Interpretable machine learning methods for regulatory and disease genomics

Placeholder Show Content

Abstract/Contents

Abstract
It is an incredible feat of nature that the same genome contains the code to every cell in each living organism. From this same genome, each unique cell type gains a different program of gene expression that enables the development and function of an organism throughout its lifespan. The non-coding genome - the ~98 of the genome that does not code directly for proteins - serves an important role in generating the diverse programs of gene expression turned on in each unique cell state. A complex network of proteins bind specific regulatory elements in the non-coding genome to regulate the expression of nearby genes. While basic principles of gene regulation are understood, the regulatory code of which factors bind together at which genomic elements to turn on which genes remains to be revealed. Further, we do not understand how disruptions in gene regulation, such as from mutations that fall in non-coding regions, ultimately lead to disease or other changes in cell state. In this work we present several methods developed and applied to learn the regulatory code or the rules that govern non-coding regions of the genome and how they regulate nearby genes. We first formulate the problem as one of learning pairs of sequence motifs and expressed regulator proteins that jointly predict the state of the cell, such as the cell type specific gene expression or chromatin accessibility. Using pre-engineered sequence features and known expression, we use a paired-feature boosting approach to build an interpretable model of how the non-coding genome contributes to cell state. We also demonstrate a novel improvement to this method that takes into account similarities between closely related cell types by using a hierarchy imposed on all of the predicted cell states. We apply this method to discover validated regulators of tadpole tail regeneration and to predict protein-ligand binding interactions. Recognizing the need for improved sequence features and stronger predictive performance, we then move to a deep learning modeling framework to predict epigenomic phenotypes such as chromatin accessibility from just underlying DNA sequence. We use deep learning models, specifically multi-task convolutional neural networks, to learn a featurization of sequences over several kilobases long and their mapping to a functional phenotype. We develop novel architectures that encode principles of genomics in models typically designed for computer vision, such as incorporating reverse complementation and the 3D structure of the genome. We also develop methods to interpret traditionally ``black box" neural networks by 1) assigning importance scores to each input sequence to the model, 2) summarizing non-redundant patterns learned by the model that are predictive in each cell type, and 3) discovering interactions learned by the model that provide indications as to how different non-coding sequence features depend on each other. We apply these methods in the system of hematopoiesis to interpret chromatin dynamics across differentiation of blood cell types, to understand immune stimulation, and to interpret immune disease-associated variants that fall in non-coding regions. We demonstrate strong performance of our boosting and deep learning models and demonstrate improved performance of these machine learning frameworks when taking into account existing knowledge about the biological system being modeled. We benchmark our interpretation methods using gold standard systems and existing experimental data where available. We confirm existing knowledge surrounding essential factors in hematopoiesis, and also generate novel hypotheses surrounding how factors interact to regulate differentiation. Ultimately our work provides a set of tools for researchers to probe and understand the non-coding genome and its role in controlling gene expression as well as a set of novel insights surrounding how hematopoiesis is controlled on many scales from global quantification of regulatory sequence to interpretation of individual variants.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2018; ©2018
Publication date 2018; 2018
Issuance monographic
Language English

Creators/Contributors

Author Greenside, Peyton Greis
Degree supervisor Kundaje, Anshul, 1980-
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Altman, Russ
Thesis advisor Pritchard, Jonathan D
Degree committee member Altman, Russ
Degree committee member Pritchard, Jonathan D
Associated with Stanford University, Program in Biomedical Informatics.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Peyton Greis Greenside.
Note Submitted to the Program in Biomedical Informatics.
Thesis Thesis Ph.D. Stanford University 2018.
Location electronic resource

Access conditions

Copyright
© 2018 by Peyton Greis Greenside
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...