Interpretable machine learning for scientific discovery in regulatory genomics

Placeholder Show Content

Abstract/Contents

Abstract
All cells in our body have approximately the same DNA sequence, yet different cell-types have distinct behavior due to differential expression of genes. This cell-type specific control of gene expression is governed by regulatory proteins that bind to DNA. Over 90% of disease-associated mutations do not disrupt the DNA sequences of genes, but rather disrupt functions involved in the regulation of gene expression. Unfortunately, conventional computational models can fail to distinguish between mutations that are benign and mutations that are likely to affect regulatory activity. Machine learning poses a solution to this dilemma: by training complex models, including deep learning models, to predict regulatory activity from DNA sequence, we implicitly force the models to learn which sequence features are relevant for regulation. However, our difficulty in interpreting and trusting these models limits our ability to extract novel scientific insights from them. In this thesis, I will present techniques I have developed to address some of these limitations. I will begin by discussing DeepLIFT, a fast algorithm for calculating example-specific importance scores to explain the predictions of a deep learning model, as well as GkmExplain, an algorithm for efficiently computing importance scores for gapped k-mer support vector machines. I will then describe TF-MoDISco, an algorithm that leverages importance scores produced by an algorithm such as DeepLIFT or GkmExplain to discover recurring patterns learned by the model. Next, I discuss two projects on leveraging domain-specific knowledge to improve the performance and interpretability of deep learning models trained on regulatory genomic data. The first project, on reverse-complement parameter sharing, introduces architectures that can account for symmetries inherent in the double-stranded nature of regulatory DNA. The second project, on separable fully-connected layers, introduces a novel parameterization to exploit the fact that positional patterns in DNA binding sites are often shared across different regulatory proteins. Finally, I will discuss three projects centered on improving the reliability of predictions derived from these models. The first project deals with the situation where a deep learning model trained on regulatory genomic data is leveraged to identify pairs of proteins that have non-additive interaction effects; we demonstrate that looking at change in the model's prediction loss, rather than simply looking at the change in the predictions, is a far more robust indicator of whether the model's learned interaction effect is likely to be an artifact. The second project presents a state-of-the-art algorithm for improving the model predictions under a type of data distribution shift known as ``label shift'', where the class proportions in the held-out testing set differ from the class proportions that the model was trained on (this can occur, for example, if a model that is trained to predict diseases given symptoms is deployed in a situation where the prevalence of the disease is far higher than in the data distribution it was trained on). The third project explores the scenario where a model can abstain from making predictions on a subset of examples that it is uncertain of, in order to improve user trust in the predictions on remaining examples; in the project, we devise a novel and flexible strategy for choosing which examples to abstain on when the goal is to optimize metrics other than simple prediction accuracy, such as the area under the ROC curve or the sensitivity at a target specificity level (such metrics are commonly used in genomics and medicine). Taken together, I hope these methods help pave the way for successful application of advanced machine learning techniques to derive novel scientific insights from regulatory genomic data

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2020; ©2020
Publication date 2020; 2020
Issuance monographic
Language English

Creators/Contributors

Author Shrikumar, Avanti
Degree supervisor Kundaje, Anshul, 1980-
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Fordyce, Polly
Thesis advisor Leskovec, Jurij
Degree committee member Fordyce, Polly
Degree committee member Leskovec, Jurij
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Avanti Shrikumar
Note Submitted to the Computer Science Department
Thesis Thesis Ph.D. Stanford University 2020
Location electronic resource

Access conditions

Copyright
© 2020 by Avanti Shrikumar
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...