Deciphering regulatory DNA with deep learning models and interpretation methods

Placeholder Show Content

Abstract/Contents

Abstract
Accurate predictive modeling of gene regulation is crucial for a fundamental understanding of cell identity and function. High-throughput profiling of diverse biochemical and functional properties of cells has enabled powerful deep learning based DNA sequence models that predict protein-DNA binding, chromatin accessibility and histone marks across cell types with state-of-the-art accuracy. Interpretation of these DNA sequence models has revealed novel insights into the cis-regulatory code of TF binding, effects of sequence variation and repeats, and the sequence basis of chromatin accessibility. However, there is much scope for enhancing modeling strategies, model performance, and the tooling and infrastructure around model development and interpretation. Moreover, the full potential of these models for extracting biological insights from high-throughput functional profiling data has not been realized. In this thesis, I will present novel methods that advance DNA sequence models for regulatory genomics, and some applications of these DNA sequence models to glean insights into biological systems. First, I introduce ChromDragoNN, a method that enables generalization of DNA sequence models to make predictions in new cell types. Next, I describe fastISM, an algorithm to significantly speed up variant scoring for convolutional neural networks. I then present dynseq, a tool for sharing and visualization of model-derived importance scores of individual bases. I will then apply DNA sequence models to two different biological systems. First, I combine single-cell chromatin accessibility profiling with DNA sequence models to nominate regulatory DNA variants associated with eye disorders. Next, I apply DNA sequence models to the study of single-cell chromatin accessibility from a time course of human skin cells transforming into induced pluripotent stem cells over four weeks. Using DNA sequence models, I reveal mechanistic insights into reprogramming progression by linking transcription factor abundance changes to sequence logic encoded in regulatory elements. Together, this thesis advances predictive modeling and analysis of gene regulation through new methods, tools and biological applications. I hope that the work moves the field closer to realizing the full potential of DNA sequence models for understanding cell identity and function.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Nair, Surag
Degree supervisor Kundaje, Anshul, 1980-
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Engreitz, Jesse
Thesis advisor Horowitz, Mark (Mark Alan)
Degree committee member Engreitz, Jesse
Degree committee member Horowitz, Mark (Mark Alan)
Associated with Stanford University, School of Engineering
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Surag Nair.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/mz621td1032

Access conditions

Copyright
© 2023 by Surag Nair
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...