Deciphering regulatory DNA with deep learning models and interpretation methods
- Accurate predictive modeling of gene regulation is crucial for a fundamental understanding of cell identity and function. High-throughput profiling of diverse biochemical and functional properties of cells has enabled powerful deep learning based DNA sequence models that predict protein-DNA binding, chromatin accessibility and histone marks across cell types with state-of-the-art accuracy. Interpretation of these DNA sequence models has revealed novel insights into the cis-regulatory code of TF binding, effects of sequence variation and repeats, and the sequence basis of chromatin accessibility. However, there is much scope for enhancing modeling strategies, model performance, and the tooling and infrastructure around model development and interpretation. Moreover, the full potential of these models for extracting biological insights from high-throughput functional profiling data has not been realized. In this thesis, I will present novel methods that advance DNA sequence models for regulatory genomics, and some applications of these DNA sequence models to glean insights into biological systems. First, I introduce ChromDragoNN, a method that enables generalization of DNA sequence models to make predictions in new cell types. Next, I describe fastISM, an algorithm to significantly speed up variant scoring for convolutional neural networks. I then present dynseq, a tool for sharing and visualization of model-derived importance scores of individual bases. I will then apply DNA sequence models to two different biological systems. First, I combine single-cell chromatin accessibility profiling with DNA sequence models to nominate regulatory DNA variants associated with eye disorders. Next, I apply DNA sequence models to the study of single-cell chromatin accessibility from a time course of human skin cells transforming into induced pluripotent stem cells over four weeks. Using DNA sequence models, I reveal mechanistic insights into reprogramming progression by linking transcription factor abundance changes to sequence logic encoded in regulatory elements. Together, this thesis advances predictive modeling and analysis of gene regulation through new methods, tools and biological applications. I hope that the work moves the field closer to realizing the full potential of DNA sequence models for understanding cell identity and function.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Degree committee member
|Stanford University, School of Engineering
|Stanford University, Computer Science Department
|Statement of responsibility
|Submitted to the Computer Science Department.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Surag Nair
- This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).
Also listed in
Loading usage metrics...