Machine learning models for analyzing chromatin and RNA structure data
Abstract/Contents
- Abstract
- Chromatin and RNA structure play key roles in regulating gene expression. Recently developed experimental assays for genome-wide probing of chromatin and RNA structure provide data at base-pair resolution. However, the data are typically sparse and noisy at practical sequencing depths. In this thesis, we develop methods based on powerful machine learning techniques - deep learning and probabilistic graphical models - that enable us to fully leverage the richness of these high-dimensional, sparse, and noisy datasets to predict histone modifications and RNA secondary structure. We first introduce the GenomeLake system for efficiently streaming genomics data into deep learning models. GenomeLake simplifies model development by eliminating the need to pre-extract data into intermediate files, and by providing a convenient interface for randomly accessing data at arbitrary genomic loci. Next, we describe Chromputer, an integrative deep learning model based on convolutional neural networks for predicting histone modifications from chromatin structure data. We show that Chromputer achieves high predictive accuracy on a subset of modifications typically associated with active chromatin, within and across different cell-types, and in an epidermal differentiation time-course. Chromputer models trained on orthogonal DNase-seq and MNase-seq datasets also obtained high predictive accuracy, suggesting a fundamentally predictive relationship between chromatin architecture and histone modifications. Finally, we describe CONTRAfold-SE, a probabilistic model for RNA secondary structure prediction that models RNA structure-probing data as observations of possibly unknown secondary structures. This model can then be learned from datasets containing only structure-probing data, or a mix of known structures and probing data. We train CONTRAfold-SE on various combinations of structure probing data and complete structures and find that while genome-wide structure probing data provides modest improvement in prediction performance, with sufficiently dense probing data alone it is possible to learn a model that approaches the performance of energy-based methods.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2017 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Foo, Chuan Sheng |
---|---|
Associated with | Stanford University, Computer Science Department. |
Primary advisor | Kundaje, Anshul, 1980- |
Thesis advisor | Kundaje, Anshul, 1980- |
Thesis advisor | Batzoglou, Serafim |
Thesis advisor | Greenleaf, William James |
Advisor | Batzoglou, Serafim |
Advisor | Greenleaf, William James |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | Chuan Sheng Foo. |
---|---|
Note | Submitted to the Department of Computer Science. |
Thesis | Thesis (Ph.D.)--Stanford University, 2017. |
Location | electronic resource |
Access conditions
- Copyright
- © 2017 by Chuan Sheng Foo
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...