Machine learning models for analyzing chromatin and RNA structure data

Placeholder Show Content

Abstract/Contents

Abstract
Chromatin and RNA structure play key roles in regulating gene expression. Recently developed experimental assays for genome-wide probing of chromatin and RNA structure provide data at base-pair resolution. However, the data are typically sparse and noisy at practical sequencing depths. In this thesis, we develop methods based on powerful machine learning techniques - deep learning and probabilistic graphical models - that enable us to fully leverage the richness of these high-dimensional, sparse, and noisy datasets to predict histone modifications and RNA secondary structure. We first introduce the GenomeLake system for efficiently streaming genomics data into deep learning models. GenomeLake simplifies model development by eliminating the need to pre-extract data into intermediate files, and by providing a convenient interface for randomly accessing data at arbitrary genomic loci. Next, we describe Chromputer, an integrative deep learning model based on convolutional neural networks for predicting histone modifications from chromatin structure data. We show that Chromputer achieves high predictive accuracy on a subset of modifications typically associated with active chromatin, within and across different cell-types, and in an epidermal differentiation time-course. Chromputer models trained on orthogonal DNase-seq and MNase-seq datasets also obtained high predictive accuracy, suggesting a fundamentally predictive relationship between chromatin architecture and histone modifications. Finally, we describe CONTRAfold-SE, a probabilistic model for RNA secondary structure prediction that models RNA structure-probing data as observations of possibly unknown secondary structures. This model can then be learned from datasets containing only structure-probing data, or a mix of known structures and probing data. We train CONTRAfold-SE on various combinations of structure probing data and complete structures and find that while genome-wide structure probing data provides modest improvement in prediction performance, with sufficiently dense probing data alone it is possible to learn a model that approaches the performance of energy-based methods.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Foo, Chuan Sheng
Associated with Stanford University, Computer Science Department.
Primary advisor Kundaje, Anshul, 1980-
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Batzoglou, Serafim
Thesis advisor Greenleaf, William James
Advisor Batzoglou, Serafim
Advisor Greenleaf, William James

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Chuan Sheng Foo.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Chuan Sheng Foo
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...