Machine learning models for analyzing chromatin and RNA structure data

Foo, Chuan Sheng; Stanford University, Computer Science Department.

Machine learning models for analyzing chromatin and RNA structure data

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fkg531qf4831" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Chromatin and RNA structure play key roles in regulating gene expression. Recently developed experimental assays for genome-wide probing of chromatin and RNA structure provide data at base-pair resolution. However, the data are typically sparse and noisy at practical sequencing depths. In this thesis, we develop methods based on powerful machine learning techniques - deep learning and probabilistic graphical models - that enable us to fully leverage the richness of these high-dimensional, sparse, and noisy datasets to predict histone modifications and RNA secondary structure. We first introduce the GenomeLake system for efficiently streaming genomics data into deep learning models. GenomeLake simplifies model development by eliminating the need to pre-extract data into intermediate files, and by providing a convenient interface for randomly accessing data at arbitrary genomic loci. Next, we describe Chromputer, an integrative deep learning model based on convolutional neural networks for predicting histone modifications from chromatin structure data. We show that Chromputer achieves high predictive accuracy on a subset of modifications typically associated with active chromatin, within and across different cell-types, and in an epidermal differentiation time-course. Chromputer models trained on orthogonal DNase-seq and MNase-seq datasets also obtained high predictive accuracy, suggesting a fundamentally predictive relationship between chromatin architecture and histone modifications. Finally, we describe CONTRAfold-SE, a probabilistic model for RNA secondary structure prediction that models RNA structure-probing data as observations of possibly unknown secondary structures. This model can then be learned from datasets containing only structure-probing data, or a mix of known structures and probing data. We train CONTRAfold-SE on various combinations of structure probing data and complete structures and find that while genome-wide structure probing data provides modest improvement in prediction performance, with sufficiently dense probing data alone it is possible to learn a model that approaches the performance of energy-based methods.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2017
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Foo, Chuan Sheng
Associated with	Stanford University, Computer Science Department.
Primary advisor	Kundaje, Anshul, 1980-
Thesis advisor	Kundaje, Anshul, 1980-
Thesis advisor	Batzoglou, Serafim
Thesis advisor	Greenleaf, William James
Advisor	Batzoglou, Serafim
Advisor	Greenleaf, William James

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Chuan Sheng Foo.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2017.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...