Statistical models for phenotypic and genotypic expression
- Over the last decades, genomic data has become significantly cheaper to produce and more ubiquitous. The analysis of such data can shed light on the functioning of organisms and how phenotypes are encoded in the genetic material of each cell. It is thus of interest to create informative statistical models of cellular phenotypes, such as gene expression, using this genomic data. In this work, we develop models to understand how gene expression is regulated by the genetic code and key regulatory proteins, and how these regulatory programs give rise to diverse phenotypes. The first part of the thesis focuses on inferring phenotypic traits directly from genotypic variants. To this end we introduce a new sparse regression method called the component lasso. The method is suited for datasets with highly correlated groups of variables, which often occur in genetics. In particular, we consider predicting traits from correlated sets of mutations in genes. The method estimates and uses the connected-components structure of the sample covariance matrix during inference to achieve a lower mean squared error as well as better support recovery. We evaluate the performance of the component lasso on simulated and real data examples. In the second part of the thesis, the focus is on the problem of genotypic expression and methods for modeling gene regulatory networks in different cells. We assume a simplified model in which mechanisms responsible for gene expression involve only two main elements: 1) transcription factors that bind to the DNA molecule; and 2) motifs that exist in the regulatory regions of genes. We first present a solution within a boosting framework that represents a regulatory network with alternating decision trees. We use the cell differentiation hierarchy to infer different networks for different cell types, while restricting the differences for models of closely related cells. We evaluate the boosting method on simulated data as well as on a real hematopoiesis dataset that has an inherent hierarchy over blood cells that stem from a single progenitor. We then present a deep learning approach for the classification of gene expression. We use a multimodal neural network on the raw DNA sequence and regulator expression data, which allows us to automatically discover relevant and new motif sequences.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Stanford University, Department of Electrical Engineering.
|Kundaje, Anshul, 1980-
|Kundaje, Anshul, 1980-
|Statement of responsibility
|Submitted to the Department of Electrical Engineering.
|Thesis (Ph.D.)--Stanford University, 2017.
- © 2017 by Nadine Hussami
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...