Statistical models for phenotypic and genotypic expression

Placeholder Show Content

Abstract/Contents

Abstract
Over the last decades, genomic data has become significantly cheaper to produce and more ubiquitous. The analysis of such data can shed light on the functioning of organisms and how phenotypes are encoded in the genetic material of each cell. It is thus of interest to create informative statistical models of cellular phenotypes, such as gene expression, using this genomic data. In this work, we develop models to understand how gene expression is regulated by the genetic code and key regulatory proteins, and how these regulatory programs give rise to diverse phenotypes. The first part of the thesis focuses on inferring phenotypic traits directly from genotypic variants. To this end we introduce a new sparse regression method called the component lasso. The method is suited for datasets with highly correlated groups of variables, which often occur in genetics. In particular, we consider predicting traits from correlated sets of mutations in genes. The method estimates and uses the connected-components structure of the sample covariance matrix during inference to achieve a lower mean squared error as well as better support recovery. We evaluate the performance of the component lasso on simulated and real data examples. In the second part of the thesis, the focus is on the problem of genotypic expression and methods for modeling gene regulatory networks in different cells. We assume a simplified model in which mechanisms responsible for gene expression involve only two main elements: 1) transcription factors that bind to the DNA molecule; and 2) motifs that exist in the regulatory regions of genes. We first present a solution within a boosting framework that represents a regulatory network with alternating decision trees. We use the cell differentiation hierarchy to infer different networks for different cell types, while restricting the differences for models of closely related cells. We evaluate the boosting method on simulated data as well as on a real hematopoiesis dataset that has an inherent hierarchy over blood cells that stem from a single progenitor. We then present a deep learning approach for the classification of gene expression. We use a multimodal neural network on the raw DNA sequence and regulator expression data, which allows us to automatically discover relevant and new motif sequences.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Hussami, Nadine
Associated with Stanford University, Department of Electrical Engineering.
Primary advisor Kundaje, Anshul, 1980-
Primary advisor Tibshirani, Robert
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Tibshirani, Robert
Thesis advisor Duchi, John
Thesis advisor Weissman, Tsachy
Advisor Duchi, John
Advisor Weissman, Tsachy

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Nadine Hussami.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Nadine Hussami
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...