Computational algorithms and statistical models for ChIP sequencing analysis
- Chromatin immunoprecipitation coupled with ultra-high-throughput DNA sequencing (ChIP-seq) has been widely utilized to study genome-wide localization of protein-DNA interactions since 2007. This powerful technique provides comprehensive and high-resolution protein-DNA binding data for the identification of cis-regulatory elements, which is important for understanding and deciphering the underlying transcriptional gene regulatory mechanism. In this dissertation, I developed a computational and statistical framework for the analysis of ChIP-seq data. A typical pipeline for analyzing ChIP-seq data is presented and discussed, including data exploration and visualization, background estimation, peak detection, genomic annotation, and motif analysis. In particular, I developed and implemented a series of peak detection algorithms and methods for transcription factor (TF)-binding ChIP-seq data, for both single-replicate ChIP-seq datasets and multiple-replicate ChIP-seq datasets. For single-replicate peak calling, I developed an iterative conditional Binomial model for the two-sample problem (when both the treated ChIP sample and the negative control sample are available). This iterative method is computationally efficient and provides accurate estimation of joint background distribution between ChIP and negative control samples. Compared to other two-sample peak callers, our method produces higher sensitivity in peak calling and sharper motif resolution in detected peak regions. For the multiple-replicate problem, I put forward a hierarchical Negative Binomial model to assess binding signal variations among multiple ChIP-seq biological replicates. A closed-form empirical Bayes estimator of expected peak signal is developed by pooling information from all candidate peak regions. This empirical estimator significantly shrinks the variance of estimation error and increases the sensitivity in detecting binding loci of interest especially when the number of replicates is small. Our method outperforms existing heuristic approaches for multiple-replicate peak calling, including the pooling approach and the intersection approach. The computational algorithms and statistical models developed for ChIP-seq data analyses are applied and evaluated in a study of the Sonic Hedgehog (Shh) signaling pathway in the ventral neural tube of mouse embryos. By an integrative analysis of transcription profiling data, ChIP-seq data, function annotation and motif discovery, we identified hundreds of novel cis-regulatory elements mediated by Gli1, predicted potential co-binding partners of Gli1, and gained insights of Gli functions in mouse embryonic ventral neural tube development.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Stanford University, Computer Science Department
|Wong, Wing Hung
|Wong, Wing Hung
|Dill, David L
|Dill, David L
|Statement of responsibility
|Submitted to the Department of Computer Science.
|Thesis (Ph.D.)--Stanford University, 2012.
- © 2012 by Wenxiu Ma
Also listed in
Loading usage metrics...