Computational algorithms and statistical models for ChIP sequencing analysis

Placeholder Show Content


Chromatin immunoprecipitation coupled with ultra-high-throughput DNA sequencing (ChIP-seq) has been widely utilized to study genome-wide localization of protein-DNA interactions since 2007. This powerful technique provides comprehensive and high-resolution protein-DNA binding data for the identification of cis-regulatory elements, which is important for understanding and deciphering the underlying transcriptional gene regulatory mechanism. In this dissertation, I developed a computational and statistical framework for the analysis of ChIP-seq data. A typical pipeline for analyzing ChIP-seq data is presented and discussed, including data exploration and visualization, background estimation, peak detection, genomic annotation, and motif analysis. In particular, I developed and implemented a series of peak detection algorithms and methods for transcription factor (TF)-binding ChIP-seq data, for both single-replicate ChIP-seq datasets and multiple-replicate ChIP-seq datasets. For single-replicate peak calling, I developed an iterative conditional Binomial model for the two-sample problem (when both the treated ChIP sample and the negative control sample are available). This iterative method is computationally efficient and provides accurate estimation of joint background distribution between ChIP and negative control samples. Compared to other two-sample peak callers, our method produces higher sensitivity in peak calling and sharper motif resolution in detected peak regions. For the multiple-replicate problem, I put forward a hierarchical Negative Binomial model to assess binding signal variations among multiple ChIP-seq biological replicates. A closed-form empirical Bayes estimator of expected peak signal is developed by pooling information from all candidate peak regions. This empirical estimator significantly shrinks the variance of estimation error and increases the sensitivity in detecting binding loci of interest especially when the number of replicates is small. Our method outperforms existing heuristic approaches for multiple-replicate peak calling, including the pooling approach and the intersection approach. The computational algorithms and statistical models developed for ChIP-seq data analyses are applied and evaluated in a study of the Sonic Hedgehog (Shh) signaling pathway in the ventral neural tube of mouse embryos. By an integrative analysis of transcription profiling data, ChIP-seq data, function annotation and motif discovery, we identified hundreds of novel cis-regulatory elements mediated by Gli1, predicted potential co-binding partners of Gli1, and gained insights of Gli functions in mouse embryonic ventral neural tube development.


Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2012
Issuance monographic
Language English


Associated with Ma, Wenxiu
Associated with Stanford University, Computer Science Department
Primary advisor Batzoglou, Serafim
Primary advisor Wong, Wing Hung
Thesis advisor Batzoglou, Serafim
Thesis advisor Wong, Wing Hung
Thesis advisor Dill, David L
Advisor Dill, David L


Genre Theses

Bibliographic information

Statement of responsibility Wenxiu Ma.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2012.
Location electronic resource

Access conditions

© 2012 by Wenxiu Ma

Also listed in

Loading usage metrics...