Differential expression identification and false discovery rate estimation in RNA-Seq data

Placeholder Show Content


RNA-Seq is becoming the primary tool for measuring genome-wide transcript expression. We discuss the identification of features (genes, isoforms, exons, etc.) that are differentially expressed in samples in different biological conditions or under different disease statuses. Besides finding the right set of significant features, we emphasize on accurately estimating the corresponding false discovery rate (FDR). RNA-Seq data take the form of counts, so models based on Gaussian distribution are generally unsuitable. Also, different sequencing experiments have very different sequencing depths, which need to be estimated accurately and then used to normalize the data. Current methods model counts by Gaussian, Poisson or negative binomial distributions, and they apply the Benjamini-Hochberg procedure for FDR estimation. They have obvious limitations: (1) They are only applicable to two-class data, but not quantitative or survival data. (2) They are sensitive to the violations of distributional assumptions and often fail completely when outliers present in the data. (3) Their estimation of FDR can often be inaccurate. To overcome these difficulties, we propose two novel methods, a parametric one and a nonparametric one. Our parametric method uses a new permutation plug-in procedure for estimating FDR, and our nonparametric method utilizes a novel resampling strategy for normalizing the count data. Both methods can be applied to different types of RNA-Seq data. The parametric method is less sensitive to violations of its distributional assumptions, and the nonparametric method is very robust even to outliers. Both of them often give reliable estimate of FDRs in the cases where other methods cannot. Although we mainly discuss the identification of differentially expressed genes in RNA-Seq data, the two methods we develop should be equally applicable to data generated by other sequencing technologies, such as DNA-Seq, ChIP-Seq, and 3SEQ.


Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2012
Issuance monographic
Language English


Associated with Li, Jun, (Statistician)
Associated with Stanford University, Department of Statistics
Primary advisor Tibshirani, Robert
Thesis advisor Tibshirani, Robert
Thesis advisor Hastie, Trevor
Thesis advisor Wong, Wing Hung
Advisor Hastie, Trevor
Advisor Wong, Wing Hung


Genre Theses

Bibliographic information

Statement of responsibility Jun Li.
Note Submitted to the Department of Statistics.
Thesis Thesis (Ph.D.)--Stanford University, 2012.
Location electronic resource

Access conditions

© 2012 by Jun Li
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...