Differential expression identification and false discovery rate estimation in RNA-Seq data
- RNA-Seq is becoming the primary tool for measuring genome-wide transcript expression. We discuss the identification of features (genes, isoforms, exons, etc.) that are differentially expressed in samples in different biological conditions or under different disease statuses. Besides finding the right set of significant features, we emphasize on accurately estimating the corresponding false discovery rate (FDR). RNA-Seq data take the form of counts, so models based on Gaussian distribution are generally unsuitable. Also, different sequencing experiments have very different sequencing depths, which need to be estimated accurately and then used to normalize the data. Current methods model counts by Gaussian, Poisson or negative binomial distributions, and they apply the Benjamini-Hochberg procedure for FDR estimation. They have obvious limitations: (1) They are only applicable to two-class data, but not quantitative or survival data. (2) They are sensitive to the violations of distributional assumptions and often fail completely when outliers present in the data. (3) Their estimation of FDR can often be inaccurate. To overcome these difficulties, we propose two novel methods, a parametric one and a nonparametric one. Our parametric method uses a new permutation plug-in procedure for estimating FDR, and our nonparametric method utilizes a novel resampling strategy for normalizing the count data. Both methods can be applied to different types of RNA-Seq data. The parametric method is less sensitive to violations of its distributional assumptions, and the nonparametric method is very robust even to outliers. Both of them often give reliable estimate of FDRs in the cases where other methods cannot. Although we mainly discuss the identification of differentially expressed genes in RNA-Seq data, the two methods we develop should be equally applicable to data generated by other sequencing technologies, such as DNA-Seq, ChIP-Seq, and 3SEQ.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Li, Jun, (Statistician)
|Stanford University, Department of Statistics
|Wong, Wing Hung
|Wong, Wing Hung
|Statement of responsibility
|Submitted to the Department of Statistics.
|Thesis (Ph.D.)--Stanford University, 2012.
- © 2012 by Jun Li
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...