Efficient permutation P-value estimation for gene set tests

Placeholder Show Content

Abstract/Contents

Abstract
In a genome-wide expression study, gene set testing is often used to find potential gene sets that correlate with a treatment(disease, drug, phenotype etc.). A gene set may contain tens to thousands genes, and genes within a gene set are generally correlated. Permutation tests are standard approaches of getting p-values for these gene set tests. Plain Monte Carlo methods that generate random permutations can be computationally infeasible for small p-values. Ackermann and Strimmer (2009) finds two families of test statistics that achieve overall best performances - a linear family and a quadratic family. This dissertation first reviews the relative background of gene set testing and permutation tests, and then provides three alternative approaches to estimate small permutation p-values efficiently. The first approach focuses on the linear statistic. Observing the p-value can be written as the proportion of points lying in a spherical cap, the p-value is approximated by the volume of a spherical cap. Error estimates can be derived from generalized Stolarsky's invariance principal, and alternative probabilistic proofs are provided. The second approach focuses on the quadratic statistic. Importance sampling is used to estimate the area of the (continuous) significant region on the sphere, and the volume of the region is used as an approximation for the (discrete proportion) p-value. Different proposal distributions are studied and compared. The third approach estimates the p-value with nested sampling. It may work for both the linear and the quadratic statistic. Similar ideas can be found in literature spanning from combinatorics, sequential Monte Carlo, Bayesian computation, rare event estimation, network reliability etc., and bears different names, e.g. approximate counting, nested sampling, subset simulation, multilevel splitting etc. We give a thorough review of literature in these different areas, and apply the technique to the gene set testing with the quadratic test statistic. Finally, we compare the proposed methods with plain Monte Carlo and saddle- point approximation on three expression studies in Parkinson's Disease patients. This work was supported by the US National Science Foundation under grant DMS-1521145.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with He, Yu
Associated with Stanford University, Department of Statistics.
Primary advisor Owen, Art B
Thesis advisor Owen, Art B
Thesis advisor Hastie, Trevor
Thesis advisor Wong, Wing Hung
Advisor Hastie, Trevor
Advisor Wong, Wing Hung

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Yu He.
Note Submitted to the Department of Statistics.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Yu He
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...