False discoveries with dependence : an application of objective inference

Placeholder Show Content

Abstract/Contents

Abstract
Especially lately, increased availability of data and decreased barriers for data analysis, while promising for the popularity of statistics, have lead to increased concern of personal biases and motives, causing distrust in published conclusions. We propose that inferential guarantees which hold regardless of individual methods and beliefs are a solution to building trust. This motivates defining an objective inference as one which is interpretable - no other information on the underlying data is needed to synthesize conclusions from results - and fair - interpretation of the same results across multiple observers with different preferences is equally easy. An inference which has these properties is more factual than one which does not. The properties of interpretable and fair are given a decision theoretic definitions in the context of hypothesis testing. They amount to a robustness requirement of null risk with respect to nuisance parameters and transformations of data. Two examples applying these properties are explored in depth. For the first example, we regain objectivity for false discovery analysis in the presence of dependent test statistics by deriving upper confidence bounds on the false discovery quantities: false discovery proportion (Fdp), Bayesian false discovery rate (Fdr), and local false discovery rate (fdr). These upper confidence bounds are uniform across all choices of hypothesis cutoffs, motivating plotting many cutoffs at once, or listing them in a table. Extension to theoretical vs estimated null components are included. We call these the "U" methods, e.g. UFdp. These methods use derived covariance formulas for both the empirical process and density estimates from the Expectation-Maximization (EM) algorithm when data is correlated, and approximations to tail probabilities of the supremum of Gaussian processes over parameter sets embedded in metric spaces from Volume of Tubes and Double Sum arguments. For the second example, we consider online experimentation, or sequential hypothesis testing of streaming data on the internet where practitioners have short term incentives to subvert statistical protocol. Defining an always valid p-value process that is super uniform regardless of the stopping time chosen for the experiment allows for an objective inference. We explore a particular always valid p-value based on the mixture sequential probability ratio test (mSPRT), and show it has asymptotically optimal risk, even among tests which are allowed to fix the stopping time and maximal sample size truncation in advance. We show also how to choose the tuning parameters of a mSPRT to gain good sub-asymptotic performance. Finally, we compare multiple always valid p-value processes at once through sequential false discovery analysis. Even though the p-value process may be themselves independent, the choice of mutual stopping time can introduce dependence, biasing false discovery procedures. We derive false discovery rate bounds from applying the Benjamini-Hochberg (BH) procedure at any stopping time, and examine them for three common classes of stopping times.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with Pekelis, Leonid B
Associated with Stanford University, Department of Statistics.
Primary advisor Efron, Bradley
Thesis advisor Efron, Bradley
Thesis advisor Johnstone, Iain
Thesis advisor Owen, Art B
Thesis advisor Taylor, Jonathan
Advisor Johnstone, Iain
Advisor Owen, Art B
Advisor Taylor, Jonathan

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Leonid B. Pekelis.
Note Submitted to the Department of Statistics.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Leonid Boris Pekelis
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...