Two graph-based tests for high-dimensional inference

Placeholder Show Content

Abstract/Contents

Abstract
With modern science there is a growing emphasis on multivariate, complex data types. Some of these data are high dimensional. Others, such as survey preference, network, and tree data, cannot be characterized easily with standard models on Euclidean spaces. This dissertation details the investigation in this new setting of two classic statistical problems: change-point detection and two-sample comparison of categorical data. Change-point models are widely used in various fields for detecting lack of homogeneity in a sequence of observations. In many applications, the dimension of the observations in the sequence can be very high, even much larger than the length of the sequence. Testing the homogeneity of such sequences is a challenging but important problem. Existing approaches are limited in many ways. We proposed a new non-parametric approach that can be applied to data in high dimension, and even to non-Euclidean object data, as long as an informative similarity measure on the sample space can be defined. The approach is graph-based two-sample tests adapted to the scan-statistic setting. Graph-based two-sample tests are tests base on graphs connecting observations by similarity [Friedman and Rafsky, 1979, Rosenbaum, 2005]. We show that this new approach is powerful in high dimensions compared to parametric approaches. We also derive accurate analytic $p$-value approximations for very general situations, which lead to easy off-the-shelf homogeneity testing for large multivariate data sets. This approach has been applied on two data sets: The determination of authorship of a classic novel, and the detection of change in a social network over time. Two-sample comparison of categorical data is a classic problem in statistics. In many modern applications, the number of categories can be quite large, even comparable to the sample size, causing existing methods to have low power. When the number of categories is large, there is often underlying structure on the sample space that can be exploited. We propose a general non-parametric approach that makes use of similarity information on the space of categories in two-sample tests. Our approach addresses a shortcoming of existing graph-based two-sample tests by no longer requiring uniqueness of the underlying graph, thus allowing ties in the distance matrix defining the graph. We found two types of statistics that are both powerful and fast to compute. We show that their permutation null distributions are asymptotically normal and that their $p$-value approximations under typical settings are quite accurate, facilitating the application of this approach.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2013
Issuance monographic
Language English

Creators/Contributors

Associated with Chen, Hao, 1984-
Associated with Stanford University, Department of Statistics.
Primary advisor Siegmund, David, 1941-
Primary advisor Zhang, Nancy R. (Nancy Ruonan)
Primary advisor Friedman, J. H. (Jerome H.)
Thesis advisor Siegmund, David, 1941-
Thesis advisor Zhang, Nancy R. (Nancy Ruonan)
Thesis advisor Friedman, J. H. (Jerome H.)
Advisor Friedman, J. H. (Jerome H.)

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Hao Chen.
Note Submitted to the Department of Statistics.
Thesis Thesis (Ph.D.)--Stanford University, 2013.
Location electronic resource

Access conditions

Copyright
© 2013 by Hao Chen
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...