Robust dimensionality reduction for data visualization and latent structure recovery

Placeholder Show Content

Abstract/Contents

Abstract
High dimensionality is one of the major challenges in the analysis of modern data sets, as it is now common to have hundreds or even millions of simultaneous measurements collected for a single sample. Visualization of the data becomes difficult if not impossible, while standard statistical methods lose power due to the curse of dimensionality. Even if a large volume of data is available, exploring a high dimensional space exhaustively is computationally impractical. Low-dimensional data representations that remove noise but retain the signal of interest can be instrumental in detecting hidden structures and patterns. Our work focuses on improving current methods to perform dimensionality reduction (DR) and interpret its output. Many datasets are governed by a continuous process, which is often unknown. Estimating data points' natural ordering and their corresponding uncertainties often sheds light on these underlying mechanisms. We develop a Bayesian Unidimensional Scaling (BUDS) technique which extracts a dominant source of variation in high dimensional datasets and produces a visual data summary, facilitating the exploration of a hidden continuum. The method maps multivariate data points to latent one-dimensional coordinates along their inherent trajectory, and provides uncertainty bounds estimated using a Bayesian posterior distribution. We then turn our attention to DR techniques for data visualization. In particular, we study the behavior of t-Stochastic Neighbor Embedding (t-SNE), a technique broadly adopted for visualizing high-dimensional datasets. We show why t-SNE is usually unable to recover large-scale structures. We then propose a new embedding method, Diffusion t-SNE, which introduces a time-step parameter that can generate a multi-view representation of the data, recovering its geometry at different scales. We also provide mathematical explanations for why the entropy equalization procedure used in t-SNE results in a loss of information about local variances, leading to data distortions that produce misleading representations with uninformative relative sizes and unidentifiable input data sampling densities and variances. Building upon this analysis, we present a scaling scheme of the pairwise proximities that achieves accurate representations of regional data variances.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Nguyễn, Lan Hương
Degree supervisor Holmes, Susan, 1954-
Thesis advisor Holmes, Susan, 1954-
Thesis advisor Kitanidis, P. K. (Peter K.)
Thesis advisor Kundaje, Anshul, 1980-
Degree committee member Kitanidis, P. K. (Peter K.)
Degree committee member Kundaje, Anshul, 1980-
Associated with Stanford University, Institute for Computational and Mathematical Engineering.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Lan Huong Nguyen.
Note Submitted to the Institute for Computational and Mathematical Engineering.
Thesis Thesis Ph.D. Stanford University 2019.
Location electronic resource

Access conditions

Copyright
© 2019 by Lan Huong Nguyen
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...