Robust dimensionality reduction for data visualization and latent structure recovery
Abstract/Contents
- Abstract
- High dimensionality is one of the major challenges in the analysis of modern data sets, as it is now common to have hundreds or even millions of simultaneous measurements collected for a single sample. Visualization of the data becomes difficult if not impossible, while standard statistical methods lose power due to the curse of dimensionality. Even if a large volume of data is available, exploring a high dimensional space exhaustively is computationally impractical. Low-dimensional data representations that remove noise but retain the signal of interest can be instrumental in detecting hidden structures and patterns. Our work focuses on improving current methods to perform dimensionality reduction (DR) and interpret its output. Many datasets are governed by a continuous process, which is often unknown. Estimating data points' natural ordering and their corresponding uncertainties often sheds light on these underlying mechanisms. We develop a Bayesian Unidimensional Scaling (BUDS) technique which extracts a dominant source of variation in high dimensional datasets and produces a visual data summary, facilitating the exploration of a hidden continuum. The method maps multivariate data points to latent one-dimensional coordinates along their inherent trajectory, and provides uncertainty bounds estimated using a Bayesian posterior distribution. We then turn our attention to DR techniques for data visualization. In particular, we study the behavior of t-Stochastic Neighbor Embedding (t-SNE), a technique broadly adopted for visualizing high-dimensional datasets. We show why t-SNE is usually unable to recover large-scale structures. We then propose a new embedding method, Diffusion t-SNE, which introduces a time-step parameter that can generate a multi-view representation of the data, recovering its geometry at different scales. We also provide mathematical explanations for why the entropy equalization procedure used in t-SNE results in a loss of information about local variances, leading to data distortions that produce misleading representations with uninformative relative sizes and unidentifiable input data sampling densities and variances. Building upon this analysis, we present a scaling scheme of the pairwise proximities that achieves accurate representations of regional data variances.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2019; ©2019 |
Publication date | 2019; 2019 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Nguyễn, Lan Hương |
---|---|
Degree supervisor | Holmes, Susan, 1954- |
Thesis advisor | Holmes, Susan, 1954- |
Thesis advisor | Kitanidis, P. K. (Peter K.) |
Thesis advisor | Kundaje, Anshul, 1980- |
Degree committee member | Kitanidis, P. K. (Peter K.) |
Degree committee member | Kundaje, Anshul, 1980- |
Associated with | Stanford University, Institute for Computational and Mathematical Engineering. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Lan Huong Nguyen. |
---|---|
Note | Submitted to the Institute for Computational and Mathematical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2019. |
Location | electronic resource |
Access conditions
- Copyright
- © 2019 by Lan Huong Nguyen
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...