Robust dimensionality reduction for data visualization and latent structure recovery

Nguyễn, Lan Hương

Robust dimensionality reduction for data visualization and latent structure recovery

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwx267sc6179" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: High dimensionality is one of the major challenges in the analysis of modern data sets, as it is now common to have hundreds or even millions of simultaneous measurements collected for a single sample. Visualization of the data becomes difficult if not impossible, while standard statistical methods lose power due to the curse of dimensionality. Even if a large volume of data is available, exploring a high dimensional space exhaustively is computationally impractical. Low-dimensional data representations that remove noise but retain the signal of interest can be instrumental in detecting hidden structures and patterns. Our work focuses on improving current methods to perform dimensionality reduction (DR) and interpret its output. Many datasets are governed by a continuous process, which is often unknown. Estimating data points' natural ordering and their corresponding uncertainties often sheds light on these underlying mechanisms. We develop a Bayesian Unidimensional Scaling (BUDS) technique which extracts a dominant source of variation in high dimensional datasets and produces a visual data summary, facilitating the exploration of a hidden continuum. The method maps multivariate data points to latent one-dimensional coordinates along their inherent trajectory, and provides uncertainty bounds estimated using a Bayesian posterior distribution. We then turn our attention to DR techniques for data visualization. In particular, we study the behavior of t-Stochastic Neighbor Embedding (t-SNE), a technique broadly adopted for visualizing high-dimensional datasets. We show why t-SNE is usually unable to recover large-scale structures. We then propose a new embedding method, Diffusion t-SNE, which introduces a time-step parameter that can generate a multi-view representation of the data, recovering its geometry at different scales. We also provide mathematical explanations for why the entropy equalization procedure used in t-SNE results in a loss of information about local variances, leading to data distortions that produce misleading representations with uninformative relative sizes and unidentifiable input data sampling densities and variances. Building upon this analysis, we present a scaling scheme of the pairwise proximities that achieves accurate representations of regional data variances.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Nguyễn, Lan Hương
Degree supervisor	Holmes, Susan, 1954-
Thesis advisor	Holmes, Susan, 1954-
Thesis advisor	Kitanidis, P. K. (Peter K.)
Thesis advisor	Kundaje, Anshul, 1980-
Degree committee member	Kitanidis, P. K. (Peter K.)
Degree committee member	Kundaje, Anshul, 1980-
Associated with	Stanford University, Institute for Computational and Mathematical Engineering.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Lan Huong Nguyen.
Note	Submitted to the Institute for Computational and Mathematical Engineering.
Thesis	Thesis Ph.D. Stanford University 2019.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...