Randomized linear algebra for large-scale data applications

Yang, Jiyan; Stanford University, Institute for Computational and Mathematical Engineering.

Randomized linear algebra for large-scale data applications

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwr092fb7484" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: As the ability to generate datasets with billions of records in relatively automated ways continues to grow, new challenges are posed to modern large-scale data analysis from several points of view such as scalability, feasibility, and interpretability. Thus, improved algorithms on large-scale data platforms are of interest. Recent years have seen great interest in Randomized Linear Algebra (RLA) algorithms. RLA is an interdisciplinary research area that exploits randomization as a computational resource for the development of improved algorithms for common matrix problems such as least-squares approximation, low-rank matrix approximation, and Laplacian-based linear equation solvers. In this thesis, our focus is on the underlying theory and practical implementation of RLA algorithms. In particular: Chapter 3 describes a novel sampling algorithm for large-scale over-determined quantile regression problems whose running time is roughly proportional to the number of non-zero elements in the matrix plus a term that depends on the low dimension only. Chapter 4 describes a hybrid algorithm named pwSGD --- precondition weighted stochastic gradient descent that combines RLA and SGD. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Chapter 5 describes a class of randomized Newton-type algorithms that exploit non-uniform sub-sampling as well as inexact updates for a class of constrained optimization problems. We show that our methods are more robust to ill-conditioned problems than other similar previous approaches. Chapter 6 presents results of implementing RLA algorithms for least-squares problems, quantile regression, and CX decomposition problems in parallel and distributed environments using Apache Spark, and discuss various tradeoffs in the implementations. In particular, we demonstrate that least-squares problems with up to terabyte-sized data can be solved to low, medium, or high precision on existing distributed systems. Chapter 7 demonstrates that applying CX/CUR decomposition to large-scale Mass Spectrometry Imaging (MSI) datasets returns interpretable results as it successfully identifies important ions and locations in complex biological samples. All in all, we show that RLA is a powerful tool for deriving scalable algorithms for large-scale regression and optimization problems, and we demonstrate that RLA algorithms are amenable to distributed computing platforms and are useful in scientific applications.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2016
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Yang, Jiyan
Associated with	Stanford University, Institute for Computational and Mathematical Engineering.
Primary advisor	Saunders, Michael
Thesis advisor	Saunders, Michael
Thesis advisor	Mahoney, Michael
Thesis advisor	Ré, Christopher
Advisor	Mahoney, Michael
Advisor	Ré, Christopher

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Jiyan Yang.
Note	Submitted to the Institute for Computational and Mathematical Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2016.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...