Topics in statistical learning with a focus on large-scale data

Le, Ya

Topics in statistical learning with a focus on large-scale data

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fqb996bc9575" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The widespread of modern information technologies to all spheres of society leads to a dramatic increase of data flow, including the formation of "big data" phenomenon. Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the ability of traditional statistical methods and standard tools. When the size of the data becomes extremely large, it may be too long to run the computing task, and even infeasible to store all of the data on a single computer. Therefore, it is necessary to turn to distributed architectures and scalable statistical methods. Big data vary in shape and call for different approaches. One type of big data is the tall data, i.e., a very large number of samples but not too many features. Chapter 1 describes a general communication-efficient algorithm for distributed statistical learning on this type of big data. Our algorithm distributes the samples uniformly to multiple machines, and uses a common reference data to improve the performance of local estimates. Our algorithm enables potentially much faster analysis, at a small cost to statistical performance. Another type of big data is the wide data, i.e., too many features but a limited number of samples. It is also called high-dimensional data, to which many classical statistical methods are not applicable. Chapter 2 discusses a method of dimensionality reduction for high-dimensional classification. Our method partitions features into independent communities and splits the original classification problem into separate iv smaller ones. It enables parallel computing and produces more interpretable results. For unsupervised learning methods like principle component analysis and clustering, the key challenges are choosing the optimal tuning parameter and evaluating method performance. Chapter 3 proposes a general cross-validation approach for unsupervised learning methods. This approach randomly partitions the data matrix into K unstructured folds. For each fold, it fits a matrix completion algorithm to the rest K − 1 folds and evaluates the prediction on the hold-out fold. Our approach provides a unified framework for parameter tuning in unsupervised learning, and shows strong performance in practice.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2018; ©2018
Publication date	2018; 2018
Issuance	monographic
Language	English

Creators/Contributors

Author	Le, Ya
Degree supervisor	Hastie, Trevor
Thesis advisor	Hastie, Trevor
Thesis advisor	Efron, Bradley
Thesis advisor	Taylor, Jonathan E
Degree committee member	Efron, Bradley
Degree committee member	Taylor, Jonathan E
Associated with	Stanford University, Department of Statistics.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Ya Le.
Note	Submitted to the Department of Statistics.
Thesis	Thesis Ph.D. Stanford University 2018.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...