Topics in statistical learning with a focus on large-scale data

Placeholder Show Content

Abstract/Contents

Abstract
The widespread of modern information technologies to all spheres of society leads to a dramatic increase of data flow, including the formation of "big data" phenomenon. Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the ability of traditional statistical methods and standard tools. When the size of the data becomes extremely large, it may be too long to run the computing task, and even infeasible to store all of the data on a single computer. Therefore, it is necessary to turn to distributed architectures and scalable statistical methods. Big data vary in shape and call for different approaches. One type of big data is the tall data, i.e., a very large number of samples but not too many features. Chapter 1 describes a general communication-efficient algorithm for distributed statistical learning on this type of big data. Our algorithm distributes the samples uniformly to multiple machines, and uses a common reference data to improve the performance of local estimates. Our algorithm enables potentially much faster analysis, at a small cost to statistical performance. Another type of big data is the wide data, i.e., too many features but a limited number of samples. It is also called high-dimensional data, to which many classical statistical methods are not applicable. Chapter 2 discusses a method of dimensionality reduction for high-dimensional classification. Our method partitions features into independent communities and splits the original classification problem into separate iv smaller ones. It enables parallel computing and produces more interpretable results. For unsupervised learning methods like principle component analysis and clustering, the key challenges are choosing the optimal tuning parameter and evaluating method performance. Chapter 3 proposes a general cross-validation approach for unsupervised learning methods. This approach randomly partitions the data matrix into K unstructured folds. For each fold, it fits a matrix completion algorithm to the rest K − 1 folds and evaluates the prediction on the hold-out fold. Our approach provides a unified framework for parameter tuning in unsupervised learning, and shows strong performance in practice.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2018; ©2018
Publication date 2018; 2018
Issuance monographic
Language English

Creators/Contributors

Author Le, Ya
Degree supervisor Hastie, Trevor
Thesis advisor Hastie, Trevor
Thesis advisor Efron, Bradley
Thesis advisor Taylor, Jonathan E
Degree committee member Efron, Bradley
Degree committee member Taylor, Jonathan E
Associated with Stanford University, Department of Statistics.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Ya Le.
Note Submitted to the Department of Statistics.
Thesis Thesis Ph.D. Stanford University 2018.
Location electronic resource

Access conditions

Copyright
© 2018 by Ya Le
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...