Large-scale and high-dimensional statistical learning methods and algorithms

Placeholder Show Content

Abstract/Contents

Abstract
In the past two decades, many areas such as genomics, neuroscience, economics and Internet services have been producing increasingly big datasets that have high dimension, large sample size, or both. This provides unprecedented opportunities for us to retrieve and infer valuable information from the data. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms. On the one hand, we want to formulate a reasonable model to capture the desired structures and improve the quality of statistical estimation and inference. On the other hand, in the face of increasingly large datasets, computation can be a big hurdle for one to arrive at meaningful conclusions. This thesis stands at the intersection of the two topics, proposing statistical methods to capture desired structures in the data, and seeking scalable approaches to optimizing the computation for very large datasets. We propose a scalable and flexible framework for solving large-scale sparse regression problems with the lasso/elastic-net and a scalable framework for solving sparse reduced rank regression in the presence of multiple correlated responses and other nuances such as missing values. Optimized implementations are developed for genomics data in the PLINK 2.0 format in R packages snpnet and multiSnpnet respectively. The two methods are demonstrated on the very large and ultrahigh-dimensional UK Biobank studies and see significant improvement over traditional predictive modeling methods. In addition, we consider a different class of high-dimensional problems, heterogeneous causal effect estimation. Unlike the setting of supervised learning, the main challenge of such problems is that in the historical data, we never observe the other side of the coin, so we have no access to the ground truth of the true difference among treatments. We propose adaptation of nonparametric statistical learning methods, in particular gradient boosting and multivariate adaptive regression splines, to the estimation of treatment effect based on the predictors available. The implementation is packaged in an R package causalLearning

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2020; ©2020
Publication date 2020; 2020
Issuance monographic
Language English

Creators/Contributors

Author Qian, Junyang
Degree supervisor Hastie, Trevor
Thesis advisor Hastie, Trevor
Thesis advisor Rivas, Manuel
Thesis advisor Tibshirani, Robert
Degree committee member Rivas, Manuel
Degree committee member Tibshirani, Robert
Associated with Stanford University, Department of Statistics.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Junyang Qian
Note Submitted to the Department of Statistics
Thesis Thesis Ph.D. Stanford University 2020
Location electronic resource

Access conditions

Copyright
© 2020 by Junyang Qian
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...