Large-scale and high-dimensional statistical learning methods and algorithms

Qian, Junyang

Large-scale and high-dimensional statistical learning methods and algorithms

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fxf104bg8789" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: In the past two decades, many areas such as genomics, neuroscience, economics and Internet services have been producing increasingly big datasets that have high dimension, large sample size, or both. This provides unprecedented opportunities for us to retrieve and infer valuable information from the data. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms. On the one hand, we want to formulate a reasonable model to capture the desired structures and improve the quality of statistical estimation and inference. On the other hand, in the face of increasingly large datasets, computation can be a big hurdle for one to arrive at meaningful conclusions. This thesis stands at the intersection of the two topics, proposing statistical methods to capture desired structures in the data, and seeking scalable approaches to optimizing the computation for very large datasets. We propose a scalable and flexible framework for solving large-scale sparse regression problems with the lasso/elastic-net and a scalable framework for solving sparse reduced rank regression in the presence of multiple correlated responses and other nuances such as missing values. Optimized implementations are developed for genomics data in the PLINK 2.0 format in R packages snpnet and multiSnpnet respectively. The two methods are demonstrated on the very large and ultrahigh-dimensional UK Biobank studies and see significant improvement over traditional predictive modeling methods. In addition, we consider a different class of high-dimensional problems, heterogeneous causal effect estimation. Unlike the setting of supervised learning, the main challenge of such problems is that in the historical data, we never observe the other side of the coin, so we have no access to the ground truth of the true difference among treatments. We propose adaptation of nonparametric statistical learning methods, in particular gradient boosting and multivariate adaptive regression splines, to the estimation of treatment effect based on the predictors available. The implementation is packaged in an R package causalLearning

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2020; ©2020
Publication date	2020; 2020
Issuance	monographic
Language	English

Creators/Contributors

Author	Qian, Junyang
Degree supervisor	Hastie, Trevor
Thesis advisor	Hastie, Trevor
Thesis advisor	Rivas, Manuel
Thesis advisor	Tibshirani, Robert
Degree committee member	Rivas, Manuel
Degree committee member	Tibshirani, Robert
Associated with	Stanford University, Department of Statistics.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Junyang Qian
Note	Submitted to the Department of Statistics
Thesis	Thesis Ph.D. Stanford University 2020
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...