Large-scale and high-dimensional statistical learning methods and algorithms
- In the past two decades, many areas such as genomics, neuroscience, economics and Internet services have been producing increasingly big datasets that have high dimension, large sample size, or both. This provides unprecedented opportunities for us to retrieve and infer valuable information from the data. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms. On the one hand, we want to formulate a reasonable model to capture the desired structures and improve the quality of statistical estimation and inference. On the other hand, in the face of increasingly large datasets, computation can be a big hurdle for one to arrive at meaningful conclusions. This thesis stands at the intersection of the two topics, proposing statistical methods to capture desired structures in the data, and seeking scalable approaches to optimizing the computation for very large datasets. We propose a scalable and flexible framework for solving large-scale sparse regression problems with the lasso/elastic-net and a scalable framework for solving sparse reduced rank regression in the presence of multiple correlated responses and other nuances such as missing values. Optimized implementations are developed for genomics data in the PLINK 2.0 format in R packages snpnet and multiSnpnet respectively. The two methods are demonstrated on the very large and ultrahigh-dimensional UK Biobank studies and see significant improvement over traditional predictive modeling methods. In addition, we consider a different class of high-dimensional problems, heterogeneous causal effect estimation. Unlike the setting of supervised learning, the main challenge of such problems is that in the historical data, we never observe the other side of the coin, so we have no access to the ground truth of the true difference among treatments. We propose adaptation of nonparametric statistical learning methods, in particular gradient boosting and multivariate adaptive regression splines, to the estimation of treatment effect based on the predictors available. The implementation is packaged in an R package causalLearning
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource
|Degree committee member
|Degree committee member
|Stanford University, Department of Statistics.
|Statement of responsibility
|Submitted to the Department of Statistics
|Thesis Ph.D. Stanford University 2020
- © 2020 by Junyang Qian
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...