Statistical learning for large-scale survival data
- The constantly growing population biobanks have provided scientists and researchers unprecedented opportunities to understand human diseases genetics. Survival analysis gives insights on the association between the predictors and time-to-event responses and is particularly suitable for such data. On the other hand, millions of genetic variants sequenced from hundreds of thousands of individuals also pose computational challenges. Chapter 1 and Chapter 3 of this dissertation present three methods that reduce the memory requirement and improve the computational speed in analyzing such data. The first method is a variable screening procedure that exploits the sparsity structure on the association between the predictors and the response in high-dimensional datasets, which reduces the frequency of expensive I/O operations for larger-than-RAM data. The second method utilizes a 2-bits-per-entry compact representation specifically for genetic matrices, which further reduces memory requirement and makes our bandwidth bound optimization algorithm scalable to more CPU cores. The third method combines the compact representation for genetic variants and a simplified version of the compressed sparse block format to represent genetic data with a large number of rare variants. The prediction performance of survival models suffers when the number of censored survival time is large. This could happen If we define the survival time as the age of onset of a rare disease. In Chapter 2, I will provide a group-sparse regression-based algorithm to boost the prediction performance on such data. This method is applicable when there are other survival responses with a large number of observed events and are associated with the same predictors as the rare event response. Finally, Chapter 4 provides a baseline-adjusted concordance index as a stable evaluation metric of survival models. This metric is particularly useful in evaluating stratified Cox models, as well as in model selection using cross validation.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Taylor, Jonathan E
|Degree committee member
|Taylor, Jonathan E
|Stanford University, Institute for Computational and Mathematical Engineering
|Statement of responsibility
|Submitted to the Institute for Computational and Mathematical Engineering.
|Thesis Ph.D. Stanford University 2021.
- © 2021 by Ruilin Li
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...