Statistical learning for large-scale survival data

Placeholder Show Content

Abstract/Contents

Abstract
The constantly growing population biobanks have provided scientists and researchers unprecedented opportunities to understand human diseases genetics. Survival analysis gives insights on the association between the predictors and time-to-event responses and is particularly suitable for such data. On the other hand, millions of genetic variants sequenced from hundreds of thousands of individuals also pose computational challenges. Chapter 1 and Chapter 3 of this dissertation present three methods that reduce the memory requirement and improve the computational speed in analyzing such data. The first method is a variable screening procedure that exploits the sparsity structure on the association between the predictors and the response in high-dimensional datasets, which reduces the frequency of expensive I/O operations for larger-than-RAM data. The second method utilizes a 2-bits-per-entry compact representation specifically for genetic matrices, which further reduces memory requirement and makes our bandwidth bound optimization algorithm scalable to more CPU cores. The third method combines the compact representation for genetic variants and a simplified version of the compressed sparse block format to represent genetic data with a large number of rare variants. The prediction performance of survival models suffers when the number of censored survival time is large. This could happen If we define the survival time as the age of onset of a rare disease. In Chapter 2, I will provide a group-sparse regression-based algorithm to boost the prediction performance on such data. This method is applicable when there are other survival responses with a large number of observed events and are associated with the same predictors as the rare event response. Finally, Chapter 4 provides a baseline-adjusted concordance index as a stable evaluation metric of survival models. This metric is particularly useful in evaluating stratified Cox models, as well as in model selection using cross validation.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Li, Ruilin
Degree supervisor Rivas, Manuel
Degree supervisor Tibshirani, Robert
Thesis advisor Rivas, Manuel
Thesis advisor Tibshirani, Robert
Thesis advisor Taylor, Jonathan E
Degree committee member Taylor, Jonathan E
Associated with Stanford University, Institute for Computational and Mathematical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Ruilin Li.
Note Submitted to the Institute for Computational and Mathematical Engineering.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/fr646ms1849

Access conditions

Copyright
© 2021 by Ruilin Li
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...