Advances in multivariate statistics and its applications
Abstract/Contents
- Abstract
- My research focuses on multivariate statistics, dimension reduction, and applied statistical modeling. During my Ph.D. studies at the Department of Statistics at Stanford, I took part in various collaborative projects, developing methodology and tools for analysis of complex phenomena in such areas as biology, genetics, and neuroscience. This dissertation will cover four big branches of my research, which I will present in separate chapters. Conformation reconstruction is one of the main challenges in computational biology. In this study we develop a model for the 3D spatial organization of chromatin, a crucial component of numerous cellular processes (e.g. transcription). The central object in this study is the so-called contact matrix. It represents the frequency of contacts between each pair of genomic loci and thus can be used to infer the 3D structure. Most of the existing algorithms operating on contact matrices are based on multidimensional scaling (MDS) and produce reconstructed 3D configurations in the form of a polygonal chain. However, none of the methods exploit the fact that the target solution is a smooth curve in 3D. The smoothness attribute is either ignored or indirectly addressed via introducing highly non-convex penalties in the model. This typically leads to increased computational complexity and instability of the reconstruction algorithm. In our work we develop Principal Curve Metric Scaling (PCMS), a novel approach modeling chromatin directly by a smooth curve. We subsequently use PCMS as a building block to create more complex distribution-based models for the conformation. The resulting reconstruction technique therefore combines advantages of MDS and smoothness penalties whereas being computationally efficient. Low-rank matrix approximation (LRMA) is one of the central concepts in machine learning. It is closely related to such areas as dimension reduction and de-noising. A recent extension to LRMA is called low-rank matrix completion. It solves the LRMA problem when some observations are missing and is especially useful for recommender systems (see, for example, the famous Netflix Prize competition). In this study we consider a weighted generalization of LRMA. We build an algorithm for solving the weighted problem as well as two important modifications: one for high-dimensional, one for sparse data. In addition, we propose an efficient way to accelerate the WLRMA algorithm. Although our previous research mainly focuses on developing the WLRMA methodology, the technique has a strong potential for applications. Beyond matrix completion, which it covers as a special case, it can serve as a building block for generalized linear models (GLM) with a matrix structure. For example, in ecology, populations of species can be modeled via Poisson GLMs. In this case, the population matrices (with rows and columns corresponding to sites and species, respectively) can be analyzed using low-rank models and the WLRMA technique will be of great importance. Canonical correlation analysis (CCA) is one of the core approaches in multivariate statistics. It is a technique for measuring the association between two multivariate sets of variables, which has a wide variety of applications. This part of the research was motivated by a study in neuroscience aimed to explore the influence of emotional disorders on brain activity. While working with the brain imaging data we encountered the following challenges. First, the measurements are made for a very dense grid of brain loci leading to extremely high-dimensional data. Second, the data was collected only for a few patients; therefore, it is underrepresented. Finally, the data has a structure, which is defined by the brain geometry. To address the first two challenges we consider Regularized CCA and develop a ``kernel trick'' that allows us to handle extreme data size. We subsequently incorporate brain structure in the regularization introducing Group Regularized CCA (GRCCA) and extend the "kernel trick" to the structured data setting. The resulting GRCCA technique has demonstrated strong potential for brain imaging applications while being computationally efficient. Epidemic forecasting became a very in-demand area during the COVID-19 era. In this research, we have studied the trajectory of the COVID-19 pandemic by means of the open-source COVIDcast dataset collected by Delphi Group. This dataset contains a wide variety of features such as cases, deaths, hospitalizations, and many auxiliary indicators of COVID-19 activity and therefore opens up a wealth of research directions. In particular, we develop the multi-period forecasting (MPF) methodology, which aims to predict the number of cases for multiple ``ahead'' values. The MPF technique solves a multi-response regression problem, where the response columns represent the same phenomenon measured at different time points. To incorporate this time dependence, it assumes the model coefficients to be smooth functions depending on time. We test this idea for the point estimation of the COVID-19 cases and subsequently extend it to predicting the cases' confidence intervals via quantile regression.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2022; ©2022 |
Publication date | 2022; 2022 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Tuzhilina, Elena |
---|---|
Degree supervisor | Hastie, Trevor |
Thesis advisor | Hastie, Trevor |
Thesis advisor | Segal, Mark |
Thesis advisor | Tibshirani, Robert |
Degree committee member | Segal, Mark |
Degree committee member | Tibshirani, Robert |
Associated with | Stanford University, Department of Statistics |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Elena Tuzhilina. |
---|---|
Note | Submitted to the Department of Statistics. |
Thesis | Thesis Ph.D. Stanford University 2022. |
Location | https://purl.stanford.edu/gr942mg4066 |
Access conditions
- Copyright
- © 2022 by Elena Tuzhilina
- License
- This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).
Also listed in
Loading usage metrics...