Advances in multivariate statistics and its applications

Placeholder Show Content

Abstract/Contents

Abstract
My research focuses on multivariate statistics, dimension reduction, and applied statistical modeling. During my Ph.D. studies at the Department of Statistics at Stanford, I took part in various collaborative projects, developing methodology and tools for analysis of complex phenomena in such areas as biology, genetics, and neuroscience. This dissertation will cover four big branches of my research, which I will present in separate chapters. Conformation reconstruction is one of the main challenges in computational biology. In this study we develop a model for the 3D spatial organization of chromatin, a crucial component of numerous cellular processes (e.g. transcription). The central object in this study is the so-called contact matrix. It represents the frequency of contacts between each pair of genomic loci and thus can be used to infer the 3D structure. Most of the existing algorithms operating on contact matrices are based on multidimensional scaling (MDS) and produce reconstructed 3D configurations in the form of a polygonal chain. However, none of the methods exploit the fact that the target solution is a smooth curve in 3D. The smoothness attribute is either ignored or indirectly addressed via introducing highly non-convex penalties in the model. This typically leads to increased computational complexity and instability of the reconstruction algorithm. In our work we develop Principal Curve Metric Scaling (PCMS), a novel approach modeling chromatin directly by a smooth curve. We subsequently use PCMS as a building block to create more complex distribution-based models for the conformation. The resulting reconstruction technique therefore combines advantages of MDS and smoothness penalties whereas being computationally efficient. Low-rank matrix approximation (LRMA) is one of the central concepts in machine learning. It is closely related to such areas as dimension reduction and de-noising. A recent extension to LRMA is called low-rank matrix completion. It solves the LRMA problem when some observations are missing and is especially useful for recommender systems (see, for example, the famous Netflix Prize competition). In this study we consider a weighted generalization of LRMA. We build an algorithm for solving the weighted problem as well as two important modifications: one for high-dimensional, one for sparse data. In addition, we propose an efficient way to accelerate the WLRMA algorithm. Although our previous research mainly focuses on developing the WLRMA methodology, the technique has a strong potential for applications. Beyond matrix completion, which it covers as a special case, it can serve as a building block for generalized linear models (GLM) with a matrix structure. For example, in ecology, populations of species can be modeled via Poisson GLMs. In this case, the population matrices (with rows and columns corresponding to sites and species, respectively) can be analyzed using low-rank models and the WLRMA technique will be of great importance. Canonical correlation analysis (CCA) is one of the core approaches in multivariate statistics. It is a technique for measuring the association between two multivariate sets of variables, which has a wide variety of applications. This part of the research was motivated by a study in neuroscience aimed to explore the influence of emotional disorders on brain activity. While working with the brain imaging data we encountered the following challenges. First, the measurements are made for a very dense grid of brain loci leading to extremely high-dimensional data. Second, the data was collected only for a few patients; therefore, it is underrepresented. Finally, the data has a structure, which is defined by the brain geometry. To address the first two challenges we consider Regularized CCA and develop a ``kernel trick'' that allows us to handle extreme data size. We subsequently incorporate brain structure in the regularization introducing Group Regularized CCA (GRCCA) and extend the "kernel trick" to the structured data setting. The resulting GRCCA technique has demonstrated strong potential for brain imaging applications while being computationally efficient. Epidemic forecasting became a very in-demand area during the COVID-19 era. In this research, we have studied the trajectory of the COVID-19 pandemic by means of the open-source COVIDcast dataset collected by Delphi Group. This dataset contains a wide variety of features such as cases, deaths, hospitalizations, and many auxiliary indicators of COVID-19 activity and therefore opens up a wealth of research directions. In particular, we develop the multi-period forecasting (MPF) methodology, which aims to predict the number of cases for multiple ``ahead'' values. The MPF technique solves a multi-response regression problem, where the response columns represent the same phenomenon measured at different time points. To incorporate this time dependence, it assumes the model coefficients to be smooth functions depending on time. We test this idea for the point estimation of the COVID-19 cases and subsequently extend it to predicting the cases' confidence intervals via quantile regression.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2022; ©2022
Publication date 2022; 2022
Issuance monographic
Language English

Creators/Contributors

Author Tuzhilina, Elena
Degree supervisor Hastie, Trevor
Thesis advisor Hastie, Trevor
Thesis advisor Segal, Mark
Thesis advisor Tibshirani, Robert
Degree committee member Segal, Mark
Degree committee member Tibshirani, Robert
Associated with Stanford University, Department of Statistics

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Elena Tuzhilina.
Note Submitted to the Department of Statistics.
Thesis Thesis Ph.D. Stanford University 2022.
Location https://purl.stanford.edu/gr942mg4066

Access conditions

Copyright
© 2022 by Elena Tuzhilina
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...