Essays on statistical learning and causal inference on panel data
- Panel data that provides multiple observations on each individual over time has become widely available and received growing interests in many domains. For example, in asset pricing, panel data on asset returns over time is central in the study of how financial assets, such as stocks, bonds, and futures, are priced. In public policy, panel data is valuable in estimating and analyzing economic and social policies' effects. Panel data can improve the power of analyses, uncover dynamic relationships of variables, and generate more accurate predictions for individual outcomes. The growing interests of panel data in empirical research have proliferated the studies of new methodologies. In the first part of this thesis, we demonstrate several novel statistical inference methods on large-dimensional panel data with a large number of units and time periods. An effective method to summarize the information in large-dimensional panel data is the factor model that has been successfully used in asset pricing, recommendation systems, and many other topics. We focus on the latent factor models where the factors are unobserved and estimated from the data. Latent factor models can address the model misspecification concern, i.e., we could not fully observe all covariates that affect the outcome. However, latent factors are hard to interpret because they are usually the weighted average of all units. We propose sparse proximate factors for latent factors. Sparse proximate factors are constructed from a few units with the largest signal-to-noise ratio that can approximate latent factors well while being interpretable. When the panel data spans a long time horizon, such as macroeconomic data, it is restrictive to assume the factor structure is static. We generalize the factor structure to depend on some observed state process. For example, the factor model in stock return data can change with the business cycle. We provide an estimator for this state-varying factor model and develop its inferential theory. Many studies in social sciences and healthcare try to answer questions about causal relationships beyond statistical analysis. Many of these studies rely on observational data when running experiments is infeasible, and observational panel data has received more attention because panel data can capture the changes with units over time. A fundamental question to estimate the causal effects from observational data is to estimate the counterfactual outcomes that can be modeled as the missing observations. We connect large-dimensional factor modeling with causal inference. Specifically, we provide an estimator for the latent factor model on large-dimensional panel data with missing observations and derive the inferential theory for our estimator that can be used to test the effect of a treatment at any time and general weighted treatments. An alternative approach to study the treatment effect is to run experiments, which is the gold standard in medical and clinical research and has become increasingly popular to test new products in large technology companies. In the second part of this thesis, we study optimal multi-period experimental design to increase the statistical power, which is a common hurdle in designing experiments. We show that the structure of the multi-period experimental design depends on how long the effect of interventions last. In the presence of pre-experimentation data, we can further optimize our treatment designs and hence reduce the number of required samples, which means lowering the experiment cost
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource
|Degree committee member
|Degree committee member
|Stanford University, Department of Management Science and Engineering
|Statement of responsibility
|Submitted to the Department of Management Science and Engineering
|Thesis Ph.D. Stanford University 2020
- © 2020 by Ruoxuan Xiong
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...