Policy evaluation and learning in adaptive experiments

Zhan, Ruohan

Policy evaluation and learning in adaptive experiments

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwm876jf4432" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Adaptive experiments are becoming increasingly prevalent due to the ability to largely improve sample efficiency to fulfill particular objectives. This results in the growing availability of data collected from these designs such as contextual bandits. A natural query arises: can we reuse the data to answer a variety of questions that may not be originally targeted by the experiments? However, adaptivity also poses great statistical challenges if the post-analysis objective differs significantly from the original, and standard approaches used to analyze independently collected data can be plagued by bias, excessive variance, or both. This thesis aims to serve as a research investigation of one step towards such post hoc analyses around two themes: evaluating other treatment assignment policies to guide future innovation or experiments, and learning optimal policies to facilitate personalization. Our main contributions are as follows: (i) We present a family of generalized augmented inverse propensity weighted (AIPW) estimators to evaluate a given policy with adaptively collected data from multi-armed bandits. Our approach is to adaptively reweight the terms of an AIPW estimator to control the contribution of each term to the estimator's variance. This scheme reduces overall estimation variance and yields an asymptotically normal test statistic. (ii) We extend the adaptive weighting approach to evaluate policies in contextual bandits, where the weights are carefully chosen to accommodate the variances of AIPW terms that may differ not only over time, but also across the context space. The resulting estimator further reduces estimation variance. (iii) Based on a special variant of above estimators, we propose an algorithm to learn optimal policies with contextual bandit data and establish its finite-sample regret bound. We complement this regret upper bound with a lower bound that characterizes the fundamental difficulty of policy learning with adaptive data. Collectively, we hope our results can shed light on the design and implementation of hypothesis testing and efficient policy learning using adaptively collected data.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2021; ©2021
Publication date	2021; 2021
Issuance	monographic
Language	English

Creators/Contributors

Author	Zhan, Ruohan
Degree supervisor	Athey, Susan
Thesis advisor	Athey, Susan
Thesis advisor	Van Roy, Benjamin
Thesis advisor	Wager, Stefan
Degree committee member	Van Roy, Benjamin
Degree committee member	Wager, Stefan
Associated with	Stanford University, Institute for Computational and Mathematical Engineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Ruohan Zhan.
Note	Submitted to the Institute for Computational and Mathematical Engineering.
Thesis	Thesis Ph.D. Stanford University 2021.
Location	https://purl.stanford.edu/wm876jf4432

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...