Adaptive and efficient batch reinforcement learning algorithms
Abstract/Contents
- Abstract
- Reinforcement learning (RL) focuses on solving the problem of sequential decision-making in an unknown environment and achieved many successes in domains with good simulators (Atari, Go, etc), from hundreds of millions of samples. However, real-world applications of reinforcement learning algorithms often cannot have high-risk online exploration. To bridge this gap, this dissertation investigates how to perform reinforcement learning from an offline dataset, named batch reinforcement learning methods. We provide theoretically justified new algorithms, as well as empirical validation in simulation and real datasets. More specifically, this dissertation studies the two main aspects of batch reinforcement learning. How to evaluate a policy given a fixed dataset collected by other policies? Offline policy evaluation is a challenging counterfactual reasoning problem, in these partial-information and sequential settings. The solution to this problem relies on two types of estimators: direct model-based estimators and importance re-weighing estimators. We propose a model-based estimator with mean-square error bound and analyze a variance reduction heuristic for importance sampling. How to learn a new policy from a fixed dataset collected by other policies? Learning policy using function approximation from a fixed dataset suffers from overestimating a counterfactual policy by empirical reward/value maximization. Prior theoretical justification relies on strong assumptions about the data distribution and thus is less informative to guide practice. We propose the idea of pessimism to constrain the policy search space and avoid instability. Following this intuition, we proposed three new algorithms: the first convergent policy gradient, value function learning with finite sample error bounds, and avoiding the overfitting in direct policy gradient. The primary contribution of this dissertation is advancing the foundations of batch RL and develop batch RL algorithms with provable guarantees under realistic assumptions. One of the driving goals is finite-sample error bounds for algorithms in function approximation settings. To that end, this dissertation makes progress in both policy evaluation and policy learning problems by studying the theory of these problems and using the theoretical insights to derive new algorithms with strong empirical performance as well.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2021; ©2021 |
Publication date | 2021; 2021 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Liu, Yao |
---|---|
Degree supervisor | Brunskill, Emma |
Thesis advisor | Brunskill, Emma |
Thesis advisor | Ma, Tengyu |
Thesis advisor | Van Roy, Benjamin |
Thesis advisor | Wager, Stefan |
Degree committee member | Ma, Tengyu |
Degree committee member | Van Roy, Benjamin |
Degree committee member | Wager, Stefan |
Associated with | Stanford University, Computer Science Department |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Yao Liu. |
---|---|
Note | Submitted to the Computer Science Department. |
Thesis | Thesis Ph.D. Stanford University 2021. |
Location | https://purl.stanford.edu/tn970zq6499 |
Access conditions
- Copyright
- © 2021 by Yao Liu
- License
- This work is licensed under a Creative Commons Attribution Non Commercial No Derivatives 3.0 Unported license (CC BY-NC-ND).
Also listed in
Loading usage metrics...