Adaptive and efficient batch reinforcement learning algorithms

Placeholder Show Content

Abstract/Contents

Abstract
Reinforcement learning (RL) focuses on solving the problem of sequential decision-making in an unknown environment and achieved many successes in domains with good simulators (Atari, Go, etc), from hundreds of millions of samples. However, real-world applications of reinforcement learning algorithms often cannot have high-risk online exploration. To bridge this gap, this dissertation investigates how to perform reinforcement learning from an offline dataset, named batch reinforcement learning methods. We provide theoretically justified new algorithms, as well as empirical validation in simulation and real datasets. More specifically, this dissertation studies the two main aspects of batch reinforcement learning. How to evaluate a policy given a fixed dataset collected by other policies? Offline policy evaluation is a challenging counterfactual reasoning problem, in these partial-information and sequential settings. The solution to this problem relies on two types of estimators: direct model-based estimators and importance re-weighing estimators. We propose a model-based estimator with mean-square error bound and analyze a variance reduction heuristic for importance sampling. How to learn a new policy from a fixed dataset collected by other policies? Learning policy using function approximation from a fixed dataset suffers from overestimating a counterfactual policy by empirical reward/value maximization. Prior theoretical justification relies on strong assumptions about the data distribution and thus is less informative to guide practice. We propose the idea of pessimism to constrain the policy search space and avoid instability. Following this intuition, we proposed three new algorithms: the first convergent policy gradient, value function learning with finite sample error bounds, and avoiding the overfitting in direct policy gradient. The primary contribution of this dissertation is advancing the foundations of batch RL and develop batch RL algorithms with provable guarantees under realistic assumptions. One of the driving goals is finite-sample error bounds for algorithms in function approximation settings. To that end, this dissertation makes progress in both policy evaluation and policy learning problems by studying the theory of these problems and using the theoretical insights to derive new algorithms with strong empirical performance as well.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Liu, Yao
Degree supervisor Brunskill, Emma
Thesis advisor Brunskill, Emma
Thesis advisor Ma, Tengyu
Thesis advisor Van Roy, Benjamin
Thesis advisor Wager, Stefan
Degree committee member Ma, Tengyu
Degree committee member Van Roy, Benjamin
Degree committee member Wager, Stefan
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Yao Liu.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/tn970zq6499

Access conditions

Copyright
© 2021 by Yao Liu
License
This work is licensed under a Creative Commons Attribution Non Commercial No Derivatives 3.0 Unported license (CC BY-NC-ND).

Also listed in

Loading usage metrics...