Adaptive and efficient batch reinforcement learning algorithms

Liu, Yao

Adaptive and efficient batch reinforcement learning algorithms

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Ftn970zq6499" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Reinforcement learning (RL) focuses on solving the problem of sequential decision-making in an unknown environment and achieved many successes in domains with good simulators (Atari, Go, etc), from hundreds of millions of samples. However, real-world applications of reinforcement learning algorithms often cannot have high-risk online exploration. To bridge this gap, this dissertation investigates how to perform reinforcement learning from an offline dataset, named batch reinforcement learning methods. We provide theoretically justified new algorithms, as well as empirical validation in simulation and real datasets. More specifically, this dissertation studies the two main aspects of batch reinforcement learning. How to evaluate a policy given a fixed dataset collected by other policies? Offline policy evaluation is a challenging counterfactual reasoning problem, in these partial-information and sequential settings. The solution to this problem relies on two types of estimators: direct model-based estimators and importance re-weighing estimators. We propose a model-based estimator with mean-square error bound and analyze a variance reduction heuristic for importance sampling. How to learn a new policy from a fixed dataset collected by other policies? Learning policy using function approximation from a fixed dataset suffers from overestimating a counterfactual policy by empirical reward/value maximization. Prior theoretical justification relies on strong assumptions about the data distribution and thus is less informative to guide practice. We propose the idea of pessimism to constrain the policy search space and avoid instability. Following this intuition, we proposed three new algorithms: the first convergent policy gradient, value function learning with finite sample error bounds, and avoiding the overfitting in direct policy gradient. The primary contribution of this dissertation is advancing the foundations of batch RL and develop batch RL algorithms with provable guarantees under realistic assumptions. One of the driving goals is finite-sample error bounds for algorithms in function approximation settings. To that end, this dissertation makes progress in both policy evaluation and policy learning problems by studying the theory of these problems and using the theoretical insights to derive new algorithms with strong empirical performance as well.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2021; ©2021
Publication date	2021; 2021
Issuance	monographic
Language	English

Creators/Contributors

Author	Liu, Yao
Degree supervisor	Brunskill, Emma
Thesis advisor	Brunskill, Emma
Thesis advisor	Ma, Tengyu
Thesis advisor	Van Roy, Benjamin
Thesis advisor	Wager, Stefan
Degree committee member	Ma, Tengyu
Degree committee member	Van Roy, Benjamin
Degree committee member	Wager, Stefan
Associated with	Stanford University, Computer Science Department

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Yao Liu.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2021.
Location	https://purl.stanford.edu/tn970zq6499

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial No Derivatives 3.0 Unported license (CC BY-NC-ND).

Also listed in

View in SearchWorks

Loading usage metrics...