Deep exploration via randomized value functions

Osband, Ian; Stanford University, Department of Management Science and Engineering.

Deep exploration via randomized value functions

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Frp457qc7612" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The "Big Data" revolution is spawning systems designed to make decisions from data. Statistics and machine learning has made great strides in prediction and estimation from any fixed dataset. However, if you want to learn to take actions where your choices can affect both the underlying system and the data you observe, you need reinforcement learning. Reinforcement learning builds upon learning from datasets, but also addresses the issues of partial feedback and long term consequences. In a reinforcement learning problem the decisions you make may affect the data you get, and even alter the underlying system for future timesteps. Statistically efficient reinforcement learning requires "deep exploration" or the ability to plan to learn. Previous approaches to deep exploration have not been computationally tractable beyond small scale problems. For this reason, most practical implementations use statistically inefficient methods for exploration such as epsilon-greedy dithering, which can lead to exponentially slower learning. In this dissertation we present an alternative approach to deep exploration through the use of randomized value functions. Our work is inspired by the Thompson sampling heuristic for multi-armed bandits which suggests, at a high level, to "randomly select a policy according to the probability that it is optimal". We provide insight into why this algorithm can be simultaneously more statistically efficient and more computationally efficient than existing approaches. We leverage these insights to establish several state of the art theoretical results and performance guarantees. Importantly, and unlike previous approaches to deep exploration, this approach also scales gracefully to complex domains with generalization. We complement our analysis with extensive empirical experiments; these include several didactic examples as well as a recommendation system, Tetris, and Atari 2600 games.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2016
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Osband, Ian
Associated with	Stanford University, Department of Management Science and Engineering.
Primary advisor	Van Roy, Benjamin
Thesis advisor	Van Roy, Benjamin
Thesis advisor	Duchi, John
Thesis advisor	Johari, Ramesh, 1976-
Thesis advisor	Kochenderfer, Mykel J, 1980-
Advisor	Duchi, John
Advisor	Johari, Ramesh, 1976-
Advisor	Kochenderfer, Mykel J, 1980-

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Ian Osband.
Note	Submitted to the Department of Management Science and Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2016.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...