Efficient learning in sequential optimization

Placeholder Show Content

Abstract/Contents

Abstract
We consider a broad class of online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. In these problems, the decision-maker repeatedly chooses among a set of possible actions, observes an outcome, and receives a reward representing the utility derived from this outcome. She is uncertain about the underlying system and is therefore initially unsure of which action is best. However, as outcomes are observed, she is able to learn over time to make increasingly effective decisions. Her objective is to choose actions sequentially so as to maximize the expected cumulative reward. We focus on three algorithmic approaches that accommodate flexible statistical modeling, and are capable of experimenting efficiently in broad classes of problems. The first part of the thesis focuses on a design principle known as optimism in the face of uncertainty, which underlies many of the most effective exploration algorithms. We provide a regret bound for an optimistic algorithm that applies broadly and can be specialized to many specific model classes. Our bound depends on a new notion of dimension that measures the degree of dependence among actions. We compare our notion to the Vapnik-Chervonenkis dimension, and explain why that and other measures of dimension used in the supervised literature do not suffice when it comes to analyzing optimistic algorithms. We then turn our attention to Thompson sampling, an elegant algorithm for learning in online optimization problems with partial feedback. We derive a close theoretical connection between Thompson sampling and optimistic algorithms. Due to the connection we derive, existing analysis available for specific optimistic algorithms immediately translates to expected regret bounds for Thompson sampling. The second part of the thesis pushes beyond the optimistic principle, and offers a fresh, information-theoretic, perspective on the exploration/exploitation tradeoff. We first revisit Thompson sampling from this perspective and provide novel regret bounds that scale with the entropy of the optimal action distribution. Then, we propose a new algorithm--information-directed sampling (IDS)--and study its performance. IDS quantifies the amount learned by selecting an action through an information theoretic measure, and selects actions by optimizing an objective that explicitly balances attaining a high immediate reward and selecting informative actions. We provide a general regret bound for IDS, demonstrate strong performance in simulation, and show through simple analytic examples that it can dramatically outperform Thompson sampling due to the way it quantifies information.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2015
Issuance monographic
Language English

Creators/Contributors

Associated with Russo, Daniel
Associated with Stanford University, Department of Management Science and Engineering.
Primary advisor Van Roy, Benjamin
Thesis advisor Van Roy, Benjamin
Thesis advisor Johari, Ramesh, 1976-
Thesis advisor Tse, David
Advisor Johari, Ramesh, 1976-
Advisor Tse, David

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Daniel Russo.
Note Submitted to the Department of Management Science and Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2015.
Location electronic resource

Access conditions

Copyright
© 2015 by Daniel Joseph Russo
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...