Thompson sampling beyond classical bandits
- Thompson sampling has been shown to be an effective policy for a variety of sequential decision making problems. Motivated by its state-of-the-art empirical performance and straightforward implementation, many recent works have focused on analyzing its theoretical performance. Despite this interest, however, many questions remain unanswered. Can Thompson sampling identify the best action among many others in a sequential decision making problem? What is the best performance guarantee we can provide for Thompson sampling in a Gaussian linear model? Can Thompson sampling still perform well even when it receives delayed feedback in the form of batches? These questions lie at the heart of many real life applications, and by answering them this thesis contributes towards a better understanding of the performance of Thompson sampling in more complex and realistic scenarios. We first study the exploration capabilities of Thompson sampling. While it is well known that Thompson sampling is an optimal algorithm which achieves sub-linear cumulative regret in the classical multi-armed bandits problem, whether or not Thompson sampling can identify the optimal action remains unknown. This is because a regret-optimal algorithm can potentially select a suboptimal arm infinitely often, hence failing to identify the optimal action. In this thesis, we show that Thompson sampling gradually determines the optimal arm with probability one whenever it achieves sub-linear regret, which is known to be the case in many classical bandits problems. Using this result, we introduce the first strongly consistent estimator for identifying the optimal action that uses only the actions selected by the Thompson sampling agent. Later we study the performance of Thompson sampling for Gaussian linear bandits. We improve the state-of-the-art regret bounds on the expected performance of Thompson sampling by an order of sqrt(log(T)) where T stands for the experiment duration. We achieve this result by introducing a novel Cauchy-Schwarz type inequality for random vectors. Finally we study the performance of Thompson sampling for the batched multi-armed bandits problem. Prior work has devised algorithms specialized for this batched setting that optimize the batch structure for a given time horizon T and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. It is not clear whether or not Thompson sampling, an algorithm that implicitly balances exploration and exploitation without knowing the time horizon, can perform well under batched feedback. In this thesis, we answer this question positively. We provide the first adversarial batching result in the literature by showing that Thompson sampling maintains its optimal performance even when the batch structure is chosen adversarially as long as it receives enough feedback. Additionally, we introduce two adaptive batching strategies tuned to a given target performance criteria: asymptotic or finite time performance. Both algorithms require only O(loglog(T)) number of batches to achieve optimal performance for a given problem instance, resulting in an exponentially smaller number of batches than previous algorithms. This opens the way to drastically parallelize the operation of Thompson sampling and reduce the number of times the agent needs to interact with the underlying system in many real-world applications.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Degree committee member
|Stanford University, Department of Electrical Engineering
|Statement of responsibility
|Submitted to the Department of Electrical Engineering.
|Thesis Ph.D. Stanford University 2022.
- © 2022 by Cem Kalkanli
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...