Efficient reinforcement learning with agent states
- In a wide range of decision problems, much focus of academic research has been put on stylized models, whose capacities are usually limited by problem-specific assumptions. In the previous decade, approaches based on reinforcement learning (RL) have received growing attention. With these approaches, a unified method can be applied to a broad class of problems, circumventing the need for stylized solutions. Moreover, when it comes to real-life applications, such RL-based approaches, unfettered from the constraining models, can potentially leverage the growing amount of data and computational resources. As such, continuing innovations might empower RL to tackle problems in the complex physical world. So far, empirical accomplishments of RL have largely been limited to artificial environments, such as games. One reason is that the success of RL often hinges on the availability of a simulator that is able to mass-produce samples. Meanwhile, real environments, such as medical facilities, fulfillment centers, and the World Wide Web, exhibit complex dynamics that can hardly be captured by hard-coded simulators. To bring the achievement of RL into practice, it would be useful to think in terms of how the interactions between the agent and the real world ought to be modeled. Recent works on RL theory tend to focus on restrictive classes of environments that fail to capture certain aspects of the real world. For example, many of such works model the environment as a Markov Decision Process (MDP), which requires that the agent always observe a summary statistic of its situation. In practice, this means that the agent designer has to identify a set of "environmental states, " where each state incorporates all information about the environment relevant to decision-making. Moreover, to ensure that the agent learns from its trajectories, MDP models presume that some environmental states are visited infinitely often. This could be a significant simplification of the real world, as the gifted Argentine poet Jorge Luis Borges once said, "Every day, perhaps every hour, is different." To generate insights on agent design in authentic applications, in this dissertation we consider a more general framework of RL that relaxes such restrictions. Specifically, we demonstrate a simple RL agent that implements an optimistic version of Q-learning and establish through regret analysis that this agent can operate with some level of competence in any environment. While we leverage concepts from the literature on provably efficient RL, we consider a general agent-environment interface and provide a novel agent design and analysis that further develop the concept of agent state, which is defined as the collection of information that the agent maintains in order to make decisions. This level of generality positions our results to inform the design of future agents for operation in complex real environments. We establish that, as time progresses, our agent performs competitively relative to policies that require longer times to evaluate. The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent. Notably, there is no dependence on the complexity of the environment. The ultimate per-period performance loss of the agent is bounded by a constant multiple of a measure of distortion introduced by the agent's state representation. Our work is the first to establish that an algorithm approaches this asymptotic condition within a tractable time frame, and the results presented in this dissertation resolve multiple open issues in approximate dynamic programming.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Dong, Shi, (Researcher of reinforcement learning)
|Van Roy, Benjamin
|Van Roy, Benjamin
|Degree committee member
|Degree committee member
|Stanford University, Department of Electrical Engineering
|Statement of responsibility
|Submitted to the Department of Electrical Engineering.
|Thesis Ph.D. Stanford University 2022.
- © 2022 by Shi Dong
Also listed in
Loading usage metrics...