Suboptimal control policies via convex optimization
- In this dissertation we consider several suboptimal policies for the stochastic control problem with discounted objective and full state information. In the general case, this problem is difficult to solve exactly. However, solutions can be found for certain special cases. When the state and action spaces are finite, for example, the problem is readily solved . Another such case is when the state and action spaces are finite dimensional real vector spaces, the system dynamics are linear, the cost function is convex quadratic, and there are no constraints on the action or the state. In this case optimal control policy is affine in the state variable, with coefficients that are readily computable [13,16,25,98]. One general method for finding the optimal control policy is to use dynamic programming (DP). DP relies on characterizing the value-function of the stochastic control problem. The value function evaluated at a state is the expected cost incurred by an optimal policy starting from that state. The optimal policy can be written as an optimization problem involving the value function [13,16,18]. However, due to the 'curse of dimensionality', even representing the value function can be intractable when the state or action spaces are infinite, or as a practical matter, when the number of states or actions is very large. Even in cases where the value function can be represented, evaluating the optimal policy may still be intractable. In such cases a common alternative is to use approximate dynamic programming (ADP) [19, 143, 179]. In ADP we replace the true value function with an approximate value function in the expression for the optimal policy. The goal is to choose the approximate value function so that the performance of the resulting policy is close to optimal, or at least, good. In the first part of this dissertation we develop two ADP control policies which we refer to as the min-max ADP policy and the iterated approximate value function (AVF) policy respectively. Both of these policies rely on our ability to parameterize a family of lower bounds on the true value function of the stochastic control problem. The condition we use to parameterize our family of lower bounds is related to the 'linear programming approach' to ADP, which was first introduced by Manne in 1960 , and extended to approximate dynamic programming in [47, 159]. The basic idea is that any function which satisfies the Bellman inequality is a pointwise lower bound on the true value function [13,16]. The min-max ADP policy uses the pointwise supremum of the family of lower bounds as a surrogate value function. Evaluating the control policy involves the solution of a min-max or saddle-point problem. For the iterated AVF policy we first develop a sequence of approximate value functions which are optimized to a trajectory of states, we then perform control at each time-step using the corresponding mem- ber of the sequence. The trajectory and approximate value functions are generated simultaneously as the solutions and dual variables to a single convex optimization problem. For the class of problems we consider, finding the control action under either policy requires solving a convex optimization problem. For the min-max ADP policy we solve a semidefinite program (SDP) [26,30] at each time-step. The iterated AVF policy requires solving a single SDP offline, and then solving a much smaller convex problem at each iteration. Model predictive control (MPC) is a widespread technique for generating a suboptimal control policy that often performs very well in practice [37,64,76,105,117, 125,162,184]. In the second part of this dissertation we introduce a new algorithm for solving optimal control problems, of which model predictive control is a special case. The algorithm, which we refer to as operator splitting for control (OSC), solves MPC problems quickly and robustly. In many cases the resulting algorithm can be implemented in fixed-point arithmetic and is thus suitable for embedded applications. The algorithm relies on an operator splitting technique, referred to as the alternating direction method of multipliers (ADMM), or as Douglas-Rachford (D-R) v splitting [28,57,62,63,70]. The third part of this document investigates the efficacy and computational burden of the suboptimal policies we developed in earlier sections through an in-depth multi-period portfolio optimization example [1, 55, 94, 141, 166, 191]. We exhibit a lower bound on the performance (our problem statement involves minimizing an objective) which we compare to two of the policies detailed in the previous chapters. We present timing results and demonstrate that the performance for both policies is very close to the lower bound and thus very close to optimal for several numerical instances.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Stanford University, Department of Electrical Engineering
|Boyd, Stephen P
|Boyd, Stephen P
|Candès, Emmanuel J. (Emmanuel Jean)
|Candès, Emmanuel J. (Emmanuel Jean)
|Statement of responsibility
|Submitted to the Department of Electrical Engineering.
|Thesis (Ph.D.)--Stanford University, 2012.
- © 2012 by Brendan O'Donoghue
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...