Lightweight statistical learning : accelerating and avoiding empirical risk minimization

Placeholder Show Content

Abstract/Contents

Abstract
In statistical machine learning, the goal is to train a model that, once deployed in the world, continues to predict accurately on fresh data. A unifying training principle is empirical risk minimization (ERM): globally minimizing the average prediction error incurred on a representative training set. Essentially, ERM prescribes optimization as a proxy for statistical estimation. Learning tasks for which ERM is computationally tractable call for optimization algorithms that scale as efficiently as possible as training set dimensions grow. Other tasks---namely, neural network learning and learning with indirect supervision---elude a general, tractable ERM algorithm in theory, motivating a reformulation of the training problem altogether. In this thesis, we first focus on efficient algorithms for empirical risk minimization in the overdetermined, convex setting with fully-observed data, where ERM is tractable and the aim is optimal computational efficiency. We improve the guaranteed running time for a broad range of problems (including basic regression problems) in terms of the condition number induced by the underlying training set. Atop these methods, we develop an algorithm for principal component regression. Next, we move to settings where general tractability of ERM is not guaranteed, but still guides algorithm design in practice. Specifically, we study two learning frameworks: convolutional neural network learning, and learning conditional random fields from indirect supervision. In either context, the prevalent replacement to global ERM is gradient descent. Since descent could converge to arbitrarily bad local optima, it renders initialization more relevant. In each setting, we develop lightweight training algorithms based on convex procedures. For neural networks, we consider replacing optimization of hidden layers with randomization. For indirect supervision, we reduce estimation to solving a linear system followed by a convex maximum-likelihood step. Via the role of initialization, both algorithms imply conditions under which local descent ought to fare at least as well as they do.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Frostig, Roy
Associated with Stanford University, Computer Science Department.
Primary advisor Liang, Percy
Thesis advisor Liang, Percy
Thesis advisor Duchi, John
Thesis advisor Valiant, Gregory
Advisor Duchi, John
Advisor Valiant, Gregory

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Roy Frostig.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Roy Frostig

Also listed in

Loading usage metrics...