When do gradient methods work well in non-convex learning problems?
Abstract/Contents
- Abstract
- Classical learning theory advocates the use of convex losses in machine learning due to its guaranteed computational efficiency, yet recent success in deep learning suggests that highly non-convex losses can often be efficiently minimized by simple gradient-based algorithms, a phenomenon not well explained by existing theory. Towards closing this gap, this thesis proposes and instantiates landscape analyses, a theoretical framework for studying the optimization of non-convex learning problems via analyzing optimization-aware geometric properties of the empirical risk. In the first part of the thesis, we establish a generic connection between the optimization landscape of smooth non-convex empirical risks, namely, that their gradients and Hessians concentrate uniformly around those of the population risk with statistical rate. Consequently, benign landscape properties on the population risk can be carried onto the empirical risk as well when the sample size is sufficiently large. We apply this principle to showing that gradient descent succeeds in solving several non-convex empirical risk minimization problems, including non-convex binary classification (and its sparse version), robust linear regression, and Gaussian mixture estimation. In the second part, we adopt landscape analysis to solving the orthogonal dictionary learning problem via minimizing a non-convex non-smooth L1 risk on the sphere, which was shown to work well empirically but not well understood in theory. Standard uniform convergence analysis fails on this problem as the subgradients are set-valued and non-Lipschitz. We develop tools, building on Hausdorff distance between sets and a novel sign-aware metric customized for the dictionary learning problem, to handle the above issues and establish the uniform convergence of subgradients. We show that the landscape of the non-smooth empirical risk is benign and subgradient descent succeeds in recovering the dictionary with required sample complexity lower than existing approaches based on non-convex optimization
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2019; ©2019 |
Publication date | 2019; 2019 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Bai, Yu, (Statistician) |
---|---|
Degree supervisor | Duchi, John |
Thesis advisor | Duchi, John |
Thesis advisor | Candès, Emmanuel J. (Emmanuel Jean) |
Thesis advisor | Montanari, Andrea |
Degree committee member | Candès, Emmanuel J. (Emmanuel Jean) |
Degree committee member | Montanari, Andrea |
Associated with | Stanford University, Department of Statistics. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Yu Bai |
---|---|
Note | Submitted to the Department of Statistics |
Thesis | Thesis Ph.D. Stanford University 2019 |
Location | electronic resource |
Access conditions
- Copyright
- © 2019 by Yu Bai
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...