When do gradient methods work well in non-convex learning problems?

Bai, Yu, (Statistician)

When do gradient methods work well in non-convex learning problems?

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fsm107sv3158" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Classical learning theory advocates the use of convex losses in machine learning due to its guaranteed computational efficiency, yet recent success in deep learning suggests that highly non-convex losses can often be efficiently minimized by simple gradient-based algorithms, a phenomenon not well explained by existing theory. Towards closing this gap, this thesis proposes and instantiates landscape analyses, a theoretical framework for studying the optimization of non-convex learning problems via analyzing optimization-aware geometric properties of the empirical risk. In the first part of the thesis, we establish a generic connection between the optimization landscape of smooth non-convex empirical risks, namely, that their gradients and Hessians concentrate uniformly around those of the population risk with statistical rate. Consequently, benign landscape properties on the population risk can be carried onto the empirical risk as well when the sample size is sufficiently large. We apply this principle to showing that gradient descent succeeds in solving several non-convex empirical risk minimization problems, including non-convex binary classification (and its sparse version), robust linear regression, and Gaussian mixture estimation. In the second part, we adopt landscape analysis to solving the orthogonal dictionary learning problem via minimizing a non-convex non-smooth L1 risk on the sphere, which was shown to work well empirically but not well understood in theory. Standard uniform convergence analysis fails on this problem as the subgradients are set-valued and non-Lipschitz. We develop tools, building on Hausdorff distance between sets and a novel sign-aware metric customized for the dictionary learning problem, to handle the above issues and establish the uniform convergence of subgradients. We show that the landscape of the non-smooth empirical risk is benign and subgradient descent succeeds in recovering the dictionary with required sample complexity lower than existing approaches based on non-convex optimization

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Bai, Yu, (Statistician)
Degree supervisor	Duchi, John
Thesis advisor	Duchi, John
Thesis advisor	Candès, Emmanuel J. (Emmanuel Jean)
Thesis advisor	Montanari, Andrea
Degree committee member	Candès, Emmanuel J. (Emmanuel Jean)
Degree committee member	Montanari, Andrea
Associated with	Stanford University, Department of Statistics.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Yu Bai
Note	Submitted to the Department of Statistics
Thesis	Thesis Ph.D. Stanford University 2019
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...