Efficient second-order methods for machine learning
- Due to the large-scale nature of many modern machine learning applications, including but not limited to deep learning problems, people have been focusing on studying and developing efficient optimization algorithms. Most of these are first-order methods which use only gradient information. The conventional wisdom in the machine learning community is that second-order methods that use Hessian information are inappropriate to use since they can not be efficient. In this thesis, we consider second-order optimization methods: we develop new sub-sampled Newton-type algorithms for both convex and non-convex optimization problems; we prove that they are efficient and scalable; and we provide a detailed empirical evaluation of their scalability as well as usefulness. In the convex setting, we present a subsampled Newton-type algorithm (SSN) that exploits non-uniform subsampling Hessians as well as inexact updates to reduce the computational complexity. Theoretically we show that our algorithms achieve a linear-quadratic convergence rate and empirically we demonstrate the efficiency of our methods on several real datasets. In addition, we extend our methods into a distributed setting and propose a distributed Newton-type method, Globally Improved Approximate NewTon method (GIANT). Theoretically we show that GIANT is highly communication efficient comparing with existing distributed optimization algorithms. Empirically we demonstrate the scalability and efficiency of GIANT in Spark. In the non-convex setting, we consider two classic non-convex Newton-type methods --- Trust Region method (TR) and Cubic Regularization method (CR). We relax the Hessian approximation condition that has been assumed in the existing works of using inexact Hessian for those algorithms. Under the relaxed Hessian approximation condition, we show that worst-case iteration complexities to converge an approximate second-order stationary point are retained for both methods. Using the similar idea of SSN, we present the sub-sampled TR and CR methods along with the sampling complexities to achieve the Hessian approximation condition. To understand the empirical performances of those methods, we conduct an extensive empirical study on some non-convex machine learning problems and showcase the efficiency and robustness of these Newton-type methods under various settings.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Degree committee member
|Stanford University, Institute for Computational and Mathematical Engineering.
|Statement of responsibility
|Submitted to the Institute for Computational and Mathematical Engineering.
|Thesis Ph.D. Stanford University 2018.
- © 2018 by Peng Xu
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...