Learning with neural networks in high dimensions
- A central challenge of modern statistics concerns the curse of dimensionality, which refers to the difficulty of learning in high dimensions due to an exponential increase in degrees of freedom. Over the past half-century, researchers have developed various methods to overcome this curse, including clever feature engineering and prior assumptions on the data structure. In recent years, however, deep learning has emerged as a standard and powerful approach to exploiting massive high-dimensional datasets, vastly outperforming previous handcrafted and domain-specific methods. Despite the blackbox nature of neural networks, they appear to avoid the curse of dimensionality by automatically learning relevant features and adapting to inherent structures in the data. However, understanding precisely the mechanism behind this capability remains an outstanding challenge. For example, it is unclear what relevant features neural networks can learn efficiently, how gradient-type algorithms construct these features dynamically, or how different resources (sample size, network width, and training time) should scale for solving different tasks. Historically, neural networks' approximation and statistical properties have been investigated separately from computational aspects. This gap leaves open the study of the subset of neural networks that are of practical interest, namely those efficiently constructed by gradient-based algorithms. In this thesis, we consider three models, corresponding to two-layer neural networks trained in three different optimization regimes, and investigate their properties in high dimensions. In particular, our aim is to precisely characterize in each setting how the performance of neural networks depends on the number of samples, neurons, and gradient descent iterations used in their training. Neural networks in the linear regime: Neural networks trained in this regime effectively behave as linear methods. Specifically, we study random feature and kernel regression, which correspond respectively to finite-width and infinite-width models for linearized neural networks. We derive sharp asymptotics for the test error in a new polynomial high-dimensional scaling, under certain abstract conditions on the underlying kernel. We show that these conditions are verified by some classical high-dimensional examples. These results allow us to precisely characterize the performance of neural networks in the linear regime, including optimal width and impact of network architecture. In addition, they explain phenomena such as multiple descents in the risk curve and benign overfitting. Neural networks in the mean-field regime: Training in this regime is non-linear and allows neural networks to perform feature learning, i.e., to construct features that are adapted to the data. As an instructive example, we consider the problem of learning multi-index functions on Boolean or isotropic Gaussian data. Those functions only depend on a latent low-dimensional subspace and are not learnable efficiently by kernel methods. In contrast, we show that in the mean-field regime, the training sequentially learns the function support with a saddle-to-saddle dynamic. The overall time complexity now depends on the target function's leap, which measures how "hierarchical" the function is, and not on the target function's smoothness as in the linear regime. In particular, this illustrates how non-linear training of neural networks can vastly outperform kernel methods by exploiting low-dimensional and hierarchical structures in the data to construct good features. Convex neural networks: We introduce a family of convex problems over the space of infinite-width two-layer neural networks. These can be seen as generalizations of kernel methods, whereby the RKHS norm---which can be seen as a weighted 2-norm over the second layer weights---is replaced by a weighted functional p-norm with p between 1 and 2. We first show that, for p strictly bigger than 1, these problems can be solved in polynomial time using a random feature approximation. This implies that convex neural networks are tractable methods that break the curse of dimensionality on an increasing set of functions as p decreases to 1. On the other hand, we argue that for p equal to 1, which achieves near-optimal statistical rates on multi-index functions, learning becomes computationally hard under some standard hardness assumptions. The picture that emerges from this study is that neural networks can realize a rich range of trade-offs between statistical and computational complexity. In particular, neural networks trained in the linear, mean-field, and convex regimes---which can be seen as implementing three different statistical learning paradigms (fixed features, feature learning, and feature selection respectively)---suffer very differently from the curse of dimensionality. While linearized neural networks are simple computational models, they are statistically inefficient in high dimensions, except for very smooth target functions. On the other hand, convex neural networks can achieve near statistical optimal rates on multi-index functions, but do not admit general tractable algorithms. Finally, non-linear training of neural networks can be seen as a tractable middle-ground between these two extremes. In particular, it outperforms fixed feature methods by allowing efficient construction of relevant features for data with low-dimensional and favorable hierarchical structure.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Misiakiewicz, Theodor Jakub
|Donoho, David Leigh
|Degree committee member
|Donoho, David Leigh
|Degree committee member
|Stanford University, School of Humanities and Sciences
|Stanford University, Department of Statistics
|Statement of responsibility
|Submitted to the Department of Statistics.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Theodor Jakub Misiakiewicz
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...