Theory and algorithms for data-centric machine learning
- Machine learning (ML) and AI have achieved remarkable, super-human performance in a wide variety of domains: computer vision, natural language processing, and protein folding, to name but a few. Until recently, most advancements have taken a model-centric approach, focusing primarily on improved neural network architectures (ConvNets, ResNets, transformers, etc.) and optimization procedures for training these models (batch norm, dropout, neural architecture search, etc.). Relatively less attention has been paid to the data used to train these models, in spite of the well-known fact that ML is critically dependent on high-quality data, captured succinctly with the phrase "garbage in, garbage out." As the returns on ever larger and more complicated models diminish (MT-NLG from Nvidia and Microsoft having 530B parameters), researchers have begun to realize the importance of taking a data-centric approach and developing principled methods for studying the fuel for these models: the data itself. Beyond improved task performance, a data-centric perspective also allows us to take socially critical considerations, such as data privacy, into account. In this thesis, we will take a critical look at several points in the ML data pipeline: before, during, and after model training. Before model training, we will explore the problem of data selection: which data should be used to train the model, and on what type of data should we expect our model to work? As we move forward into model training, we will turn our attention to two issues which can result from the interaction of our ML systems with the environment in which they are deployed. The first issue is that of data privacy: how can we prevent our models from leaking sensitive information about their training data? The second issue concerns the dynamic nature of some modeled populations. Especially when our model is used to make socially impactful decisions (e.g., automated loan approval or recommender systems), the model itself may impact the distribution of the data, leading to degraded performance. Lastly, despite following best practices before and during model training, it may be the case that we want to post-process a model to remove the effects of certain data after training. How can this be achieved in a computationally efficient manner? This thesis covers novel solutions for each of the preceding problems, with an emphasis on the provable guarantees for each of the proposed algorithms. By applying mathematical rigor to challenging real-world problems, we can develop algorithms which are both effective and trustworthy.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Stanford University, School of Humanities and Sciences
|Stanford University, Department of Mathematics
|Statement of responsibility
|Zachary Luigi Edward Izzo.
|Submitted to the Department of Mathematics.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Zachary Izzo
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...