Theory and algorithms for data-centric machine learning

Placeholder Show Content

Abstract/Contents

Abstract
Machine learning (ML) and AI have achieved remarkable, super-human performance in a wide variety of domains: computer vision, natural language processing, and protein folding, to name but a few. Until recently, most advancements have taken a model-centric approach, focusing primarily on improved neural network architectures (ConvNets, ResNets, transformers, etc.) and optimization procedures for training these models (batch norm, dropout, neural architecture search, etc.). Relatively less attention has been paid to the data used to train these models, in spite of the well-known fact that ML is critically dependent on high-quality data, captured succinctly with the phrase "garbage in, garbage out." As the returns on ever larger and more complicated models diminish (MT-NLG from Nvidia and Microsoft having 530B parameters), researchers have begun to realize the importance of taking a data-centric approach and developing principled methods for studying the fuel for these models: the data itself. Beyond improved task performance, a data-centric perspective also allows us to take socially critical considerations, such as data privacy, into account. In this thesis, we will take a critical look at several points in the ML data pipeline: before, during, and after model training. Before model training, we will explore the problem of data selection: which data should be used to train the model, and on what type of data should we expect our model to work? As we move forward into model training, we will turn our attention to two issues which can result from the interaction of our ML systems with the environment in which they are deployed. The first issue is that of data privacy: how can we prevent our models from leaking sensitive information about their training data? The second issue concerns the dynamic nature of some modeled populations. Especially when our model is used to make socially impactful decisions (e.g., automated loan approval or recommender systems), the model itself may impact the distribution of the data, leading to degraded performance. Lastly, despite following best practices before and during model training, it may be the case that we want to post-process a model to remove the effects of certain data after training. How can this be achieved in a computationally efficient manner? This thesis covers novel solutions for each of the preceding problems, with an emphasis on the provable guarantees for each of the proposed algorithms. By applying mathematical rigor to challenging real-world problems, we can develop algorithms which are both effective and trustworthy.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Izzo, Zachary
Degree supervisor Ying, Lexing
Degree supervisor Zou, James
Thesis advisor Ying, Lexing
Thesis advisor Zou, James
Thesis advisor Chatterjee, Sourav
Degree committee member Chatterjee, Sourav
Associated with Stanford University, School of Humanities and Sciences
Associated with Stanford University, Department of Mathematics

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Zachary Luigi Edward Izzo.
Note Submitted to the Department of Mathematics.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/th037nx8240

Access conditions

Copyright
© 2023 by Zachary Izzo
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...