Theory and algorithms for data-centric machine learning

Izzo, Zachary

Theory and algorithms for data-centric machine learning

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fth037nx8240" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Machine learning (ML) and AI have achieved remarkable, super-human performance in a wide variety of domains: computer vision, natural language processing, and protein folding, to name but a few. Until recently, most advancements have taken a model-centric approach, focusing primarily on improved neural network architectures (ConvNets, ResNets, transformers, etc.) and optimization procedures for training these models (batch norm, dropout, neural architecture search, etc.). Relatively less attention has been paid to the data used to train these models, in spite of the well-known fact that ML is critically dependent on high-quality data, captured succinctly with the phrase "garbage in, garbage out." As the returns on ever larger and more complicated models diminish (MT-NLG from Nvidia and Microsoft having 530B parameters), researchers have begun to realize the importance of taking a data-centric approach and developing principled methods for studying the fuel for these models: the data itself. Beyond improved task performance, a data-centric perspective also allows us to take socially critical considerations, such as data privacy, into account. In this thesis, we will take a critical look at several points in the ML data pipeline: before, during, and after model training. Before model training, we will explore the problem of data selection: which data should be used to train the model, and on what type of data should we expect our model to work? As we move forward into model training, we will turn our attention to two issues which can result from the interaction of our ML systems with the environment in which they are deployed. The first issue is that of data privacy: how can we prevent our models from leaking sensitive information about their training data? The second issue concerns the dynamic nature of some modeled populations. Especially when our model is used to make socially impactful decisions (e.g., automated loan approval or recommender systems), the model itself may impact the distribution of the data, leading to degraded performance. Lastly, despite following best practices before and during model training, it may be the case that we want to post-process a model to remove the effects of certain data after training. How can this be achieved in a computationally efficient manner? This thesis covers novel solutions for each of the preceding problems, with an emphasis on the provable guarantees for each of the proposed algorithms. By applying mathematical rigor to challenging real-world problems, we can develop algorithms which are both effective and trustworthy.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2023; ©2023
Publication date	2023; 2023
Issuance	monographic
Language	English

Creators/Contributors

Author	Izzo, Zachary
Degree supervisor	Ying, Lexing
Degree supervisor	Zou, James
Thesis advisor	Ying, Lexing
Thesis advisor	Zou, James
Thesis advisor	Chatterjee, Sourav
Degree committee member	Chatterjee, Sourav
Associated with	Stanford University, School of Humanities and Sciences
Associated with	Stanford University, Department of Mathematics

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Zachary Luigi Edward Izzo.
Note	Submitted to the Department of Mathematics.
Thesis	Thesis Ph.D. Stanford University 2023.
Location	https://purl.stanford.edu/th037nx8240

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...