Interface design to enable efficient data curation for machine learning

Suri, Sahaana

Interface design to enable efficient data curation for machine learning

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwt166vm9805" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Machine learning (ML) models achieve performance on par with or exceeding human capabilities on benchmark tasks across numerous applications spanning object detection, language modeling, question answering, and more. Unfortunately, translating the success of these models from benchmark datasets to real-world deployments is often bottlenecked by the curation of high-quality, domain-specific training datasets. It can take a domain expert weeks to months (if ever) to curate a labeled training dataset to train a model for tasks solving a specific business need. This poses a major scalability challenge as a company's typical ML deployment can consist of hundreds of tasks, spanning different data modalities that address several distinct, yet overlapping, business needs. Without training data of sufficient quality to input to the ML training procedure, the resulting models fail to perform well in practice. This dissertation argues that a key opportunity for optimization is to leverage shared information across tasks to expedite the data curation process—while companies deploy models for many tasks, they are often related, and overlapping information can bootstrap data curation for new tasks. We demonstrate how defining and systemizing new abstractions for practitioners enables them to more efficiently curate datasets across a varying number of datasets, schemas, and modalities for fixed classes of downstream ML tasks. We consider the problem of data curation across four regimes: a single data source, multiple data sources of the same schema and data modality, multiple data sources with heterogeneous schemas, and multiple data sources with more than one data modality. For each regime, we present our findings from designing and developing interfaces for efficient data curation for a specific class of downstream workload. Concretely, we show how to: (1) Perform whole-workload optimization to pre-process a single training dataset; (2) Automatically enrich datasets with additional data sources both of the same and heterogeneous schema; (3) Transfer and reuse labels across modalities to save data curation overhead by up to several weeks at scale.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2022; ©2022
Publication date	2022; 2022
Issuance	monographic
Language	English

Creators/Contributors

Author	Suri, Sahaana
Degree supervisor	Ré, Christopher
Thesis advisor	Ré, Christopher
Thesis advisor	Bailis, Peter
Thesis advisor	Olukotun, Oyekunle Ayinde
Degree committee member	Bailis, Peter
Degree committee member	Olukotun, Oyekunle Ayinde
Associated with	Stanford University, Department of Electrical Engineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Sahaana Suri.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2022.
Location	https://purl.stanford.edu/wt166vm9805

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...