Interface design to enable efficient data curation for machine learning

Placeholder Show Content

Abstract/Contents

Abstract
Machine learning (ML) models achieve performance on par with or exceeding human capabilities on benchmark tasks across numerous applications spanning object detection, language modeling, question answering, and more. Unfortunately, translating the success of these models from benchmark datasets to real-world deployments is often bottlenecked by the curation of high-quality, domain-specific training datasets. It can take a domain expert weeks to months (if ever) to curate a labeled training dataset to train a model for tasks solving a specific business need. This poses a major scalability challenge as a company's typical ML deployment can consist of hundreds of tasks, spanning different data modalities that address several distinct, yet overlapping, business needs. Without training data of sufficient quality to input to the ML training procedure, the resulting models fail to perform well in practice. This dissertation argues that a key opportunity for optimization is to leverage shared information across tasks to expedite the data curation process—while companies deploy models for many tasks, they are often related, and overlapping information can bootstrap data curation for new tasks. We demonstrate how defining and systemizing new abstractions for practitioners enables them to more efficiently curate datasets across a varying number of datasets, schemas, and modalities for fixed classes of downstream ML tasks. We consider the problem of data curation across four regimes: a single data source, multiple data sources of the same schema and data modality, multiple data sources with heterogeneous schemas, and multiple data sources with more than one data modality. For each regime, we present our findings from designing and developing interfaces for efficient data curation for a specific class of downstream workload. Concretely, we show how to: (1) Perform whole-workload optimization to pre-process a single training dataset; (2) Automatically enrich datasets with additional data sources both of the same and heterogeneous schema; (3) Transfer and reuse labels across modalities to save data curation overhead by up to several weeks at scale.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2022; ©2022
Publication date 2022; 2022
Issuance monographic
Language English

Creators/Contributors

Author Suri, Sahaana
Degree supervisor Ré, Christopher
Thesis advisor Ré, Christopher
Thesis advisor Bailis, Peter
Thesis advisor Olukotun, Oyekunle Ayinde
Degree committee member Bailis, Peter
Degree committee member Olukotun, Oyekunle Ayinde
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Sahaana Suri.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2022.
Location https://purl.stanford.edu/wt166vm9805

Access conditions

Copyright
© 2022 by Sahaana Suri
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...