Interface design to enable efficient data curation for machine learning
Abstract/Contents
- Abstract
- Machine learning (ML) models achieve performance on par with or exceeding human capabilities on benchmark tasks across numerous applications spanning object detection, language modeling, question answering, and more. Unfortunately, translating the success of these models from benchmark datasets to real-world deployments is often bottlenecked by the curation of high-quality, domain-specific training datasets. It can take a domain expert weeks to months (if ever) to curate a labeled training dataset to train a model for tasks solving a specific business need. This poses a major scalability challenge as a company's typical ML deployment can consist of hundreds of tasks, spanning different data modalities that address several distinct, yet overlapping, business needs. Without training data of sufficient quality to input to the ML training procedure, the resulting models fail to perform well in practice. This dissertation argues that a key opportunity for optimization is to leverage shared information across tasks to expedite the data curation process—while companies deploy models for many tasks, they are often related, and overlapping information can bootstrap data curation for new tasks. We demonstrate how defining and systemizing new abstractions for practitioners enables them to more efficiently curate datasets across a varying number of datasets, schemas, and modalities for fixed classes of downstream ML tasks. We consider the problem of data curation across four regimes: a single data source, multiple data sources of the same schema and data modality, multiple data sources with heterogeneous schemas, and multiple data sources with more than one data modality. For each regime, we present our findings from designing and developing interfaces for efficient data curation for a specific class of downstream workload. Concretely, we show how to: (1) Perform whole-workload optimization to pre-process a single training dataset; (2) Automatically enrich datasets with additional data sources both of the same and heterogeneous schema; (3) Transfer and reuse labels across modalities to save data curation overhead by up to several weeks at scale.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2022; ©2022 |
Publication date | 2022; 2022 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Suri, Sahaana |
---|---|
Degree supervisor | Ré, Christopher |
Thesis advisor | Ré, Christopher |
Thesis advisor | Bailis, Peter |
Thesis advisor | Olukotun, Oyekunle Ayinde |
Degree committee member | Bailis, Peter |
Degree committee member | Olukotun, Oyekunle Ayinde |
Associated with | Stanford University, Department of Electrical Engineering |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Sahaana Suri. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2022. |
Location | https://purl.stanford.edu/wt166vm9805 |
Access conditions
- Copyright
- © 2022 by Sahaana Suri
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...