Programmatic weak supervision to label training data

Placeholder Show Content

Abstract/Contents

Abstract
As deep learning models are applied to increasingly diverse problems, a key bottleneck is gathering enough high-quality training labels tailored to each task. Manually labeling this amount of data can be slow, expensive, and static, requiring relabeling if the task changes or new training data is introduced. Users therefore turn to weak supervision, relying on imperfect sources of labels like pattern matching and user-defined heuristics to efficiently, albeit noisily, label training data. Recently, generative models have been used to model and combine the outputs of these sources to assign high quality training labels to unlabeled data across various domains. However, current work treats these weak supervision sources as black-boxes and limits the ability to further simplify and automate the process of labeling training data. In this dissertation, we open the black-box of supervision sources by combining tools from traditional programming languages with statistical algorithms to improve theoretical guarantees and build weak supervision systems that generate labels to train state-of-the-art models. We begin by studying the role of domain specific primitives, interpretable building blocks of data, which simplify the process of developing supervision signals. We then address the issue of complex dependencies that arise as a result of sharing these primitives and present a method to learn correlations among supervision sources using static analysis over the programs that define the sources, and an additional statistical method that can capture correlations not evident from the source code. By gaining access to the reasoning and logic behind supervision sources, we then design a system inspired by program synthesis to automatically generate supervision sources, and provide theoretical guarantees for the accuracy of these signals. To validate the benefits of opening the black-box of supervision, we apply our methods across applications from relation extraction to image classification. We build systems that use supervision sources developed in conjunction with cardiologists to label real-world population-level medical datasets. We use our program synthesis-based approach to automatically complete benchmark visual knowledge bases by labeling rare relationships without requiring additional manual labeling. This body of work demonstrates how traditional programming language analysis methods, combined with the appropriate statistical tools, can enrich weak supervision systems by improving theoretical guarantees and empirical performance of real-world tasks.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Varma, Paroma
Degree supervisor Ré, Christopher
Thesis advisor Ré, Christopher
Thesis advisor Garcia-Molina, Hector
Thesis advisor Olukotun, Oyekunle Ayinde
Degree committee member Garcia-Molina, Hector
Degree committee member Olukotun, Oyekunle Ayinde
Associated with Stanford University, Department of Electrical Engineering.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Paroma Varma.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2019.
Location electronic resource

Access conditions

Copyright
© 2019 by Paroma Varma
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...