Programmatic weak supervision to label training data

Varma, Paroma

Programmatic weak supervision to label training data

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fbx713dz3991" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: As deep learning models are applied to increasingly diverse problems, a key bottleneck is gathering enough high-quality training labels tailored to each task. Manually labeling this amount of data can be slow, expensive, and static, requiring relabeling if the task changes or new training data is introduced. Users therefore turn to weak supervision, relying on imperfect sources of labels like pattern matching and user-defined heuristics to efficiently, albeit noisily, label training data. Recently, generative models have been used to model and combine the outputs of these sources to assign high quality training labels to unlabeled data across various domains. However, current work treats these weak supervision sources as black-boxes and limits the ability to further simplify and automate the process of labeling training data. In this dissertation, we open the black-box of supervision sources by combining tools from traditional programming languages with statistical algorithms to improve theoretical guarantees and build weak supervision systems that generate labels to train state-of-the-art models. We begin by studying the role of domain specific primitives, interpretable building blocks of data, which simplify the process of developing supervision signals. We then address the issue of complex dependencies that arise as a result of sharing these primitives and present a method to learn correlations among supervision sources using static analysis over the programs that define the sources, and an additional statistical method that can capture correlations not evident from the source code. By gaining access to the reasoning and logic behind supervision sources, we then design a system inspired by program synthesis to automatically generate supervision sources, and provide theoretical guarantees for the accuracy of these signals. To validate the benefits of opening the black-box of supervision, we apply our methods across applications from relation extraction to image classification. We build systems that use supervision sources developed in conjunction with cardiologists to label real-world population-level medical datasets. We use our program synthesis-based approach to automatically complete benchmark visual knowledge bases by labeling rare relationships without requiring additional manual labeling. This body of work demonstrates how traditional programming language analysis methods, combined with the appropriate statistical tools, can enrich weak supervision systems by improving theoretical guarantees and empirical performance of real-world tasks.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Varma, Paroma
Degree supervisor	Ré, Christopher
Thesis advisor	Ré, Christopher
Thesis advisor	Garcia-Molina, Hector
Thesis advisor	Olukotun, Oyekunle Ayinde
Degree committee member	Garcia-Molina, Hector
Degree committee member	Olukotun, Oyekunle Ayinde
Associated with	Stanford University, Department of Electrical Engineering.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Paroma Varma.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2019.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...