Leveraging prior knowledge and structure for data-efficient machine learning

Placeholder Show Content

Abstract/Contents

Abstract
Building high-performing end-to-end machine learning systems primarily consists of developing the machine learning model and gathering high-quality training data for the application of interest, assuming one has access to the right hardware. Although machine learning models are getting increasingly commoditized in the last few years with the rise of open-sourced platforms, curating high-quality labeled training datasets is still either costly or not feasible for many real-world applications. Hence, we mainly focus on data in this thesis, specifically how to (1) reduce dependence on labeled data with data-efficient machine learning methods through either injecting domain-specific prior knowledge or leveraging existing software systems and datasets that have initially been created for different tasks, (2) effectively manage training data and build associated tooling in order to maximize the utility of the data, and (3) improve the quality of the data representations achieved by embeddings by matching the structure of the data to the geometry of the embedding space. We start by describing our works on building data-efficient machine learning methods for accelerated magnetic resonance imaging (MRI) reconstruction through physics-driven augmentations for consistency training, scale-equivariant unrolled neural networks, and weak supervision using untrained neural networks. Then, we describe our works on building data-efficient machine learning methods for natural language understanding. In particular, we discuss a supervised contrastive learning approach for pre-trained language model fine-tuning and a large-scale data augmentation method to retrieve in-domain data. Related to effectively managing training data, we discuss our proposed information extraction system for form-like documents Glean and focus on the often overlooked aspects of training data management and associated tooling. We highlight the importance of effectively managing training data by showing that it is at least as critical as the machine learning model advances in terms of downstream extraction performance on a real-world dataset. Finally, to improve embedding representations for a variety of types of data, we investigate spaces with heterogeneous curvature. We demonstrate mixed-curvature representations provide higher quality representations both for graphs and for word embeddings. Also, we investigate integrating entity embeddings from Wikidata knowledge graph to an abstractive text summarization model to enhance factuality.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2022; ©2022
Publication date 2022; 2022
Issuance monographic
Language English

Creators/Contributors

Author Gunel, Beliz
Degree supervisor Pauly, John (John M.)
Thesis advisor Pauly, John (John M.)
Thesis advisor Chaudhari, Akshay
Thesis advisor Pilanci, Mert
Thesis advisor Vasanawala, Shreyas
Degree committee member Chaudhari, Akshay
Degree committee member Pilanci, Mert
Degree committee member Vasanawala, Shreyas
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Beliz Gunel.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2022.
Location https://purl.stanford.edu/sb560hz7613

Access conditions

Copyright
© 2022 by Beliz Gunel
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...