Leveraging prior knowledge and structure for data-efficient machine learning

Gunel, Beliz

Leveraging prior knowledge and structure for data-efficient machine learning

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fsb560hz7613" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Building high-performing end-to-end machine learning systems primarily consists of developing the machine learning model and gathering high-quality training data for the application of interest, assuming one has access to the right hardware. Although machine learning models are getting increasingly commoditized in the last few years with the rise of open-sourced platforms, curating high-quality labeled training datasets is still either costly or not feasible for many real-world applications. Hence, we mainly focus on data in this thesis, specifically how to (1) reduce dependence on labeled data with data-efficient machine learning methods through either injecting domain-specific prior knowledge or leveraging existing software systems and datasets that have initially been created for different tasks, (2) effectively manage training data and build associated tooling in order to maximize the utility of the data, and (3) improve the quality of the data representations achieved by embeddings by matching the structure of the data to the geometry of the embedding space. We start by describing our works on building data-efficient machine learning methods for accelerated magnetic resonance imaging (MRI) reconstruction through physics-driven augmentations for consistency training, scale-equivariant unrolled neural networks, and weak supervision using untrained neural networks. Then, we describe our works on building data-efficient machine learning methods for natural language understanding. In particular, we discuss a supervised contrastive learning approach for pre-trained language model fine-tuning and a large-scale data augmentation method to retrieve in-domain data. Related to effectively managing training data, we discuss our proposed information extraction system for form-like documents Glean and focus on the often overlooked aspects of training data management and associated tooling. We highlight the importance of effectively managing training data by showing that it is at least as critical as the machine learning model advances in terms of downstream extraction performance on a real-world dataset. Finally, to improve embedding representations for a variety of types of data, we investigate spaces with heterogeneous curvature. We demonstrate mixed-curvature representations provide higher quality representations both for graphs and for word embeddings. Also, we investigate integrating entity embeddings from Wikidata knowledge graph to an abstractive text summarization model to enhance factuality.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2022; ©2022
Publication date	2022; 2022
Issuance	monographic
Language	English

Creators/Contributors

Author	Gunel, Beliz
Degree supervisor	Pauly, John (John M.)
Thesis advisor	Pauly, John (John M.)
Thesis advisor	Chaudhari, Akshay
Thesis advisor	Pilanci, Mert
Thesis advisor	Vasanawala, Shreyas
Degree committee member	Chaudhari, Akshay
Degree committee member	Pilanci, Mert
Degree committee member	Vasanawala, Shreyas
Associated with	Stanford University, Department of Electrical Engineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Beliz Gunel.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2022.
Location	https://purl.stanford.edu/sb560hz7613

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...