Leveraging prior knowledge and structure for data-efficient machine learning
Abstract/Contents
- Abstract
- Building high-performing end-to-end machine learning systems primarily consists of developing the machine learning model and gathering high-quality training data for the application of interest, assuming one has access to the right hardware. Although machine learning models are getting increasingly commoditized in the last few years with the rise of open-sourced platforms, curating high-quality labeled training datasets is still either costly or not feasible for many real-world applications. Hence, we mainly focus on data in this thesis, specifically how to (1) reduce dependence on labeled data with data-efficient machine learning methods through either injecting domain-specific prior knowledge or leveraging existing software systems and datasets that have initially been created for different tasks, (2) effectively manage training data and build associated tooling in order to maximize the utility of the data, and (3) improve the quality of the data representations achieved by embeddings by matching the structure of the data to the geometry of the embedding space. We start by describing our works on building data-efficient machine learning methods for accelerated magnetic resonance imaging (MRI) reconstruction through physics-driven augmentations for consistency training, scale-equivariant unrolled neural networks, and weak supervision using untrained neural networks. Then, we describe our works on building data-efficient machine learning methods for natural language understanding. In particular, we discuss a supervised contrastive learning approach for pre-trained language model fine-tuning and a large-scale data augmentation method to retrieve in-domain data. Related to effectively managing training data, we discuss our proposed information extraction system for form-like documents Glean and focus on the often overlooked aspects of training data management and associated tooling. We highlight the importance of effectively managing training data by showing that it is at least as critical as the machine learning model advances in terms of downstream extraction performance on a real-world dataset. Finally, to improve embedding representations for a variety of types of data, we investigate spaces with heterogeneous curvature. We demonstrate mixed-curvature representations provide higher quality representations both for graphs and for word embeddings. Also, we investigate integrating entity embeddings from Wikidata knowledge graph to an abstractive text summarization model to enhance factuality.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2022; ©2022 |
Publication date | 2022; 2022 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Gunel, Beliz |
---|---|
Degree supervisor | Pauly, John (John M.) |
Thesis advisor | Pauly, John (John M.) |
Thesis advisor | Chaudhari, Akshay |
Thesis advisor | Pilanci, Mert |
Thesis advisor | Vasanawala, Shreyas |
Degree committee member | Chaudhari, Akshay |
Degree committee member | Pilanci, Mert |
Degree committee member | Vasanawala, Shreyas |
Associated with | Stanford University, Department of Electrical Engineering |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Beliz Gunel. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2022. |
Location | https://purl.stanford.edu/sb560hz7613 |
Access conditions
- Copyright
- © 2022 by Beliz Gunel
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...