Automating knowledge distillation and representation from richly formatted data
Abstract/Contents
- Abstract
- Much of the information created today is scattered throughout books, magazines, web pages, academic papers, and other documents that cannot be directly queried in a structured way. To address this challenge, people are building structured knowledge bases to make this information more accessible. The process of populating knowledge bases from unstructured inputs is called knowledge base construction (KBC). Through KBC, we can make troves of valuable information more accessible. However, existing KBC systems have limited abilities to handle input data from a wide variety of formats, including unstructured text, tables, and figures, contained within highly variable file structures. This data heterogeneity makes automated, scalable KBC difficult to achieve in real-world scenarios. This dissertation focuses on automating and scaling the complex process of building KBC systems from heterogeneous data. In particular, we study both knowledge distillation and representation from richly formatted data, where information is expressed via combinations of textual, structural, tabular, and visual cues, as well as novel techniques for making these processes feasible in practice. This dissertation consists of two parts. In the first part, we aim to discover the fundamental building blocks of KBC from richly formatted data. We present Fonduer, a KBC system enabling extraction of information from richly formatted data. Fonduer automatically models richly formatted data and allows both users and machines to systematically retrieve all of a document's multimodal information in a programmatic way. This information can formalize multimodal signals for both training data generation via weak supervision and for augmenting deep learning models with multimodal features to perform task learning. In the second part of this dissertation, we study how building knowledge bases can be made feasible in practice. We investigate two crucial parts of Fonduer's pipeline: training data generation and task learning. We propose two systems. The first is Dauphin, a training data generation system that uses data augmentation to lower the cost of acquiring training data. The second is Emmental, a system for building multimodal, multi-task learning models to increase the efficiency of learning multiple tasks by leveraging the similarities among them. In Dauphin, we analyze the generalization effects of linear transformations in data augmentation and propose a method to strategically select the augmented data samples that provide the most information. In Emmental, we analyze and improve the information transfer in multi-task learning to make the task learning process more efficient and achieve higher quality. The three systems described in this dissertation---Fonduer, Dauphin, and Emmental---demonstrate that it is both possible and practical to build knowledge bases from richly formatted data. We believe that these innovations hold great promise for future KBC techniques
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2020; ©2020 |
Publication date | 2020; 2020 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Wu, Sen |
---|---|
Degree supervisor | Ré, Christopher |
Thesis advisor | Ré, Christopher |
Thesis advisor | Levis, Philip |
Thesis advisor | Olukotun, Oyekunle Ayinde |
Degree committee member | Levis, Philip |
Degree committee member | Olukotun, Oyekunle Ayinde |
Associated with | Stanford University, Computer Science Department. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Sen Wu |
---|---|
Note | Submitted to the Computer Science Department |
Thesis | Thesis Ph.D. Stanford University 2020 |
Location | electronic resource |
Access conditions
- Copyright
- © 2020 by Sen Wu
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...