Automating knowledge distillation and representation from richly formatted data

Placeholder Show Content

Abstract/Contents

Abstract
Much of the information created today is scattered throughout books, magazines, web pages, academic papers, and other documents that cannot be directly queried in a structured way. To address this challenge, people are building structured knowledge bases to make this information more accessible. The process of populating knowledge bases from unstructured inputs is called knowledge base construction (KBC). Through KBC, we can make troves of valuable information more accessible. However, existing KBC systems have limited abilities to handle input data from a wide variety of formats, including unstructured text, tables, and figures, contained within highly variable file structures. This data heterogeneity makes automated, scalable KBC difficult to achieve in real-world scenarios. This dissertation focuses on automating and scaling the complex process of building KBC systems from heterogeneous data. In particular, we study both knowledge distillation and representation from richly formatted data, where information is expressed via combinations of textual, structural, tabular, and visual cues, as well as novel techniques for making these processes feasible in practice. This dissertation consists of two parts. In the first part, we aim to discover the fundamental building blocks of KBC from richly formatted data. We present Fonduer, a KBC system enabling extraction of information from richly formatted data. Fonduer automatically models richly formatted data and allows both users and machines to systematically retrieve all of a document's multimodal information in a programmatic way. This information can formalize multimodal signals for both training data generation via weak supervision and for augmenting deep learning models with multimodal features to perform task learning. In the second part of this dissertation, we study how building knowledge bases can be made feasible in practice. We investigate two crucial parts of Fonduer's pipeline: training data generation and task learning. We propose two systems. The first is Dauphin, a training data generation system that uses data augmentation to lower the cost of acquiring training data. The second is Emmental, a system for building multimodal, multi-task learning models to increase the efficiency of learning multiple tasks by leveraging the similarities among them. In Dauphin, we analyze the generalization effects of linear transformations in data augmentation and propose a method to strategically select the augmented data samples that provide the most information. In Emmental, we analyze and improve the information transfer in multi-task learning to make the task learning process more efficient and achieve higher quality. The three systems described in this dissertation---Fonduer, Dauphin, and Emmental---demonstrate that it is both possible and practical to build knowledge bases from richly formatted data. We believe that these innovations hold great promise for future KBC techniques

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2020; ©2020
Publication date 2020; 2020
Issuance monographic
Language English

Creators/Contributors

Author Wu, Sen
Degree supervisor Ré, Christopher
Thesis advisor Ré, Christopher
Thesis advisor Levis, Philip
Thesis advisor Olukotun, Oyekunle Ayinde
Degree committee member Levis, Philip
Degree committee member Olukotun, Oyekunle Ayinde
Associated with Stanford University, Computer Science Department.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Sen Wu
Note Submitted to the Computer Science Department
Thesis Thesis Ph.D. Stanford University 2020
Location electronic resource

Access conditions

Copyright
© 2020 by Sen Wu
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...