Automating knowledge distillation and representation from richly formatted data

Wu, Sen

Automating knowledge distillation and representation from richly formatted data

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fbt266nw6584" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Much of the information created today is scattered throughout books, magazines, web pages, academic papers, and other documents that cannot be directly queried in a structured way. To address this challenge, people are building structured knowledge bases to make this information more accessible. The process of populating knowledge bases from unstructured inputs is called knowledge base construction (KBC). Through KBC, we can make troves of valuable information more accessible. However, existing KBC systems have limited abilities to handle input data from a wide variety of formats, including unstructured text, tables, and figures, contained within highly variable file structures. This data heterogeneity makes automated, scalable KBC difficult to achieve in real-world scenarios. This dissertation focuses on automating and scaling the complex process of building KBC systems from heterogeneous data. In particular, we study both knowledge distillation and representation from richly formatted data, where information is expressed via combinations of textual, structural, tabular, and visual cues, as well as novel techniques for making these processes feasible in practice. This dissertation consists of two parts. In the first part, we aim to discover the fundamental building blocks of KBC from richly formatted data. We present Fonduer, a KBC system enabling extraction of information from richly formatted data. Fonduer automatically models richly formatted data and allows both users and machines to systematically retrieve all of a document's multimodal information in a programmatic way. This information can formalize multimodal signals for both training data generation via weak supervision and for augmenting deep learning models with multimodal features to perform task learning. In the second part of this dissertation, we study how building knowledge bases can be made feasible in practice. We investigate two crucial parts of Fonduer's pipeline: training data generation and task learning. We propose two systems. The first is Dauphin, a training data generation system that uses data augmentation to lower the cost of acquiring training data. The second is Emmental, a system for building multimodal, multi-task learning models to increase the efficiency of learning multiple tasks by leveraging the similarities among them. In Dauphin, we analyze the generalization effects of linear transformations in data augmentation and propose a method to strategically select the augmented data samples that provide the most information. In Emmental, we analyze and improve the information transfer in multi-task learning to make the task learning process more efficient and achieve higher quality. The three systems described in this dissertation---Fonduer, Dauphin, and Emmental---demonstrate that it is both possible and practical to build knowledge bases from richly formatted data. We believe that these innovations hold great promise for future KBC techniques

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2020; ©2020
Publication date	2020; 2020
Issuance	monographic
Language	English

Creators/Contributors

Author	Wu, Sen
Degree supervisor	Ré, Christopher
Thesis advisor	Ré, Christopher
Thesis advisor	Levis, Philip
Thesis advisor	Olukotun, Oyekunle Ayinde
Degree committee member	Levis, Philip
Degree committee member	Olukotun, Oyekunle Ayinde
Associated with	Stanford University, Computer Science Department.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Sen Wu
Note	Submitted to the Computer Science Department
Thesis	Thesis Ph.D. Stanford University 2020
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...