Distantly supervised information extraction using bootstrapped patterns

Placeholder Show Content

Abstract/Contents

Abstract
Information extraction (IE) involves extracting information such as entities, relations, and events from unstructured text. Although most work in IE focuses on tasks that have abundant training data by exploiting supervised machine learning techniques, in practice, most IE problems do not have any supervised training data available. Learning conditional random fields (CRFs), a state-of-the-art supervised approach, is impractical for such real world applications because: (1) they require large and expensive labeled corpora, and (2) it is difficult to interpret them and analyze errors, an often-ignored but important feature. This dissertation focuses on information extraction for tasks that have no labeled data available, apart from some seed examples. Supervision using seed examples is usually easier to obtain than fully labeled sentences. In addition, for many tasks, the seed examples can be acquired using existing resources like Wikipedia and other human curated knowledge bases. I present Bootstrapped Pattern Learning (BPL), an iterative pattern and entity learning approach, as an effective and interpretable approach to entity extraction tasks with only seed examples as supervision. I propose two new tasks: (1) extracting key aspects from scientific articles to study the influence of sub-communities of a research community, and (2) extracting medical entities from online web forums. For the first task, I propose three new categories of key aspects and a new definition of influence based on the key aspects. This dissertation is the first work to address the second task of extracting drugs & treatments and symptoms & conditions entities from patient-authored text. Extracting these entities can aid in studying the efficacy and side effects of drugs and home remedies at a large scale. I show that BPL, using either dependency patterns or lexico-syntactic surface-word patterns, is an effective approach to solve both problems. It outperforms existing tools and CRFs. Similar to most bootstrapped or semi-supervised systems, BPL systems developed earlier either ignore the unlabeled data or make closed world assumptions about it, resulting in less accurate classifiers. To address this problem, I propose improvements to BPL's pattern and entity scoring functions by evaluating the unlabeled entities using unsupervised similarity measures, such as word embeddings and contrasting domain-specific and general text. I improve the entity classifier of BPL by expanding the training sets using similarity computed by distributed representations of entities. My systems successfully leverage unlabeled data and significantly outperform the baselines by not making closed world assumptions. Developing any learning system usually requires a developer-in-the-loop to tune the parameters. I utilize the interpretability of patterns to humans, a highly desirable attribute for industrial applications, to develop a new diagnostic tool for visualization of the output of multiple pattern-based entity learning systems. Such comparisons can help in diagnosing errors faster, resulting in a shorter and easier development cycle. I make source code of all tools developed in this dissertation publicly available.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2015
Issuance monographic
Language English

Creators/Contributors

Associated with Gupta, Sonal
Associated with Stanford University, Department of Computer Science.
Primary advisor Manning, Christopher D
Thesis advisor Manning, Christopher D
Thesis advisor Heer, Jeffrey Michael
Thesis advisor Liang, Percy
Advisor Heer, Jeffrey Michael
Advisor Liang, Percy

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Sonal Gupta.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2015.
Location electronic resource

Access conditions

Copyright
© 2015 by Sonal Gupta
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...