Distantly supervised information extraction using bootstrapped patterns

Gupta, Sonal; Stanford University, Department of Computer Science.

Distantly supervised information extraction using bootstrapped patterns

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fnt508qx3506" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Information extraction (IE) involves extracting information such as entities, relations, and events from unstructured text. Although most work in IE focuses on tasks that have abundant training data by exploiting supervised machine learning techniques, in practice, most IE problems do not have any supervised training data available. Learning conditional random fields (CRFs), a state-of-the-art supervised approach, is impractical for such real world applications because: (1) they require large and expensive labeled corpora, and (2) it is difficult to interpret them and analyze errors, an often-ignored but important feature. This dissertation focuses on information extraction for tasks that have no labeled data available, apart from some seed examples. Supervision using seed examples is usually easier to obtain than fully labeled sentences. In addition, for many tasks, the seed examples can be acquired using existing resources like Wikipedia and other human curated knowledge bases. I present Bootstrapped Pattern Learning (BPL), an iterative pattern and entity learning approach, as an effective and interpretable approach to entity extraction tasks with only seed examples as supervision. I propose two new tasks: (1) extracting key aspects from scientific articles to study the influence of sub-communities of a research community, and (2) extracting medical entities from online web forums. For the first task, I propose three new categories of key aspects and a new definition of influence based on the key aspects. This dissertation is the first work to address the second task of extracting drugs & treatments and symptoms & conditions entities from patient-authored text. Extracting these entities can aid in studying the efficacy and side effects of drugs and home remedies at a large scale. I show that BPL, using either dependency patterns or lexico-syntactic surface-word patterns, is an effective approach to solve both problems. It outperforms existing tools and CRFs. Similar to most bootstrapped or semi-supervised systems, BPL systems developed earlier either ignore the unlabeled data or make closed world assumptions about it, resulting in less accurate classifiers. To address this problem, I propose improvements to BPL's pattern and entity scoring functions by evaluating the unlabeled entities using unsupervised similarity measures, such as word embeddings and contrasting domain-specific and general text. I improve the entity classifier of BPL by expanding the training sets using similarity computed by distributed representations of entities. My systems successfully leverage unlabeled data and significantly outperform the baselines by not making closed world assumptions. Developing any learning system usually requires a developer-in-the-loop to tune the parameters. I utilize the interpretability of patterns to humans, a highly desirable attribute for industrial applications, to develop a new diagnostic tool for visualization of the output of multiple pattern-based entity learning systems. Such comparisons can help in diagnosing errors faster, resulting in a shorter and easier development cycle. I make source code of all tools developed in this dissertation publicly available.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2015
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Gupta, Sonal
Associated with	Stanford University, Department of Computer Science.
Primary advisor	Manning, Christopher D
Thesis advisor	Manning, Christopher D
Thesis advisor	Heer, Jeffrey Michael
Thesis advisor	Liang, Percy
Advisor	Heer, Jeffrey Michael
Advisor	Liang, Percy

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Sonal Gupta.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2015.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...