Holistic language processing : joint models of linguistic structure

Finkel, Jenny Rose; Stanford University, Computer Science Department

Holistic language processing : joint models of linguistic structure

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fny298sm6241" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Humans are much better than computers at understanding language. This is, in part, because humans naturally employ holistic language processing. They effortlessly keep track of many inter-related layers of low-level information, while simultaneously integrating in long-distance information from elsewhere in the conversation or document. This thesis is about joint models for natural language processing which also aim to capture the dependencies between different layers of information and between different parts of a document when making a decision. I address three aspects of holistic language processing. First, I present an information extraction model that includes long-distance links in order to jointly make decisions about related words which may be far away from one another in a document. Most information extraction systems use sequence models, such as linear-chain conditional random fields, which only have access to a small, local, context when making decisions. I show how to add long-distance links between related words which can be arbitrarily far apart with the document. Experiments show that these long-distance links can be used to improve performance on multiple tasks. I then move to jointly modeling different layers of information. First I present a sampling-based pipeline. In a typical linguistic annotation pipeline, different components are run one after another, and the best output from each is used as the input to the next stages. The pipeline I present is theoretically equivalent to passing the entire distribution from one stage to the next, instead of just the most likely output. Experimentally, this pipeline did outperform the typical, greedy pipeline, but did not outperform taking the k-best outputs at this stage. I follow this with a full joint model of parsing and named entity recognition. This joint model does not have the directionality constraints inherent in a pipeline, and both levels of annotation can directly influence and constrain one another. Experiments show that this joint model can produce significant improvements on both tasks. I then show how to further improve the joint model using additional data which has been annotated with only one type of structure, unlike the jointly annotated data needed by the original joint model. The additional data is incorporated using a hierarchical prior, which links feature weights between models for the different tasks. Lastly, I address the problem of multi-domain learning, where the goal is to jointly model different genres of text (annotated for the same task). This is once again done via a hierarchical prior which links the feature weights between the models for the different genres. Experiments show that this can technique can improve performance across all domains, though, not surprisingly, ones with smaller training corpora improve more.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2010
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Finkel, Jenny Rose
Associated with	Stanford University, Computer Science Department
Primary advisor	Manning, Christopher D
Thesis advisor	Manning, Christopher D
Thesis advisor	Jurafsky, Dan, 1962-
Thesis advisor	Koller, Daphne
Thesis advisor	Ng, Andrew Y, 1976-
Advisor	Jurafsky, Dan, 1962-
Advisor	Koller, Daphne
Advisor	Ng, Andrew Y, 1976-

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Jenny Rose Finkel.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2010.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...