Designing syntactic representations for NLP : an empirical investigation

Placeholder Show Content

Abstract/Contents

Abstract
This dissertation is a study on the use of linguistic structure in Natural Language Processing (NLP) applications. Specifically, it investigates how different ways of packaging syntactic information have consequences for goals such as representing linguistic properties, training statistical parsers, and sourcing features for information extraction. The focus of these investigations is the design of Universal Dependencies (UD), a multilingual syntactic representation for NLP. Chapter 2 discusses the theoretical foundations of UD and its relations to other frameworks for the study of syntax. This discussion shows specific design decisions that characterize UD, and the principles motivating those decisions. The rationale for headedness criteria and type distinctions in UD is introduced there. Chapter 3 studies how choices of headedness in dependency representations have consequences for parsing and crosslinguistic parallelism. UD strongly prefers lexical heads in dependency trees, and this chapter presents quantitative results supporting this preference for its impact on parallelism. However, that design can be suboptimal for parsing, and in some languages parsing accuracy can be improved by using a parser-internal representation that favors function words as heads. Chapter 4 presents the first detailed linguistic analysis of UD-represented data, taking four Romance languages for a case study. UD's conciseness and orientation to surface syntax allows for a simple and straightforward analysis of Romance SE constructions, which are very difficult to unify in generative syntax. On the other hand, complex predicates require us to choose between representing syntactic or semantic properties. The Romance case also shows why maximizing the crosslinguistic uniformity of the distinction between function and content words requires a small amount of semantic information, in addition to syntactic cues. Chapter 5 investigates the actual usage of UD in a pipeline, with an extrinsic evaluation that compares UD to minimally transformed versions of it. The main takeaway is methodological: it is very difficult to obtain consistent improvements across data sets by manipulating the dependency representation. The most consistent result obtained was an improvement in performance when using a version of UD that is restructured and relabeled to have shorter predicate-argument paths. The results and analyses presented in this work show that the main (and perhaps only) reason to use a lexical-head design is to support crosslinguistic parallelism. However, that is only possible if function words are defined uniformly across languages, and doing so satisfactorily requires the use of criteria outside syntax. Moreover, the complexity of the results shows that a single design cannot necessarily serve every purpose equally well. Knowing this, one of the most useful things that designers can do is provide a discussion of the properties of their representation for users, empowering them to make transformations such as the many examples illustrated in this dissertation. A deep understanding of syntactic representations creates flexibility for users exploit their properties in the way that is most suitable for a particular task and data set. This dissertation creates such a deep understanding about UD, thereby, hopefully, enabling users to utilize it in the way that is most suitable for them.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with Silveira, Natalia G
Associated with Stanford University, Department of Linguistics.
Primary advisor Manning, Christopher D
Thesis advisor Manning, Christopher D
Thesis advisor Jurafsky, Dan, 1962-
Thesis advisor Potts, Christopher, 1977-
Thesis advisor De Marneffe, Marie-Catherine
Advisor Jurafsky, Dan, 1962-
Advisor Potts, Christopher, 1977-
Advisor De Marneffe, Marie-Catherine

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Natalia G. Silveira.
Note Submitted to the Department of Linguistics.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Natalia Giordani Silveira
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...