Grammar induction and parsing with dependency-and-boundary models

Placeholder Show Content

Abstract/Contents

Abstract
Unsupervised learning of hierarchical syntactic structure from free-form natural language text is an important and difficult problem, with implications for scientific goals, such as understanding human language acquisition, or engineering applications, including question answering, machine translation and speech recognition. As is the case with many unsupervised settings in machine learning, grammar induction usually reduces to a non-convex optimization problem. This dissertation proposes a novel family of head-outward generative dependency parsing models and a curriculum learning strategy, co-designed to effectively induce grammars despite local optima, by taking advantage of multiple views of data. The dependency-and-boundary models are parameterized to exploit, as much as possible, any observable state, such as words at sentence boundaries, which limits the proliferation of optima that is ordinarily caused by presence of latent variables. They are also flexible in their modeling of overlapping subgrammars and sensitive to different kinds of input types. These capabilities allow training data to be split into simpler text fragments, in accordance with proposed parsing constraints, thereby increasing the numbers of visible edges. An optimization strategy then gradually exposes learners to more complex data. The proposed suite of constraints on possible valid parse structures, which can be extracted from unparsed surface text forms, helps guide language learners towards linguistically plausible syntactic constructions. These constraints are efficient, easy to implement and applicable to a variety of naturally-occurring partial bracketings, including capitalization changes, punctuation and web markup. Connections between traditional syntax and HTML annotations, for instance, were not previously known, and are one of several discoveries about statistical regularities in text that this thesis contributes to the science linguistics. Resulting grammar induction pipelines attain state-of-the-art performance not only on a standard English dependency parsing test bed, but also as judged by constituent structure metrics, in addition to a more comprehensive multilingual evaluation that spans disparate language families. This work widens the scope and difficulty of the evaluation methodology for unsupervised parsing, testing against nineteen languages (rather than just English), evaluating on all (not just short) sentence lengths, and using disjoint (blind) training and test data splits. The proposed methods also show that it is possible to eliminate commonly used supervision signals, including biased initializers, manually tuned training subsets, custom termination criteria and knowledge of part-of-speech tags, and still improve performance. Empirical evidence presented in this dissertation strongly suggests that complex learning tasks like grammar induction can cope with non-convexity and discover more correct syntactic structures by pursuing learning strategies that begin with simple data and basic models and progress to more complex data instances and more expressive model parameterizations. A contribution to artificial intelligence more broadly is thus a collection of search techniques that make expectation-maximization and other optimization algorithms less sensitive to local optima. The proposed tools include multi-objective approaches for avoiding or escaping fixed points, iterative model recombination and ``starting small'' strategies that gradually improve candidate solutions, and a generic framework for transforming these and other already-found locally optimal models. Such transformations make for informed, intelligent, non-random restarts, enabling the design of comprehensive search networks that are capable of exploring combinatorial parameter spaces more rapidly and more thoroughly than conventional optimization methods.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2013
Issuance monographic
Language English

Creators/Contributors

Associated with Spitkovsky, Valentin Ilyich
Associated with Stanford University, Department of Computer Science.
Primary advisor Jurafsky, Dan, 1962-
Thesis advisor Jurafsky, Dan, 1962-
Thesis advisor Alshawi, Hiyan
Thesis advisor Manning, Christopher D
Advisor Alshawi, Hiyan
Advisor Manning, Christopher D

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Valentin Ilyich Spitkovsky.
Note Submitted to the Department of Computer Science.
Thesis Ph.D. Stanford University 2013
Location electronic resource

Access conditions

Copyright
© 2013 by Valentin Ilyich Spitkovsky
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...