Bilingual and cross-lingual learning of sequence models with bitext

Wang, Mengqiu; Stanford University, Department of Computer Science.

Bilingual and cross-lingual learning of sequence models with bitext

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fnq879qs3428" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Information extraction technologies such as detecting the names of people and places in natural language texts are becoming ever more prevalent, as the amount of unstructured text data grows exponentially. Tremendous progress has been made in the past decade in learning supervised sequence models for such tasks, and current state-of-the-art results are in the lower 90s in term of F1 score, for resource-rich languages like English and widely-studied datasets such as the CoNLL newswire corpus. However, the performance of existing supervised methods lag by a significant margin when evaluated for non-English languages, new datasets, and domains other than newswire. Furthermore, for resource-poor languages where there is often little or no annotated training data, neither supervised nor existing unsupervised methods tend to work well. This thesis describes a series of models and experiments in response to these challenges in three specific areas. Firstly, we address the problem of balancing between feature weight undertraining and overtraining in learning log-linear models. We explore the use of two novel regularization techniques--a mixed L2L1 norm in a product of experts ensemble, and an adaptive regularization with feature noising--to show that they can be very effective in improving system performance. Secondly, we challenge the conventional wisdom of employing a linear architecture and sparse discrete feature representation for sequence labeling tasks, and closely examine the connection and tradeoff between a linear versus nonlinear architecture, as well as a discrete versus continuous feature representation. We show that a nonlinear architecture enjoys a significant advantage over linear architecture when used with continuous feature vectors, but does not seem to offer benefits over traditional sparse features. Lastly, we explore methods that leverage readily available unlabeled parallel text from translation as a rich source of constraints for learning bilingual models that transfer knowledge from English to resource-poor languages. We formalize the model as loopy Markov Random Fields, and propose a suite of approximate inference methods for decoding. Evaluated on standard test sets for five non-English languages, our semi-supervised models yield significant improvements over the state-of-the-art results for all five languages. We further propose a cross-lingual projection method that is capable of learning sequence models for languages where there are no annotated resources at all. Our method projects model posteriors from English to the foreign side over word alignments on bitext, and handles missing and noisy labels via expectation regularization. Learned with no annotated data at all, our model attains the same accuracy as supervised models trained with thousands of labeled examples.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2014
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Wang, Mengqiu
Associated with	Stanford University, Department of Computer Science.
Primary advisor	Manning, Christopher D
Thesis advisor	Manning, Christopher D
Thesis advisor	Jurafsky, Dan, 1962-
Thesis advisor	Liang, Percy
Advisor	Jurafsky, Dan, 1962-
Advisor	Liang, Percy

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Mengqiu Wang.
Note	Submitted to the Department of Computer Science.
Thesis	Ph.D. Stanford University 2014
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...