Processing short message communications in low-resource languages

Placeholder Show Content

Abstract/Contents

Abstract
This dissertation explores the nature of the variation inherent to short message communications in the majority of the world's languages, and the extent to which modeling this variation can improve natural language processing systems. Text messaging may be the most linguistically diverse form of digital communication that has ever existed, but it is almost completely unstudied. Three sets of short messages are studied, in the Haitian Kreyol, Chichewa, and Urdu languages. It is found that all three contain the substantial spelling variations that result from a productive use of affixes/compounds, from phonological/orthographic variation, and from the typographic errors that arise from speakers with varying literacy. For example, the 600 Chichewa messages have more than 40 spellings for the word odwala ('patient'), with most appearing just once. This is problematic for many current approaches to natural language processing, which assume the level of standardization that is found in formal written English. However, as the variation is linguistically predictable it follows patterns that can be modeled. The dissertation first looks at automated methods for modeling this variation, finding that language independent methods can perform as accurately as language specific methods, indicating a broad deployment potential. Turning to categorization, it is shown that by generalizing across the spelling variations, we can, for example, implement classification systems that can more accurately distinguish emergency messages from those that are less time critical, even when incoming messages contain a large number of previously unknown spellings of words. Looking across languages, the words that vary the least in translation are named entities, meaning that it is possible to leverage loosely aligned translations to automatically extract the names of people, places and organizations. Taken together, it is hoped that the results will lead to more accurate natural language processing systems for low-resource languages and, in turn, lead to greater services for their speakers.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2012
Issuance monographic
Language English

Creators/Contributors

Associated with Munro, Robert James
Associated with Stanford University, Department of Linguistics
Primary advisor Manning, Christopher D
Thesis advisor Manning, Christopher D
Thesis advisor Jurafsky, Dan, 1962-
Thesis advisor Parikh, Tapan S
Advisor Jurafsky, Dan, 1962-
Advisor Parikh, Tapan S

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Robert Munro.
Note Submitted to the Department of Linguistics.
Thesis Thesis (Ph.D.)--Stanford University, 2012.
Location electronic resource

Access conditions

Copyright
© 2012 by Robert James Munro
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...