Processing short message communications in low-resource languages

Munro, Robert James; Stanford University, Department of Linguistics

Processing short message communications in low-resource languages

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fcg721hb0673" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: This dissertation explores the nature of the variation inherent to short message communications in the majority of the world's languages, and the extent to which modeling this variation can improve natural language processing systems. Text messaging may be the most linguistically diverse form of digital communication that has ever existed, but it is almost completely unstudied. Three sets of short messages are studied, in the Haitian Kreyol, Chichewa, and Urdu languages. It is found that all three contain the substantial spelling variations that result from a productive use of affixes/compounds, from phonological/orthographic variation, and from the typographic errors that arise from speakers with varying literacy. For example, the 600 Chichewa messages have more than 40 spellings for the word odwala ('patient'), with most appearing just once. This is problematic for many current approaches to natural language processing, which assume the level of standardization that is found in formal written English. However, as the variation is linguistically predictable it follows patterns that can be modeled. The dissertation first looks at automated methods for modeling this variation, finding that language independent methods can perform as accurately as language specific methods, indicating a broad deployment potential. Turning to categorization, it is shown that by generalizing across the spelling variations, we can, for example, implement classification systems that can more accurately distinguish emergency messages from those that are less time critical, even when incoming messages contain a large number of previously unknown spellings of words. Looking across languages, the words that vary the least in translation are named entities, meaning that it is possible to leverage loosely aligned translations to automatically extract the names of people, places and organizations. Taken together, it is hoped that the results will lead to more accurate natural language processing systems for low-resource languages and, in turn, lead to greater services for their speakers.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2012
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Munro, Robert James
Associated with	Stanford University, Department of Linguistics
Primary advisor	Manning, Christopher D
Thesis advisor	Manning, Christopher D
Thesis advisor	Jurafsky, Dan, 1962-
Thesis advisor	Parikh, Tapan S
Advisor	Jurafsky, Dan, 1962-
Advisor	Parikh, Tapan S

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Robert Munro.
Note	Submitted to the Department of Linguistics.
Thesis	Thesis (Ph.D.)--Stanford University, 2012.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...