Processing short message communications in low-resource languages
Abstract/Contents
- Abstract
- This dissertation explores the nature of the variation inherent to short message communications in the majority of the world's languages, and the extent to which modeling this variation can improve natural language processing systems. Text messaging may be the most linguistically diverse form of digital communication that has ever existed, but it is almost completely unstudied. Three sets of short messages are studied, in the Haitian Kreyol, Chichewa, and Urdu languages. It is found that all three contain the substantial spelling variations that result from a productive use of affixes/compounds, from phonological/orthographic variation, and from the typographic errors that arise from speakers with varying literacy. For example, the 600 Chichewa messages have more than 40 spellings for the word odwala ('patient'), with most appearing just once. This is problematic for many current approaches to natural language processing, which assume the level of standardization that is found in formal written English. However, as the variation is linguistically predictable it follows patterns that can be modeled. The dissertation first looks at automated methods for modeling this variation, finding that language independent methods can perform as accurately as language specific methods, indicating a broad deployment potential. Turning to categorization, it is shown that by generalizing across the spelling variations, we can, for example, implement classification systems that can more accurately distinguish emergency messages from those that are less time critical, even when incoming messages contain a large number of previously unknown spellings of words. Looking across languages, the words that vary the least in translation are named entities, meaning that it is possible to leverage loosely aligned translations to automatically extract the names of people, places and organizations. Taken together, it is hoped that the results will lead to more accurate natural language processing systems for low-resource languages and, in turn, lead to greater services for their speakers.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2012 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Munro, Robert James | |
---|---|---|
Associated with | Stanford University, Department of Linguistics | |
Primary advisor | Manning, Christopher D | |
Thesis advisor | Manning, Christopher D | |
Thesis advisor | Jurafsky, Dan, 1962- | |
Thesis advisor | Parikh, Tapan S | |
Advisor | Jurafsky, Dan, 1962- | |
Advisor | Parikh, Tapan S |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | Robert Munro. |
---|---|
Note | Submitted to the Department of Linguistics. |
Thesis | Thesis (Ph.D.)--Stanford University, 2012. |
Location | electronic resource |
Access conditions
- Copyright
- © 2012 by Robert James Munro
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...