Improving access to untranscribed speech corpora using AI
Abstract/Contents
- Abstract
- How easily speech corpora can be indexed and searched has a direct impact on how effectively its contents can be used by many interested parties — from linguists, to language teachers, to community members. As transcribing speech is much more time consuming than recording it, large parts of speech corpora typically remain untranscribed, making it difficult to index and search these sub-parts. While searchable transcriptions can be automatically derived using a speech-to-text system for major languages like English, such technologies are typically unavailable for smaller languages, especially those typical in language documentation work. For documentation projects, this difficulty creates a bottleneck for creating language learning materials for language revitalisation and maintenance as well as linguistic analyses. In this dissertation, I propose four approaches to widen this bottleneck to enable some form of search or indexing, or accelerate the time-consuming process of transcription. Each chapter addresses a common but distinct scenario within language documentation projects according to the types and amounts of available data. For each scenario, I propose a context-appropriate, data-efficient solution that leverages AI speech models as well as external resources where appropriate.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2024; ©2024 |
Publication date | 2024; 2024 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | San, Nay Myo |
---|---|
Degree supervisor | Jurafsky, Dan, 1962- |
Thesis advisor | Jurafsky, Dan, 1962- |
Thesis advisor | Anttila, Arto |
Thesis advisor | Manning, Christopher D |
Degree committee member | Anttila, Arto |
Degree committee member | Manning, Christopher D |
Associated with | Stanford University, School of Humanities and Sciences |
Associated with | Stanford University, Department of Linguistics |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Nay Myo San. |
---|---|
Note | Submitted to the Department of Linguistics. |
Thesis | Thesis Ph.D. Stanford University 2024. |
Location | https://purl.stanford.edu/jx557wt1543 |
Access conditions
- Copyright
- © 2024 by Nay Myo San
- License
- This work is licensed under a Creative Commons Attribution Share Alike 3.0 Unported license (CC BY-SA).
Also listed in
Loading usage metrics...