Nuisance compensation and prosodic modeling on high-level speech tasks

Placeholder Show Content

Abstract/Contents

Abstract
As automatic speech processing has matured, research has expanded its focus from automatic speech recognition or keyword spotting to applications that focus on paralinguistic speech problems that aim to detect "beyond-the-words" information. Researchers have focused on automatically deriving speaker characteristics from speech and classifying speakers into categories ranging from age, identity, language, dialect, idiolect, and sociolect to truthfulness, cognitive health, and emotion. This dissertation focuses on three of these categories, namely, emotion recognition, psychological state detection, and speaker verification. One of the many difficulties in the areas of emotion recognition and psychological state detection is the lack of real data. Most publicly available emotion databases use acted speech, which is not representative of real speakers' emotions because the emotions are stereotypical and exaggerated. In this work, features and approaches that have been found successful in other speech areas, are applied to the new tasks of emotion and psychological state detection using three different databases of real non-acted speech. These various methods, including modeling cepstral features and prosodic features with Gaussian mixture models and applying nuisance compensation on cepstral features to reduce the speaker and channel variability, outperform the standard linear classifier approach on simple prosodic features. Because these techniques require large amounts of data and these emotion and psychological health databases are small, data with only speaker identity labels are used to initially train the models. Although this data is not tagged with emotion or psychological state labels, it is shown in this dissertation that this data can be used successfully for these tasks during training. In this dissertation, all emotion and psychological state detection tasks are binary classifications, including detecting fear vs. neutral in 911 emergency calls, distinguishing severely depressed from nondepressed older males, and differentiating high risk suicidal adults from both depressed and nondepressed adults in addition to adults who have ideas of committing suicide. With N-fold leave-one-out cross-validation, performance with these new systems is 19% better on average than a basic linear discriminative classifier that uses only prosodic features. Performance is also 17% better than state-of-the-art research on the same data. Results show that fear in 911 calls can be detected with 85% accuracy; and high risk suicidal males are discriminated from males with ideation, depressed males, and nondepressed males with 90% accuracy. Constrained speaker verification, or systems that model standard cepstral features that fall within particular types of speech regions, are studied. A question in modeling such systems is whether to constrain universal background model (UBM) training, joint factor analysis (JFA), or both. This question is explored, as well as how to optimize the UBM model size, using a corpus of Arabic male speakers. Over a large set of phonetic and prosodic constraints, the performance of a system using constrained JFA and UBM is found to be on average 5.2% better than when using constraint-independent (all frames) JFA and UBM. Further improvement is found from optimizing the UBM size based on the percentage of frames covered by the constraint.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2011
Issuance monographic
Language English

Creators/Contributors

Associated with Sanchez, Michelle Hewlett
Associated with Stanford University, Department of Electrical Engineering
Primary advisor El Gamal, Abbas A
Primary advisor Gray, Robert M, 1943-
Thesis advisor El Gamal, Abbas A
Thesis advisor Gray, Robert M, 1943-
Thesis advisor Ferrer, Luciana
Thesis advisor Olshen, Richard A, 1942-
Advisor Ferrer, Luciana
Advisor Olshen, Richard A, 1942-

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Michelle Hewlett Sanchez.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2011.
Location electronic resource

Access conditions

Copyright
© 2011 by Michelle Hewlett Sanchez
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...