Nuisance compensation and prosodic modeling on high-level speech tasks
Abstract/Contents
- Abstract
- As automatic speech processing has matured, research has expanded its focus from automatic speech recognition or keyword spotting to applications that focus on paralinguistic speech problems that aim to detect "beyond-the-words" information. Researchers have focused on automatically deriving speaker characteristics from speech and classifying speakers into categories ranging from age, identity, language, dialect, idiolect, and sociolect to truthfulness, cognitive health, and emotion. This dissertation focuses on three of these categories, namely, emotion recognition, psychological state detection, and speaker verification. One of the many difficulties in the areas of emotion recognition and psychological state detection is the lack of real data. Most publicly available emotion databases use acted speech, which is not representative of real speakers' emotions because the emotions are stereotypical and exaggerated. In this work, features and approaches that have been found successful in other speech areas, are applied to the new tasks of emotion and psychological state detection using three different databases of real non-acted speech. These various methods, including modeling cepstral features and prosodic features with Gaussian mixture models and applying nuisance compensation on cepstral features to reduce the speaker and channel variability, outperform the standard linear classifier approach on simple prosodic features. Because these techniques require large amounts of data and these emotion and psychological health databases are small, data with only speaker identity labels are used to initially train the models. Although this data is not tagged with emotion or psychological state labels, it is shown in this dissertation that this data can be used successfully for these tasks during training. In this dissertation, all emotion and psychological state detection tasks are binary classifications, including detecting fear vs. neutral in 911 emergency calls, distinguishing severely depressed from nondepressed older males, and differentiating high risk suicidal adults from both depressed and nondepressed adults in addition to adults who have ideas of committing suicide. With N-fold leave-one-out cross-validation, performance with these new systems is 19% better on average than a basic linear discriminative classifier that uses only prosodic features. Performance is also 17% better than state-of-the-art research on the same data. Results show that fear in 911 calls can be detected with 85% accuracy; and high risk suicidal males are discriminated from males with ideation, depressed males, and nondepressed males with 90% accuracy. Constrained speaker verification, or systems that model standard cepstral features that fall within particular types of speech regions, are studied. A question in modeling such systems is whether to constrain universal background model (UBM) training, joint factor analysis (JFA), or both. This question is explored, as well as how to optimize the UBM model size, using a corpus of Arabic male speakers. Over a large set of phonetic and prosodic constraints, the performance of a system using constrained JFA and UBM is found to be on average 5.2% better than when using constraint-independent (all frames) JFA and UBM. Further improvement is found from optimizing the UBM size based on the percentage of frames covered by the constraint.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2011 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Sanchez, Michelle Hewlett |
---|---|
Associated with | Stanford University, Department of Electrical Engineering |
Primary advisor | El Gamal, Abbas A |
Primary advisor | Gray, Robert M, 1943- |
Thesis advisor | El Gamal, Abbas A |
Thesis advisor | Gray, Robert M, 1943- |
Thesis advisor | Ferrer, Luciana |
Thesis advisor | Olshen, Richard A, 1942- |
Advisor | Ferrer, Luciana |
Advisor | Olshen, Richard A, 1942- |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | Michelle Hewlett Sanchez. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis (Ph.D.)--Stanford University, 2011. |
Location | electronic resource |
Access conditions
- Copyright
- © 2011 by Michelle Hewlett Sanchez
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...