A non-negative framework for joint modeling of spectral structure and temporal dynamics in sound mixtures

Placeholder Show Content

Abstract/Contents

Abstract
Statistical modeling of audio is an ongoing pursuit. There is a great deal of structure in audio and good models need to make use of this structure. Audio is non-stationary but the statistics of the spectral structure are quite consistent over segments of time. Moreover, there is a structure to the non-stationarity itself in the form of temporal dynamics. Sound mixtures are commonly encountered in practice. Polyphonic music, multiple concurrent speakers, and most environmental sounds are mixtures. Moreover, most real world sound sources are actually a mixture of the source and noise. When dealing with mixtures, the structure of the individual sources becomes particularly important if we wish deal with the sources separately. In recent years, non-negative spectrogram factorization methods have become quite popular for modeling audio as they provide a rich representation of audio spectra and are amenable to high quality reconstructions. However, they disregard non-stationarity as they use a single dictionary to characterize the statistics of the spectral structure of an entire source. On the other hand, hidden Markov models (HMMs) cater well to non-stationarity and have been used successfully to model temporal dynamics. They can be powerful for audio analysis, as shown by their application to speech recognition. They can also be used for the reconstruction of sources but have certain limitations due to a rigid observation model. This can be an issue for high quality reconstructions. We propose a new model of single sound sources, the non-negative hidden Markov model (N-HMM), that jointly models the spectral structure and temporal dynamics of a given source. In the proposed model, rather than learning a single dictionary, we learn several small dictionaries that characterize the spectral structure of the source, catering well to non-stationarity. Moreover, we jointly learn a Markov chain that characterizes the temporal dynamics of the source. This is done with a flexible observation model that allows high quality reconstructions. We demonstrate this model on content-aware audio processing. We then propose a new model of sound mixtures, the non-negative factorial hidden Markov model (N-FHMM), that combines models of individual sources. This model incorporates the spectral structure and temporal dynamics of each individual source. We demonstrate the model on single channel source separation and show that it yields superior performance to non-negative spectrogram factorization. Although it is demonstrated on source separation, the N-FHMM is a general model of sound mixtures and can be used for various applications.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2010
Issuance monographic
Language English

Creators/Contributors

Associated with Mysore, Gautham J
Associated with Stanford University, Department of Music
Primary advisor Smith, Julius O. (Julius Orion)
Thesis advisor Smith, Julius O. (Julius Orion)
Thesis advisor Slaney, Malcolm
Thesis advisor Smaragdis, Paris
Thesis advisor Tibshirani, Robert
Advisor Slaney, Malcolm
Advisor Smaragdis, Paris
Advisor Tibshirani, Robert

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Gautham J. Mysore.
Note Submitted to the Department of Music.
Thesis Thesis (Ph. D.)--Stanford University, 2010.
Location electronic resource

Access conditions

Copyright
© 2010 by Gautham J Mysore
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...