A non-negative framework for joint modeling of spectral structure and temporal dynamics in sound mixtures

Mysore, Gautham J; Stanford University, Department of Music

A non-negative framework for joint modeling of spectral structure and temporal dynamics in sound mixtures

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fxz488jk1186" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Statistical modeling of audio is an ongoing pursuit. There is a great deal of structure in audio and good models need to make use of this structure. Audio is non-stationary but the statistics of the spectral structure are quite consistent over segments of time. Moreover, there is a structure to the non-stationarity itself in the form of temporal dynamics. Sound mixtures are commonly encountered in practice. Polyphonic music, multiple concurrent speakers, and most environmental sounds are mixtures. Moreover, most real world sound sources are actually a mixture of the source and noise. When dealing with mixtures, the structure of the individual sources becomes particularly important if we wish deal with the sources separately. In recent years, non-negative spectrogram factorization methods have become quite popular for modeling audio as they provide a rich representation of audio spectra and are amenable to high quality reconstructions. However, they disregard non-stationarity as they use a single dictionary to characterize the statistics of the spectral structure of an entire source. On the other hand, hidden Markov models (HMMs) cater well to non-stationarity and have been used successfully to model temporal dynamics. They can be powerful for audio analysis, as shown by their application to speech recognition. They can also be used for the reconstruction of sources but have certain limitations due to a rigid observation model. This can be an issue for high quality reconstructions. We propose a new model of single sound sources, the non-negative hidden Markov model (N-HMM), that jointly models the spectral structure and temporal dynamics of a given source. In the proposed model, rather than learning a single dictionary, we learn several small dictionaries that characterize the spectral structure of the source, catering well to non-stationarity. Moreover, we jointly learn a Markov chain that characterizes the temporal dynamics of the source. This is done with a flexible observation model that allows high quality reconstructions. We demonstrate this model on content-aware audio processing. We then propose a new model of sound mixtures, the non-negative factorial hidden Markov model (N-FHMM), that combines models of individual sources. This model incorporates the spectral structure and temporal dynamics of each individual source. We demonstrate the model on single channel source separation and show that it yields superior performance to non-negative spectrogram factorization. Although it is demonstrated on source separation, the N-FHMM is a general model of sound mixtures and can be used for various applications.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2010
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Mysore, Gautham J
Associated with	Stanford University, Department of Music
Primary advisor	Smith, Julius O. (Julius Orion)
Thesis advisor	Smith, Julius O. (Julius Orion)
Thesis advisor	Slaney, Malcolm
Thesis advisor	Smaragdis, Paris
Thesis advisor	Tibshirani, Robert
Advisor	Slaney, Malcolm
Advisor	Smaragdis, Paris
Advisor	Tibshirani, Robert

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Gautham J. Mysore.
Note	Submitted to the Department of Music.
Thesis	Thesis (Ph. D.)--Stanford University, 2010.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...