Connect:   

SANE Conference Overview

Tara N. Sainath

SLTC Newsletter, November 2012

The Speech and Audio in the Northeast (SANE) Conference was held on October 24, 2012 at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, MA. The goal of this meeting was to gather researchers and students in speech and audio from the northeast American continent. The conference featured 8 talks which are outlined in more detail below.

Unsupervised pattern discovery in speech was discussed in three main talks, including talks by Jim Glass and Chia-ying Lee from MIT, T.J. Hazen and David Harwath from Lincoln Labs, and Herb Gish from BBN. Most current speech recognition systems require labeled data for training acoustic, pronunciation and language models. However, given the vast amount of unlabeled data and the difficulty of sometimes obtaining labeled data, there is no doubt unsupervised pattern discovery can help in areas such as learning acoustic sub-word units, lexical dictionaries and higher-level semantic information. Zero-resource learning from spoken audio looks at discovering patterns in speech signals using unsupervised techniques, without any knowledge of labeled transcriptions, language, etc. These techniques have been explored for acoustic pattern discovery [1], learning of acoustic-phonetic inventories [2], and topic modeling for spoken documents [3].

Tara Sainath from IBM Research presented work on using bottleneck features extracted from Deep Belief Networks (DBNs). The work discusses an alternative bottleneck structure which trains a NN with a constant number of hidden units to predict output targets, and then reduces the dimensionality of these output probabilities through an auto-encoder, to create auto-encoder bottleneck (AE-BN) features [4]. The benefit of placing the BN after the posterior estimation network is that it avoids the loss in frame classification accuracy incurred by networks that place the BN before the softmax. AE-BN features provide between a 10-20% relative improvement over state-of-the-art Gaussian processing techniques across telephony and broadcast news tasks.

Dan Ellis from Columbia spoke about "Recognizing and Classifying Environmental Sounds". Environmental sound recognition is the process of identifying user-relevant acoustic events/sources from a soundtrack. The talk discussed methods to identifying events from real-world consumer video clips [5]. It also discussed doing video classification from audio transients, which represent foreground events in a video scene [6].

Josh McDermott from MIT spoke about 'Understanding Audition via Sound Analysis and Synthesis". Analysis and synthesis of natural sounds allows humans to better understand how sounds are heard and processed [7]. The talk argued that sound synthesis can help us to explore auditory models. More specifically, variables that produce compelling synthesis could underlie perception, while synthesis failures point the way to new variables that might be important for the perceptual system. Auditory models help us to understand how many natural sounds may be recognized, both with simple and more complex models and auditory representations.

Steve Rennie from IBM Research discussed using the Factorial Hidden Restricted Boltzmann Machine (FHRBM) for robust speech recognition. In this work, speech and noise are modeled as independent RBMs [8]. Furthermore, the interaction between speech and noise is explicitly modeled to capture how these components combine to generate observed noisy speech features. The work compares the FHRBMs to the traditional factorial model for noisy speech which is based on GMMs. FHRBMs offer the advantage that the representations of both speech and noise are highly distributed. This allows the model to learn a parts-based representation of noisy speech data that can generalize better to previously unseen noise conditions, producing promising results.

Finally, John Hershey from MERL presented work on "A New Class of Dynamical System Models for Speech and Audio". This work introduces a new family of models called non-negative dynamical systems (NDS), that combine aspects of linear dynamical systems (LDS), hidden Markov models (HMM), and non-negative matrix factorization (NMF), in a simple probabilistic model. In NDS, state and observation variables are non-negative, and the innovation noise is multiplicative and Gamma distributed rather than additive and Gaussian as in LDS. The talk showed how NDS can model frame-to-frame dependencies in audio via structured transitions between the elementary spectral patterns that describe the data. Speech enhancement results were demonstrated on a difficult non-stationary noise task, comparing favorably to standard methods.

Acknowledgements

Thank you to Jonathan Le Roux of MERL for his help in providing content and feedback for this article.

References

[1] A. Park and J. Glass, "Unsupervised pattern discovery in speech," IEEE TASLP, Jan. 2008.

[2] C. Lee and J. Glass, "A nonparametric Bayesian approach to acoustic model discovery," Proc. ACL, Jul. 2012.

[3] T. J. Hazen, M.-H. Siu, H. Gish, S. Lowe and A. Chan, "Topic modeling for spoken documents using only phonetic information," Proc. ASRU, Dec. 2011.

[4] T. N. Sainath, B. Kingsbury, and B. Ramabhadran, "Auto-Encoder Bottleneck Features Using Deep Belief Networks," in Proc. ICASSP, Mar. 2012.

[5] K. Lee and D. Ellis, "Audio-Based Semantic Concept Classification for Consumer Video," IEEE TASLP, Aug. 2010.

[6] C. Cotton, D. Ellis , and A. Loui, "Soundtrack classification by transient events," Proc. ICASSP, May 2011.

[7] J. H. McDermott and E. P. Simoncelli, "Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis," Neuron, Sep. 2011.

[8] S. Rennie, P. Fousek and P. Dognin, "Factorial Hidden Restricted Boltzmann Machine for Noise Robust Speech Recognition," in Proc. ICASSP, Mar. 2012

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com