Skip to content.

-- KarenLivescu - 15 Dec 2005

WS06 > WebHome

Welcome to the home of the Johns Hopkins 2006 Summer Workshop project on articulatory feature-based speech recognition!

This is a planning, collaboration, and discussion area for the project. To be added as a user, contact Karen Livescu at klivescu@csail.mitNOSPAM.edu (remove the "NOSPAM").

A brief description of the planned project:

Articulatory features have a long history in proposals for automatic speech recognition (ASR) architectures. These are usually motivated by the intuitions that (1) models related to speech production should better account for spontaneous speech pronunciation effects (2) certain aspects of articulation are more acoustically salient or robust to noise than others, leading to alternative models of the acoustic observations. Improved recognition has been obtained using feature classifiers, especially in noisy conditions or for varying speaking styles; and improvements in pronunciation modeling have been obtained by accounting for articulatory asynchrony and reduction. Articulatory features also provide a natural explanation for the observed asynchronies in multimodal speech recognition, e.g., between audio and video streams in automatic speechreading. Feature-based models have also been proposed for language-universal speech recognition.

Recent developments make this a particularly good time to pursue articulatory models in a concerted effort such as a JHU summer workshop. There is a growing number of researchers investigating different aspects of this idea. Members of our team have developed methods for extracting articulatory states from acoustics, pronunciation modeling using articulatory features, and multistream dynamic Bayesian networks for articulatory and other models for ASR. Another helpful recent development is the advent of graphical modeling techniques, including speech-oriented toolkits such as GMTK (the successful product of a previous summer workshop). Whereas previous efforts to replace the phonetic state with a feature-based representation have required new algorithms for inference and training, graphical models allow us to focus instead on quickly building new models. The components needed for a complete feature-based speech recognizer are reaching maturity, but a joint effort is needed to combine them into a successful system.

In this workshop project, we propose to build complete, end-to-end articulatory feature-based speech recognition systems, for both audio-only and audio-visual speech, and to answer some of the modeling questions that arise in such systems. The project will use a graphical model implementation, in which each feature is associated with a separate hidden state stream, with the asynchrony between streams explicitly modeled. This type of structure has shown promise in pronunciation modeling and has begun to be used in complete systems. This project would also draw on recent ideas for feature classification and would explore questions such as:

  • The type and amount of inter-articulator asynchrony allowed
  • The modeling of reductions from target articulations
  • The effect of context (e.g. phonemic, syllabic, prosodic, or speaker characteristics) on both asynchrony and reductions
  • The use of different feature sets and different classifier types for those features
  • The application of articulator asynchrony as an account for observed audio-video asynchrony, as an alternative to the typical phoneme-viseme modeling approach

During the workshop, the project will focus mainly on two tasks:

  • Small-vocabulary conversational speech recognition using the SVitchboard 1 database, recently introduced by Simon King et al.
  • Audio-visual speech recognition, in a read speech setting

This project also has potential for much follow-up work, including the additional application of language-universal ASR.