Welcome to the home of the Johns Hopkins 2006 Summer Workshop project on articulatory feature-based speech recognition!
This is a planning, collaboration, and discussion area for the project. To be added as a user, contact Karen Livescu at
klivescu@csail.mitNOSPAM.edu (remove the "NOSPAM").
A brief description of the planned project:
Articulatory features have a long history in proposals for automatic
speech recognition (ASR) architectures. These are usually motivated
by the intuitions that (1) models related to speech production should
better account for spontaneous speech pronunciation effects (2)
certain aspects of articulation are more acoustically salient or
robust to noise than others, leading to alternative models of the
acoustic observations. Improved recognition has been obtained using
feature classifiers, especially in noisy conditions or for varying
speaking styles; and improvements in pronunciation modeling have been
obtained by accounting for articulatory asynchrony and reduction.
Articulatory features also provide a natural explanation for the
observed asynchronies in multimodal speech recognition, e.g., between
audio and video streams in automatic speechreading. Feature-based
models have also been proposed for language-universal speech
recognition.
Recent developments make this a particularly good time to pursue
articulatory models in a concerted effort such as a JHU summer
workshop. There is a growing number of researchers investigating
different aspects of this idea. Members of our team have developed
methods for extracting articulatory states from acoustics,
pronunciation modeling using articulatory features, and multistream
dynamic Bayesian networks for articulatory and other models for ASR.
Another helpful recent development is the advent of graphical modeling
techniques, including speech-oriented toolkits such as GMTK (the
successful product of a previous summer workshop). Whereas previous
efforts to replace the phonetic state with a feature-based
representation have required new algorithms for inference and
training, graphical models allow us to focus instead on quickly
building new models. The components needed for a complete
feature-based speech recognizer are reaching maturity, but a joint
effort is needed to combine them into a successful system.
In this workshop project, we propose to build complete, end-to-end
articulatory feature-based speech recognition systems, for both
audio-only and audio-visual speech, and to answer some of the modeling
questions that arise in such systems. The project will use a
graphical model implementation, in which each feature is associated
with a separate hidden state stream, with the asynchrony between
streams explicitly modeled. This type of structure has shown promise
in pronunciation modeling and has begun to be used in complete
systems. This project would also draw on recent ideas for feature
classification and would explore questions such as:
- The type and amount of inter-articulator asynchrony allowed
- The modeling of reductions from target articulations
- The effect of context (e.g. phonemic, syllabic, prosodic, or speaker characteristics) on both asynchrony and reductions
- The use of different feature sets and different classifier types for those features
- The application of articulator asynchrony as an account for observed audio-video asynchrony, as an alternative to the typical phoneme-viseme modeling approach
During the workshop, the project will focus mainly on two tasks:
- Small-vocabulary conversational speech recognition using the SVitchboard 1 database, recently introduced by Simon King et al.
- Audio-visual speech recognition, in a read speech setting
This project also has potential for much follow-up work, including the
additional application of language-universal ASR.