Skip to content.

-- KarenLivescu - 15 Dec 2005

WS06 > RelevantPapers

Papers

An archive of papers that may be relevant. Papers are marked as follows:

  • DONE indicates a paper that we should probably all read and have in our collective consciousness
  • DONE indicates a useful paper that may be relevant to some parts of the project
  • DONE indicates a paper that may be of interest, but probably does not directly impact project planning

Some papers don't have a "relevance marker" yet. Please feel free to add one, or to change an existing one, if you are familiar with the paper.

This list is incomplete--please add papers, links, comments, new sections, etc.

Some of these references are taken from Katrin Kirchhoff's compilation for the 2001 JHU workshop.

Some basic background

  • F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: The MIT Press, 1997.
    • Brief, somewhat dense, well-written introduction. The only prerequisite is basic probability. The most relevant chapters for us are 1-3, 9, and 12.
  • F. V. Jensen, Bayesian Networks and Decision Graphs. Springer, 2001.
    • I haven't read this in a while, but I believe it's the best truly introductory BN book out there. -Karen

Articulatory feature classification/recognition

  • DONE M.R. Schroeder, ``Determination of the geometry of the human vocal tract by acoustic measurements'', JASA 41(2), pp. 1002-1010, 1967.
  • DONE K. Shirai and M. Honda, "Estimation of Articulatory Motion". in Dynamic Aspects of Speech Production, pp. 279-302, Tokyo University Press, 1976.
  • DONE B.S. Atal, J.J. Chang, M.V. Mathews & J.W. Tukey, ``Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique'', JASA 63(5), pp. 1535-1555, 1978.
    • ... so acoustic-to-articulatory mapping goes back at least to the 60s.

  • T. Kobayashi, M. Yagyu & K. Shirai, ``Application of neural networks to articulatory motion estimation'', Proceedings ICASSP-91, pp. 489-492.

  • J. Papcun, T.R. Hochberg, T.R. Thomas, F. Larouche, J. Zacks & S. Levy, ``Inferring articulation and recognizing gestures from acoustics with a neural network trained on X-ray microbeam data'', JSASA 92, pp. 688-700, 1992.

  • K. Elenius and M. Blomberg, "Comparing phoneme and feature based speech recognition using artificial neural networks", Proceedings ICSLP-92 1992, 1279-1282.

  • E. Eide, J.R. Rohlicek. H. Gish and S. Mitter, "A linguistic feature representation of the speech waveform", Proceedings ICASSP-93, 1993, pp. 483-486.

  • M.G. Rahim, W.B. Kleijn, J. Schroeter & C.C. Goodyear, ``Acoustic to articulatory parameter mapping using an assembly of neural networks'', Proceedings of ICASSP-91, pp. 485-488, 1991.
  • H.B. Richards, J.S. Mason, M.J. Hunt & J.S. Bridle, ``Deriving articulatory representations of speech'', Proceedings Eurospeech-95, pp. 761-764, 1995.

  • C.S. Blackburn & S.J. Young, ``Towards improved speech recognition using a speech production model'', Proceedings Eurospeech-95, pp. 1623-1626, Madrid, Spain, 1995.
  • C.S. Blackburn, Articulatory Methods for Speech Production and Recognition, Ph.D. Thesis, Cambridge University Engineering Department, 1996.

  • J. Hogden et al., ``Accurate recovery of articulator positions from acoustics: new conclusions based on human data'', Journal of the Acoustical Society of America 100(3),1996, pp. 1819-1834

  • A.V. Hansen, ``Acoustic parameters optimised for recognition of phonetic features'', Proceedings of Eurospeech-97, pp. 397-400, Rhodes, Greece, 1997.

  • DONE S. Dusan and L. Deng, ``Acoustic-to-articulatory inversion using dynamical and phonological constraints'', Proceedings of the 5th Speech Production Workshop: models and data, Kloster Seeon, Germany, 2000.
  • DONE S. Dusan & L. Deng, ``Estimation of articulatory parameters from speech acoustics by Kalman filtering'', Proceedings of CITO Researcher Retreat, Hamilton, Canada, 1998.
  • DONE S. Dusan & L. Deng, ``Recovering vocal tract shapes from MFCC parameters'', Proceedings ICSLP-98, Sydeny, Australia, 1998.
  • DONE S. Dusan, Statistical Estimation of Articulatory Trajectories from the Speech Signals Using Dynamic and Phonological Constraints, University of Waterloo, Canada, 2000.










  • M. Rajamanohar and E. Fosler-Lussier, "An Evaluation of Hierarchical Articulatory Feature Detectors," IEEE Automatic Speech Recognition and Understanding Workkshop (ASRU 2005), San Juan, Puerto Rico, 2005.

Articulatory feature-based ASR

  • DONE L. Deng & K. Erler, ``Structural design of hidden Markov model speech recognizer using multivalued phonetic features: Comparison with segmental speech units'', JASA 92(6), pp. 3058-3066, 1992.
  • DONE L. Deng & D. Sun, ``Speech recognition using atomic speech units constructed from overlapping articulatory features'', Proceedings Eurospeech-93, pp. 1635-1638, Berlin, Germany, 1993.
  • DONE L. Deng & D. Sun, ``Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds'', Proceedings ICSSP-94, pp. I-45-48, Adelaide, Australia, 1994.
  • DONE L. Deng, G. Ramsay & D. Sun, ``Production models as a structural basis for automatic speech recognition'', ETRW-96, 1996.
  • DONE K. Erler & G. H. Freeman, ``An HMM-based speech recognizer using overlapping articulatory features'', JASA 100(4), pp. 2500-2513, 1996.
    • A series of papers using HMMs in which each state corresponds to a combination of feature values.

  • J. Zacks & T.R. Thomas, ``A new neural network for articulatory speech recognition and its application to vowel identification'', Computer, Speech and Language 8, pp. 189-209, 1994.

  • DONE K. Kirchhoff. "Syllable-level desynchronisation of phonetic features for speech recognition", International Conference on Spoken Language Processing, Philadelphia, USA, October, 1996.
    • Two-pass recognition approach allowing for asynchrony between articulatory features within syllable boundaries.

  • D.J. Iskra & W.H. Edmondson, ``Feature-based approach to speech recognition'', Proceedings ICSLP-98, Sydey, Australia, 1998.


  • T. Stephenson et al., ``Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables'', Proceedings ICSLP-00, Beijing, China, 2000.






Graphical models






  • DONE K. Livescu, "Graphical models and speech recognition", guest lecture in MIT 6.345 Automatic Speech Recognition. PDF and PPT.
  • DONE Homework assignment associated with above lecture
    • A lecture-plus-lab unit on graphical models in speech. Might be useful for background. We should certainly all have the "warm-up exercises" in the homework assignment down pat smile I might be able to find the files for the actual lab part too if necessary. --Karen

  • DONE J. Bilmes, Graphical Models in Speech and Language Research, tutorial presented during the 2004 Human Language Technology conference / North American chapter of the Association for Computational Linguistics(HLT/NAACL'04) conference.
    • Another tutorial that includes more information on inference and other applications besides ASR.

Multi-stream models for ASR

  • H.J. Nock, S.J. Young, Loosely Coupled HMMs for ASR. In Proc of ICSLP 2000, Beijing, China.
  • H.J. Nock and S.J. Young, Modelling Asynchrony in Automatic Speech Recognition Using Loosely-Coupled HMMs. Cognitive Science. May-June 2002.

  • Özgür Çetin. Multi-rate Modeling, Model Inference, and Estimation for Statistical Classifiers, Ph.D. thesis, University of Washington, 2004.

Articulatory phonology


Pronunciation modeling

  • G. Tajchman, E. Fosler, and D. Jurafsky. "Building Multiple Pronunciation Models for Novel Words using Exploratory Computational Phonology," Fourth European Conference on Speech Communication and Technology (Eurospeech '95), Madrid, Spain, 1995.

  • M. Ostendorf, B. Byrne, M. Bacchiani, M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D., Talkin, A. Waibel, B. Wheatley and T. Zeppenfeld, “Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode,” Proc. of the International Conference on Spoken Language Processing, 1996, supplementary paper.
    • Results from the 1996 JHU workshop.

  • Byrne W, Finke M, Khudanpur S, McDonough J, Nock H, Riley M, Saraclar M, Wooters C and Zavaliagkos G, "Pronunciation Modelling Using a Hand-Labelled Corpus for Conversational Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, 1998.
  • Byrne W, Finke M, Khudanpur S, McDonough J, Nock H, Riley M, Saraclar M, Wooters C and Zavaliagkos G, "Pronunciation Modelling for Conversational Speech Recognition: A Status Report from WS97," Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Santa Barbara, CA, December 1997.
  • Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A. McDonough, J., Nock, H., Saraclar, M., Wooters, C., Zavaliagkos. "Stochastic pronunciation modeling from hand-labeled phonetic corpora," in the Proceedings of the Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, Rolduc, The Netherlands, May 4-6, 1998. pp 109-116
  • Michael Riley, William Byrne, Michael Finke, Sanjeev Khudanpur, Andrej Ljolje, John McDonough, Harriet Nock, Murat Saraclar, Charles Wooters, George Zavaliagkos "Stochastic pronunciation modelling from hand-labelled phonetic corpora," Speech Communication, to appear.
    • Papers resulting from the 1997 JHU workshop.

  • D. Jurafsky, A. Bell, E. Fosler-Lussier, C. Girand, and W. Raymond. "Reduction of English function words in Switchboard," International Conference on Spoken Language Processing (ICSLP-98), Sydney, Australia, 1998.
  • DONE E. Fosler-Lussier and N. Morgan. "Effects of Speaking Rate and Word Predictability on Conversational Pronunciations," ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, Kerkrade, Netherlands, 1998.

  • DONE E. Fosler-Lussier. "Contextual word and syllable pronunciation models," International Workshop on Automatic Speech Recognition and Understanding (ASRU '99), Keystone, Colorado, 1999.

  • DONE J. E. Fosler-Lussier. "Dynamic Pronunciation Models for Automatic Speech Recognition," Ph.D. thesis, University of California, Berkeley, 1999. Reprinted as International Computer Science Institute technical report TR-99-015.
  • DONE Murat Saraclar. Pronunciation Modeling for Conversational Speech Recognition. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, USA, 2000.
  • DONE H. J. Nock, Techniques for Modelling Phonological Processes in Automatic Speech Recognition, Ph.D. Thesis, Cambridge University Engineering Department. August 2001.
    • The first few chapters of these theses give a lot of good background. Harriet Nock's thesis also goes into details of multistream DBNs for ASR.

  • Murat Saraclar and Sanjeev Khudanpur. Pronunciation change in conversational speech and its implications for automatic speech recognition. Computer Speech and Language, 18(4):375-395, October 2004.
  • Murat Saraclar, Harriet Nock, and Sanjeev Khudanpur. Pronunciation modeling by sharing gaussian densities across phonetic models. Computer Speech and Language, 14(2):137-160, April 2000.
    • Papers showing that pronunciation changes are often partial, rather than wholesale substitutions of one phone for another.

  • Jurafsky, Daniel, Alan Bell, Michelle Gregory, and William D. Raymond. 2001. Probabilistic Relations between Words: Evidence from Reduction in Lexical Production. In Bybee, Joan and Paul Hopper (eds.). Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins. 229-254.
  • Jurafsky, Daniel, Alan Bell, Michelle Gregory, and William D. Raymond. 2001. The Effect of Language Model Probability on Pronunciation Reduction. In Proceedings of ICASSP-01 II.801--804, Salt Lake City, Utah.

  • DONE Jurafsky, Dan, Wayne Ward, Zhang Jianping, Keith Herold, Yu Xiuyang, and Zhang Sen. 2001. What Kind of Pronunciation Variation is Hard for Triphones to Model? Proceedings of ICASSP-01, I.577-580, Salt Lake City, Utah.

  • E. Fosler-Lussier. "A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition,", in S. Renals and G. Grefenstette (eds), Text and Speech Triggered Information Access, Springer Verlag, Berlin, 2003.



  • E. Fosler-Lussier, W. Byrne, and D. Jurafsky, eds. Speech Communication Special Issue on Pronunciation Modleing and Lexicon Adaptation, 46:2, June 2005.

Pronunciation modeling with articulatory features


Visual/Audio-visual ASR



  • DONE A. Mashari, J. Sison, C. Neti, G. Potamianos, J.Luettin, Modeling visual co-articulation for large vocabulary continuous visual speech recognition. ICASSP 2001 Conference Student Forum.
    • Explores visually meaningful co-articulation models using decision trees. Part of the JHSU Workshop 2000 (see below).
  • DONE C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop. Proc. Works. Signal Processing 2001.
    • Focused on integration of audio and visual speech signals for large-vocabulary recognition (workshop homepage).









Hybrid/Tandem ASR

The institutes currently pursuing these approaches include IDIAP, ICSI and SRI. Key authors to search for are Bourlard, Morgan, Hermansky.

  • Bourlard, H., and Morgan, N. (1998), “Hybrid HMM/ANN Systems for Speech Recognition: Overview and New Research Directions,” in Adaptive Processing of Sequences and Data Structures, C.L. Giles and M. Gori (Eds.), Lecture Notes in Artificial Intelligence (1387), Springer Verlag (ISBN 3-540-64341-9), pp. 389-417.

  • Morgan, N. and Bourlard, H. (1995), “Continuous Speech Recognition: An Introduction to the Hybrid HMM/Connectionist Approach,” IEEE Signal Processing Magazine, Invited Paper, vol. 12, no. 3, pp. 25-42, May 1995 (IEEE Award paper).

  • Renals, S., Morgan, N., Bourlard, H., Cohen, M. and Franco, H. (1994), “Connectionist Probability Estimators in HMM Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 2, no. 1, pp. 161-174.

Corpora

  • DONE Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, and Thomas Huang, "AVICAR: An Audiovisual Speech Corpus in a Car Environment," ICSLP 2004




Review papers/position papers/idea papers


  • O. Schmidbauer, F. Casacuberta, M.J. Castro, G. Hegerl, H. Hoge, J.A. Sanchez & I. Zlokarnik, ``Articulatory representation and speech technology'', Language and Speech 36, pp. 331-351, 1993.

  • R.C. Rose, J. Schroeter & M.M. Sondhi, ``An investigation of the potential role of speech production models in automatic speech recognition'', Proceedings ICSLP-94, pp. 575-578.

  • R.S. McGowan & A. Faber, ``Introduction to papers on speech recognition and perception from an articulatory point of view'', JASA 99(3), pp. 1680-1681, 1996.

  • DONE M. Ostendorf, “Moving beyond the ‘beads-on-a-string’ model of speech,” Proc. IEEE ASRU Workshop, 1999.
  • DONE M. Ostendorf, ``Incorporating linguistic theories of phonological variation into speech recognition models,'' Phil. Trans. Royal Society, vol. 358, no. 1769, pp. 1325-1338, 2000.
    • These two papers give some good background and motivate models beyond phone-based HMMs.

Other

  • N. Morgan and E. Fosler-Lussier. "Combining Multiple Estimators of Speaking Rate," International Conference on Acoustic, Speech, and Signal Processing (ICASSP-98), Seattle, Washington, 1998.
  • N. Morgan, E. Fosler, and N. Mirghafori. "Speech Recognition using On-line Estimation of Speaking Rate," Fifth European Conference on Speech Communication and Technology (Eurospeech '97), Rhodes, Greece, 1997.
  • N. Mirghafori, E. Fosler, and N. Morgan. "Towards Robustness to Fast Speech in ASR," International Conference on Acoustic, Speech, and Signal Processing (ICASSP-96), Atlanta, Georgia, 1996.
  • N. Mirghafori, E. Fosler, and N. Morgan. "Why Is ASR Harder For Fast Speech and What Can We Do About It?" Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '95), Snowbird, Utah, 1995.
  • N. Mirghafori, E. Fosler, and N. Morgan. "Fast Speakers in Large Vocabulary Continuous Speech Recognition: Analysis & Antidotes," Fourth European Conference on Speech Communication and Technology (Eurospeech '95), Madrid, Spain, 1995.
    • May be useful if we want to do experiments with varying speaking rates.

Discussion

Enter any comments, questions, or discussion regarding relevant literature in the comment box below. New comments will be appended below existing ones and will be signed with your user name.

-- KarenLivescu - 11 Dec 2005