Skip to content.

-- KarenLivescu - 15 Dec 2005

WS06 > ProjectMeetingsApr23

First official planning meeting, Sunday, April 23, 9am-5pm, MIT, Cambridge, MA

  • Local information
  • Present:
    • Team members: Karen, Ozgur, Mark, Simon, Nash, Chris, Arthur, Partha, Lisa
    • Satellite/advisory members: Trevor, Edward, Kate (thank you!)
  • Presentation slides
    • Karen: Project intro, DBNs for ASR, pronunciation modeling, feature transcriptions, wrap-up and discussion slides. PPT and PDF
    • Simon: Articulatory feature classification, hybrid and tandem models; PDF and PDF 8-up handout version for printing
    • Mark: SVM classifiers, lessons from JHU WS04, AVICAR corpus. PPT
    • Kate: Audio-visual speech recognition (includes additional slides on ICCV visual ASR paper) PDF
    • Chris: GMTK tutorial. PPT

Status report as of prior to the meeting

What's been done, and who's been involved in each.

  • Feature sets & phone ↔ canonical feature value mappings determined (Karen, Simon, Mark, Eric)
    • Feature set for pronunciation modeling may still change a bit (Karen)
  • Manual feature transcriptions underway (Karen, Xuemin, Lisa)
  • Video front-end processing (Kate, Mark) and audio-visual baseline recognizers (Kate, Karen) underway
  • GMTK infrastructure
    • State tying tool underway (Simon, Jeff)
    • Updated parallel training/decoding scripts underway (Karen)
  • Training of NN feature classifiers about to start (Joe, Simon)
  • Maybe won’t use SVM feature classifiers? (Mark, Karen, Simon)
  • SVitchboard baseline GMTK recognizers unchanged frown (Karen)

Some main conclusions

  • (See last few of Karen's slides above for more info)

  • We plan to test several observation models...
    • Fully-generative (Gaussian mixture-based)
      • May compare factored vs. unfactored observation models (diagrammed here)
    • Hybrid (using scaled AF classifier outputs as likelihoods)
    • Tandem (using AF classifier outputs as observations)

  • ...and several pronunciation models.
    • Context-dependent feature substitution probabilities
    • Different ways of modeling asynchrony (diagrammed here)
      • Using async variables for feature subsets (the "original" model)
      • Using single async variable that evaluates the goodness of the entire “synchrony state”
      • No async variables, just condition each stream’s state transition on other stream states (coupled HMM-style)
    • Cross-word asynchrony

  • Corpora
    • Audio-only: Mainly SVitchboard 1
      • If results are very good, might try Switchboard?
      • If results are very bad, might try PhoneBook?
      • Might also try using AF classifier outputs in a tandem Arabic recognizer at ICSI, just because we can
    • Audio-visual: Choice of corpus will depend on ongoing baseline experiments
      • There is a concern that digits tasks may be sufficiently well modeled by existing phoneme-viseme approaches
      • Nevertheless, we will continue working with these for now since they have quick turnaround for initial experiments
      • For audio-visual work, we may use only fully-generative models, perhaps also hybrid/tandem depending on time and visual AF classifier performance

  • AF classifiers for hybrid/tandem models
    • We will use NNs, and perhaps also SVMs
    • We would like to do embedded training of classifiers, using a pronunciation model that allows for asynchrony and reductions for re-alignment between iterations
    • We will test AF classifiers against manual transcriptions currently being created at MIT

  • We will use forced-aligned AF transcriptions as "ground truth" for pronunciation modeling experiments and maybe analysis
    • Which classifiers/models will be used to generate forced alignments will depend on accuracy as tested against manual transcriptions

  • Acoustic observations:
    • Fully-gen models: PLPs per frame, PLPs over N frames + transformation
    • AF classifers: trained on PLPs over N concatenated frames; also log critical band energies?
    • Tandem: classifier outputs + PLPs per frame

  • Training data:
    • NNs: All of (Fisher – Switchboard 1) for now; maybe reduce later or for embedded training
    • SVMs will have to be trained on less data
    • Fully-gen: Just SVitchboard for now; later, maybe train on Switchboard/test on SVitchboard

To do by next planning meeting

  • SVitchboard 1 baselines (Karen, Ozgur)

  • GMTK tying tool (Simon)

  • Audio-visual baselines and decisions on which corpora to use (Kate, Karen, Mark)
    • For now, Kate & Karen work on MIT webcam digits corpus, Mark on AVICAR
    • Kate & Mark will try MIT's liptracking algorithm on AVICAR

  • NN audio AF classifiers (Simon, Joe)

  • SVM audio AF classifiers (Ozgur, Mark)

  • Visual classifiers (Ozgur)
    • Visual observations and alignments, at least for initial tests, to be provided by Kate & Karen

  • Updated GMTK training/decoding scripts (Karen, Chris, Simon, Partha)
    • These should be ready around mid-May, along with an example training/decoding task, so that they can be tested at the various sites:
      • JHU (Nash, Lisa)
      • UW (Chris)
      • Edinburgh (Simon, Partha, Bronwyn?)
      • UIUC (Mark, Arthur)
      • MIT? (Karen)

  • Manual AF transcriptions
    • All utts done/almost done (Karen)
    • Analysis (e.g. inter-transcriber agreement) (Nash, Lisa)

-- KarenLivescu - 04 May 2006