Find topic
WS06 topics
Members' area
Tools
Help!
-- KarenLivescu - 15 Dec 2005
|
WS06
>
ProjectMeetingsApr23
First official planning meeting, Sunday, April 23, 9am-5pm, MIT, Cambridge, MA
- Local information
- Present:
- Team members: Karen, Ozgur, Mark, Simon, Nash, Chris, Arthur, Partha, Lisa
- Satellite/advisory members: Trevor, Edward, Kate (thank you!)
- Presentation slides
- Karen: Project intro, DBNs for ASR, pronunciation modeling, feature transcriptions, wrap-up and discussion slides. PPT and PDF
- Simon: Articulatory feature classification, hybrid and tandem models; PDF and PDF 8-up handout version for printing
- Mark: SVM classifiers, lessons from JHU WS04, AVICAR corpus. PPT
- Kate: Audio-visual speech recognition (includes additional slides on ICCV visual ASR paper) PDF
- Chris: GMTK tutorial. PPT
Status report as of prior to the meeting
What's been done, and who's been involved in each.
- Feature sets & phone ↔ canonical feature value mappings determined (Karen, Simon, Mark, Eric)
- Feature set for pronunciation modeling may still change a bit (Karen)
- Manual feature transcriptions underway (Karen, Xuemin, Lisa)
- Video front-end processing (Kate, Mark) and audio-visual baseline recognizers (Kate, Karen) underway
- GMTK infrastructure
- State tying tool underway (Simon, Jeff)
- Updated parallel training/decoding scripts underway (Karen)
- Training of NN feature classifiers about to start (Joe, Simon)
- Maybe won’t use SVM feature classifiers? (Mark, Karen, Simon)
- SVitchboard baseline GMTK recognizers unchanged
(Karen)
Some main conclusions
- (See last few of Karen's slides above for more info)
- We plan to test several observation models...
- Fully-generative (Gaussian mixture-based)
- May compare factored vs. unfactored observation models (diagrammed here)
- Hybrid (using scaled AF classifier outputs as likelihoods)
- Tandem (using AF classifier outputs as observations)
- ...and several pronunciation models.
- Context-dependent feature substitution probabilities
- Different ways of modeling asynchrony (diagrammed here)
- Using async variables for feature subsets (the "original" model)
- Using single async variable that evaluates the goodness of the entire “synchrony state”
- No async variables, just condition each stream’s state transition on other stream states (coupled HMM-style)
- Cross-word asynchrony
- Corpora
- Audio-only: Mainly SVitchboard 1
- If results are very good, might try Switchboard?
- If results are very bad, might try PhoneBook?
- Might also try using AF classifier outputs in a tandem Arabic recognizer at ICSI, just because we can
- Audio-visual: Choice of corpus will depend on ongoing baseline experiments
- There is a concern that digits tasks may be sufficiently well modeled by existing phoneme-viseme approaches
- Nevertheless, we will continue working with these for now since they have quick turnaround for initial experiments
- For audio-visual work, we may use only fully-generative models, perhaps also hybrid/tandem depending on time and visual AF classifier performance
- AF classifiers for hybrid/tandem models
- We will use NNs, and perhaps also SVMs
- We would like to do embedded training of classifiers, using a pronunciation model that allows for asynchrony and reductions for re-alignment between iterations
- We will test AF classifiers against manual transcriptions currently being created at MIT
- We will use forced-aligned AF transcriptions as "ground truth" for pronunciation modeling experiments and maybe analysis
- Which classifiers/models will be used to generate forced alignments will depend on accuracy as tested against manual transcriptions
- Acoustic observations:
- Fully-gen models: PLPs per frame, PLPs over N frames + transformation
- AF classifers: trained on PLPs over N concatenated frames; also log critical band energies?
- Tandem: classifier outputs + PLPs per frame
- Training data:
- NNs: All of (Fisher – Switchboard 1) for now; maybe reduce later or for embedded training
- SVMs will have to be trained on less data
- Fully-gen: Just SVitchboard for now; later, maybe train on Switchboard/test on SVitchboard
To do by next planning meeting
- SVitchboard 1 baselines (Karen, Ozgur)
- Audio-visual baselines and decisions on which corpora to use (Kate, Karen, Mark)
- For now, Kate & Karen work on MIT webcam digits corpus, Mark on AVICAR
- Kate & Mark will try MIT's liptracking algorithm on AVICAR
- NN audio AF classifiers (Simon, Joe)
- SVM audio AF classifiers (Ozgur, Mark)
- Visual classifiers (Ozgur)
- Visual observations and alignments, at least for initial tests, to be provided by Kate & Karen
- Updated GMTK training/decoding scripts (Karen, Chris, Simon, Partha)
- These should be ready around mid-May, along with an example training/decoding task, so that they can be tested at the various sites:
- JHU (Nash, Lisa)
- UW (Chris)
- Edinburgh (Simon, Partha, Bronwyn?)
- UIUC (Mark, Arthur)
- MIT? (Karen)
- Manual AF transcriptions
- All utts done/almost done (Karen)
- Analysis (e.g. inter-transcriber agreement) (Nash, Lisa)
-- KarenLivescu - 04 May 2006
|