Skip to content.

-- KarenLivescu - 15 Dec 2005

WS06 > ResultsPage

Results log

Some pre-workshop results

We've changed a number of things since these models were built (types of observations, dictionaries, ...) so these are just a rough guideline of what to expect.

SVitchboard

Model Vocab Task down Test set WER(%) notes
Monophone 10 1 31.2 uses MIT PLPs, 64 Gaussians/state, ins penalty = 0, lm_scale = .8, am_scale = 0.2; not fully tuned
Monophone 100 69.1 MFCCs
Whole-word (King et al., Eurospeech '05) 500 70.8 MFCCs
Monofeat 69.1 MFCCs, # states = 130
Whole-word (King et al., Eurospeech '05) 20.8 MFCCs

CUAVE

See AudioVisualRecCUAVE


Workshop results

NOTE:

  • Click in the Model column for a (currently very brief) definition of the model.
  • Results in italics are provisional (e.g. still being tuned).
  • All systems are GMTK unless otherwise indicated. Tuning is on CV_small unless otherwise indicated.
  • Results on the full CV set will in general not be done. The column is included only to report those full-CV numbers that we happen to have from initial experiments.
  • Boxes marked '*' are ones we don't plan to fill in unless we have spare time or we absolutely need them for comparison with another model.
  • Initials next to models indicate the person working on it. Models with no names have yet to be taken up by someone.

SVitchboard, phone-based models using PLPs

  WER (%)  
Model Vocab Task CV_small CV Test Notes
(e.g. D_short) (e.g. D) (e.g. E)
(OC) Monophone 10 1 20.9 * 24.5 (best result) 64 comp, lms=25, lmp=-4
(OC) Monophone 21.6 * 25.5 (for comparison to the result below)32 comp, lms=27, lmp=-2
(OC) Monophone w/train wd alingments 16.7 18.7 19.6 (fully tuned) 32 comp, lms=25, lmp=-1
(CB) Triphone   *    
(OC) Monophone 500 1 65.1 66.2 67.7 128 comp, lms=10, lmp=-1
(OC) Monophone w/train wd alingments 62.1 63.4 65.0 128 comp, lms=10, lmp=-1
(CB) Triphone 56.1 * 59.2 without word alignments
(CB) Triphone 58.7   61.8 trained with word alignments, 64 components, lms=12, lmp=-2
(OC) HTK triphone * 56.4 61.2 16comp, lms=15, lmp=-18, 560 tied states
(OC) SRI, 1st pass * 41.6 42.4  
(OC) SRI, Nth pass * 26.2 26.8  

SVitchboard, tandem experiments

  WER (%)  
Model Vocab Task CV_small CV Test Notes
(e.g. D_short) (e.g. D) (e.g. E)
(OC) TandemMonophoneUnfactoredFisher 10 1 16.0 * 21.1 64 comp, lms=60, lmp=0
(OC) TandemMonophoneSemiFactored1Fisher 15.5 * 19.7 64 comp, lms=71, lmp=-1
(OC) TandemMonophonePhoneMLPSVB 500 1 59.2 * 63.0 128comp lms=21 lmp=-1
(OC) TandemMonophoneUnfactoredFisher 54.5 58.4 59.7 128comp lms=17 lmp=0
(OC) TandemMonophoneUnfactoredSVB 57.9 * 62.3 128comp lms=14 lmp=-1
(AK) TandemMonophoneFactoredFisher   *    
(AK) TandemMonophoneSemiFactored1Fisher 58.7 * 59.5 128 comp, lms=16, lmp=-1, weight=1, ckbeam=10000
(OC) TandemMonophoneSemiFactored1Fisher 54.9 * 59.1 128 comp, lms=16, lmp=-1
(AK) TandemMonophoneSemiFactored2Fisher   *    
(OC) TandemTriphoneUnfactoredFisher? 49.9 * 55.0 64comp lms=19 lmp=-2
(OC) TandemTriphoneSemiFactored1Fisher? 48.7 * 53.8 64comp lms=22 lmp=-1
Best of the above + embedded   *    
Triphone, best of the above + best of factored   *    

SVitchboard, pron modeling experiments

  WER (%)  
Model Vocab Task CV_small CV Test Notes
(e.g. D_short) (e.g. D) (e.g. E)
(AK) MonoFeat1State 10 1 28.3 * 28.5 64comp lms=12 lmp=-8 details
(CB) MonoFeat3State 22.7 * 25.4 no word alignments, 32comp lms=12 lmp=-8
(CB) MonoFeat3State 17.6 * 21.3 with word alignments, 32comp lms=8 lmp=-6
(CB) MonoFeat3State, no asynchrony 18.5 * 20.7 with word alignments, 64comp lms=16 lmp=-1
TriFeat3State   *    
(BW) MonoFeat1StateSubsCI 48.0 *    
(BW) MonoFeat3StateSubsSmoothed   *    
(CB) MonoFeat3StateCHMM   *    
(LY) MonoFeat3StatePOS   *    
(NB) MonoFeat3StateCrosswd   *    
Best of above + tandem   *    
(AK) MonoFeat1State 500 1   *    
(AK) MonoFeat1StateNoAsync 76.1 *   mix=512+ lms=9 lmp=0
(AK) MonoFeat1StateFA 73.1   74.8 256comp lms=6 lmp=-2 not comparable to MonoFeat1StateNoAsync
(AK) MonoFeat1StateNoAsyncFA 67.5 * 70.8 512comp lms=7 lmp=-1
(CB) MonoFeat3State 66.6 * 67.4 with word alignments, 64m, lms=8, lmp=-2
(CB) MonoFeat3StateNoAsync? 64.0 * 65.2 with word alignments
TriFeat3State   *    
(BW) MonoFeat3StateSubsCI   *    
(BW) MonoFeat3StateSubsSmoothed   *    
(CB) MonoFeat3StateCHMM   *    
(LY) MonoFeat3StatePOS   *    
(NB) MonoFeat3StateCrosswd   *    
Best of above + tandem   *    

SVitchboard, hybrid experiments

Where two results are given, these are before / after 1 round of embedded MLP training.

  WER (%)  
Model Vocab Task CV_small CV Test Notes
(e.g. D_short) (e.g. D) (e.g. E)
(SK) HybridMonophone, det. CPTs 10 1 32.9 *    
(SK) HybridMonophone, nondet CPTs 26.0 / 23.1 * 30.1 / 24.3  
(SK) HybridMonophone with PLPs 16.2 * 19.6 initialised from best 32 component monophone model
(SDH/SK) HybridMonofeat, det   *    
(SDH/SK) HybridMonofeat, nondet 43.8 *    
(SK) HybridMonophone, det CPTs 500 * * *  
(SK) HybridMonophone, nondet CPTs 66.6 *    
Best of above + PLP   *    
(SK) HybridMonofeat, det   *    
(SK) HybridMonofeat, nondet   *    

CUAVE

CuaveRecipe?

These are dev results only for now, using the "S" split.

  WER (%)
Model CLEAN 12dB 10dB 6dB 4dB -4dB Notes
(OC) Audio-only 1.3 18.0 23.3 39.7 50.0 81.3 32 components
(OC) Video-only 63.3 63.3 63.3 63.3 63.3 63.3 16 components
(PL) Monophone + viseme, synchronous 1.7 8.0 11.3 23.0 30.0 57.3 16 gaussians was optimal for CLEAN, and so used throughout this row. dev set on original split
(KL) Monophone + viseme, async var, 1 state of async 0.7 8.3 12.7 25.3 35.0 57.3 16comp, trained starting from Ozgur's synchronous 8comp; #s in parentheses are video weight; weights tested are {0, .1, .3, .5, .7, .9, 1}
(KL) Monophone + viseme, async var, 2 states of async 1.0 8.7 12.0 25.3 34.7 58.7 16comp, trained starting from Ozgur's synchronous 8comp; #s in parentheses are video weight.
(KL) ", flat start 1.0 9.0 13.7 22.3 29.3 54.0 16comp
(KL) Monophone + viseme, async var, asymmetric async              
(MH) phone+viseme CHMM 1.7 9.7 12.3 25.3 33.3   16 Gaussians, 0-state async
(MH) phone+viseme CHMM 1.3 6.0 11.3 26.3 31.7 59.0 16 Gaussians, 1-state async
(MH) phone+viseme CHMM 1.3 7.7 10.0 20.3 28.3 60.7 8 Gaussians, 2-state async
(MH) phone+viseme 2-chain 1.7 7.0 9.0 21.0 31.0 61.3 4 Gaussians; unlimited within-word asynchrony
(MH) 1-state monofeat CHMM 8.3 36.0 47.3       16 Gaussians, 2-state async; AF-dep transitions
(MH) 3-state monofeat CHMM 1.0 10.3 15.3 27.7 38.0 65.0 4 Gaussians, 1-state async; AF-dep transitions
(MH) 3-state monofeat CHMM 1.7 9.3 13.7 28.3 35.7 65.7 1 Gaussian, 2-state async; AF-dep transitions
(MH) above+expanded card 2.0 6.3 10.3 21.3 29.3 67.0 AF cardinalities expanded so that every monophone state is distinct: G:(VL)->(asp,other VL); T:(MF,HF)->(MFTense,MFLax,HFTense,HFLax); /ow2/->protruded wide (not a new state, just changed ow2); 2 Gaussians
(MH) above+phTrans 1.3 7.7 11.3 21.0 29.7 61.3 expanded set, with phone-dep transitions
(PL) Monofeat 5.0 20.7 24.7 42.7 48.7 68.7 16 gaussians was optimal on CLEAN for this row. dev set on original split
(PL) Monofeat (tied Gaussians)             Tying (with these questions) didn't help)
(PL) Monofeat (3 states per phone) 2.3 8.0 11.7 23.3 32.7 62.7 4 Gaussians was optimal on CLEAN
(PL) Monofeat (3 states per phone, tied) 1.3 9.7 12.3 25.3 33.3 65.3 430 clusters, 4 gaussians
(PL) Monofeat (crossword asynchrony)             3 states per phone

Audio and video weights sum to one. We're only looking at intervals of 0.1.

  video weight
Model CLEAN 12dB 10dB 6dB 4dB -4dB Notes
Monophone + viseme, synchronous 0.1 0.1 0.1 0.1 0.2 0.8 The weights are optimized for each noise condition. dev set on original split
(KL) Monophone + viseme, async var, 1 state of async .1 .1 .1 .1 .2 .7  
(KL) Monophone + viseme, async var, 2 states of async .1 .1 .1 .2 .2 .8  
(KL) ", flat start .1 .1 .2 .2 .2 .4  
(MH) Monophone + viseme, CHMM, 0-state async 0.1 0.2 0.2 0.3 0.3 0.2  
(MH) Monophone + viseme, CHMM, 1-state async 0.1 0.1/2 0.1/2 0.1 0.2 0.5 / means two weights give the same performance
(MH) Monophone + viseme, CHMM, 2-state async 0.1 0.2 0.2 0.2 0.3 0.4  
(MH) Monophone + viseme, 2-chain 0.1 0.2 0.2 0.2 0.2 0.2 4 Gaussians; unlimited within-word asynchrony
(MH) 1-state monofeat CHMM, 1-state async 0.1 0.2 0.2       no tying; 16 Gaussians
(MH) 1-state monofeat CHMM, 2-state async 0.1 0.2 0.2       no tying; 16 Gaussians
(MH) 3-state monofeat CHMM, 1-state async 0.1 0.1/2 0.1 0.1 0.2 0.7 no tying; 4 Gaussians
(MH) 3-state monofeat CHMM, 2-state async 0.1 0.1 0.1 0.2 0.2 0.4 no tying; 1 Gaussian
(MH) monofeat with expanded AF cardinality 0.1 0.2 0.1 0.1 0.1 0.3  
Monofeat 0.2 0.2 0.3 0.2 0.2 0.3  
Monofeat (3 states per phone) 0.1 0.1 0.1 0.1 0.1 0.7  

Cross-vaildation run. Train on FGH, tune num. gaussians on I, tune video weight at that num. gaussian across SNRs on I. Train again on FGHI, test on J using the previously optimize num gaussians & video weight

  WER (%)
Model CLEAN 12dB 10dB 6dB 4dB -4dB Notes
(PL) Monophone + viseme, synchronous 2.4 7.3 9.7 18.2 29.2 56.2  
(PL) Monofeat (3 states per phone) 2.4 9.1 13.3 29.2 36.5 66.0  

OLD TABLE FOR ARCHIVAL PURPOSES

  WER (%)  
Model Observations Vocab Task CV_small CV Test Notes
Monophone PLP 10 1 20.9   24.5 lm_scale = 25; lm_penalty = -4; 64 components; 57 states
Hybrid 29.0   35.1 details
Tandem        
Monofeat PLP 31.9     Unoptimized
3-state Monofeat PLP 24.6     Unoptimized
Monofeat PLP 28.3   28.5 lm_scale = 12; lm_penalty = -8, 64 gaussians per mixture, (arthur) details
SRI, 1st pass 500   41.6 42.4  
SRI, Nth pass   26.2 26.8  
Monophone PLP        
Hybrid       under construction
Tandem + KLT        
HTK triphone PLP   56.4 61.2 lm_scale = 15; lm_penalty = -18; 16 components; 560 tied states
Triphone PLP 59.6 61.6 62.7 gmtkTie, 16 comp per mixture, tuned on CV_small

Fine print

-- KarenLivescu - 11 Jul 2006