Find topic
WS06 topics
Members' area
Tools
Help!
-- KarenLivescu - 15 Dec 2005
|
Results log
Some pre-workshop results
We've changed a number of things since these models were built (types of observations, dictionaries, ...) so these are just a rough guideline of what to expect.
SVitchboard
| Model | Vocab  | Task | Test set WER(%) | notes |
| Monophone | 10 | 1 | 31.2 | uses MIT PLPs, 64 Gaussians/state, ins penalty = 0, lm_scale = .8, am_scale = 0.2; not fully tuned |
| Monophone | 100 | 69.1 | MFCCs |
| Whole-word (King et al., Eurospeech '05) | 500 | 70.8 | MFCCs |
| Monofeat | 69.1 | MFCCs, # states = 130 |
| Whole-word (King et al., Eurospeech '05) | 20.8 | MFCCs |
CUAVE
See AudioVisualRecCUAVE
Workshop results
NOTE:
- Click in the Model column for a (currently very brief) definition of the model.
- Results in italics are provisional (e.g. still being tuned).
- All systems are GMTK unless otherwise indicated. Tuning is on CV_small unless otherwise indicated.
- Results on the full CV set will in general not be done. The column is included only to report those full-CV numbers that we happen to have from initial experiments.
- Boxes marked '*' are ones we don't plan to fill in unless we have spare time or we absolutely need them for comparison with another model.
- Initials next to models indicate the person working on it. Models with no names have yet to be taken up by someone.
SVitchboard, phone-based models using PLPs
| | WER (%) | |
| Model | Vocab | Task | CV_small | CV | Test | Notes |
| (e.g. D_short) | (e.g. D) | (e.g. E) |
| (OC) Monophone | 10 | 1 | 20.9 | * | 24.5 | (best result) 64 comp, lms=25, lmp=-4 |
| (OC) Monophone | 21.6 | * | 25.5 | (for comparison to the result below)32 comp, lms=27, lmp=-2 |
| (OC) Monophone w/train wd alingments | 16.7 | 18.7 | 19.6 | (fully tuned) 32 comp, lms=25, lmp=-1 |
| (CB) Triphone | | * | | |
| (OC) Monophone | 500 | 1 | 65.1 | 66.2 | 67.7 | 128 comp, lms=10, lmp=-1 |
| (OC) Monophone w/train wd alingments | 62.1 | 63.4 | 65.0 | 128 comp, lms=10, lmp=-1 |
| (CB) Triphone | 56.1 | * | 59.2 | without word alignments |
| (CB) Triphone | 58.7 | | 61.8 | trained with word alignments, 64 components, lms=12, lmp=-2 |
| (OC) HTK triphone | * | 56.4 | 61.2 | 16comp, lms=15, lmp=-18, 560 tied states |
| (OC) SRI, 1st pass | * | 41.6 | 42.4 | |
| (OC) SRI, Nth pass | * | 26.2 | 26.8 | |
SVitchboard, tandem experiments
| | WER (%) | |
| Model | Vocab | Task | CV_small | CV | Test | Notes |
| (e.g. D_short) | (e.g. D) | (e.g. E) |
| (OC) TandemMonophoneUnfactoredFisher | 10 | 1 | 16.0 | * | 21.1 | 64 comp, lms=60, lmp=0 |
| (OC) TandemMonophoneSemiFactored1Fisher | 15.5 | * | 19.7 | 64 comp, lms=71, lmp=-1 |
| (OC) TandemMonophonePhoneMLPSVB | 500 | 1 | 59.2 | * | 63.0 | 128comp lms=21 lmp=-1 |
| (OC) TandemMonophoneUnfactoredFisher | 54.5 | 58.4 | 59.7 | 128comp lms=17 lmp=0 |
| (OC) TandemMonophoneUnfactoredSVB | 57.9 | * | 62.3 | 128comp lms=14 lmp=-1 |
| (AK) TandemMonophoneFactoredFisher | | * | | |
| (AK) TandemMonophoneSemiFactored1Fisher | 58.7 | * | 59.5 | 128 comp, lms=16, lmp=-1, weight=1, ckbeam=10000 |
| (OC) TandemMonophoneSemiFactored1Fisher | 54.9 | * | 59.1 | 128 comp, lms=16, lmp=-1 |
| (AK) TandemMonophoneSemiFactored2Fisher | | * | | |
| (OC) TandemTriphoneUnfactoredFisher? | 49.9 | * | 55.0 | 64comp lms=19 lmp=-2 |
| (OC) TandemTriphoneSemiFactored1Fisher? | 48.7 | * | 53.8 | 64comp lms=22 lmp=-1 |
| Best of the above + embedded | | * | | |
| Triphone, best of the above + best of factored | | * | | |
SVitchboard, pron modeling experiments
SVitchboard, hybrid experiments
Where two results are given, these are before / after 1 round of embedded MLP training.
CUAVE
CuaveRecipe?
These are dev results only for now, using the "S" split.
| | WER (%) |
| Model | CLEAN | 12dB | 10dB | 6dB | 4dB | -4dB | Notes |
| (OC) Audio-only | 1.3 | 18.0 | 23.3 | 39.7 | 50.0 | 81.3 | 32 components |
| (OC) Video-only | 63.3 | 63.3 | 63.3 | 63.3 | 63.3 | 63.3 | 16 components |
| (PL) Monophone + viseme, synchronous | 1.7 | 8.0 | 11.3 | 23.0 | 30.0 | 57.3 | 16 gaussians was optimal for CLEAN, and so used throughout this row. dev set on original split |
| (KL) Monophone + viseme, async var, 1 state of async | 0.7 | 8.3 | 12.7 | 25.3 | 35.0 | 57.3 | 16comp, trained starting from Ozgur's synchronous 8comp; #s in parentheses are video weight; weights tested are {0, .1, .3, .5, .7, .9, 1} |
| (KL) Monophone + viseme, async var, 2 states of async | 1.0 | 8.7 | 12.0 | 25.3 | 34.7 | 58.7 | 16comp, trained starting from Ozgur's synchronous 8comp; #s in parentheses are video weight. |
| (KL) ", flat start | 1.0 | 9.0 | 13.7 | 22.3 | 29.3 | 54.0 | 16comp |
| (KL) Monophone + viseme, async var, asymmetric async | | | | | | | |
| (MH) phone+viseme CHMM | 1.7 | 9.7 | 12.3 | 25.3 | 33.3 | | 16 Gaussians, 0-state async |
| (MH) phone+viseme CHMM | 1.3 | 6.0 | 11.3 | 26.3 | 31.7 | 59.0 | 16 Gaussians, 1-state async |
| (MH) phone+viseme CHMM | 1.3 | 7.7 | 10.0 | 20.3 | 28.3 | 60.7 | 8 Gaussians, 2-state async |
| (MH) phone+viseme 2-chain | 1.7 | 7.0 | 9.0 | 21.0 | 31.0 | 61.3 | 4 Gaussians; unlimited within-word asynchrony |
| (MH) 1-state monofeat CHMM | 8.3 | 36.0 | 47.3 | | | | 16 Gaussians, 2-state async; AF-dep transitions |
| (MH) 3-state monofeat CHMM | 1.0 | 10.3 | 15.3 | 27.7 | 38.0 | 65.0 | 4 Gaussians, 1-state async; AF-dep transitions |
| (MH) 3-state monofeat CHMM | 1.7 | 9.3 | 13.7 | 28.3 | 35.7 | 65.7 | 1 Gaussian, 2-state async; AF-dep transitions |
| (MH) above+expanded card | 2.0 | 6.3 | 10.3 | 21.3 | 29.3 | 67.0 | AF cardinalities expanded so that every monophone state is distinct: G:(VL)->(asp,other VL); T:(MF,HF)->(MFTense,MFLax,HFTense,HFLax); /ow2/->protruded wide (not a new state, just changed ow2); 2 Gaussians |
| (MH) above+phTrans | 1.3 | 7.7 | 11.3 | 21.0 | 29.7 | 61.3 | expanded set, with phone-dep transitions |
| (PL) Monofeat | 5.0 | 20.7 | 24.7 | 42.7 | 48.7 | 68.7 | 16 gaussians was optimal on CLEAN for this row. dev set on original split |
| (PL) Monofeat (tied Gaussians) | | | | | | | Tying (with these questions) didn't help) |
| (PL) Monofeat (3 states per phone) | 2.3 | 8.0 | 11.7 | 23.3 | 32.7 | 62.7 | 4 Gaussians was optimal on CLEAN |
| (PL) Monofeat (3 states per phone, tied) | 1.3 | 9.7 | 12.3 | 25.3 | 33.3 | 65.3 | 430 clusters, 4 gaussians |
| (PL) Monofeat (crossword asynchrony) | | | | | | | 3 states per phone |
Audio and video weights sum to one. We're only looking at intervals of 0.1.
| | video weight |
| Model | CLEAN | 12dB | 10dB | 6dB | 4dB | -4dB | Notes |
| Monophone + viseme, synchronous | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.8 | The weights are optimized for each noise condition. dev set on original split |
| (KL) Monophone + viseme, async var, 1 state of async | .1 | .1 | .1 | .1 | .2 | .7 | |
| (KL) Monophone + viseme, async var, 2 states of async | .1 | .1 | .1 | .2 | .2 | .8 | |
| (KL) ", flat start | .1 | .1 | .2 | .2 | .2 | .4 | |
| (MH) Monophone + viseme, CHMM, 0-state async | 0.1 | 0.2 | 0.2 | 0.3 | 0.3 | 0.2 | |
| (MH) Monophone + viseme, CHMM, 1-state async | 0.1 | 0.1/2 | 0.1/2 | 0.1 | 0.2 | 0.5 | / means two weights give the same performance |
| (MH) Monophone + viseme, CHMM, 2-state async | 0.1 | 0.2 | 0.2 | 0.2 | 0.3 | 0.4 | |
| (MH) Monophone + viseme, 2-chain | 0.1 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 4 Gaussians; unlimited within-word asynchrony |
| (MH) 1-state monofeat CHMM, 1-state async | 0.1 | 0.2 | 0.2 | | | | no tying; 16 Gaussians |
| (MH) 1-state monofeat CHMM, 2-state async | 0.1 | 0.2 | 0.2 | | | | no tying; 16 Gaussians |
| (MH) 3-state monofeat CHMM, 1-state async | 0.1 | 0.1/2 | 0.1 | 0.1 | 0.2 | 0.7 | no tying; 4 Gaussians |
| (MH) 3-state monofeat CHMM, 2-state async | 0.1 | 0.1 | 0.1 | 0.2 | 0.2 | 0.4 | no tying; 1 Gaussian |
| (MH) monofeat with expanded AF cardinality | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.3 | |
| Monofeat | 0.2 | 0.2 | 0.3 | 0.2 | 0.2 | 0.3 | |
| Monofeat (3 states per phone) | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 | |
Cross-vaildation run.
Train on FGH, tune num. gaussians on I, tune video weight at that num. gaussian across SNRs on I. Train again on FGHI, test on J using the previously optimize num gaussians & video weight
| | WER (%) |
| Model | CLEAN | 12dB | 10dB | 6dB | 4dB | -4dB | Notes |
| (PL) Monophone + viseme, synchronous | 2.4 | 7.3 | 9.7 | 18.2 | 29.2 | 56.2 | |
| (PL) Monofeat (3 states per phone) | 2.4 | 9.1 | 13.3 | 29.2 | 36.5 | 66.0 | |
OLD TABLE FOR ARCHIVAL PURPOSES
| | WER (%) | |
| Model | Observations | Vocab | Task | CV_small | CV | Test | Notes |
| Monophone | PLP | 10 | 1 | 20.9 | | 24.5 | lm_scale = 25; lm_penalty = -4; 64 components; 57 states |
| Hybrid | 29.0 | | 35.1 | details |
| Tandem | | | | |
| Monofeat | PLP | 31.9 | | | Unoptimized |
| 3-state Monofeat | PLP | 24.6 | | | Unoptimized |
| Monofeat | PLP | 28.3 | | 28.5 | lm_scale = 12; lm_penalty = -8, 64 gaussians per mixture, (arthur) details |
| SRI, 1st pass | 500 | | 41.6 | 42.4 | |
| SRI, Nth pass | | 26.2 | 26.8 | |
| Monophone | PLP | | | | |
| Hybrid | | | | under construction |
| Tandem + KLT | | | | |
| HTK triphone | PLP | | 56.4 | 61.2 | lm_scale = 15; lm_penalty = -18; 16 components; 560 tied states |
| Triphone | PLP | 59.6 | 61.6 | 62.7 | gmtkTie, 16 comp per mixture, tuned on CV_small |
Fine print
-- KarenLivescu - 11 Jul 2006
|