| | WER (%) | |
| Model | Vocab | Task | CV_small | CV | Test | Notes |
| (e.g. D_short) | (e.g. D) | (e.g. E) |
| (OC) Monophone | 10 | 1 | 20.9 | * | 24.5 | (best result) 64 comp, lms=25, lmp=-4 |
| (OC) Monophone | 21.6 | * | 25.5 | (for comparison to the result below)32 comp, lms=27, lmp=-2 |
| (OC) Monophone w/train wd alingments | 16.7 | 18.7 | 19.6 | (fully tuned) 32 comp, lms=25, lmp=-1 |
| (CB) Triphone | | * | | |
| (OC) Monophone | 500 | 1 | 65.1 | 66.2 | 67.7 | 128 comp, lms=10, lmp=-1 |
| (OC) Monophone w/train wd alingments | 62.1 | 63.4 | 65.0 | 128 comp, lms=10, lmp=-1 |
| (CB) Triphone | 56.1 | * | 59.2 | without word alignments |
| (CB) Triphone | 58.7 | | 61.8 | trained with word alignments, 64 components, lms=12, lmp=-2 |
| (OC) HTK triphone | * | 56.4 | 61.2 | 16comp, lms=15, lmp=-18, 560 tied states |
| (OC) SRI, 1st pass | * | 41.6 | 42.4 | |
| (OC) SRI, Nth pass | * | 26.2 | 26.8 | |
| | WER (%) | |
| Model | Vocab | Task | CV_small | CV | Test | Notes |
| (e.g. D_short) | (e.g. D) | (e.g. E) |
| (OC) TandemMonophoneUnfactoredFisher | 10 | 1 | 16.0 | * | 21.1 | 64 comp, lms=60, lmp=0 |
| (OC) TandemMonophoneSemiFactored1Fisher | 15.5 | * | 19.7 | 64 comp, lms=71, lmp=-1 |
| (OC) TandemMonophonePhoneMLPSVB | 500 | 1 | 59.2 | * | 63.0 | 128comp lms=21 lmp=-1 |
| (OC) TandemMonophoneUnfactoredFisher | 54.5 | 58.4 | 59.7 | 128comp lms=17 lmp=0 |
| (OC) TandemMonophoneUnfactoredSVB | 57.9 | * | 62.3 | 128comp lms=14 lmp=-1 |
| (AK) TandemMonophoneFactoredFisher | | * | | |
| (AK) TandemMonophoneSemiFactored1Fisher | 58.7 | * | 59.5 | 128 comp, lms=16, lmp=-1, weight=1, ckbeam=10000 |
| (OC) TandemMonophoneSemiFactored1Fisher | 54.9 | * | 59.1 | 128 comp, lms=16, lmp=-1 |
| (AK) TandemMonophoneSemiFactored2Fisher | | * | | |
| (OC) TandemTriphoneUnfactoredFisher? | 49.9 | * | 55.0 | 64comp lms=19 lmp=-2 |
| (OC) TandemTriphoneSemiFactored1Fisher? | 48.7 | * | 53.8 | 64comp lms=22 lmp=-1 |
| Best of the above + embedded | | * | | |
| Triphone, best of the above + best of factored | | * | | |
| | WER (%) |
| Model | CLEAN | 12dB | 10dB | 6dB | 4dB | -4dB | Notes |
| (OC) Audio-only | 1.3 | 18.0 | 23.3 | 39.7 | 50.0 | 81.3 | 32 components |
| (OC) Video-only | 63.3 | 63.3 | 63.3 | 63.3 | 63.3 | 63.3 | 16 components |
| (PL) Monophone + viseme, synchronous | 1.7 | 8.0 | 11.3 | 23.0 | 30.0 | 57.3 | 16 gaussians was optimal for CLEAN, and so used throughout this row. dev set on original split |
| (KL) Monophone + viseme, async var, 1 state of async | 0.7 | 8.3 | 12.7 | 25.3 | 35.0 | 57.3 | 16comp, trained starting from Ozgur's synchronous 8comp; #s in parentheses are video weight; weights tested are {0, .1, .3, .5, .7, .9, 1} |
| (KL) Monophone + viseme, async var, 2 states of async | 1.0 | 8.7 | 12.0 | 25.3 | 34.7 | 58.7 | 16comp, trained starting from Ozgur's synchronous 8comp; #s in parentheses are video weight. |
| (KL) ", flat start | 1.0 | 9.0 | 13.7 | 22.3 | 29.3 | 54.0 | 16comp |
| (KL) Monophone + viseme, async var, asymmetric async | | | | | | | |
| (MH) phone+viseme CHMM | 1.7 | 9.7 | 12.3 | 25.3 | 33.3 | | 16 Gaussians, 0-state async |
| (MH) phone+viseme CHMM | 1.3 | 6.0 | 11.3 | 26.3 | 31.7 | 59.0 | 16 Gaussians, 1-state async |
| (MH) phone+viseme CHMM | 1.3 | 7.7 | 10.0 | 20.3 | 28.3 | 60.7 | 8 Gaussians, 2-state async |
| (MH) phone+viseme 2-chain | 1.7 | 7.0 | 9.0 | 21.0 | 31.0 | 61.3 | 4 Gaussians; unlimited within-word asynchrony |
| (MH) 1-state monofeat CHMM | 8.3 | 36.0 | 47.3 | | | | 16 Gaussians, 2-state async; AF-dep transitions |
| (MH) 3-state monofeat CHMM | 1.0 | 10.3 | 15.3 | 27.7 | 38.0 | 65.0 | 4 Gaussians, 1-state async; AF-dep transitions |
| (MH) 3-state monofeat CHMM | 1.7 | 9.3 | 13.7 | 28.3 | 35.7 | 65.7 | 1 Gaussian, 2-state async; AF-dep transitions |
| (MH) above+expanded card | 2.0 | 6.3 | 10.3 | 21.3 | 29.3 | 67.0 | AF cardinalities expanded so that every monophone state is distinct: G:(VL)->(asp,other VL); T:(MF,HF)->(MFTense,MFLax,HFTense,HFLax); /ow2/->protruded wide (not a new state, just changed ow2); 2 Gaussians |
| (MH) above+phTrans | 1.3 | 7.7 | 11.3 | 21.0 | 29.7 | 61.3 | expanded set, with phone-dep transitions |
| (PL) Monofeat | 5.0 | 20.7 | 24.7 | 42.7 | 48.7 | 68.7 | 16 gaussians was optimal on CLEAN for this row. dev set on original split |
| (PL) Monofeat (tied Gaussians) | | | | | | | Tying (with these questions) didn't help) |
| (PL) Monofeat (3 states per phone) | 2.3 | 8.0 | 11.7 | 23.3 | 32.7 | 62.7 | 4 Gaussians was optimal on CLEAN |
| (PL) Monofeat (3 states per phone, tied) | 1.3 | 9.7 | 12.3 | 25.3 | 33.3 | 65.3 | 430 clusters, 4 gaussians |
| (PL) Monofeat (crossword asynchrony) | | | | | | | 3 states per phone |
| | video weight |
| Model | CLEAN | 12dB | 10dB | 6dB | 4dB | -4dB | Notes |
| Monophone + viseme, synchronous | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.8 | The weights are optimized for each noise condition. dev set on original split |
| (KL) Monophone + viseme, async var, 1 state of async | .1 | .1 | .1 | .1 | .2 | .7 | |
| (KL) Monophone + viseme, async var, 2 states of async | .1 | .1 | .1 | .2 | .2 | .8 | |
| (KL) ", flat start | .1 | .1 | .2 | .2 | .2 | .4 | |
| (MH) Monophone + viseme, CHMM, 0-state async | 0.1 | 0.2 | 0.2 | 0.3 | 0.3 | 0.2 | |
| (MH) Monophone + viseme, CHMM, 1-state async | 0.1 | 0.1/2 | 0.1/2 | 0.1 | 0.2 | 0.5 | / means two weights give the same performance |
| (MH) Monophone + viseme, CHMM, 2-state async | 0.1 | 0.2 | 0.2 | 0.2 | 0.3 | 0.4 | |
| (MH) Monophone + viseme, 2-chain | 0.1 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 4 Gaussians; unlimited within-word asynchrony |
| (MH) 1-state monofeat CHMM, 1-state async | 0.1 | 0.2 | 0.2 | | | | no tying; 16 Gaussians |
| (MH) 1-state monofeat CHMM, 2-state async | 0.1 | 0.2 | 0.2 | | | | no tying; 16 Gaussians |
| (MH) 3-state monofeat CHMM, 1-state async | 0.1 | 0.1/2 | 0.1 | 0.1 | 0.2 | 0.7 | no tying; 4 Gaussians |
| (MH) 3-state monofeat CHMM, 2-state async | 0.1 | 0.1 | 0.1 | 0.2 | 0.2 | 0.4 | no tying; 1 Gaussian |
| (MH) monofeat with expanded AF cardinality | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.3 | |
| Monofeat | 0.2 | 0.2 | 0.3 | 0.2 | 0.2 | 0.3 | |
| Monofeat (3 states per phone) | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 | |