A family of models is reported on this page. They all use virtual evidence from ANNs as follows:
phState
/ |
/ | ...
V V
dg1 pl1
| |
| |
V V
VE_dg1 VE_pl1
where VE_<F> is the virtual evidence given by MLP activations
for feature <F>.
The relationship between the phone state and the features can be non-deterministic (a dense CPT) or non-deterministic (implemented as a dense CPTs with only one non-zero per row).
Optionally, the model may also have a conventional PLP observation, generated from a mixture of Gaussians:
phState
/ | \
/ | \
/ | ... \
V V \
dg1 pl1 |
| | |
| | |
V V |
VE_dg1 VE_pl1 PLPs
where VE_<F> is the virtual evidence given by MLP activations
for feature <F> and PLPs are conventional observations.
Svitchboard, monophone, hybrid
This system uses the 8 ANNs to provide virtual evidence about the 8 features. The 8 feature hidden RVs each depend on the phone state using a
DenseCPT. If this has only one non-zero entry per row, is is deterministic.
Validation set is always D_short unless noted.
The PLPs are generated from mixtures of either:
- for 10 word models:
- 64 Gaussians, initialised from the PLP-only monophone model trained without word alignments.
- 32 Gaussian, initialised from the best overal monophone model trained with word alignments
- for 500 word models:
The Gaussians do get retrained when this model is trained using VE.
| Models trained using original word alignments |
| Vocab size | Task | Word error rate (%) | scale factors | language model | Divide by prior? | Det. CPTs? | PLP obs? | Notes |
| | | Validation | Test | dg1 | pl1 | plp | scale | penalty | |
| 10 | 1 | 26.0 | 30.1 | 1.5 | 1.5 | - | 25 | -2 | no | no | no | (A) |
| 23.1 | 24.3 | 24 | -3 | (B)(R2) |
| 32.0 | * | 1.5 | 1.5 | - | 27 | -4 | yes | (B) |
| 32.9 | * | 0.5 | 0.5 | - | 28 | -5 | no | yes | (A) |
| * | * | | | - | | | yes | |
| 17.4 | * | 1.5 | 1.5 | 1.0 | 22 | -2 | no | no | 64 | (C2) |
| 18.8 | 19.9 | all=0.0 | 1.0 | 24 | -1 | (C4) |
| 16.7 | 19.6 | all=0.6 | 23 | -2 |
| 16.9 | 20.0 | all=0.0 | 24 | -1 | 32 | (C3); (C4) now CV on townhill |
| 16.2 | 19.6 | all=0.5 | 24 | -3 |
| 16.9 | * | all=0.0 | | | (C4)(R2) plp weight of zero is best |
| 21.1 | * | 0.0 | 1.5 | 1.0 | 28 | -4 | yes | 64 | (C2) |
| * | * | | | | | | no | yes | |
| * | * | | | | | | yes | |
| 500 | 66.6 | | 1.5 | 1.5 | - | 13 | 0 | no | no | no | (B) ckbeam=30000 |
| 62.6 | | 1.5 | 1.5 | - | 10 | 1 | (B)(R2) ckbeam=30000 (th still doing more...) |
| * | * | | | - | | | yes | |
| | | | | - | | | no | yes | |
| * | * | | | - | | | yes | |
| | | | | | | | no | no | TBA | |
| * | * | | | | | | yes | |
| * | * | | | | | | no | yes | |
| * | * | | | | | | yes | |
(A) weight search over all combinations of 0.5/1.0/1.5 for dg1 and pl1
(B) No weight search (yet)
(C1) weight search over all combinations of 0.0/1.5 for dg1 & pl1 and 0.0/0.1/0.5/1.0/1.5/3.0/10.0 for plp
(C2) weight search over all combinations of 0.0/0.5/1.0/1.5 for dg1 & pl1 and 0.0/0.1/0.5/1.0/1.5/3.0/10.0 for plp
(C3) fixed PLP weight of 1.0, search over and equal weighting on VE of 0.0/0.1/0.5/1.0/1.5
(C4) fixed PLP weight of 1.0, search over and equal weighting on VE of 0.0/0.1/0.3/0.4/0.5/0.6/0.7/1.0/1.5
(R2) trained (from a flat start) on the "realigned_2" activations. These are from a net trained using realigned targets, which were produced by the 500 word hybrid (no PLP) model.
Recipes for the 500 word task
Very slow to train starting with uniform DCPTs (unless I can find a better triangulation), so:
Recipe 1
Train on 1000 utterances for 2 iterations
Take the DCPTs and make them more sparse by zeroing all entries less than 0.1
Using these parameters, run the genetic triangulation script to find a fast triangulation, given this particular sparsity of the DCPTs.
Starting from these parameters, train to 0.5% tolerance (takes 8 its) on full training set
Find a decoding graph triangulation using the final trained parameters.
--
SimonKing - 01 Aug 2006
Recipe 2
Found a better triangulation using the genetic algorithm. Then manually re-retriangulated the epilogue and prologue (becasue they were "completed") using heurstic "S".
This model is easily trainable with fully dense CPTs.
However, decoding takes serious memory (ckbeam of 25000 because anything smaller lead to different decodings on one test sentence), although is fast enough (~20 secs per utt).
To make this decode in reasonable amounts of memory, all state_to_FEAT DCPTs were made sparser by zeroing all entries smaller than 0.1
--
SimonKing - 04 Aug 2006
Recipe 3
As recipe two, but zeroing entries smaller than 0.? (TO DO)
--
SimonKing - 08 Aug 2006