Table 2:
Baseline results of ASR trained only on children’s speech (91 hours).
| Model | WER |
|---|---|
| GMM-HMM Monophone | 54.53% |
| GMM-HMM Triphone | 36.96% |
| GMM-HMM Triphone LDA+MLLT | 32.79% |
| GMM-HMM Triphone LDA+MLLT+SAT | 24.55% |
| GMM-HMM Triphone LDA+MLLT+SAT + VTLN | 25.66% |
| Hybrid DNN-HMM | 35.97% |
| Hybrid DNN-HMM + VTLN | 32.72% |
| Hybrid DNN-HMM + LDA+MLLT+SAT | 21.31% |
| Hybrid DNN-HMM + LDA+MLLT+SAT + VTLN | 21.82% |
| Hybrid DNN-HMM + online i-vector (speaker) | 28.03% |
| Hybrid DNN-HMM + online i-vector (utterance) | 26.59% |
| Hybrid DNN-HMM + offline i-vector (utterance) | 25.53% |