Table 8.
The CER (%) and WER (%) results of different E2E ASR models were trained using the Uzbek language speech corpus. The impact of language model (LM), speed perturbation (SP), and spectral augmentation (SA) are also reported.
Model | LM | SP | SA | Valid | Test | ||
---|---|---|---|---|---|---|---|
CER | WER | CER | WER | ||||
E2E-LSTM | × | × | × | 13.8 | 43.1 | 14.0 | 44.0 |
√ | × | × | 14.9 | 30.0 | 14.3 | 31.4 | |
√ | √ | × | 13.7 | 27.6 | 14.4 | 30.6 | |
√ | √ | √ | 12.6 | 24.9 | 12.0 | 27.0 | |
DNN-HMM | × | × | × | 12.8 | 34.7 | 10.2 | 32.1 |
√ | × | × | 10.3 | 20.5 | 8.6 | 24.9 | |
√ | √ | × | 6.9 | 18.8 | 7.5 | 23.5 | |
√ | √ | √ | 6.9 | 19.9 | 8.1 | 24.9 | |
RNN-CTC | × | × | × | 13.3 | 35.8 | 9.7 | 32.3 |
√ | × | × | 12.2 | 27.2 | 9.1 | 24.3 | |
√ | √ | × | 10.9 | 25.1 | 8.7 | 23.9 | |
√ | √ | √ | 8.3 | 24.7 | 7.9 | 22.3 | |
E2E − Transformer | × | × | × | 12.3 | 35.2 | 9.4 | 31.6 |
√ | × | × | 11.7 | 25.7 | 8.7 | 23.9 | |
√ | √ | × | 10.7 | 23.9 | 8.4 | 23.0 | |
√ | √ | √ | 9.9 | 21.4 | 7.6 | 21.0 | |
E2E-Conformer | × | × | × | 12.7 | 37.6 | 10.7 | 35.1 |
√ | × | × | 11.5 | 27.5 | 9.7 | 26.3 | |
√ | √ | × | 9.2 | 21.7 | 7.5 | 21.2 | |
√ | √ | √ | 7.8 | 18.1 | 5.8 | 17.4 | |
E2E − T (CTC + Attention) | × | × | × | 12.1 | 33.2 | 9.8 | 30.3 |
√ | × | × | 9.6 | 19.4 | 7.9 | 22.7 | |
√ | √ | × | 6.4 | 17.9 | 7.4 | 20.3 | |
√ | √ | √ | 5.7 | 15.2 | 5.41 | 14.3 |