Table 4.
Experimental configuration of CTC Attention.
| Acoustic Unit | Mono-Phone (Initial and Final in Mandarin) | 
|---|---|
| Acoustic Feature | The window length is 30 ms and the frame shift is 30 ms. The input feature is a 40-dimensional filter bank with first-order and second-order derivatives, as well as a 3-dimensional pitch. | 
| Configuration | The output of the CTC is 59 units, including 58 labels of initials and finals and one blank label. Because CTC does not need a context decision tree to achieve good performance, mono phone (initial or final) is taken as the acoustic unit. The lower frame rate can reduce the computational cost of the decoding process and greatly improve the decoding speed. The input of the attention-based model is the same as the CTC, and the encoder is shared. The output of attention-based model is 60 units, including 58 phone labels and < SOS > < EOS >. In the decoding process, the irregular alignment can be further eliminated by combining the probability score based on the attention and CTC in the one-pass beam search algorithm. CCTV, PSC-G1-112, and PSC-Train-1000 speech corpora are used as training data sets. Finally, the performance is tested in the PSC-Test-89 speech corpus. |