Table 1.
Comparison of the forced-alignment algorithms under consideration.
| Algorithm | Engine | Alignment | English training set | Remark |
|---|---|---|---|---|
| P2FA (Yuan & Liberman, 2008) |
HMM-GMM on PLP features. HTK backend. | Monophone | 25 hr of U.S. Supreme Court oral arguments | Not trainable. |
| Prosodylab (Gorman et al., 2011) |
HMM-GMM on MFCC features. HTK backend. | Monophone | 10 hr laboratory-recorded North American speech | |
| Kaldi (Povey et al., 2011) |
HMM-GMM on MFCC features. Kaldi backend. | Two passes: monophone, triphone |
Librispeech (Panayotov et al., 2015): 1,000 hr of adult-read audiobooks | Kaldi is a speech recognition engine but recipes are available for forced alignment. |
| MFA-No-SAT (McAuliffe et al., 2017) |
HMM-GMM on MFCC features. Kaldi backend. | Two passes: monophone, triphone |
Librispeech | Automates Kaldi alignment recipes. Developed by same lab as Prosodylab. |
| MFA-SAT (McAuliffe et al., 2017) |
HMM-GMM on MFCC features. Kaldi backend. | Three passes: monophone, triphone, speaker-adapted triphone |
Librispeech |
Note. P2FA = Penn Phonetics Lab Forced Aligner; HMM = Hidden Markov model; GMM = Gaussian mixture model; PLP = perceptual linear predictor; HTK = Hidden Markov Model Toolkit (Young et al., 2015); MFCC = Mel-frequency cepstral coefficient; MFA = Montreal Forced Aligner; No-SAT = No speaker adaptive training; SAT = speaker adaptive training.