Skip to main content
. 2021 Mar 11;64(6 Suppl):2213–2222. doi: 10.1044/2020_JSLHR-20-00268

Table 1.

Comparison of the forced-alignment algorithms under consideration.

Algorithm Engine Alignment English training set Remark
P2FA
(Yuan & Liberman, 2008)
HMM-GMM on PLP features. HTK backend. Monophone 25 hr of U.S. Supreme Court oral arguments Not trainable.
Prosodylab
(Gorman et al., 2011)
HMM-GMM on MFCC features. HTK backend. Monophone 10 hr laboratory-recorded North American speech
Kaldi
(Povey et al., 2011)
HMM-GMM on MFCC features. Kaldi backend. Two passes:
monophone, triphone
Librispeech (Panayotov et al., 2015): 1,000 hr of adult-read audiobooks Kaldi is a speech recognition engine but recipes are available for forced alignment.
MFA-No-SAT
(McAuliffe et al., 2017)
HMM-GMM on MFCC features. Kaldi backend. Two passes:
monophone, triphone
Librispeech Automates Kaldi alignment recipes. Developed by same lab as Prosodylab.
MFA-SAT
(McAuliffe et al., 2017)
HMM-GMM on MFCC features. Kaldi backend. Three passes: monophone,
triphone,
speaker-adapted triphone
Librispeech

Note. P2FA = Penn Phonetics Lab Forced Aligner; HMM = Hidden Markov model; GMM = Gaussian mixture model; PLP = perceptual linear predictor; HTK = Hidden Markov Model Toolkit (Young et al., 2015); MFCC = Mel-frequency cepstral coefficient; MFA = Montreal Forced Aligner; No-SAT = No speaker adaptive training; SAT = speaker adaptive training.