Skip to main content
. 2022 Dec 9;3(12):100616. doi: 10.1016/j.patter.2022.100616

Table 2.

An overview of the recent audio self-supervised learning methods

Model Speech Input format Framework Encoder Loss Inspired by
LIM73 (2019) raw waveform 1(b) SincNet BCE, MINE, or NCE loss SimCLR
COLA74 (2021) log mel-filterbanks 1(b) EfficientNet InfoNCE loss SimCLR
CLAR81 (2021, semi) raw waveform log mel-spectrogram 1(b) 1D ResNet-18 ResNet-18 NT-Xent + cross-entropy SimCLR
Fonseca et al.75 (2021) log mel-spectrogram 1(b) ResNet, VGG, CRNN NT-Xent loss SimCLR
Wang et al.82 (2020) raw waveform + log mel-filterbanks 1(b) CNN ResNet NT-Xent loss + cross-entropy SimCLR
BYOL-A83 (2021) log mel-spectrogram 2(a) CNN MSE loss BYOL
Carr37 (2021) MFCCs 1(a) context-free network Fenchel-Young loss
Ryan38 (2020) constant-Q transform spectrogram 1(a) AlexNet triplet loss
Speech2Vec90 (2018) mel spectrogram 3 RNN MSE loss Word2Vec
Audio2Vec89 (2020) ✓✗ MFCCs 3 CNN MSE loss Word2Vec
DeCoAR91 (2020) log filterbank features 3 RNN L1 loss Word2Vec
Audio Word2Vec195 (2019) MFCCs 3 RNN MSE loss Word2Vec
Mockingjay95 (2020) mel spectrogram 4(b) transformer L1 loss BERT
TERA96 (2021) log mel spectrogram 4(b) transformer L1 loss BERT
Audio ALBERT98 (2021) log mel spectrogram 4(b) transformer L1 loss BERT
DAPC99 (2021) spectrogram 4(b) transformer modified MSE loss + orthogonality penalty BERT
PASE85 (2019) raw waveform 1(a) SincNet + CNN L1, BCE loss MTL
PASE+87 (2020) raw waveform 1(a) SincNet + CNN + QRNN MSE, BCE loss MTL
APC66 (2019) log mel spectrogram 4(a) RNN L1 loss
VQ-APC114 (2020) log mel spectrogram 4(a) RNN, transformer L1 loss
NPC69 (2021) log mel spectrogram CNN + masked CNN L1 loss
CPC42 (2018) raw waveform 4(a) ResNet + GRU InfoNCE loss
CPC v271 (2020) raw waveform 4(a) ResNet + masked CNN InfoNCE loss
CPC293 (2021) raw waveform 4(a) ResNet + LSTM InfoNCE loss
wav2vec77 (2019) raw waveform 4(a) 1D CNN contrastive loss
VQ-wav2vec78 (2019) raw waveform 4(a) 1D CNN + BERT contrastive loss BERT
wav2vec 2.072 (2020) raw waveform 4(b) 1D CNN + transformer contrastive loss BERT
HuBERT112 (2021) raw waveform 4(b) 1D CNN + transformer contrastive loss BERT
WavLM113 (2022) raw waveform 4(b) 1D CNN + transformer contrastive loss BERT

Model, speech (i.e., whether a method addresses speech tasks or it is designed for general audio representations), framework (referring to Figures 1, 2, 3, and 4), encoder, loss, and the previous technology by which the methods are inspired, are given.