Table 2.
An overview of the recent audio self-supervised learning methods
| Model | Speech | Input format | Framework | Encoder | Loss | Inspired by |
|---|---|---|---|---|---|---|
| LIM73 (2019) | ✓ | raw waveform | 1(b) | SincNet | BCE, MINE, or NCE loss | SimCLR |
| COLA74 (2021) | ✗ | log mel-filterbanks | 1(b) | EfficientNet | InfoNCE loss | SimCLR |
| CLAR81 (2021, semi) | ✗ | raw waveform log mel-spectrogram | 1(b) | 1D ResNet-18 ResNet-18 | NT-Xent + cross-entropy | SimCLR |
| Fonseca et al.75 (2021) | ✗ | log mel-spectrogram | 1(b) | ResNet, VGG, CRNN | NT-Xent loss | SimCLR |
| Wang et al.82 (2020) | ✗ | raw waveform + log mel-filterbanks | 1(b) | CNN ResNet | NT-Xent loss + cross-entropy | SimCLR |
| BYOL-A83 (2021) | ✗ | log mel-spectrogram | 2(a) | CNN | MSE loss | BYOL |
| Carr37 (2021) | ✓ | MFCCs | 1(a) | context-free network | Fenchel-Young loss | – |
| Ryan38 (2020) | ✗ | constant-Q transform spectrogram | 1(a) | AlexNet | triplet loss | – |
| Speech2Vec90 (2018) | ✓ | mel spectrogram | 3 | RNN | MSE loss | Word2Vec |
| Audio2Vec89 (2020) | ✓✗ | MFCCs | 3 | CNN | MSE loss | Word2Vec |
| DeCoAR91 (2020) | ✓ | log filterbank features | 3 | RNN | L1 loss | Word2Vec |
| Audio Word2Vec195 (2019) | ✓ | MFCCs | 3 | RNN | MSE loss | Word2Vec |
| Mockingjay95 (2020) | ✓ | mel spectrogram | 4(b) | transformer | L1 loss | BERT |
| TERA96 (2021) | ✓ | log mel spectrogram | 4(b) | transformer | L1 loss | BERT |
| Audio ALBERT98 (2021) | ✓ | log mel spectrogram | 4(b) | transformer | L1 loss | BERT |
| DAPC99 (2021) | ✓ | spectrogram | 4(b) | transformer | modified MSE loss + orthogonality penalty | BERT |
| PASE85 (2019) | ✓ | raw waveform | 1(a) | SincNet + CNN | L1, BCE loss | MTL |
| PASE+87 (2020) | ✓ | raw waveform | 1(a) | SincNet + CNN + QRNN | MSE, BCE loss | MTL |
| APC66 (2019) | ✓ | log mel spectrogram | 4(a) | RNN | L1 loss | – |
| VQ-APC114 (2020) | ✓ | log mel spectrogram | 4(a) | RNN, transformer | L1 loss | – |
| NPC69 (2021) | ✓ | log mel spectrogram | – | CNN + masked CNN | L1 loss | – |
| CPC42 (2018) | ✓ | raw waveform | 4(a) | ResNet + GRU | InfoNCE loss | – |
| CPC v271 (2020) | ✓ | raw waveform | 4(a) | ResNet + masked CNN | InfoNCE loss | – |
| CPC293 (2021) | ✓ | raw waveform | 4(a) | ResNet + LSTM | InfoNCE loss | – |
| wav2vec77 (2019) | ✓ | raw waveform | 4(a) | 1D CNN | contrastive loss | – |
| VQ-wav2vec78 (2019) | ✓ | raw waveform | 4(a) | 1D CNN + BERT | contrastive loss | BERT |
| wav2vec 2.072 (2020) | ✓ | raw waveform | 4(b) | 1D CNN + transformer | contrastive loss | BERT |
| HuBERT112 (2021) | ✓ | raw waveform | 4(b) | 1D CNN + transformer | contrastive loss | BERT |
| WavLM113 (2022) | ✓ | raw waveform | 4(b) | 1D CNN + transformer | contrastive loss | BERT |