Table 2:
Depression detection performance for various models and AO-SOTA baselines based on F1-AVG, F1(ND), F1(D), and Speaker ID accuracy using the DAIC-WoZ dataset. SOTA baseline results are either reproduced values or reported from the corresponding study. The symbol ‘–’ indicates that those values were not reported in the corresponding study. The symbols ‘ ↑’ and ‘ ↓’ indicate a higher or lower value is better, respectively. Best results are highlighted in bold.
| Model Architecture | Input Feature | Disentanglement Method | Model Parameters | F1-AVG ↑ | F1(ND) ↑ | F1(D) ↑ | SID Accuracy ↓ |
|---|---|---|---|---|---|---|---|
| DepAudioNet [34] | Mel-Spectrogram | None | 280k | 0.6081 | 0.6977 | 0.5185 | - |
| FVTC-CNN [40] | Formants | None | - | 0.6400 | 0.4600 | 0.8200 | - |
| Speech SimCLR [41] | Mel-Spectrogram | None | - | 0.6578 | 0.7556 | 0.5600 | - |
| CPC [42] | Mel-Spectrogram | None | - | 0.6762 | 0.7317 | 0.6207 | - |
| CNN-LSTM [23] | Spk. Embd. + OpenSmile | None | - | 0.6850 | 0.8600 | 0.5100 | - |
| SpeechFormer [43] | Wav2Vec | None | 33M | 0.6940 | - | - | - |
| Vowel-based [44] | Mel-Spectrogram | None | - | 0.7000 | 0.8400 | 0.5600 | - |
| DepAudioNet [37] (D1) | Raw-Audio | None | 445k | 0.6259 | 0.7755 | 0.4762 | 10.04% |
| DepAudioNet [30] (D2) | Raw-Audio | USD | 459k | 0.6830 | 0.7826 | 0.5833 | 8.91% |
| DepAudioNet (D3) | Raw-Audio | NUSD | 459k | 0.7086 | 0.8085 | 0.6087 | 8.05% |
| ECAPA-TDNN (E1) | Raw-Audio | None | 595k | 0.6329 | 0.7273 | 0.5385 | 42.33% |
| ECAPA-TDNN (E2) | Raw-Audio | USD | 609k | 0.7086 | 0.8085 | 0.6087 | 9.38% |
| ECAPA-TDNN (E3) | Raw-Audio | NUSD | 609k | 0.7349 | 0.8333 | 0.6364 | 4.68% |
| Δ (E3 vs E2) in % | - | - | - | 3.70 | 2.80 | 4.55 | −50.11 |