Skip to main content
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Interspeech. 2023 Aug;2023:2343–2347. doi: 10.21437/interspeech.2023-2101

Table 2:

Depression detection performance for various models and AO-SOTA baselines based on F1-AVG, F1(ND), F1(D), and Speaker ID accuracy using the DAIC-WoZ dataset. SOTA baseline results are either reproduced values or reported from the corresponding study. The symbol ‘–’ indicates that those values were not reported in the corresponding study. The symbols ‘ ↑’ and ‘ ↓’ indicate a higher or lower value is better, respectively. Best results are highlighted in bold.

Model Architecture Input Feature Disentanglement Method Model Parameters F1-AVG ↑ F1(ND) ↑ F1(D) ↑ SID Accuracy ↓
DepAudioNet [34] Mel-Spectrogram None 280k 0.6081 0.6977 0.5185 -
FVTC-CNN [40] Formants None - 0.6400 0.4600 0.8200 -
Speech SimCLR [41] Mel-Spectrogram None - 0.6578 0.7556 0.5600 -
CPC [42] Mel-Spectrogram None - 0.6762 0.7317 0.6207 -
CNN-LSTM [23] Spk. Embd. + OpenSmile None - 0.6850 0.8600 0.5100 -
SpeechFormer [43] Wav2Vec None 33M 0.6940 - - -
Vowel-based [44] Mel-Spectrogram None - 0.7000 0.8400 0.5600 -
DepAudioNet [37] (D1) Raw-Audio None 445k 0.6259 0.7755 0.4762 10.04%
DepAudioNet [30] (D2) Raw-Audio USD 459k 0.6830 0.7826 0.5833 8.91%
DepAudioNet (D3) Raw-Audio NUSD 459k 0.7086 0.8085 0.6087 8.05%
ECAPA-TDNN (E1) Raw-Audio None 595k 0.6329 0.7273 0.5385 42.33%
ECAPA-TDNN (E2) Raw-Audio USD 609k 0.7086 0.8085 0.6087 9.38%
ECAPA-TDNN (E3) Raw-Audio NUSD 609k 0.7349 0.8333 0.6364 4.68%
Δ (E3 vs E2) in % - - - 3.70 2.80 4.55 −50.11