Table 2.
Depression detection and severity estimation performance, in terms of ( and ), Balanced Accuracy (BAc.) and RMSE, on DAIC-WOZ and Vocal Mind datasets.
Acoustic features Alone |
Speaker embeddings Alone |
Acoustic and speaker Embeddings combined |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset1: DAIC | Model | COVAREP | ECAPA | (ECAPA, COVAREP) | |||||||||
BAc. | RMSE | BAc. | RMSE | BAc. | RMSE | ||||||||
MK-CNN | 0.35 | 0.70 | 0.52 | 7.39 | 0.43 | 0.78 | 0.60 | 6.35 | 0.45 | 0.79 | 0.61 | 6.21 | |
LSTM | 0.32 | 0.70 | 0.51 | 7.41 | 0.46 | 0.79 | 0.61 | 6.31 | 0.47 | 0.80 | 0.63 | 6.19 | |
OpenSMILE | ECAPA | (ECAPA, OpenSMILE) | |||||||||||
MK-CNN | 0.37 | 0.74 | 0.55 | 6.87 | 0.43 | 0.78 | 0.61 | 6.35 | 0.49 | 0.81 | 0.65 | 6.08 | |
LSTM | 0.39 | 0.73 | 0.56 | 6.82 | 0.46 | 0.79 | 0.63 | 6.31 | 0.50 | 0.83 | 0.66 | 6.01 | |
Dataset2: VM | Model | COVAREP | ECAPA | (ECAPA, COVAREP) | |||||||||
BAc. | RMSE | BAc. | RMSE | BAc. | RMSE | ||||||||
MK-CNN | 0.30 | 0.68 | 0.49 | 7.61 | 0.32 | 0.80 | 0.55 | 6.64 | 0.34 | 0.80 | 0.57 | 6.55 | |
LSTM | 0.32 | 0.67 | 0.50 | 7.63 | 0.34 | 0.81 | 0.57 | 6.62 | 0.37 | 0.81 | 0.60 | 6.51 | |
OpenSMILE | ECAPA | (ECAPA, OpenSMILE) | |||||||||||
MK-CNN | 0.32 | 0.74 | 0.53 | 6.96 | 0.32 | 0.80 | 0.56 | 6.64 | 0.41 | 0.81 | 0.61 | 6.41 | |
LSTM | 0.34 | 0.75 | 0.54 | 6.94 | 0.34 | 0.81 | 0.57 | 6.62 | 0.43 | 0.84 | 0.64 | 6.28 |
and are scores for depressed and healthy classes, respectively. COVAREP and OpenSMILE are acoustic features. Results obtained using ECAPA-TDNN x-vectors (ECAPA), COVAREP and OpenSMILE features on DAIC-WOZ (DAIC) and Vocal Mind (VM) datasets. For results obtained by combining Acoustic and Speaker embeddings ((ECAPA, COVAREP) and (ECAPA, OpenSMILE)), MK-CNN and LSTM models refer to CE models with MK-CNN and LSTM blocks, respectively.
Bold values indicate best results in each comparison group.