Skip to main content
. Author manuscript; available in PMC: 2024 Apr 22.
Published in final edited form as: CEUR Workshop Proc. 2024 Feb;3649:57–63.

Table 1.

F1-scores using majority voting (MV) and DeID, for speaker disentanglement through ADV and USSD using the DAIC-WOZ dataset. Recall that, unlike ADV, USSD does not use speaker labels for disentanglement. The parameter count for USSD does not include speaker ID models, as they are neither retrained nor fine-tuned. The best results are bold-faced.

Feature Model Disentanglement Number of Parameters F1-AVG (MV) F1-ND F1-D DelD

Mel-Spectrogram No 280 k 0.658 0.756 0.560 NA
CNN-LSTM ADV 293k 0.694 0.773 0.615 14.01%
USSD 280 k 0.683 0.783 0.583 10.29%

No 515k 0.709 0.809 0.609 NA
ECAPA-TDNN ADV 529k 0.746 0.826 0.667 3.69%
USSD 515k 0.746 0.826 0.667 5.97%

Raw-Audio No 445 k 0.669 0.792 0.546 NA
CNN-LSTM ADV 459 k 0.709 0.809 0.609 55.83%
USSD 445 k 0.746+ 0.826 0.667 45.35%

No 595k 0.694 0.773 0.615 NA
ECAPA-TDNN ADV 609k 0.790 0.880 0.700 22.32%
USSD 595k 0.773+ 0.851 0.696 19.90%

No 1.15M 0.694 0.773 0.615 NA
ComparE16 LSTM-only ADV 1.18M 0.762+ 0.857 0.667 68.37%
USSD 1.15M 0.776 0.885 0.667 92.87%

No 3.6M 0.683 0.783 0.583 NA
Wav2vec2 LSTM-only ADV 3.7M 0.747 0.863 0.632 52.43%
USSD 3.6M 0.720 0.840 0.600 58.65%

The symbols

‘↑’ and ‘↓’

indicate a higher or lower value is better, respectively.

+

indicates improvements are not statistically significant.