Table 1.
F1-scores using majority voting (MV) and DeID, for speaker disentanglement through ADV and USSD using the DAIC-WOZ dataset. Recall that, unlike ADV, USSD does not use speaker labels for disentanglement. The parameter count for USSD does not include speaker ID models, as they are neither retrained nor fine-tuned. The best results are bold-faced.
Feature | Model | Disentanglement | Number of Parameters ↓ | F1-AVG (MV) ↑ | F1-ND ↑ | F1-D ↑ | DelD ↑ |
---|---|---|---|---|---|---|---|
| |||||||
Mel-Spectrogram | No | 280 k | 0.658 | 0.756 | 0.560 | NA | |
CNN-LSTM | ADV | 293k | 0.694 | 0.773 | 0.615 | 14.01% | |
USSD | 280 k | 0.683 | 0.783 | 0.583 | 10.29% | ||
| |||||||
No | 515k | 0.709 | 0.809 | 0.609 | NA | ||
ECAPA-TDNN | ADV | 529k | 0.746 | 0.826 | 0.667 | 3.69% | |
USSD | 515k | 0.746 | 0.826 | 0.667 | 5.97% | ||
| |||||||
Raw-Audio | No | 445 k | 0.669 | 0.792 | 0.546 | NA | |
CNN-LSTM | ADV | 459 k | 0.709 | 0.809 | 0.609 | 55.83% | |
USSD | 445 k | 0.746+ | 0.826 | 0.667 | 45.35% | ||
| |||||||
No | 595k | 0.694 | 0.773 | 0.615 | NA | ||
ECAPA-TDNN | ADV | 609k | 0.790 | 0.880 | 0.700 | 22.32% | |
USSD | 595k | 0.773+ | 0.851 | 0.696 | 19.90% | ||
| |||||||
No | 1.15M | 0.694 | 0.773 | 0.615 | NA | ||
ComparE16 | LSTM-only | ADV | 1.18M | 0.762+ | 0.857 | 0.667 | 68.37% |
USSD | 1.15M | 0.776 | 0.885 | 0.667 | 92.87% | ||
| |||||||
No | 3.6M | 0.683 | 0.783 | 0.583 | NA | ||
Wav2vec2 | LSTM-only | ADV | 3.7M | 0.747 | 0.863 | 0.632 | 52.43% |
USSD | 3.6M | 0.720 | 0.840 | 0.600 | 58.65% |
The symbols
indicate a higher or lower value is better, respectively.
indicates improvements are not statistically significant.