Table 1.
TL Strategy | Inputs | Models | With VAD (InaSpeech) |
Accuracy ± 95% CI |
---|---|---|---|---|
- | - | Human perception [18] |
- | 67.00 |
- | - | ZeroR | - | 13.33 ± 2.06 |
Feature Extraction |
Deep-Spectrum embs. from fc7 of AlexNet |
SVC | No | 43.32 ± 2.56 |
Yes | 45.80 ± 2.57 | |||
PANNs embs. from CNN-14 |
SVC | No | 39.73 ± 2.53 | |
Yes | 37.22 ± 2.50 | |||
Fine Tuning |
Mel spectrograms | AlexNet | No | 60.72 ± 2.52 |
Yes | 61.67 ± 2.51 | |||
Mel spectrograms | CNN-14 | No | 76.58 ± 2.18 | |
Yes | 75.25 ± 2.23 |