Table 4:
Description | (Unaligned) CMU-MOSEI Sentiment | ||||
---|---|---|---|---|---|
F1h | MAEl | Corrh | |||
Unimodal Transformers | |||||
Language only | 46.5 | 77.4 | 78.2 | 0.653 | 0.631 |
Audio only | 41.4 | 65.6 | 68.8 | 0.764 | 0.310 |
Vision only | 43.5 | 66.4 | 69.3 | 0.759 | 0.343 |
Late Fusion by using Multiple Unimodal Transformers | |||||
LF-Transformer | 47.9 | 78.6 | 78.5 | 0.636 | 0.658 |
Temporally Concatenated Early Fusion Transformer | |||||
EF-Transformer | 47.8 | 78.9 | 78.8 | 0.648 | 0.647 |
Multimodal Transfomers | |||||
Only [V,A → L] (ours) | 50.5 | 80.1 | 80.4 | 0.605 | 0.670 |
Only [L,A → V ] (ours) | 48.2 | 79.7 | 80.2 | 0.611 | 0.651 |
Only [L,V → A] (ours) MulT mixing intermediate- |
47.5 | 79.2 | 79.7 | 0.620 | 0.648 |
level features (ours) | 50.3 | 80.5 | 80.6 | 0.602 | 0.674 |
MulT (ours) | 50.7 | 81.6 | 81.6 | 0.591 | 0.691 |