Table 4:
An ablation study on the benefit of MulT’s crossmodal transformers using CMU-MOSEI.).
| Description | (Unaligned) CMU-MOSEI Sentiment | ||||
|---|---|---|---|---|---|
| F1h | MAEl | Corrh | |||
| Unimodal Transformers | |||||
| Language only | 46.5 | 77.4 | 78.2 | 0.653 | 0.631 |
| Audio only | 41.4 | 65.6 | 68.8 | 0.764 | 0.310 |
| Vision only | 43.5 | 66.4 | 69.3 | 0.759 | 0.343 |
| Late Fusion by using Multiple Unimodal Transformers | |||||
| LF-Transformer | 47.9 | 78.6 | 78.5 | 0.636 | 0.658 |
| Temporally Concatenated Early Fusion Transformer | |||||
| EF-Transformer | 47.8 | 78.9 | 78.8 | 0.648 | 0.647 |
| Multimodal Transfomers | |||||
| Only [V,A → L] (ours) | 50.5 | 80.1 | 80.4 | 0.605 | 0.670 |
| Only [L,A → V ] (ours) | 48.2 | 79.7 | 80.2 | 0.611 | 0.651 |
| Only [L,V → A] (ours) MulT mixing intermediate- |
47.5 | 79.2 | 79.7 | 0.620 | 0.648 |
| level features (ours) | 50.3 | 80.5 | 80.6 | 0.602 | 0.674 |
| MulT (ours) | 50.7 | 81.6 | 81.6 | 0.591 | 0.691 |