Skip to main content
. Author manuscript; available in PMC: 2020 May 1.
Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2019 Jul;2019:6558–6569. doi: 10.18653/v1/p19-1656

Table 4:

An ablation study on the benefit of MulT’s crossmodal transformers using CMU-MOSEI.).

Description (Unaligned) CMU-MOSEI Sentiment
Acc7h Acc2h F1h MAEl Corrh

Unimodal Transformers

Language only 46.5 77.4 78.2 0.653 0.631
Audio only 41.4 65.6 68.8 0.764 0.310
Vision only 43.5 66.4 69.3 0.759 0.343

Late Fusion by using Multiple Unimodal Transformers

LF-Transformer 47.9 78.6 78.5 0.636 0.658

Temporally Concatenated Early Fusion Transformer

EF-Transformer 47.8 78.9 78.8 0.648 0.647

Multimodal Transfomers

Only [V,AL] (ours) 50.5 80.1 80.4 0.605 0.670
Only [L,AV ] (ours) 48.2 79.7 80.2 0.611 0.651
Only [L,VA] (ours)
MulT mixing intermediate-
47.5 79.2 79.7 0.620 0.648
level features (ours) 50.3 80.5 80.6 0.602 0.674
MulT (ours) 50.7 81.6 81.6 0.591 0.691