Skip to main content
. Author manuscript; available in PMC: 2020 May 1.
Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2019 Jul;2019:6558–6569. doi: 10.18653/v1/p19-1656

Figure 6:

Figure 6:

Visualization of sample crossmodal attention weights from layer 3 of [VL] crossmodal transformer on CMU-MOSEI. We found that the crossmodal attention has learned to correlate certain meaningful words (e.g., “movie”, “disappointing”) with segments of stronger visual signals (typically stronger facial motions or expression change), despite the lack of alignment between original L/V sequences. Note that due to temporal convolution, each textual/visual feature contains the representation of nearby elements.