Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions

Torsten Wörtwein; Lisa B Sheeber; Nicholas Allen; Jeffrey F Cohn; Louis-Philippe Morency

doi:10.18653/v1/2022.findings-emnlp.344

. Author manuscript; available in PMC: 2025 Dec 1.

Published in final edited form as: Find ACL EMNLP. 2022 Dec;2022:4681–4696. doi: 10.18653/v1/2022.findings-emnlp.344

Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions

Torsten Wörtwein ¹, Lisa B Sheeber ², Nicholas Allen ³, Jeffrey F Cohn ⁴, Louis-Philippe Morency ⁵

PMCID: PMC12665182 NIHMSID: NIHMS2013310 PMID: 41328176

Abstract

Multimodal fusion addresses the problem of analyzing spoken words in the multimodal context, including visual expressions and prosodic cues. Even when multimodal models lead to performance improvements, it is often unclear whether bimodal and trimodal interactions are learned or whether modalities are processed independently of each other. We propose Multimodal Residual Optimization (MRO)¹ to separate unimodal, bimodal, and trimodal interactions in a multimodal model. This improves interpretability as the multimodal interaction can be quantified. Inspired by Occam’s razor, the main intuition of MRO is that (simpler) unimodal contributions should be learned before learning (more complex) bimodal and trimodal interactions. For example, bimodal predictions should learn to correct the mistakes (residuals) of unimodal predictions, thereby letting the bimodal predictions focus on the remaining bimodal interactions. Empirically, we observe that MRO successfully separates unimodal, bimodal, and trimodal interactions while not degrading predictive performance. We complement our empirical results with a human perception study and observe that MRO learns multimodal interactions that align with human judgments.

1. Introduction

Multimodal fusion integrates information from what we say, how we speak, and how we visually express ourselves. While multimodal models have led to performance improvements (Zadeh et al., 2017; Tsai et al., 2019; Zellers et al., 2021), they often have the downside of being difficult to interpret: it is unclear whether interactions between two modalities (bimodal) or three modalities (trimodal) are learned, whether modalities are processed independently of each other, or whether these models focus on only one modality (Wu et al., 2021). Quantifying multimodal interactions is an essential building block for future research: in model debugging as a step to better understand models and improve their performance (Du et al., 2019) as well as in AI applications as a step to be more interpretable (Goodman and Flaxman, 2017).

Seminal work (Hessel and Lee, 2020) observed that many multimodal models function like the sum of unimodal models, so-called additive models. In other words, these models might not be learning as many non-additive (bimodal and trimodal) interactions as expected. The non-additive interaction example in Figure 1 exemplifies how humans perceive the whole multimodal example as more than the sum of the two modalities. While the current approach of separating additive and non-additive interactions highlighted the problem of models primarily learning additive contributions, it did not provide solutions to learn non-additive interactions explicitly (Hessel and Lee, 2020). However, many multimodal tasks, such as visual question answering (Cadene et al., 2019), require learning unimodal, bimodal, and trimodal interactions.

Figure 1: — The joint assessment of language and vision (denoted as $f (L, V)$ ) is different from the sum of unimodal assessments (additive). This is an example for valence from the IEMOCAP dataset (Busso et al., 2008).

In this paper, we introduce Multimodal Residual Optimization (MRO) to explicitly learn and decompose predictions into the sum of unimodal, bimodal, and trimodal interactions. Inspired by Occam’s razor, to prefer simpler solutions, the main intuition of MRO is that (simpler) unimodal contributions should be learned before learning (more complex) bimodal and trimodal interactions. For example, the bimodal predictions should learn to correct the mistakes (residuals) of the unimodal predictions, thereby letting the bimodal predictions focus on the remaining bimodal interactions. Similarly, trimodal predictions should learn what is not modeled by unimodal and bimodal predictions.

We evaluate MRO on six multimodal language datasets, including tasks for intent, sentiment, and emotion recognition. MRO aims to separate multimodal interactions (unimodal, bimodal, and trimodal) without degrading predictive performance. As part of evaluating MRO, we propose a new evaluation metric that extends prior work to three modalities (Hessel and Lee, 2020). We complement our empirical results with a human perceptions study to evaluate whether MRO learns non-additive interactions that align with human judgment.

2. Related Work

We review previous research on four aspects related to multimodal interactions: the prevalence of additive interactions, model-specific and model-agnostic quantification of modality interactions, and taxonomies of multimodal interactions.

Prevalence of Additive Interactions:

Growing empirical evidence (Hessel and Lee, 2020) and annotation studies (Provost et al., 2015; Kruk et al., 2019; Wörtwein et al., 2021) highlight that additive interactions are prevalent especially on datasets that are not carefully balanced, e.g., not having the same image contextualized with different captions (Hessel and Lee, 2020). An empirical approach highlights that multimodal models can be factorized into additive models without significant loss in performance (Hessel and Lee, 2020), indicating that the examined models primarily relied on additive interactions. Similarly, multimodal perception studies indicate the importance of additive interactions: unimodal ratings of emotions are predictive of multimodal ratings (Provost et al., 2015). Further, annotations of the semiotic mode, how the multimodal meaning emerges from individual modalities (Bateman, 2014), of text-image pairs found that modalities provide mostly the same meaning (Kruk et al., 2019). Moreover, modality importance annotations for affective states found that a single modality often contains sufficient information to confirm an affective state (Wörtwein et al., 2021). While additive interactions are sufficient in many cases, non-additive interactions are still needed, especially when datasets contain the same unimodal representation in different multimodal contexts (Provost et al., 2015; Hessel and Lee, 2020).

Model-specific quantification:

Models can indicate how much they rely on potentially non-additive interactions (Zadeh et al., 2018; Tsai et al., 2020). Multimodal routing (Tsai et al., 2020) was recently proposed to interpret the relative importance of multimodal interactions. It uses the routing-by-agreement algorithm (Sabour et al., 2017) to focus more on modalities whose embedding is similar to other modalities’ embeddings. The performance gains of the routing model hint at modalities containing partially redundant information (De Gelder and Bertelson, 2003) for emotion and sentiment prediction. While most model-specific approaches cannot rule out that a multimodal model potentially uses only one modality (Wu et al., 2021), MRO encourages that a bimodal model focuses on bimodal interactions.

Model-agnostic quantification:

Multimodal interactions can be quantified after a model has been trained (Hessel and Lee, 2020; Tsang et al., 2020; Wang et al., 2021; Lyu et al., 2022). EMAP (Hessel and Lee, 2020) is based on the idea of factorizing any trained model into additive and non-additive interactions. Unfortunately, this marginalizing is very costly: with $m$ modalities and a dataset of $N$ samples, it requires $N^{m}$ forward passes. Compared to EMAP, MRO learns a model that directly separates multimodal interactions.

Taxonomy of Multimodal Interactions:

Many categorizations have been proposed to quantify the relationship between modalities (Kloepfer, 1976; Zhang et al., 2018; Wang et al., 2021). A recent study (Kruk et al., 2019) uses Koepfer’s parallel, amplifying, and divergent. Parallel signals that only one modality is needed for prediction as they all provide the same meaning. Amplifying is sometimes also referred to as “additive” in a non-mathematical sense: modalities provide similar information but their combined meaning is either amplified or diminished. Finally, divergent indicates that modalities provide opposing information. Figure 1 is an example of opposing information.

3. Quantifying Multimodal Interactions

To learn a multimodal model that separates unimodal, bimodal, and trimodal interactions, we first define how to quantify these three types of multimodal interactions. Further, the work presented in this section extends prior work (Hessel and Lee, 2020), which defined evaluation metrics to quantify multimodal interactions in the bimodal case.

Consider three modalities $T$ (text), $V$ (vision), and $A$ (acoustic) with corresponding features $x_{T}$ , $x_{V}$ , $x_{A}$ . A bimodal function $f$ is additive when it can be factorized into the sum of two unimodal functions, $\forall x_{T}, x_{V} : f (x_{T}, x_{V}) = g (x_{T}) + h (x_{V})$ . Further, $f$ contains unimodal contributions when parts of the prediction depend on only one modality: $\exists x_{T} : E_{v} f (x_{T}, v) \neq 0$ (Lyu et al., 2022). This equation is illustrated for the language modality but has the same formulation for the vision modality. Prior work (Hessel and Lee, 2020) proposed EMAP to quantify unimodal contributions $(U C)$ in the context of two modalities. In this paper, we generalize $U C$ to three modalities.

Claim 1.

A trimodal function $f$ contains unimodal contributions when $U C (f, x_{T}, x_{V}, x_{A}) \neq 0$ with

\begin{array}{l} U C (f, x_{T}, x_{V}, x_{A}) = \\ \underset{v, a}{E} f (x_{T}, v, a) + \underset{t, a}{E} f (t, x_{V}, a) \\ + \underset{t, v}{E} f (t, v, x_{A}) - 2 \underset{t, v, a}{E} f (t, v, a) . \end{array}

(1)

The idea of $U C$ is to evaluate the model with all possible combinations of unimodal features (even feature combinations that are not in a dataset) so that the model cannot use non-additive interactions between modalities. Similarly, we can formulate a function $B I$ to quantify bimodal interactions.

Claim 2.

A trimodal function $f$ contains bimodal interactions $(B I)$ when $B I (f, x_{T}, x_{V}, x_{A}) \neq 0$ with

\begin{array}{l} B I (f, x_{T}, x_{V}, x_{A}) = \\ \underset{t}{E} [f (t, x_{V}, x_{A}) - U C (f, t, x_{V}, x_{A})] \\ + \underset{v}{E} [f (x_{T}, v, x_{A}) - U C (f, x_{T}, v, x_{A})] \\ + \underset{a}{E} [f (x_{T}, x_{V}, a) - U C (f, x_{T}, x_{V}, x_{a})] . \end{array}

(2)

The remaining trimodal interactions $(T I)$ are then simply what is not covered by the unimodal contributions and bimodal interactions:

\begin{array}{l} T I (f, x_{T}, x_{V}, x_{A}) = f (x_{T}, x_{V}, x_{A}) \\ - U C (f, x_{T}, x_{V}, x_{A}) - B I (f, x_{T}, x_{V}, x_{A}) . \end{array}

(3)

When computational feasible², $U C$ , $B I$ and $T I$ are valuable tools to evaluate whether a trimodal model contains unimodal, bimodal, and trimodal interactions. We will use these metrics to evaluate our proposed approaches.

4. Multimodal Residual Optimization

The main contribution of this paper is Multimodal Residual Optimization (MRO) which has the goal of learning and decomposing predictions into unimodal, bimodal and trimodal interactions to quantify them. Inspired by Occam’s razor, the intuition of MRO is that (simpler) unimodal interactions should be prioritized before learning (more complex) bimodal and trimodal interactions. MRO has two components to separate modality interactions: an architecture and loss-function component.

4.1. MRO Architecture

Instead of using a single trimodal function to make a prediction $\hat{y} = f (x_{T}, x_{V}, x_{A})$ , the goal of MRO is to make predictions as $\hat{y} = U C (f, x_{T}, x_{V}, x_{A}) + B I (f, x_{T}, x_{V}, x_{A}) + T I (f, x_{T}, x_{V}, x_{A})$ without having to compute $U C$ , $B I$ and $T I$ . Therefore, MRO makes predictions $\hat{y}$ based on three components:

\hat{y} = {\hat{y}}_{uni} + {\hat{y}}_{bi} + {\hat{y}}_{tri}

(4)

where ${\hat{y}}_{uni}$ , ${\hat{y}}_{bi}$ and ${\hat{y}}_{tri}$ model the unimodal, bimodal, and trimodal interactions respectively. It is important to note that ${\hat{y}}_{bi}$ and ${\hat{y}}_{tri}$ are intended to model only non-additive interactions, while ${\hat{y}}_{uni}$ is designed to model only additive interactions. ${\hat{y}}_{uni}$ is defined as

{\hat{y}}_{uni} = f_{θ_{T}} (x_{T}) + f_{θ_{V}} (x_{V}) + f_{θ_{A}} (x_{A})

(5)

where $f_{θ_{T}}$ , $f_{θ_{V}}$ and $f_{θ_{A}}$ are models, e.g., neural networks that use only one modality as an input. Each model has its own set of parameters ( $θ_{T}$ , $θ_{V}$ , and $θ_{A}$ ). We parameterize the bimodal and trimodal models in a similar manner:

{\hat{y}}_{bi} = f_{θ_{T V}} (x_{T}, x_{V}) + f_{θ_{T A}} (x_{T}, x_{A}) + f_{θ_{A V}} (x_{A}, x_{V})

(6)

{\hat{y}}_{tri} = f_{θ_{T V A}} (x_{T}, x_{V}, x_{A})

(7)

where $f_{θ_{T V}}$ , $f_{θ_{T A}}$ and $f_{θ_{A V}}$ are the bimodal models that take only two modalities as input, and $f_{θ_{T V A}}$ takes all three modalities as input. The whole MRO model is parameterized with $Θ = (θ_{T}, θ_{V}, θ_{A}, θ_{T V}, θ_{T A}, θ_{A V}, θ_{T V A})$ .

This architecture already enforces that ${\hat{y}}_{uni}$ can only contain unimodal contributions. While dedicating unimodal, bimodal, and trimodal models was explored in prior work (Zadeh et al., 2016, 2019; Tsai et al., 2020), they did not explicitly encourage ${\hat{y}}_{bi}$ and ${\hat{y}}_{tri}$ not to contain unimodal contributions and similarly ${\hat{y}}_{tri}$ not to contain bimodal interactions. The MRO loss function described in the next section addresses this issue.

4.2. MRO Loss Function

We first explain MRO for two modalities (language and vision) before presenting the more general formulation for three and more modalities.

Bimodal case:

To encourage ${\hat{y}}_{bi}$ to not contain unimodal contributions, MRO prioritizes ${\hat{y}}_{uni}$ . MRO defines the loss function as

L (y, \hat{y}) = L (y, {\hat{y}}_{uni}) + L (y, s g ({\hat{y}}_{uni}) + {\hat{y}}_{bi})

(8)

where $s g$ refers to stop-gradient (Razavi et al., 2019), which prevents back-propagation through $s g$ ’s arguments. The first part of Equation 8 updates $θ_{T}$ and $θ_{V}$ to predict $y$ using only unimodal contributions ${\hat{y}}_{uni} = f_{θ_{T}} (x_{T}) + f_{θ_{V}} (x_{V})$ . The second part of Equation 8 updates $θ_{T V}$ so that $L (y, {\hat{y}}_{uni} + {\hat{y}}_{bi})$ is smaller; i.e., ${\hat{y}}_{bi}$ corrects mistakes that ${\hat{y}}_{uni}$ makes. We do not backpropagate again to $θ_{T}$ and $θ_{V}$ so that ${\hat{y}}_{bi}$ does not influence ${\hat{y}}_{uni}$ ; i.e., ${\hat{y}}_{uni}$ is optimized independently of ${\hat{y}}_{bi}$ .

Figure 2 summarizes MRO in the bimodal case.

$m$ -modal case:

In the case of $m$ modalities, we have $m$ types of interactions: unimodal, bimodal, trimodal, ..., $m$ -modal. Instead of separating just additive from all non-additive interactions, we want to separate these $m$ types of interactions. MRO defines the loss function as

L (y, \hat{y}) = \sum_{i = 1}^{m} L (y, s g (\sum_{j = 1}^{i - 1} {\hat{y}}_{j}) + {\hat{y}}_{i})

(9)

where ${\hat{y}}_{i}$ refers to the $i$ -modal predictions, i.e., ${\hat{y}}_{1} = {\hat{y}}_{uni}$ , ${\hat{y}}_{2} = {\hat{y}}_{bi}$ , ${\hat{y}}_{3} = {\hat{y}}_{tri}$ . For the trimodal case, ${\hat{y}}_{uni}$ , ${\hat{y}}_{bi}$ , and ${\hat{y}}_{tri}$ were defined in subsection 4.1. When $m$ is large than three, the models can be defined following the same approach. Similar to the bimodal case, ${\hat{y}}_{bi}$ is optimized independently of ${\hat{y}}_{tri}$ as the gradient of ${\hat{y}}_{bi}$ is stopped by $s g$ when optimizing ${\hat{y}}_{tri}$ .

4.3. Sequential MRO

An alternative to MRO’s approach of simultaneously optimizing all prediction components $({\hat{y}}_{uni}, {\hat{y}}_{bi}, {\hat{y}}_{tri})$ , the sequential MRO (sMRO) proposes to optimize them sequentially.

First, sMRO optimizes the parameters of ${\hat{y}}_{uni}$ using the loss $L (y, {\hat{y}}_{uni})$ until convergence and then freezes its parameters $θ_{L}$ , $θ_{V}$ , and $θ_{A}$ before optimizing ${\hat{y}}_{bi}$ and ${\hat{y}}_{tri}$ . Next, sMRO optimizes the parameters of ${\hat{y}}_{bi}$ using the loss $L (y, {\hat{y}}_{uni} + {\hat{y}}_{bi})$ until convergence and then freeze the bimodal parameters $θ_{L V}$ , $θ_{L A}$ and $θ_{V A}$ . The trimodal ${\hat{y}}_{tri}$ can then be optimized using the loss $L (y, {\hat{y}}_{uni} + {\hat{y}}_{bi} + {\hat{y}}_{tri})$ . For cases with more than three modalities, sMRO can optimize the parameters of ${\hat{y}}_{m}$ for $L (y, \sum_{i = 1}^{m} {\hat{y}}_{i})$ until convergence and then freeze the parameters of ${\hat{y}}_{m}$ .

sMRO has similarities with gradient boosting (GB) (Friedman, 2001) when GB has, in the trimodal case, three learners that correspond to the prediction components ${\hat{y}}_{uni}$ , ${\hat{y}}_{bi}$ , and ${\hat{y}}_{tri}$ . Unlike sMRO, GB is not suitable for some loss functions, such as the mean absolute error (MAE; its gradient is not proportional to the residual), as each learner in GB estimates the gradient of the errors from the previous learners. In the case of MAE, learners will predict −1 or 1, which leads to a poor fit with only three learners.

5. Experimental Methodology

We evaluate whether we can train a model that separates unimodal, bimodal, and trimodal interactions while not degrading predictive performance.

Datasets:

We focus on five sentiment- and emotion-annotated datasets for which prior work used multimodal models, see Table 1. We also include a sixth Instagram dataset (Kruk et al., 2019) as it has modality interaction annotations (semiotic modes), which we can use to evaluate MRO.

Table 1:

Dataset overview.

Original Paper	Tasks	Abbreviation	Samples	Modalities
(Zadeh et al., 2016)	Sentiment (regression)	MOSI	2.2k	3
(Zadeh et al., 2018)	Sentiment, Polarity, Happiness (regression)	MOSEI	22.9k	3
(Busso et al., 2008)	Arousal and Valence (regression)	IEMOCAP	4.8k	3
(Valstar et al., 2016)	Arousal and Valence (regression)	SEWA	1.9k	3
(Nelson et al., 2021)	Affect categories (4-way classification)	TPOT	17.3k	3
(Kruk et al., 2019)	Intent of Instagram posts (7-way classification)	Instagram	1.3k	2

Open in a new tab

We use the same features across all sentiment and emotion datasets: RoBERTa (Liu et al., 2020) as a representation of transcribed utterances; OpenFace 2.0 (Baltrusaitis et al., 2018) to summarize face-related features, and openSMILE’s eGeMAPS (Eyben et al., 2015) to summarize acoustic features. For the Instagram dataset, we use the author-provided ResNet features (He et al., 2016) to summarize the image content and use RoBERTa to represent captions.

Evaluation:

We want that the prediction components ${\hat{y}}_{uni}$ , ${\hat{y}}_{bi}$ and ${\hat{y}}_{tri}$ correspond to $U C (\hat{y})$ , $B I (\hat{y})$ , and $T I (\hat{y})$ so that the prediction components represent only unimodal, only bimodal, and only trimodal interactions. To test this, we use $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})|$ to evaluate whether the bimodal and trimodal predictions contain unimodal contributions and $|B I ({\hat{y}}_{tri})|$ whether the trimodal prediction contains bimodal contributions. Given the MRO-architecture, ${\hat{y}}_{uni}$ cannot include bimodal and trimodal interactions and ${\hat{y}}_{bi}$ cannot include trimodal interactions. This means, if $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ is 0, the model perfectly separates unimodal, bimodal, and trimodal interactions, i.e., ${\hat{y}}_{uni} = U C (\hat{y})$ , ${\hat{y}}_{bi} = B I (\hat{y})$ , and ${\hat{y}}_{tri} = T I (\hat{y})$ . We use 5-fold test setup for all datasets.

Models:

We compare the MRO-architecture when optimized in different manners: with $L (y, {\hat{y}}_{uni} + {\hat{y}}_{bi} + {\hat{y}}_{tri})$ (referred to as Joint), sMRO, and MRO. For performance comparison, we include the routing model (Tsai et al., 2020) (referred to as Routing), a recently proposed model with the goal of modality interpretability. Lastly, we compare the performance against a single trimodal model $\hat{y} = f_{θ_{T V A}} (x_{T}, x_{V}, x_{A})$ (referred to as Tri) to evaluate whether the larger MRO-architecture has two many parameters for smaller datasets.

Implementation Details:

The functions $f$ of Equation 4 are instantiated as multi-layer perceptrons. For each multimodal model, e.g., $f_{θ_{T V}}$ , we implement two popular types of fusion: early fusion (concatenating the modalities) and tensor fusion (Zadeh et al., 2017) (outer product between modalities after learning unimodal embeddings). The type of fusion is a hyper-parameter together with the number of layers, their width, learning rate, learning rate decay, L2 weight decay, dropout, and with/without prior feature selection. As a loss function, we use the mean absolute error for regression tasks and the cross-entropy loss for classification tasks.

6. Multimodal Perception Study

We conduct a multimodal perception study to evaluate whether MRO learns non-additive interactions, when humans also require non-additive interactions. We choose arousal and valence on the IEMOCAP dataset for this study as arousal and valence are two fundamental dimensions to describe emotional states (Munezero et al., 2014).

Study Design:

Crowd workers³ are asked to rate arousal and valence of video segments when being exposed to only a subset of modalities. The four subsets are: 1) the transcript of what the person says (T); 2) the muted video (V); 3) the low-pass filtered audio (A), and 4) the transcripts, the video, and the original audio ( ${TVA}_{O}$ ). IEMOCAP has ten speakers. We randomly select ten segments for each speaker, i.e. 100 segments.

Audio Processing:

It is challenging to disentangle speech content and how we speak (Bhargava and Bagkent, 2012). Similar to previous work, we low-pass filter the audio signal (Yang et al., 2012). Instead of using 850 Hz as a cut-off (Yang et al., 2012), we use a lower cut-off frequency, as we could understand spoken words at 850 Hz. We choose 660 Hz⁴ as it is the mean of the maximum pitch in an empirical study (Li and Yiu, 2006) and it also closely coincides with the maximum pitch of contralto singers (E5 at 659.25 Hz). We choose this pitch-focused definition as we believe that prosodic information will predict arousal and valence.

Avoiding learning effects:

Raters might be able to infer the missing multi-modal context after having rated some of the unimodal subsets for a specific segment. We therefore use two mechanisms to address learning effects across the modalities. First, each of the raters annotates only 20 randomly selected segments for each modality subset (we have eight raters per segment and modality subset). Second, we structurally randomize the order of the modality subsets by first presenting all unimodal subsets in a random order and in the end the trimodal segments.

Ratings and reliability:

Following the annotation setup from IEMOCAP, we use the ordinal arousal and valence manikins scale consisting of five levels (Bradley and Lang, 1994) to rate the two emotional dimensions. The effective reliability (Rosenthal, 2005) over $k$ raters as measured by the Intra-class Correlation Coefficient ICC(2, k-1) is excellent (above 0.9) (Koo and Li, 2016) for all modality subsets. Further, our new trimodal ratings ( ${TVA}_{O}$ ) correlate highly with the existing annotations on IEMOCAP r(98) = 0.88, p < 0.001 for arousal and r(98) = 0.92, p < 0.001 for valence, indicating that we can use our new annotations to inspect models trained on the original annotations.

Evaluation:

To evaluate when humans require non-additive interactions, we train a linear regression model (an additive model) that predicts ${TVA}_{O}$ given T, V, and A. We refer to this model as ${\hat{y}}_{uni}^{human}$ . The model fit of ${\hat{y}}_{uni}^{human}$ shows how important the missing non-additive interactions are (Provost et al., 2015). Further, the absolute error $|{TVA}_{O} - {\hat{y}}_{uni}^{human}|$ measures how important the missing non-additive interactions are to humans for each segment. We use $|{TVA}_{O} - {\hat{y}}_{uni}^{human}|$ to answer the question: does MRO learn more non-additive interactions when $|{TVA}_{O} - {\hat{y}}_{uni}^{human}|$ is larger, i.e., when humans require non-additive interactions?

7. Results and Discussion

Sanity Check:

Before evaluating MRO on more complex datasets, we conduct a sanity check on two simpler datasets: $x_{T} + x_{V} + x_{A}$ which requires only unimodal contributions (we refer to it as Sanity Check Unimodal) and $x_{T} x_{V} + x_{T} x_{A} + x_{V} x_{A}$ which requires only bimodal interactions (we refer to it as Sanity Check Bimodal). Figure 3 shows that the joint and the routing model do not separate unimodal, bimodal, and trimodal interactions well as $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ is high. As expected, sMRO and MRO separate the interacts almost perfectly as $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ is very close to 0.

Figure 3: — Average $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ for all models and datasets. Lower values indicate a better separation of unimodal, bimodal, and trimodal contributions.

To test how many epochs are needed to minimize $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ , we evaluate it after each epoch. The results in Figure 4 show that the separation during the first epochs becomes worse as ${\hat{y}}_{uni}$ has not yet learned much, meaning ${\hat{y}}_{bi}$ and ${\hat{y}}_{tri}$ try to predict unimodal contributions which increases $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})|$ . However, after a few epochs the separation becomes better and $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ reaches 0. The same can be observed for the bimodal sanity check in Figure 4.

MRO significantly reduces $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ .

Similar to the sanity check on simpler dataset, we want that $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ is as small as possible. For easier comparison across datasets, we normalize $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ by the standard deviation of the ground truth from the training set. Figure 3 shows that sMRO and MRO significantly reduce $∣ U C ({\hat{y}}_{bi} + {\hat{y}}_{tri}) |+ |B I ({\hat{y}}_{tri})|$ compared to models optimized with $L (y, {\hat{y}}_{uni} + {\hat{y}}_{bi} + {\hat{y}}_{tri})$ (Joint) and die routing model.

As it is computationally very expensive to evaluate $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ after each epoch, we plot it only for arousal and valence on IEM0-CAP in Figure 4. We focus on IEMOCAP as we also conduct the perception study on it, see section 6. While the plot for arousal in Figure 4 is a bit noisy, MRO quickly reduces $∣ U C ({\hat{y}}_{bi} + t r i) ∣ + |B I ({\hat{y}}_{tri})|$ . The same can be observed for valence in Figure 4.

MRO does not degrade performance.

The secondary goal of MRO is not degrading performance. Table 3 lists the models’ performance. Models optimized with MRO are in no case significantly worse than any other model. However, they arc statistically significantly better than the joint model for valence on SEWA and happiness on MOSEI.

Table 3:

Average performance over the test folds. Higher is better.

	Tri	Routing	Joint	sMRO	MRO
MOSI (Pearson’s $r$ )
Sentiment	0.662	0.658	0.657	0.656	0.661
MOSEI (Pearson’s $r$ )
Sentiment	0.723	0.727	0.727	0.726	0.727
Polarity	0.599	0.597	0.606	0.593	0.605
Happiness	0.637	0.642	0.637	0.630	0.641
IEMOCAP (Concordance Correlation Coefficient)
Arousal	0.588	0.613	0.622	0.624	0.611
Valence	0.647	0.655	0.624	0.603	0.634
SEWA (Concordance Correlation Coefficient)
Arousal	0.317	0.263	0.293	0.292	0.304
Valence	0.268	0.335	0.268	0.310	0.337
TPOT (Accuracy)
Constructs	0.565	0.554	0.566	0.566	0.574
Instagram (macro ROC AUC)
Intent	0.876	0.731	0.891	0.888	0.891

Mean	0.588	0.595	0.589	0.589	0.599

Open in a new tab

MRO might generalizes slightly better because, similar to structural risk minimization (Vapnik, 1999), it prioritizes simpler models and relies on more complex multimodal models only when needed. Another reason is that MRO has similar’ effects as having auxiliary unimodal loss functions which seems beneficial for multimodal models (Wang et al., 2020; Zeng et al., 2021).

Ablating ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ decreases performance.

We quantify the average performance impact of post-hoc removing ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ across datasets, i.e., $\hat{y} = {\hat{y}}_{uni}$ . When comparing Table 4 with Table 3, we observe that removing ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ (the non-additive predictions), hurts performance. While additive contributions are very important, non-additive interactions are needed for best performance.

Table 4:

Average performance when post-hoc removing ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ , i.e., $\hat{y} = {\hat{y}}_{uni}$ .

	sMRO	MRO
Mean	0.577	0.587

Open in a new tab

MRO learns more non-additive interactions when two modalities are informative.

The TPOT dataset has human judgments for how important modalities are to confirm the current affective state (Wörtwein et al., 2021). Three importance levels were annotated: 1) a modality is sufficient to confirm the affective state (while ignoring other modalities), 2) a modality contains relevant information for the affective state (information from a second modality is needed), and 3) a modality contains no information for the current affective state.

We hypothesize that MRO uses more non-additive interactions $({\hat{y}}_{bi} + {\hat{y}}_{tri})$ for samples with at least two informative modalities (relevant or sufficient) compared to samples with only one informative modality. To measure whether ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ are used more, we calculate how much the softmax probabilities (TPOT is a classification task) change when removing ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ , i.e., $\sum_{k = 1}^{4} |softmax {(\hat{y})}^{(k)} - softmax {({\hat{y}}_{uni})}^{(k)}|$ where $k$ indexes the probability vector for the four classes. The means of samples with two informative modalities (0.299) and only one informative modality (0.264) are significantly different according to an independent t-test, $t (2671) = 5.059$ , p < 0.001. This suggests that MRO not only mathematically separates unimodal, bimodal, and trimodal interactions but that its separation also correlates with human assessments. Further, this observation provides evidence that models are more likely to learn non-additive interactions when several modalities are themselves informative.

MRO learns more non-additive interactions when modalities amplify each other.

We included the Instagram dataset (Kruk et al., 2019) because it has modality interaction annotations (semiotic modes) that are inspired by Kloepfer (Kloepfer, 1976). To test whether ${\hat{y}}_{bi}$ (this dataset has only two modalities) contributes more depending on the semiotic mode (parallel, amplifying, and divergent), we conduct a one-way ANOVA on the probability changes when removing ${\hat{y}}_{bi}$ . The means between the semiotic modes are significantly different, F(2,1296) = 5.059, p = 0.006, with the highest absolute average change for amplifying (0.317), followed by parallel (0.272), and then divergent (0.256). The means between amplifying and parallel are significantly different t(1297) = 2.432, p = 0.015 as well as between amplifying and divergent t(1297) = 2.874, p = 0.004. Similar to the results on TPOT, it is confirming that MRO learned significantly larger non-additive contributions $({\hat{y}}_{bi})$ for amplifying than for parallel. A possible explanation why diverging seems to require the least non-additive interactions is that the definition of diverging requires that only the meaning of the modalities is opposing but it does not specify how the combined meaning is formed. Even if the combined meaning of Figure 1 was neutral (additive), the semiotic mode is still divergent.

MRO learns non-additive interaction when humans need non-additive interactions.

The additive model ${\hat{y}}_{uni}^{human}$ of predicting the mutlimodal ratings ${TVA}_{O}$ given the uni-modal ratings, fits very well (r² = 0.85 for arousal and r² = 0.85 for valence) which is inline with similar prior work (Provost et al., 2015). Even though our multimodal model is not on par with ${\hat{y}}_{uni}^{human}$ (r² = 0.68 for arousal and r² = 0.66 for valence), we observe a significant correlation of r(98) = 0.202, p = 0.043 for valence between $|TV A_{O} - {\hat{y}}_{uni}^{human}|$ (the missing non-additive interactions) and $|{\hat{y}}_{bi} + {\hat{y}}_{tri}|$ (non-additive contributions). This indicates that ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ learned non-additive interactions that cannot be explained by ${\hat{y}}_{uni}^{human}$ . For arousal, we do not observe a significant correlation, potentially because the optimization seems far nosier for arousal then for valence, see Figure 4.

8. Conclusion

We proposed MRO to explicitly learn and separate unimodal, bimodal, and trimodal interactions in a multimodal model. This separation is essential for quantifying how much a model uses multimodal interactions and is a step towards more interpretable models. Based on prior work (Hessel and Lee, 2020) we proposed a new evaluation metrics to quantify whether a trimodal models uses unimodal, bimodal, and trimodal interactions. Empirically, we observed that MRO successfully separated unimodal, bimodal, and trimodal interactions while not degrading predictive performance. Beyond the empirical evaluation, MRO learns non-additive interactions in accordance with human judgments on three datasets.

Limitations

We evaluated MRO in the context of language, vision, and acoustic modalities. Future work could explore MRO’s performance on different modalities. Exploring MRO beyond three modalities will also be interesting. To address a potentially growing number of parameters for models with more than three modalities, sharing modality representation could be explored: the bi-modal models could be given access to intermediate representations from unimodal models for modalities they have in common. Sharing representations could reduce the overall model size.

While we evaluated MRO on many sentiment and emotion annotated datasets, these datasets are primarily in English, and one is in German (SEWA). More research is needed to work with a more diverse set of languages.

It will also be interesting to study MRO in tasks that require multimodal fusion and translation: generating a modality given a set of different modalities.

Table 2:

Basic demographic information about the annotators.

	Arousal	Valence
Min. age	19	21
Mean age	36	37
Max. age	79	62

Female	20	19
Male	20	21

Open in a new tab

Acknowledgements

This material is based upon work partially supported by the National Science Foundation (Awards #1722822 and #1750439), and National Institutes of Health (Awards #R01MH125740, #R01MH096951, #U01MH116925, and #U01MH116923). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or National Institutes of Health, and no official endorsement should be inferred.

A. Experimental Details

A.1. IEMOCAP

IEMOCAP has multiple recording conditions of two speakers interacting: acted interactions, improvised interactions, and spontaneous interactions (Busso et al., 2008). In this paper, we use the improvised interactions as they cover a diverse range of emotional expressions and are not tied to a set of fixed utterances as is the case of the acted interactions.

A.2. MOSEI

Polarity is an established dimension in ethics research (Tyagi et al., 2020) and is typically defined as the absolute value of the sentiment intensity (Hutto and Gilbert, 2014). We use this definition and apply it to MOSEI’s sentiment ratings.

A.3. SEWA

Instead of using SEWA’s time continuous ratings of valence and arousal, we take the average of the ratings for each utterance to make SEWA similar to the other datasets.

A.4. Features

We use the openSMILe configuration eGeMAPS v0.1b (Eyben et al., 2015) which extracts instantaneous low-level descriptors and summaries over a moving window. For the low-level descriptors, we calculate the median and interquartile range for each segment. For the summary features, we take the median over each segment.

OpenFace 2.0 extracts many face-related features. We summarize OpenFace’s facial action units (Ekman, 1982), head pose, and eye gaze features with the mean and standard deviation.

B. $U C$ and $B I$ for trimodal Models

Any trimodal function $f$ can be expressed as $f_{T} + f_{V} + f_{A} + f_{T V} + f_{T A} + f_{V A} + f_{T V A}$ such that $U C (f) = f_{T} + f_{V} + f_{A}$ and $B I (f) = f_{T V} + f_{T A} + f_{V A}$ where the bimodal functions do not contains unimodal contributions: $\forall x_{T} : E_{v} [f_{T V} (x_{T}, v)] = 0$ and similar for $x_{V}$ and $x_{A}$ . Further, the trimodal function should not contains unimodal and bimodal interactions: $\forall x_{T}$ , $x_{V} : E_{a} [f_{T V A} (x_{T}, x_{V}, a) = 0$ and similar for the pairs $(x_{T}, x_{A})$ and $(x_{V}, x_{A})$ .

Proof. As any trimodal function can be expressed as the above function, we show that the definition of $U C$ returns exactly the unimodal contributions $f_{T} + f_{V} + f_{A}$ .

\begin{array}{l} U C (f) \\ = \underset{v, a}{E} f (x_{T}, v, a) + \underset{t, a}{E} f (t, x_{V}, a) \end{array}

(10)

\begin{array}{l} + \underset{t, v}{E} f (t, v, x_{A}) - 2 \underset{t, v, a}{E} f (t, v, a) \\ = (f_{T} (x_{T}) + \underset{v}{E} f_{V} (v) + \underset{a}{E} f_{A} (a) \\ + \underset{v}{E} f_{T V} (x_{T}, v) + \underset{a}{E} f_{T A} (x_{T}, a) \\ + \underset{v, a}{E} f_{V A} (v, a) + \underset{v, a}{E} f_{T V A} (x_{T}, v, a)) \\ + (\underset{t}{E} f_{T} (t) + f_{V} (x_{V}) + \underset{a}{E} f_{A} (a) \\ + \underset{t}{E} f_{T V} (t, x_{V}) + \underset{t, a}{E} f_{T A} (t, a) \\ + \underset{a}{E} f_{V A} (x_{V}, a) + \underset{t, a}{E} f_{T V A} (t, x_{V}, a)) \\ + (\underset{t}{E} f_{T} (t) + \underset{v}{E} f_{V} (v) + f_{A} (x_{A}) \\ + \underset{t v}{E} f_{T V} (t, v) + \underset{t}{E} f_{T A} (t, x_{A}) \\ + \underset{v}{E} f_{V A} (v, x_{A}) + \underset{t, v}{E} f_{T V A} (t, v, x_{A})) \\ - 2 (\underset{t}{E} f_{T} (t) + \underset{v}{E} f_{V} (v) + \underset{a}{E} f_{A} (a) \\ + \underset{t, v}{E} f_{T V} (t, v) + \underset{t, a}{E} f_{T A} (t, a) \end{array}

(11)

\begin{array}{l} + \underset{v, a}{E} f_{V A} (v, a) + \underset{t, v, a}{E} f_{T V A} (t, v, a)) \\ = (f_{T} (x_{T}) + \underset{v}{E} f_{V} (v) + \underset{a}{E} f_{A} (a)) \\ + (\underset{t}{E} f_{T} (t) + f_{V} (x_{V}) + \underset{a}{E} f_{A} (a)) \\ + (\underset{t}{E} f_{T} (t) + \underset{v}{E} f_{V} (v) + f_{A} (x_{A})) \end{array}

(12)

- 2 (\underset{t}{E} f_{T} (t) + \underset{v}{E} f_{V} (v) + \underset{a}{E} f_{A} (a))

(13)

= f_{T} (x_{T}) + f_{V} (x_{V}) + f_{A} (x_{A})

(14)

□

Compared to $B I$ in the bimodal case, we need to also remove trimodal interactions for $B I$ in the trimodal context.

Claim 3.

$B I$ is defined for three modalities as

\begin{array}{l} B I (f) \\ = \underset{t}{E} [f (t, x_{V}, x_{A}) - U C (f, t, x_{V}, x_{A})] \\ + \underset{v}{E} [f (x_{T}, v, x_{A}) - U C (f, x_{T}, v, x_{A})] \\ + \underset{a}{E} [f (x_{T}, x_{V}, a) - U C (f, x_{T}, x_{V}, a)] \end{array}

(15)

\begin{array}{l} = \underset{t}{E} f (t, x_{V}, x_{A}) + \underset{v}{E} f (x_{T}, v, x_{A}) \\ + \underset{a}{E} f (x_{T}, x_{V}, a) - 2 \underset{t, v}{E} f (t, v, x_{A}) \\ - 2 \underset{t, a}{E} f (t, x_{V}, a) - 2 \underset{v, a}{E} f (x_{T}, v, a) \\ + 3 \underset{t, v, a}{E} f (t, v, a) \end{array}

(16)

= f_{T V} (x_{T}, x_{V}) + f_{T A} (x_{T}, x_{A}) + f_{V A} (x_{V}, x_{A})

(17)

The omitted steps are to apply the definition of $U C$ and cancelling terms to get to Equation 16. From there on, we write $f$ as $f_{T} + f_{V} + f_{A} + f_{T V} + f_{T A} + f_{V A} + f_{T V A}$ and use their properties (expected values of the bi/trimodal function are 0).

C. Study Details

In addition to the three unimodal and the trimodal combinations we explored bimodal combinations: 1) the muted video with the transcript (TV); 2) the muted video with the low-pass filtered audio (VA); 3) the transcript with the low-pass filtered audio (TA); 4) for comparison the original audio with the transcript ( ${TA}_{O}$ ).

C.1. Reliability

We report two types reliabilities: the averaged pairwise reliability between two random raters (ICC(2,1)) and the effective reliability of the mean over k=8 raters (ICC(2, k-1)). Pairwise and effective reliability address different purposes: pairwise is needed to determine how many raters are needed to achieve a targeted effective reliability (Rosenthal, 2005). Averaging over raters is important as emotional dimensions are subjective and difficult to annotate (especially when modalities are missing). The effective reliability describes how reliable the mean over the raters is, i.e., if we were to draw a new set of ratings and were to average them, how similar is this new mean to our current mean.

Table 5:

Pairwise and effective reliability across the eight combinations. ICC is calculated with the R package psych.

Combination	Avg. ICC(2, 1)		ICC(2, k-l)
Combination	Arousal	Valence	Arousal	Valence
T	0.36	0.55	0.96	0.98
V	0.52	0.64	0.98	0.99
A	0.57	0.38	0.98	0.96

TV	0.48	0.62	0.97	0.98
VA	0.56	0.61	0.98	0.98
TA	0.60	0.54	0.98	0.98
${TA}_{O}$	0.55	0.62	0.98	0.98

${TVA}_{O}$	0.56	0.64	0.98	0.99

Open in a new tab

Except for transcripts-only (T) on arousal and acoustic-only (A) on valence, all pairwise reliabilities are moderate (between 0.5 and 0.75) (Koo and Li, 2016), see Table 5. The effective reliability (Rosenthal, 2005) of the mean over k raters as measured by ICC(2, k-1) is excellent (above 0.9) for all combinations. Instead of directly taking the mean over the raters, we apply, as common in affective computing, a z-normalization for each rater (Valstar et al., 2016; Busso et al., 2008) and take a weighted mean (Grimm and Kroschel, 2005) over the raters.

C.2. Compensation

All raters are paid the same fixed amount, leading to an average hourly rate of 11.14 USD/h.

D. Additional Experiments

MRO reaches $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})| = 0$ when trained long enough.

As we have seen in Figure 4, a model might need to be optimized long enough to minimize $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ . We therefore also investigate $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ when models are trained without early stopping. While such models are more likely to have poor generalization performance this allows us to test how much we could minimize $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ with MRO and sMRO. As can be seen in Table 6, MRO and sMRO are numerically very close to 0.0 demonstrating that such optimized models almost perfectly separate unimoda, bimodal, and trimodal interactions.

Table 6:

Bootstrapped average of normalized $∣ U C ({\hat{y}}_{bi} + {\hat{y}}_{tri}) |+ |B I ({\hat{y}}_{tri})|$ on the test folds (1.0 corresponds to a magnitude of one standard deviation) when models are trained without early stopping. Lower is better (ideally 0.0).

	Joint	sMRO	MRO
SEWA
Arousal	[0.81,0.87]	[0.00, 0.00]	[0.00, 0.01]
Valence	[0.29, 0.31]	[0.00, 0.00]	[0.02, 0.03]
IEMOCAP
Arousal	[0.41,0.42]	[0.00, 0.00]	[0.03, 0.03]
Valence	[0.20, 0.21]	[0.00, 0.00]	[0.05, 0.06]
MOSI
Sentiment	[0.14, 0.15]	[0.00, 0.00]	[0.00, 0.00]
MOSEI
Sentiment	[0.41,0.42]	[0.00, 0.00]	[0.00, 0.00]
Polarity	[0.44, 0.45]	[0.04, 0.04]	[0.00, 0.00]
Happiness	[0.44, 0.45]	[0.15, 0.15]	[0.01,0.01]
TPOT
Constructs	[0.29, 0.29]	[0.01,0.01]	[0.04,0.04]
Instagram Intent	[0.12, 0.13]	[0.00, 0.01]	[0.02, 0.02]

Open in a new tab

Removing stop-gradient leads to a worse separation:

Theoretically, we should be able to remove stop-gradient $(s g)$ from Equation 9 as they have the same global minima. In practice, we observe that doing so leads to worse separation of interactions, see Table 7.

MRO applies to transformers as well:

When choosing transformers (Vaswani et al., 2017) as a base model instead of multilayer perceptrons, we observe the same trend that a model trained without MRO does not separate the multimodal interactions, whereas when trained with MRO the interactions are far better separated, see Table 8.

Interactions needed for amplifiers, ambiguity, and rare behaviors.

Table 9 summarizes the five segments with the largest absolute errors $|{TVA}_{O} - {\hat{y}}_{uni}^{human}|$ separately for arousal and valence (we refer to these segments with A₁ to A₅ and V₁ to V₅). Qualitatively, three groups emerge: amplifiers, ambiguities, and rare behaviors.

Unimodal amplifiers:

Amplifiers are essential for valence and sentiment as an intense expression can be very negative or positive. A modality might contain a strong amplifier (language in case of example V₃ and V₄) but the modality might not provide strong evidence for the directionality. In such cases, non-additive interactions are needed to combine the directionality from one modality with the amplifier from another modality.

Table 7:

Average performance and bootstrapped average of normalized $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ on the test folds (1.0 corresponds to a magnitude of one standard deviation) when models are trained without early stopping.

	Performance	$\|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})\| + \|B I ({\hat{y}}_{tri})\|$
SEWA
Arousal	0.599	[0.05, 0.06]
Valence	0.638	[0.13, 0.14]
IEMOCAP
Arousal	0.316	[0.27, 0.29]
Valence	0.297	[0.30, 0.32]
MOSI
Sentiment	0.660	[0.03, 0.03]
MOSEI
Sentiment	0.724	[0.06, 0.06]
Polarity	0.605	[0.17, 0.19]
Happiness	0.644	[0.16, 0.16]
TPOT
Constructs	0.569	[0.13, 0.13]
Instagram
Intent	0.890	[0.04, 0.04]

Open in a new tab

Table 8:

Average of normalized $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ on the test folds (1.0 corresponds to a magnitude of one standard deviation) when using transformers as a base model instead of multilayer perceptrons.

	Joint	MRO
MOSEI
Sentiment	[0.64, 0.67]	[0.01,0.01]

Open in a new tab

Ambiguities:

When a modality might not provide information in either direction (language in case V₁, V₂, V₅, A₁, A₂), more contextual information in form of bimodal interactions is needed.

Rare behaviors:

When a typically important modality is “missing” (language in case of A₂ and vision in case of A₃) or a typically less important modality contains an important behavior (acoustic in case of V₂) it changes the relative importance of the remaining modalities. Unlike the routing model, additive models have no mechanism to re-weight how important modalities are. When a modality is unexpectedly very (un)important, a bimodal or trimodal model becomes necessary.

E. Reproducibility

Computing Resources:

All model are implemented in PyTorch and were optimized on servers with consumer-level graphic cards.

Model Information:

The validation performance, the training time, and the number of parameters for the best models as chosen based on the validation performance, are listed in Table 10.

Hyperparameter Search:

All models and datasets have the same exhaustive hyperparameter search, see Table 11. The gridsearch determined in most cases the same hyperparameter across the different optimization strategies (Joint, sMRO, and MRO), we therefore only highlight the best hyperparameters for MRO in Table 11. The performance metrics reported in Table 3 are also used to selected the best validation model.

Datasplits:

MOSI and MOSEI have an established hold-out test set, we use it for testing. SEWA has a private test set: we use the public development set for testing. IEMOCAP and Instagram have an established 5-fold test setup which we use.

Table 9:

The five segments with the largest absolute errors when predicting ${TVA}_{O}$ with T, V, A.

	Transcript	Non-verbal behaviors	T	V	A	${TVA}_{O}$	${\hat{y}}_{uni}^{human}$
Arousal

A₁	I’m gonna forget him.	gaze aversion, almost whining	−0.57	0.04	0.32	1.16	0.05
A₂	What?	not attentive, quiet	−0.62	−0.83	−1.67	−2.1	−1.17
A₃	Do you know how long it’s gonna take me to start all over and fill out the new form?	little movement, loud	1.11	−0.36	0.1	0.98	0.1
A₄	That’s right. That’s right. I mean he would want us to, you know, celebrate the life that he- that he lived and, you know, enjoy the rest of ours as much as we can.	gaze aversion, quiet	−0.23	−1.11	−1.14	−0.17	−0.99
A₅	Well, I need-I need you to be able to do this for me ‘cause I can’t do anything about it.	gaze aversion, almost whining	0.47	−0.43	0.29	0.73	0.03

Valence

V₁	I mean, it’s just as hard for me, but-I know that we can do it, you know	eye-gaze aversion, fidgeting, quiet	0.25	−0.82	−0.08	−1.51	−0.41
V₂	I did exactly what they told me to do.	loud, determined	−0.02	0.30	−1.45	−1.06	−0.06
V₃	Oh, wow. You got in?	smile, loud/staccato- like voice	1.08	1.49	−0.86	2.04	1.08
V₄	Oh, my God. That’s so dramatic.	smile, slight laughter, pitch jumps	−0.70	1.70	−0.12	1.68	0.74
V₅	How can you lose my luggage like from like, in -	looking up, hand gestures	−1.39	0.88	−0.37	−0.89	−0.01

Open in a new tab

Table 10:

Validation performance (perf), training time in seconds (sec), and the number of parameters of the best validation model (params).

	Tri			Routing			Joint			sMRO			MRO
	perf	sec	params	perf	sec	params	perf	sec	params	perf	sec	params	perf	sec	params
MOSI
Sentiment	0.73	20.3	138263	0.741	68.2	556168	0.724	29.7	138263	0.735	50.8	111623	0.732	32.7	111623
MOSEI
Sentiment	0.71	97.6	141163	0.71	201.9	567768	0.713	189.9	503347	0.715	254.9	441187	0.71	171.4	114523
Polarity	0.611	99.1	441187	0.613	295.9	567768	0.617	181.8	441187	0.614	247.3	86647	0.617	290.9	441187
Happiness	0.636	87.4	441187	0.642	358.3	823208	0.638	252.6	441187	0.644	372.3	21317	0.642	380.0	503347
IEMOCAP
Arousal	0.64	41.6	111623	0.644	301.2	811608	0.643	60.3	84327	0.662	165.8	491747	0.65	69.6	4303
Valence	0.67	35.5	138263	0.66	118.8	811608	0.658	27.3	138263	0.642	64.8	22683	0.651	41.4	138263
SEWA
Arousal	0.367	174.7	22683	0.372	691.9	78518	0.388	266.3	41467	0.396	466.6	43077	0.384	484.1	43077
Valence	0.424	190.3	20283	0.427	527.3	200408	0.424	147.0	20283	0.425	461.5	11693	0.428	144.1	22683
TPOT
Constructs Instagram	0.569	105.2	80038	0.557	359.7	56036	0.568	143.4	180710	0.565	251.7	14298	0.576	243.5	14298
Intent	0.888	9.0	12098	0.757	75.7	13427	0.9	24.7	12098	0.89	24.9	12098	0.896	18.0	12098

Open in a new tab

Table 11:

Parameter search. Parameters that were determined to be best for MRO and task x are indicated by x in the superscript. We enumerate tasks in the same order as they are presented in Table 3.

Learning rate	{0.05⁵, 0.01¹, 0.005^2,6,10, 0.001^7,8, 0.0001^3,4, 0.00001⁹}
L2 weight decay	{0.1, 0.01^{1,3,4,5,6,7,8,9,10}, 0.001², 0.0}
Hidden layers	{(5,), (10,)¹⁰, (20, 10)^5,8,9, (10, 20)⁷, (100, 20, 10)^1,2,3, (100, 100, 10)^4,6}
Selection	{yes^5,10, no^{1,2,3,4,6,7,8,9}}
Feature Fusion	{concatenation^3,4,7,10, tensor fusion^1,2,5,6,8,9}

Open in a new tab

Footnotes

Ethics Statement

Emotional states can provide insights into mental health, especially into mood disorders like depression. While emotion recognition systems can be part of medical pre-screening tools to facilitate care (DeVault et al., 2014), the same technology can be part of job interview tools (Naim et al., 2016) potentially leading to discrimination against people with mood disorders. More work on visualizing and interpreting model predictions are tools to highlight potential biases and help better understand the internal decision process of multimodal models.

Code available at https://github.com/twoertwein/MultimodalResidualOptimization.

UC, BI, and TI can computationally be demanding given the expectation terms. While this is not as much of an issue when used as evaluation metrics, the computational cost prohibits us from using them as part of an iterative optimization process, e.g., in the loss function of neural networks.

We recruited 40 US-based crowed works from the platform prolific https://www.prolific.co/ whose first language is English.

⁴

We use ffmpeg for low-pass filtering with the following filter configuration: firequalizer=gain=‘if(lt(f,660), 0, - INF)’:min_phase=1

Contributor Information

Torsten Wörtwein, Language Technologies Institute, Carnegie Mellon University.

Lisa B. Sheeber, Oregon Research Institute

Nicholas Allen, Department of Psychology, University of Oregon.

Jeffrey F. Cohn, Department of Psychology, University of Pittsburgh

Louis-Philippe Morency, Language Technologies Institute,Carnegie Mellon University.

References

Baltrusaitis Tadas, Zadeh Amir, Lim Yao Chong, and Morency Louis-Philippe. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE. [Google Scholar]
Bateman John A. 2014. Text and image: A critical introduction to the visual/verbal divide. Routledge. [Google Scholar]
Bhargava Pranesh and Başkent Deniz. 2012. Effects of low-pass filtering on intelligibility of periodically interrupted speech. The Journal of the Acoustical Society of America, 131(2):EL87–EL92. [DOI] [PubMed] [Google Scholar]
Bradley Margaret M and Lang Peter J. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, 25(1):49–59. [DOI] [PubMed] [Google Scholar]
Busso Carlos, Bulut Murtaza, Lee Chi-Chun, Kazemzadeh Abe, Mower Emily, Kim Samuel, Chang Jeannette N, Lee Sungbok, and Narayanan Shrikanth S. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359. [Google Scholar]
Cadene Remi, Dancette Corentin, Cord Matthieu, Parikh Devi, et al. 2019. Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32. [Google Scholar]
De Gelder Beatrice and Bertelson Paul. 2003. Multisensory integration, perception and ecological validity. Trends in cognitive sciences, 7(10):460–467. [DOI] [PubMed] [Google Scholar]
DeVault David, Artstein Ron, Benn Grace, Dey Teresa, Fast Ed, Gainer Alesia, Georgila Kallirroi, Gratch Jon, Hartholt Arno, Lhommet Margaux, et al. 2014. Simsensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1061–1068. [Google Scholar]
Du Mengnan, Liu Ninghao, and Hu Xia. 2019. Techniques for interpretable machine learning. Communications of the ACM, 63(1):68–77. [Google Scholar]
Ekman Paul. 1982. Methods for measuring facial action. Handbook of methods in nonverbal behavior research, pages 45–90. [Google Scholar]
Eyben Florian, Scherer Klaus R, Schuller Björn W, Sundberg Johan, André Elisabeth, Busso Carlos, Devillers Laurence Y, Epps Julien, Laukka Petri, Narayanan Shrikanth S, et al. 2015. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE transactions on affective computing, 7(2):190–202. [Google Scholar]
Friedman Jerome H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232. [Google Scholar]
Goodman Bryce and Flaxman Seth. 2017. European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 38(3):50–57. [Google Scholar]
Grimm Michael and Kroschel Kristian. 2005. Evaluation of natural emotions using self assessment manikins. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pages 381–385. IEEE. [Google Scholar]
He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. [Google Scholar]
Hessel Jack and Lee Lillian. 2020. Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. [Google Scholar]
Hutto Clayton and Gilbert Eric. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225. [Google Scholar]
Kloepfer Rolf. 1976. Komplementarität von sprache und bild am beispiel von comic, karikatur und reklame. Sprache in Technischen Zeitalter Stuttgart. [Google Scholar]
Koo Terry K and Li Mae Y. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2):155–163. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kruk Julia, Lubin Jonah, Sikka Karan, Lin Xiao, Jurafsky Dan, and Divakaran Ajay. 2019. Integrating text and image: Determining multimodal document intent in instagram posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. [Google Scholar]
Li Nicole YK and Yiu Edwin M-L. 2006. Acoustic and perceptual analysis of modal and falsetto registers in females with dysphonia. Clinical linguistics & phonetics, 20(6):463–481. [DOI] [PubMed] [Google Scholar]
Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2020. RoBERTa: A robustly optimized BERT pretraining approach. https://openreview.net/forum?id=SyxS0T4tvS. [Google Scholar]
Lyu Yiwei, Liang Paul Pu, Deng Zihao, Salakhutdinov Ruslan, and Morency Louis-Philippe. 2022. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. arXiv preprint arXiv:2203.02013. [Google Scholar]
Munezero Myriam, Montero Calkin Suero, Sutinen Erkki, and Pajunen John. 2014. Are they different? affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2):101–111. [Google Scholar]
Naim Iftekhar, Tanveer Md Iftekhar, Gildea Daniel, and Hoque Mohammed Ehsan. 2016. Automated analysis and prediction of job interview performance. IEEE Transactions on Affective Computing, 9(2):191–204. [Google Scholar]
Nelson Benjamin W, Sheeber Lisa, Pfeifer Jennifer, and Allen Nicholas B. 2021. Psychobiological markers of allostatic load in depressed and nondepressed mothers and their adolescent offspring. Journal of Child Psychology and Psychiatry, 62(2):199–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
Provost Emily Mower, Shangguan Yuan, and Busso Carlos. 2015. Umeme: University of michigan emotional mcgurk effect data set. IEEE Transactions on Affective Computing, 6(4):395–409. [Google Scholar]
Razavi Ali, van den Oord Aaron, and Vinyals Oriol. 2019. Generating diverse high-fidelity images with vq-vae−2. In Advances in neural information processing systems, pages 14866–14876. [Google Scholar]
Rosenthal Robert. 2005. Conducting judgment studies: Some methodological issues. The new handbook of methods in nonverbal behavior research, pages 199–234. [Google Scholar]
Sabour Sara, Frosst Nicholas, and Hinton Geoffrey E. 2017. Dynamic routing between capsules. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. [Google Scholar]
Tsai Yao-Hung Hubert, Bai Shaojie, Liang Paul Pu, Kolter J Zico, Morency Louis-Philippe, and Salakhutdinov Ruslan. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting. [Google Scholar]
Tsai Yao-Hung Hubert, Ma Martin Q, Yang Muqiao, Salakhutdinov Ruslan, and Morency Louis-Philippe. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. [Google Scholar]
Tsang Michael, Cheng Dehua, Liu Hanpeng, Feng Xue, Zhou Eric, and Liu Yan. 2020. Feature interaction interpretability: A case for explaining ad-recommendation systems via neural interaction detection. arXiv preprint arXiv:2006.10966. [Google Scholar]
Tyagi Aman, Field Anjalie, Lathwal Priyank, Tsvetkov Yulia, and Carley Kathleen M. 2020. A computational analysis of polarization on indian and pakistani social media. In International Conference on Social Informatics, pages 364–379. Springer. [Google Scholar]
Valstar Michel, Gratch Jonathan, Schuller Björn, Ringeval Fabien, Lalanne Denis, Torres Mercedes Torres, Scherer Stefan, Stratou Giota, Cowie Roddy, and Pantic Maja. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge, pages 3–10. [Google Scholar]
Vapnik Vladimir N. 1999. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999. [DOI] [PubMed] [Google Scholar]
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in neural information processing systems, 30. [Google Scholar]
Wang Weiyao, Tran Du, and Feiszli Matt. 2020. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12695–12705. [Google Scholar]
Wang Xingbo, He Jianben, Jin Zhihua, Yang Muqiao, Wang Yong, and Qu Huamin. 2021. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28(1):802–812. [DOI] [PubMed] [Google Scholar]
Wörtwein Torsten, Sheeber Lisa B, Allen Nicholas, Cohn Jeffrey F, and Morency Louis-Philippe. 2021. Human-guided modality informativeness for affective states. In Proceedings of the 2021 International Conference on Multimodal Interaction, pages 728–734. [Google Scholar]
Wu Zhiyong, Kong Lingpeng, Bi Wei, Li Xiang, and Kao Ben. 2021. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In “Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6153–6166. [Google Scholar]
Yang Ying, Fairbairn Catherine, and Cohn Jeffrey F. 2012. Detecting depression severity from vocal prosody. IEEE transactions on affective computing, 4(2):142–150. [Google Scholar]
Zadeh Amir, Chen Minghai, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. [Google Scholar]
Zadeh Amir, Mao Chengfeng, Shi Kelly, Zhang Yiwei, Paul Pu Liang Soujanya Poria, and Morency Louis-Philippe. 2019. Factorized multimodal transformer for multimodal sequential learning. In Elsevier Information Fusion Journal (IF 11.21). [Google Scholar]
Zadeh Amir, Zellers Rowan, Pincus Eli, and Morency Louis-Philippe. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. [Google Scholar]
Zadeh AmirAli Bagher, Liang Paul Pu, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2018. Multimodal language analysis in the wild: Cmumosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2236–2246. [Google Scholar]
Zellers Rowan, Lu Ximing, Hessel Jack, Yu Youngjae, Jae Sung Park Jize Cao, Farhadi Ali, and Choi Yejin. 2021. Merlot: Multimodal neural script knowledge models. In Advances in Neural Information Processing Systems 34. [Google Scholar]
Zeng Ying, Mai Sijie, and Hu Haifeng. 2021. Which is making the contribution: Modulating unimodal and cross-modal dynamics for multimodal sentiment analysis. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1262–1274. Association for Computational Linguistics. [Google Scholar]
Zhang Mingda, Hwa Rebecca, and Kovashka Adriana. 2018. Equal but not the same: Understanding the implicit relationship between persuasive images and text. In Proceedings of the British Machine Vision Conference (BMVC). [Google Scholar]

[R1] Baltrusaitis Tadas, Zadeh Amir, Lim Yao Chong, and Morency Louis-Philippe. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE. [Google Scholar]

[R2] Bateman John A. 2014. Text and image: A critical introduction to the visual/verbal divide. Routledge. [Google Scholar]

[R3] Bhargava Pranesh and Başkent Deniz. 2012. Effects of low-pass filtering on intelligibility of periodically interrupted speech. The Journal of the Acoustical Society of America, 131(2):EL87–EL92. [DOI] [PubMed] [Google Scholar]

[R4] Bradley Margaret M and Lang Peter J. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, 25(1):49–59. [DOI] [PubMed] [Google Scholar]

[R5] Busso Carlos, Bulut Murtaza, Lee Chi-Chun, Kazemzadeh Abe, Mower Emily, Kim Samuel, Chang Jeannette N, Lee Sungbok, and Narayanan Shrikanth S. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359. [Google Scholar]

[R6] Cadene Remi, Dancette Corentin, Cord Matthieu, Parikh Devi, et al. 2019. Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32. [Google Scholar]

[R7] De Gelder Beatrice and Bertelson Paul. 2003. Multisensory integration, perception and ecological validity. Trends in cognitive sciences, 7(10):460–467. [DOI] [PubMed] [Google Scholar]

[R8] DeVault David, Artstein Ron, Benn Grace, Dey Teresa, Fast Ed, Gainer Alesia, Georgila Kallirroi, Gratch Jon, Hartholt Arno, Lhommet Margaux, et al. 2014. Simsensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1061–1068. [Google Scholar]

[R9] Du Mengnan, Liu Ninghao, and Hu Xia. 2019. Techniques for interpretable machine learning. Communications of the ACM, 63(1):68–77. [Google Scholar]

[R10] Ekman Paul. 1982. Methods for measuring facial action. Handbook of methods in nonverbal behavior research, pages 45–90. [Google Scholar]

[R11] Eyben Florian, Scherer Klaus R, Schuller Björn W, Sundberg Johan, André Elisabeth, Busso Carlos, Devillers Laurence Y, Epps Julien, Laukka Petri, Narayanan Shrikanth S, et al. 2015. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE transactions on affective computing, 7(2):190–202. [Google Scholar]

[R12] Friedman Jerome H. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232. [Google Scholar]

[R13] Goodman Bryce and Flaxman Seth. 2017. European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 38(3):50–57. [Google Scholar]

[R14] Grimm Michael and Kroschel Kristian. 2005. Evaluation of natural emotions using self assessment manikins. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pages 381–385. IEEE. [Google Scholar]

[R15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. [Google Scholar]

[R16] Hessel Jack and Lee Lillian. 2020. Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. [Google Scholar]

[R17] Hutto Clayton and Gilbert Eric. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225. [Google Scholar]

[R18] Kloepfer Rolf. 1976. Komplementarität von sprache und bild am beispiel von comic, karikatur und reklame. Sprache in Technischen Zeitalter Stuttgart. [Google Scholar]

[R19] Koo Terry K and Li Mae Y. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2):155–163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Kruk Julia, Lubin Jonah, Sikka Karan, Lin Xiao, Jurafsky Dan, and Divakaran Ajay. 2019. Integrating text and image: Determining multimodal document intent in instagram posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. [Google Scholar]

[R21] Li Nicole YK and Yiu Edwin M-L. 2006. Acoustic and perceptual analysis of modal and falsetto registers in females with dysphonia. Clinical linguistics & phonetics, 20(6):463–481. [DOI] [PubMed] [Google Scholar]

[R22] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2020. RoBERTa: A robustly optimized BERT pretraining approach. https://openreview.net/forum?id=SyxS0T4tvS. [Google Scholar]

[R23] Lyu Yiwei, Liang Paul Pu, Deng Zihao, Salakhutdinov Ruslan, and Morency Louis-Philippe. 2022. Dime: Fine-grained interpretations of multimodal models via disentangled local explanations. arXiv preprint arXiv:2203.02013. [Google Scholar]

[R24] Munezero Myriam, Montero Calkin Suero, Sutinen Erkki, and Pajunen John. 2014. Are they different? affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2):101–111. [Google Scholar]

[R25] Naim Iftekhar, Tanveer Md Iftekhar, Gildea Daniel, and Hoque Mohammed Ehsan. 2016. Automated analysis and prediction of job interview performance. IEEE Transactions on Affective Computing, 9(2):191–204. [Google Scholar]

[R26] Nelson Benjamin W, Sheeber Lisa, Pfeifer Jennifer, and Allen Nicholas B. 2021. Psychobiological markers of allostatic load in depressed and nondepressed mothers and their adolescent offspring. Journal of Child Psychology and Psychiatry, 62(2):199–211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Provost Emily Mower, Shangguan Yuan, and Busso Carlos. 2015. Umeme: University of michigan emotional mcgurk effect data set. IEEE Transactions on Affective Computing, 6(4):395–409. [Google Scholar]

[R28] Razavi Ali, van den Oord Aaron, and Vinyals Oriol. 2019. Generating diverse high-fidelity images with vq-vae−2. In Advances in neural information processing systems, pages 14866–14876. [Google Scholar]

[R29] Rosenthal Robert. 2005. Conducting judgment studies: Some methodological issues. The new handbook of methods in nonverbal behavior research, pages 199–234. [Google Scholar]

[R30] Sabour Sara, Frosst Nicholas, and Hinton Geoffrey E. 2017. Dynamic routing between capsules. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. [Google Scholar]

[R31] Tsai Yao-Hung Hubert, Bai Shaojie, Liang Paul Pu, Kolter J Zico, Morency Louis-Philippe, and Salakhutdinov Ruslan. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting. [Google Scholar]

[R32] Tsai Yao-Hung Hubert, Ma Martin Q, Yang Muqiao, Salakhutdinov Ruslan, and Morency Louis-Philippe. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. [Google Scholar]

[R33] Tsang Michael, Cheng Dehua, Liu Hanpeng, Feng Xue, Zhou Eric, and Liu Yan. 2020. Feature interaction interpretability: A case for explaining ad-recommendation systems via neural interaction detection. arXiv preprint arXiv:2006.10966. [Google Scholar]

[R34] Tyagi Aman, Field Anjalie, Lathwal Priyank, Tsvetkov Yulia, and Carley Kathleen M. 2020. A computational analysis of polarization on indian and pakistani social media. In International Conference on Social Informatics, pages 364–379. Springer. [Google Scholar]

[R35] Valstar Michel, Gratch Jonathan, Schuller Björn, Ringeval Fabien, Lalanne Denis, Torres Mercedes Torres, Scherer Stefan, Stratou Giota, Cowie Roddy, and Pantic Maja. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge, pages 3–10. [Google Scholar]

[R36] Vapnik Vladimir N. 1999. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999. [DOI] [PubMed] [Google Scholar]

[R37] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in neural information processing systems, 30. [Google Scholar]

[R38] Wang Weiyao, Tran Du, and Feiszli Matt. 2020. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12695–12705. [Google Scholar]

[R39] Wang Xingbo, He Jianben, Jin Zhihua, Yang Muqiao, Wang Yong, and Qu Huamin. 2021. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28(1):802–812. [DOI] [PubMed] [Google Scholar]

[R40] Wörtwein Torsten, Sheeber Lisa B, Allen Nicholas, Cohn Jeffrey F, and Morency Louis-Philippe. 2021. Human-guided modality informativeness for affective states. In Proceedings of the 2021 International Conference on Multimodal Interaction, pages 728–734. [Google Scholar]

[R41] Wu Zhiyong, Kong Lingpeng, Bi Wei, Li Xiang, and Kao Ben. 2021. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In “Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6153–6166. [Google Scholar]

[R42] Yang Ying, Fairbairn Catherine, and Cohn Jeffrey F. 2012. Detecting depression severity from vocal prosody. IEEE transactions on affective computing, 4(2):142–150. [Google Scholar]

[R43] Zadeh Amir, Chen Minghai, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. [Google Scholar]

[R44] Zadeh Amir, Mao Chengfeng, Shi Kelly, Zhang Yiwei, Paul Pu Liang Soujanya Poria, and Morency Louis-Philippe. 2019. Factorized multimodal transformer for multimodal sequential learning. In Elsevier Information Fusion Journal (IF 11.21). [Google Scholar]

[R45] Zadeh Amir, Zellers Rowan, Pincus Eli, and Morency Louis-Philippe. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. [Google Scholar]

[R46] Zadeh AmirAli Bagher, Liang Paul Pu, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2018. Multimodal language analysis in the wild: Cmumosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2236–2246. [Google Scholar]

[R47] Zellers Rowan, Lu Ximing, Hessel Jack, Yu Youngjae, Jae Sung Park Jize Cao, Farhadi Ali, and Choi Yejin. 2021. Merlot: Multimodal neural script knowledge models. In Advances in Neural Information Processing Systems 34. [Google Scholar]

[R48] Zeng Ying, Mai Sijie, and Hu Haifeng. 2021. Which is making the contribution: Modulating unimodal and cross-modal dynamics for multimodal sentiment analysis. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1262–1274. Association for Computational Linguistics. [Google Scholar]

[R49] Zhang Mingda, Hwa Rebecca, and Kovashka Adriana. 2018. Equal but not the same: Understanding the implicit relationship between persuasive images and text. In Proceedings of the British Machine Vision Conference (BMVC). [Google Scholar]

PERMALINK

Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions

Torsten Wörtwein

Lisa B Sheeber

Nicholas Allen

Jeffrey F Cohn

Louis-Philippe Morency

Abstract

1. Introduction

Figure 1:

2. Related Work

Prevalence of Additive Interactions:

Model-specific quantification:

Model-agnostic quantification:

Taxonomy of Multimodal Interactions:

3. Quantifying Multimodal Interactions

Claim 1.

Claim 2.

4. Multimodal Residual Optimization

4.1. MRO Architecture

4.2. MRO Loss Function

Bimodal case:

Figure 2:

m-modal case:

4.3. Sequential MRO

5. Experimental Methodology

Datasets:

Table 1:

Evaluation:

Models:

Implementation Details:

6. Multimodal Perception Study

Study Design:

Audio Processing:

Avoiding learning effects:

Ratings and reliability:

Evaluation:

7. Results and Discussion

Sanity Check:

Figure 3:

Figure 4:

MRO significantly reduces UCy^bi+y^tri+BIy^tri.

MRO does not degrade performance.

Table 3:

Ablating y^bi+y^tri decreases performance.

Table 4:

MRO learns more non-additive interactions when two modalities are informative.

MRO learns more non-additive interactions when modalities amplify each other.

MRO learns non-additive interaction when humans need non-additive interactions.

8. Conclusion

Limitations

Table 2:

Acknowledgements

A. Experimental Details

A.1. IEMOCAP

A.2. MOSEI

A.3. SEWA

A.4. Features

B. UC and BI for trimodal Models

Claim 3.

C. Study Details

C.1. Reliability

Table 5:

C.2. Compensation

D. Additional Experiments

MRO reaches UCy^bi+y^tri+BIy^tri=0 when trained long enough.

Table 6:

Removing stop-gradient leads to a worse separation:

MRO applies to transformers as well:

Interactions needed for amplifiers, ambiguity, and rare behaviors.

Unimodal amplifiers:

Table 7:

Table 8:

Ambiguities:

Rare behaviors:

E. Reproducibility

Computing Resources:

Model Information:

Hyperparameter Search:

Datasplits:

$m$ -modal case:

MRO significantly reduces $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})|$ .

Ablating ${\hat{y}}_{bi} + {\hat{y}}_{tri}$ decreases performance.

B. $U C$ and $B I$ for trimodal Models

MRO reaches $|U C ({\hat{y}}_{bi} + {\hat{y}}_{tri})| + |B I ({\hat{y}}_{tri})| = 0$ when trained long enough.