MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Paul Pu Liang; Yiwei Lyu; Xiang Fan; Zetian Wu; Yun Cheng; Jason Wu; Leslie Chen; Peter Wu; Michelle A Lee; Yuke Zhu; Ruslan Salakhutdinov; Louis-Philippe Morency

. Author manuscript; available in PMC: 2024 May 21.

Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;2021(DB1):1–20.

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Paul Pu Liang ¹, Yiwei Lyu ¹, Xiang Fan ¹, Zetian Wu ², Yun Cheng ¹, Jason Wu ¹, Leslie Chen ³, Peter Wu ¹, Michelle A Lee ⁴, Yuke Zhu ⁵, Ruslan Salakhutdinov ¹, Louis-Philippe Morency ¹

PMCID: PMC11106632 NIHMSID: NIHMS1770973 PMID: 38774625

Abstract

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human-computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiBench, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning spanning innovations in fusion paradigms, optimization objectives, and training approaches. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9/15 datasets. Therefore, MultiBench presents a milestone in unifying disjoint efforts in multimodal machine learning research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MultiBench, our standardized implementations, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

1. Introduction

Our perception of the natural world surrounding us involves multiple sensory modalities: we see objects, hear audio signals, feel textures, smell fragrances, and taste flavors. A modality refers to a way in which a signal exists or is experienced. Multiple modalities then refer to a combination of multiple signals each expressed in heterogeneous manners [10]. Many real-world research problems are inherently multimodal: from the early research on audio-visual speech recognition [48] to the recent explosion of interest in language, vision, and video understanding [48] for applications such as multimedia [102, 116], affective computing [101, 127], robotics [84, 91], finance [70], dialogue [126], human-computer interaction [47, 117], and healthcare [51, 172]. The research field of multimodal machine learning (ML) brings unique challenges for both computational and theoretical research given the heterogeneity of various data sources [10]. At its core lies the learning of multimodal representations that capture correspondences between modalities for prediction, and has emerged as a vibrant interdisciplinary field of immense importance and with extraordinary potential.

Limitations of current multimodal datasets:

Current multimodal research has led to impressive advances in benchmarking and modeling for specific domains such as language and vision [4, 103, 105, 132]. However, other domains, modalities, and tasks are relatively understudied. Many of these tasks are crucial for real-world intelligence such as improving accessibility to technology for diverse populations [62], accelerating healthcare diagnosis to aid doctors [78], and building reliable robots that can engage in human-AI interactions [16, 83, 137]. Furthermore, current benchmarks typically focus on performance without quantifying the potential drawbacks involved with increased time and space complexity [148], and the risk of decreased robustness from imperfect modalities [101, 123]. In real-world deployment, a balance between performance, robustness, and complexity is often required.

MultiBench:

In order to accelerate research in building general-purpose multimodal models, our main contribution is MultiBench (Figure 1), a systematic and unified large-scale benchmark that brings us closer to the requirements of real-world multimodal applications. MultiBench is designed to comprehensively evaluate 3 main components: generalization across domains and modalities, complexity during training and inference, and robustness to noisy and missing modalities:

Generalization across domains and modalities: MultiBench contains a diverse set of 15 datasets spanning 10 modalities and testing for 20 prediction tasks across 6 distinct research areas. These research areas include important tasks understudied from a multimodal learning perspective, such as healthcare, finance, and HCI. Building upon extensive data-collection efforts by domain experts, we worked with them to adapt datasets that reflect real-world relevance, present unique challenges to multimodal learning, and enable opportunities in algorithm design and evaluation.
Complexity during training and inference: MultiBench also quantifies potential drawbacks involving increased time and space complexity of multimodal learning. Together, these metrics summarize the tradeoffs of current models as a step towards efficiency in real-world settings [142].
Robustness to noisy and missing modalities: Different modalities often display different noise topologies, and real-world multimodal signals possibly suffer from missing or noisy data in at least one of the modalities [10]. MultiBench provides a standardized way to assess the risk of decreased robustness from imperfect modalities through a set of modality-specific and multimodal imperfections that reflect real-world noise, thereby providing a benchmark towards safe and robust deployment.

Together, MultiBench unifies efforts across separate research areas in multimodal learning to enable quick and accurate benchmarking across a wide range of datasets and metrics.

To help the community accurately compare performance and ensure reproducibility, MultiBench includes an end-to-end pipeline including data preprocessing, dataset splits, multimodal algorithms, evaluation metrics, and cross-validation protocols. This includes an implementation of 20 core multimodal approaches spanning innovations in fusion paradigms, optimization objectives, and training approaches in a standard public toolkit called MultiZoo. We perform a systematic evaluation and show that directly applying these methods can improve the state-of-the-art performance on 9 out of the 15 datasets. Therefore, MultiBench presents a step towards unifying disjoint efforts in multimodal research and paves a way towards a deeper understanding of multimodal models.

Most importantly, our public zoo of multimodal benchmarks and models will ensure ease of use, accessibility, and reproducibility. Finally, we outline our plans to ensure the continual availability, maintenance, and expansion of MultiBench, including using it as a theme for future workshops and competitions and to support the multimodal learning courses taught around the world.

2. MultiBench: The Multiscale Multimodal Benchmark

Background:

We define a modality as a single particular mode in which a signal is expressed or experienced. Multiple modalities then refer to a combination of multiple heterogeneous signals [10]. The first version of MultiBench focuses on benchmarking algorithms for multimodal fusion, where the main challenge is to join information from two or more modalities to perform a prediction (e.g., classification, regression). Classic examples for multimodal fusion include audio-visual speech recognition where visual lip motion is fused with speech signals to predict spoken words [48]. Multimodal fusion can be contrasted with multimodal translation where the goal is to generate a new and different modality [162], grounding and question answering where one modality is used to query information in another (e.g., visual question answering [4]), and unsupervised or self-supervised multimodal representation learning [109, 143]. We plan future versions of MultiBench to study these important topics in multimodal research in Appendix I.

Each of the following 15 datasets in MultiBench contributes a unique perspective to the various technical challenges in multimodal learning involving learning and aligning complementary information, scalability to a large number of modalities, and robustness to realistic real-world imperfections.

2.1. Datasets

Table 1 shows an overview of the datasets provided in MultiBench. We provide a brief overview of the modalities and tasks for each of these datasets and refer the reader to Appendix C for details.

Table 1:

MultiBench provides a comprehensive suite of 15 multimodal datasets to benchmark current and proposed approaches in multimodal representation learning. It covers a diverse range of research areas, dataset sizes, input modalities (in the form of $ℓ$ : language, $i$ : image, $v$ : video, $a$ : audio, $t$ : time-series, $t a$ : tabular, $f$ : force sensor, $p$ : proprioception sensor, $s$ : set, $o$ : optical flow), and prediction tasks. We provide a standardized data loader for datasets in MultiBench, along with a set of state-of-the-art multimodal models.

Research Area	Size	Dataset	Modalities	# Samples	Prediction task
Affective Computing	S M L L	MUStARD [24] CMU-MOSI [181] UR-FUNNY [64] CMU-MOSEI [183]	{ $ℓ$ , $v$ , $a$ } { $ℓ$ , $v$ , $a$ } { $ℓ$ , $v$ , $a$ } { $ℓ$ , $v$ , $a$ }	690 2,199 16,514 22,777	Sarcasm sentiment humor sentiment, emotions

Healthcare	L	MIMIC [78]	{ $t$ , $t a$ }	36,212	mortality, ICD-9 codes

Robotics	M L	MuJoCo Push [90] Vision&Touch [92]	{ $i$ , $f$ , $p$ } { $i$ , $f$ , $p$ }	37,990 147,000	object pose contact, robot pose

Finance	M M M	Stocks-F&B Stocks-Health Stocks-Tech	{ $t$ ×18} { $t$ ×63} { $t$ ×100}	5,218 5,218 5,218	stock price, volatility stock price, volatility stock price, volatility

HCI	S	ENRICO [93]	{ $i$ , $s$ }	1,460	design interface

Multimedia	S M M L	Kinetics400-S [80] MM-IMDb [8] AV-MNIST [161] Kinetics400-L [80]	{ $v$ , $a$ , $o$ } { $ℓ$ , $i$ } { $i$ , $a$ } { $v$ , $a$ , $o$ }	2,624 25,959 70,000 306,245	human action movie genre digit human action

Open in a new tab

Affective computing studies the perception of human affective states (emotions, sentiment, and personalities) from our natural display of multimodal signals spanning language (spoken words), visual (facial expressions, gestures), and acoustic (prosody, speech tone) [124]. It has broad impacts towards building emotionally intelligent computers, human behavior analysis, and AI-assisted education. MultiBench contains 4 datasets involving fusing language, video, and audio time-series data to predict sentiment (CMU-MOSI [181]), emotions (CMU-MOSEI [183]), humor (UR-FUNNY [64]), and sarcasm (MUStARD [24]). Complementary information may occurs at different moments, requiring models to address the multimodal challenges of grounding and alignment.

Healthcare:

Modern medical decision-making often involves integrating complementary information and signals from several sources such as lab tests, imaging reports, and patient-doctor conversations. Multimodal models can help doctors make sense of high-dimensional data and assist them in the diagnosis process [5]. MultiBench includes the large-scale MIMIC dataset [78] which records ICU patient data including time-series data measured every hour and other demographic variables (e.g., age, gender, ethnicity in the form of tabular numerical data). These are used to predict the disease ICD-9 code and mortality rate. MIMIC poses unique challenges in integrating time-varying and static modalities, reinforcing the need of aligning multimodal information at correct granularities.

Robotics:

Modern robot systems are equipped with multiple sensors to aid in their decision-making. We include the large-scale MuJoCo Push [90] and Vision&Touch [92] datasets which record the manipulation of simulated and real robotic arms equipped with visual (RGB and depth), force, and proprioception sensors. In MuJoCo Push, the goal is to predict the pose of the object being pushed by the robot end-effector. In Vision&Touch, the goal is to predict action-conditional learning objectives that capture forward dynamics of the different modalities (contact prediction and robot end-effector pose). Robustness is important due to the risk of real-world sensor failures [89].

Finance:

We gathered historical stock data from the internet to create our own dataset for financial time-series prediction across 3 groups of correlated stocks: Stocks-F&B, Stocks-Health, and Stocks-Tech. Within each group, the previous stock prices of a set of stocks are used as multimodal time-series inputs to predict the price and volatility of a related stock (e.g., using Apple, Google, and Microsoft data to predict future Microsoft prices). Multimodal stock prediction [136] presents scalability issues due to a large number of modalities (18/63/100 vs 2/3 in most datasets), as well as robustness challenges arising from real-world data with an inherently low signal-to-noise ratio.

Human Computer Interaction (HCI) studies the design of computer technology and interactive interfaces between humans and computers [43]. Many real-world problems involve multimodal inputs such as language, visual, and audio interfaces. We use the Enrico (Enhanced Rico) dataset [40, 93] of Android app screens (consisting of an image as well as a set of apps and their locations) categorized by their design motifs and collected for data-driven design applications such as design search, user interface (UI) layout generation, UI code generation, and user interaction modeling.

Multimedia:

A significant body of research in multimodal learning has been fueled by the large availability of multimedia data (language, image, video, and audio) on the internet. MultiBench includes 3 popular large-scale multimedia datasets with varying sizes and levels of difficulty: (1) AV-MNIST [161] is assembled from images of handwritten digits [88] and audio samples of spoken digits [94], (2) MM-IMDb [8] uses movie titles, metadata, and movie posters to perform multi-label classification of movie genres, and (3) Kinetics [80] contains video, audio, and optical flow of 306,245 video clips annotated for 400 human actions. To ease experimentation, we split Kinetics into small and large partitions (see Appendix C).

2.2. Evaluation Protocol

MultiBench contains evaluation scripts for the following holistic desiderata in multimodal learning:

Performance:

We standardize evaluation using metrics designed for each dataset, including MSE and MAE for regression to accuracy, micro & macro F1-score, and AUPRC for classification.

Complexity:

Modern ML research unfortunately causes significant impacts to energy consumption [142], a phenomenon often exacerbated in processing high-dimensional multimodal data. As a step towards quantifying energy complexity and recommending lightweight multimodal models, MultiBench records the amount of information taken in bits (i.e., data size), number of model parameters, as well as time and memory resources required during the entire training process. Real-world models may also need to be small and compact to run on mobile devices [131] so we also report inference time and memory on CPU and GPU (see Appendix D.2).

Robustness:

Real-world multimodal data is often imperfect as a result of missing entries, noise corruption, or missing modalities entirely, which calls for robust models that can still make accurate predictions despite only having access to noisy and missing signals [101, 123]. To standardize efforts in evaluating robustness, MultiBench includes the following tests: (1) Modality-specific imperfections are independently applied to each modality taking into account its unique noise topologies (i.e., flips and crops of images, natural misspellings in text, abbreviations in spoken audio). (2) Multimodal imperfections capture correlations in imperfections across modalities (e.g., missing modalities, or a chunk of time missing in multimodal time-series data). We use both qualitative measures (performance-imperfection curve) and quantitative metrics [149] that summarize (1) relative robustness measuring accuracy under imperfections and (2) effective robustness measuring the rate of accuracy drops after equalizing for initial accuracy on clean test data (see Appendix D.3 for details).

3. MultiZoo: A Zoo of Multimodal Algorithms

To complement MultiBench, we release a comprehensive toolkit, MultiZoo, as starter code for multimodal algorithms which implements 20 methods spanning different methodological innovations in (1) data preprocessing, (2) fusion paradigms, (3) optimization objectives, and (4) training procedures (see Figure 2). To introduce these algorithms, we use the simple setting with 2 modalities for notational convenience but refer the reader to Appendix E for detailed descriptions and implementations. We use $x_{1}$ , $x_{2}$ for input modalities, $z_{1}$ , $z_{2}$ for unimodal representations, $z_{mm}$ for the multimodal representation, and $\hat{y}$ for the predicted label.

Figure 2: — MultiZoo provides a standardized implementation of a suite of multimodal methods in a modular fashion to enable accessibility for new researchers, compositionality of approaches, and reproducibility of results.

3.1. Data Preprocessing

Temporal alignment [26] has been shown to help tackle the multimodal alignment problem for time-series data. This approach assumes a temporal granularity of the modalities (e.g., at the level of words for text) and aligns information from the remaining modalities to the same granularity. We call this approach WordAlign [26] for temporal data where text is one of the modalities.

3.2. Fusion Paradigms

Early and late fusion:

Early fusion performs concatenation of input data before using a model (i.e., $Z_{mm} = [x_{1}, x_{2}]$ ) while late fusion applies suitable unimodal models to each modality to obtain their feature representations, concatenates these features, and defines a classifier to the label (i.e., $z_{mm} = [z_{1}, z_{2}]$ ) [10]. MultiZoo includes their implementations denoted as EF and LF respectively. Tensors are specifically designed to tackle the multimodal complementarity challenge by explicitly capturing higher-order interactions across modalities [179]. Given unimodal representations $z_{1}$ , $z_{2}$ , tensors are defined as $z_{mm} = [\begin{matrix} z_{1} \\ 1 \end{matrix}] \otimes [\begin{matrix} z_{2} \\ 1 \end{matrix}]$ where $\otimes$ denotes an outer product. However, computing tensor products is expensive since their dimension scales exponentially with the number of modalities so several efficient approximations have been proposed [71, 101, 106]. MultiZoo includes Tensor Fusion (TF) [179] as well as the approximate Low-rank Tensor Fusion (LRTF) [106].

Multiplicative Interactions (MI) generalize tensor products to include learnable parameters that capture multimodal interactions [77]. In its most general form, MI defines a bilinear product $z_{mm} = z_{1} W z_{2} + z_{1}^{⊤} U + V z_{2} + b$ where $W$ , $U$ , $Z$ , and $b$ are trainable parameters. By appropriately constraining the rank and structure of these parameters, MI recovers HyperNetworks [61] (unconstrained parameters resulting in a matrix output), Feature-wise linear modulation (FiLM) [120, 188] (diagonal parameters resulting in vector output), and Sigmoid units [37] (scalar parameters resulting in scalar output). MultiZoo includes all 3 as MI-Matrix, MI-Vector, and MI-Scalar respectively.

Multimodal gated units learn representations that dynamically change for every input [25, 167, 171]. Its general form can be written as $z_{mm} = z_{1} ⊙ h (z_{2})$ , where $h$ represents a function with sigmoid activation and $⊙$ denotes element-wise product. $h (z_{2})$ is commonly referred to as “attention weights” learned from $z_{2}$ to attend on $z_{1}$ . Attention is conceptually similar to MI-Vector but recent work has explored more expressive forms of $h$ such as using a Query-Key-Value mechanism [167] or fully-connected layers [25]. We implement the Query-Key-Value mechanism as NL Gate [167].

Temporal attention models tackle the challenge of multimodal alignment and complementarity. Transformer models [158] are useful for temporal data by automatically aligning and capturing complementary features at different time-steps [154, 174]. We include the Multimodal Transformer (MulT) [154] which applied a Crossmodal Transformer block using $z_{1}$ to attend to $z_{2}$ (and vice-versa) to obtain a multimodal representation $z_{mm} = [z_{1 \to 2}, z_{2 \to 1}] = [CM (z_{1}, z_{2}), CM (z_{2}, z_{1})]$ .

Algorithm 1.

PyTorch code integrating MultiBench datasets and MultiZoo models.

from datasets.get_data import get_dataloader

from unimodals.common_models import ResNet, Transformer

from fusions.common_fusions import MultInteractions

from training_structures.gradient_blend import train, test

# loading Multimodal IMDB dataset

traindata, validdata, testdata = get_dataloader(‘multimodal_imdb’)

out_channels = 3

# defining ResNet and Transformer unimodal encoders

encoders = [ResNet(in_channels=1, out_channels, layers=5),

Transformer(in_channels=1, out_channels, layers=3)]
# defining a Multiplicative Interactions fusion layer
fusion = MultInteractions([out_channels*8, out_channels*32], out_channels*32, ‘matrix’)
classifier = MLP(out_channels*32, 100, labels=23)
# training using Gradient Blend algorithm

model = train(encoders, fusion, classifier, traindata, validdata,
epochs=100, optimtype=torch.optim.SGD, lr=0.01, weight_decay=0.0001)

# testing

performance, complexity, robustness = test(model, testdata)

Open in a new tab

Architecture search:

Instead of hand-designing architectures, several approaches define a set of atomic operations (e.g., linear transformation, activation, attention, etc.) and use architecture search to learn the best order of these operations for a given task [122, 173], which we call MFAS.

3.3. Optimization Objectives

In addition to the standard supervised losses (e.g., cross entropy for classification, MSE/MAE for regression), several proposed methods have proposed new objective functions based on:

Prediction-level alignment objectives tackle the challenge of alignment by capturing a representations where semantically similar concepts from different modalities are close together [9, 33, 91, 151]. Alignment objectives have been applied at both prediction and feature levels. In the former, we implement Canonical Correlation Analysis (CCA) [7, 145, 166], which maximizes correlation by adding a loss term $L_{CCA} = - corr (g_{1} (z_{1}), g_{2} (z_{2}))$ where $g_{1}$ , $g_{2}$ are auxiliary classifiers mapping each unimodal representation to the label.

Feature-level alignment:

In the latter, contrastive learning has emerged as a popular approach to bring similar concepts close in feature space and different concepts far away [33, 91, 151]. We include REFNET [135] which uses a self-supervised contrastive loss between unimodal representations $z_{1}$ , $z_{2}$ and the multimodal representation $z_{mm}$ , i.e., $L_{contrast} = 1 - \cos (z_{mm}, g_{1} (z_{1})) + 1 - \cos (z_{mm}, g_{2} (z_{2}))$ where $g_{1}$ , $g_{2}$ is a layer mapping each modality’s representation into the joint multimodal space.

Reconstruction objectives based on generative-discriminative models (e.g., VAEs) aim to reconstruct the input (or some part of the input) [91, 155]. These have been shown to better preserve task-relevant information learned in the representation, especially in settings with sparse supervised signals such as robotics [91] and long videos [155]. We include the Multimodal Factorized Model (MFM) [155] that learns a representation $z_{mm}$ that can reconstruct input data $x_{1}$ , $x_{2}$ while also predicting the label, i.e., adding an objective $L_{rec} = {‖ g_{1} (z_{mm}) - x_{1} ‖}_{2} + {‖ g_{2} (z_{mm}) - x_{2} ‖}_{2}$ where $g_{1}$ , $g_{2}$ are auxiliary decoders mapping $z_{mm}$ to each raw input modality. MFM can be paired with any multimodal model from section 3.2 (e.g., learning $z_{mm}$ via tensors and adding a term to reconstruct input data).

Improving robustness:

3.4. Training Procedures

Improving generalization:

Recent work has found that directly training a multimodal model is sub-optimal since different modalities overfit and generalize at different rates. MultiZoo includes Gradient Blending (GRadBlend), that computes generalization statistics for each modality to determine their weights during fusion [167], and Regularization by Maximizing Functional Entropies (RMFE), which uses functional entropy to balance the contribution of each modality to the result [53].

3.5. Putting Everything Together

In Algorithm 1, we show a sample code snippet in Python that loads a dataset from MultiBench (section C.2), defines the unimodal and multimodal architectures, optimization objective, and training procedures (section 3), before running the evaluation protocol (section 2.2). Our MultiZoo toolkit is easy to use and trains entire multimodal models in less than 10 lines of code. By standardizing the implementation of each module and disentangling the individual effects of models, optimizations, and training, MultiZoo ensures both accessibility and reproducibility of its algorithms.

4. Experiments and Discussion

Setup:

Using MultiBench, we load each of the datasets and test the multimodal approaches in MultiZoo. We only vary the contributed method of interest and keep all other possibly confounding factors constant (i.e., using the exact same training loop when testing a new multimodal fusion paradigm), a practice unfortunately not consistent in previous work. Our code is available at https://github.com/pliang279/MultiBench. Please refer to Appendix G for experimental details. MultiBench allows for careful analysis of multimodal models and we summarize the main take-away messages below (see Appendix H for full results and analysis).

Benefits of standardization:

From Table 2, simply applying methods proposed outside of the same research area can improve the state-of-the-art performance on 9 of the 15 MultiBench datasets, especially for relatively understudied domains and modalities (i.e., healthcare, finance, HCI).

Table 2:

Standardizing methods and datasets enables quick application of methods from different research areas which achieves stronger performance on 9/15 datasets in MultiBench, especially in healthcare, HCI, robotics, and finance. In-domain refers to the best performance across methods previously proposed on that dataset and out-domain shows best performance across remaining methods. ↑ indicates metrics where higher is better (Acc, AUPRC), ↓ indicates lower is better (MSE).

Dataset	MUStARD ↑	CMU-MOSI ↑	UR-FUNNY ↑	CMU-MOSEI ↑	MIMIC ↑
Unimodal	68.6±0.4	74.2±0.5	58.3±0.2	78.8±1.5	76.7±0.3

In-domain	66.3±0.3	83.0±0.1	62.9±0.2	82.1±0.5	77.9±0.3
Out-domain	71.8±0.3	75.5±0.5	66.7±0.3	78.1±0.3	78.2±0.2
Improvement	4.7%	-	6.0%	-	0.4%

Dataset	MuJoCo Push ↓	V&T EE ↓	Stocks-F&B ↓	Stocks-Health ↓	Stocks-Tech ↓
Unimodal	0.334±0.034	0.202±0.022	1.856±0.093	0.541±0.010	0.125±0.004

In-domain	0.290±0.018	0.258±0.011	1.856±0.093	0.541±0.010	0.125±0.004
Out-domain	0.402±0.026	0.185±0.011	1.820±0.138	0.526±0.017	0.120±0.008
Improvement	-	8.4%	1.9%	2.8%	4.0%

Dataset	ENRICO ↑	MM-IMDb ↑	AV-MNIST ↑	Kinetics-S ↑	Kinetics-L ↑
Unimodal	47.0±1.6	45.6±4.5	65.1±0.2	56.5	72.6

In-domain	47.0±1.6	49.8±1.7	72.8±0.2	56.1	74.7
Out-domain	51.0±1.4	50.2±0.9	72.3±0.2	23.7	71.7
Improvement	8.5%	0.8%	-	-	-

Open in a new tab

Generalization across domains and modalities:

MultiBench offers an opportunity to analyze algorithmic developments across a large suite of modalities, domains, and tasks. We summarize the following observations regarding performance across datasets and tasks (see details in Appendix H.7):

Many multimodal methods show their strongest performance on in-domain datasets and do not generalize across domains and modalities. For example, MFAS [122] works well on domains it was designed for (AV-MNIST and MM-IMDb in multimedia) but does not generalize to other domains such as healthcare (MIMIC). Similarly, MulT [154] performs extremely well on the affect recognition datasets it was designed for but struggles on other multimodal time-series data in the finance and robotics domains. Finally, GRadBlend [167], an approach specifically designed to improve generalization in multimodal learning and tested on video and audio datasets (e.g., Kinetics), does not perform well on other datasets. In general, we observe high variance in the performance of multimodal methods across datasets in MultiBench. Therefore, there still does not exist a one-size-fits-all model, especially for understudied modalities and tasks.
There are methods that are surprisingly generalizable across datasets. These are typically general modality-agnostic methods such as LF. While simple, it is a strong method that balances simplicity, performance, and low complexity. However, it does not achieve the best performance on any dataset.
Several methods such as MFAS and CCA are designed for only 2 modalities (usually image and text), and TF and MI do not scale efficiently beyond 2/3 modalities. We encourage the community to generalize these approaches across datasets and modalities on MultiBench.

Tradeoffs between modalities:

How far can we go with unimodal methods? Surprisingly far! From Table 2, we observe that decent performance can be obtained with the best performing modality. Further improvement via multimodal models may come at the expense of around 2−3× the parameters.

Tradeoffs between performance and complexity:

In Figure 3(a), we summarize the performance of all methods in terms of performance and complexity. We find a strong tradeoff between these two desiderata: simple fusion techniques (e.g., LF) are actually appealing choices which score high on both metrics, especially when compared to complex (but slightly better performing) methods such as architecture search (MFAS) or Multimodal Transformers (MulT). While LF is the easiest to adapt to new datasets and domains, we encountered difficulties in adapting several possibly well-performing methods (such as MFAS or MulT) to new datasets and domains. Therefore, while their average performance is only slightly better than LF on all datasets (see Figure 3(a)), they perform much better on well-studied datasets (see Figure 3(b)). We hope that the release of MultiBench will greatly accelerate research in adapting complex methods on new datasets (see full results in Appendix H.8).

Tradeoffs between performance and robustness:

In Figure 4, we plot a similar tradeoff plot between accuracy and (relative & effective) robustness. As a reminder, relative robustness directly measures accuracy under imperfections while effective robustness measures the rate at which accuracy drops after equalizing for initial accuracy on clean test data (see Appendix D.3 for details). We observe a positive correlation between performance and relative robustness (see Figure 4(a)), implying that models starting off with higher accuracy tend to stay above other models on the performance-imperfection curve. However, we observe a negative best fit between performance and effective robustness (see Figure 4(b)) because several well-performing methods such as MulT, CCA, and MVAE tend to drop off faster after equalizing for initial accuracy on clean test data. Furthermore, very few models currently achieve both positive relative and effective robustness, which is a crucial area for future multimodal research (see full results in Appendix H.9).

5. Related Work

We review related work on standardizing datasets and methods in multimodal learning.

Comparisons with related benchmarks:

To the best of our knowledge, MultiBench is the first multimodal benchmark with such a large number of datasets, modalities, and tasks. Most previous multimodal benchmarks have focused on a single research area such as within affective computing [56], human multimodal language [177], language and vision-based question answering [50, 138], text classification with external multimodal information [60], and multimodal learning for educa-tion [65]. MultiBench is specifically designed to go beyond the commonly studied language, vision, and audio modalities to encourage the research community to explore relatively understudied modalities (e.g., tabular data, time-series, sensors, graph and set data) and build general multimodal methods that can handle a diverse set of modalities.

Our work is also inspired by recent progress in better evaluation benchmarks for a suite of important tasks in ML such as language representation learning [163, 164], long-range sequence modeling [150], multilingual representation learning [72], graph representation learning [74], and robustness to distribution shift [85]. These well-crafted benchmarks have accelerated progress in new algorithms, evaluation, and analysis in their respective research areas.

Standardizing multimodal learning:

There have also been several attempts to build a single model that works well on a suite of multimodal tasks [95, 109, 143]. However, these are limited to the language and vision space, and multimodal training is highly tailored for text and images. Transformer architectures have emerged as a popular choice due to their suitability for both language and image data [27, 73] and a recent public toolkit was released for incorporating multimodal data on top of text-based Transformers for prediction tasks [60]. By going beyond Transformers and text data, MultiBench opens the door to important research questions involving a much more diverse set of modalities and tasks while holistically evaluating performance, complexity, and robustness.

Analysis of multimodal representations:

Recent work has begun to carefully analyze and challenge long-standing assumptions in multimodal learning. They have shown that certain models do not actually learn cross-modal interactions but rather rely on ensembles of unimodal statistics [68] and that certain datasets and models are biased to the most dominant modality [22, 59], sometimes ignoring others completely [3]. These observations are currently only conducted on specific datasets and models without testing their generalization to others, a shortcoming we hope to solve using MultiBench which enables scalable analysis over modalities, tasks, and models.

6. Conclusion

Limitations:

While MultiBench can help to accelerate research in multimodal ML, we are aware of the following possible limitations (see detailed future directions in Appendix I):

Tradeoffs between generality and specificity: While it is desirable to build models that work across modalities and tasks, there is undoubtedly merit in building modality and task-specific models that can often utilize domain knowledge to improve performance and interpretability (e.g., see neurosymbolic VQA [159], or syntax models for the language modality [31]). MultiBench is not at odds with research in this direction: in fact, by easing access to data, models, and evaluation, we hope that MultiBench will challenge researchers to design interpretable models leveraging domain knowledge for many multimodal tasks. It remains an open question to define “interpretability” for other modalities beyond image and text, a question we hope MultiBench will drive research in.
Scale of datasets, models, and metrics: We plan for MultiBench to be a continuously-growing community effort with regular maintenance and expansion. While MultiBench currently does not include several important research areas outside of multimodal fusion (e.g., question answering [4, 63], retrieval [187], grounding [32], and reinforcement learning [110]), and is also limited by the models and metrics it supports, we outline our plan to expand in these directions in Appendix I.

Projected expansions of MultiBench:

In this subsection, we describe concrete ongoing and future work towards expanding MultiBench (see details in Appendix I).

Other multimodal research problems: We are genuinely committed to building a community around these resources and continue improving it over time. While we chose to focus on multimodal fusion by design for this first version to have a more coherent way to standardize and evaluate methods across datasets, we acknowledge the breadth of multimodal learning and are looking forward to expanding it in other directions in collaboration with domain experts. We have already included 2 datasets in captioning (and more generally for non-language outputs, retrieval): (1) Yummly-28K of paired videos and text descriptions of food recipes [114], and (2) Clotho dataset for audio-captioning [45] as well as a language-guided RL environment Read to Fight Monsters (RTFM) [188] and are also working towards more datasets in QA, retrieval, and multimodal RL.

To help in scalable expansion, we plan for an open call to the community for suggestions and feedback about domains, datasets, and metrics. As a step in this direction, we have concrete plans to use MultiBench as a theme for future workshops and competitions (building on top of the multimodal workshops we have been organizing at NAACL 2021, ACL 2020, and ACL 2019, and in multimodal learning courses (starting with the course taught annually at CMU). Since MultiBench is public and will be regularly maintained, the existing benchmark, code, evaluation, and experimental protocols can greatly accelerate any dataset and modeling innovations added in the future. In our public GitHub, we have included a section on contributing through task proposals or additions of datasets and algorithms. The authors will regularly monitor new proposals through this channel.
New evaluation metrics: We also plan to include evaluation for distribution shift, uncertainty estimation, tests for fairness and social biases, as well as labels/metrics for interpretable multimodal learning. In the latter, we plan to include the EMAP score [68] as an interpretability metric assessing whether cross-modal interactions improve performance.
Multimodal transfer learning and co-learning: Can training data in one dataset help learning on other datasets? MultiBench enables easy experimentation of such research questions: our initial experiments on transfer learning found that pre-training on larger datasets in the same domain can improve performance on smaller datasets when fine-tuned on a smaller dataset: performance on the smaller CMU-MOSI dataset improved from 75.2 to 75.8 using the same late fusion model with transfer learning from the larger UR-FUNNY and CMU-MOSEI datasets. Furthermore, recent work has shown that multimodal training can help improve unimodal performance as well [140, 170, 180]. While previous experiments were on a small scale and limited to a single domain, we plan to expand significantly on this phenomenon (multimodal co-learning) in future versions of MultiBench.
Multitask learning across modalities: Multitask learning across multimodal tasks with a shared set of input modalities is a promising direction that can enable statistical strength sharing across datasets and efficiency in training a single model. Using MultiBench, we also ran an extra experiment on multi-dataset multitask learning. We used the 4 datasets in the affective computing domain and trained a single model across all 4 of them with adjustable input embedding layers if the input features were different and separate classification heads for each dataset’s task. We found promising initial results with performance on the largest CMU-MOSEI dataset improving from 79.2 to 80.9 for a late fusion model and from 82.1 to 82.9 using a multimodal transformer model, although performance on the smaller CMU-MOSI dataset decreased from 75.2 to 70.8. We believe that these potential future studies in co-learning, transfer learning, and multi-task learning are strengths of MultiBench since it shows the potential of interesting experiments and usage.

In conclusion, we present MultiBench, a large-scale benchmark unifying previously disjoint efforts in multimodal research with a focus on ease of use, accessibility, and reproducibility, thereby paving the way towards a deeper understanding of multimodal models. Through its unprecedented range of research areas, datasets, modalities, tasks, and evaluation metrics, MultiBench highlights several future directions in building more generalizable, lightweight, and robust multimodal models.

Acknowledgements

This material is based upon work partially supported by the National Science Foundation (Awards #1722822 and #1750439), National Institutes of Health (Awards #R01MH125740, #R01MH096951, #U01MH116925, and #U01MH116923), BMW of North America, and SquirrelAI. PPL is supported by a Facebook PhD Fellowship and a Center for Machine Learning and Health Fellowship. RS is supported in part by NSF IIS1763562 and ONR Grant N000141812861. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, National Institutes of Health, Facebook, CMLH, Office of Naval Research, BMW of North America, and SquirrelAI, and no official endorsement should be inferred. We are extremely grateful to Amir Zadeh, Chaitanya Ahuja, Volkan Cirik, Murtaza Dalal, Benjamin Eysenbach, Tiffany Min, and Devendra Chaplot for helpful discussions and feedback, as well as Ziyin Liu and Chengfeng Mao for providing tips on working with financial time-series data. Finally, we would also like to acknowledge NVIDIA’s GPU support.

Appendix

A. Broader Impact Statement

Multimodal data and models are ubiquitous in a range of real-world applications. MultiBench and MultiZoo is our aim to systematically categorize the plethora of datasets and models currently in use. While these contributions will accelerate research towards multimodal datasets and models as well as their real-world deployment, we believe that special care must be taken in the following regard to ensure that these models are safely deployed for real-world benefit:

Time & space complexity:

Modern multimodal datasets and models are large, especially when building on already large pretrained unimodal datasets and models such as BERT or ResNets. The increasing time and space complexity of these models can cause financial impacts resulting from the cost of hardware, electricity, and computation, as well as environmental impacts resulting from the carbon footprint required to fuel modern hardware. Therefore, there has been much recent interest in building lightweight machine learning models [142].

MultiBench also provides several efforts in this direction:

Firstly, MultiBench alleviates the need for separate research groups to repeat preprocessing efforts when beginning to work on a new multimodal dataset, which often takes significant time when large video & audio datasets and feature extractors are involved.
Secondly, our standardized implementation of core approaches in MultiZoo prevents duplicate efforts in adapting approaches to new datasets. We found that while many authors of these multimodal methods released their code publicly on GitHub, there was still some effort needed to adapt their code and tune their models to achieve the best performance on our standardized implementation in MultiZoo. By standardizing these experimentation efforts, we can facilitate the sharing of code and trained models, ensure reproducibility across implementations, and save time and effort in the future.
Finally, MultiBench explicitly tests for complexity and encourages researchers to build lightweight models. While this has been less studied in multimodal research, we hope that our efforts will pave the way for greener multimodal learning.

Privacy and security:

There may be privacy risks associated with making predictions from multimodal data of recorded human behaviors. The datasets potentially in question might include those in affective computing (recorded video data labeled for sentiment, emotions, and personality attributes), and healthcare (health data labeled for disease and mortality rate). Therefore, it is crucial to obtain user consent before collecting device data. In our experiments with real-world data where people are involved (i.e., healthcare and affective computing), the creators of these datasets have taken the appropriate steps to only access public data which participants/content creators have consented for released to the public (see details in Appendix C.2). We only use these datasets for research purposes. All data was anonymized and stripped of all personal (e.g., personally identifiable information) and protected attributes (e.g., race, gender).

To deploy these algorithms at scale in the real world, it is also important to keep data and features private on each device without sending it to other locations using techniques such as federated learning [96, 100], differential privacy [55], or encryption [35].

Social biases:

We acknowledge that there is a risk of exposure bias due to imbalanced datasets, especially when human-centric data and possibly sensitive labels are involved. For example, will models trained on imbalanced data disproportionately classify videos of a particular gender as displaying a particular emotion? Models trained on biased data have been shown to amplify the underlying social biases especially when they correlate with the prediction targets [108]. This leaves room for future work in exploring methods tailored for specific scenarios such as mitigating social biases in words [18], sentences [99], images [118], and other modalities. Future research in multimodal models should also focus on quantifying the trade-offs between fairness and performance [186]. MultiBench enables the large-scale study of these crucial research questions and we outline some of our ongoing and future efforts in expanding the evaluation metrics in MultiBench to take these into account in Appendix I.

Possible biases within each dataset:

In this section, we expand upon the previous two points regarding privacy and social biases by describing the possible biases in each domain/dataset included in MultiBench.

Affective computing: Analysis of sentiment, emotions, and personality might carry biases if care is not taken to appropriately anonymize the video data used. In MultiBench, all models trained on affect recognition datasets use only pre-extracted non-invertible features that rely on general visual or audio features such as the presence of a smile or magnitude of voice. Therefore the features used in this paper cannot be used to identify the speaker [181, 183]. Furthermore, videos within the datasets all follow the creative commons license and follow fair use guidelines of YouTube. This license allows is the standard way for content creators to grant someone else permission to use and redistribute their work. We use no information regarding gender, ethnicity, identity, or video identifier in online sources. We emphasize that the models trained to perform automated affect recognition should not in any way be used to harm individuals and should only be used as a scientific study.

In addition to privacy issues, we also studied the videos collected in these affective computing datasets and found no offensive content. While there are clearly expressions of highly negative sentiment or strong displays of anger and disgust, there are no offensive words used or personal attacks recorded in the video. All videos are related to movie or product reviews, TED talks, and TV shows.
Healthcare: The MIMIC dataset [78] has been rigorously de-identified in accordance with Health Insurance Portability and Accountability Act (HIPAA) such that all possible personal information has been removed from the dataset. Removed personal information include patient name, telephone number, address, and dates. Dates of birth for patients aged over 89 were shifted to obscure their true age. Please refer to Appendix C.2.2 for de-identification details. Again, we emphasize that any multimodal models trained to perform prediction should only be used for scientific study and should not in any way be used for real-world prediction.
Finance: There is no personal/human data included and there is no risk of personally identifiable information and offensive content.
Robotics: There is no personal/human data included and there is no risk of personally identifiable information and offensive content.
HCI: There is no personal/human data included and there is no risk of personally identifiable information and offensive content.
Multimedia: For MM-IMDb and AV-MNIST, there is no personal/human data included and there is no risk of personally identifiable information and offensive content. For Kinetics, all videos within the dataset are obtained from public YouTube videos that follow the creative commons license which allows content creators to grant permission to use and redistribute their work. We use no information regarding gender, ethnicity, identity, or video identifier in online sources. We emphasize that the models trained to perform action recognition should not in any way be used to harm individuals and should only be used as a scientific study. We also checked to make sure that these videos do not contain offensive content. All videos are related to human actions and do not contain any offensive words/audio.

Overall, MultiBench offers opportunities to study these potential issues at scale across modalities, tasks, datasets, and domains. We plan to continue expanding this benchmark to rigorously test for these social impacts to improve the safety and reliability of multimodal models. For example, in Appendix I.3.3, we describe some concrete extensions to include evaluations for fairness and privacy of multimodal models trained on the datasets in MultiBench. Our holistic evaluation metrics will also encourage the research community to quantify the tradeoffs between performance, complexity, robustness, fairness, and privacy in human-centric multimodal models.

B. Background: Multimodal Representation Learning

We first provide background focusing on multimodal representation learning and several core technical challenges in this area.

B.1 Problem Statement

We define a modality as a single particular mode in which a signal is expressed or experienced. Multiple modalities then refer to a combination of multiple signals each expressed or experienced in heterogeneous manners [10]. We distinguish between the possible temporal resolution of modalities that will impact the types of approaches used:

Static modalities include inputs without a time dimension such as images, tabular data (i.e., a table of numerical data).
Temporal modalities include those coming in a sequence with a time-dimension such as language (a sequence of tokens), video (a sequence of frames/audio features/optical flow features), or time-series data (a sequence of data points indexed by time).

The first version of MultiBench focuses on benchmarks and algorithms for multimodal fusion, where the main challenge is to join information from two or more modalities to perform a prediction. Classic examples include audio-visual speech recognition where visual lip motion is fused with speech signals to predict spoken words [48]. Note that in fusion problems, it should be well-defined to predict the label with a single modality only, which marks an important distinction to tasks in question answering and grounding where one modality is used to query information in another (e.g., visual question answering [4] using a text question to query information in the image). We outline our plans to extend future versions of MultiBench to include more multimodal challenges such as question answering, retrieval, and grounding in Appendix I.

Formally, the multimodal fusion problem is defined as follows. We suppose there is a set of $M$ modalities drawn from a joint distribution $p (X_{1}, \dots, X_{M}, Y)$ where $X_{m}$ is a random variable denoting data distributed according to modality $m$ and $Y$ is a random variable representing the label. If modality $m$ is a static modality, $X_{m}$ is a random vector without the time dimension. If modality $m$ is a temporal modality, $X_{m}$ is a random vector with a time dimension which can be represented as follows: $X_{m} = (X_{m}^{1}, X_{m}^{2}, \dots, X_{m}^{T})$ where $T$ is the number of time-steps in the temporal modality.

In multimodal fusion, a set of $M$ modalities is drawn from a joint distribution $p (X_{1}, \dots, X_{M}, Y)$ where $X_{m}$ is a random variable denoting data distributed according to modality $m$ and $Y$ is a random variable representing the label. A multimodal dataset is a collection of draws of (data, label) pairs from the joint distribution $p (X_{1}, \dots, X_{M}, Y)$ . We denote a dataset as ${(x_{i 1}, \dots, x_{i M}, y_{i})}_{i = 1}^{n}$ . These draws from the true distribution are possibly biased (e.g., across individuals, topics, or labels) and noisy (e.g., due to noisy or missing modalities). A multimodal model is a set of functions ${f_{m} : m \in [M]} \cup f_{mm}$ where each of the $f_{m}'s$ are unimodal encoders, one for each modality, and $f_{mm}$ is a multimodal fusion network. The unimodal encoders are specially designed with domain knowledge to learn representations from each modality (e.g., convolutional networks for images, temporal models for time-series data) resulting in unimodal representations $z_{1}, \dots, z_{M}$ . The multimodal network is designed to capture information across unimodal representations and summarize it in a multimodal representation $z_{mm}$ that can be used to predict the label $y$ . The goal of multimodal fusion is to learn a model with the lowest prediction error as measured on a held-out test set, while also balancing other potential objectives such as low complexity and robustness to imperfect data.

B.2 Technical Challenges

MultiBench tests for the following holistic desiderata in multimodal fusion:

Performance: We summarize the following core challenges across all prediction tasks for multimodal learning with reference to Baltrusaitis et al., [10]. Solving these challenges is essential in any multimodal prediction problem, regardless of domain and task.
1. Unimodal structure and granularity: The information coming from each modality follows a certain underlying structure and invariance, which needs to be processed by suitable unimodal encoders. While there are certain generally adopted unimodal encoders for commonly studied modalities such as images and text, there remain challenges in designing unimodal encoders with the right types of inductive biases for other less-studied modalities such as tabular and time-series data. Representations extracted from unimodal encoders should contain task-relevant information from that modality, expressed at the right granularity.
2. Multimodal complementarity: The information coming from different modalities have varying predictive power by themselves and also when complemented by each other. We refer to these as higher-order interactions: first-order interactions define a predictive signal from a single granular unit of information in one modality to the label (e.g., the presence of a smile indicating positive sentiment); second-order interactions define a predictive signal from a pair of granular units of information across two modalities to the label (e.g., the presence of an eye-roll together with a positive word indicating sarcasm); and nth-order interactions extend the above definition to $n$ modalities. There are many possible interactions that explain the labels in a dataset, out of which only some may generalize to unseen data. It remains a challenge to discover these higher-order interactions using suitably expressive models. At the same time, the space of possible interactions is too large which requires suitable inductive biases in model design (see challenges regarding complexity in model design below).
3. Multimodal alignment: Information from different modalities often comes in different granularities. In order to learn predictive signals from higher-order interactions, there is a need to first identify the relations between granular units from two or more different modalities. This challenge requires a measure of the relationship between different modalities, which we call cross-modal alignment.
  
  When dealing with temporal data, it also requires capturing possible long-range dependencies across time, which we call temporal alignment. For example, it requires aligning the presence of an eye-roll together with a positive word to recognize sarcasm even when both signals happen at different times. This challenge extends cross-modal alignment to the temporal dimension.
Complexity: The space of possible interactions is very large which requires suitable inductive biases in model design. While more expressive models may perform better, these often come at the cost of time and space complexity during training and inference. To enable real-world deployment of multimodal models in a variety of settings [142], there is a need to build lightweight models with cheap training and inference.
Robustness: Information from different modalities often display different noise topologies, and real-world multimodal signals possibly suffer from missing or noisy data in at least one of the modalities [10]. While most methods are trained on carefully curated and cleaned datasets, there is a need to benchmark their robustness in realistic scenarios. The core challenge here is to build models that still perform well despite the presence of unimodal-specific or multimodal imperfections.

Figure 5: — MultiBench provides a standardized machine learning pipeline across data processing, data loading, multimodal models, evaluation metrics, and a public leaderboard to encourage future research in multimodal representation learning. MultiBench aims to present a milestone in unifying disjoint efforts in multimodal machine learning research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility.

C. MultiBench Datasets

MultiBench provides a standardized machine learning pipeline that starts from data loading to running multimodal models, providing evaluation metrics, and a public leaderboard to encourage future research in multimodal representation learning (see Figure 5).

In this section, we provide additional details on the distribution, release, and maintenance of each of the datasets in MultiBench as well as the maintenance of MultiBench as a whole.

C.1 Dataset Selection

In this section, we discuss our choices of datasets in MultiBench. We select each dataset based on its data collection method, input modalities, evaluation tasks, evaluation metric, and train/test splits that reflect real-world multimodal applications. We consulted with domain experts in each of the application areas to select datasets that satisfy the following properties:

Realism in data collection, input modalities, preprocessing, and task: Each of the datasets in MultiBench reflect a subset of real-world sensory modalities collected in the wild. Realism is important since it brings natural noise topologies in each modality and in the prediction task. It is crucial that these datasets reflect real-world data such that capturing these imperfections through machine learning models can potentially bridge the gap towards real-world deployment.
Diversity in research area: We chose these research areas through a survey of recent research papers in multimodal learning across conferences in machine learning and beyond (e.g., HCI, NLP, vision, robotics conferences). Furthermore, we consulted with domain experts in applying multimodal learning to their respective application areas to determine areas of large potential. Through engaging with domain experts we were able to select research areas and datasets that reflected realism in data collection, input modalities, preprocessing, and tasks which present challenges for machine learning models and potential for real-world transfer of learned algorithms. These research areas that are designed to span both human-centric and data-centric machine learning. In the former, we selected HCI, healthcare, and robotics since these are fast-growing research areas with increasingly specialized tracks in machine learning conferences dedicated to them. In the latter, financial data analysis is an area with an inherently low signal-to-noise ratio reflecting extremely noisy, imperfect, and uncertain real-world datasets which provide challenges for current multimodal models. We also included several multimedia datasets due to the large resources publicly available on the internet which results in multimodal datasets of the largest scale.
Diversity in modalities: We started with a set of commonly studied modalities such as language, image, video, and audio. For each of the following research areas, we consulted with domain experts to choose datasets that are established, but not overstudied. More importantly, we aimed for diversity in modalities to truly test the generalization capabilities of modern multimodal models outside of commonly studied domains and modalities. For example, while there is much work in HCI involving images and text, we chose a modality representing a set of mobile applications for coverage. Similarly, in robotics, we consulted with domain experts to obtain datasets with high-frequency force and proprioception sensors that provide a unique challenge to machine learning researchers.
Challenging for ML models: We aim to choose datasets where the current state-of-the-art performance via machine learning models is still far from human performance (if human performance is provided, otherwise judged by a domain expert). This is to ensure that there is room for improvement through community involvement in this research area.
Community expansion: Finally, we would like to emphasize that we heavily encourage and actively seek out community participation in expanding MultiBench to keep up with the incredible pace in multimodal machine learning research. We describe our plans for an open call for proposals of new research areas, datasets, and prediction tasks in section I.

Figure 6: — **Affective computing** studies the perception of human affective states (emotions, sentiment, and personalities) from our natural display of multimodal signals spanning language (spoken words), visual (facial expressions, gestures), and acoustic (prosody, speech tone) [124]. MultiBench contains 4 datasets in this category involving fusing *language*, *video*, and *audio* time-series data to predict sentiment (CMU-MOSI [181] and CMU-MOSEI [183]), emotions (CMU-MOSEI [183]), humor (UR-FUNNY [64]), and sarcasm (MUStARD [24]).

C.2 Dataset Details

We provide details for each of the research areas and datasets selected in MultiBench. In each of the categories, we describe the research area, the datasets and their associated data collection process, their access restrictions and licenses, and any data preprocessing or feature extraction we used following current work in each of these domains.

C.2.1 Affective Computing

1. MUStARD is a multimodal video corpus for research in automated sarcasm discovery [24]. The dataset is compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context, which provides additional information on the scenario where the utterance occurs, thereby providing a further challenge in the long-range modeling of multimodal information. Sarcasm is specifically chosen as an annotation task since it requires careful modeling of complementary information, particularly when the semantic information from each modality do not agree with each other.

Data collection:

According to Castro et al., [24], they conducted web searches on YouTube using keywords such as Friends sarcasm, Chandler sarcasm, Sarcasm 101, and Sarcasm in TV shows to obtain videos with sarcastic content from three main TV shows: Friends, The Golden Girls, and Sarcasmaholics Anonymous. To obtain non-sarcastic videos, they used a subset of 400 videos from MELD, a multimodal emotion recognition dataset derived from the Friends TV series [128]. Videos from The Big Bang Theory were also collected by segmenting episodes using laughter cues from its audience.

Access restrictions:

While we do not have the license to this dataset, it is a public dataset free to download by the research community from https://github.com/soujanyaporia/MUStARD.

Licenses:

MIT, see https://github.com/soujanyaporia/MUStARD/blob/master/LICENSES

Dataset preprocessing:

We followed these preprocessing steps for each modality as suggested in the original paper [24]:

Language: Textual utterances are represented using pretrained BERT representations [42] as well as Common Crawl pre-trained 300-dimensional GloVe word vectors [119] for each token.
Visual: Visual features are extracted for each frame using a pool5 layer of an ImageNet [41] pretrained ResNet-152 [66] model. Every frame is first preprocessed by resizing, center-cropping, and normalizing it. We also use the OpenFace facial behavioral analysis tool [11] to extract facial expression features.
Audio: Low-level features from the audio data stream are extracted using the speech processing library Librosa [112]. We also extract COVAREP [39] features as is commonly used for the other datasets in the affective computing domain (see below).

Train, validation, and test splits:

there are 414, 138, and 138 video segments in train, valid, and test data respectively, which gives a total of 690 data points.

2. CMU-MOSI is a collection of 2,199 opinion video clips each rigorously annotated with labels for subjectivity, sentiment intensity, per-frame, and per-opinion annotated visual features, and per-milliseconds annotated audio features [181]. Sentiment intensity is annotated in the range [−3,+3] which enables fine-grained prediction of sentiment beyond the classical positive/negative split. Each video is collected from YouTube with a focus on video blogs, or vlogs which reflect the real-world distribution of speakers expressing their behaviors through monologue videos. CMU-MOSI is a realistic real-world multimodal dataset for affect recognition and is regularly used in competitions and workshops.

Data collection:

According to Zadeh et al., [181], videos were collected from YouTube with a focus on video blogs indexed by #vlog. A total of 93 videos were randomly selected. The final set of videos contained 89 distinct speakers, including 41 female and 48 male speakers. Most of the speakers were approximately between the ages of 20 and 30 from different backgrounds (e.g., Caucasian, African-American, Hispanic, Asian). All speakers expressed themselves in English and the videos originated from either the United States of America or the United Kingdom.

Access restrictions:

The authors are part of the team who collected the CMU-MOSI dataset [181] so we have the license and right to redistribute this dataset. CMU-MOSI was originally downloaded from https://github.com/A2Zadeh/CMU-MultimodalSDK.

Licenses:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the conditions in https://raw.githubusercontent.com/A2Zadeh/CMU-MultimodalSDK/master/LICENSE.txt

Train, validation, and test splits:

Each dataset contains several videos, and each video is further split into short segments (roughly 10 − 20 seconds) that are annotated. We split the data at the level of videos so that segments from the same video will not appear across train, valid, and test splits. This enables us to train user-independent models instead of having a model potentially memorizing the average affective state of a user. There are 52, 10, and 31 videos in train, valid, and test data respectively. Splitting up these videos gives a total of 1,284, 229, and 686 segments respectively for a total of 2,199 data points.

Dataset preprocessing:

We follow current work [103, 183] and apply the following preliminary feature extraction for the CMU-MOSI dataset:

Language: Glove word embeddings [119] were used to embed a sequence of individual words from video segment transcripts into a sequence of word vectors that represent spoken text. The Glove word embeddings used are 300-dimensional word embedding trained on 840 billion tokens from the common crawl dataset, resulting in a sequence of dimension $T$ × 300 after alignment. The timing of word utterances is extracted using P2FA forced aligner [176]. This extraction enables alignment between text, audio, and video.
Visual: We use the library Facet [75] to extract a set of visual features including facial action units, facial landmarks, head pose, gaze tracking, and HOG features. These visual features are extracted from the full video segment at 30Hz to form a sequence of facial gesture changes throughout time, resulting in a sequence of dimension $T$ × 35. In addition to Facet, OpenFace facial behavioral analysis tool [11] is used to extract the facial expression features which include facial Action Units (AU) based on the Facial Action Coding System (FACS) [49].
Audio: The software COVAREP [39] is used to extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch tracking and voiced/unvoiced segment features [46], glottal source parameters [28], peak slope parameters and maxima dispersion quotients [79]. These visual features are extracted from the full audio clip of each segment at 100Hz to form a sequence that represents variations in tone of voice over an audio segment, resulting in a sequence of dimension $T$ × 74.

3. UR-FUNNY is the first large-scale multimodal dataset of humor detection in human speech [64]. UR-FUNNY is a realistic representation of multimodal language (including text, visual and acoustic modalities). This dataset opens the door to understanding and modeling humor in a multimodal framework, which is crucial since humor is an inherently multimodal communicative tool involving the effective use of words (text), accompanying gestures (visual), and prosodic cues (acoustic). UR-FUNNY consists of more than 16,000 video samples from TED talks which are among the most diverse idea-sharing channels covering speakers from various backgrounds, ethnic groups, and cultures discussing a variety of topics from discoveries in science and arts to motivational speeches and everyday events. The diversity of speakers, topics, and unique annotation targets make it a realistic dataset for multimodal language modeling.

Data collection:

According to Hasan et al., [64] 1,866 videos and their transcripts in English were collected from the TED portal, chosen from 1,741 different speakers and across 417 topics. The laughter markup is used to filter out 8257 humorous punchlines from the transcripts. The context is extracted from the prior sentences to the punchline (until the previous humor instances or the beginning of the video is reached). Using a similar approach, 8,257 negative samples are chosen at random intervals where the last sentence is not immediately followed by a laughter marker. After this negative sampling, there is a homogeneous 50% split in the dataset between positive and negative humor examples.

Access restrictions:

This is a public dataset free to download by the research community from https://github.com/ROC-HCI/UR-FUNNY. The authors of the dataset also note that videos on www.ted.com are publicly available for download [64].

Licenses:

No license was provided with this dataset.

Dataset preprocessing:

We follow current work [103, 183] and apply the same preliminary feature extraction as the CMU-MOSI dataset described above.

Train, validation, and test splits:

Each dataset contains several videos, and each video is further split into short segments (roughly 10 − 20 seconds) that are annotated. We split the data at the level of videos so that segments from the same video will not appear across train, valid, and test splits. This enables us to train user-independent models instead of having a model potentially memorizing the average affective state of a user. There are 1,166, 300, and 400 videos in train, valid, and test data respectively. Splitting up these videos gives a total of 10,598, 2,626, and 3,290 segments respectively for a total of 16,514 data points,

4. CMU-MOSEI is the largest dataset of sentence-level sentiment analysis and emotion recognition in real-world online videos [102, 183]. CMU-MOSEI contains more than 65 hours of annotated video from more than 1,000 speakers and 250 topics. Each video is annotated for sentiment as well as the presence of 9 discrete emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed, and neutral) as well as continuous emotions (valence, arousal, and dominance). The diversity of prediction tasks makes CMU-MOSEI a valuable dataset to test multimodal models across a range of real-world affective computing tasks. The dataset has been continuously used in workshops and competitions revolving around human multimodal language.

Data collection:

According to Zadeh et al., [183], videos from YouTube are automatically analyzed for the presence of one speaker in the frame using face detection to ensure the video is a monologue and rejecting videos that have moving cameras. A diverse set of 250 frequently used topics in online videos is used as the seed for acquisition. The authors restrict the number of videos acquired from each channel to a maximum of 10 and limit the videos to have manual and properly punctuated transcriptions. After manual quality inspection, they also performed automatic checks on the quality of video and transcript using facial feature extraction confidence and forced alignment confidence, before balancing the gender in the dataset using the data provided by annotators (57% male to 43% female).

Access restrictions:

The authors are part of the team who collected the CMU-MOSEI dataset [183] so we have the license and right to redistribute this dataset. CMU-MOSEI was originally downloaded from https://github.com/A2Zadeh/CMU-MultimodalSDK.

Figure 7: — **Healthcare:** Medical decision-making often involves integrating complementary signals from several sources such as lab tests, imaging reports, and patient-doctor conversations. Multimodal models can help doctors make sense of high-dimensional data and assist them in the diagnosis process [5]. MultiBench includes the MIMIC dataset [78] which records ICU patient data including *time-series* data measured every hour and other *tabular numerical* data about the patient (e.g., age, gender, ethnicity) to predict mortality rate and the disease ICD-9-code. Figure adapted from [165].

Licenses:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the““Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the conditions in https://raw.githubusercontent.com/A2Zadeh/CMU-MultimodalSDK/master/LICENSE.txt

Dataset preprocessing:

We follow current work [103, 183] and apply the same preliminary feature extraction as the CMU-MOSI and UR-FUNNY datasets described above.

Train, validation, and test splits:

Each dataset contains several videos, and each video is further split into short segments (roughly 10 − 20 seconds) that are annotated. We split the data at the level of videos so that segments from the same video will not appear across train, valid, and test splits. This enables us to train user-independent models instead of having a model potentially memorizing the average affective state of a user. There are a total of 16,265, 1,869, and 4,643 segments in train, valid, and test datasets respectively for a total of 22,777 data points.

C.2.2 Healthcare

1. MIMIC-III (Medical Information Mart for Intensive Care III) [78] is a large, freely-available database comprising de-identified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. Following [129], we organized numerous patient data into two major modalities (using the 17 features in feature set A in [129]): time series modality, which is a set of medical measurements of the patient taken every 1 hour in a period of 24 hours where each measurement is a vector of size 12 (12 different measured numerical values); static modality, which is a set of medical information about the patient, represented in a vector of size 5. We use these modalities for 3 tasks: mortality prediction (6-class prediction on whether the patient dies in 1 day, 2 day, 3 day, 1 week, 1 year, or longer than 1 year), and 2 ICD-9 code predictions (binary classification on whether the patient fits any ICD-9 code in group 1 (140 − 239) and binary classification on whether the patient fits any ICD-9 code in group 7 460 − 519).

Data collection:

According to Johnson et al., [78], MIMIC contains data associated with 53,423 distinct hospital admissions for adult patients (aged 16 years or above) admitted to critical care units between 2001 and 2012, as well as 7,870 neonates admitted between 2001 and 2008. The data covers 38,597 distinct adult patients and 49,785 hospital admissions. Data was also downloaded from several sources, including archives from critical care information systems, hospital electronic health record databases, and Social Security Administration Death Master File.

Privacy:

Before data was incorporated into the MIMIC-III database, it was first de-identified in accordance with Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. The de-identification process removed all eighteen identifying data elements listed in HIPAA, such as patient name, date of birth (for patients over 89 of age), telephone number, address, and dates. Protected health information was also removed from text fields, such as diagnostic reports and physician notes. We refer the reader to [129] for full de-identification details.

Access restrictions:

We do not have the license and right to redistribute this dataset. Accessing MIMIC requires the completion of a training course and approval for access on PhysioNet (https://physionet.org/about/database/). However, we provide our own data preprocessing scripts for MIMIC, which transform the raw data into the standardized format for multimodal data and perform standardized splitting into the train, validation, and test splits. For a new user getting started with MIMIC data, all they would need to do is to complete the training course and obtain approval of access for scientific research from PhysioNet before they can use our public code to load all extracted features from the raw dataset in a version that can directly be used for machine learning studies.

Licenses:

MIT, see https://github.com/mit-lcp/mimic-code/blob/main/LICENSE

Dataset preprocessing:

We followed the instructions on https://mimic.physionet.org/gettingstarted/access/ to download the dataset in the form of raw tables, then generated preprocessed data following the steps described in https://github.com/USC-Melady/Benchmarking_DL_MIMICIII (which takes 1 − 2 weeks running time) to get the data used for experiments. Specifically, we will use data in the file 24hrs/series/imputed-normed-ep_1_24-stdized.npz. When accessing this data from our code repo, set the imputed_path of the npz file above in the get_data.py and the script will generate the PyTorch data loader for the tasks (where we will normalize the data).

Train, validation, and test splits:

We split the data into train/valid/test sets randomly (using a fixed random seed) in a 80 : 10 : 10 ratio (so 28,970 train, 3,621 valid, and 3,621 test data points) for a total of 36,212 data points.

C.2.3 Robotics

1. MuJoCo Push is a planar pushing task, in which a 7-DoF Panda Franka robot is pushing a circular puck with its end-effector in simulation. We estimate the 2D position of the unknown object on a table surface, while the robot intermittently interacts with the object. Similar to Vision&Touch, planar pushing is a contact-rich task. However, instead of estimating robot states, this dataset is estimating the state of the object the robot is currently interacting with. While other robotics datasets have also studied planar pushing [14, 175], Yu et al., [175] use a Vicon tracker (instead of raw RGB images) while Bauza et al., [14] only collect visual and proprioceptive data.

Data collection:

According to Lee et al. [90], this dataset consists of 1000 trajectories with 250 steps at 1.0 × 10¹ Hz, of a simulated Franka Panda robot arm pushing a circular puck in MuJoCo [152]. The pushing actions are generated by a heuristic controller that tries to move the end-effector to the center of the object. The multimodal inputs are gray-scaled images (1 × 32 × 32) from an RGB camera, forces (and binary contact information) from a force/torque sensor, and the 3D position of the robot end-effector. The task is to predict the 2-D planar object pose which we measure by MSE.

Access restrictions:

While we do not have the license to this dataset, it is a public dataset free to download by the research community from https://github.com/brentyi/multimodalfilter/.

Licenses:

MIT, see https://github.com/brentyi/multimodalfilter/blob/master/LICENSE.

Dataset preprocessing:

Training, validation, and test data are each in their own files and can be used directly after downloading. Data is normalized using mean and variance from the train set.

Train, validation, and test splits:

This dataset contains 1000 training data, 10 validation data, and 300 test data. Each data point is split into 29 time-series sequences of length 16. The total number of data points for training, validation, and test are 29,000, 290, and 8,700 for a total of 37990 data points.

Figure 8: — **Robotics:** Modern robot systems are equipped with multiple sensors to aid in their decision-making. We include the large-scale MuJoCo Push [90] and Vision&Touch [92] datasets which record the manipulation of real and simulated robotic arms equipped with visual (RGB and depth), force, and proprioception sensors. In MuJoCo Push, the goal is to predict the pose of the object being pushed by the robot end-effector. In Vision&Touch, the goal is to predict action-conditional learning objectives that capture forward dynamics of the different modalities (contact prediction and robot end-effector pose). Figure adapted from [91].

2. Vision&Touch is a real-world robot manipulation dataset that collects visual, force, and robot proprioception data (as well as the robot actions) for a peg insertion task. The robot is a 7-DoF, torque-controlled Franka Panda robot, which has a triangle peg attached to its end-effector. Rigidly attached to the table in front of the robot is a box with a triangle hole. The robot attempts to insert the peg into the hole, a contact-rich manipulation task that has been studied for decades due to its relevance in manufacturing. Vision, force, and proprioception are feedback modalities shown to be complementary and concurrent during contact-rich manipulation [17].

Data collection:

According to Lee et al., [92], the data is collected by running on the robot a random policy (that takes random actions) as well as a heuristic policy (that attempts peg insertion). Four sensor modalities are available, including robot proprioception, an RGB-D camera, and a force-torque sensor. The proprioceptive input is the robot end-effector pose as well as linear and angular velocity. They are computed using forward kinematics. RGB images and depth maps are recorded from a fixed camera (Kinect v2 camera) pointed at the robot. Input images to our model are down-sampled to 128×128. The force sensor provides 6-axis feedback on the forces and moments along the $x$ , $y$ , $z$ axes. The OptoForce force sensor is mounted between the last joint and the peg. The robot action data is also collected at every timestep. The robot action is the Cartesian end-effector position displacement and z-axis roll rotation of the end-effector. There are 150 trajectories collected, each with 1000 timesteps of data collected. While the dataset originally was intended for representation learning for reinforcement learning, We use 2 tasks from the Vision&Touch datasets: (1) predicting binary contact in the next time step and (2) predicting end-effector position measured in MSE.

Access restrictions: While we do not have the license to this dataset, it is a public dataset free to download by the research community from https://github.com/stanford-iprl-lab/multimodal_representation/.

Licenses: MIT, see https://github.com/stanford-iprl-lab/multimodal_representation/blob/master/LICENSE.

Figure 9: — **Finance:** We scrape historical stock data from the internet and create our own dataset for financial time-series prediction across 3 groups of correlated stocks: Stocks-F&B, Stocks-Health, and Stocks-Tech. Within each group, the previous stock prices of a set of stocks are used as multimodal input to predict the squared return of a related stock (e.g., using Apple, Google, and Microsoft historical data to predict future prices of Microsoft).

Dataset preprocessing:

Dataset has already been pre-processed and can be downloaded directly at https://github.com/stanford-iprl-lab/multimodal_representation/. The dataset comes as a zipped file with 3000 hdf5 files, each with 50 timesteps of data. In order to get action-conditional contact as well as robot end-effector position, the dataset uses the contact and end-effector position data from the next timestep. Since the data from the first time step cannot be used, only 49 of 50 timesteps of data per file can be used.

Train/validation split:

This dataset uses a 80 ∶ 20 training and validation split. There are 117600 training data points and 29400 validation data points. Since the original dataset does not contain test data, we report validation performance instead of test performance for this dataset.

C.2.4 Finance

We created the following financial datasets which consist of historical stock data retrieved from publicly available online financial databases. We record the opening price of each stock from 2000–06-01 to 2021–02-28, which creates a total of 5218 time steps. Details of each dataset are described in its own section below.

Stocks-F&B consists of 18 selected stocks from S&P 500 stocks categorized by GICS as Restaurants or Packaged Foods & Meats. We select mcd, sbux, hsy, and hrl for initial experiments on this dataset, record their opening prices, and preprocess the data following the preprocessing procedures below.
Stocks-Health consists of 63 selected stocks from S&P 500 stocks categorized by GICS as Health Care. We select mrk, wst, cvs, mck, abt, unh, and tfx for initial experiments on this dataset, record their opening prices, and preprocess the data following the preprocessing procedures below.
Stocks-Tech consists of 100 selected stocks from S&P 500 stocks categorized by GICS as Information Technology or Communication Services. We select aapl, msft, amzn, intc, amd, and msi for initial experiments on this dataset, record their opening prices, and preprocess the data following the preprocessing procedures below.

Figure 10: — **Human Computer Interaction (HCI)** studies the design and use of computer technology with a focus on the interactive interfaces between humans and computers. We use the Enrico (Enhanced Rico) dataset [40, 93] of Android app screens (consisting of an image as well as a set of apps and their locations) categorized by their design motifs and collected for data-driven design applications such as design search, user interface (UI) layout generation, UI code generation, and user interaction modeling. Figure adapted from [40, 93].

Access restrictions:

The datasets were collected from Yahoo Finance, which is publicly available but does not allow redistribution of their data. We provide automated download and preprocessing scripts for this dataset.

Licenses:

We could not find a finance dataset with a free redistribution license that includes historical financial data. As such, we provide automated download and preprocessing scripts as part of this project, which utilizes the open-source pandas-datareader to download raw finance data. We used the open-source code at https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/yahoo/components.py. The automated scripts we provide are licensed under an MIT License.

Dataset preprocessing:

Data is downloaded, converted to returns, and normalized. Labels are converted to squared returns. Each time series is split in chronological order, where the test split corresponds to the latest prices. For each data point, 500 previous returns are used to predict the squared return of the next day. The first 500 time steps are not predicted since they do not have 500 previous steps. We consider each stock as a modality; unimodal datasets have the input stock identical to the target stock. To keep memory usage practical for MulT [154] models, we evenly separate the stocks into 3 groups and use each group as a modality when preprocessing for MulT [154].

Train, validation, and test splits:

We split the data according to time. There are 3200 continuous days of stock prices in the train data (2002–06-04 start to 2015–02-18 end date), 500 continuous days of stock prices in the valid data (2015–02-19 start to 2017–02-10 end date), and 1017 continuous days of stock prices in the test data (2017–02-13 start to 2021–02-26 end date).

C.2.5 HCI

1. ENRICO (Enhanced Rico) [93] is a dataset of Android app screens categorized by their design motifs. ENRICO was collected to help data-driven design applications such as design search, UI layout generation, UI code generation, and user interaction modeling. ENRICO is a subset of RICO [40], which is a large dataset of app screens collected by the automated and semi-automated “crawling” of Android apps available on the Google Play Store.

The RICO and ENRICO datasets have been used as benchmarks for data-driven models of design in scaffolding the creation of mobile apps. These constitute a set of relevant examples that help designers understand best practices and trends in building human-centered interfaces. Building multimodal models on these examples will enable systems that can predict whether a UI design will achieve its targeted goals even before it is deployed to millions of people. In the long run, this will enable the large-scale creation of personalized UI designs that can automatically adapt to diverse users and contexts.

The authors of ENRICO employed two main modalities for app classification: (1) the app screenshot and (2) the view hierarchy. The app screenshot is given in the form of an image. The view hierarchy is a type of metadata associated with some UI screens that describe the spatial and structural layout of UI elements. This view hierarchy can be treated as a set since it contains an unordered collection of UI elements each containing metadata and their spatial and structural layout.

Data collection:

The original RICO dataset was collected using a combination of manual (i.e., crowdworkers) and automated (i.e., app crawler) methods. More information about how the apps were downloaded and captured is available in the RICO paper [40]. The ENRICO dataset is a subset of RICO that was created by first randomly sampling 10000 screens from RICO and labeling a highquality subset (1460 screens) that can be categorized into 20 design categories. More information about the collection and annotation process is available in the ENRICO paper [93].

Access restrictions:

While we do not have the license to this dataset, it is a public dataset free to download by the research community from https://github.com/luileito/enrico.

Licenses:

MIT, see https://github.com/luileito/enrico/blob/master/LICENSE

Dataset preprocessing:

We extract the following features from each modality:

Image: The authors of ENRICO used a VGG-16 network (augmented with batch normalization and dropout) to encode app screenshots. To reduce overfitting on the relatively small dataset (1460 examples), we use a VGG-11 network pre-trained on ImageNet, with a frozen feature extraction network and a slimmed-down classifier network.
Set: We followed prior modeling approaches [40, 93] to represent the view hierarchy as a set of UI elements spatially rendered as a “wireframe” (similar to a semantic map). The wireframe was then fed into the same VGG-11 network used to encode the screenshot. Another possibility, which we briefly explored, is to use a set encoder [184] to use a permutation invariant function to compute a pooled representation of the set of mobile applications. We found that the CNN-based approach resulted in better performance, as it allowed the network to be initialized from a pre-trained checkpoint, although our experiments were initial and there is still ample room for future work to explore better encoders for this set modality.

Train, validation, and test splits:

The original paper doesn’t provide official splits for training, validation, and testing. We used a known seed to deterministically shuffle the dataset and create splits for training (65%, 947 examples), validation (15%, 219 examples), and testing (20%, 292 examples).

C.2.6 Multimedia

1. AV-MNIST is a multimodal dataset created by paring audio of a human reading digits from the FSDD dataset [1] with written digits in the MNIST dataset [88] with a task to predict the digit into one of 10 classes (0 − 9). Since existing models can already complete the digit recognition task from either modality quite well, one common practice in previous work [161] is to increase the difficulty by removing 75% of energy in the visual modality via PCA and adding noise from ESC-50 [125] to the audio modality, such that models have to leverage information from both modalities to make accurate predictions. ESC-50 is a realistic dataset collected from real-world audio of various everyday objects. Therefore, AV-MNIST serves as a good starting point of a relatively simple multimodal dataset but with underlying challenges of complementarity and noisy data. In fact, the method of injecting real-world background noises into the audio modality also inspired more tests for robustness included in MultiBench. AV-MNIST has served as a popular benchmark for evaluating the effectiveness of multimodal fusion models [122, 161].

Data collection:

According to Vielzeuf et al., [161], AV-MNIST starts with the entirety of the MNIST image and FSDD audio datasets. The audio samples are augmented by adding randomly chosen ‘noise’ samples from the ESC-50 dataset [125], to reach the same number of examples as in MNIST (55000 training, 5000 validation, and 10000 testing examples).

Access restrictions:

This dataset is programmatically generated by combining 2 unimodal datasets: MNIST and FSDD (with the additional audio signal from ESC-50). While we do not have the license to these datasets, they are public datasets free to download by the research community.

Licenses:

MNIST is released with a Creative Commons Attribution-Share Alike 3.0. FSDD is released with a Creative Commons Attribution-ShareAlike 4.0 International license. ESC-50 is released with a Creative Commons Attribution Non-Commercial license. All of these licenses allow redistribution of the datasets.

Dataset preprocessing:

To create the dataset, we downloaded MNIST from http://yann.lecun.com/exdb/mnist/, FSDD from https://github.com/Jakobovski/free-spoken-digit-dataset, ESC-50 from https://github.com/karolpiczak/ESC-50, and generated AV-MNIST with the scripts provided in https://github.com/slyviacassell/_MFAS/blob/master/datasets/avmnist_gen.py. Note that since the official implementation of the preprocessing is not released, our preprocessing, as well as all other existing preprocessing scripts, may differ from the original preprocessing in some details (such as keeping at most or at least 25% of energy in the image modality, and some parameters in adding noise to audio), so the performance of models in our version of AV-MNIST should not be compared directly with the performance of models on AV-MNIST in other papers.

No preprocessing is done for the image modality. For audio, it is converted to a 112x112 Spectogram. See the code in https://github.com/slyviacassell/_MFAS/blob/master/datasets/avmnist_gen.py for details.

Train, validation, and test splits:

Data splits for AV-MNIST follow that of the MNIST dataset, with 55000 training, 5000 validation, and 10000 testing examples.

2. MM-IMDb is the largest publicly available multimodal dataset for genre prediction on movies [8]. MM-IMDb starts from the movies of the MovieLens 20M dataset and expands this dataset by collecting genre, poster, and plot information for each movie. The final dataset contains ratings for 25,959 movies. MM-IMDb is a realistic real-world multimodal dataset and is a popular benchmark for multimodal learning [8, 81, 122].

Data collection:

According to Arevalo et al., [8], MM-IMDb dataset is built with the IMDb ids provided by the Movielens 20M dataset that contains ratings of 27,000 movies. Using the IMDbPY 3 library, movies that do not contain their poster image were filtered out. The resulting dataset comprises 25,959 movies along with their plot, poster, genres, and other 50 additional metadata fields such as year, language, writer, director, aspect ratio, etc. The task is to perform multilabel classification into one of 23 movie genres.

Access restrictions:

While we do not have the license to this dataset, it is a public dataset free to download by the research community from http://lisi1.unal.edu.co/mmimdb/ and https://github.com/johnarevalo/gmu-mmimdb/.

Licenses:

MIT, see https://github.com/johnarevalo/gmu-mmimdb/blob/master/LICENSE

Dataset preprocessing:

We used the same method as [8] to extract features from texts and images.

Text: We used the pretrained Google Word2vec^¹ to extract text features. The final vocabulary contains 41,612, which is the intersection of Google word2vec words and the MM-IMDb plots. We converted all text to lowercase following existing work.
Image: All images were scaled, and cropped when required, to 160 × 256 pixels keeping the aspect ratio. A VGG-16 model [139] is applied as the image feature extractor. This CNN consists of 5 convolutional layers of 5,3,3,3,3 squared filters and 2 × 2 pool sizes. Each convolutional layer has 16 hidden units. The convolutional layers are connected with a MaxoutMLP on top.

Figure 11: — **Multimedia:** A significant body of research in multimodal learning has been fueled by the large availability of multimedia data (language, image, video, and audio) on the internet. MultiBench includes 3 popular large-scale multimedia datasets with varying sizes and levels of difficulty: (1) Audio-Visual MNIST (AV-MNIST) [161] is assembled from images of handwritten digits [88] and audio samples of spoken digits [94], (2) Multimodal IMDb (MM-IMDb) [8] uses movie titles, metadata, and movie posters to perform multi-label classification to a set of movie genres, and (3) Kinetics [80] contains video and audio of 306,245 video clips annotated for 400 human actions. To ease experimentation, we split Kinetics into small and large partitions (see Appendix C). Figure adapted from [8, 80].

Train, validation, and test splits:

The MM-IMDb dataset is split by genre into train, valid, and test datasets containing 15552, 2608, and 7799. The split was performed so that training, valid and test sets comprise 60%, 10%, 30% samples of each genre respectively.

3. Kinetics is a series of large-scale curated video clips covering a diverse range of human actions. We use the original Kinetics-400 dataset [80] which contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. This is one of the largest publicly available multimodal datasets with a total of 306,245 video clips spanning 400 human actions. Therefore, Kinetics is suitable for testing the scalability of multimodal models to extremely large datasets. Furthermore, recognizing human actions is a core challenge in a variety of applications such as human-AI interaction, robotics, and human behavior analysis.

The sheer scale of the Kinetics dataset means that even the simplest models take up to several weeks to finish training. To enable multimodal learning from video and audio while also increasing access across researchers with limited computing resources, we subsample the Kinetics dataset into small and large partitions:

Kinetics-S:

We subsampled 5 human actions: archery, breakdancing, crying, dining, singing and retained all video clips annotated for these 5 actions. We selected these actions randomly out of the 400 actions in Kinetics-400. This gave us a total of 2624 video clips in the small version of the dataset. Training a basic supervised learning model on Kinetics-S takes roughly 2 hours on a single GPU.

Kinetics-L:

This represents the entire Kinetics-400 dataset with 306,245 video clips spanning 400 human actions. Training a basic supervised learning model on Kinetics-L takes roughly 2 weeks on a single GPU.

Data collection:

We refer the reader to Kay et al., [80] for a detailed description of the dataset collection process. Briefly, the authors (1) started with a list of human actions from sources spanning existing action datasets, motion capture, and crowdsourcing, (2) obtained candidate clips from YouTube and extracted temporal positions within a video, (3) performed manual labeling for human actions with Amazon’s Mechanical Turk, and (4) cleaning up and de-noising the selected videos.

Access restrictions:

While we do not have the license to this dataset, it is a public dataset free to download by the research community from https://deepmind.com/research/open-source/kinetics.

Licenses:

Creative Commons Attribution 4.0 International, so we are free to share, copy, and redistribute the material in any medium or format, see https://deepmind.com/research/open-source/kinetics.

Dataset preprocessing:

We downloaded links from https://deepmind.com/research/open-source/kinetics and preprocessed them with the torchvision Kinetics scripts.

We processed the video and audio modalities as follows:

Video: We use 150 × 224 × 224 × 3 input clips, created with a frame skip of 2, a center crop with shape (224,224), and the normalization step required for using torchvision.models.
Audio: We use log-scaled mel spectrograms with 763 temporal frames by 40 Mel filters, element-wise averaging 2-channel waveforms to yield single channel ones.

Train, validation, and test splits:

We use the 80.5/6.5/13 split provided by the original dataset, taking all the data points in our chosen classes. This yields 2112, 171, and 341 data points in train, validation, and test splits respectively for Kinetics-S and 246527, 19906, and 39812 data points in train, validation, and test splits respectively for Kinetics-L.

C.3 Documentation

We provide documentation for MultiBench in the form of datasheets for datasets [54]:

Motivation
1. For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.
  
  Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, and healthcare. Unfortunately, current research focuses primarily on a fixed set of modalities and tasks without a concrete understanding of generalization across domains and modalities, complexity during training and inference, and robustness to noisy and missing modalities. In order to standardize multimodal research and accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiBench, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench summarizes both performance as well as the potential drawbacks involving increased time and space complexity and risk of decreased robustness from other modalities. To accompany MultiBench, we also provide a standardized implementation of 20 core approaches in multimodal learning unifying innovations in fusion paradigms, optimization objectives, and training approaches.
  
  MultiBench datasets present significant challenges of scalability to large-scale multimodal datasets and robustness to realistic imperfections, which present fruitful opportunities for future research. We hope that MultiBench will present a milestone in unifying disjoint efforts in multimodal machine learning research and paves a way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MultiBench, our standardized implementation, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.
2. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?
  
  MultiBench is created primarily by the MultiComp Lab in the Language Technologies Institute and Machine Learning Department of the School of Computer Science at Carnegie Mellon University, in collaboration with several other researchers in the Human-Computer Interaction Institute and Computer Science Department at Carnegie Mellon University as well as at Johns Hopkins University, Stanford University, and UT Austin. The creation of MultiBench is for purely research purposes only.
3. Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.
  
  This material was based upon work partially supported by the National Science Foundation (Awards #1722822 and #1750439) and National Institutes of Health (Awards #R01MH125740, #R01MH096951, #U01MH116925, and #U01MH116923), NSF IIS1763562, and ONR Grant N000141812861. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or National Institutes of Health, and no official endorsement should be inferred.
4. Any other comments?
  
  No.
Composition
1. What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.
  
  We describe each dataset in detail in Appendix C.2. MultiBench provides a comprehensive suite of multimodal datasets to benchmark current and proposed approaches in multimodal representation learning. It covers a diverse range of research areas (affective computing, healthcare, robotics, finance, HCI, and multimedia), dataset sizes (small, medium, and large), input modalities (in the form of $ℓ$ : language, $i$ : image, $v$ : video, $a$ : audio, $t$ : time-series, $t a$ : tabular, $o$ : optical flow, $f$ : force sensor, $p$ : proprioception sensor, $s$ : set), and prediction tasks (affect recognition, robot manipulation, stock prediction, design interface, action recognition, movie genre prediction, and digit prediction).
2. How many instances are there in total (of each type, if appropriate)?
  
  We describe each dataset’s statistics in detail in Appendix C.2. We chose datasets to span small, medium, and large sizes. The smallest dataset contains 1,460 instances (and training a model takes roughly a few minutes on a single GPU) while the largest one contains 306,245 instances (and training a model takes roughly 2 weeks on a single GPU). This enables accessibility for researchers with limited computational resources, while also allowing for large-scale studies of multimodal datasets and models.
3. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).
  
  Each of the datasets is collected in different ways that we detail in Appendix C.2. To summarize, each dataset consists of samples from a larger set since it is impossible to include all videos/stock data/medical data/robotics data in the world. Each dataset is collected with the aim to be representative of the entire population.
4. What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.
  
  We describe in detail the raw data and processed features for each dataset in Appendix C.2. To summarize, MultiBench contains both raw modality data as well as processed data with predefined feature extractors following current work.
5. Is there a label or target associated with each instance? If so, please provide a description.
  
  We describe in detail the labels for each dataset in Appendix C.2. To summarize, MultiBench contains 6 research areas with a total of 15 prediction tasks spanning affect recognition, robot manipulation, stock prediction, design interface, action recognition, movie genre prediction, and digit prediction.
6. Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.
  
  No, all datasets are provided in full. For robustness tests, we do inject noise and imperfections into each dataset to simulate the performance of machine learning models on real-world imperfections (see Appendix D.3 for details).
7. Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.
  
  We describe in detail the relationships between modalities for each dataset in Appendix C.2.
8. Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
  
  Yes, MultiBench provides a data loading pipeline that directly loads train, validation, and test splits according to current work. We provide these details for each dataset in Appendix C.2.
9. Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
  
  We do not know of any errors in each of the datasets included in MultiBench. However, we will always be on the lookout for potential issues and update them via https://cmu-multicomp-lab.github.io/multibench/ and https://github.com/pliang279/MultiBench.
10. Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
  Most of the datasets in MultiBench have been collected, stored, processed, and are self-contained. There are some datasets that depend on external resources which we explain below:
  1. MIMIC: We depend on the original dataset to be hosted on https://mimic.physionet.org/gettingstarted/access/. Unfortunately, since we are not allowed to redistribute the raw data and users need to complete training to access the raw data, we are unable to provide a self-contained version of the MIMIC dataset. We are currently planning to add several new multimodal datasets in the healthcare domain that can be self-contained after appropriate de-identification.
  2. Finance: Yahoo Finance prohibits the redistribution of their data. We depend on the original data to be hosted on Yahoo Finance and provide automated downloading and preprocessing scripts for the datasets based on pandas-datareader, which has original code at https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/yahoo/components.py
11. Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.
  
  From the authors of MIMIC [78]: “The project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was de-identified.”
  
  To the best of our knowledge, all other datasets do not contain confidential data and are publicly available for research purposes.
12. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.
  
  We reviewed the datasets and found no offensive content. While there are clearly expressions of highly negative sentiment or strong displays of anger and disgust in the affective computing videos, there are no offensive words used or personal attacks recorded in the video. All videos are related to movie or product reviews, TED talks, and TV shows.
13. Does the dataset relate to people? If not, you may skip the remaining questions in this section.
  
  Yes, the healthcare, affective computing, and Kinetics (multimedia) datasets relate to people. The other datasets in MultiBench do not.
14. Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.
  The following datasets relate to people:
  1. Affective computing: These datasets do not identify any subpopulations in their modeling decisions. However, the raw data comes in the form of videos publicly available and free to download from YouTube. Sub-population and demographic information can be inferred from these raw videos.
  2. MIMIC: According to the authors [78]: “The median age of adult patients is 65.8 years and 55.9% patients are male.”
  3. Kinetics: This dataset does not identify any subpopulations. However, the raw data comes in the form of videos publicly available and free to download from YouTube. Sub-population and demographic information can be inferred from these raw videos.
15. Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.
  
  The following datasets relate to people:
  1. Affective computing: One can see the person in the raw video, but the dataset contains no personal information. We do not explicitly use information regarding gender, ethnicity, identity, or video identifier in online sources. All pre-extracted features are non easily invertible and only rely on general visual or audio features such as the presence of a smile or magnitude of voice [181, 183].
  2. MIMIC: The MIMIC dataset has been rigorously de-identified in accordance with Health Insurance Portability and Accountability Act (HIPAA) such that all possible personal information has been removed from the dataset. Removed personal information includes patient name, telephone number, address, and dates. Dates of birth for patients aged over 89 were shifted to obscure their true age. Please refer to Appendix C.2.2 for de-identification details. Again, we emphasize that any multimodal models trained to perform prediction should only be used for scientific study and should not in any way be used for real-world prediction.
  3. Kinetics: One can see the person in the raw video, but the dataset does not contain direct personal information. We do not explicitly use information regarding gender, ethnicity, identity, or video identifier in online sources.
16. Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.
  
  MultiBench contains datasets with financial and healthcare data. However, all these datasets are publicly available for research purposes. Healthcare data (MIMIC) has been rigorously de-identified in accordance with the Health Insurance Portability and Accountability Act (HIPAA) such that all possible personal information (patient name, telephone number, address, and dates, date of birth) has been removed from the dataset. Please refer to Appendix C.2.2 for de-identification details.
17. Any other comments?
  
  No.
Collection Process
1. How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.
  
  We include the collection process for each dataset in Appendix C.2.
2. What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?
  
  We include these details in Appendix C.2.
3. If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
  
  We include sampling methods for each dataset in Appendix C.2.
4. Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
  
  We include annotation details for each dataset in Appendix C.2.
5. Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.
  
  We include timeframes for each dataset in Appendix C.2.
6. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
  
  From the authors of MIMIC [78]: “The project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was de-identified.”
7. Does the dataset relate to people? If not, you may skip the remainder of the questions in this section.
  
  Yes, the healthcare, affective computing, and Kinetics (multimedia) datasets relate to people. The other datasets in MultiBench do not.
8. Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?
  
  Affective computing and Kinetics datasets are collected from YouTube videos that follow the creative commons license and follow fair use guidelines of YouTube. According to the authors for the MIMIC dataset [78]: “Data was downloaded from several sources, including archives from critical care information systems, hospital electronic health record databases, and Social Security Administration Death Master File.”
9. Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how the notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.
  
  Affective computing and Kinetics datasets are collected from YouTube videos that follow the creative commons license and follow fair use guidelines of YouTube. This is the standard way for content creators to grant someone else permission to use and redistribute their work. According to the authors for the MIMIC dataset [78]: “The project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was de-identified.”
10. Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.
  
  Affective computing and Kinetics datasets are collected from YouTube videos that follow the creative commons license and follow fair use guidelines of YouTube which allows content creators to grant someone else permission to use and redistribute their work. According to the authors for the MIMIC dataset [78]: “Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was de-identified.”
11. If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).
  
  N/A.
12. Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
  
  N/A.
13. Any other comments?
  
  N/A.
Preprocessing/cleaning/labeling
1. Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
  
  Yes, we followed the convention in prior research for any preprocessing done to the datasets. We explain these steps in Appendix C.2.
2. Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
  
  Yes, we include the raw data in MultiBench in addition to the preprocessed features. The raw data (usually in the form of raw text, videos, audio, time series etc) are useful for users to perform their own feature extraction and also for robustness tests on raw data itself (e.g., imperfections in the raw text through spelling errors and missing words). There are certain cases where we are not allowed to distribute the raw data: for MIMIC where users must undergo training to download the raw data, and for finance datasets where Yahoo Finance is publicly available but does not allow redistribution of raw data. For both of these datasets, we provide automated download and preprocessing scripts once the raw data is downloaded through the correct procedure by each user (see details in Appendix C.2).
3. Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.
  
  Yes, we provided all links and references to preprocessing steps in Appendix C.2.
4. Any other comments?
  
  No.
Uses
1. Has the dataset been used for any tasks already? If so, please provide a description.
  
  Yes, MultiBench contains several datasets that have been used in the multimodal ML community. We provide links to the original repositories of each dataset and their original citations in Appendix C.2.
2. Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.
  
  We provide links to the original repositories of each dataset and their original citations in Appendix C.2. We also include references to general multimodal methods implemented in MultiZoo in Appendix E. Many of these methods have been tested by their original authors on a small subset of datasets in MultiBench. In addition to these references, the leading authors maintain a reading list on topics in multimodal ML at [98] which contains links to papers, datasets, code, academic courses, conferences, and workshops relevant to the multimodal ML community.
3. What (other) tasks could the dataset be used for?
  
  In addition to building multimodal models for the prediction tasks, datasets in MultiBench can also be used for:
  1. Unsupervised learning across multimodal data/unsupervised pre-training of multimodal models.
  2. Interpreting relationships between modalities.
  3. Designing models for robustness to noisy and missing modalities.
  4. Investigating alignment between modalities.
  5. Other multimodal tasks including but not limited to: co-learning, translation, retrieval, and grounding [10].
4. Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
  
  We are careful to outline all possible risks associated with each dataset in Appendix C.2 and also in our broader impact statement (Appendix A). We acknowledge that there could be risks regarding the privacy and security of data, as well as the real-world deployment of these methods whenever human-centric data is involved (e.g., in healthcare, affective computing, and multimedia). We discussed data demographics in the previous section and it should be taken into consideration when making claims regarding the generalization of models to new users. We also emphasize that these multimodal datasets and methods should only be used for research purposes and not for actual real-world deployment until research can sufficiently verify their safety. Finally, we are carefully working with domain experts towards better understanding biases in these multimodal datasets and models as well as their real-world safety.
5. Are there tasks for which the dataset should not be used? If so, please provide a description.
  
  Yes, we emphasize that all multimodal models trained to perform prediction on these datasets should not in any way be used to harm individuals and should only be used as a scientific study. They should not be deemed safe for real-world deployment. In particular, the models used to make predictions of affective states, human actions, health indicators, and financial indicators are particularly sensitive and should not be used to inform any real-world decisions. All results must only be used as a scientific study of machine learning methods. See more details in Appendix A.
6. Any other comments?
  
  No.
Distribution
1. Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
  
  Yes, the benchmark will be distributed to the public research community for theoreticians and practitioners to experiment on multimodal data.
2. How will the dataset be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
  
  We plan to distribute MultiBench via our public GitHub: https://github.com/pliang279/MultiBench. We also include a landing website page on https://cmu-multicomp-lab.github.io/multibench/ that includes an introduction to the benchmark, links to the relevant papers on multimodal datasets and algorithms, and a public leaderboard to keep track of current progress on these multimodal tasks.
3. When will the dataset be distributed?
  
  The dataset is currently available for use.
4. Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.
  
  We release the benchmark and code under an MIT license: see https://github.com/pliang279/MultiBench/blob/main/LICENSE, which allows for sharing and distribution of the code for research purposes. Each of the datasets in MultiBench has their own licenses which we detail in Appendix C.2.
5. Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
  
  Yes, MultiBench brings together a collection of several existing datasets in the multimodal research that were built by their individual authors who have original licenses for these datasets. We only included the datasets with licenses that allow for redistribution (MIT or Creative Commons license) and are freely downloadable for research purposes. We detailed all dataset licenses in Appendix C.2.
6. Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.
  
  We are not aware of any such restrictions.
7. Any other comments?
  
  No.
Maintenance
1. Who is supporting/hosting/maintaining the dataset?
  
  The dataset is supported and hosted by the team of authors at CMU. The team will also lead the maintenance and expansion of MultiBench. The team will also work with the other collaborators on the paper who are domain experts in each research area MultiBench covers, such as robotics, HCI, healthcare, and finance.
2. How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
  
  We provide all contact addresses at https://cmu-multicomp-lab.github.io/multibench/.
3. Is there an erratum? If so, please provide a link or other access point.
  
  All erratum and updates to the dataset will be tracked via GitHub commit histories at https://github.com/pliang279/MultiBench. We will also provide updates via our landing page https://cmu-multicomp-lab.github.io/multibench/.
4. Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?
  
  Yes, we plan for long-term maintenance and expansion of the dataset. All erratum and updates to the dataset will be tracked via GitHub commit histories at https://github.com/pliang279/MultiBench. We will also provide updates via our landing page https://cmu-multicomp-lab.github.io/multibench/. Please refer to Appendix C.5 for details.
5. If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.
  
  The individuals in question were not notified about the data collection. For YouTube videos, they are released under a creative commons license which is the standard way for content creators to grant someone else permission to use and redistribute their work. According to the authors for the MIMIC dataset [78]: “The project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was de-identified.”
6. Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users.
  
  Yes, we will maintain a GitHub history for all updates and older versions of datasets and code in MultiBench.
7. If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.
  
  Yes, we will create a system where users can create pull requests on GitHub to include their datasets and models. The authors will verify that the additions are in the scope of multimodal learning and do not break the current experimental code. We will work with these authors to ensure that their data and algorithms can be included in MultiBench.
8. Any other comments?
  
  No.

C.4. Benchmark Distribution

We plan to distribute the MultiBench benchmark via our public GitHub: https://github.com/pliang279/MultiBench. We also include a landing website page on https://cmu-multicomp-lab.github.io/multibench/ that includes an introduction to the benchmark, links to the relevant papers on multimodal datasets and algorithms, and a public leaderboard to keep track of current progress on these multimodal tasks.

The GitHub and webpage will also allow feedback from the research community in suggesting and adding new datasets and algorithms. Finally, we plan to include a list of planned future updates to MultiBench on the webpage along with their target release dates.

C.5. Hosting and Maintenance

We have a long-term plan to continue the expansion and maintenance of MultiBench. Here we summarize the main directions we plan to expand towards and leave details and other areas of future work to Appendix I.

Maintenance: MultiBench will be continuously hosted via GitHub which provides stable access to code and a landing page website. We guarantee that MultiBench will be available for a long time through our distribution channels. The authors themselves are also actively working on multimodal learning in affective computing, robotics, healthcare, human-computer interaction, and multimedia. The authors are also involved in efforts in applying multimodal machine learning to finance. As a result of these long-term collaborative research efforts, the authors will continue to maintain and expand on the datasets and code provided in MultiBench.
Expansion of datasets: We plan to include more datasets for multimodal fusion as well as more research areas in multimodal learning such as retrieval, question answering, grounding, and reinforcement learning. While these research areas are very different, we hope that insights in multimodal representations can be shared across them.
Expansion of evaluation: To enable holistic evaluation, we plan to build on top of our metrics by adding robustness to distribution shift, uncertainty measures, tests for fairness and social biases, as well as labels/metrics for interpretable multimodal learning.
Expansion of datasets: We plan to encourage students taking the multimodal machine learning course at CMU (https://cmu-multicomp-lab.github.io/mmml-course/fall2020/) to use the benchmark and add their proposed datasets and models to it.
Expansion of methods: The authors currently collect a very up-to-date reading list of core multimodal papers https://github.com/pliang279/awesome-multimodal-ml and plan to continuously update MultiZoo with new multimodal methods proposed by the community.

C.6. Author Statement

The authors carefully reviewed the information present in this document. To the best of our knowledge, the datasets in MultiBench can be used for research purposes, following the methodology and licenses described in the dataset section (Appendix C.2).

C.7. License

Each of the datasets included in MultiBench includes their own licenses which we detail in Appendix C.2.. We release all preprocessing code across all datasets using the MIT license. All other codes for multimodal algorithms in MultiZoo as well as evaluation scripts, are also released via an MIT license: see https://github.com/pliang279/MultiBench/blob/main/LICENSE, which allows for sharing and distribution of the code for research purposes.

C.8. Metadata

We have included structured metadata for MultiBench on our landing page: https://cmu-multicomp-lab.github.io/multibench/.

C.9. Persistence of MultiBench

MultiBench is publicly hosted on https://github.com/pliang279/MultiBench. For larger datasets that cannot be uploaded to GitHub, we plan to upload the processed dataset to CMU Box. We are still exploring the best options for sharing large datasets. Users need to download these processed datasets, place them into a correct folder, and run the MultiBench data loader and machine learning pipeline.

D. MultiBench Evaluation Protocol

To enable holistic evaluation, MultiBench offers a comprehensive evaluation methodology to assess (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities: We describe the evaluation protocol for each desiderata in detail in each of the following subsections:

D.1. Performance

MultiBench provides standardized evaluation using metrics designed for each dataset, ranging from MSE and MAE for regression to accuracy, micro & macro F1-score, and AUPRC for classification on each dataset. To assess for generalization, we compute the variance of a particular model’s performance across all datasets in MultiBench on which it is tested. We split these results on multiple datasets into in-domain datasets and out-domain datasets. In-domain datasets refer to model performance on datasets that it was initially proposed and tested on, while out-domain datasets refer to model performance on the remaining datasets. Comparing out-domain vs in-domain performance, as well as variance in performance across datasets as a whole, allow us to summarize the generalization statistics of each multimodal model.

D.2. Complexity

Modern ML research, unfortunately, causes significant impacts to energy consumption [142], a phenomenon often exacerbated in processing high-dimensional multimodal data. As a step towards quantifying energy complexity and recommending lightweight multimodal models, MultiBench records the amount of information taken in bits (i.e., data size), number of model parameters, and time and memory resources required during the entire training process. To enforce consistency, the training time measured for all models on each dataset is run on the same CPUs and GPUs. We report training memory by measuring peak memory usage of the python process during the entire training process using python memory_profiler toolkit (https://pypi.org/project/memory-profiler/). When counting the number of parameters when training a model, we only count the parameters in persistent modules during training and does not count the ephemeral networks or modules created in the middle of the training process (such as the networks trained for determining weights in GRadBlend or the fusion architectures created as part of the architecture search process in MFAS).

In addition to training time and resources, real-world models may need to be small and compact to run on mobile devices [131]. To account for this, MultiBench also records inference time and parameters. We report inference time by measuring the time it takes for the trained model to complete inference on the entire test set of the dataset. In some cases, only parts of the parameters used in training are counted towards the inference parameters (for example, the parameters in decoders of MVAE and MFM are part of training parameters but not part of inference parameters).

D.3. Robustness to Imperfect Data

Real-world multimodal data is often imperfect as a result of missing entries, noise corruption, or missing modalities entirely. For example, multimodal dialogue systems trained on acted TV shows are susceptible to poor performance when deployed in the real world where users might be less expressive in using facial gestures. This calls for robust models that can still make accurate predictions despite only having access to a (possibly noisy [101]) subset of signals [123]. To standardize efforts in evaluating the robustness of multimodal models, MultiBench includes the following robustness tests as part of the evaluation:

D.3.1. Modality-specific Imperfections

Modality-specific imperfections are independently applied to each modality taking into account the unique noise topologies in that source of data (i.e., flips and crops of images, natural misspellings in text, abbreviations in spoken audio). We describe all the modality-specific imperfections we implement in MultiBench in the following:

Language:

Imperfections in the language modality can occur at various granularities spanning the character, word, phrase, and sentence levels. With reference to [15], many of these imperfections occur at the raw text data level and are usually results of spelling errors on a QWERTY keyboard as well as abbreviations in written, typed, and spoken text. Given a word $w$ of length $n$ and a fixed probability $p \in (0, 1)$ , we implement the following language-specific imperfections:

Spelling errors: note that spelling mistakes are different from intentionally changed word forms (e.g. abbreviation used in instant messaging service) since they are unintentional [144]. We simulate typos by replacing each letter with a letter having an adjacent position on a QWERTY keyboard with probability $p$ .
Short message noise: Short Message Service (SMS) data usually include intentional corruptions of words and phrases like abbreviations, phonetic substitutions, omission of characters and words, and dialectal and informal usages [144]. We implement the following:
1. Simulate sticky keys: given a number $m$ , choosing $m$ letters of a word randomly to repeat with probability $p$ .
2. Simulate quick typing: given a number $m$ , choosing $m$ letters of a word randomly to omit with probability $p$ .
Random permutation of letters: swapping adjacent two letters is a common natural noise when typing quickly [15]. Random permutation of the entire word or the majority of letters is a form of synthetic noise. We implement the following:
1. Swap two random adjacent letters (except for the first and the last letter) with probability $p$ .
2. (b) Permute the middle chunk of a word: denote the middle chunk (all letters except the first and the last letter) as $w [1 : n]$ , with probability p, produce a permutation $f$ with the first and last letter fixed, i.e. $f (0) = 0, f (n) = n$ . The shuffled word is $w^{'}$ with $w^{'} [f (i)] = w [i]$ for all $i \in [n]$ .

Image:

Given a RGB image $X \in ℤ^{W \times H \times 3}$ where $W$ and $H$ are the height and width of the image, let $R$ , $G$ , $B$ be the $W \times H$ matrices of three color channels. We implement the following robustness tests in the image modality:

Noises in digital images: various noises are naturally prevalent in digital images during image acquisition, coding, transmission, and processing steps [19]. We implement the following:
1. Gaussian/electronic noise that normalizes histogram with respect to the gray values. We add Gaussian noise as a $W \times H$ matrix with each entry following Gaussian distribution $N (0, p)$ .
2. Impulse valued/salt-and-pepper noise that has dark pixels in bright regions and bright pixels in dark regions. To add salt-and-pepper noise, for each pixel $x \in X$ , we convert $x = 0$ (white) or $x = 255$ (black) into a dead pixel with uniform distribution with probability $p$ .
3. Periodic noise such that it looks like some repeating patterns are exposed on top of the affected image. We add periodic noise by exposing the original image to periodic patterns with probability $p$ .
Color errors:
1. Convert the image to grayscale: 0.3R + 0.59G + 0.11B with probability $p$ .
2. Decrease the contrast with probability $p$ .
3. Negate the color: let $X^{'}$ be the inverted image then $\forall i \in [W]$ , $j \in [H]$ , $k \in$ [3], $X^{'} (i, j, k) = 255 - X [i, j, k]$ with probability $p$ .
4. Change the white-balance by increasing/decreasing the temperature with probability $p$ . (e) Colorize the image with probability $p$ .
Flips, crops, and rotations:
1. Horizontal flipping with probability $p$ .
2. Color space transformation - isolating a single color channel and changing brightness etc with probability $p$ .
3. Random cropping changes with probability $p$ .
4. Rotate the image by random angle ∈ [20,40] with probability $p$ .
5. Translation of images to the left, right, up, or down with probability $p$ .

Most of these transformations are achieved with the Python Imaging Library (PIL).

Video:

We treat video data as a time series of images. For each image in the video, we apply the image-specific robustness tests as described above. In addition, we also apply the following tests to simulate imperfections in time-series data:

Random drop: dropping the datapoint at random time step with probability $p$ .
Structured drop: given a time step $t$ , $m$ consecutive time steps with at least one nonzero signal are dropped with probability $p$ .

Audio:

Audio is typically represented as a time-series signal. Noises are primarily caused by imperfections in the recording device, which can cause static Gaussian noise to be added to the recorded temporal waveform at random time steps, background noise to be picked up at higher magnitudes, and certain time steps (or consecutive time steps) to be dropped from the recording. We implement the following unimodal noises in the audio modality:

Additive white Gaussian noise: given an array of length $N$ of a sampled audio segment, we add white gaussian noise, which is an array of $N$ with each entry following a normal distribution with mean 0 and standard deviation $p$ .

In addition to these imperfections applied at a single time step, we also apply the following across the entire time-series signal:

Random drop: dropping the datapoint at random time step with probability $p$ .
Structured drop: given a time step $t$ , $m$ consecutive time steps with at least one nonzero signal are dropped with probability $p$ .

Time-series data consists of a sequence with a time-dimension (a sequence of data points indexed by time). Following Liang et al., [101], we implement the following types of noise and missing values in time-series data:

White noise added independently at every time step (noise sampled from zero-mean Gaussian with standard deviation $p$ ).
Random drop: dropping the datapoint at random time step with probability $p$ .
Structured drop: given a time step $t$ , $m$ consecutive time steps across modalities are dropped with probability $p$ .

Optical flow:

We treat optical flow in a similar manner as time-series data and implement the same robustness tests.

Force and proprioception sensors:

We also treat these sensors in robotics as time-series data with a key difference - we add noise/drop time steps at a higher frequency since force and proprioception sensors often record data at a higher frequency.

Tabular data takes the form of rows, each of which contains information about some feature (e.g., age, ). We define the following robustness tests on tabular data:

Random drops of elements from the table with probability $p$ .
Random swaps elements in the table with probability $p$ .

Sets are data instances where the collection of input elements satisfy permutation invariance, which is in contrast to fixed dimensional vectors that are commonplace in machine learning on images, text, and audio. The key difference between sets and tabular data is that each element in the set is often assumed to be from the same distribution (e.g., a point cloud is a set of 3D coordinates). We define the following types of noise on an input set modality:

Random dropping of elements from the set with probability $p$ .
Adding noise to elements of the set with noise sampled from zero-mean Gaussian with standard deviation $p$ .

D.3.2. Multimodal Imperfections

Multimodal imperfections capture correlations in imperfections across modalities (e.g., missing modalities [123], or a chunk of time missing in multimodal time-series data [101]). These represent settings where data collection across modalities is correlated rather than independent.

Correlated noise: adding noise to all modalities with probability $p$ , where noise is defined according to the aforementioned modality-specific noises.
Correlated drop: dropping all modalities with probability $p$ , where dropping patterns are defined according to the aforementioned modality-specific drops.
Temporal drop: in the case of temporal modalities recorded in parallel (e.g., video, audio, and text recorded across time; financial time-series data recorded across days), we perform correlated drops across all modalities at random time steps with probability $p$ .
Structured temporal drop: we extend temporal drop such that given a time step $t$ , we perform temporal drop on $m$ consecutive time steps with probability $p$ .
Missing modalities: dropping an entire modality with probability $p$ .

D.3.3. Robustness Measure

We train the model on clean training data and evaluate it under increasing levels of noise added only to test data. To simulate realistic noise and imperfections in test data, we follow the modality-specific and multimodal imperfections as described above. Given a multimodal dataset with $M$ modalities, this allows us to create $M$ + 1 partitions of imperfect test datasets: one partition of increasing noise levels for modality-specific imperfections within each modality (which gives a total of $M$ partitions) and one partition of multimodal imperfections across all modalities. For datasets where it is not possible to create multimodal imperfections due to the lack of a shared dimension (e.g., image and text datasets typically do not share any correlated dimension, but multimodal time-series datasets share an underlying time dimension), we implement the first $M$ modality-specific imperfections which results in $M$ imperfect data partitions.

A qualitative visualization:

Given each test partition, we take a unimodal or multimodal model trained on clean data and plot model performance on the $y$ -axis as increasing levels of noise is added to the test data, on a range of 0 (no noise) to 1 (complete noise) along the $x$ -axis. This allows us to visually inspect the robustness of each model as increasing imperfections are added to the test data. Visually, a robust model should maintain high accuracy (or low MSE) as much as possible despite increasing levels of noise.

A quantitative metric:

While the visualization technique above allows one to compare the robustness of several multimodal models across the same dataset, it does not allow us to aggregate robustness performance across the broad range of datasets and tasks in MultiBench. To design such a metric, we extend the quantitative robustness measures proposed in Taori et al., [149] to deal with multimodal imperfections across a range of imperfection levels $σ \in [0.0, 1.0]$ .

We begin by reviewing the example proposed in Taori et al., [149]: suppose we are given two models $f_{1}$ and $f_{2}$ , where accuracy ${acc}_{clean} (f_{1}) = 0.8$ , ${acc}_{noisy} (f_{2}) = 0.75$ (i.e., a 5% drop in accuracy from the imperfections), and ${acc}_{clean} (f_{2}) = 0.9$ , ${acc}_{noisy} (f_{2}) = 0.76$ (a 14% drop). Model $f_{2}$ has higher accuracy on the noisy test set, but overall sees a drop of 14% from the clean to the noisy test set. In contrast, $f_{1}$ starts off with a lower accuracy but sees only a 5% drop. To capture both these desiderata (i.e., having higher accuracy at all levels and lower drops in accuracy), Taori et al., [149] introduce two notions of robustness: relative and effective robustness.

Relative robustness directly measures accuracy under imperfection. A model with higher relative robustness would display higher accuracy at all levels of imperfection compared to a baseline model. We measure the relative robustness of all multimodal models as compared to a baseline LF (simple late fusion with concatenation) method since that is the most basic method tested on all datasets. We compute relative robustness of a model $f$ using the formula

τ (f) = \int_{σ} {acc}_{σ} (f) - {acc}_{σ} (LF) d σ,

(1)

which essentially measures the area between two performance-imperfection curves as imperfection levels $σ$ increase from 0.0 to 1.0 (we compute a discrete approximation to the integral).

Effective robustness measures the rate of accuracy drops as imperfection levels increase. However, to reliably measure the rate of accuracy drops, one must remove the confounding variable brought by differences in initial accuracies on clean test data. Taori et al., [149] therefore propose to measure whether a model can offer higher accuracy on the noisy test set beyond what is expected from having higher accuracy on the original test set. Taori et al., use a log-linear fit on the set of (accuracy on noisy test data, accuracy on clean test data) points across a range of models trained on ImageNet to measure the expected accuracy on noisy test data given a new model’s performance on clean test data. Graphically, effective robustness then corresponds to a model’s performance on noisy test data lying above the linear trendline. Similar to relative robustness, we measure the effective robustness of multimodal models relative to the accuracy trend of the LF baseline, which we denote as $β_{LF}$ . We compute effective robustness of a model $f$ using the formula

ρ (f) = \int_{σ} {acc}_{σ} (f) - β_{LF} ({acc}_{0.0} (f)) d σ,

(2)

which essentially measures the area between the performance-imperfection curve of model $f$ and a shifted performance-imperfection curve of the LF baseline (shifted to match the initial accuracy of model $f$ at imperfection level 0.0). A model with higher effective robustness should lie above this shifted accuracy curve at all imperfection levels $σ$ . Again, we compute a discrete approximation to the integral.

Overall, a robust multimodal model should obtain both high relative and effective robustness.

D.4. Aggregating Measures Across Datasets and Tasks

MultiBench benefits from benchmarking multimodal models across a diverse set of datasets, modalities, and tasks. While it is useful to analyze methods on a single dataset in isolation, it is also useful to assess the generalization and failure modes of methods across multiple datasets. Therefore, we need a way to reliably summarize the above metrics (performance, complexity, and robustness) across datasets despite their being on vastly different scales (e.g., accuracy for different numbers of categories) and orders (e.g., accuracy vs RMSE). We find that min-max normalization of results per dataset into a 0 − 1 scale (where min and max are appropriately reversed for RMSE/MSE metrics) before averaging across datasets gives a reliable indicator of overall performance across multiple datasets.

Table 3:

MultiZoo provides a standardized implementation of the following multimodal methods to enable accessibility for new researchers and reproducibility of results. These approaches span advances in data processing, fusion paradigms, optimization objectives, and training procedures. We choose these approaches since they offer complementary perspectives towards tacking the fundamental challenges in multimodal fusion: (1) aligning signals across modalities at the right granularity, (2) learning complementary information across aligned signals, and (3) maintaining robustness in the presence of noisy and missing modalities.

Category	Method	Alignment	Complementarity	Robustness
Data	WordAlign [26]	✓	✗	✗

Model	EF, LF [10]	✗	✓	✗
	TF [179], LRTF [106]	✗	✓	✗
	MI-Matrix, MI-Vector, MI-Scalar [77]	✗	✓	✗
	NL Gate [167]	✗	✓	✗
	MulT [154]	✓	✓	✗
	MFAS [122]	✗	✓	✗

Objective	CCA [7]	✓	✗	✗
	RefNet [135]	✓	✗	✗
	MFM [155]	✓	✓	✗
	MVAE [168]	✗	✓	✗
	MCTN [123]	✗	✗	✓

Training	GRadBlend [167]	✗	✓	✓
Training	RMFE [53]	✗	✓	✓

Open in a new tab

E. MultiZoo: A Zoo of Multimodal Algorithms

In this section, we provide more details into our choice of standardizing multimodal representation learning as well as the implementation of our standardized library. In each category, we carefully describe the algorithm, motivate its effect in tackling one of the core challenges in section B.2, and provide references to the original code that we adapted to include in MultiZoo.

E.1. Selection of Algorithms in MultiZoo

We begin by discussing our choices of algorithms in MultiZoo. We consulted with domain experts in each of the application areas to select methods that satisfy the following properties:

Diversity in areas: We chose algorithms that present novel perspectives across a suite of machine learning research domains spanning data preprocessing, fusion paradigms, optimization objectives, and training procedures.
Coverage of technical challenges: Each of the algorithms selected in MultiZoo are chosen because they provide unique perspectives to the technical challenges in multimodal learning as elucidated in Appendix B.2. In Table 3, we provided a coarse attempt in categorizing each of the technical challenges in multimodal learning. As a result, we did not include too many methods in any category (e.g., multiple methods that are based on model architectures that tackle similar challenges of learning complementary information). Even within the same category and within those tackling the same technical challenge, we attempted to select ones that were fundamentally different (e.g., architectures based on domain knowledge, general-purpose Transformers, and architecture search).
SOTA on a particular dataset: For each dataset chosen in MultiBench, we aim to include the model that currently achieves state-of-the-art performance on that dataset. This allows us to assess the best performing model within the same domain of the dataset, as well as the best performing model outside the domain of the dataset.
Community expansion: Any set of initial methods that we will choose will represent only a small sample of the powerful multimodal methods out there. We will encourage community participation in expanding the methods in MultiZoo and encourage researchers to implement new methods using a similar modular structure to reduce confounding factors, enable standardized sharing of code, and ensure reproducibility in results.

E.2. Data Preprocessing

Temporal alignment:

As a preprocessing step, performing temporal alignment [26] has been shown to help tackle the multimodal alignment problem in the case of time-series data. This approach makes an implicit assumption on the temporal granularity of the modalities (e.g., at the level of words for text) and aligns information from the remaining modalities to the same temporal granularity. We call this approach WordAlign [26] and apply it to temporal data with text being one of the modalities. We use the temporal alignment provided in https://github.com/A2Zadeh/CMU-MultimodalSDK. Specifically, it performs alignment at the granularity of words. Given a sentence with words $w_{1}, \dots, w_{T}$ each annotated with their start and end times $(s_{1}, e_{1})$ , $(s_{2}, e_{2})$ , …, $(s_{T}, e_{T})$ , word-level alignment takes the non-text modality features (which are typically extracted at a higher frequency) and averages them during the intervals $e_{1} - s_{1}$ , $e_{2} - s_{2}$ , ..., $e_{T} - s_{T}$ . This results in a text sequence of $T$ words alongside aligned non-text modality sequences of $T$ time-steps as well.

E.3. Fusion Paradigms

Early and late fusion have been the de-facto first-approach when tackling new multimodal problems. Early fusion performs concatenation at the input data level before using a suitable prediction model (i.e., $Z_{mm} = [x_{1}, x_{2}]$ ) and late fusion applies suitable unimodal models to each modality to obtain their feature representations, concatenates these features, and defines a classifier to the label (i.e., $Z_{mm} = [z_{1}, z_{2}]$ ) [10]. MultiZoo includes their implementations denoted as EF and LF respectively. Since these are basic building blocks in the multimodal learning field, we implement them ourselves.

Tensors are specifically designed to tackle the multimodal complementarity challenge by explicitly capturing higher-order interactions across modalities [179]. Given unimodal representations $z_{1}$ , $z_{2}$ , a multimodal tensor representation is defined as $z_{mm} = [\begin{matrix} z_{1} \\ 1 \end{matrix}] \otimes [\begin{matrix} z_{2} \\ 1 \end{matrix}]$ where $\otimes$ denotes an outer product. However, computing tensor products is expensive since their dimension scales exponentially with the number of modalities. Several efficient variants have been proposed to approximate expensive full tensor products with cheaper variants while maintaining performance [71, 101, 106]. MultiZoo includes Tensor Fusion (TF) [179] as well as approximate Low-rank Tensor Fusion (LRTF) [106].

We use the Tensor Fusion implementation in https://github.com/Justin1904/TensorFusionNetworks and the Low-rank Tensor Fusion implementation in https://github.com/Justin1904/Low-rank-Multimodal-Fusion. As future work, we also plan to include more expressive higher-order tensor fusion methods [71].

Multiplicative Interactions (MI) further generalize tensor products to include learnable parameters that capture the interactions between streams of information [77]. In its most general form, MI defines a bilinear product $z_{mm} = z_{1} W z_{2} + z_{1}^{⊤} U + V z_{2} + b$ where $W$ , $U$ , $Z$ , and $b$ are trainable parameters. By appropriately constraining the rank and structure of these parameters, MI recovers HyperNetworks [61] (unconstrained parameters resulting in a matrix output), Feature-wise linear modulation (FiLM) [120, 188] (diagonal parameters resulting in vector output), and Sigmoid units [37] (scalar parameters resulting in scalar output). MultiZoo includes all 3 as MI-Matrix, MI-Vector, and MI-Scalar respectively.

Since code was not released for the Multiplicative Interactions paper [77], we implemented the MI layer ourselves. We also referred to the implementation of Feature-wise linear modulation (FiLM) [120] from https://github.com/ethanjperez/film and added it as a module in MultiBench, which we call FILM. While MI-Vector (i.e., diagonal parameters in a MI layer which results in a vector output) corresponds to the most basic implementation of FILM, the original FILM layer uses multiple non-linear layers instead of a single linear transformation in MI-Vector which has been shown to improve performance [120].

Gated attention models are prevalent in learning combinations of two representations that dynamically change for every input [25, 167, 171]. Its general form can be written as $z_{mm} = z_{1} ⊙ h (z_{2})$ , where $h$ represents a function with sigmoid activation and $⊙$ denotes the element-wise product. The output $h (z_{2})$ is commonly referred to as “attention weights” learned from $z_{2}$ used to attend on $z_{1}$ .

We implement the Query-Key-Value mechanism as NL Gate as proposed in [167] by referring to the implementation of in https://github.com/facebookresearch/VMZ. This attention mechanism is conceptually similar to the MI-Vector case above but recent work has explored more expressive forms of $h$ such as using a Query-Key-Value mechanism [167] or several fully-connected layers [25] rather than a linear transformation in MI-Vector.

Temporal attention models are useful in tackling the challenge of multimodal alignment and complementarity. Transformer models [158] have been shown to be useful for temporal multimodal data by automatically aligning and capturing complementary features at different time-steps [154, 174]. We include the Multimodal Transformer (MulT) [154] which uses a Crossmodal Transformer block that uses $z_{1}$ to attend to $z_{2}$ (and vice-versa), before concatenating both representations to obtain $z_{mm} = [z_{1 \to 2}, z_{2 \to 1}] = [CM (z_{1}, z_{2}), CM (z_{2}, z_{1})]$ .

We use the public implementation available at https://github.com/yaohungt/Multimodal-Transformer which includes a basic crossmodal transformer block designed for 2 modalities. To extend this to 3 modalities, the crossmodal transformer block is repeated across all 3 sets of modality pairs (i.e., $z_{mm} = [z_{1 \to 2}, z_{2 \to 1}, z_{1 \to 3}, z_{3 \to 1}, z_{2 \to 3}, z_{3 \to 2}]$ ). While this is still computationally feasible for 3 modalities such as the language, video, and audio datasets that MulT was originally designed for, this quickly becomes intractable for problems involving more than 3 modalities. To adapt MulT for the financial prediction task involving more than 10 modalities, we cluster all modalities into 3 groups based on similarities in their data and perform early fusion on the data within each cluster before applying MulT only on the 3 clusters of modalities. While MulT is a strong model based on performance, it poses scalability issues that should be the subject of future work (i.e., since the number of cross-modal attention blocks grows quadratically with the number of modalities).

Architecture search:

Finally, instead of hand-designing multimodal architectures, several approaches define a set of atomic neural operations (e.g., linear transformation, activation, attention, etc.) and use architecture search to automatically learn the best order of these operations for a given multimodal task [122, 173]. We focus on the more general approach, MFAS [122], designed for language and vision datasets.

We adapt the implementation from https://github.com/juanmanpr/mfas. While this approach is categorized under innovations in model architecture (since it primarily targets better architectures for multimodal fusion), its code in the MultiZoo toolkit is implemented under training structures, since architecture search requires an outer loop to learn model architectures over multiple inner supervised learning loops that train an individual model architecture. Therefore, we are unable to integrate MFAS directly with the basic supervised learning training loops like we do for the other fusion paradigms described above.

E.4. Optimization Objectives

In addition to the standard supervised losses (e.g., cross-entropy for classification, MSE/MAE for regression), several proposed methods have proposed new optimization objectives based on:

Prediction-level alignment:

There has been extensive research in defining objective functions to tackle the challenge of multimodal alignment: capturing a representation space where semantically similar concepts from different modalities are close together. While primarily useful for cross-modal retrieval [104, 187], recent work has also shown its utility in learning representations for prediction [9, 33, 91, 151]. These alignment objectives have been applied at both prediction and feature levels. In the former, we implement Canonical Correlation Analysis (CCA) [7, 166], which computes $L_{CCA} = corr (g_{1} (z_{1}), g_{2} (z_{2}))$ where $g_{1}$ , $g_{2}$ are auxiliary classifiers mapping each unimodal representation to the label. This method corresponds to prediction-level alignment since they aim to learn representations of each modality that agree on the label, as measured by the correlation of label predictions made by each modality across a batch of samples.

We refer to the paper that most closely implements CCA-based alignment for multimodal data (specifically directly testing on the CMU-MOSI dataset) [145]. Since the authors did not release their code, we implemented it from scratch with reference to CCA implementations from https://github.com/Michaelvll/DeepCCA and https://github.com/VahidooX/DeepCCA.

Feature-level alignment:

In the latter, contrastive learning has emerged as a popular approach that brings similar concepts close in feature space and different concepts far away [33, 91, 151]. MultiZoo includes REFNET [135] which includes a self-supervised contrastive loss between unimodal representations $z_{1}$ , $z_{2}$ and the multimodal representation $z_{mm}$ , i.e., $L_{contrast} = 1 - \cos (z_{mm}, g_{1} (z_{1})) + 1 - \cos (z_{mm}, g_{2} (z_{2}))$ where $g_{1}$ , $g_{2}$ is an auxiliary layer mapping each modality’s representation into the joint multimodal space. The intuition here is that the unimodal representations $z_{1}$ , $z_{2}$ and the multimodal representation $z_{mm}$ should be aligned in the multimodal feature space as measured by cosine similarity. While the original REFNET method does not use negative samples, closely related work in multi-view contrastive learning has extended this idea to use negative samples which is more closely in line with recent work in contrastive learning [151].

Since they did not release code, we implement REFNET ourselves on top of current supervised learning modules in MultiZoo.

Reconstruction objectives:

Methods based on generative-discriminative models (e.g., VAEs) include an objective to reconstruct the input (or some part of the input) [91, 155]. These have been shown to better preserve task-relevant information learned in the representation, especially in settings with sparse supervised signals such as robotics [91] and long videos [155]. We include the Multimodal Factorized Model (MFM) [155] which is a general approach that learns a representation $z_{mm}$ that can reconstruct input data $x_{1}$ , $x_{2}$ while also predicting the label. The multimodal representation is a concatenation of factorized representations $z_{1}$ , $z_{2}$ , ..., $z_{M}$ , and $z_{y}$ .

Since MFM optimizes a variational lower-bound to the log likelihood, the overall objective consists of 3 terms - generative, discriminative, and prior regularization:

\min_{f_{i}, f_{mm}, g_{i}, g_{y}} E_{P_{x_{1 : M}, y}} E_{f_{1} (z_{1} ∣ x_{1})} \dots E_{f_{M} (z_{M} ∣ x_{M})} E_{f_{mm} (z_{y} ∣ x_{1 : M})}

(3)

[\sum_{i = 1}^{M} {‖ x_{i}, g_{i} (z_{i}, z_{y}) ‖}_{2} + ℓ (y, g_{y} (z_{y}))] + λMMD (Q_{z}, P_{z}),

where $f_{i}'s$ are encoders from each modality to representations, $f_{mm}$ is a multimodal encoder to the joint representation $z_{y}$ , $g_{i}'s$ are decoders from latent representations back into input data, and $g_{y}$ is a classification head to the label. The final MMD term is a regularizer to bring the representations close to a unit Gaussian prior. The multimodal encoder $f_{mm}$ in MFM can be instantiated with any multimodal model from section 3.2 (e.g., learning $z_{y}$ via tensors and adding a term to reconstruct input data). We use the public implementation in https://github.com/pliang279/factorized, which uses a temporal attention model as $f_{mm}$ for multimodal time-series data. For the remaining experiments we replace $f_{mm}$ with a simple late fusion but also run some experiments with multimodal methods that are state-of-the-art in each domain.

Improving robustness:

These approaches modify the objective function to account for robustness to noisy [101] or missing [89, 111, 123] modalities. MultiZoo includes MCTN [123] which uses cycle-consistent translation to predict the noisy/missing modality from present ones. The key insight is that a joint representation between modalities $x_{1}$ and $x_{2}$ can be learned by using $x_{1}$ to predict $x_{2}$ , in a vein similar to machine translation or image/text style transfer. MCTN defines a cyclic translation path $x_{1} \to z_{mm} \to {\hat{x}}_{2} \to z_{mm} \to {\hat{x}}_{1}$ and adds additional reconstruction losses $L_{rec} = {‖ x_{1} - {\hat{x}}_{1} ‖}_{2} + {‖ x_{2} - {\hat{x}}_{2} ‖}_{2}$ on top of the supervised learning loss. The representations $z_{mm}$ learned via translation are then used to predict the label. Surprisingly, the model needs to take in only $x_{1}$ at test time and is therefore robust to all levels of noise or missingness in $x_{2}$ .

E.5. Training Procedures

Improving generalization:

Recent work has found that directly training a multimodal model with all modalities using supervised learning is sub-optimal since different modalities overfit and generalize at different rates. MultiZoo includes an approach to solve this, called Gradient Blending (GRadBlend), that computes generalization statistics for each modality to determine their weights during multimodal fusion [167]. We use the implementation in https://github.com/facebookresearch/VMZ and modify it to be part of the MultiZoo training structures.

We also include a similar work, Regularization by Maximizing Functional Entropies (RMFE), which uses functional entropy to balance the contribution of each modality to the classification result [53]. We use the public implementation from https://github.com/itaigat/removing-bias-in-multi-modal-classifiers.

E.6. Domain-specific Methods

Finally, we also implemented several domain-specific methods that had been applied to each domain. These include sensor fusion [91] and Kalman filtering [90] for robotics, and the multimodal Refiner network [135] for multimedia experiments. We refer the reader to the respective papers for algorithmic details.

F. Integrating MultiBench and MultiZoo: A Brief Tutorial

MultiBench is available via our public GitHub: https://github.com/pliang279/MultiBench. We also include a landing website page on https://cmu-multicomp-lab.github.io/multibench/ that includes an introduction to the benchmark, links to the relevant papers on multimodal datasets and algorithms, and a public leaderboard to keep track of current progress on these multimodal tasks. In this section, we provide more details for the loading of datasets ML pipeline provided by MultiBench. We also describe the modular implementation of multimodal models in MultiZoo and provide several code examples to illustrate its usage.

F.1. Reading the Dataset

We provide scripts for reading each dataset supported by MultiZoo at dataset/[dataset_name]/get_data.py in the repository. For each dataset, the user will need to first follow downloading and preprocessing instructions documented in Section C.2 or in the comments of the get_data.py. The python script contains a function (usually called get_dataloader) that takes in required arguments (such as the location of the preprocessed dataset or compressed data, etc) and it will output a tuple of three PyTorch Dataloader objects for train, valid, and test split of the dataset respectively. You can feed these dataloaders directly into training structures in MultiZoo.

F.2. Unimodal Models

In addition to the multimodal models described in Appendix E that are the main subject of study in this area, each dataset and modality typically also requires an initial processing stage either through feature extraction (see Appendix C.2 for initial feature extraction done on each dataset) and/or unimodal models on raw data/extracted features.

To standardize the implementation of unimodal models, MultiZoo includes an implementation of several standard unimodal models that we encountered when running experiments on the diverse range of datasets and modalities in MultiBench. Each unimodal model is implemented as a function class that takes in either raw data or extracted features from a modality and returns a unimodal representation tensor after applying the function. MultiZoo includes the following unimodal methods:

Multi-layer Perceptrons form the building blocks of many deep learning methods and are generally suitable for any modality that has undergone feature extraction into a vector that does not require any more processing with inductive biases. Their general structure means that they can be flexibly adapted for the tabular, set, and image, and text (e.g., see Deep Averaging Network [76]) modalities. They have also been used as a starting point for force and proprioception sensors in robotics if data does not come in the form of time-series [91].
Convolutional Networks [87] are typically used over the image modality. They are also used on the audio modality if an initial preprocessing step of converting raw audio to spectrograms is used.
ResNets [66] are an improvement over ConvNets to enable training of deeper models and have been used extensively for images and audio spectrograms.
Recurrent Networks [134], GRUs [29], and LSTMs [69] are suitable for temporal data in the form of text, video, audio, and time-series modalities.
Transformers [158] have recently emerged as a strong alternative to recurrent models by using self-attention rather than an accumulative memory. They are also suitable for text, video, audio, and time-series modalities. We also implemented recently proposed Vision Transformers [44] that adapt Transformer models for image classification as well.
Deep Sets [184] was proposed as a permutation-invariant method for machine learning on sets, and was shown to outperform prior methods such as MLPs that are sensitive to the permutation of elements.
Finally, we also included several domain-specific methods that we encountered as we were accumulating the datasets in MultiBench. Some of these methods include MaxOut networks [58] used for MM-IMDb [8] and Causal Convolution [157] for the high-frequency force sensors used in robotics datasets [91, 90].

F.3. Multimodal Models

MultiZoo includes an implementation of all multimodal methods described in Appendix E. Each multimodal method (i.e., fusion paradigm) is implemented as a Pytorch Module class taking in unimodal tensors and returning final multimodal representation vectors. We implemented several common fusion modules, such as Concatenation, Early-Concatenation (i.e., concatenate in input space), Stack, FilM, Multiplicative-Interactions (MI), Tensor Fusion, LRTF, NL-gate, and more described in Appendix E. When the training algorithm requires non-standard multimodal representations (e.g., more than one vector output from fusion module) or the unimodal encoders produce non-standard unimodal representations (i.e., not a single vector representation), special fusion modules will be needed in these situations. For example, we wrote a roboticsConcat module that performs concatenation for the Vision&Touch dataset due to its non-standard unimodal encoder output. We also have special fusion modules for optimization objectives or training structures such as MVAE, MFAS, and GRadBlend. The design of modular fusion modules gives flexibility in model design, as users can reuse a previous fusion module directly in most cases but can also write their own special fusion modules easily.

F.4. Classification Head

Finally, MultiZoo includes flexible implementations of classification heads that take in the multimodal representation and return a label either directly (perhaps with some activation) for regression or a softmax over classes for classification.

F.5. Optimization Objectives

The optimization objectives are modules that take in the classification or regression result produced by the model and the ground-truth (as well as other necessary inputs if applicable) and return a loss that can be used to optimize the model based on the desired objective. In most methods we simply use torch.nn.CrossEntropyLoss as the objective for classification tasks and torch.nn.MSELoss as the objective for regression tasks. However, in certain training structures, special objectives are required. For example, MultiZoo includes implementations of objective functions such as weighted reconstruction loss and ELBO loss used in reconstruction-based methods MFM and MVAE, and there are also implementations of alignment-based objectives such as CCA and contrastive learning. The final optimization objective returns a weighted sum of these prediction objectives and auxiliary objectives, where the user is free to specify these weights as hyperparameters.

F.6. Training Structures

Training Structures are the main body of MultiZoo programs. All other modules (unimodal models, fusion paradigms, optimization objectives, classification heads, etc) can be seen as exchangeable plugins to these training structures. The training structure determines the main training algorithm, with the most common one being supervised_learning (training unimodal, multimodal, and classification parameters directly for a task-specific supervised learning objective).

More advanced methods may change this training structure either through additional optimization objectives (MVAE [168], MFM [155]) or via extensions of supervised learning through dynamic weighting of modalities (GRadBlend [167]) or an outer architecture search training loop (MFAS [122]). Each of these methods, therefore, have their own training structure module.

These interchangeable plugin modules give a lot of flexibility in adapting each training structure to new tasks. For example, for the experiments described in Section G, the methods that are primarily based on different fusion paradigms (i.e., EF, LF, TF, LRTF, MI, NL-Gate, MulT etc all use the same training structure (supervised_learning) with different plugin fusion modules (and different unimodal encoders and heads based on datasets and tasks). Similarly, while most of these more advanced training structures were originally paired with a simple LF model in their original papers, our modular implementation makes it possible to combine advances in fusion paradigms with training structures in future work.

F.7. Performance Evaluation

We standardize evaluation using metrics designed for each dataset, ranging from MSE and MAE for regression to accuracy, micro & macro F1-score, and AUPRC for classification. We use the standard PyTorch and scikit-learn implementations of these performance metrics.

Algorithm 2.

PyTorch code integrating MultiBench datasets and MultiZoo models.

from datasets.get_data import get_dataloader

from unimodals.common_models import ResNet, Transformer

from fusions.common_fusions import MultInteractions

from training_structures.gradient_blend import train, test

# loading Multimodal IMDB dataset

traindata, validdata, testdata = get_dataloader(‘multimodal_imdb’)
out_channels = 3

# defining ResNet and Transformer unimodal encoders
encoders = [ResNet(in_channels=1, out_channels, layers=5),

model = train(encoders, fusion, classifier, traindata, validdata,
epochs=100, optimtype=torch.optim.SGD, lr=0.01, weight_decay=0.0001)

# testing
performance, complexity, robustness = test(model, testdata)

Open in a new tab

F.8. Complexity Evaluation

We report training memory by measuring peak memory usage of the python process during the entire training process using python memory_profiler toolkit (https://pypi.org/project/memory-profiler/). When counting the number of parameters when training a model, we only count the parameters in persistent modules during training and does not count the ephemeral networks or modules created in the middle of the training process (such as the networks trained for determining weights in GRadBlend or the fusion architectures created as part of the architecture search process in MFAS).

F.9. Robustness Evaluation

For robustness experiments, modality-specific and multimodal imperfections are implemented as modules. A separate version of data loader is created for each dataset to test robustness, which adds custom unimodal or multimodal imperfections of increasing noise levels $σ \in$ [0,1] to the original clean test set. A testing module is also provided specifically for robustness experiments, which evaluates the model on increasing levels of noisy test datasets and prints out the metrics for visualization. In this way, MultiZoo allows highly modular data loading and robustness evaluation that requires minimal modification to the regular training and testing workflow.

MultiZoo includes evaluation protocols summarizing these robustness results. It includes visualization functions of the performance-imperfection curves across datasets in MultiBench. We also implemented relative and effective robustness as two quantitative metrics for robustness evaluation. For relative robustness, we approximate the area under the performance-imperfection curves for each model across MultiBench datasets. For effective robustness, we take the performance-imperfection curve of LF evaluated on the same dataset equalized for initial accuracy on clean test data. For both metrics, we normalized performance across all models evaluated on the same dataset.

F.10. Code Snippets

In Algorithm 2, we show a sample code snippet in Python that loads a dataset from MultiBench (Appendix C.2), defines the unimodal and multimodal architectures, optimization objectives, and training procedures (Appendix E), before running the evaluation protocol (Appendix D). Our MultiZoo toolkit is easy to use and trains entire multimodal models in less than 10 lines of code. By standardizing the implementation of each module and disentangling the individual effects of models, optimizations, and training, MultiZoo ensures accessibility and reproducibility of its multimodal algorithms.

Table 4:

Table of hyperparameters for prediction on affective computing dataset.

Component	Model	Parameter	Value
GRU Encoder	GRU	Input sizes	[5,20,35,74,300,704]
		Hidden sizes	[32,32,64,128,512,1024]
		Num of layers	1 or 2
		Dropout	0:0 or 0:1

Transformer Encoder [158]	Transformer [158]	Input sizes	[5,20,35,74,300,704]
		Hidden sizes	[32,32,64,128,512,1024]
		Num of layers	2 or 3
		Dropout	0.2

Head	MLP	Input sizes	[5,20,32,64,128,256]
		Hidden sizes	[5,20,32,64,128,256]
		Num layers	[2
		Dropout	0.2

MCTN [123] Encoder	GRU	Input sizes	300
		Hidden sizes	[32, 64]
		Num of layers	1 or 2
		Dropout	0.0 or 0.1

MCTN [123] Decoder	GRU	Input sizes	[32, 64]
		Hidden sizes	300
		Num of layers	1 or 2
		Dropout	0.0 or 0.1

MCTN [123] Seq2Seq	GRU+GRU	teaching ratio	0.5
		Embed sizes	32
		$μ_{t_{1}}$ , $μ_{c}$ , $μ_{t_{2}}$	0.01

Fusion	LRTF [106]	Num ranks	64
	LRTF [106]	Output sizes	128

	MI-Matrix [77]	Hidden size	128

	MulT [	Hidden size	40
	MulT [	Num heads	8 or 10

Training		Loss	MAE or Cross Entropy
		Batch size	32
		Seq Length	50 or 20
		Num epochs	100 or 300
		Early stop	True
		Patience	[8,20]
		Activation	ReLU
		Optimizer	AdamW
		Weight Decay	1×10⁻⁴
		Learning rate	1×10⁻⁴

Open in a new tab

G. Experimental Setup

In this section, we provide additional details of the experimental setup. All experiments were conducted on a server with 4× Nvidia GTX 980 Ti GPUs, 5× Nvidia Tesla P40 GPUs, 2× Nvidia Tesla K40c GPUs, 4× Nvidia TITAN X GPUs, 1× Tesla T4 GPU, and 1× Tesla V100 GPU. The server also contained 32× Intel(R) Xeon(R) CPU (E5 − 2670, 2.60GHz).

G.1. Affective Computing

Hyperparameters:

We show the hyperparameters used for models on datasets in the Affective Computing domain in Table 4. For each dataset we tune the following hyperparameters selected from the following ranges: the learning rate is selected between 0.00001 to 0.001 and set to be 0.0001 in the beginning; Early stopping is applied with patience 8 to 20 before overfitting happens; The input sizes and hidden sizes vary according to the different modalities and datasets. The $μ_{t_{0}}$ , $μ_{c}$ , and $μ_{t_{1}}$ hyperparameters in MCTN [123] is tuned between 0.005 to 0.1. The sequence length varies from 20 to 50. Only punchline sentences (target sentences) are used in UR-FUNNY [64] and MUStARD [24] following the original papers.

Hyperparameters were selected based on performance on the validation set. For models that had been previously proposed and tested on these datasets, we use the same hyperparameters as those reported in their paper or public code.