Skip to main content
. Author manuscript; available in PMC: 2024 May 21.
Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;2021(DB1):1–20.

Table 1:

MultiBench provides a comprehensive suite of 15 multimodal datasets to benchmark current and proposed approaches in multimodal representation learning. It covers a diverse range of research areas, dataset sizes, input modalities (in the form of : language, i: image, v: video, a: audio, t: time-series, ta: tabular, f: force sensor, p: proprioception sensor, s: set, o: optical flow), and prediction tasks. We provide a standardized data loader for datasets in MultiBench, along with a set of state-of-the-art multimodal models.

Research Area Size Dataset Modalities # Samples Prediction task
Affective Computing S
M
L
L
MUStARD [24]
CMU-MOSI [181]
UR-FUNNY [64]
CMU-MOSEI [183]
{, v, a}
{, v, a}
{, v, a}
{, v, a}
690
2,199
16,514
22,777
Sarcasm
sentiment
humor
sentiment, emotions
Healthcare L MIMIC [78] {t, ta} 36,212 mortality, ICD-9 codes
Robotics M
L
MuJoCo Push [90]
Vision&Touch [92]
{i, f, p}
{i, f, p}
37,990
147,000
object pose
contact, robot pose
Finance M
M
M
Stocks-F&B
Stocks-Health
Stocks-Tech
{t ×18} {t ×63}
{t ×100}
5,218
5,218
5,218
stock price, volatility
stock price, volatility
stock price, volatility
HCI S ENRICO [93] {i, s} 1,460 design interface
Multimedia S
M
M
L
Kinetics400-S [80]
MM-IMDb [8]
AV-MNIST [161]
Kinetics400-L [80]
{v, a, o}
{, i}
{i, a}
{v, a, o}
2,624
25,959
70,000
306,245
human action
movie genre
digit
human action