. Author manuscript; available in PMC: 2024 May 21.

Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;2021(DB1):1–20.

Table 1:

MultiBench provides a comprehensive suite of 15 multimodal datasets to benchmark current and proposed approaches in multimodal representation learning. It covers a diverse range of research areas, dataset sizes, input modalities (in the form of $ℓ$ : language, $i$ : image, $v$ : video, $a$ : audio, $t$ : time-series, $t a$ : tabular, $f$ : force sensor, $p$ : proprioception sensor, $s$ : set, $o$ : optical flow), and prediction tasks. We provide a standardized data loader for datasets in MultiBench, along with a set of state-of-the-art multimodal models.

Research Area	Size	Dataset	Modalities	# Samples	Prediction task
Affective Computing	S M L L	MUStARD [24] CMU-MOSI [181] UR-FUNNY [64] CMU-MOSEI [183]	{ $ℓ$ , $v$ , $a$ } { $ℓ$ , $v$ , $a$ } { $ℓ$ , $v$ , $a$ } { $ℓ$ , $v$ , $a$ }	690 2,199 16,514 22,777	Sarcasm sentiment humor sentiment, emotions

Healthcare	L	MIMIC [78]	{ $t$ , $t a$ }	36,212	mortality, ICD-9 codes

Robotics	M L	MuJoCo Push [90] Vision&Touch [92]	{ $i$ , $f$ , $p$ } { $i$ , $f$ , $p$ }	37,990 147,000	object pose contact, robot pose

Finance	M M M	Stocks-F&B Stocks-Health Stocks-Tech	{ $t$ ×18} { $t$ ×63} { $t$ ×100}	5,218 5,218 5,218	stock price, volatility stock price, volatility stock price, volatility

HCI	S	ENRICO [93]	{ $i$ , $s$ }	1,460	design interface

Multimedia	S M M L	Kinetics400-S [80] MM-IMDb [8] AV-MNIST [161] Kinetics400-L [80]	{ $v$ , $a$ , $o$ } { $ℓ$ , $i$ } { $i$ , $a$ } { $v$ , $a$ , $o$ }	2,624 25,959 70,000 306,245	human action movie genre digit human action