Skip to main content
. 2021 Jun 10;38(8):2939–2970. doi: 10.1007/s00371-021-02166-7

Table 2.

Summary of the multimodal applications reviewed, their related technical details, and best results achieved

References Year Application Sensing modality/data sources Fusion scheme Dataset/best results
[73] 2018 Person recognition Face and body information Late fusion (Score-level fusion) DFB-DB1 (EER = 1.52%)
ChokePoint (EER = 0.58%)
[51] 2015 Face recognition Holistic face + Rendered frontal pose data Late fusion LFW (ACC = 98.43%)
CASIA-WebFace (ACC = 99.02%)
[93] 2020 Face recognition Biometric traits (face and iris) Feature concatenation CASIA-ORL (ACC = 99.16%)
CASIA-FERET (ACC = 99.33%)
[100] 2016 Image retrieval Visual + Textual Joint embeddings Flickr30K (mAP = 47.72%; R@10 = 79.9%)
MSCOCO (R@10 = 86.9%)
[101] 2016 Image retrieval Photos + Sketches Joint embeddings Fine-grained SBIR Database (R@5 = 19.8%)
[102] 2015 Image retrieval Cross-view image pairs Alignment A dataset of 78k pairs of Google street-view images (AP = 41.9%)
[103] 2019 Image retrieval Visual + Textual Feature concatenation Fashion-200k (R@50 = 63.8%)
MIT-State (R@10 = 43.1%)
CS (R@1 = 73.7%)
[97] 2015 Gesture recognition RGB + D Recurrent fusion, Late fusion, and Early fusion SKIG (ACC = 97.8%)
[98] 2017 Gesture recognition RGB + D A canonical correlation scheme Chalearn LAP IsoGD (ACC = 67.71%)
[200] 2019 Gesture recognition RGB + D + Opt. flow A spatio-temporal semantic alignment loss (SSA) VIVA hand gestures (ACC = 86.08%)
EgoGesture (ACC = 93.87%)
NVGestures (ACC = 86.93%)
[52] 2019 Image captioning Visual + Textual RNN + Attention mechanism GoodNews (Bleu-1 = 8.92%)
[53] 2019 Image captioning Visual + Textual + Acoustic Alignment MSCOCO (R@10 = 91.6%)
Flickr30K (R@10 = 79.0%)
[128] 2019 Image captioning Visual + Textual Alignment MSCOCO (BLUE-1 = 61.7%)
[174] 2019 Image captioning Visual + Textual Gated fusion network MSR-VTT (BLUE-1 = 81.2%)
MSVD (BLUE-4 = 53.9%)
[175] 2019 Image captioning Visual + Acoustic GRU Encoder-Decoder Proposed dataset (BLUE-1 = 36.9%)
[176] 2020 Image captioning Visual + Textual (Spatio-temporal data) Object-aware knowledge distillation mechanism MSR-VTT (BLUE-4 = 40.5%)
MSVD (BLUE-4 = 52.2%)
[87] 2018 Vision-and-language navigation Visual + Textual (instructions) Attention mechanism + LSTM R2R (SPL = 18%)
[88] 2019 Vision-and-language navigation Visual + Textual Attention mechanism + Language Encoder R2R (SPL = 38%)
[118] 2020 Vision-and-language navigation Visual + Textual (instructions) Domain adaptation R2R (Performance gap = 8.6)
R4R (Performance gap = 23.9)
CVDN (Performance gap = 3.55)
[119] 2020 Vision-and-language navigation Visual + Textual (instructions) Early fusion + Late fusion R2R (SPL = 59%)
[120] 2020 Vision-and-language navigation Visual + Textual (instructions) Attention mechanism + Feature concatenation VLN-CE (SPL = 35%)
[121] 2019 Vision-and-language navigation Visual + Textual (instructions) Encoder-decoder + Multiplicative attention mechanism ASKNAV (Success rate = 52.26%)
[89] 2018 Embodied question answering Visual + Textual (questions) Attention mechanism + Alignment EQA-v1 (MR = 3.22)
[90] 2019 Embodied question answering Visual + Textual (questions) Feature concatenation EQA-v1 (ACC = 61.45%)
[122] 2019 Embodied question answering Visual + Textual (questions) Alignment VideoNavQA (ACC = 64.08%)
[125] 2019 Video question answering Visual + Textual (questions) Bilinear fusion TDIUC (ACC = 88.20%)
VQA-CP (ACC = 39.54%)
VQA-v2 (ACC = 65.14%)
[126] 2019 Video question answering Visual + Textual (questions) Alignment TGIF-QA (ACC = 53.8%)
MSVD-QA (ACC = 33.7%)
MSRVTT-QA (ACC = 33.00%)
Youtube2Text-QA (ACC = 82.5%)
[127] 2020 Video question answering Visual + Textual (questions) Hierarchical Conditional Relation Networks (HCRN) MSRVTT-QA (ACC = 35.6%)
MSVD-QA (ACC = 36.1%)
[129] 2019 Video question answering Visual + Textual (questions) Dual-LSTM + Spatial and temporal attention TGIF-QA (l2 distance = 4.22)
[159] 2019 Style transfer Content + Style Graph based matching A dataset of images from MSCOCO and WikiArt (PV = 33.45%)
[160] 2017 Style transfer Content + Style Hierarchical feature concatenation A dataset of images from MSCOCO (PS = 0.54s)