Table 2.
Summary of the multimodal applications reviewed, their related technical details, and best results achieved
References | Year | Application | Sensing modality/data sources | Fusion scheme | Dataset/best results |
---|---|---|---|---|---|
[73] | 2018 | Person recognition | Face and body information | Late fusion (Score-level fusion) | DFB-DB1 (EER = 1.52%) |
ChokePoint (EER = 0.58%) | |||||
[51] | 2015 | Face recognition | Holistic face + Rendered frontal pose data | Late fusion | LFW (ACC = 98.43%) |
CASIA-WebFace (ACC = 99.02%) | |||||
[93] | 2020 | Face recognition | Biometric traits (face and iris) | Feature concatenation | CASIA-ORL (ACC = 99.16%) |
CASIA-FERET (ACC = 99.33%) | |||||
[100] | 2016 | Image retrieval | Visual + Textual | Joint embeddings | Flickr30K (mAP = 47.72%; R@10 = 79.9%) |
MSCOCO (R@10 = 86.9%) | |||||
[101] | 2016 | Image retrieval | Photos + Sketches | Joint embeddings | Fine-grained SBIR Database (R@5 = 19.8%) |
[102] | 2015 | Image retrieval | Cross-view image pairs | Alignment | A dataset of 78k pairs of Google street-view images (AP = 41.9%) |
[103] | 2019 | Image retrieval | Visual + Textual | Feature concatenation | Fashion-200k (R@50 = 63.8%) |
MIT-State (R@10 = 43.1%) | |||||
CS (R@1 = 73.7%) | |||||
[97] | 2015 | Gesture recognition | RGB + D | Recurrent fusion, Late fusion, and Early fusion | SKIG (ACC = 97.8%) |
[98] | 2017 | Gesture recognition | RGB + D | A canonical correlation scheme | Chalearn LAP IsoGD (ACC = 67.71%) |
[200] | 2019 | Gesture recognition | RGB + D + Opt. flow | A spatio-temporal semantic alignment loss (SSA) | VIVA hand gestures (ACC = 86.08%) |
EgoGesture (ACC = 93.87%) | |||||
NVGestures (ACC = 86.93%) | |||||
[52] | 2019 | Image captioning | Visual + Textual | RNN + Attention mechanism | GoodNews (Bleu-1 = 8.92%) |
[53] | 2019 | Image captioning | Visual + Textual + Acoustic | Alignment | MSCOCO (R@10 = 91.6%) |
Flickr30K (R@10 = 79.0%) | |||||
[128] | 2019 | Image captioning | Visual + Textual | Alignment | MSCOCO (BLUE-1 = 61.7%) |
[174] | 2019 | Image captioning | Visual + Textual | Gated fusion network | MSR-VTT (BLUE-1 = 81.2%) |
MSVD (BLUE-4 = 53.9%) | |||||
[175] | 2019 | Image captioning | Visual + Acoustic | GRU Encoder-Decoder | Proposed dataset (BLUE-1 = 36.9%) |
[176] | 2020 | Image captioning | Visual + Textual (Spatio-temporal data) | Object-aware knowledge distillation mechanism | MSR-VTT (BLUE-4 = 40.5%) |
MSVD (BLUE-4 = 52.2%) | |||||
[87] | 2018 | Vision-and-language navigation | Visual + Textual (instructions) | Attention mechanism + LSTM | R2R (SPL = 18%) |
[88] | 2019 | Vision-and-language navigation | Visual + Textual | Attention mechanism + Language Encoder | R2R (SPL = 38%) |
[118] | 2020 | Vision-and-language navigation | Visual + Textual (instructions) | Domain adaptation | R2R (Performance gap = 8.6) |
R4R (Performance gap = 23.9) | |||||
CVDN (Performance gap = 3.55) | |||||
[119] | 2020 | Vision-and-language navigation | Visual + Textual (instructions) | Early fusion + Late fusion | R2R (SPL = 59%) |
[120] | 2020 | Vision-and-language navigation | Visual + Textual (instructions) | Attention mechanism + Feature concatenation | VLN-CE (SPL = 35%) |
[121] | 2019 | Vision-and-language navigation | Visual + Textual (instructions) | Encoder-decoder + Multiplicative attention mechanism | ASKNAV (Success rate = 52.26%) |
[89] | 2018 | Embodied question answering | Visual + Textual (questions) | Attention mechanism + Alignment | EQA-v1 (MR = 3.22) |
[90] | 2019 | Embodied question answering | Visual + Textual (questions) | Feature concatenation | EQA-v1 (ACC = 61.45%) |
[122] | 2019 | Embodied question answering | Visual + Textual (questions) | Alignment | VideoNavQA (ACC = 64.08%) |
[125] | 2019 | Video question answering | Visual + Textual (questions) | Bilinear fusion | TDIUC (ACC = 88.20%) |
VQA-CP (ACC = 39.54%) | |||||
VQA-v2 (ACC = 65.14%) | |||||
[126] | 2019 | Video question answering | Visual + Textual (questions) | Alignment | TGIF-QA (ACC = 53.8%) |
MSVD-QA (ACC = 33.7%) | |||||
MSRVTT-QA (ACC = 33.00%) | |||||
Youtube2Text-QA (ACC = 82.5%) | |||||
[127] | 2020 | Video question answering | Visual + Textual (questions) | Hierarchical Conditional Relation Networks (HCRN) | MSRVTT-QA (ACC = 35.6%) |
MSVD-QA (ACC = 36.1%) | |||||
[129] | 2019 | Video question answering | Visual + Textual (questions) | Dual-LSTM + Spatial and temporal attention | TGIF-QA (l2 distance = 4.22) |
[159] | 2019 | Style transfer | Content + Style | Graph based matching | A dataset of images from MSCOCO and WikiArt (PV = 33.45%) |
[160] | 2017 | Style transfer | Content + Style | Hierarchical feature concatenation | A dataset of images from MSCOCO (PS = 0.54s) |