Table 2.
Summary of the multimodal applications reviewed, their related technical details, and best results achieved
| References | Year | Application | Sensing modality/data sources | Fusion scheme | Dataset/best results |
|---|---|---|---|---|---|
| [73] | 2018 | Person recognition | Face and body information | Late fusion (Score-level fusion) | DFB-DB1 (EER = 1.52%) |
| ChokePoint (EER = 0.58%) | |||||
| [51] | 2015 | Face recognition | Holistic face + Rendered frontal pose data | Late fusion | LFW (ACC = 98.43%) |
| CASIA-WebFace (ACC = 99.02%) | |||||
| [93] | 2020 | Face recognition | Biometric traits (face and iris) | Feature concatenation | CASIA-ORL (ACC = 99.16%) |
| CASIA-FERET (ACC = 99.33%) | |||||
| [100] | 2016 | Image retrieval | Visual + Textual | Joint embeddings | Flickr30K (mAP = 47.72%; R@10 = 79.9%) |
| MSCOCO (R@10 = 86.9%) | |||||
| [101] | 2016 | Image retrieval | Photos + Sketches | Joint embeddings | Fine-grained SBIR Database (R@5 = 19.8%) |
| [102] | 2015 | Image retrieval | Cross-view image pairs | Alignment | A dataset of 78k pairs of Google street-view images (AP = 41.9%) |
| [103] | 2019 | Image retrieval | Visual + Textual | Feature concatenation | Fashion-200k (R@50 = 63.8%) |
| MIT-State (R@10 = 43.1%) | |||||
| CS (R@1 = 73.7%) | |||||
| [97] | 2015 | Gesture recognition | RGB + D | Recurrent fusion, Late fusion, and Early fusion | SKIG (ACC = 97.8%) |
| [98] | 2017 | Gesture recognition | RGB + D | A canonical correlation scheme | Chalearn LAP IsoGD (ACC = 67.71%) |
| [200] | 2019 | Gesture recognition | RGB + D + Opt. flow | A spatio-temporal semantic alignment loss (SSA) | VIVA hand gestures (ACC = 86.08%) |
| EgoGesture (ACC = 93.87%) | |||||
| NVGestures (ACC = 86.93%) | |||||
| [52] | 2019 | Image captioning | Visual + Textual | RNN + Attention mechanism | GoodNews (Bleu-1 = 8.92%) |
| [53] | 2019 | Image captioning | Visual + Textual + Acoustic | Alignment | MSCOCO (R@10 = 91.6%) |
| Flickr30K (R@10 = 79.0%) | |||||
| [128] | 2019 | Image captioning | Visual + Textual | Alignment | MSCOCO (BLUE-1 = 61.7%) |
| [174] | 2019 | Image captioning | Visual + Textual | Gated fusion network | MSR-VTT (BLUE-1 = 81.2%) |
| MSVD (BLUE-4 = 53.9%) | |||||
| [175] | 2019 | Image captioning | Visual + Acoustic | GRU Encoder-Decoder | Proposed dataset (BLUE-1 = 36.9%) |
| [176] | 2020 | Image captioning | Visual + Textual (Spatio-temporal data) | Object-aware knowledge distillation mechanism | MSR-VTT (BLUE-4 = 40.5%) |
| MSVD (BLUE-4 = 52.2%) | |||||
| [87] | 2018 | Vision-and-language navigation | Visual + Textual (instructions) | Attention mechanism + LSTM | R2R (SPL = 18%) |
| [88] | 2019 | Vision-and-language navigation | Visual + Textual | Attention mechanism + Language Encoder | R2R (SPL = 38%) |
| [118] | 2020 | Vision-and-language navigation | Visual + Textual (instructions) | Domain adaptation | R2R (Performance gap = 8.6) |
| R4R (Performance gap = 23.9) | |||||
| CVDN (Performance gap = 3.55) | |||||
| [119] | 2020 | Vision-and-language navigation | Visual + Textual (instructions) | Early fusion + Late fusion | R2R (SPL = 59%) |
| [120] | 2020 | Vision-and-language navigation | Visual + Textual (instructions) | Attention mechanism + Feature concatenation | VLN-CE (SPL = 35%) |
| [121] | 2019 | Vision-and-language navigation | Visual + Textual (instructions) | Encoder-decoder + Multiplicative attention mechanism | ASKNAV (Success rate = 52.26%) |
| [89] | 2018 | Embodied question answering | Visual + Textual (questions) | Attention mechanism + Alignment | EQA-v1 (MR = 3.22) |
| [90] | 2019 | Embodied question answering | Visual + Textual (questions) | Feature concatenation | EQA-v1 (ACC = 61.45%) |
| [122] | 2019 | Embodied question answering | Visual + Textual (questions) | Alignment | VideoNavQA (ACC = 64.08%) |
| [125] | 2019 | Video question answering | Visual + Textual (questions) | Bilinear fusion | TDIUC (ACC = 88.20%) |
| VQA-CP (ACC = 39.54%) | |||||
| VQA-v2 (ACC = 65.14%) | |||||
| [126] | 2019 | Video question answering | Visual + Textual (questions) | Alignment | TGIF-QA (ACC = 53.8%) |
| MSVD-QA (ACC = 33.7%) | |||||
| MSRVTT-QA (ACC = 33.00%) | |||||
| Youtube2Text-QA (ACC = 82.5%) | |||||
| [127] | 2020 | Video question answering | Visual + Textual (questions) | Hierarchical Conditional Relation Networks (HCRN) | MSRVTT-QA (ACC = 35.6%) |
| MSVD-QA (ACC = 36.1%) | |||||
| [129] | 2019 | Video question answering | Visual + Textual (questions) | Dual-LSTM + Spatial and temporal attention | TGIF-QA (l2 distance = 4.22) |
| [159] | 2019 | Style transfer | Content + Style | Graph based matching | A dataset of images from MSCOCO and WikiArt (PV = 33.45%) |
| [160] | 2017 | Style transfer | Content + Style | Hierarchical feature concatenation | A dataset of images from MSCOCO (PS = 0.54s) |