Table 1.
The comparison of existing multimodal fusion methods.
| Model | Architecture | Features | Fusion Strategy |
|---|---|---|---|
| Luo et al. [8] | CNN + RNN | voice and text | Fuse the audio and handcrafted low-level descriptor through simple vector concatenation. |
| Micucci et al. [9] | CNN | palmprint and hand-geometry | Score level fusion, sum the weighted scores from each modality. |
| Sell et al. [10] | DNN + CNN | face and voice | Converting the output scores generated from unimodal verification systems into log-likelihood ratios. |
| PINS [11] | VGG-M | face and voice | Establish a joint embedding between faces and voices. |
| EmoRL-Net [12] | ResNet-18 | face and voice | Project the representation of two full connection layers into a spherical space. |