Table 1.
Study | Modalities | Subjects | Tasks | Fusion strategy |
Fusion details | Performance comparison (uni-/multimodal) |
Performance comparison (different fusion methods) |
1 | Holste et al [16] | MRI images, clinical features | 17,046 samples of 5,248 patients | Classification of breast cancer | Operation | Element-wise multiplication/elementwise summation/concatenation of learned unimodal features or direct features. | [AUC] Images: 0.860, clinical features: 0.806, all: 0.903 (P-value < 0.05) | [AUC] learned feature concatenation: 0.903, sum: 0.902, multiplication: 0.896; probability fusion: 0.888 (p-value >0.05) |
2 | Lu et al [18] | H&E images, clinical features | 1) 32,537 samples of 29,107 patients from CPTAC [22] and TCGA [23]. (Public) 2) 19162 samples of 19162 patients from an in-house dataset. 3) External testing set: 682 patients. |
Classification of primary and metastatic tumors, and origin sites. | Operation | Concatenation of clinical features and the learned pathology image feature. | [Top-1 accuracy] Image: about 0.740, image + sex: about 0.808, image + sex + site: about 0.762 (Metastatic tumors) | — |
3 | EI-Sappagh et al [24] | MRI and PET images, neuropsychology data, cognitive scores, assessment data | 1,536 patients from ADNI [37]. (Public) | Classification of AD and prodromal status. Regression of 4 cognitive scores. | Operation | Concatenation of learned static features and learned time-series unimodal features from five stacked CNN-biLSTM. | [Accuracy] Five modalities: 92.62, four modalities: 90.45, three modalities: 89.40, two modalities: 89.09. (Regression performance is consistent with the classification) | — |
4 | Yan et al [27] | Pathology images, clinical features | 3,764 samples of 153 patients. (Public) | Classification of breast cancer. | Operation | Concatenation of increased-dimensional clinical features and multi-scale image features. | [Accuracy] Image + clinical features: 87.9, clinical features: 78.5, images: 83.6 | — |
5 | Mobadersany et al [21] | H&E images, genomic data | 1,061 samples of 769 patients from the TCGA-GBM and TCGA-LGG [23]. (Public) | Survival prediction of glioma tumors | Operation | Concatenation of genomic biomarkers and learned pathology image features. | [C-Index] Image: 0.745, gene: 0.746, images + gene: 0.774, (P-value < 0.05) | — |
6 | Yap et al [42] | Macroscopic images, dermatoscopic images, clinical features | 2,917 samples from ISIC [43]. (Public) | Classification of skin lesion. | Operation | Concatenation of clinical features and learned image features. | [AUC] Dsc + macro + clinical: 0.888, dsc + macro: 0.888, macro: 0.854, dsc: 0.871 | — |
7 | Silva et al [44] | Pathology images, mRNA, miRNA, DNA, copy number variation (CNV), clinical features | 11,081 patients of 33 cancer types from TCGA [23]. (Public) | Pancancer survival prediction. | Operation Attention | Attention weighted element-wise summation of unimodal features. | [C-Index] Clinical: 0.742, mRNA: 0.763, miRNA: 0.717, DNA: 0.761, CNV: 0.640, pathology: 0.562, clinical + mRNA + DNAm: 0.779, all six modalities: 0.768 | — |
8 | Kawahara et al [45] | Clinical images, dermoscopic images, clinical features | 1,011 samples. (Public) | Classification of skin lesion. | Operation | Concatenation of learned unimodal features. | [Accuracy] Clinical images + clinical features: 65.3, dermoscopic images + clinical features: 72.9, all modalities: 73.7 | — |
9 | Yoo et al [38] | MRI images, clinical features | 140 patients | Classification of brain lesion conversion. | Operation | Concatenation of learned images features and the replicated and rescaled clinical features. | [AUC] Images: 71.8, images + clinical: 74.6 | — |
10 | Yao et al [28] | Pathology images, genomic data | 1) 106 patients from TCGA-LUSC. 2) 126 patients from the TCGA-GBM [23]. (Public) | Survival prediction of lung cancer and brain cancer. | Operation Subspace | Maximum correlated representation supervised by the CCA-based loss. | [C-index] Pathology images: 0.5540, molecular: 0.5989, images + molecular: 0.6287. (LUSC). Similar results on other two datasets. | [C-index] Proposed: 0.6287, SCCA [46]: 0.5518, DeepCorr + DeepSurv [17]: 0.5760 (LUSC). Similar results on other two datasets. |
11 | Cheerla et al [47] | Pathology images, genomic data, clinical features | 11,160 patients from TCGA [23] (nearly 43% of patients miss modalities). (Public) | Survival prediction of 20 types of cancer. | Operation Subspace | The average of learned unimodal features, while a margin-based hinge-loss was used to regularize the similarity of learned unimodal features. | [C-index] Clinical + miRNA + mRNA + pathology: 0.78, clinical + miRNA: 0.78, clinical + mRNA: 0.60, clin + miRNA + mRNA:0.78, clinical + miRNA + pathology: 0.78 | — |
12 | Li et al [48] | Pathology images genomic data | 826 cases from the TCGA-BRCA [23]. (Public) | Survival prediction of breast cancer. | Operation Subspace | Concatenated the learned unimodal features regularized by a similarity loss. | [C-index] Images + gene: 0.7571, gene: 0.6912, image: 0.6781. (p-value < 0.05) | — |
13 | Zhou et al [39] | CT images, laboratory indicators, clinical features | 733 patients | Classification of COVID-19 severity. | Operation Subspace | Concatenated the learned unimodal features regularized by a similarity loss. | [Accuracy] Clinical features: 90.45, CT + clinical features: 96.36 | [Accuracy] Proposed: 96.36, proposed wo/similarity loss: 93.18 |
14 | Ghosal et al [65] | Two fMRI paradigms images, genomic data (single nucleotide polymorphisms (SNP)) | 1) 210 patients from the LIBD institute. 2) External testing set: 97 patients from BARI institute. |
Classification of neuropsychiatric disorders. | Operation Subspace | Mean vector of learned unimodal features, supervised by the reconstruction loss. | — | [AUC] Proposed: 0.68, encoder + dropout: 0.62, encoder only: 0.59 (LIBD). The external test set showed the same trend of results. |
15 | Cui et al [35] | H&E and MRI images, genomic data (DNA), demographic features | 962 patients (170 with complete modalities) from TCGA-GBMLGG [23] and BraTs [66] (Public) | Survival prediction of glioma tumors. | Operation Subspace | Mean vector of learned unimodal features with modality dropout, supervised by the reconstruction loss. | [C-index] Pathology: 0.7319 radiology: 0.7062, DNA: 0.7174, demographics: 0.7050, all: 0.7857 | [C-index] Proposed: 0.8053, pathomic fusion [67]: 0.7697, deep orthogonal [34]: 0.7624 |
16 | Schulz et al [20] | CT, MRI and H&E images, genomic data | 1)230 patients from the TCGA-KIRC [23]. (Public) 2) External testing set: 18 patients. | Survival prediction of clear-cell renal cell carcinoma. | Operation Attention | Concatenation of learned unimodal features with an attention layer. | [C-index] Radiology: 0.7074, pathology: 0.7424, rad + path: 0.7791. (p-value < 0.05). The external test set showed similar results | — |
17 | Cui et al [68] | CT images, clinical features | 924 samples of 397 patients | Lymph node metastasis prediction of cell carcinoma. | Operation Attention | The concatenation of learned unimodal features with a category-wise contextual attention were used as the attributes of graph nodes. | [AUC] Images: 0.782, images + clinical: 0.823. | [AUC] Proposed: 0.823, logistic regression: 0.713, attention gated [74]: 0.6390, deep insight [75]: 0.739 |
18 | Li et al [31] | H&E images, clinical features | 3,990 cases | Lymph node metastasis prediction of breast cancer. | Operation Attention | Attention-based MIL for WSI-level representation, whose attention coefficients were learned from both modalities. | [AUC] Clinical: 0.8312, image: 0.7111, clinical and image: 0.8844 | [AUC] Proposed: 0.8844, concatenation: 0.8420, gating attention [67]: 0.8570, M3DN [70]: 0.8117 |
19 | Duanmu et al [59] | MRI images, genomic data, demographic features. | 112 patients | Response prediction to neoadjuvant chemotherapy in breast cancer. | Operation Attention | The learned feature vector of non-image modality was multiplied in a channel-wise way with the image features at multiple layers. | [AUC] Image: 0.5758, image and non-image: 0.8035 | [AUC] Proposed: 0.8035, concatenation: 0.5871 |
20 | Guan et al [36] | CT images, clinical features | 553 patients | Classification of esophageal fistula risk. | Operation Attention | Self-attention on the concatenation of learned unimodal features. Concatenation of all paths in the end. | [AUC] Images: 0.7341, clinical features [76]: 0.8196, images + clinical: 0.9119 | [AUC] Proposed: 0.9119, Concate: 0.8953, Ye et al [77]: 0.7736, Chauhan et al [53]: 0.6885, Yap et al [42]:0.8123 |
21 | Pölsterl et al [52] | MRI images, clinical features. | 1,341 patients for diagnosis and 755 patients for prognosis. (Public) | Survival prediction and diagnosis of AD. | Operation Attention | Dynamic affine transform module. | [C-index] Images: 0.599, images + clinical: 0.748 | [C-index] Proposed: 0.748, FiLM [78]: 0.7012, Duanmu et al [59]: 0.706, concatenation: 0.729 |
22 | Wang et al [79] | X-ray images, free-text reports. | 1) Chest x-ray 14 dataset [80]. 2) 900 samples from a hand-labeled dataset. 3) 3,643 samples from the OpenI [81]. (Partially public) | Classification of thorax disease. | Operation Attention | Multi-level attention for learned features of image and text. | [Weighted accuracy] Text reports: 0.978, images: 0.722, images + text reports: 0.922. (Chest X-rays14). Similar results on other two datasets. | — |
23 | Chen et al [67] | H&E images, genomic data (DNA and mRNA) | 1) 1,505 samples of 769 patients from TCGA-GBM/LGG. 2) 1,251 samples of 417 patients from TCGA-KIRC [23]. (Public) |
Survival prediction and grade classification of glioma tumors and renal cell carcinoma. | Operation Attention Tensor Fusion | Kronecker product of different modalities. And a gated-attention layer was used to regularize the unimportant features. | [C-index] Images (CNN): 0.792, images (GCN): 0.746, gene: 0.808,images + gene: 0.826. (GBM/LGG) Similar results on the other dataset. | [C-index]: Proposed: 0.826, Mobadersany et al [21]: 0.781. (p-value < 0.05) (GBM/LGG). Similar results on the other dataset. |
24 | Wang et al [29] | Pathology images, genomic data | 345 patients from TCGA [23](Public) | Survival prediction of breast cancer. | Operation Tensor Fusion | Inter-modal features and intra-modal features produced by the bilinear layers. | [C-index] Gene: 0.695, images: 0.578, gene + images: 0.723 | [C-index] Proposed: 0.723, LASSO-Cox 0.700, inter-modal features: 0.708, DeepCorrSurv [28] : 0.684, MDNNMD [82]: 0.704, concatenation: 0.703 |
25 | Braman et al [34] | T1 and T2 MRI images, genomic data (DNA), clinical features | 176 patients from TCGA-GBM/LGG [23] and BraTs [66]. (Public) | Survival prediction of brain glioma tumors. | Operation Attention Tensor Fusion | Extended the fusion method in [67] to four modalities and the orthogonal loss was added to encourage the learning of complementary unimodal features. | [C-index] Radiology: 0.718, pathology: 0.715, gene: 0.716, clinical: 0.702, path + clin: 0.690, all: 0.785 | [C-index] Proposed: 0.785, pathomic fusion [67]: 0.775, concatenation: 0.76 |
26 | Cao et al [41] | fMRI images, clinical features | 871 patients from ABIDE [83]. (Public) | Classification of ASD and health controls. | Graph Operation | Nodes features were composed of image features, while the edge weights were calculated by images and non-image features. | [Accuracy] Sites + gender + age + FIQ: 0.7456, sites + age + FIQ: 0.7534, sites + age: 0.7520 | [Accuracy] Proposed: 0.737, Parisot et al [40]: 0.704 |
27 | Parisot et al [40] | fMRI images, clinical features | 1) 871 patients from ABIDE [83]. 2) 675 subjects from ANDI [37]. (Public) | Classification of ASD and health control. Prediction of conversion to AD. | Graph Operation | Nodes features were composed of image features, while the edge weights were calculated by images and non-image features. | [AUC] Image + sex + APOE4: 0.89, image + sex + APOE4 + age: 0.85 (ADNI dataset) | [AUC] Proposed: 0.89, GCN: 0.85, MLP (Concatenation): 0.74 (ADNI dataset) |
28 | Chen et al [26] | H&E images, genomic data | 1) - 4) 437, 1,022, 1,011, 515 and 538 patients from TCGA-BLCA, TCGA-BRCA, TCGA-GBMLGG, TCGA-LUAD and TCGA-UCEC respectively [23] (Public) | Survival prediction of five kinds of tumors. | Operation Attention | Co-attention mapping between WSIs and genomic features. | [C-Index] Gene: 0.527, pathology images: 0.614, all: 0.653 (overall prediction of five tumors) | [C-index] Proposed: 0.653, concatenation: 0.634, bilinear pooling: 0.621. (Overall prediction of five tumors) |
29 | Zhou et al [84] | PET images, MRI images, genomic data (SNP) | 805 patients from ADNI [37] (360 with complete multimodalities). (Public) | Classification of AD and its prodromal status | Operation | Learned features of every two modalities and all three modalities were concatenated at the 1st and 2nd fusion stage separately. | [Accuracy] MRI + PET + SNP > MRI + PET > MRI > MRI + SNP > PET + SNP > PET > SNP (Four-class classification) | [Accuracy] Proposed > MKL [85] > SAE [86] (Direct concatenation of learned unimodal features) |
30 | Huang et al [87] | CT images, clinical features, and lab test results | 1,837 studies from 1,794 patients | Classification of the presence pulmonary embolism | Operation | Compared seven kinds of fusion, including early, intermediate and late fusion. Late elastic fusion performed the best. | [AUC] Images: 0.791, clinical and lab test: 0.911, all: 0.947. | [AUC] Early fusion: 0.899, late fusion: 0.947, joint fusion: 0.893. |
31 | Lu et al [69] | Pathology images, genomics data | 736 patients from TCGA-GBM/LGG [23]. (Public) | Survival prediction and grade classification of glioma tumors. | Operation Attention | Proposed a multimodal transformer encoder for co-attention fusion. | [C-index] Images: 0.7385, gene: 0.7979, images + gene: 0.8266 (Same trend for the classification task) | [C-index] Proposed: 0.8266 pathomic fusion [67]: 0.7994 |
32 | Cai et al [51] | Camera/dermatoscopic images, clinical features | 1) 10,015 cases from ISIC [43]. (Public) 2) 760 cases from a private dataset. | Classification of skin wounds | Operation Attention | Two multi-head cross attention to interactively fuse information from images and metadata. | [AUC] Images: 0.944 clinical features: 0.964 images + clinical: 0.974 (Private dataset) | [AUC] Poposed: 0.974, metaBlock [88]: 0.968, concatenation: 0.964 (Private dataset) |
33 | Jacenkow et al [72] | X-ray images, free-text reports | 210,538 cases from MIMIC-CXR [89]. (Public) | Classification chest diseases | Attention | Finetuned unimodally pre-trained BERT models by a multimodal task. | [ACC] Images: 86.0 text: 85.1 images + text: 87.7 | [ACC] Proposed: 87.7, attentive [90]: 86.8 |
34 | Li et al [58] | X-ray images, free-text reports | 1) 222,713 cases from MIMIC-CXR [89], 2) 3,684 cases from OpenI [81]. (Public) |
Classification of chest diseases | Attention | Used different pre-trained visual-text transformer. | [AUC] Text: 0.974, image + text: 0.987 (MIMIC-CXR) | [AUC] VisualBERT [91, 92]: 0.987, LXMERT [93]: 0.984, UNITER [94]: 0.985, PixelBERT [95]: 0.953 (MIMIC-CXR) |