. Author manuscript; available in PMC: 2023 Jun 23.

Published in final edited form as: Prog Biomed Eng (Bristol). 2023 Apr 11;5(2):10.1088/2516-1091/acc2fe. doi: 10.1088/2516-1091/acc2fe

Table 1.

An overview of image and non-image modalities, number of subjects, tasks multimodal fusion methods and performance comparison of reviewed studies (the performance are reported by reviewed studies).

	Study	Modalities	Subjects	Tasks	Fusion strategy	Fusion details	Performance comparison (uni-/multimodal)	Performance comparison (different fusion methods)
1	Holste et al [16]	MRI images, clinical features	17,046 samples of 5,248 patients	Classification of breast cancer	Operation	Element-wise multiplication/elementwise summation/concatenation of learned unimodal features or direct features.	[AUC] Images: 0.860, clinical features: 0.806, all: 0.903 (P-value < 0.05)	[AUC] learned feature concatenation: 0.903, sum: 0.902, multiplication: 0.896; probability fusion: 0.888 (p-value >0.05)
2	Lu et al [18]	H&E images, clinical features	1) 32,537 samples of 29,107 patients from CPTAC [22] and TCGA [23]. (Public) 2) 19162 samples of 19162 patients from an in-house dataset. 3) External testing set: 682 patients.	Classification of primary and metastatic tumors, and origin sites.	Operation	Concatenation of clinical features and the learned pathology image feature.	[Top-1 accuracy] Image: about 0.740, image + sex: about 0.808, image + sex + site: about 0.762 (Metastatic tumors)	—
3	EI-Sappagh et al [24]	MRI and PET images, neuropsychology data, cognitive scores, assessment data	1,536 patients from ADNI [37]. (Public)	Classification of AD and prodromal status. Regression of 4 cognitive scores.	Operation	Concatenation of learned static features and learned time-series unimodal features from five stacked CNN-biLSTM.	[Accuracy] Five modalities: 92.62, four modalities: 90.45, three modalities: 89.40, two modalities: 89.09. (Regression performance is consistent with the classification)	—
4	Yan et al [27]	Pathology images, clinical features	3,764 samples of 153 patients. (Public)	Classification of breast cancer.	Operation	Concatenation of increased-dimensional clinical features and multi-scale image features.	[Accuracy] Image + clinical features: 87.9, clinical features: 78.5, images: 83.6	—
5	Mobadersany et al [21]	H&E images, genomic data	1,061 samples of 769 patients from the TCGA-GBM and TCGA-LGG [23]. (Public)	Survival prediction of glioma tumors	Operation	Concatenation of genomic biomarkers and learned pathology image features.	[C-Index] Image: 0.745, gene: 0.746, images + gene: 0.774, (P-value < 0.05)	—
6	Yap et al [42]	Macroscopic images, dermatoscopic images, clinical features	2,917 samples from ISIC [43]. (Public)	Classification of skin lesion.	Operation	Concatenation of clinical features and learned image features.	[AUC] Dsc + macro + clinical: 0.888, dsc + macro: 0.888, macro: 0.854, dsc: 0.871	—
7	Silva et al [44]	Pathology images, mRNA, miRNA, DNA, copy number variation (CNV), clinical features	11,081 patients of 33 cancer types from TCGA [23]. (Public)	Pancancer survival prediction.	Operation Attention	Attention weighted element-wise summation of unimodal features.	[C-Index] Clinical: 0.742, mRNA: 0.763, miRNA: 0.717, DNA: 0.761, CNV: 0.640, pathology: 0.562, clinical + mRNA + DNAm: 0.779, all six modalities: 0.768	—
8	Kawahara et al [45]	Clinical images, dermoscopic images, clinical features	1,011 samples. (Public)	Classification of skin lesion.	Operation	Concatenation of learned unimodal features.	[Accuracy] Clinical images + clinical features: 65.3, dermoscopic images + clinical features: 72.9, all modalities: 73.7	—
9	Yoo et al [38]	MRI images, clinical features	140 patients	Classification of brain lesion conversion.	Operation	Concatenation of learned images features and the replicated and rescaled clinical features.	[AUC] Images: 71.8, images + clinical: 74.6	—
10	Yao et al [28]	Pathology images, genomic data	1) 106 patients from TCGA-LUSC. 2) 126 patients from the TCGA-GBM [23]. (Public)	Survival prediction of lung cancer and brain cancer.	Operation Subspace	Maximum correlated representation supervised by the CCA-based loss.	[C-index] Pathology images: 0.5540, molecular: 0.5989, images + molecular: 0.6287. (LUSC). Similar results on other two datasets.	[C-index] Proposed: 0.6287, SCCA [46]: 0.5518, DeepCorr + DeepSurv [17]: 0.5760 (LUSC). Similar results on other two datasets.
11	Cheerla et al [47]	Pathology images, genomic data, clinical features	11,160 patients from TCGA [23] (nearly 43% of patients miss modalities). (Public)	Survival prediction of 20 types of cancer.	Operation Subspace	The average of learned unimodal features, while a margin-based hinge-loss was used to regularize the similarity of learned unimodal features.	[C-index] Clinical + miRNA + mRNA + pathology: 0.78, clinical + miRNA: 0.78, clinical + mRNA: 0.60, clin + miRNA + mRNA:0.78, clinical + miRNA + pathology: 0.78	—
12	Li et al [48]	Pathology images genomic data	826 cases from the TCGA-BRCA [23]. (Public)	Survival prediction of breast cancer.	Operation Subspace	Concatenated the learned unimodal features regularized by a similarity loss.	[C-index] Images + gene: 0.7571, gene: 0.6912, image: 0.6781. (p-value < 0.05)	—
13	Zhou et al [39]	CT images, laboratory indicators, clinical features	733 patients	Classification of COVID-19 severity.	Operation Subspace	Concatenated the learned unimodal features regularized by a similarity loss.	[Accuracy] Clinical features: 90.45, CT + clinical features: 96.36	[Accuracy] Proposed: 96.36, proposed wo/similarity loss: 93.18
14	Ghosal et al [65]	Two fMRI paradigms images, genomic data (single nucleotide polymorphisms (SNP))	1) 210 patients from the LIBD institute. 2) External testing set: 97 patients from BARI institute.	Classification of neuropsychiatric disorders.	Operation Subspace	Mean vector of learned unimodal features, supervised by the reconstruction loss.	—	[AUC] Proposed: 0.68, encoder + dropout: 0.62, encoder only: 0.59 (LIBD). The external test set showed the same trend of results.
15	Cui et al [35]	H&E and MRI images, genomic data (DNA), demographic features	962 patients (170 with complete modalities) from TCGA-GBMLGG [23] and BraTs [66] (Public)	Survival prediction of glioma tumors.	Operation Subspace	Mean vector of learned unimodal features with modality dropout, supervised by the reconstruction loss.	[C-index] Pathology: 0.7319 radiology: 0.7062, DNA: 0.7174, demographics: 0.7050, all: 0.7857	[C-index] Proposed: 0.8053, pathomic fusion [67]: 0.7697, deep orthogonal [34]: 0.7624
16	Schulz et al [20]	CT, MRI and H&E images, genomic data	1)230 patients from the TCGA-KIRC [23]. (Public) 2) External testing set: 18 patients.	Survival prediction of clear-cell renal cell carcinoma.	Operation Attention	Concatenation of learned unimodal features with an attention layer.	[C-index] Radiology: 0.7074, pathology: 0.7424, rad + path: 0.7791. (p-value < 0.05). The external test set showed similar results	—
17	Cui et al [68]	CT images, clinical features	924 samples of 397 patients	Lymph node metastasis prediction of cell carcinoma.	Operation Attention	The concatenation of learned unimodal features with a category-wise contextual attention were used as the attributes of graph nodes.	[AUC] Images: 0.782, images + clinical: 0.823.	[AUC] Proposed: 0.823, logistic regression: 0.713, attention gated [74]: 0.6390, deep insight [75]: 0.739
18	Li et al [31]	H&E images, clinical features	3,990 cases	Lymph node metastasis prediction of breast cancer.	Operation Attention	Attention-based MIL for WSI-level representation, whose attention coefficients were learned from both modalities.	[AUC] Clinical: 0.8312, image: 0.7111, clinical and image: 0.8844	[AUC] Proposed: 0.8844, concatenation: 0.8420, gating attention [67]: 0.8570, M3DN [70]: 0.8117
19	Duanmu et al [59]	MRI images, genomic data, demographic features.	112 patients	Response prediction to neoadjuvant chemotherapy in breast cancer.	Operation Attention	The learned feature vector of non-image modality was multiplied in a channel-wise way with the image features at multiple layers.	[AUC] Image: 0.5758, image and non-image: 0.8035	[AUC] Proposed: 0.8035, concatenation: 0.5871
20	Guan et al [36]	CT images, clinical features	553 patients	Classification of esophageal fistula risk.	Operation Attention	Self-attention on the concatenation of learned unimodal features. Concatenation of all paths in the end.	[AUC] Images: 0.7341, clinical features [76]: 0.8196, images + clinical: 0.9119	[AUC] Proposed: 0.9119, Concate: 0.8953, Ye et al [77]: 0.7736, Chauhan et al [53]: 0.6885, Yap et al [42]:0.8123
21	Pölsterl et al [52]	MRI images, clinical features.	1,341 patients for diagnosis and 755 patients for prognosis. (Public)	Survival prediction and diagnosis of AD.	Operation Attention	Dynamic affine transform module.	[C-index] Images: 0.599, images + clinical: 0.748	[C-index] Proposed: 0.748, FiLM [78]: 0.7012, Duanmu et al [59]: 0.706, concatenation: 0.729
22	Wang et al [79]	X-ray images, free-text reports.	1) Chest x-ray 14 dataset [80]. 2) 900 samples from a hand-labeled dataset. 3) 3,643 samples from the OpenI [81]. (Partially public)	Classification of thorax disease.	Operation Attention	Multi-level attention for learned features of image and text.	[Weighted accuracy] Text reports: 0.978, images: 0.722, images + text reports: 0.922. (Chest X-rays14). Similar results on other two datasets.	—
23	Chen et al [67]	H&E images, genomic data (DNA and mRNA)	1) 1,505 samples of 769 patients from TCGA-GBM/LGG. 2) 1,251 samples of 417 patients from TCGA-KIRC [23]. (Public)	Survival prediction and grade classification of glioma tumors and renal cell carcinoma.	Operation Attention Tensor Fusion	Kronecker product of different modalities. And a gated-attention layer was used to regularize the unimportant features.	[C-index] Images (CNN): 0.792, images (GCN): 0.746, gene: 0.808,images + gene: 0.826. (GBM/LGG) Similar results on the other dataset.	[C-index]: Proposed: 0.826, Mobadersany et al [21]: 0.781. (p-value < 0.05) (GBM/LGG). Similar results on the other dataset.
24	Wang et al [29]	Pathology images, genomic data	345 patients from TCGA [23](Public)	Survival prediction of breast cancer.	Operation Tensor Fusion	Inter-modal features and intra-modal features produced by the bilinear layers.	[C-index] Gene: 0.695, images: 0.578, gene + images: 0.723	[C-index] Proposed: 0.723, LASSO-Cox 0.700, inter-modal features: 0.708, DeepCorrSurv [28] : 0.684, MDNNMD [82]: 0.704, concatenation: 0.703
25	Braman et al [34]	T1 and T2 MRI images, genomic data (DNA), clinical features	176 patients from TCGA-GBM/LGG [23] and BraTs [66]. (Public)	Survival prediction of brain glioma tumors.	Operation Attention Tensor Fusion	Extended the fusion method in [67] to four modalities and the orthogonal loss was added to encourage the learning of complementary unimodal features.	[C-index] Radiology: 0.718, pathology: 0.715, gene: 0.716, clinical: 0.702, path + clin: 0.690, all: 0.785	[C-index] Proposed: 0.785, pathomic fusion [67]: 0.775, concatenation: 0.76
26	Cao et al [41]	fMRI images, clinical features	871 patients from ABIDE [83]. (Public)	Classification of ASD and health controls.	Graph Operation	Nodes features were composed of image features, while the edge weights were calculated by images and non-image features.	[Accuracy] Sites + gender + age + FIQ: 0.7456, sites + age + FIQ: 0.7534, sites + age: 0.7520	[Accuracy] Proposed: 0.737, Parisot et al [40]: 0.704
27	Parisot et al [40]	fMRI images, clinical features	1) 871 patients from ABIDE [83]. 2) 675 subjects from ANDI [37]. (Public)	Classification of ASD and health control. Prediction of conversion to AD.	Graph Operation	Nodes features were composed of image features, while the edge weights were calculated by images and non-image features.	[AUC] Image + sex + APOE4: 0.89, image + sex + APOE4 + age: 0.85 (ADNI dataset)	[AUC] Proposed: 0.89, GCN: 0.85, MLP (Concatenation): 0.74 (ADNI dataset)
28	Chen et al [26]	H&E images, genomic data	1) - 4) 437, 1,022, 1,011, 515 and 538 patients from TCGA-BLCA, TCGA-BRCA, TCGA-GBMLGG, TCGA-LUAD and TCGA-UCEC respectively [23] (Public)	Survival prediction of five kinds of tumors.	Operation Attention	Co-attention mapping between WSIs and genomic features.	[C-Index] Gene: 0.527, pathology images: 0.614, all: 0.653 (overall prediction of five tumors)	[C-index] Proposed: 0.653, concatenation: 0.634, bilinear pooling: 0.621. (Overall prediction of five tumors)
29	Zhou et al [84]	PET images, MRI images, genomic data (SNP)	805 patients from ADNI [37] (360 with complete multimodalities). (Public)	Classification of AD and its prodromal status	Operation	Learned features of every two modalities and all three modalities were concatenated at the 1^st and 2^nd fusion stage separately.	[Accuracy] MRI + PET + SNP > MRI + PET > MRI > MRI + SNP > PET + SNP > PET > SNP (Four-class classification)	[Accuracy] Proposed > MKL [85] > SAE [86] (Direct concatenation of learned unimodal features)
30	Huang et al [87]	CT images, clinical features, and lab test results	1,837 studies from 1,794 patients	Classification of the presence pulmonary embolism	Operation	Compared seven kinds of fusion, including early, intermediate and late fusion. Late elastic fusion performed the best.	[AUC] Images: 0.791, clinical and lab test: 0.911, all: 0.947.	[AUC] Early fusion: 0.899, late fusion: 0.947, joint fusion: 0.893.
31	Lu et al [69]	Pathology images, genomics data	736 patients from TCGA-GBM/LGG [23]. (Public)	Survival prediction and grade classification of glioma tumors.	Operation Attention	Proposed a multimodal transformer encoder for co-attention fusion.	[C-index] Images: 0.7385, gene: 0.7979, images + gene: 0.8266 (Same trend for the classification task)	[C-index] Proposed: 0.8266 pathomic fusion [67]: 0.7994
32	Cai et al [51]	Camera/dermatoscopic images, clinical features	1) 10,015 cases from ISIC [43]. (Public) 2) 760 cases from a private dataset.	Classification of skin wounds	Operation Attention	Two multi-head cross attention to interactively fuse information from images and metadata.	[AUC] Images: 0.944 clinical features: 0.964 images + clinical: 0.974 (Private dataset)	[AUC] Poposed: 0.974, metaBlock [88]: 0.968, concatenation: 0.964 (Private dataset)
33	Jacenkow et al [72]	X-ray images, free-text reports	210,538 cases from MIMIC-CXR [89]. (Public)	Classification chest diseases	Attention	Finetuned unimodally pre-trained BERT models by a multimodal task.	[ACC] Images: 86.0 text: 85.1 images + text: 87.7	[ACC] Proposed: 87.7, attentive [90]: 86.8
34	Li et al [58]	X-ray images, free-text reports	1) 222,713 cases from MIMIC-CXR [89], 2) 3,684 cases from OpenI [81]. (Public)	Classification of chest diseases	Attention	Used different pre-trained visual-text transformer.	[AUC] Text: 0.974, image + text: 0.987 (MIMIC-CXR)	[AUC] VisualBERT [91, 92]: 0.987, LXMERT [93]: 0.984, UNITER [94]: 0.985, PixelBERT [95]: 0.953 (MIMIC-CXR)