Abstract
Multi-modal learning (e.g., integrating pathological images with genomic features) tends to improve the accuracy of cancer diagnosis and prognosis as compared to learning with a single modality. However, missing data is a common problem in clinical practice, i.e., not every patient has all modalities available. Most of the previous works directly discarded samples with missing modalities, which might lose information in these data and increase the likelihood of overfitting. In this work, we generalize the multi-modal learning in cancer diagnosis with the capacity of dealing with missing data using histological images and genomic data. Our integrated model can utilize all available data from patients with both complete and partial modalities. The experiments on the public TCGA-GBM and TCGA-LGG datasets show that the data with missing modalities can contribute to multi-modal learning, which improves the model performance in grade classification of glioma cancer.
Keywords: Multi-modal learning, missing data, deep learning
1. INTRODUCTION
Histopathological images are generally recognized as the “gold standard” diagnostic images in cancer diagnosis and prognosis. However, in addition to histopathological images, other modalities (e.g., genomic information, clinical features) of the same patients can commonly assist oncologists in making decisions in clinical practice. Similarly, for deep-learning models, the shared and independent information from multiple modalities has the potential to benefit model learning. Many recent works [1]-[11] have shown that integrating pathological images with other heterogeneous data sources leads to greater accuracy and robustness than only using pathological images. Both conventional and deep-learning-based techniques have been developed to exploit the complementary and correlated information of the multiple modalities. The traditional canonical correlation analysis (CCA) based methods are popular in detecting the correspondence between gene and pathological images [1] [2] [3], and it is introduced in the deep learning field as a supervised loss by Yao et al [4]. Mobadersany et al. [5] and Yao et al. [4] concatenated the low-dimension features of pathological images and genomic features. Integration methods in [3] [7] [8] learn the shared feature space of the different modalities by similarity loss. More recently, Chen et al. [9] and Wang et al. [10] extracted pairwise feature interaction by the Kronecker product and bilinear layers.
Most of the previous works focused on using the data with complete multi-modals only. However, such an assumption might not hold, or at least be limited, in more realistic clinical data. For instance, it is common that the number of patients for a clinical study is limited with no guarantee that every patient contributed the same types of modality available in the routinely collected dataset. If the data with missing modalities are simply discarded in multi-modal learning, the information loss would be introduced with an increased risk of overfitting. Moreover, the usage of the model may be limited in the inference phase because it cannot generate a prediction for the data with a missing modality. Limited work has been done to deal with the missing modality problem in the field of multi-learning for cancer diagnosis. Cheerla et al. [6] and Luís A et al. [11] drop out entire feature vectors corresponding to the missing modality and scale up the weights of the other modalities correspondingly, while they simply used the weighted average strategy to fuse the features from different modalities.
In this study, we introduce and explore the missing modality problem of multi-modal learning in cancer diagnosis. Pathomic Fusion [9] is a recently proposed multi-modal learning pipeline for cancer diagnosis using pathological features and genomic features. It extracted features from the different modalities with multiple supervised methods, but it cannot accept the data with either pathological images or genomic data missing. To investigate the potential of the extracted features when utilizing the data with missing modalities, we integrated a novel multi-modal fusion strategy from the CPM-Nets [12], which can handle arbitrary modality-missing patterns to learn structured and complete joint representation for classification. Our experiments aim to investigate whether the cancer diagnosis accuracy of the integrated model can be improved by using the data with missing modalities.
2. METHODS
Feature extraction and fusion among multiple modalities are two typical components in multiple modality learning. Different types of modalities, e.g., the histopathological images and molecular data, are converted into informative feature vectors by different methods before integrating them for final prediction. In this work, the feature extraction method from the Pathomic Fusion and the feature fusion method from the CPM-Nets are integrated. The entire pipeline of the integrated model is shown in figure 1.
Figure 1.

The pipeline of the integrated multi-modal learning model. The feature extraction stage is adapted from the Pathomic fusion, while the multimodal feature fusion stage is adapted from the CPM-Nets. The CNN, GCN, and SNN are trained separately at first, supervised by the grade labels. Then, the neurons at the last hidden layer of each network will be extracted as a low-dimensional feature. As for the feature fusion stage, the hidden representation is randomly initialized as first. And then, three MLP networks separately learn to reconstruct the low-dimensional features from CNN, GCN, and SNN. The hidden representation and the trainable parameters of the MLP networks are learned by the reconstruction losses by turns. Also, a clustering-like classification loss is used to classify samples by calculating the similarity of hidden representations.
2.1. Feature Extraction
At the image feature extraction stage, both the convolutional neural network (CNN) and the graph convolutional neural network (GCN) are used to learn the pathological image features. The GCN aims to capture the cell-to-cell interactions and the structured environment of cell graphs. The CNN is initialized by a VGG19 network pre-trained by the ImageNet. For the GCN, a cell graph of each ROI is pre-defined [9]. The segmented nuclei are nodes of the graph and the edges are defined by the similarity of the features of these nuclei. Through the GCN, the manual cell features and deep features extracted by contrastive learning of each node can be aggregated from their neighbors. By pooling over the aggregated features of all nodes, the representation vector of the entire graph will be obtained. As for the genomic features, the most informative genes are selected at first. Then, the self-normalizing networks (SNN) [13] are employed to mitigate overfitting and learn the low-dimensional genomic features. After training the above three networks separately by labels, their last hidden layers are extracted as the three kinds of low-dimensional features. According to the Pathomic Fusion [9], although these features from the pathological image and gene can predict cancer diagnosis separately, their integration achieved the best results.
2.2. Multi-modal Feature Fusion with Missing Modalities
Pathomic Fusion uses the Kronecker product of multi-modal feature vectors to catch the correlation between different modalities. However, it is restricted to the data with both modalities (pathological images and genomic data) available. So, we replace the Kronecker product with the fusion method from the CPM-Nets to utilize the data with missing modalities. In detail, a structured hidden representation H, which integrated multi-modal features for each sample, is trained by the reconstruction loss and clustering-like classification loss. According to Zhang et al [12], the representation H learns to reconstruct all available modalities of a sample, which in turn encodes the comprehensive information from different available modalities. It defines a common space where different samples with different modalities are comparable. In this common space, the clustering-like classification schema penalizes the misclassification and guarantees the structured representation. The representation H and the trainable parameters in multi-layer perceptron (MLP) will be updated by backpropagation by turns. The missing modality controller records the modality availability for each sample. When a modality of data is missing, other available modalities can still be used to calculate the reconstruction loss and learn the representation H. In the testing phase, the self-supervised reconstruction loss and the trained MLP generate the hidden representation for testing data. Finally, they are classified by the average similarity of their hidden representations to the hidden representations at the training set.
3. EXPERIMENTS
3.1. Data
In this study, we used the merged TCGA-GBM (Glioblastoma Multiforme) and TCGA-LGG (Low-Grade Glioma) datasets organized which were preprocessed by previous works [9] [5]. It contained 769 patients, each with one modality of 1-3 1024 ×1024 region of interests (ROIs) from diagnostic slides and/or another modality of 80 genomic features, including the most informative 79 features from copy number variations (CNV) and one from mutation status. Only 664 patients out of the 769 patients that had both gene and image modalities available were used in this project. The World Health Organization (WHO) grade (II, III, and IV) for each patient was made by the manual interpretation of histology for malignancy [9], which was also used as the ground truth for the cancer diagnosis in the following experiments.
3.2. Experiment Settings
Two sets of experiments were performed to investigate the contributions of the data with missing modalities. 1) The first experiment compared the integrated model trained with all data and the data with complete modalities only. 2) The second experiment compared the integrated model trained by all data with the Pathomic Fusion model trained by the data with complete modalities only. Specifically, we randomly discarded the pathological image modality or the gene modality of the n% (n = 99, 95, 90, 75, 50, 25, 0) training data, while keeping the rest of samples with both modalities. All available data of each modality were used to train the CNN, GCN and SNN at the feature extraction stage to generate the low-dimensional features. At the feature fusion stage, partial samples with complete modalities or all samples including missing modalities were used to train the models to make the final prediction.
The framework1 was built with PyTorch on an Nvidia Quadro RTX5000 GPU. In the feature extraction section, we followed the same setting in the Pathomic fusion. Specifically, the CNN for pathological image features was initialized by the VGG-16 pretrained by the ImageNet, trained by a small learning rate of 0.0005 and a batch size of 8. In each training epoch, one image patch in the shape of 512 × 512 was randomly cropped from each pathological 1024 × 1024 ROI and augmented by the color jittering, random vertical and horizontal flips. The augmented patches were used to train the CNN. The GCN and SNN were initialized by the self-normalizing weights from Klambeur et al. [13] and trained by the learning rate of 0.002 and a batch size of 32 and 64 respectively. Note that the graph of the whole ROI did not take up a large computation memory, so there was no need to crop it as the pathological image patches for CNN. For each modality, the feature vectors with the length of 32 before the last linear layer and the softmax activation were extracted as the low-dimension features for multi-modal fusion.
In the multi-modal feature fusion stage, nine overlapping 512 × 512 patches were cropped from each 1024 × 1024 pathological ROIs. Each low-dimensional feature of the cropped pathological image patch was paired with a low-dimension feature of the corresponding pathological ROI graph, and a low dimension feature of the genomic data from the corresponding patient for multi-modal fusion. The max softmax activation score of all samples was used to determine the class of this patient. For the CPM fusion method, the initial learning rate was set to 0.001 and the length of the hidden representation H was set to 64 empirically. Each modality had a reconstruction network, which consisted of two linear layers with the dimensions of [64, 96] and [96, 32], and a dropout layer (dropout ratio = 0.1). The hidden representation vectors and the weights of reconstruction networks were randomly initialized by the xavier method at first. In each training epoch, the hidden representation was fixed, and the reconstruction networks were updated by the reconstruction loss, i.e., mean square error, in the first 5 inner epochs. And then, the hidden representation was updated by the sum of cluster-like classification loss [12] and reconstruction loss for another 5 inner epochs, while the reconstruction networks were fixed. In testing phase, the trained reconstruction networks were fixed and the hidden representation was updated by the reconstruction loss. To make a comparison with the previous state-of-the-art on data with complete modalities, the fusion method proposed in the Pathomic fusion [9] was also implemented. It used attention gates for each modality to further sparse the low-dimensional features. Value one was appended to each low-dimensional feature vector. And then, a Kronecker product was used to capture the interactions of every two or three modalities features by generating a multi-modal tensor with the dimension of 33 × 33 × 33. The result of the Kronecker product was flattened to be a multi-modal feature vector, and compressed to the size of 64 through a three-layer perceptron. A simple linear classifier was used to make predictions in the end and the crossentropy loss was used to supervise the result. The learning rate was set to 0.0001 in this fusion method empirically.
To simplify the computation and focus on the comparison of the fusion method, the feature extraction stage and the feature fusion stage were trained separately, instead of end-to-end. All networks were trained by the Adam optimizer, and a linearly decaying learning rate scheduler. Different from the previous works [9] [12] that trained model with a fixed number of epochs, training phase was interrupted when there was no improvement on the validation set for 10 epochs, and the models performed best in the validation set were used for testing.
To evaluate the model performance, we used the same 15 train-set splits [5] that were also used by Pathomic Fusion. Each contained about 80% training data and 20% testing data randomly split by patients. The difference was that 10% of the training set were randomly split as the validation set in our experiments. The area under the receiver operating characteristic curve (AUC), F1 score and the F1 score for the Grade IV class are used as evaluation metrics for classification.
3.3. Results
Figure 2 compares the AUC values of different multi-modal models and unimodal models under different missing rates. The proposed integrated model trained by all data (displayed by the blue line) always performed better than unimodal models (displayed by dashed lines). And other multi-modal models only trained by complete data surpassed the unimodal models when the missing rate was not very large (smaller than 90%).
Figure 2:

AUC of the classification results at different missing rates (mean ± std from the testing sets of the 15 splits). The dashed lines represent the AUC value of the unimodal models (CNN, GCN and SNN). The solid lines represent the AUC value of the fusion models. The performance of the proposed integrated model trained by all data including incomplete modalities is shown by the blue line, which is better than the other two fusion models only trained by the data with complete modalities when the missing rate is equal to or larger than 75%.
Table 1 provides more quantitative comparisons of multi-modal methods trained by all data or complete data only. It is observed that when the missing rate of the dataset was equal to or larger than 75%, the AUC and the F1 grade IV of the integrated model trained by all data was higher than the Pathomic Fusion and the integrated model trained by the data with complete modalities only. Although the best AUC and F1 scores of the integrated model were slightly lower than the Pathomic Fusion, the integrated model performed much more stable at different missing rates. The AUC of the integrated model changed from 0.896 to 0.885 at different missing rates, while the AUC of the Pathomic Fusion changed from 0.911 to 0.751. In the extreme case that the missing rate is 99%, the number of data with complete modalities belonging to different classes can be very unbalanced. Even worse, some classes do not have training data with complete modalities for fusion models. This is probably the reason that the model trained by the data with complete modalities only performed obviously worse or even fail when the missing rate is large, while the integrated model trained by all available data kept good performance.
Table 1:
Model performance of the integrated model and Pathomic Fusion at the different missing rates of data. (All values represent mean value ± standard deviation computed from the testing sets of the 15 splits. The methods with the best performance are highlighted in bold)
| Methods | Use data with missing modalities? |
Evaluation Metrics |
Missing Rate* | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 0% | 25% | 50% | 75% | 90% | 95% | 99% | |||
| Proposed Integrated model (Pathomic + CPM-Nets) | Yes | AUC ↑ | 0.896 ± 0.009 | 0.895 ± 0.008 | 0.888 ± 0.011 | 0.887 ± 0.010 | 0.890 ± 0.010 | 0.886 ± 0.010 | 0.885 ± 0.011 |
| F1 score ↑ | 0.731 ± 0.018 | 0.738 ± 0.023 | 0.719 ± 0.023 | 0.716 ± 0.018 | 0.721 ± 0.024 | 0.702 ± 0.023 | 0.696 ± 0.023 | ||
| F1 Grade IV ↑ | 0.926 ± 0.011 | 0.922 ± 0.014 | 0.918 ± 0.010 | 0.910 ± 0.014 | 0.909 ± 0.015 | 0.896 ± 0.017 | 0.898 ± 0.016 | ||
| Pathomic Fushion | No | AUC ↑ | 0.911 ± 0.010 | 0.905 ± 0.009 | 0.894 ± 0.010 | 0.883 ± 0.014 | 0.880 ± 0.012 | 0.874 ± 0.017 | 0.751 ± 0.076 |
| F1 score ↑ | 0.749 ± 0.021 | 0.740 ± 0.019 | 0.729 ± 0.017 | 0.711 ± 0.020 | 0.723 ± 0.018 | 0.699 ± 0.024 | 0.579 ± 0.084 | ||
| F1 Grade IV ↑ | 0.933 ± 0.015 | 0.928 ± 0.013 | 0.919 ± 0.013 | 0.903 ± 0.011 | 0.903 ± 0.018 | 0.887 ± 0.018 | 0.723 ± 0.172 | ||
| Proposed Integrated model (Pathomic + CPM-Nets) | No | AUC ↑ | 0.896 ± 0.009 | 0.891 ± 0.009 | 0.893 ± 0.013 | 0.884 ± 0.011 | 0.881 ± 0.009 | 0.869 ± 0.011 | 0.786 ± 0.063 |
| F1 score ↑ | 0.731 ± 0.018 | 0.727 ± 0.021 | 0.723 ± 0.019 | 0.715 ± 0.021 | 0.705 ± 0.017 | 0.705 ± 0.017 | 0.660 ± 0.011 | ||
| F1 Grade IV ↑ | 0.926 ± 0.011 | 0.926 ± 0.009 | 0.921 ± 0.012 | 0.905 ± 0.013 | 0.901 ± 0.024 | 0.888 ± 0.020 | 0.815 ± 0.126 | ||
Indicates n% data in the training set had the complete two modalities pathological image and genomic data, while the rest of data in the training set had only one modality. Using data with missing modalities means using all available data, including the data with missing modalities at the feature fusion stage.
In short, these experimental results indicate that the multi-modal fusion can improve the performance of unimodal models at most of cases. Furthermore, utilizing the data with missing modalities at the fusion stage of multi-modal features is beneficial to the model performance. The models are more likely to keep the performance at a high level when using data with and without missing modalities, even when the missing rate is large.
4. DISCUSSION
In this work, to deal with the missing data problem in multi-modal cancer diagnosis, we integrated the feature extraction methods from Pathomic Fusion and the missing-modality fusion perspective from the CPM-Nets. In the experiment using the pathological images and genomic information for glioma cancer grade classification, the results show that the integrated model that utilized data with missing modalities achieved better results when the dataset has a high missing rate. It would be helpful in clinical practice where missing data is a common problem. Moreover, the integrated model is not only flexible to the data with missing modalities, but also the multi-modal dataset with more modalities. The fusion method used in the Pathomic Fusion is not practical when there are too many modalities (e.g., larger than three modalities) because it requires the product between any two and three modalities. In cancer diagnosis and prognosis, there can be more modalities available than pathological images and genomic data. We plan to explore more missing modality problems when additional modalities are available in future work. If the work is conclusive, it has the potential to fuse the complex information from multiple modalities to make a computer-aided decision, which can greatly help clinicians in cancer diagnosis.
ACKNOWLEDGEMENTS
This work has not been submitted for publication or presentation elsewhere.
This work is in part based upon data generated by the TCGA Research Network: https://www-cancer-gov.proxy.library.vanderbilt.edu/tcga
This work was supported by a grant from the Skin Cancer Foundation and the Dermatology Foundation, and it was supported by Leona M. and Harry B. Helmsley Charitable Trust grant G-1903-03793, NSF CAREER 1452485
Footnotes
REFERENCES
- [1].Ash JT, Darnell G, Munro D, and Engelhardt BE, “Joint analysis of expression levels and histological images identifies genes associated with tissue morphology,” Nature Communications, vol. 12, no. 1, 2021, doi: 10.1038/s41467-021-21727-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Subramanian V. Chidester B, Ma J, and Do MN, “CORRELATING CELLULAR FEATURES WITH GENE EXPRESSION USING CCA Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA Computational Biology, School of Computer Science , Carnegie Mellon University; , USA,” vol. 471, no. Isbi, pp. 805–808, 2018. [Google Scholar]
- [3].Subramanian V, Syeda-Mahmood T, and Do MN, “Multimodal fusion using sparse cca for breast cancer survival prediction,” in Proceedings - International Symposium on Biomedical Imaging, Apr. 2021, vol. 2021-April, pp. 1429–1432. doi: 10.1109/ISBI48211.2021.9434033. [DOI] [Google Scholar]
- [4].Yao J, Zhu X, Zhu F, and Huang J, “Deep correlational learning for survival prediction from multi-modality data,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10434 LNCS, pp. 406–414. 2017, doi: 10.1007/978-3-319-66185-8_46. [DOI] [Google Scholar]
- [5].Mobadersany P et al. , “Predicting cancer outcomes from histology and genomics using convolutional networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 115, no. 13, pp. E2970–E2979, 2018, doi: 10.1073/pnas.1717139115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Cheerla A and Gevaert O, “Deep learning with multimodal representation for pancancer prognosis prediction,” Bioinformatics, vol. 35, no. 14, pp. i446–i454, 2019, doi: 10.1093/bioinformatics/btz342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Shao W et al. , “Multi-task multi-modal learning for joint diagnosis and prognosis of human cancers,” Medical Image Analysis, vol. 65, 2020, doi: 10.1016/j.media.2020.101795. [DOI] [PubMed] [Google Scholar]
- [8].Li S, Shi H, Sui D, Hao A, and Qin H, “A Novel Pathological Images and Genomic Data Fusion Framework for Breast Cancer Survival Prediction,” Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, vol. 2020-July, pp. 1384–1387, 2020, doi: 10.1109/EMBC44109.2020.9176360. [DOI] [PubMed] [Google Scholar]
- [9].Chen RJ et al. , “Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis,” IEEE Transactions on Medical Imaging, vol. 0062, no. c, pp. 1–1, 2020, doi: 10.1109/tmi.2020.3021387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Wang Z, Li R, Wang M, and Li A, “GPDBN: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction,” Bioinformatics, no. March, pp. 1–8, 2021, doi: 10.1093/bioinformatics/btab185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Silva K, Luis A. Vale; Rohr, “PAN-CANCER PROGNOSIS PREDICTION USING MULTIMODAL DEEP LEARNING is A . Vale Silva and Karl Rohr Biomedical Computer Vision Group BioQuant Center and Institute of Pharmacy and Molecular Biotechnology ( IPMB ) Heidelbeig University , Germany,” pp. 568–571, 2020. [Google Scholar]
- [12].Zhang C, Han Z, Cui Y, Fu H, Zhou JT, and Hu Q, “CPM-Nets: Cross partial multi-modal networks,” Advances in Neural Information Processing Systems, vol. 32, no. NeurIPS, pp. 1–11, 2019. [Google Scholar]
- [13].Klambauer G, Unterthiner T, Mayr A, and Hochreiter S, “Self-normalizing neural networks,” Advances in Neural Information Processing Systems, vol. 2017-Decem, pp. 972–981, 2017. [Google Scholar]
