Table 3.
Comparison of MeTra to current state-of-the-art methods for survival prediction in patients in intensive care in terms of area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC).
Method | AUROC | AUPRC | Comments |
---|---|---|---|
Early14 | 0.827 [0.801, 0.854] | 0.485 [0.417, 0.555] | Clinical parameters (CP) and imaging data are first pre-trained separately. Subsequently, the latent representation of both modalities is concatenated, and a final classification layer is trained to merge the inputs |
Joint14 | 0.825 [0.798, 0.853] | 0.506 [0.436, 0.574] | CP and imaging data are fed through separate feature extraction layers and then concatenated and fed through a final classification head to form the final prediction. Training is performed in an end-to-end setting |
MMTM14,42 | 0.819 [0.788, 0.846] | 0.474 [0.402, 0.544] | A Multimodal Transfer Module14,42 (MMTM) is added after feature extraction of both modalities to merge the inputs |
DAFT14,16 | 0.828 [0.799, 0.854] | 0.492 [0.427, 0.572] | A Dynamic Affine Feature Map Transform14,16 (DAFT) is used after feature extraction of both modalities to scale and shift the resulting feature maps to merge the modalities |
Unified14,15 | 0.835 [0.808, 0.861] | 0.495 [0.424, 0.567] | In every training iteration, a two-step approach is performed. First, feature extractors for the CP and imaging data (which do not necessarily have to be paired) are trained separately to extract meaningful features. Second, the previously learned feature extractors extract features for a set of paired samples, which are then concatenated and fed through a learnable classification head |
MedFuse (PT)14 | 0.841 [0.813, 0.868] | 0.544 [0.477, 0.609] |
For CP and imaging data, separate feature extractors are learned on modality-specific labels. The final prediction is then formed by feeding both feature representations sequentially into a neural network of LSTM (Long Short-Term Memory) layers This AUROC cannot directly be compared to our method: The configuration of MedFuse(PT) uses considerably more imaging data (pre-training on 340,470 additional radiographs) and more CP (22,356 samples) as compared to MeTra (6,798 samples for both CP and imaging data) |
MedFuse (OPTIMAL)14 | 0.865 [0.837, 0.889] | 0.594 [0.526, 0.655] |
For CP and imaging data, separate feature extractors are learned on modality-specific labels. The final prediction is then formed by feeding both feature representations sequentially into a neural network of LSTM layers This AUROC cannot directly be compared to our method: MedFuse(OPTIMAL) uses the same additional imaging data as MedFuse(PT) and performs extensive selection on the CP data based on 22,356 samples |
MedFuse (RI)14 | 0.817 [0.785, 0.846] | 0.471 [0.404, 0.545] |
For CP and imaging data, separate feature extractors are learned on modality-specific labels. The final prediction is then formed by feeding both feature representations sequentially into a neural network of LSTM layers This AUROC cannot directly be compared to our method: MedFuse (RI) is not pre-trained on additional imaging data (as PT or OPTIMAL) but still uses more CP data (22,356 samples) as compared to MeTra (6,798 samples) |
MeTra (CP + CXR) | 0.863 [0.835, 0.889] | 0.594 [0.526, 0.662] |
MeTra is based on the transformer model, where data is processed as a set of tokens. The CP and imaging data are fed through corresponding transformer-based backbones to extract latent feature tokens merged in a final transformer encoder MeTra is trained on fewer data than MedFuse OPTIMAL, PT, and RI |
Means [95% confidence intervals].