Skip to main content
. 2023 Jul 1;13:10666. doi: 10.1038/s41598-023-37835-1

Figure 5.

Figure 5

Medical Transformer (MeTra) architecture. The chest radiograph is first processed in the vision backbone, where it is split into patch embeddings and subsequently fed through a transformer encoder. Similarly, the clinical parameter items are fed through the clinical backbone, where they are projected to an embedding space with a dimensionality that matches that of the image embeddings. In the next step, a learnable position encoding token is added to the embeddings of both modalities. Finally, the modalities are fused by processing the embeddings with a transformer encoder that applies multi-head self-attention to all input tokens, thus allowing cross-modality information transfer. A multilayer-perceptron (MLP) is applied to the output to form the final prediction for in-hospital survival.