Skip to main content
Journal of Imaging Informatics in Medicine logoLink to Journal of Imaging Informatics in Medicine
. 2025 Feb 19;38(6):3568–3583. doi: 10.1007/s10278-025-01446-1

Enhancing Chest X-ray Diagnosis with a Multimodal Deep Learning Network by Integrating Clinical History to Refine Attention

Lian Yang 1,2, Yiliang Wan 3, Feng Pan 1,2,
PMCID: PMC12701160  PMID: 39971817

Abstract

The rapid advancements of deep learning technology have revolutionized medical imaging diagnosis. However, training these models is often challenged by label imbalance and the scarcity of certain diseases. Most models fail to recognize multiple coexisting diseases, which are common in real-world clinical scenarios. Moreover, most radiological models rely solely on image data, which contrasts with radiologists’ comprehensive approach, incorporating both images and other clinical information such as clinical history and laboratory results. In this study, we introduce a Multimodal Chest X-ray Network (MCX-Net) that integrates chest X-ray images and clinical history texts for multi-label disease diagnosis. This integration is achieved by combining a pretrained text encoder, a pretrained image encoder, and a pretrained image-text cross-modal encoder, fine-tuned on the public MIMIC-CXR-JPG dataset, to diagnose 13 diverse lung diseases on chest X-rays. As a result, MCX-Net achieved the highest macro AUROC of 0.816 on the test set, significantly outperforming unimodal baselines such as ViT-base and ResNet152, which scored 0.747 and 0.749, respectively (p < 0.001). This multimodal approach represents a substantial advancement over existing image-based deep-learning diagnostic systems for chest X-rays.

Keywords: Chest X-ray, Lung diseases, Differential diagnoses, Artificial intelligence, Transformer

Introduction

Chest X-rays stand as the preeminent medical imaging test globally due to their cost-effectiveness, rapid imaging capabilities, and minimal radiation exposure [1]. They serve as a substantial diagnostic tool for investigating a spectrum of chest pathologies, such as pneumonia, tumors, and cardiac diseases [1]. However, the interpretation of chest X-ray images remains a complex task owing to its insufficient soft tissue contrast and superimposed densities under the two-dimensional projection of X-rays, often challenging even for radiologists, particularly in scenarios involving multiple co-existing diseases [2, 3]. For example, approximately 90% of overlooked lung cancer cases are attributable to errors in the interpretation of chest X-rays [4]. Especially, with the surge in the number of chest X-rays during the COVID-19 pandemic, human error stemming from fatigue or increased workload, coupled with the scarcity of experienced thoracic radiologists, further compound the challenge [1, 4, 5].

Efforts to mitigate these challenges have prompted the development of deep-learning systems designed to assist radiologists in interpreting chest X-rays. In particular, many deep-learning diagnostic tools have been established and significantly improved the diagnostic accuracy of radiologists across more than a hundred clinical findings with robust performance [6]. Despite advancements, current deep-learning models face significant challenges in simultaneously identifying multiple diseases in chest X-rays [7, 8]. First, identifying multiple diseases requires models to recognize and distinguish complex patterns associated with each pathology [7]. Similar imaging features can be shared among different diseases (e.g., pneumonia and consolidation) [9]. Additionally, these patterns may interact across various labels, making it challenging for models to accurately differentiate between them [7, 8, 10]. Second, diseases can present with a wide range of variations in terms of size, shape, location, and appearance [1113]. Models must be able to generalize across this variability to effectively detect and differentiate these diseases [7, 14]. However, capturing these diversities in training data might be difficult [7]. At last, data imbalance and limited interpretability also hinder robust diagnoses [7, 15, 16]. Although current updated techniques like spatial attention, multi-instance learning, and transfer learning, have been applied to improve detection performance, but still grapple with noise and the disparate arrangement of visual elements that scatter model focus [1720]. Although existing lesion detection and instance segmentation methods demonstrate great potential for multi-disease identification, these approaches often necessitate extensive manual annotation, rendering them resource-intensive [21, 22]. These limitations highlight the need for more advanced methods to address the complexity and variability of chest diseases in medical imaging.

Current deep-learning models primarily focus on single-modality training, particularly with imaging data. In contrast, radiologists consider both imaging and non-imaging data when diagnosing diseases. Among the different types of non-imaging data, clinical history (e.g., symptoms, and previous medical conditions) is indispensable in diagnosis as it helps radiologists focus their attention on specific areas of interest in images to find potential abnormalities [2327]. Recognizing this difference, a novel deep-learning approach integrating clinically relevant information beyond imaging data, such as clinical history and laboratory findings, can be considered to enhance diagnostic capabilities [28, 29]. Although projecting both image and text into the same semantic space presents a challenging task, the triumph of transformer architecture in natural language processing (e.g., BERT) has recently paved the way for its seamless extension into multi- and cross-modal classification tasks owing to its input-agnostic feature [3034].

In this context, this study proposes a Multimodal Chest X-ray Network (MCX-Net) with a composite application of a pretrained text encoder, a pretrained image encoder, and a pretrained image-text cross-modal encoder for chest X-ray multi-disease diagnosis, addressing the imperative of advancing beyond the limitations of most deep-learning systems. We aim to explore whether involving clinical history texts can enhance the diagnostic capacities of deep learning for chest X-rays in our proposed approach. The proposed MCX-Net merged image and text representations with the robustness of cutting-edge Transformer network architectures from natural language processing and computer vision. It possesses the capability to utilize self-attention across multiple modalities simultaneously, enabling a more prompt and nuanced fusion of multimodal data. In sum, MCX-Net offers several advantages and improvements:

  1. It amalgamates chest X-rays and clinical history with an intermediate fusion strategy, since it merges the deep feature maps of these two types of clinical information in the middle of the processing, presenting a more feasible and powerful performance [35, 36].

  2. Pretrained text, image, and image-text cross-modal encoders are employed and finetuned, offering a simpler and more adaptable approach that eliminates the need for retraining [30]. Furthermore, this strategy is modality-agnostic, devoid of specific regions or bounding box proposals, which are often challenging to obtain from various public datasets. It facilitates the computation of raw image features, thus enabling smooth backpropagation throughout the entire encoder architecture.

Materials and Methods

Dataset

Our study utilized the publicly available MIMIC-CXR-JPG dataset (Version: 2.0.0, Published: Sept. 19, 2019, available at https://doi.org/10.13026/C2JT1Q) [37]. This dataset comprises 377,110 chest X-ray images and corresponding free-text reports conducted at the Beth Israel Deaconess Medical Center in Boston, MA [37]. It includes thirteen (13) disease labels for multi-disease diagnoses, including “Atelectasis,” “Cardiomegaly,” “Consolidation,” “Edema,” “Enlarged Cardiomediastinum,” “Fracture,” “Lung Lesion,” “Lung Opacity,” “Pleural Effusion,” “Pleural Other,” “Pneumonia,” “Pneumothorax,” and “Support Devices.” Additionally, a “No Findings” label is introduced to indicate the absence of these diseases. To comply with the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements, the dataset is de-identified and can be used in research after credentialing [37]. The exclusion criteria of images consisted of the following: (1) images with lateral views; (2) labels marked as ambiguities; and (3) images missed. Consequently, 195,528 images were retained. Among these, 135,682 images included clinical history text (marked as “Indication” in the original dataset) provided by physicians. Subsequently, we collected these 135,682 images with provided “Indication” information and divided them into train, validation, and test sets with ratios of 0.8, 0.1, and 0.1, respectively. The remaining 59,846 images without clinical history information were allocated to an additional test set (images only). The flowchart depicting the data inclusion process is illustrated in Fig. 1. Further details are provided in Table 1.

Fig. 1.

Fig. 1

Flowchart of data inclusion

Table 1.

Summary of the dataset

Train set (count/total, %) Validation set (count/total, %) Test set (count/total, %) Additional test set (only images) (count/total, %)
No finding 48678/108546, (44.85%) 6157/13568, (45.38%) 6093/13568, (44.91%) 29,315/59846, (48.98%)
Diseases (labels) 59,868/108546, (55.15%) 7411/13568, (54.62%) 7475/13568, (55.09%) 30,531/59846, (51.02%)
Atelectasis 20,498/108546, (18.88%) 2491/13568, (18.36%) 2561/13568, (18.88%) 10,387/59846, (17.36%)
Cardiomegaly 22,516/108546, (20.74%) 2712/13568, (19.99%) 2810/13568, (20.71%) 8778/59846, (14.67%)
Consolidation 3216/108546, (2.96%) 413/13568, (3.04%) 392/13568, (2.89%) 2260/59846, (3.78%)
Edema 11,956/108546, (11.01%) 1510/13568, (11.13%) 1476/13568, (10.88%) 6231/59846, (10.41%)
Enlarged cardiomediastinum 2659/108546, (2.45%) 338/13568, (2.49%) 316/13568, (2.33%) 2162/59846, (3.61%)
Fracture 2159/108546, (1.99%) 268/13568, (1.98%) 245/13568, (1.81%) 1221/59846, (2.04%)
Lung lesion 2603/108546, (2.40%) 339/13568, (2.50%) 332/13568, (2.45%) 1751/59846, (2.93%)
Lung opacity 15,994/108546, (14.73%) 2106/13568, (15.52%) 1992/13568, (14.68%) 9763/59846, (16.31%)
Pleural effusion 22,793/108546, (21.00%) 2878/13568, (21.21%) 2853/13568, (21.03%) 12,175/59846, (20.34%)
Pleural other 887/108546, (0.82%) 112/13568, (0.83%) 113/13568, (0.83%) 440/59846, (0.74%)
Pneumonia 6957/108546, (6.41%) 902/13568, (6.65%) 906/13568, (6.68%) 4031/59846, (6.74%)
Pneumothrax 5534/108546, (5.10%) 671/13568, (4.95%) 669/13568, (4.93%) 2449/59846, (4.09%)
Support devices 24,275/108546, (22.36%) 2949/13568, (21.73%) 3030/13568, (22.33%) 12,109/59846, (20.23%)
Multi-label statistics in cases with disease labels
1 label 17,535/59868, (29.29%) 2143/7411, (28.92%) 2235/7475, (29.90%) 9265/30531, (30.35%)
2 labels 18,334/59868, (30.62%) 2240/7411, (30.23%) 2250/7475, (30.10%) 8784/30531, (28.77%)
3 labels 12,721/59868, (21.25%) 1633/7411, (22.03%) 1611/7475, (21.55%) 6159/30531, (20.17%)
4 labels 7665/59868, (12.80%) 937/7411, (12.64%) 886/7475, (11.85%) 3891/30531, (12.74%)
5 labels 2804/59868, (4.68%) 347/7411, (4.68%) 386/7475, (5.16%) 1822/30531, (5.97%)
6 labels 678/59868, (1.13%) 96/7411, (1.30%) 96/7475, (1.28%) 509/30531, (1.67%)
7 labels 117/59868, (0.20%) 12/7411, (0.16%) 11/7475, (0.15%) 89/30531, (0.29%)
8 labels 12/59868, (0.02%) 3/7411, (0.04%) 0/7475, (0.00%) 12/30531, (0.04%)
9 labels 2/59868, (0.00%) 0/7411, (0.00%) 0/7475, (0.00%) 0/30531, (0.00%)

Note: No significant differences among the classes across all sets were identified by Chi-square tests

Architecture of MCX-Net

The MCX-Net effectively integrates text data (clinical history) and image data (chest X-rays) based on a supervised cross-modal bitransformers-like architecture [32]. This network jointly finetunes a pretrained text encoder (including a text tokenizer and a text encoder), a pretrained image encoder (including an image tokenizer and an image encoder), and a pretrained image-text cross-modal encoder by projecting image and text into the same cross-modal semantic space. The illustration of the network is presented in Fig. 2.

Fig. 2.

Fig. 2

The proposed network architecture of MCX-Net

Image Tokenizer and Encoder

Images are resized to 224 × 224 resolution and normalized. Subsequently, the 224 × 224 image is divided into 14 × 14 grids (or patches) of equal size, enabling direct integration with a standard Vision Transformer (ViT) architecture [38]. Each patch consists of 16 × 16 pixels, which are subsequently flattened and linearly projected to generate input vectors. Next, a semantic-rich visual tokenizer pretrained from Vector-quantized Knowledge Distillation in BEIT-v2 model was applied to transform the input vectors into 768-dimensional visual tokens, like different “visual words” [39]. In this way, we can map any image to a discrete semantic space corresponding to the visual dictionary. Then, a pretrained BEIT-v2 image encoder is employed to conduct the sequential encoding process to obtain image representations [40]. It adheres to a ViT architecture, featuring a 12-layer Transformer with 768 hidden units, 12 attention heads, and an intermediate size of 3072 in feed-forward networks [40]. Similar to natural language processing, the image is encoded as a sequence of discrete tokens (14 × 14 = 196), each comprising 768 dimensions.

Text Tokenizer and Encoder

For text representations of clinical history, we employ a pretrained BERT language model (trained on English Wikipedia), a state-of-the-art language model capable of capturing contextual information from the text [31]. The text data is tokenized into tokens using the default BERT tokenizer. Each token corresponds to a word or sub-word in the text data. If the input text exceeds 128 tokens, it will be truncated to 128; conversely, if it is shorter than 128 tokens, it will be padded to ensure a consistent length. Following this tokenization process, the resulting sequence of 128 tokens is converted into input embeddings using BERT’s embedding layer. Afterward, the sequence of input embeddings is passed through a pretrained BERT encoder. This encoder is also based on a Transformer architecture. The output of the BERT encoder is a sequence of contextualized embeddings with 768 dimensions for each word, as the text semantic embeddings.

Image-Text Cross-modal Encoder and Sequential Classifier

In this section, we employ an image-text cross-modal encoder with a multi-head bitransformer architecture to map input data from multiple modalities into a shared semantic space where information from different modalities can be integrated and processed together [32, 33]. This shared semantic space enables the model to capture rich semantic relationships and interactions between different modalities, facilitating downstream classification tasks [35, 36].

First, image embeddings from a pretrained BEIT-v2 image encoder and text embeddings from a pretrained BERT text encoder are concatenated to create unified embeddings. In this process, a specific 768-dimensional token, “[CLS]” (which stands for “class” tokens), is added at the beginning of the image tokens to aggregate information from the entire unified sequence [41]. Additionally, another 768-dimensional token, “[SEP]” (which stands for “separator” tokens), is placed between the image and text sequence as a delimiter [41]. Subsequently, additional segment and position embeddings are incorporated into the unified embeddings with a shape of 768 × 326 before their input into a pretrained cross-modal encoder—a multimodal bitransformer pretrained in a public hateful memes dataset, as outlined in a previous study [32]. Similarly, the encoder has a 12-layer Transformer with 768 hidden units and 12 attention heads. In scenarios involving tasks with a single text and single image input, we allocate one segment identity to the text and the other to the image embeddings. This design ensures compatibility with situations where not every modality is present—for instance, if only image or text modality is available; moreover, this method can be readily extended to accommodate an arbitrary number of modalities, such as laboratory test data [32]. Then, the sequence of input embeddings undergoes processing by the cross-modal encoder to capture rich semantic information about the input context [31]. Ultimately, the output consists of 326 multimodal contextual embeddings, including the “[CLS]” token, the “[SEP]” token, 128 text contextual embeddings, and 196 image contextual embeddings.

To achieve a multi-label classification, a Softmax pooling technique is applied [42]. This pooling strategy integrates the advantages of existing methods, such as Average Pooling and Generalized Pooling Operator, while addressing their limitations, including excessive training parameters and inadequate preservation of intra-modal correlations, which present a better performance in a previous study [42]. Specifically, the 128 text contextual embeddings undergo average pooling to form a 768-dimensional vector, which is then repeated to create a 768 × 196 tensor, matching the size of the image contextual embeddings. Both image and processed text contextual embeddings are reshaped into tensors of dimensions 768 × 14 × 14 (Channel × Height × Width). Subsequently, the image contextual tensor undergoes two iterations of a 3 × 3 convolution, followed by instance normalization and leaky ReLU activation. Afterward, the processed image and text contextual tensors are concatenated, resulting in a tensor of shape 1536 × 14 × 14. A 1 × 1 convolution operation is then applied to transform this tensor into a shape of 13 × 14 × 14, where 13 represents the number of labels. Following this, we apply the Softmax operation along the channel dimension, producing a probability map of the same shape 13 × 14 × 14. Next, a 2D MaxPooling operation is performed across the height and width dimensions, reducing the probability map to 13 × 1 × 1, effectively summarizing the probabilities into a single value per each label. For the final classification, if the probability for any channel exceeds 0.5, the corresponding label is predicted as positive. Conversely, if all channel probabilities remain below 0.5, the overall prediction is categorized as “No Finding.”

Fine-Tuning

The MCX-Net primarily comprises a mixture of pre-trained encoders. Each encoder undergoes fine-tuning in its entirety rather than being transferred with fixed parameters. This approach is adopted because the initial weights are derived from natural images and text, rather than medical data. Additionally, other components are initialized randomly.

Other Implementation Details

We applied an autoaugment policy derived from forced learning, comprising 25 augmentations like random rotation, shear, and sharpness adjustments, which have been proven to enhance model accuracy across diverse datasets [43]. Asymmetric loss (ASL) is employed to prioritize learning from the challenging and rare samples, as the real-world data often exhibit a long-tailed distribution, in which the majority of samples belong to the head category (e.g., “No Finding”), while disease samples are relatively scarce and predominantly found in the tail categories [7, 4446]. This approach dynamically down-weights negative samples while hard-thresholding easy samples, addressing the imbalanced nature of the dataset [45]. Additionally, hyperparameter settings included a batch size of 32 due to GPU memory constraints, 60,000 training iterations, Adam optimizer, and a learning rate (lr) of 5.0e-05. Given the dataset’s high imbalance, determining the best model in this study focused on maximizing macro AUROC in the validation set rather than maximizing classification accuracy. The top 3 iterations with the highest macro AUROC in the validation set were saved for each training session. All models were trained on the same cloud GPU platform (https://gpuhub.com/home). The hardware configuration includes: Nvidia 3090 24G GPU × 4, a 60-core Intel(R) Xeon(R) Platinum 8358P CPU, and 360G RAM. The code used to train the proposed model is publicly available on GitHub (https://github.com/Rad-HUST/VLP_X-ray).

Exploration of the Influence of Clinical History

To assess the impact of clinical history (text modality), we trained the MCX-Net using multimodal data (image + clinical history) and compared its performance to unimodal baselines, including ResNet152 and ViT-base (pretrained on ImageNet), on the same test set. To further investigate the influence of clinical history, we tested the MCX-Net on an unimodal test set comprising only X-ray images (the additional test set illustrated in Fig. 1) and compared its performance to that of MCX-Net trained solely on image data.

Ablation Studies

To understand the individual contributions of different components to the overall performance of our proposed model, we conducted ablation studies. Specifically, we systematically removed or modified specific elements of the model architecture and observed their impact on key performance metrics:

  1. Impact of image encoder: We conducted experiments to compare the performance of a different image encoder of ResNet (ResNet152), which was trained and released as part of a supervised multimodal bitransformer study [32].

  2. Handling long-tailed data: We investigated the effects of various techniques for dealing with long-tailed data, including distribution-balanced loss (DBL) and class-balanced re-sampling [4749].

Visualization and Interpretation

To address the model interpretability in Transformer architectures, particularly in multimodal settings, we employed a generic method for attention visualization [50]. This method overcomes the limitations of traditional explainability techniques, which often focus solely on pure self-attention and overlook the complexities of co-attention maps, so it can be used for unimodal or cross-modal Transformers [51]. By meticulously tracking the evolution and mixing of attention maps across layers, this approach can generate relevancy maps that highlight significant interactions between input modalities [50]. This process involves extracting attention weights of both image and text modalities, computing relevancy scores, and visualizing these scores to provide an interpretable representation of the model's decision-making process [50]. The implementation code is publicly available at: https://github.com/hila-chefer/Transformer-MM-Explainability. Given that the last layer of the Transformer is considered crucial for understanding contextual relationships, our study focuses on attention maps from the last Transformer layer of the image-text cross-modal encoder in MCX-Net or the visual encoder in the ViT-base model, representing an average across all attention heads [50]. For ResNet, model interpretability was assessed using Gradient-weighted Class Activation Mapping (Grad-CAM), which visualizes the heatmap at the last convolutional layer, providing deeper insights into the model’s interpretability [52].

Performance Evaluations and Statistical Analysis

Quantitative data are presented as mean ± standard deviation, while categorical data are presented as count and percentage. The multi-label classification performance was evaluated using macro precision, recall, F1-score, AUROC, and accuracy which are commonly used in multiclass assessments [7]. Precision for each label is calculated as the ratio of true positives to total predicted positives, while recall is the ratio of true positives to total actual positives. Accuracy is the ratio of correct predictions for each label to the total number of samples. The F1-score is the harmonic mean of precision and recall for each label. AUROC is calculated by plotting the true positive rate against the false positive rate across thresholds for each label. The macro precision, recall, F1-score, and AUROC are determined by averaging the corresponding metrics across all labels. Overall accuracy is defined as the ratio of correctly predicted label sets to the total number of samples. The Student t-tests were used for statistical comparisons of quantitative results between models. Bonferroni corrections were applied for multiple comparisons. Chi-square tests were used for statistical comparisons of categorical variables. A two-tailed p value of less than 0.05 was considered statistically significant. Statistical analysis was performed using IBM SPSS Statistics Software (version 26, IBM, New York, USA).

Results

Basic Characteristics of Datasets

The distribution of cases across the datasets is summarized in Table 1. The”No Finding” category constitutes nearly half of the cases in each set. Notably, “Atelectasis,” “Cardiomegaly,” and “Pleural Effusion” were among the most frequent diseases, each appearing in roughly 15–20% of cases. For cases with multiple disease labels, about 30% had one label, another 30% had two labels, and around 21% had three labels. The proportion of cases with four or more labels decreased progressively. No significant differences of label categories are found among all sets.

Comparisons to Unimodal Baselines

We evaluated the performance of MCX-Net (image + text) against three unimodal models (image only) across various metrics (Table 2). The multimodal model exhibited significantly higher macro recall (0.521 ± 0.015, p < 0.01), F1 score (0.451 ± 0.012, p < 0.01), and macro AUROC (0.816 ± 0.004, p < 0.001), compared to the unimodal models. In contrast, the performance of MCX-Net without text input was significantly lower across these metrics, suggesting the critical role of clinical history in enhancing diagnostic accuracy. The integration of text data allowed MCX-Net to focus more effectively on relevant anatomical regions, such as the bilateral lung and mediastinum, compared to unimodal image models (Fig. 3). This underscores the effectiveness of the multimodal approach for improving the accuracy and reliability of medical image analysis. Despite the overall superior performance of the multimodal model, certain disease labels with rare occurrences, such as “Fracture” (1.81% in the test set), still showed limited improvement (Fig. 4). The AUROC for “Fracture” in MCX-Net was only 0.731 ± 0.012, indicating that data scarcity remains a significant challenge.

Table 2.

Overall comparisons between the proposed multimodal model and unimodal baselines in the test set

Macro precision Macro recall Macro F1-score Macro AUROC Overall accuracy
MCX-Net1 (image + text) 0.433 ± 0.017 0.521 ± 0.015** 0.451 ± 0.012** 0.816 ± 0.004*** 0.839 ± 0.004
MCX-Net without the text input2 (image only) 0.342 ± 0.001 0.335 ± 0.001 0.287 ± 0.005 0.728 ± 0.001 0.819 ± 0.002
ViT-base (image only) 0.384 ± 0.025 0.220 ± 0.008 0.265 ± 0.003 0.747 ± 0.004 0.848 ± 0.002
ResNet152 (image only) 0.438 ± 0.049 0.231 ± 0.024 0.249 ± 0.035 0.749 ± 0.008 0.845 ± 0.004

1Our proposed multimodal model MCX-Net

2As a comparably unimodal control, the proposed MCX-Net is modified without the text (the clinical history) input

*p < 0.05, **p < 0.01, ***p < 0.001, compared with other three groups. All p values were calculated by t-tests with Bonferroni corrections

Fig. 3.

Fig. 3

Visual interpretation comparison for different models. This figure presents the original chest X-rays, attention maps from the last Transformer layer for various Transformer models, alongside saliency maps for ResNet152 used in this study. From left to right, the columns display the original chest X-ray images with red boxes marking disease regions, attention maps from the ViT-base model, MCX-Net without text input, MCX-Net with text input, and Gradient-weighted Class Activation Mapping (Grad-CAM) results from the ResNet152 model. Since ResNet152 does not produce attention maps, Grad-CAM was applied to the last convolutional layer to highlight important image regions contributing to predictions. As a result, the ViT-base and ResNet152 models exhibit limited and less focused attention on the correct regions for predictions. MCX-Net without text input shows more comprehensive attention but includes noticeable noise outside the bilateral lungs and mediastinum. In contrast, MCX-Net with integrated text input demonstrates enhanced and more precise attention on the bilateral lung and mediastinum regions, as well as significant clinical history words (darker green backgrounds indicate greater textual attention), indicating improved interpretability and performance in identifying relevant clinical features. All cases were randomly chosen from the test set

Fig. 4.

Fig. 4

Performance of different models on identifying different diseases in the test set

Improvement of Unimodal Image Multi-label Classification Through Integrated Text Modality in Model Training

To explore the additional effects of clinical information, we compared the performance of MCX-Net to that of MCX-Net trained with unimodal data (either X-rays or clinical histories) on the same additional test set, which consisted of only X-ray images without accompanying clinical histories. The results revealed that incorporating text modality during model training significantly improved the performance of downstream unimodal image multi-label classification, in terms of precision, recall, F1-score, and AUROC for several diseases, including cardiomegaly, edema, and pleural effusion (Table 3). For instance, this enhancement was evident across several key metrics, including macro recall (0.376 ± 0.013 vs. 0.322 ± 0.001, p = 0.018), F1 score (0.347 ± 0.011 vs. 0.279 ± 0.003, p = 0.009), and AUROC (0.740 ± 0.001 vs. 0.723 ± 0.002, p = 0.003). The network trained solely on text data (with image data masked) failed to converge and was excluded from the analysis. These results indicate the integration of clinical histories helps to refine attention, leading to better identification of relevant clinical features and improving overall model performance in multi-label classification tasks, even when only images are available during testing (Fig. 5).

Table 3.

Comparisons among different training modalities in the additional test set

Labels Precision Recall F1-score AUROC Accuracy
Image + Text1 Image2 Image + Text1 Image2 Image + Text1 Image2 Image + Text1 Image2 Image + Text1 Image2
Atelectasis 0.452 ± 0.009 0.442 ± 0.003 0.777 ± 0.041 0.771 ± 0.012 0.571 ± 0.018 0.562 ± 0.004 0.730 ± 0.006 0.725 ± 0.001 0.637 ± 0.008 0.625 ± 0.003
Cardiomegaly 0.571 ± 0.005** 0.473 ± 0.001 0.618 ± 0.035** 0.893 ± 0.013 0.593 ± 0.017 0.618 ± 0.002 0.781 ± 0.004 0.775 ± 0.000 0.726 ± 0.004*** 0.642 ± 0.002
Consolidation 0.383 ± 0.173 0.000 ± 0.000 0.063 ± 0.034 0.000 ± 0.000 0.105 ± 0.047 0.000 ± 0.000 0.737 ± 0.010*** 0.630 ± 0.008 0.901 ± 0.006 0.906 ± 0.000
Edema 0.560 ± 0.028* 0.694 ± 0.004 0.595 ± 0.071* 0.180 ± 0.035 0.574 ± 0.017** 0.284 ± 0.044 0.798 ± 0.003** 0.770 ± 0.000 0.757 ± 0.012 0.751 ± 0.006
Enlarged cardiomediastinum 0.167 ± 0.144 0.000 ± 0.000 0.011 ± 0.010 0.000 ± 0.000 0.021 ± 0.018 0.000 ± 0.000 0.646 ± 0.015 0.631 ± 0.005 0.926 ± 0.000 0.928 ± 0.000
Fracture 0.163 ± 0.152 0.000 ± 0.000 0.056 ± 0.048 0.000 ± 0.000 0.082 ± 0.071 0.000 ± 0.000 0.703 ± 0.018 0.728 ± 0.002 0.950 ± 0.005 0.957 ± 0.000
Lung lesion 0.112 ± 0.063 0.074 ± 0.006 0.092 ± 0.105 0.050 ± 0.012 0.078 ± 0.051 0.059 ± 0.011 0.660 ± 0.020 0.664 ± 0.002 0.902 ± 0.051 0.912 ± 0.005
Lung opacity 0.422 ± 0.030 0.468 ± 0.006 0.675 ± 0.034* 0.512 ± 0.017 0.518 ± 0.014 0.489 ± 0.009 0.609 ± 0.016 0.644 ± 0.003 0.543 ± 0.049 0.612 ± 0.005
Pleural effusion 0.702 ± 0.032** 0.890 ± 0.002 0.828 ± 0.036** 0.440 ± 0.014 0.759 ± 0.007*** 0.589 ± 0.013 0.860 ± 0.006 0.848 ± 0.001 0.753 ± 0.016* 0.711 ± 0.005
Pleural other 0.119 ± 0.072 0.000 ± 0.000 0.078 ± 0.077 0.000 ± 0.000 0.066 ± 0.024* 0.000 ± 0.000 0.800 ± 0.032 0.751 ± 0.005 0.933 ± 0.042 0.964 ± 0.000
Pneumonia 0.241 ± 0.009** 0.548 ± 0.041 0.292 ± 0.136 0.032 ± 0.005 0.257 ± 0.057* 0.061 ± 0.009 0.688 ± 0.009* 0.640 ± 0.004 0.780 ± 0.033* 0.865 ± 0.001
Pneumothrax 0.192 ± 0.053 0.118 ± 0.001 0.206 ± 0.131 0.460 ± 0.014 0.182 ± 0.051 0.188 ± 0.001 0.755 ± 0.018 0.723 ± 0.006 0.913 ± 0.026* 0.801 ± 0.007
Support devices 0.855 ± 0.057 0.707 ± 0.022 0.601 ± 0.114 0.851 ± 0.016 0.699 ± 0.062 0.772 ± 0.007 0.847 ± 0.010 0.865 ± 0.000 0.766 ± 0.024 0.768 ± 0.013
Macro/overall 0.380 ± 0.032 0.340 ± 0.001 0.376 ± 0.013* 0.322 ± 0.001 0.347 ± 0.011** 0.279 ± 0.003 0.740 ± 0.001** 0.723 ± 0.002 0.807 ± 0.006 0.803 ± 0.001

Note: MCX-Net, fine-tuned on training and validation sets with only text data (image data were masked), failed to converge

1MCX-Net was fine-tuned in multimodal train and validation sets

2MCX-Net was fine-tuned in train and validation sets with only image data (text data were masked)

*p < 0.05; **p < 0.01; ***p < 0.001. All p values were calculated by t-tests

Fig. 5.

Fig. 5

Attention visualization for unimodal and multimodal model training. This figure illustrates the attention maps of MCX-Net with different training modalities. The columns from left to right represent the original chest X-ray images with red boxes marking disease regions, attention maps from MCX-Net trained with only image input, and MCX-Net trained with both image and text (clinical history) inputs. The attention maps generated by MCX-Net trained solely on image data (second column) display more comprehensive attention but also exhibit noticeable noise outside the bilateral lungs and mediastinum. In contrast, the attention maps from MCX-Net trained with integrated clinical history inputs (third column) demonstrate enhanced and more precise attention focused on the bilateral lungs and mediastinum regions, as well as highlighting significant clinical history terms (with darker green backgrounds indicating greater textual attention). This emphasizes the supplementary role of clinical text information in improving model interpretability and performance. All cases were randomly chosen from the additional test set

Ablation Studies

Our proposed method achieved a macro precision of 0.433 ± 0.017, outperforming the ResNet encoder (0.407 ± 0.034). Interestingly, the DBL and Re-sampling techniques achieved higher macro precision values of 0.461 ± 0.015 and 0.468 ± 0.011, respectively. In terms of macro recall, our method significantly outperformed the baselines with a score of 0.521 ± 0.015 (p < 0.001). ResNet followed with 0.482 ± 0.005, while DBL and Re-sampling techniques lagged behind with scores of 0.296 ± 0.002 and 0.360 ± 0.009, respectively. Our method also achieved the highest macro F1-score of 0.451 ± 0.012 (p < 0.001), indicating a balanced performance in terms of precision and recall. ResNet's macro F1-score was 0.422 ± 0.024, while DBL and Re-sampling scored 0.314 ± 0.007 and 0.386 ± 0.008, respectively. Our method demonstrated significantly superior performance in terms of macro AUROC with a score of 0.816 ± 0.004 (p < 0.001). ResNet achieved 0.796 ± 0.007, followed by Re-sampling with 0.769 ± 0.001, and DBL with 0.764 ± 0.003. Finally, our method achieved an overall accuracy of 0.839 ± 0.004. While this was slightly lower than DBL (0.851 ± 0.004) and Re-sampling (0.853 ± 0.000), it was higher than ResNet (0.831 ± 0.005). The details are presented in Table 4. Overall, our method shows robust performance across various metrics, particularly excelling in macro recall, F1-score, and AUROC, highlighting its effectiveness and balance in handling the tasks owing to improving the image encoder and loss function.

Table 4.

Ablation studies in the test set

Macro precision Macro recall Macro F1-score Macro AUROC Overall accuracy
Our1 0.433 ± 0.017 0.521 ± 0.015*** 0.451 ± 0.012*** 0.816 ± 0.004*** 0.839 ± 0.004
ResNet2 0.407 ± 0.034 0.482 ± 0.005 0.422 ± 0.024 0.796 ± 0.007 0.831 ± 0.005
DBL3 0.461 ± 0.015 0.296 ± 0.002 0.314 ± 0.007 0.764 ± 0.003 0.851 ± 0.004
Re-sampling4 0.468 ± 0.011 0.360 ± 0.009 0.386 ± 0.008 0.769 ± 0.001 0.853 ± 0.000

1Our proposed network and training settings

2Our proposed network in which the image encoder was replaced with ResNet152

3Our proposed network in which the loss was replaced with distribution-balanced loss (DBL)

4Our proposed network in which class-balanced re-sampling was applied

***p < 0.001, which was calculated by t-tests with Bonferroni corrections

Discussion

This study aimed to assess the impact of integrating clinical history information into a multimodal model “MCX-Net” for improving the diagnosis of chest X-ray images. Our findings indicate that the inclusion of text data of clinical history in the training phase significantly enhances the performance of multi-disease diagnosis. Notably, MCX-Net achieved superior results across various metrics, including macro precision, recall, F1-score, and AUROC, compared to conventional unimodal models trained solely on image data. Ablation studies further highlighted the robustness and effectiveness of our proposed method, particularly in terms of macro recall and F1-score.

One recent study has developed a ViT-based multimodal network that integrates chest X-rays, quantitative clinical parameters (e.g., blood pressure, heart rate, and Glasgow Coma Scale), and laboratory parameters (e.g., C-reactive protein levels, and leukocyte count), showing improved diagnostic performance for up to 25 pathologic conditions compared to models using only one data type [53]. This multimodal approach outperformed unimodal models in diagnosing diseases in ICU patients, as evidenced by higher AUROC scores [53]. Quantitative multimodal parameters are relatively straightforward to integrate during model training, but incorporating textual clinical information, such as clinical history, into other modalities still presents a challenge [41, 54]. To go further, this study attempts to involve the text clinical history which is easy to obtain in clinical practice by applying an intermediate fusion strategy with advanced pretrained backbones, including a text encoder, an image encoder, and an image-text cross-modal encoder. This method balances complexity by integrating features from different modalities at an abstract level, capturing modality-specific interactions and leading to better generalization [35]. It also offers flexibility in feature extraction prior to integration, allowing for the replacement of any encoders with state-of-the-art components in the future [35]. Our results demonstrate the significant advantage of incorporating clinical history information into the training of our multimodal model MCX-Net when compared to conventional unimodal deep-learning models, underscoring the value of a multimodal approach. From our results, the integrated text modality probably provided a richer context, which likely contributed to better regional attention and overall model performance. This was particularly evident in the enhanced recall, F1-score, and AUROC observed in the multimodal model.

Specifically, the integration of text modality into MCX-Net significantly enhanced unimodal image multi-label classification performance, as evidenced by substantial improvements in precision, recall, F1-score, and AUROC for several diseases such as cardiomegaly, edema, and pleural effusion. Notably, the model trained exclusively on text data failed to converge, underscoring the critical role of visual information. These findings suggest that incorporating clinical history information provides supplementary contextual information, potentially refining the model’s focus on relevant image regions.

Ablation studies provided further insights into the contributions of different components of our proposed method. Our proposed model applying a BEIT-v2 image encoder is expected to demonstrate significantly superior performance compared to ResNet, suggesting advancements in image encoding capabilities [39]. This advantage is likely attributed to the BEIT-v2’s ability to project visual information more effectively into the associated text token space, facilitating better alignment and understanding between image and text modalities [39]. Conventional convolutional neural networks are currently the mainstay for extracting visual features; however, due to the constraints imposed by convolutional kernel receptive fields, these models often emphasize extracting local feature information from images over global feature information [15]. BEIT-v2 image encoders surpass traditional convolutional neural networks by leveraging self-attention mechanisms for a global receptive field, capturing long-range dependencies more effectively [39]. This, combined with a weaker inductive bias, makes them more adaptable to diverse image datasets and tasks. Additionally, BEIT-v2 image encoder often exhibits better computational efficiency and superior feature learning capabilities, leading to more accurate and informative image representations [39]. While techniques like distribution-balanced loss (DBL) and class-balanced re-sampling showed higher precision, our ASL application excelled in recall and F1-score, indicating a more balanced performance, because ASL was specifically designed to self-adaptively address the class imbalance problem, especially in multi-label classification [55]. By focusing on hard negative and positive samples, ASL helps the model learn more discriminative features [55]. The accuracy of ASL (0.839) was slightly lower than that of DBL (0.851) and Re-sampling (0.853). This is due to the highly imbalanced dataset, where higher accuracy can result from predominantly identifying samples as the head class (“No findings”), despite increasing loss [7]. Therefore, the best model in this study is determined by maximizing the macro AUROC in the validation set rather than maximizing classification accuracy [7]. As a result, the high macro AUROC achieved by MCX-Net with ASL further supports its effectiveness in handling the complexities of multi-label classification.

This study has several limitations. First, it was conducted using a specific dataset, which may limit the generalizability of the findings to other populations or imaging modalities. While many public datasets of chest X-rays are available, few include detailed clinical histories. Future research should establish more comprehensive datasets that include both medical images and extensive clinical information to validate and extend our findings. Second, the reliance on text data for model training means that the quality and completeness of clinical histories significantly impact performance. In the MIMIC-CXR-JPG dataset, many abbreviations were used instead of full-text names (e.g., “___F with DKA”, “___F with RUQ pain s/p ERCP w/stent placement”), suggesting that providing complete descriptions of clinical history might improve the performance of MCX-Net, especially when applying the BERT text encoder from transfer learning.

In conclusion, the integration of clinical history into the multimodal deep-learning model significantly enhances the performance of chest X-ray image classification. The results underscore the potential of multimodal approaches to improve diagnostic accuracy, highlighting the importance of considering both image and clinical text data in medical imaging tasks.

Author Contribution

Lian Yang: conceptualization, methodology, data collection, validation, formal analysis, investigation, resources, writing—original draft, funding acquisition; Yiliang Wan: conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing—review and editing, visualization; Feng Pan: conceptualization, supervision, writing—original draft, writing—review and editing, project administration, funding acquisition. All authors have read and approved the final manuscript. They agree to be accountable for all aspects of the work to ensure that questions related to the accuracy or integrity of any part are appropriately investigated and resolved.

Funding

This work was supported by the National Natural Science Foundation of China (grant numbers: 82272083 and 82172034) and the Key Project of Natural Science Foundation of Hubei, China (grant number: 2023BCB014).

Data Availability

The MIMIC-CXR-JPG dataset (Version: 2.0.0, Published: Sept. 19, 2019) for this study can be found here: https://doi.org/10.13026/C2JT1Q. The codes of the proposed model are available in the https://github.com/Rad-HUST/VLP_X-ray. The implementation code of attention visualization for Transformers is publicly available at: https://github.com/hila-chefer/Transformer-MM-Explainability.

Declarations

Ethics Approval

This study utilized the publicly available MIMIC-CXR-JPG dataset (Version: 2.0.0, Published: Sept. 19, 2019), which is governed by the PhysioNet Credentialed Health Data License Version 1.5.0. The dataset is provided by the MIT Laboratory for Computational Physiology for research and educational purposes, ensuring compliance with stringent data protection and privacy guidelines. The authors adhered to these guidelines, ensuring ethical use of the data in accordance with the terms set by MIT-LCP. This adherence to ethical standards underscores our commitment to maintaining the confidentiality and privacy of individuals represented in the dataset.

Conflict of Interest

The authors declare no competing interests.

Declaration of Generative AI in Scientific Writing

No generative AI tool was used in the writing process.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Jacobi A, Chung M, Bernheim A et al. (2020) Portable chest X-ray in coronavirus disease-19 (COVID-19): A pictorial review. Clin Imaging 64:35-42. 10.1016/j.clinimag.2020.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Pan F, Li L, Liu B et al. (2021) A novel deep learning-based quantification of serial chest computed tomography in Coronavirus Disease 2019 (COVID-19). Sci Rep 11:417. 10.1038/s41598-020-80261-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Roberts M, Driggs D, Thorpe M et al. (2021) Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3:199-217. 10.1038/s42256-021-00307-0 [Google Scholar]
  • 4.Lee CS, Nagy PG, Weaver SJ et al. (2013) Cognitive and system factors contributing to diagnostic errors in radiology. AJR Am J Roentgenol 201:611-617. 10.2214/AJR.12.10375 [DOI] [PubMed] [Google Scholar]
  • 5.Brady AP (2017) Error and discrepancy in radiology: inevitable or avoidable? Insights Imaging 8:171-182. 10.1007/s13244-016-0534-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Seah JCY, Tang CHM, Buchlak QD et al. (2021) Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multicase study. Lancet Digit Health 3:e496-e506. 10.1016/S2589-7500(21)00106-0 [DOI] [PubMed] [Google Scholar]
  • 7.Chen Y, Wan Y, Pan F (2023) Enhancing Multi-disease Diagnosis of Chest X-rays with Advanced Deep-learning Networks in Real-world Data. J Digit Imaging 36:1332-1347. 10.1007/s10278-023-00801-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Jin Y, Lu H, Zhu W et al. (2023) Deep learning based classification of multi-label chest X-ray images via dual-weighted metric loss. Comput Biol Med 157:106683. 10.1016/j.compbiomed.2023.106683 [DOI] [PubMed] [Google Scholar]
  • 9.Bankier AA, Macmahon H, Colby T et al. (2024) Fleischner Society: Glossary of Terms for Thoracic Imaging. Radiology 310:e232558. 10.1148/radiol.232558 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sajid NA, Rahman A, Ahmad M et al. (2023) Single vs. Multi-Label: The Issues, Challenges and Insights of Contemporary Classification Schemes. Applied Sciences 13:6804. 10.3390/app13116804 [Google Scholar]
  • 11.Garg M, Prabhakar N, Kiruthika P et al. (2017) Imaging of Pneumonia: An Overview. Current Radiology Reports 5:16. 10.1007/s40134-017-0209-9 [Google Scholar]
  • 12.Hollings N, Shaw P (2002) Diagnostic imaging of lung cancer. European Respiratory Journal 19:722-742. 10.1183/09031936.02.00280002 [DOI] [PubMed] [Google Scholar]
  • 13.Panunzio A, Sartori P (2020) Lung Cancer and Radiological Imaging. Curr Radiopharm 13:238-242. 10.2174/1874471013666200523161849 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bassi P, Dertkigil SSJ, Cavalli A (2024) Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization. Nat Commun 15:291. 10.1038/s41467-023-44371-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Nazir S, Dickson DM, Akram MU (2023) Survey of explainable artificial intelligence techniques for biomedical imaging with deep neural networks. Computers in Biology and Medicine 156:106668. 10.1016/j.compbiomed.2023.106668 [DOI] [PubMed] [Google Scholar]
  • 16.Rahman S, Sarker S, Miraj MaA et al. (2021) Deep learning–driven automated detection of Covid-19 from radiography images: A comparative analysis. Cognitive Computation:1–30. 10.1007/s12559-020-09779-5 [DOI] [PMC free article] [PubMed]
  • 17.Cha S-M, Lee S-S, Ko B (2021) Attention-Based transfer learning for efficient pneumonia detection in chest X-ray images. Applied Sciences 11:1242. 10.3390/app11031242 [Google Scholar]
  • 18.El-Dahshan E-SA, Bassiouni MM, Hagag A et al. (2022) RESCOVIDTCNnet: A residual neural network-based framework for COVID-19 detection using TCN and EWT with chest X-ray images. Expert Systems with Applications 204:117410. 10.1016/j.eswa.2022.117410 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fan Y, Liu J, Yao R et al. (2021) COVID-19 detection from X-ray images using multi-kernel-size spatial-channel attention network. Pattern Recognition 119:108055. 10.1016/j.patcog.2021.108055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li X, Shen L, Xie X et al. (2020) Multi-resolution convolutional networks for chest X-ray radiograph based lung nodule detection. Artificial intelligence in medicine 103:101744. 10.1016/j.artmed.2019.101744 [DOI] [PubMed] [Google Scholar]
  • 21.Lin T-Y, Dollár P, Girshick R et al. (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p 2117–2125. 10.48550/arXiv.1612.03144
  • 22.Xu S, Lu H, Ye M et al. (2020) Improved cascade R-CNN for medical images of pulmonary nodules detection combining dilated HRNet. In: Proceedings of the 2020 12th international conference on machine learning and computing. p 283–288. 10.1145/3383972.3384070
  • 23.Castillo C, Steffens T, Sim L et al. (2021) The effect of clinical information on radiology reporting: A systematic review. J Med Radiat Sci 68:60-74. 10.1002/jmrs.424 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Doubilet P, Herman PG (1981) Interpretation of radiographs: effect of clinical history. AJR Am J Roentgenol 137:1055-1058. 10.2214/ajr.137.5.1055 [DOI] [PubMed] [Google Scholar]
  • 25.Leslie A, Jones AJ, Goddard PR (2000) The influence of clinical information on the reporting of CT by radiologists. Br J Radiol 73:1052-1055. 10.1259/bjr.73.874.11271897 [DOI] [PubMed] [Google Scholar]
  • 26.Maizlin NN, Somers S (2019) The Role of Clinical History Collected by Diagnostic Imaging Staff in Interpreting of Imaging Examinations. J Med Imaging Radiat Sci 50:31-35. 10.1016/j.jmir.2018.07.009 [DOI] [PubMed] [Google Scholar]
  • 27.Yapp KE, Brennan P, Ekpo E (2022) The Effect of Clinical History on Diagnostic Imaging Interpretation - A Systematic Review. Acad Radiol 29:255-266. 10.1016/j.acra.2020.10.021 [DOI] [PubMed] [Google Scholar]
  • 28.Shende P, Augustine S, Prabhakar B et al. (2019) Advanced multimodal diagnostic approaches for detection of lung cancer. Expert Rev Mol Diagn 19:409-417. 10.1080/14737159.2019.1607299 [DOI] [PubMed] [Google Scholar]
  • 29.Zhang H, Xu C, Liang P et al. (2022) MMLN: Leveraging Domain Knowledge for Multimodal Diagnosis. In: Bansal MS, Cai Z, Mangul S (eds) Bioinformatics Research and Applications. Springer Nature Switzerland, Cham, p 192–203. 10.1007/978-3-031-23198-8_18
  • 30.Chen Y, Pan F (2022) Multimodal detection of hateful memes by applying a vision-language pre-training model. PLoS One 17:e0274300. 10.1371/journal.pone.0274300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT. p 2. 10.48550/arXiv.1810.04805
  • 32.Kiela D, Bhooshan S, Firooz H et al. (2019) Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950
  • 33.Kiela D, Firooz H, Mohan A et al. (2020) The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems 33:2611–2624. https://proceedings.neurips.cc/paper_files/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf
  • 34.Lu J, Batra D, Parikh D et al. (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 3210.48550/arXiv.1908.02265
  • 35.Boulahia SY, Amamra A, Madi MR et al. (2021) Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32:121. 10.1007/s00138-021-01249-8 [Google Scholar]
  • 36.Wang W, Bao H, Dong L et al. (2022) Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442
  • 37.Johnson AEW, Pollard TJ, Berkowitz SJ et al. (2019) MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6:317. 10.1038/s41597-019-0322-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Dosovitskiy A, Beyer L, Kolesnikov A et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint:2010.1192910.48550/arXiv.2010.11929
  • 39.Peng Z, Dong L, Bao H et al. (2022) Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366
  • 40.Bao H, Dong L, Piao S et al. (2021) Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254
  • 41.Li X, Yin X, Li C et al. (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings. Lecture Notes in Computer Science. Springer, Cham, p 121–137. 10.1007/978-3-030-58577-8_8
  • 42.Zeng Z, Cao J, Weng N et al. (2021) Softmax Pooling for Super Visual Semantic Embedding. In: 2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). p 0258–0265. 10.1109/IEMCON53756.2021.9623131
  • 43.Cubuk ED, Zoph B, Mane D et al. (2019) Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR):113–123. 10.48550/arXiv.1805.09501
  • 44.Ferguson AR, Nielson JL, Cragin MH et al. (2014) Big data from small data: data-sharing in the 'long tail' of neuroscience. Nature Neuroscience 17:1442-1447. 10.1038/nn.3838 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Ridnik T, Ben-Baruch E, Zamir N et al. (2021) Asymmetric loss for multi-label classification. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV):82–91. 10.48550/arXiv.2009.14119
  • 46.Wang X, Peng Y, Lu L et al. (2017) ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR):2097–2106. 10.1109/CVPR.2017.369
  • 47.Shi J-X, Wei T, Xiang Y et al. (2023) How Re-sampling Helps for Long-Tail Learning? Advances in Neural Information Processing Systems 36https://proceedings.neurips.cc/paper_files/paper/2023/file/eeffa70bcbbd43f6bd067edebc6595e8-Paper-Conference.pdf
  • 48.Tran TT, Pham HH, Nguyen TV et al. (2021) Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p 3314–3323. 10.48550/arXiv.2108.06486
  • 49.Wu T, Huang Q, Liu Z et al. (2020) Distribution-balanced loss for multi-label classification in long-tailed datasets. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, p 162–178. 10.1007/978-3-030-58548-8_10
  • 50.Chefer H, Gur S, Wolf L (2021) Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p 397–406. 10.48550/arXiv.2103.15679
  • 51.Chefer H, Gur S, Wolf L (2021) Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p 782–791. 10.48550/arXiv.2012.09838
  • 52.Tan X, Pan F, Zhan N et al. (2024) Multimodal integration to identify the invasion status of lung adenocarcinoma intraoperatively. iScience 27:111421. 10.1016/j.isci.2024.111421 [DOI] [PMC free article] [PubMed]
  • 53.Khader F, Muller-Franzes G, Wang T et al. (2023) Multimodal Deep Learning for Integrating Chest Radiographs and Clinical Parameters: A Case for Transformers. Radiology 309:e230806. 10.1148/radiol.230806 [DOI] [PubMed] [Google Scholar]
  • 54.Zhang P, Li X, Hu X et al. (2021) Vinvl: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529 1:8.
  • 55.Ridnik T, Ben-Baruch E, Zamir N et al. (2021) Asymmetric loss for multi-label classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p 82–91. https://openaccess.thecvf.com/content/ICCV2021/papers/Ridnik_Asymmetric_Loss_for_Multi-Label_Classification_ICCV_2021_paper.pdf

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The MIMIC-CXR-JPG dataset (Version: 2.0.0, Published: Sept. 19, 2019) for this study can be found here: https://doi.org/10.13026/C2JT1Q. The codes of the proposed model are available in the https://github.com/Rad-HUST/VLP_X-ray. The implementation code of attention visualization for Transformers is publicly available at: https://github.com/hila-chefer/Transformer-MM-Explainability.


Articles from Journal of Imaging Informatics in Medicine are provided here courtesy of Springer

RESOURCES