Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

Research Square logoLink to Research Square
[Preprint]. 2024 Aug 30:rs.3.rs-4837662. [Version 1] doi: 10.21203/rs.3.rs-4837662/v1

Joint Imbalance Adaptation for Radiology Report Generation

Yuexin Wu 1, I-Chan Huang 2, Xiaolei Huang 1,*
PMCID: PMC11384792  PMID: 39257991

Abstract

Purpose:

Radiology report generation, translating radiological images into precise and clinically relevant description, may face the data imbalance challenge – medical tokens appear less frequently than regular tokens; and normal entries are significantly more than abnormal ones. However, very few studies consider the imbalance issues, not even with conjugate imbalance factors.

Methods:

In this study, we propose a Joint Imbalance Adaptation (JIMA) model to promote task robustness by leveraging token and label imbalance. JIMA predicts entity distributions from images and generates reports based on these distributions and image features. We employ a hard-to-easy learning strategy that mitigates overfitting to frequent labels and tokens, thereby encouraging the model to focus more on rare labels and clinical tokens.

Results:

JIMA shows notable improvements (16.75% - 50.50% on average) across evaluation metrics on IU X-ray and MIMIC-CXR datasets. Our ablation analysis proves that JIMA’s enhanced handling of infrequent tokens and abnormal labels counts the major contribution. Human evaluation and case study experiments further validate that JIMA can generate more clinically accurate reports.

Conclusion:

Data imbalance (e.g., infrequent tokens and abnormal labels) leads to the underperformance of radiology report generation. Our curriculum learning strategy successfully reduce data imbalance impacts by reducing overfitting on frequent patterns and underfitting on infrequent patterns. While data imbalance remains challenging, our approach opens new directions for the generation task.

Keywords: Data imbalance, Radiology report generation, Curriculum learning, Model robustness

1. Introduction

Radiology report generation is a multimodal and medical image-to-text task that generates text descriptions for radiographs (e.g., X-ray or CT scan), which may reduce the workloads of radiologists [1, 2]. The task has own unique characteristics than general image-to-text tasks (e.g., image captioning), such as lengthy medical notes, medical annotations, and clinical terminologies. As demonstrated in Figure 1, data imbalance can significantly impact model robustness that prevents model deployment in practice – models can easily overfit on frequent patterns. However, modeling data imbalance to augment the robust generation of the radiology report is understudied.

Fig. 1.

Fig. 1

State-of-the-art model performance on normal and abnormal entries by BLEU-4 (left two) and low- and high-frequent tokens by F1 scores (right two).

Two major data imbalances exist in the radiology generation task, label and token. Label imbalance pertains to a disproportionate ratio of normal and abnormal diagnosis categories, which exist in radiological images and text reports. For instance, normal cases (images and reports) dominate radiology data, which can easily lead to underperformance in disease detection and professional description. As shown in Table 1, abnormal reports are considerably longer than normal reports while can only count less than 15%. These abnormal reports are much harder to generate than shorter reports [35] and can be worse with fewer samples than normal cases.1 Existing imbalance learning studies of radiology report generation primarily focus on label imbalance [7, 8]. Token imbalance is a critical challenge in generation that tokens have varied occurrence frequencies, and the issue is more critical in the medical task. Learning infrequent tokens can be harder than frequent tokens for generation models [9, 10]. Medical tokens appear less frequently than regular ones, and the infrequent tokens may contain more medical results, highlighting the very unique challenge of this task. Figure 1 illustrates the learning progress of the state-of-the-art (SOTA) model RRG [11] in predicting a report with predominantly normal diagnoses. The model shows strong performance on normal cases but struggles on abnormal reports.

Table 1.

Data statistics summary. Variations exist in label (Normal and Abnormal %) and average report length (L).

Image Report Vocab Abnormal % Normal % L Lnormal Labnormal

IU X-ray 7,470 3,955 1,517 32.96% 67.04% 35.99 27.76 40.72
MIMIC-CXR 377,110 227,835 13,876 13.97% 86.03% 59.70 34.57 59.36

To promote the quality of generated reports, we propose Joint Imbalance Adaptation (JIMA) model by curriculum learning [12]. JIMA automatically guides the model learning process by leveraging optimization difficulties, strengthening learning capability on infrequent samples, and alleviating overfitting on frequent patterns on both label and token. We incorporate the token and label metrics as a joint optimization and design a novel Training Scheduler that sampling and sorting training instances with a multi-aspect scoring mechanism. The scheduler automatically adjust training samples when model performance varies across multiple imbalance factors. We conduct experiments on two publicly available datasets, MIMIC-CXR [13] and IU X-ray [14] with automatic and human evaluations. By comparing with six state-of-the-art (STOA) baselines on overall and imbalance performance settings, our approach shows promising results over the STOA baselines. Our ablation and qualitative analyses show that JIMA can generate more precise medical reports, alleviating label and token imbalance.

2. Data

We collected two publicly accessible datasets for this study, IU X-ray [14] and MIMIC-CXR [13], de-identified chest X-ray datasets to evaluate radiology report generation. IU X-ray [14], collected from the Indiana Network for Patient Care, includes 7,470 X-ray images and corresponding 3,955 radiology reports. MIMIC-CXR [13], collected from the Beth Israel Deaconess Medical Center, contains 377,110 X-ray images and 227,827 radiology reports for 65,379 patients. Each report is a text document and associates with one or more front and side X-ray images. Table 1 summarizes statistics of data imbalance and Figure 2 visualize the distributions of frequent (ranked in the top 12.5% of the vocabulary) and infrequent tokens. We include preprocessing details in Appendix A.

Fig. 2.

Fig. 2

Frequent and infrequent token distributions conditioning on report label.

Table 1 presents imbalance patterns in tokens and labels. Abnormal entries are predominant in both datasets, and MIMIC-CXR displays a more skewed label distribution, as more abnormal samples were collected during diagnosis phases not for screening purposes. MIMIC-CXR has a longer average length than IU X-ray. The lengthier documents may pose a unique multimodal generation challenge in the medical field. To conduct our analysis, we define the low and high frequency using the top 12.5% frequent tokens. Figure 1 suggests a joint relation between label and token imbalance and higher ratios of low-frequency tokens in abnormal reports. This observation motivates us to investigate how the imbalance impacts model robustness and reliability.

2.1. Imbalance Effects

We examine the potential impact of label and token imbalance on model performance. To ensure consistency, we keep the top 12.5% to split low- and high-frequent tokens for evaluation purposes. The analysis includes three state-of-the-art models, R2Gen [15], WCL [16], and CMN [17]. We use BLEU-4 [18] and F1 scores to measure performance across both token (low vs high frequency) and label (normal vs. abnormal) imbalance. We visualize performance variations in Figure 2.

The results suggest that the models exhibit significant difficulties in coping under label and token imbalance. Models consistently perform worse on abnormal reports, which are lengthier and have more infrequent tokens than normal reports. For example, the top 12.5% frequent tokens count > 80% tokens in two datasets, and low-frequent tokens have much worse performance than frequent tokens, as infrequent tokens are harder to optimize [19]. However, infrequent tokens contain higher ratios of medical terms (e.g., silhouettes and pulmonary) describing health states. The significantly varying performance highlights the unique challenges to adapt token and label imbalance. While existing work [7] has considered label imbalance, however, the study did not examine the performance effects of label or token imbalance. The findings inspire us to propose our model Joint Imbalance Adaptation (JIMA) to model token and label imbalance.

3. Joint Imbalance Adaptation

In this section, we present our approach Joint Imbalance Adaptation (JIMA) in Fig 3 by using curriculum learning. JIMA aims to augment model robustness under label and token imbalance. As optimizing data imbalance has been demonstrated difficulty, deploying such a learning strategy will strengthen model robustness and reliability. Our proposed approach deploys curriculum learning (CL) [20] that automatically adjusts the optimization process by gradually selecting training data entries from learning difficulty — learning from hard to easy samples as our optimization strategy [21]. To achieve the goal, we design two major CL modules, difficulty measurer for assessing the difficulty of samples, and a training scheduler for determining the percentage of training data. Then we employ our CL training strategy to two tasks. Task 1 aims to predict entities from the images and Task 2 can generate a report from images’ features and entity distribution.

Fig. 3.

Fig. 3

JIMA has two tasks. Task 1 aims to predict entity distribution from images and task2 aims to generate report from image’s feature and entity distribution. We assign one color per task and solid arrows as workflows.

Difficulty measurer is to measure sample difficulties. To diversify learning aspects and jointly incorporate imbalance factors, we propose a novel measurement to improve model performance over imbalance patterns. Our measurement adopts a competitive mechanism that encourage correct options with higher ranking over incorrect ones, rather than independently increasing the likelihood of correct options and decreasing the likelihood of incorrect options. This approach helps mitigate overfitting on common samples and underfitting on rare samples since it focuses on ranking of correct option rather than prediction confidence. Specifically, given a reference token z, vocabulary list V and the prediction p|V|, we calculate the token reference (z) probability ranking in the prediction p as the following,

k=Rank(p,p[z])/|V| (1)

where V is the vocabulary size. Rankp,pz assigns a rank to p in descending order and identifies the position of p[z] within this ranking. k ranges from 0 to 1 under regularization with |V|. A higher value of k indicates that the sample is more difficult. Then, we feed the difficulty information to the next step, Training Scheduler.

Training scheduler aims to automatically leverage imbalance effects by selecting training samples via the difficulty measurers. Our goal is to increase the number of easier samples when the performance decreases and vice versa. According to our goal, we design our scheduler function, cst as following:

cst=min(1,[1stst1st1]×cst1),t1 (2)

, where s is the average performance of all training samples, measuring the model’s learning ability. t is the training step. Given decreasing performance as an example, stst1st1 will be negative. During the process, the ratio 1stst1st1>1 will allow the model to include more easy training data than the last step cst1. When the performance increase, the scheduler feed less easy samples to the model and reduce the over-fitting on these samples. After multiple epochs of training, harder samples receive more training iterations than easier samples. In this way, we can alleviate the the challenge from imbalanced tokens and labels in radiology report generation task. To start our curriculum learning, we record the samples’ average performance of the last two regular training epochs as s0 and s1, where we empirically initialize cs0 as 1.

3.1. CL-Task 1

CL-Task 1 is to exploit imbalance patterns of radiology labels to generate clinically accurate reports. Entities in clinical reports play a crucial role in disease diagnosis. However, these clinical tokens often occur infrequently and are significantly underestimated during model training. Hence, we assess the accuracy of clinical entities to evaluate performance. Our intuition is that as abnormal cases contain more infrequent entities, focusing on the clinical entities may benefit the abnormal cases. If our generated reports are clinically correct, the visual extractor can accurately extract the same entities as gold entities from images.

The computing process is as the following. Given a radiology image Img and the corresponding report Z=z0,,zl with the length l, we extract the features from images with a visual extractor. We use ResNet101 [22] f as our visual extractor and obtain image features (X) from different convolutional channels, X=f(Img). Xpatch_size×d, where d is the size of the feature vector. To predict entities distribution, we feed the feature from X into the Entity Extractor fE with parameters WEd×|V| and average the value on each patch(1st dimension),

q=AVG:1fEXWE (3)

Then we obtain the entity distribution representation q|V|. To optimize the model, we minimize Binary Cross Entropy as follows,

task1=1|V|i=1|V|(yi*log(qi)+(1yi)*log(1qi)) (4)

where qi is the prediction probability of the i-th token and yi=1 if i-th token is the entities. We extract the gold entities (e) by radgraph [23]. To evaluate sample’s difficulty in this task, we input the entity distribution prediction q into e.q 1 and obtain ktask1=i|e|Rankq,qei/(|V||e|).

3.2. CL-Task 2

CL-Task 2 implements an image-to-text generation pipeline with the objective of improving the infrequent tokens prediction in reports. To generate a report containing more clinically useful information, we integrate the probability prediction of entities (q) in e.q. 3 with image’s feature (X). Since d|V|, we cannot interact q and X directly. To facilitate their interaction and information sharing, we employ a cross-concatenation and perform a dot product operation on their cross-concatenated matrix as follows:

S=concat:2(X,q)concat:2(q,X)

where Spatch_size×(d+|V|) Finally, we adopt a transformer structure to encode S and generate i-th token probability distribution Pi from encoding feature S and i-th token, Pi=f𝒯S,zi1. To optimize the model, we minimize negative log-likelihood loss (NLL) as follows,

task2=illogPi (5)

We can access the sample’s difficulty with Pi by e.q. 1, ktask2=ilRankPi,Pizi/(|V|l).

Algorithm 1.

Optimization Process of JIMA

Require: rateαβ
 1: for each epoch do
 2:  Rank entries by the two difficulty measurers (ktask1 and ktask2), and obtain two sorted datasets 𝒟1, 𝒟2
 3:  Calculate ckttask1 and ckttask2 training schedulers
 4:  Select top ckttask1 samples from the sorted datasets 𝒟1 obtained by step 1 as training sets
 5:  Select top ckttask2 samples from the sorted datasets 𝒟2 obtained by step 1 as training sets
 6:  Sample a batch from 𝒟1 and update Task 1: f~fαftask1,f~EfEαfEtask1
 7:  Sample a batch from 𝒟2 and update Task 2: f~𝒯f𝒯αf𝒯task2
 8: end for

3.3. CL-Joint Optimization

We propose a joint optimization approach to integrate two tasks. Algorithm 1 summarizes the overall optimization process of our approach. We set the learning rate of task 1 as α and β refers to the learning rate of tasks 2. In each training step, we sample different data for different tasks and each task focuses on optimizing its own module of the models. For example, we update the visual extractor f and the entity extractor fE in task 1. Next, we freeze the parameters of the visual extractor and the entity extractor, and update the parameters of the transformer f𝒯 specifically for task 2. Our optimization approach integrates with curriculum learning to tailor joint imbalance learning for each module f,fE,f𝒯,). Curriculum learning empowers the model to concentrate on optimizing hard samples while mitigating the risk of overfitting to easier samples. The joint optimization scheme facilitates each task to manage different module parameters optimization and learn a transferable knowledge from the simpler to more complex task. As a result, all modules collaborate to enhance error reduction from previous tasks.

4. Experiments

We design our experiments to evaluate performance on both regular and imbalanced settings via automatic and human evaluations. The automatic evaluation includes NLG-oriented and clinical-correctness metrics. NLG-oriented metrics measure the similarity between generated and reference reports. Clinical correctness and human evaluation belong to factually-oriented metrics, and domain-specific evaluation methods. To be consistent with our baselines [10, 11, 15], we utilize the F1 CheXbert [24] for the clinical-correctness metrics. The experiments compare our proposed approach (JIMA) and the state-of-the-art baselines. Two of our five baselines (CMM + RL & RRG) are designed to solve label imbalance by improving the abnormal findings generation. We conduct ablation and case analyses to fully understand the capabilities of our proposed approach. We include more implementation details and hyperparameter settings in Appendix B.1.

4.1. Baselines

To examine the validity of our method, we include five state-of-the-art baselines under the same experimental settings: R2Gen [15], CMN [17], WCL [16], CMN + RL [25], RRG [26], TIMER [10] and RGRG [27] – and obtain from their open-sourced code repositories.

R2Gen [15] is a transformer-based model with ResNet101 [22] as the visual extractor. To capture some patterns in medical reports, R2Gen proposes a relational memory to enhance the transformer so that the model can learn from the patterns’ characteristics. Furthermore, R2Gen deploys a memory-driven conditional layer normalization to the transformer decoder facilitating incorporating the previous step generation into the current step.

CMN [17] is a novel extension to the transformer architecture that facilitates the alignment of textual and visual modalities. The cross-modal memory network record the shared information of visual and textual features. The alignment process is carried out via memory querying and responding. The model maps the visual and textual features into the same representation space in memory querying and learns a weighted representation of these features in memory responding.

WCL [16] utilizes the R2Gen framework and incorporates a weakly supervised contrastive loss. Specifically, WCL leverages the contrastive loss to enhance the similarity between a given source image and its corresponding target sequence. Furthermore, the model enhances its ability to learn from difficult samples by assigning more weights to instances sharing common labels.

CMM + RL [25] is a cross-modal memory-based model with reinforcement learning for optimization. CMM + RL designs a cross-modal memory model to align the visual and textual features and deploy reinforcement learning to capture the label imbalance between abnormality and normality. The author uses BLEU-4 as a reward to guide the model to generate the next word from the image and previous words.

RRG [11, 26] aims to generate clinically correct reports by weakly-supervised learning of the entities and relations from reports. RRG is a BERT-based model with Densenet-121 [28] as a visual extractor. RRG leverages RadGraph [23] to extract the entities and relation labels in a report. RRG utilizes reinforcement learning to optimize the model. The reward assesses the consistency and completeness of entities and the relation set between generated reports and reference radiology reports. RRG addresses label imbalance issues by maximizing the reward of predicting more complicated entities and relations in abnormal samples.

TIMER [10] aims to decrease the over-fitting of frequent tokens by introducing unlikelihood loss to punish the error on these tokens. The tokens set of unlikelihood loss is automatically adjusted by maximizing the average F1 score on different frequency tokens.

RGRG [27] adopts GPT2 as the language generation model and generate a report based on the localized visual features of anatomical regions, which are extracted by a object detection. This baseline experiment was specifically carried out on the MIMIC-CXR dataset, as the IU X-ray dataset lacks anatomical region information, resulting in the inability to train an object detection module effectively.

4.2. Imbalance Setting

We evaluate model robustness under token and label imbalance settings and present results in Section 5.2 and 5.3.. For token imbalance, we compare F1-scores of frequent and infrequent tokens separately. We introduce three different scales to define frequency token sets, 1/4, 1/6, and 1/8 respectively. The splits define the top 1/4, 1/6, and 1/8 vocabulary as frequent tokens and the rest vocabulary as infrequent tokens. The setting is to demonstrate the effectiveness of our approach in adapting token imbalance. For label imbalance, we divide our samples into a binary category, normal and abnormal.

5. Results and Analysis

In this section, we present overall performance and report results of imbalance evaluations and include an ablation analysis and a case study. Generally, JIMA outperforms the state-of-the-art baselines by a large margin, especially under imbalance settings. Our qualitative studies show our method can achieve more clinically accuracy and generate more precisely clinical terms.

5.1. Overall Performance

Table 2 presents the performance of JIMA by NLG and clinical-correctness metrics. JIMA outperforms baseline models (both imbalance and regular methods) on BLEU scores by a large margin, confirming the validity of selecting training samples by our curriculum learning method. The approach enables the model to learn multiple times from the samples with lower BLEU-4, resulting in a better performance compared to the baseline models. For example, JIMA shows an improvement of 16.59% on average for IU X-ray and 16.28% for MIMIC-CXR. We infer this is as our task 1 and 2 jointly work to improves the token and label imbalanced problem.

Table 2.

Overall performance. Δ is the averaged percentage improvements over baselines.

Dataset Model NLG metrics CE metrics
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L F1

IU X-ray R2Gen 48.80 31.93 23.24 17.72 20.21 37.10 63.62
CMN 45.53 29.50 21.47 16.53 18.99 36.78 64.83
WCL 44.74 29.30 21.49 16.79 20.45 37.11 49.24
CMM + RL 49.40 30.08 21.45 16.10 20.10 38.40 40.79
RRG 49.96 31.44 22.11 17.05 18.81 33.46 49.10
TIMER 49.34 32.49 23.84 18.61 20.38 38.25 94.52
JIMA (Ours) 50.50 33.12 24.15 18.88 21.16 38.56 96.58

Δ (%) 5.49 7.74 8.65 10.44 6.86 4.86 72.10

MIMIC-CXR R2Gen 35.42 21.99 14.50 10.30 13.75 27.24 54.60
CMN 35.60 21.41 14.07 9.91 14.18 27.14 50.50
WCL 37.30 23.13 15.49 10.70 14.40 27.39 55.58
CMM+RL 35.35 21.80 14.82 10.58 14.20 27.37 65.43
RRG 37.57 19.78 15.87 9.56 14.77 26.81 62.20
TIMER 38.30 22.49 14.60 10.40 14.70 28.00 75.86
RGRG 30.7 20.59 14.10 10.18 15.43 24.03 80.28
JIMA (Ours) 41.37 24.83 16.72 11.20 16.75 30.15 81.25

Δ(%) 16.26 15.24 13.34 9.59 15.73 12.52 31.29

Second, our model achieves the best performance in F1 of the clinical metric, which indicates the Task 1 (Section 3.1) can enable the model to put more attention on difficult samples with lower F1 scores. Additionally, our method promotes clinical token prediction as performance on infrequent tokens and medical terms have been improved. For example, our generation significantly outperforms the baselines on F1 score by 72.10% on IU X-ray and 31.29% on the MIMIC-CXR average. CMN + RL performs better than other baselines on IU X-ray but not on MIMIC-CXR. JIMA maintains a stable performance on both IU X-ray and MIMIC-CXR. We infer this as our joint imbalance adaptation can yield more improvements.

5.2. Token Imbalance

Table 3 compares high- and low-frequent tokens F1 in different ratio splits. Our method consistently outperforms baselines in the low-frequent tokens across frequency splits (14, 16, and 18 ) on IU X-ray and MIMIC-CXR. While RRG and CMN + RL approaches have adapted label imbalance, the approaches may not be able to adapt the token imbalance. Our approach achieves better performance on the token imbalance. Generating rare tokens with accuracy remains a difficult task despite the high performance achieved on frequent tokens. Common tokens are prone to overfitting while rare tokens are predicted with less precision. For example, the 0.00 score by R2GEN on 3/4 split of the MIMIC-CXR vocabulary. Performance imbalance can deteriorate the clinical correctness of generated reports as medical terminologies are usually infrequent. Nonetheless, our joint imbalance adaptation approach has shown considerable improvements in this area, indicating a promising direction to enhance the robustness of radiology report generation, a critical clinical task.

Table 3.

Results on high- and low-frequent tokens with three ratio splits.

IU X-ray MIMIC-CXR
Ratio Method infreq freq infreq freq

1/8 R2GEN 4.46 62.73 2.52 52.01
CMN 5.88 55.86 2.23 45.60
WCL 5.29 60.23 2.91 48.60
CMN + RL 5.19 49.36 0.21 23.64
RRG 7.28 41.94 2.50 43.57
TIMER 13.23 61.89 3.15 52.66
RGRG - - 0.22 31.33
JIMA (ours) 14.87 62.55 3.58 53.06

1/6 R2GEN 2.80 61.62 2.02 49.86
CMN 5.75 65.12 0.85 52.02
WCL 3.72 59.26 2.13 47.88
CMN + RL 5.19 49.36 0.14 23.36
RRG 4.55 40.46 2.09 43.56
TIMER 5.93 67.79 2.02 51.72
RGRG - - 0.26 30.66
JIMA (ours) 10.52 68.82 2.83 52.32

1/4 R2GEN 1.16 59.98 0.00 48.77
CMN 2.60 63.92 0.33 51.09
WCL 1.50 56.83 0.30 46.95
CMN + RL 5.19 49.36 0.07 23.05
RRG 2.04 38.84 0.39 41.45
TIMER 8.66 64.00 0.58 51.39
RGRG - - 0.20 29.56
JIMA (ours) 9.77 66.23 0.94 51.92

5.3. Label Imbalance

We report NLG evaluations on label imbalance (normal vs. abnormal) in Table 4. JIMA significantly outperforms baseline models both on normal and abnormal splits, which demonstrates its effectiveness under label imbalance. JIMA also performs better than the label imbalance methods, RRG and CMM+RL, indicating that the joint imbalance adaptation is a promising direction to improve model robustness. It is worth noting that models generally perform better on normal samples than on abnormal ones. We infer this for two reasons: 1) abnormal reports contain more infrequent medical tokens, and 2) abnormal reports are longer, as discussed in Section 2. JIMA shows more improvements on abnormal samples over baselines while maintains a similar performance on samples with normal labels. The observations suggest that our approach can successfully learn from lengthier documents with more medical tokens.

Table 4.

Label imbalance evaluation with binary types, normal and abnormal.

Dataset Label Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L

IU X-ray Normal R2Gen 50.50 34.91 25.86 20.93 23.66 40.56
CMN 47.42 32.80 25.25 18.72 20.51 38.69
WCL 49.74 35.44 28.02 18.71 26.88 42.09
CMM+RL 51.68 36.65 21.99 19.47 24.53 40.05
RRG 50.03 33.76 24.81 19.89 20.43 34.39
TIMER 51.83 32.43 33.71 20.19 24.43 39.39
JIMA (ours) 52.65 37.06 28.39 21.56 27.20 42.33

Abnormal R2Gen 42.67 27.86 18.47 12.35 15.04 30.10
CMN 35.09 21.42 14.97 11.32 14.36 29.85
WCL 32.31 19.93 13.87 10.50 13.81 30.37
CMM+RL 38.09 25.42 11.17 15.09 13.13 27.64
RRG 43.38 23.44 10.02 15.58 12.43 31.52
TIMER 44.25 26.73 15.28 10.76 15.43 33.26
JIMA (ours) 45.41 27.95 19.15 15.68 16.36 34.59

MIMIC-CXR Normal R2Gen 40.42 26.76 19.75 15.60 17.58 32.02
CMN 41.42 27.80 20.25 15.72 17.51 33.69
WCL 39.74 25.44 18.02 13.71 16.88 32.09
CMM+RL 17.50 10.11 6.83 14.99 8.05 19.10
RRG 38.78 21.63 18.04 12.09 18.27 27.56
TIMER 40.33 27.53 19.88 14.87 17.47 33.08
RGRG 32.09 22.67 16.40 12.30 18.26 27.28
JIMA (ours) 41.79 27.87 20.49 16.00 17.93 33.87

Abnormal R2Gen 33.97 19.31 12.07 10.17 10.98 26.82
CMN 33.00 19.44 10.02 8.73 10.21 25.16
WCL 34.56 22.45 14.63 10.26 12.43 26.87
CMM+RL 27.74 10.87 5.18 3.43 6.11 16.08
RRG 17.47 9.71 5.78 3.74 8.37 17.59
TIMER 35.66 21.83 14.25 14.87 9.84 26.77
RGRG 30.54 20.34 13.82 9.92 15.13 23.66
JIMA (ours) 37.81 22.46 15.26 10.28 14.56 27.38

5.4. Ablation Analysis

In this section, we carry out ablation experiments to analyze the impact of our curriculum learning approach on tokens and labels with different frequencies. To investigate the performance across different tokens, we categorize tokens into five groups based on their frequency, with “0” representing the most frequent tokens and “4” representing the least frequent tokens. Each group contains an equal number of tokens. In order to compare the performance across different labels, we present their performance individually. We conduct our ablation experiments on the MIMIC-CXR dataset, and the results are depicted in Figure 4.

Fig. 4.

Fig. 4

Performance comparison of JIMA with and without curriculum learning across various labels and tokens.

First, we notice that removing curriculum learning does not result in a significant detrimental impact on highly frequent tokens and labels. For instance, the performance is comparable in the “0” token group and the “0–5” label groups. Curriculum learning empowers the model to allocate increased attention to challenging samples, thereby reducing the likelihood of predictions on highly frequent samples. However, our curriculum learning strategy selects training samples based on the ranking of the correct answers. Therefore, despite the reduced probability of the correct answer, the ranking remains unchanged. For example, the correct option still holds the highest estimation). As a result, our curriculum learning approach does not diminish the performance on highly frequent samples. Next, our curriculum learning approach significantly enhances performance primarily on moderately frequent samples. The average improvement amounts to 6.49% in the “1–3” token group and 2.58% in the “6–10” label group. However, our method exhibits limitations in enhancing the performance of exceedingly rare tokens. Notably, the model struggles to predict tokens in the “4” group.

5.5. Human Evaluation

To verify the factual correctness, we invite two health professionals to perform evaluation. First, we randomly select 50 test instances per data from IU X-ray and MIMIC-CXR, respectively. We choose CMM+RL as our targeting comparison, as the model is the best performing baseline by automatic metrics. In evaluation, we show the X-ray images, corresponding ground truth reports, and two generated reports (one from our model and the other from CMM+RL) to the expert without disclosing their sources. The experts selected a better description from two candidate reports or chooses the “Same” option if both reports are of similar quality.

We present our human evaluation results in Table 5, which shows a consistent result with automatic evaluation results. Generally, JIMA outperforms the baseline with 11 reports in total. Notably, our approach exhibits significant improvements in abnormal samples. Even though JIMA has only one more vote than the baseline in normal samples, our model secures ten more votes in abnormal samples. This is because abnormal samples have lengthier reports on average and encompass more medical entities, indicating that our approach generates more clinically precise reports. Furthermore, our human evaluation is consistent with the automated evaluation results shown in Table 2.

Table 5.

Human evaluation. “Same” means the experts vote the same for the generated reports.

Dataset Label CMM+RL Same JIMA (Ours)

IU X-ray Normal 6 — 7 12 — 7 6 — 10
Abnormal 4 — 4 10 — 5 12 — 13

MIMIC-CXR Normal 6 — 7 15 — 7 7 — 11
Abnormal 5 — 6 10 — 7 7 — 16

Overall Normal 12 — 14 27 — 14 13 — 21
Abnormal 9 — 10 20 — 12 19 — 29
All 21 — 24 47 — 26 32 — 50

5.6. Case Study

To verify our model’s effectiveness in generating clinically correct descriptions, we perform a case study in this section and present the result in Fig 5. We select four samples from IU X-ray and MIMIC-CXR datasets and compare the normal and abnormal samples’ performance separately. The correct pathological and anatomical entity predictions are remarked in blue color. Generally, our predictions cover more than 90% entities in reference reports. Compared to normal samples, abnormal samples have longer descriptions and contain more complex entities. These entities usually are rare in corpus and suffer under-fitting from models. Therefore, models underperform in abnormal samples. However, JIMA can capture most of the entities in all kinds of samples and achieve similar performance in both normal and abnormal samples, which proves our model’s effectiveness in improving the factual completeness and correctness of generated radiology reports.

Fig. 5.

Fig. 5

Qualitative comparison between JIMA and CMM+RL. We highlight correct predictions of pathological and anatomical entities in blue color.

6. Related Work

Radiology report generation is a domain-specific image-to-text task that has two major directions, retrieval- [29, 30] and generation-based [15, 25, 31]. The retrieval-based approach compares similarities between an input radiology image and a set of report candidates, ranks the candidates, and returns the most similar one [5, 26, 29, 30, 32]. In contrast, our study focuses on the generation-based task, which automatically generates a precise report from an input image. The task has domain-specific characteristics in the clinical field. The clinical data contains many infrequent medical terminologies and longer documents than image captioning from general domains [6]. As radiology report generation can reduce the workloads of radiologists, generating highly qualified and precise can be a critical challenge, especially under the imbalance settings. Differing from previous work, we aim to promote model robustness and reliability under imbalance settings, which have been rarely studied in the radiology report generation.

Imbalance learning aims to model skewed data distributions. The primary focus of imbalance learning is on class or label imbalance, such as positive or negative reviews in sentiment analysis [33]. While previous studies proposed new objective functions (e.g., focal-loss [34]) or oversampling [35], those methods may not be applicable to our primary generation unit, token, which has large vocabulary sizes and extreme sparsity. In terms of radiology report generation, reports may have disease-related labels. Recent studies have augmented model robustness by balancing performance between disease and normal by reinforcement learning [7, 8]. However, those methods ignore a fundamental challenge of generation task, token imbalance – a long-tail distribution. The token imbalance can be even more critical for the clinical domain, as medical tokens appear less frequently than regular tokens in radiology reports. Our study makes a unique contribution to the radiology report generation that jointly consider multiple imbalance factors via curriculum learning.

7. Conclusion

In this study, we have demonstrated the critical imbalance challenge and developed a curriculum learning-based model to jointly adapt label and token imbalance. Extensive experiments, ablation analysis, and human evaluations show that JIMA leads to significant improvements over the existing state-of-the-art baselines, especially in handling token and label imbalance. Our future work will examine the proposed approach on more imbalance factors (e.g., demography).

8. Limitations

Limitations should be fully acknowledged before fully interpreting this study, as no research can be fully perfect. Evaluation. We are aware of other evaluation metrics, such as RadGraph [23] and CheXpert [36]. However, additional metrics may only be applicable to the MIMIC-CXR or have overlapped with our existing method, such as CheXpert and CheXbert [24]. We have included diverse metrics, including NLG, clinical correctness, and human evaluations. To keep consistency with our state-of-the-art baselines, we utilize a similar evaluation schema. Having consistent observations between our human and automatic evaluations may also prove our evaluation validity.

Funding

This work was supported by the National Science Foundation under Award IIS-2245920 and National Cancer Institute Award R01CA258193.

Appendix A Data

We extract labels of each data entry and follow baseline studies [15, 17, 25] to pre-process the report documents to ensure comparisons under same settings. In order to ensure data format consistency, we include and infer two primary labels of radiology reports, normality and abnormality. To obtain labels for IU X-ray, we build a supervised classifier using BioBert-PubMed200kRCT [37] to extract the binary labels on the Medical Subject Heading (MESH)2 and RadLex3 labels (normal and abnormal). To obtain labels for MIMI-CXR, we utilize CheXbert [24] to extract the binary categories, disease types and “no finding”. We define “no finding” as normality and disease types as abnormality. In this study, we conducted text preprocessing by utilizing the Natural Language Toolkit (NLTK) [38] to lowercase and tokenize documents. Furthermore, we removed redundant spaces, empty lines, serial numbers, and punctuation marks from the documents.

A.1. Ethic, Privacy, and IRB

We follow data agreement and training to access the two radiology report datasets. To protect user privacy, we ensure proper data usage and experiment with de-identified data. Our experiments do not store any data and only use available multimodal entries for research demonstrations. Due to privacy and ethical considerations, we will not release any clinical data associated with patient identities. Instead, we will release our code and provide detailed instructions to replicate our study. This study only uses publicly available and de-identified data. Our study focuses on computational approaches and does not collect data from human subjects. Our institutional IRB determines that IRB approval is not required for this study.

Appendix B Experiment

B.1. Implementation Details

In our model architecture, we set the transformer structure with 3 layers and 8 attention heads, 512 dimensions for hidden states. The memory-driven model is a single-layer GRU network with a hidden size equal to vocabulary size. We set the α learning rate as 4e − 4 and β learning rate as 1e − 5 and decay them by a 0.8 rate per epoch for all datasets. The pre-training epoch is 30 in IU X-ray and 10 in MIMIC-CXR. Then we adopt curriculum learning to optimize our pre-trained model. The maximum training epoch is 70 for the IU X-ray and 50 for the MIMIC-CXR datasets. We keep the learning rate the same as in the pre-trained stage.

For all baselines, we set the maximum training epoch as 100 and 60 for IU X-ray and the MIMIC-CXR datasets, respectively. Also, we use the same pre-processing, optimizer, batch size, maximum length of training data, sampling method, and machine learning framework in all experiments. Specifically, we optimize models by ADAM [39] with 16 batch sizes. The maximum length of training data is 60. In the test stage, we generate tokens by beam search [40] with 3 beam sizes for all experiments. All implementations are on PyTorch [41]. In implementing baselines, we keep all the model architecture and optimization parameters the same as in their papers. In R2Gen, CMN, and RRG, we generate reports by using the code and the pre-trained models published by the authors. For the other baselines (WCL & CMM+RL & TIMER), we use the released code to train and generate reports.

We personalize the following setting in baselines. In WCL, we use the basic contrastive learning loss without assigning a hardness weight to different samples in IU X-ray dataset. Because the file measuring the similarity among different samples is inaccessible. We set the contrastive embedding size as 256 and the weight of contrastive loss is 0.2. In CMM + RL, the reinforcement learning reward is based on evaluation metrics and we select BLEU-4 in this case.

B.2. Evaluation Metrics

Automatic Evaluation includes seven evaluation methods from two major categories, NLG and Clinical metrics. We first evaluate our model and the baseline models on natural language generation (NLG) metrics, including BLEU (−1, −2, −3, and −4) [18], METEOR [42] and ROUGE-L [43]. BLEU score measures the precision of prediction with a penalty for the reference-to-prediction length ratio. METEOR computes the harmonic mean of unigram precision and recall. Unlike BLEU, which considers only single words, METEOR incorporates a penalty to account for the importance of word order. ROUGE-L takes into account sentence-level structure similarity naturally and identifies the longest co-occurring in sequence n-grams automatically. Clinical metrics is a domain-specific evaluation method to measure the factual completeness and consistency of generated reports. We use CheXbert [24] to extract the labels of ground truth and prediction and evaluate clinical efficacy (CE) metrics by F1. We do not present clinical F1 score in the label imbalance experiment since we can not access recall in separate normal and abnormal sample sets.

Footnotes

Declarations

Competing interests The authors declare no competing interests.

1

Clinical reports are also much longer than general-domain image captions, such as MS-COCO [6].

Ethics approval and consent to participate Not applicable.

Consent for publication All authors have reviewed and approved the final version of this manuscript and consent to its publication.

Code availability The code that implements the method described in this study is publicly available at https://github.com/woqingdoua/JIMA. This repository includes all necessary scripts, tools, and documentation to replicate the results and analyses presented in this paper.

Data availability

The data that support the findings of this study are derived from the publicly available MIMIC-CXR and IU-Xray datasets. MIMIC-CXR: The MIMIC-CXR (Medical Information Mart for Intensive Care) dataset is available through the PhysioNet repository. More information on accessing MIMIC-CXR can be found at https://physionet.org/content/mimic-cxr/2.0.0/. IU-Xray: The IU-Xray dataset, which consists of chest X-ray images and associated radiology reports, is available from the Open Access Biomedical Image Search Engine (OpenI) provided by the U.S. National Library of Medicine. The dataset can be accessed at https://openi.nlm.nih.gov/faq#collection. These datasets are publicly available to researchers subject to the respective data use agreements and ethical guidelines. Any additional data generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  • [1].Jing B., Xie P., Xing E.: On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2577–2586. Association for Computational Linguistics, Melbourne, Australia: (2018). 10.18653/v1/P18-1240 . https://aclanthology.org/P18-1240 [DOI] [Google Scholar]
  • [2].Jing B., Wang Z., Xing E.: Show, describe and conclude: On exploiting the structure information of chest X-ray reports. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6570–6580. Association for Computational Linguistics, Florence, Italy: (2019). 10.18653/v1/P19-1657 . https://aclanthology.org/P19-1657 [DOI] [Google Scholar]
  • [3].Lovelace J., Mortazavi B.: Learning to generate clinically coherent chest X-ray reports. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1235–1243. Association for Computational Linguistics, Online (2020). 10.18653/v1/2020.findings-emnlp.110 . https://aclanthology.org/2020.findings-emnlp.110 [DOI] [Google Scholar]
  • [4].Tan B., Yang Z., Al-Shedivat M., Xing E., Hu Z.: Progressive generation of long text with pretrained language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4313–4324. Association for Computational Linguistics, Online (2021). 10.18653/v1/2021.naacl-main.341 . https://aclanthology.org/2021.naacl-main.341 [DOI] [Google Scholar]
  • [5].Wang Z., Liu L., Wang L., Zhou L.: Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11558–11567 (2023). https://openaccess.thecvf.com/content/CVPR2023/html/Wang_METransformer_Radiology_Report_Generation_by_Transformer_With_Multiple_Learnable_Expert_CVPR_2023_paper.html [Google Scholar]
  • [6].Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L.: Microsoft coco: Common objects in context. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds.) Computer Vision – ECCV 2014, pp. 740–755. Springer, Cham: (2014). 10.1007/978-3-319-10602-148 [DOI] [Google Scholar]
  • [7].Nishino T., Ozaki R., Momoki Y., Taniguchi T., Kano R., Nakano N., Tagawa Y., Taniguchi M., Ohkuma T., Nakamura K.: Reinforcement learning with imbalanced dataset for data-to-text medical report generation. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2223–2236. Association for Computational Linguistics, Online (2020). 10.18653/v1/2020.findings-emnlp.202 [DOI] [Google Scholar]
  • [8].Yu H., Zhang Q.: Clinically coherent radiology report generation with imbalanced chest x-rays. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1781–1786. IEEE Computer Society, Los Alamitos, CA, USA: (2022). 10.1109/BIBM55620.2022.9994871 . 10.1109/BIBM55620.2022.9994871 [DOI] [Google Scholar]
  • [9].Gu S., Zhang J., Meng F., Feng Y., Xie W., Zhou J., Yu D.: Tokenlevel adaptive training for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1035–1046. Association for Computational Linguistics, Online (2020). 10.18653/v1/2020.emnlp-main.76 . https://aclanthology.org/2020.emnlp-main.76 [DOI] [Google Scholar]
  • [10].Wu Y., Huang I.-C., Huang X.: Token imbalance adaptation for radiology report generation. In: Mortazavi B.J., Sarker T., Beam A., Ho J.C. (eds.) Proceedings of the Conference on Health, Inference, and Learning, vol. 209, pp. 72–85. PMLR, ??? (2023). https://proceedings.mlr.press/v209/wu23a.html [Google Scholar]
  • [11].Delbrouck J.-B., Chambon P., Bluethgen C., Tsai E., Almusa O., Langlotz C.: Improving the factual correctness of radiology report generation with semantic rewards. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 4348–4360. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates: (2022). https://aclanthology.org/2022.findings-emnlp.319 [Google Scholar]
  • [12].Bengio Y., Louradour J., Collobert R., Weston J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ‘09, pp. 41–48. Association for Computing Machinery, New York, NY, USA: (2009). 10.1145/1553374.1553380 . 10.1145/1553374.1553380 [DOI] [Google Scholar]
  • [13].Johnson A.E.W., Pollard T.J., Berkowitz S.J., Greenbaum N.R., Lungren M.P., Deng C. y., Mark R.G., Horng S.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 6(1), 317 (2019) 10.1038/s41597-019-0322-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Demner-Fushman D., Kohli M.D., Rosenman M.B., Shooshan S.E., Rodriguez L., Antani S., Thoma G.R., McDonald C.J.: Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23(2), 304–310 (2015) 10.1093/jamia/ocv080 https://academic.oup.com/jamia/article-pdf/23/2/304/34147537/ocv080.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Chen Z., Song Y., Chang T.-H., Wan X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439–1449. Association for Computational Linguistics, Online (2020). 10.18653/v1/2020.emnlp-main.112 . https://aclanthology.org/2020.emnlp-main.112 [DOI] [Google Scholar]
  • [16].Yan A., He Z., Lu X., Du J., Chang E., Gentili A., McAuley J., Hsu C. n.: Weakly supervised contrastive learning for chest x-ray report generation. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4009–4015 (2021). https://aclanthology.org/2021.findings-emnlp.336.pdf [Google Scholar]
  • [17].Chen Z., Shen Y., Song Y., Wan X.: Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914. Association for Computational Linguistics, Online (2021). 10.18653/v1/2021.acl-long.459 . https://aclanthology.org/2021.acl-long.459 [DOI] [Google Scholar]
  • [18].Papineni K., Roukos S., Ward T., Zhu W.-J.: Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL ‘02, pp. 311–318. Association for Computational Linguistics, USA: (2002). 10.3115/1073083.1073135 . 10.3115/1073083.1073135 [DOI] [Google Scholar]
  • [19].Yu S., Song J., Kim H., Lee S., Ryu W.-J., Yoon S.: Rare tokens degenerate all tokens: Improving neural text generation via adaptive gradient gating for rare token embeddings. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 29–45. Association for Computational Linguistics, Dublin, Ireland: (2022). 10.18653/v1/2022.acl-long.3 . https://aclanthology.org/2022.acl-long.3 [DOI] [Google Scholar]
  • [20].Wang X., Chen Y., Zhu W.: A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(9), 4555–4576 (2022) 10.1109/TPAMI.2021.3069908 [DOI] [PubMed] [Google Scholar]
  • [21].Zhou T., Wang S., Bilmes J.: Curriculum learning by dynamic instance hardness. In: Larochelle H., Ranzato M., Hadsell R., Balcan M.F., Lin H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 8602–8613. Curran Associates, Inc., ??? (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/62000dee5a05a6a71de3a6127a68778a-Paper.pdf [Google Scholar]
  • [22].He K., Zhang X., Ren S., Sun J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). 10.1109/CVPR.2016.90 . https://ieeexplore.ieee.org/document/7780459 [DOI] [Google Scholar]
  • [23].Jain S., Agrawal A., Saporta A., Truong S., Duong D.N., Bui T., Chambon P., Zhang Y., Lungren M., Ng A., Langlotz C., Rajpurkar P., Rajpurkar P.: Radgraph: Extracting clinical entities and relations from radiology reports. In: Vanschoren J., Yeung S. (eds.) Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1 (2021). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c8ffe9a587b126f152ed3d89a146b445-Paper-round1.pdf [Google Scholar]
  • [24].Smit A., Jain S., Rajpurkar P., Pareek A., Ng A., Lungren M.: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1500–1519. Association for Computational Linguistics, Online (2020). 10.18653/v1/2020.emnlp-main.117 . https://aclanthology.org/2020.emnlp-main.117 [DOI] [Google Scholar]
  • [25].Qin H., Song Y.: Reinforced cross-modal alignment for radiology report generation. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 448–458. Association for Computational Linguistics, Dublin, Ireland: (2022). 10.18653/v1/2022.findings-acl.38 . https://aclanthology.org/2022.findings-acl.38 [DOI] [Google Scholar]
  • [26].Delbrouck J.-B., Varma M., Chambon P., Langlotz C.: Overview of the RadSum23 shared task on multi-modal and multi-anatomical radiology report summarization. In: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pp. 478–482. Association for Computational Linguistics, Toronto, Canada: (2023). 10.18653/v1/2023.bionlp-1.45 . https://aclanthology.org/2023.bionlp-1.45 [DOI] [Google Scholar]
  • [27].Tanida T., Müller P., Kaissis G., Rueckert D.: Interactive and explainable region-guided radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7433–7442 (2023) [Google Scholar]
  • [28].Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017). 10.1109/CVPR.2017.243 . https://ieeexplore.ieee.org/document/8099726 [DOI] [Google Scholar]
  • [29].Endo M., Krishnan R., Krishna V., Ng A.Y., Rajpurkar P.: Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In: Roy S., Pfohl S., Rocheteau E., Tadesse G.A., Oala L., Falck F., Zhou Y., Shen L., Zamzmi G., Mugambi P., Zirikly A., McDermott M.B.A., Alsentzer E. (eds.) Proceedings of Machine Learning for Health. Proceedings of Machine Learning Research, vol. 158, pp. 209–219. PMLR, ??? (2021). https://proceedings.mlr.press/v158/endo21a.html [Google Scholar]
  • [30].Jeong J., Tian K., Li A., Hartung S., Adithan S., Behzadi F., Calle J., Osayande D., Pohlen M., Rajpurkar P.: Multimodal image-text matching improves retrieval-based chest x-ray report generation. In: Oguz I., Noble J., Li X., Styner M., Baumgartner C., Rusu M., Heinmann T., Kontos D., Landman B., Dawant B. (eds.) Medical Imaging with Deep Learning. Proceedings of Machine Learning Research, vol. 227, pp. 978–990. PMLR, ??? (2024). https://proceedings.mlr.press/v227/jeong24a.html [Google Scholar]
  • [31].Kale K., Bhattacharyya P., Jadhav K.: Replace and report: NLP assisted radiology report generation. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 10731–10742. Association for Computational Linguistics, Toronto, Canada: (2023). 10.18653/v1/2023.findings-acl.683 . https://aclanthology.org/2023.findings-acl.683 [DOI] [Google Scholar]
  • [32].Liu F., Ge S., Wu X.: Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3001–3012. Association for Computational Linguistics, Online: (2021). 10.18653/v1/2021.acl-long.234 . https://aclanthology.org/2021.acl-long.234 [DOI] [Google Scholar]
  • [33].Li Q., Peng H., Li J., Xia C., Yang R., Sun L., Yu P.S., He L.: A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. 13(2) (2022) 10.1145/3495162 [DOI] [Google Scholar]
  • [34].Lin T.-Y., Goyal P., Girshick R., He K., Dollár P.: Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(2), 318–327 (2020) 10.1109/TPAMI.2018.2858826 [DOI] [PubMed] [Google Scholar]
  • [35].Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002) 10.1613/jair.953 [DOI] [Google Scholar]
  • [36].Irvin J., Rajpurkar P., Ko M., Yu Y., Ciurea-Ilcus S., Chute C., Marklund H., Haghgoo B., Ball R., Shpanskaya K., Seekins J., Mong D.A., Halabi S.S., Sandberg J.K., Jones R., Larson D.B., Langlotz C.P., Patel B.N., Lungren M.P., Ng A.Y.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019). 10.1609/aaai.v33i01.3301590 . https://ojs.aaai.org/index.php/AAAI/article/view/3834 [DOI] [Google Scholar]
  • [37].Deka P., Jurek-Loughrey A., et al. : Evidence extraction to validate medical claims in fake news detection. In: International Conference on Health Information Science, pp. 3–15 (2022). Springer. 10.1007/978-3-031-20627-61 [DOI] [Google Scholar]
  • [38].Loper E., Bird S.: NLTK: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp. 63–70. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA: (2002). 10.3115/1118108.1118117 . https://aclanthology.org/W02-0109 [DOI] [Google Scholar]
  • [39].Kingma D., Ba J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), San Diega, CA, USA (2015). https://arxiv.org/abs/1412.6980 [Google Scholar]
  • [40].Sutskever I., Vinyals O., Le Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani Z., Welling M., Cortes C., Lawrence N., Weinberger K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc., ??? (2014). https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf [Google Scholar]
  • [41].Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., Desmaison A., Kopf A., Yang E., DeVito Z., Raison M., Tejani A., Chilamkurthy S., Steiner B., Fang L., Bai J., Chintala S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, vol. 32, pp. 8024–8035. Curran Associates, ??? (2019). https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf [Google Scholar]
  • [42].Denkowski M., Lavie A.: Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 85–91. Association for Computational Linguistics, Edinburgh, Scotland: (2011). https://aclanthology.org/W11-2107 [Google Scholar]
  • [43].Lin C.-Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain: (2004). https://aclanthology.org/W04-1013 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are derived from the publicly available MIMIC-CXR and IU-Xray datasets. MIMIC-CXR: The MIMIC-CXR (Medical Information Mart for Intensive Care) dataset is available through the PhysioNet repository. More information on accessing MIMIC-CXR can be found at https://physionet.org/content/mimic-cxr/2.0.0/. IU-Xray: The IU-Xray dataset, which consists of chest X-ray images and associated radiology reports, is available from the Open Access Biomedical Image Search Engine (OpenI) provided by the U.S. National Library of Medicine. The dataset can be accessed at https://openi.nlm.nih.gov/faq#collection. These datasets are publicly available to researchers subject to the respective data use agreements and ethical guidelines. Any additional data generated and analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Research Square are provided here courtesy of American Journal Experts

RESOURCES