Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 5;16:4645. doi: 10.1038/s41598-025-34799-2

Hybrid framework for lesion-aware, clinically coherent chest X-ray report generation using contrastive learning and large language models

Won-Jun Noh 1,#, Sun-Woo Pi 1,#, Byoung-Dai Lee 1,2,
PMCID: PMC12868645  PMID: 41492086

Abstract

Automated radiology report generation from chest X-rays (CXRs) has the potential to reduce the workload of radiologists and improve diagnostic consistency. However, conventional approaches have been constrained by trade-offs between understanding global images and characterizing fine-grained lesions, often leading to omissions or clinically inconsistent narratives. This study proposed a hybrid framework, CLALA-Net, to integrate global and regional representations through three key modules: Lesion Cross-Attention (LCA), Lesion-Level Contrastive Learning (LLCL), and Image-Text Contrastive Learning (ITCL). LCA injects lesion-level cues derived from full-image classification into each region of interest (ROI), LLCL enhances discriminability by aligning lesion representations across CXRs, and ITCL improves visual-textual semantic alignment. A large language model (LLM)-based aggregator was utilized to consolidate ROI-level descriptions into a clinically coherent report. An LLM-driven label extraction pipeline was introduced to generate fine-grained lesion annotations for training and evaluation. Extensive experiments on the Chest-Imagenome dataset demonstrated that CLALA-Net outperformed existing baselines in both lesion-level accuracy (mean F1-score: 0.40) and report-level consistency (total score: 14.32/20). Ablation studies confirmed the complementary roles of LCA and LLCL, whereas the sensitivity analysis indicated strong performance gains with improved label quality. By bridging full-image contextual reasoning with regional-level lesion analysis, CLALA-Net produced accurate, semantically consistent, and clinically reliable chest radiography reports. This framework provides a robust and interpretable foundation for the real-world deployment of automated radiological reporting.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-34799-2.

Keywords: Chest x-ray, Contrastive learning, Large language model, Multimodal learning, Radiology report generation

Subject terms: Computational biology and bioinformatics, Medical research

Introduction

Chest radiography is the cornerstone diagnostic modality for detecting a wide range of thoracic conditions, including pneumonia, tuberculosis, emphysema, and fractures, owing to its affordability, accessibility, and diagnostic versatility. However, the growing volume of chest radiography studies has placed unprecedented strain on radiology services, leading to longer turnaround times and potential declines in reporting accuracy and consistency1.

To address this challenge, the automation of diagnostic imaging interpretation, including chest X-rays (CXR), through deep learning and artificial intelligence (AI) has emerged as a critical area of research2,3. Among these efforts, automatic radiology report generation has shown particular promise, as it can reduce manual reporting time while improving consistency and diagnostic reliability. Existing approaches typically fall into one of two categories. Full image-based methods analyze the entire CXR to generate a single report, benefiting from global contextual awareness but often failing to capture localized findings. In contrast, region-based methods analyze predefined anatomical regions to describe fine-grained features but are prone to narrative inconsistency and redundancy. For example, a region-based system might simultaneously state “no consolidation” and “patchy opacities” within the same zone—contradictions that are uncommon in expert-written reports.

In real-world practice, radiologists interpret CXRs by integrating both global patterns and regional lesion cues into a unified clinical impression. Motivated by this reasoning process, we propose Contrastive Lesion Attention and large language model (LLM) Aggregation Network (CLALA-Net), a hybrid framework that combines the contextual efficiency of full-image modeling with the lesion-level precision of region-based analysis. The architecture consists of three key components. First, the Lesion Cross-Attention (LCA) module integrates lesion-level predictions derived from the entire CXR into the encoding of region-specific features, thereby enabling each ROI to be contextualized with global diagnostic cues. Second, the Lesion-Level Contrastive Learning (LLCL) module constructs semantically meaningful pairs of ROI embeddings based on lesion status and anatomical location, enforcing intra-lesion consistency and inter-lesion separability within the learned feature space. Finally, the Image-Text Contrastive Learning (ITCL) module aligns ROI-level visual embeddings with the corresponding textual descriptions from radiology reports, reinforcing multimodal consistency and improving the clinical fidelity of the generated language. An LLM-based aggregator subsequently integrates regional-level outputs into a unified, coherent report, resolving conflicts and minimizing redundancy.

The main contributions of this study are summarized as follows:

  • We propose CLALA-Net, a hybrid CXR report-generation framework that leverages lesion-guided attention and contrastive learning to unify full-image and region-based features, improving both detection sensitivity and report coherence.

  • We introduce an LLM-based evaluation pipeline that enables scalable, lesion-level accuracy assessment and narrative consistency measurement between predicted and reference reports.

  • We demonstrate the effectiveness of CLALA-Net through extensive experiments on the Chest-Imagenome dataset4, achieving superior performance over state-of-the-art baselines in both lesion-level and report-level metrics.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the CLALA-Net methodology. Section 4 describes the experimental setup and evaluation protocol. Section 5 presents the quantitative and qualitative results. Finally, Sect. 6 concludes the paper and outlines future research directions.

Related work

Image captioning

Image captioning aims to generate natural language descriptions from visual inputs by integrating advances in computer vision and natural language processing. Traditional approaches combined convolutional neural networks (CNNs) for feature extraction with recurrent neural networks (RNNs) for sequence modeling. With the advent of Transformer architectures and large-scale vision–language pre-training, significant improvements have been achieved in accuracy and contextual coherence.

These techniques have been adapted to automate the generation of radiology report from CXR. A pioneering CNN–LSTM framework was proposed by Sishar et al.5, in which salient features are extracted and corresponding reports are generated. Subsequently, Fenglin et al.6 introduced a contrastive attention mechanism designed to enhance focus on lesion-specific regions. However, these methods often rely on coarse global representations, limiting their capacity to capture fine-grained clinical findings and resulting in inconsistent narratives.

To address this limitation, more recent studies have explored vision-language pretrained models and refined alignment strategies that more effectively bridge the gap between medical images and diagnostic reports2. These advances provide a foundation for more accurate and clinically reliable report-generation systems.

Radiology report generation

Automated CXR report generation methods can be broadly categorized into full image-based, region-based, and hybrid strategies. Each category reflects a distinct trade-off between global context modeling, localized feature focus, and integration robustness.

Full image-based models process the entire CXR as a single input. For example, Radiology Report Generation (R2Gen)7 employs a transformer-based architecture augmented with a query-driven relational memory module that progressively generates reports by leveraging previously decoded context. METransformer8 enhances representation learning by utilizing multiple parameter sets and orthogonal loss, encouraging diverse feature learning without requiring external annotations. Recent approaches have further enriched image features with domain knowledge: dynamic structure and nodes are integrated in Contrastive Learning9 through lesion-relationship graphs; complex organ mask-guided report generation10 applies segmentation masks with weighted attention; and RaDialog11 adopts a two-stage pipeline that feeds both raw images and lesion predictions into an LLM. Although effective in capturing global context, these models may overlook subtle, localized findings.

Extending this paradigm, grounded and generalist vision–language models have emerged as powerful encoders for radiology reporting. BioViL12 adopts a dual-encoder architecture trained with image–text matching and contrastive objectives, aligning CXR images with sentence-level radiology findings. MAIRA-213 further strengthens this alignment by introducing memory-aware region grounding, enabling more explicit association between localized visual cues and textual report segments. Although these models do not perform report generation directly, they serve as transferable, semantically grounded feature extractors that can be fine-tuned for downstream tasks.

Generalist vision–language models such as XrayGPT14 and LLaVA-Rad15 extend this line of work by adapting large-scale pretrained image–text architectures to the radiology domain via domain-specific instruction tuning. These models leverage natural image–text corpora and radiological prompts to produce fluent, long-form outputs. However, they often lack explicit regional supervision and rely on high-level associations between entire images and global narratives, which can limit their sensitivity to subtle or spatially constrained pathologies.

GLoRIA16 introduced an attention-based framework that learns both global and local representations by contrasting image sub-regions with word-level features in paired radiology reports. By computing similarity matrices between sub-region features and semantic tokens, it produces attention-weighted visual embeddings, enabling pathology-specific feature alignment. However, because the image sub-regions are implicitly selected by attention rather than anatomically predefined, the alignment between visual regions and textual phrases can become inconsistent across images, increasing the risk of missing clinically important structures.

Region-based approaches introduce explicit anatomical structure into the generation process by first detecting relevant subregions and then producing localized descriptions. For example, Region-guided Radiology Report Generation (RGRG)17 employs object detection to localize 29 predefined chest regions, from which region-specific features are extracted. Abnormal ROIs are selected using an auxiliary classification loss. These features are then decoded using GPT-218, whereas redundant sentences are removed using BERTScore19. Although such frameworks enable fine-grained analysis, they incur higher computational costs owing to multi-region processing and carry the risk of introducing contradictions during sentence integration, particularly when description similarity is prioritized over clinical accuracy.

Hybrid models aim to combine the strengths of full image-based and region-based methods by integrating global semantic awareness with region-level precision. CLALA-Net follows this paradigm by using LCA to infuse ROI embeddings with image-level context, applying contrastive learning to enforce lesion representations, and utilizing an LLM-based aggregator to synthesize concise, clinically coherent reports that address both the coarse coverage limitations of full image-based models and the integration challenges inherent to region-based systems.

Methods

Overview

The generation of radiology reports from CXR images requires a model architecture capable of capturing both global contextual information and localized lesion features. A critical requirement is the implementation of differentiated representation learning based on the presence or absence of lesions within each anatomically defined ROI, while employing a report synthesis strategy that preserves semantic coherence throughout the generated report. To satisfy these requirements, CLALA-Net is proposed as a hybrid architecture that generates diagnostic reports based on region-level features enriched with lesion-specific information. Unlike conventional region-based methods, such as RGRG, the proposed approach refines region-specific representations by incorporating lesion predictions derived from the entire CXR image as global guidance, thereby enabling more precise and clinically consistent report generation.

Figure 1 illustrates the overall process of automated CXR report generation using the proposed framework. Solid and dotted lines denote the training and inference phases, respectively. The framework begins by extracting predefined anatomical ROIs from the input CXR image using an ROI extractor. An image encoder then generates initial embeddings for each ROI. Simultaneously, the CXR classifier predicts the overall lesion status at the image level. These diagnostic cues are subsequently integrated into the ROI embeddings through an LCA mechanism, allowing region-level representations to be refined using global contextual information. Subsequently, LLCL is applied to enforce consistency among embeddings corresponding to ROIs with the same lesion type while enhancing separability across different lesion types. Additionally, the ITCL aligns each ROI embedding with the corresponding textual descriptions found in radiology reports, reinforcing cross-modal semantic coherence. Finally, the learned ROI embeddings are passed to a text decoder to generate individual region-level descriptions. These partial descriptions are then aggregated using a report aggregator, which produces a final output report that is both semantically coherent and clinically consistent.

Fig. 1.

Fig. 1

Overview of the proposed CLARA-NET framework for automated chest X-ray report generation. The model integrates global and regional information through the Lesion Cross-Attention, Lesion-Level Contrastive Learning, and Image–Text Contrastive Learning modules, followed by large language model-based aggregation. Solid arrows indicate the training phase, whereas dashed arrows denote the inference phase.

The following section provides a detailed description of each component introduced in the proposed framework.

Anatomical ROI definition

The proposed framework adopts anatomically defined ROIs as the basic units for region-level report generation. ROIs are purely anatomical and lesion-agnostic at the extraction stage, providing a consistent spatial basis for organizing localized visual evidence while decoupling anatomical localization from diagnostic supervision.

To reduce redundancy and focus on clinically relevant content, report generation is restricted to a fixed set of 11 anatomically salient ROIs that reflect routine radiological reporting practice: left lung, right lung, left lower lung zone, right lower lung zone, left hilar structures, right hilar structures, mediastinum, upper mediastinum, left costophrenic angle, right costophrenic angle, and cardiac silhouette.

Lesion awareness is introduced at the representation learning stage through the LCA and LLCL modules. Global lesion predictions derived from the full CXR image are injected as soft guidance to refine ROI features, enabling lesion-aware reasoning while retaining the flexibility to capture secondary or subtle findings based on regional visual cues.

Lesion cross attention

ROI-only methods risk omitting lesion cues that are derived from the full image context. To address this, an LCA was introduced, which injects global lesion predictions into the feature encoding of each region. First, a predefined list of lesion names was passed through a text encoder to obtain the corresponding lesion text embeddings:

graphic file with name d33e371.gif

where Inline graphic represents the number of lesion categories and Inline graphic denotes the dimensionality of the text embeddings. Then, a weighting matrix Inline graphic is applied to reflect the diagnostic status (presence or absence) of each lesion in the corresponding CXR image. To account for label uncertainty and the potential false predictions, weights of 0.9 and 0.1 were assigned to positively and negatively identified lesions, respectively. This design avoids the issue of completely discarding lesion information—an inherent risk when assigning a weight of 0. These weights are integrated into the lesion embeddings using a Hadamard product (element-wise multiplication), defined as follows:

graphic file with name d33e391.gif

In the LCA module, the visual features of all ROIs are denoted as.

graphic file with name d33e399.gif

where Inline graphic is the number of ROIs and Inline graphic is the dimensionality of each ROI embedding. These ROI embeddings act as queries in the cross-attention mechanism, whereas the weighted lesion embeddings (Inline graphic) serve as both keys and values.

graphic file with name d33e419.gif

Through this process, lesion-specific information is incorporated into each ROI representation, resulting in refined ROI embeddings that encode globally informed, lesion-aware contextual features. Importantly, the injected lesion information serves as a soft hint rather than a hard constraint. The model is not restricted to describing only the abnormalities indicated by the global lesion labels. Instead, it retains the capacity to identify and express additional or secondary findings, such as subtle morphological changes or atypical presentations, by leveraging both full-image context and fine-grained regional visual cues. These enriched embeddings are subsequently employed in downstream contrastive learning and report generation, ensuring that the representation of each region retains the critical lesion information inferred from the entire CXR image.

Lesion-level contrastive learning

Although the LCA module enriches each ROI embedding with global lesion context, additional mechanisms are required to enforce both intraclass consistency and interclass separability among lesion representations. To address this need, an LLCL module is introduced. The key distinction of the LLCL module lies in the construction of anchor-comparison embedding pairs. Specifically, the anchor embeddings are derived from ROI features refined via LCA, whereas the corresponding comparison embeddings originate from raw ROI features without LCA refinement. Each comparison ROI is extracted from other CXR images annotated with the same lesion type as the anchor ROI and located in the same anatomical region. This design enables the model to explicitly learn alignment of lesion-informed representations according to lesion labels, thereby reinforcing the effectiveness of the LCA mechanism. LLCL further guides LCA-refined embeddings to exhibit semantically meaningful differences across lesion categories through contrastive learning.

In LLCL, ROI pairs sharing the same lesion type are defined as positive pairs, whereas those containing different lesion types or originating from different anatomical locations are defined as negative pairs. A contrastive loss based on InfoNCE20 encourages the model to pull positive pairs closer in the embedding space while pushing negative pairs farther apart. This process facilitates the learning of lesion-discriminative representations that maintain consistency across similar abnormalities and are distinct across different abnormalities.

graphic file with name d33e435.gif

where Inline graphic denotes the anchor ROI embedding refined through the LCA module and Inline graphic​ refers to the corresponding comparison ROI embedding without the applied LCA. The function Inline graphic represents cosine similarity, τ denotes a temperature scaling hyperparameter, and Inline graphic represents the number of samples in the batch. By explicitly structuring the embedding space according to lesion labels and locations, the LLCL enhances the ability of the model to discriminate between clinically similar abnormalities while maintaining consistency across ROIs. This capability is crucial for generating accurate and reliable radiology reports.

Image-text contrastive learning

Although LCA and LLCL establish lesion-aware visual features, these representations must be aligned with their textual counterparts to ensure clinically coherent language generation. To address this need, an ITCL module was introduced to enforce the multimodal correspondence between each ROI embedding and its associated report sentences.

ITCL employs a contrastive learning strategy that aligns visual and textual representations, encouraging clinically relevant image–text pairs to cluster closely in the embedding space, while pushing unrelated pairs further apart. Consequently, each ROI embedding becomes more accurately aligned with the semantics of its corresponding textual description, thereby enabling the text decoder to generate more precise and clinically meaningful sentences. Specifically, a contrastive loss is employed to maximize the similarity between each ROI image embedding Inline graphic​ and its corresponding textual embedding (Inline graphic​, while minimizing the similarity with textual embeddings of other, non-matching ROIs. The loss function is formulated using the InfoNCE objective as follows:

graphic file with name d33e470.gif

where Inline graphic denotes cosine similarity, Inline graphic is a temperature scaling hyperparameter, and Inline graphic represents the number of samples in the batch.

Through this process, ROI embeddings are trained to align semantically with their clinically relevant textual counterparts. This alignment plays a crucial role in enhancing the coherence between lesion representations and language expressions, thereby improving the quality and clinical relevance of generated radiology reports.

Text decoder

The Transformer-based text decoder consumes lesion-enriched ROI embeddings to generate concise, clinically meaningful statements for each anatomical region, accurately describing lesion presence, location, and morphology. Similar to standard autoregressive language models, the text decoder predicts the probability distribution of the next word Inline graphic​ given the previously generated word sequence Inline graphic​ and the corresponding ROI embedding Inline graphic​, generating sentences in a step-by-step manner. By grounding the generation process in LCA- and contrastive-refined embeddings, the decoder avoids generic templates and instead produces language that faithfully reflects lesion characteristics.

During training, the cross-entropy loss between predicted and ground-truth tokens was minimized:

graphic file with name d33e507.gif

where Inline graphic denotes the target sequence length, Inline graphic​​ represents the ground-truth word at time step Inline graphic, and Inline graphic refers to the ROI embedding enriched with lesion information through the LCA and contrastive learning.

By training the text decoder in this manner, the model can generate natural language reports that go beyond surface-level descriptions, effectively incorporating lesion characteristics and clinical context. This design enables the decoder to generate region-level descriptions that are not limited to the presence or absence of predefined diseases, but instead convey richer clinical observations derived from the underlying ROI embeddings.

Aggregator

To generate a comprehensive report that reflects the entire CXR image, an integration strategy must go beyond the simple concatenation of region-level descriptions. This strategy ensures semantic coherence and narrative consistency across sentences. Without this consideration, conflicting findings may arise, particularly between positive and negative statements, and redundant content may accumulate across ROI-level reports, ultimately compromising readability and clinical reliability.

To address these challenges, an aggregation module was designed to leverage an LLM for semantically consolidating region-level diagnostic statements into a unified final report. The LLM evaluates semantic relationships between sentences generated from each ROI, eliminating unnecessary repetition while preserving logical coherence. Because the inputs to the aggregator are observation-level ROI reports rather than discrete disease labels, the LLM can reason over complementary or partially overlapping regional descriptions instead of merely verbalizing classification outcomes.

During this process, the LLM prioritizes content based on the clinical significance of lesion descriptions, emphasizing the presence of abnormalities over negative findings when structuring the final report. When both positive and negative statements are generated for the same lesion, the framework prioritizes the inclusion of positive findings to ensure a more clinically sensitive and cautious interpretation. This design reflects real-world diagnostic practice, where acknowledging the potential presence of an abnormality is preferable to prematurely ruling it out. For regions without detected lesions, verbose descriptions are minimized in favor of concise, informative expressions, thereby enhancing the overall efficiency and consistency of the report. This aggregation strategy allows the final report to reconcile local observations within a coherent global narrative, while preserving clinically relevant nuance present at the region level.

The prompt used for unified report generation is provided in Supplementary Appendix A1 and includes explicit instructions on lesion prioritization, redundancy handling, and content condensation. This enables the LLM to produce consolidated reports of consistent quality and clinical coherence. Consequently, the proposed aggregation strategy serves as a crucial post-processing component in automated reporting systems, contributing directly to both diagnostic accuracy and interpretive consistency.

Experimental setup and evaluation protocol

Datasets

The proposed method was evaluated using the publicly available and de-identified Chest-Imagenome dataset, an extension of the widely used Medical Information Mart for Intensive Care Chest X-Ray (MIMIC-CXR) dataset21. MIMIC-CXR contains 377,110 chest X-ray images from 227,827 radiographic studies covering 65,379 unique patients, developed jointly by Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology as part of the MIMIC project.

The Chest-Imagenome dataset, derived from 242,072 chest X-ray studies within MIMIC-CXR, provides bounding-box annotations that localize pathological findings within specific anatomical regions, thereby supporting tasks such as lesion classification, detection, segmentation, and anatomical localization. In this study, the anatomical region coordinates from Chest-Imagenome were utilized to preprocess chest X-ray images for region-level radiology report generation. The final dataset used in our experiments comprised 111,326 training, 15,640 validation, and 31,255 test images, with stratified patient-level sampling performed to prevent overlap and to ensure a consistent distribution of pathologies and anatomical regions across partitions.

As both MIMIC-CXR and Chest-Imagenome are publicly available and fully de-identified in compliance with the Health Insurance Portability and Accountability Act, this research was exempt from requiring additional institutional review board approval, and all experiments were conducted in accordance with relevant guidelines and regulations.

Lesion labeling strategy

Effective training of the LCA and LLCL modules requires accurate lesion annotations that include both the presence and absence of abnormalities and their corresponding anatomical locations on the chest radiographs. These labels are essential for distinguishing between positive and negative lesion instances during the construction of contrastive pairs and for embedding lesion-specific information into each ROI representation.

Although previous studies have employed automated labeling tools such as CheXbert22, these methods only yield coarse-grained binary labels (positive/negative) for each lesion, without providing anatomical localization. This can lead to incorrect pairing when the same pathology appears in different anatomical regions, potentially undermining the effectiveness of contrastive learning and attentional mechanisms.

To address this limitation, a label extraction pipeline was developed using an LLM designed to generate refined lesion labels with higher accuracy and clinical granularity by leveraging both diagnostic status and region-specific context. The prompt for label extraction was structured as follows. First, a list of target lesions was provided, along with example expressions commonly found in radiology reports for each lesion, encompassing both positive and negative cases. To enhance reasoning capabilities, a chain-of-thought23 prompting strategy was adopted, which guides the LLM to infer not only the presence or absence of a lesion but also its anatomical location and the rationale behind the prediction in a step-by-step manner. To ensure label reliability, we validated our LLM-derived annotations against both CheXbert outputs and a radiologist-annotated subset of the Chest-Imagenome test set. Detailed results are reported in Sect. 5.3.

This approach is particularly effective in managing the contextual variability and linguistic ambiguity inherent in radiological descriptions, significantly improving the precision of the extracted labels by leveraging the high-level reasoning capabilities of the LLM. The extracted lesion information was structured according to lesion name, status (positive or negative), and location (left, right, bilateral, or not mentioned) and stored in a standardized JSON format. Lesions not explicitly mentioned in the report were assigned a negative status, whereas missing location information was labeled as “Not mentioned” to prevent ambiguity during downstream processing. These refined labels serve three critical purposes within our framework. First, in LLCL training, they are used to construct contrastive pairs based on both lesion status (positive or negative) and anatomical location. Second, in the LCA module, lesion presence is used to compute attention weights, thereby integrating lesion-specific information into the ROI embeddings. Third, they act as supervised targets for training the CXR classifier, enabling the model to learn the lesion status from full CXR images.

The detailed structure of the lesion-labeling prompt is provided in Supplementary Appendix A2. This LLM-based labeling approach offers an efficient solution for automatically generating high-quality, fine-grained annotations, which form the foundation for contrastive and lesion-specific learning. Given its scalability and adaptability, the proposed method has a significant potential for broader applications across various medical imaging domains.

Implementation and training details

The proposed model utilizes a vision transformer (ViT-B)24, pre-trained on ImageNet25, as the image encoder, and adopts a BERT-based architecture26 as the text encoder. Training was conducted over 15 epochs with a batch size of 24, using the AdamW optimizer27 with a weight decay coefficient of 0.05.

The training procedure was divided into two stages: pre-training and fine-tuning. During the pre-training stage, all modules were jointly optimized to learn generalizable representations, with an initial learning rate of 1e-4. In the subsequent fine-tuning stage, updates were restricted to the LCA and text decoder modules to perform more focused adjustments, and the learning rate was reduced to 1e-5 to allow for precise optimization.

The input resolution of ROI image patches was progressively varied across the two stages to enhance feature learning. The images were resized to 224 × 224 pixels during pre-training and increased to 384 × 384 pixels during fine-tuning and final report generation. This higher-resolution input allowed the model to capture finer visual details critical for clinical interpretation.

To predict lesion presence at the global image level, we employed BioViL-T12, a CLIP28-based vision–language model trained using lesion labels generated by the proposed LLM-based annotation pipeline. Training was performed on 14 lesion categories, encompassing common thoracic conditions frequently observed in clinical practice. These include cardiomegaly, emphysema, pleural effusion, hernia, nodules, pneumothorax, atelectasis, pleural thickening, mass, edema, consolidation, infiltration, fibrosis, and pneumonia.

Regarding ROI localization, ground-truth anatomical bounding boxes provided by the Chest-Imagenome dataset were used during training to extract ROIs. During inference, YOLOv1029, an efficient real-time object detector, was employed to localize anatomical ROIs within each CXR image. Importantly, this detector was used to identify predefined anatomical regions rather than to predict lesion categories, ensuring consistency between training and deployment settings. As a result, the model benefits from precise anatomical supervision during training while remaining robust to practical localization noise at test time. For report generation, the model was restricted to a fixed set of 11 clinically relevant anatomical ROIs. Each ROI patch was processed independently by the image encoder to obtain region-level visual embeddings, which were subsequently refined through lesion-aware attention and contrastive learning.

For the final report consolidation, GPT-4o mini30 was employed, guided by carefully designed prompts to eliminate redundant content across region-level descriptions and ensure semantic consistency throughout the report. All models were implemented using PyTorch31, and both training and inference were conducted on a single NVIDIA RTX 3090 GPU.

Evaluation strategy

To ensure a robust and clinically meaningful assessment of the automatically generated radiology reports, we employed a combination of quantitative metrics and expert-informed interpretive criteria. In particular, two complementary evaluation strategies were adopted: lesion-based evaluation and consistency-based evaluation. Additionally, to align with recent evaluation practices in radiology report generation, we incorporated several widely adopted metrics including Metric for Evaluation of Translation with Explicit Ordering (METEOR)32, Recall-Oriented Understudy for Gisting Evaluation–Longest Common Subsequence (ROUGE-L)33, Generative Radiology Report Evaluation and Error Notation (GREEN)34, and F1-RadGraph35.

The lesion-based evaluation quantifies the model’s ability to correctly describe the presence or absence of specific abnormalities, while the consistency-based evaluation assesses the overall semantic fidelity, fluency, and structural coherence of the generated reports. All evaluations are fully automated via LLM-guided extraction and scoring processes, ensuring scalable and reproducible analysis across large datasets.

Lesion evaluation

Lesion-based evaluation was performed by extracting and comparing the lesion statuses (positive or negative) from both the predicted and reference radiology reports. An LLM was used to extract lesion mentions, following the same prompt structure and querying methodology described in Sect. 4.2. The evaluation encompassed all 14 thoracic lesion categories defined in Sect. 4.3.

For each lesion, the precision, recall, and F1-score were calculated based on the agreement between the predicted and reference labels. This evaluation directly reflects the model’s ability to recognize and describe clinically important findings, serving as a critical indicator of its potential applicability in real-world diagnostic practice.

Consistency evaluation

Consistency-based evaluation was conducted to assess the alignment between the generated radiology report and reference reports in terms of overall structure and clinical reliability. While traditional natural language evaluation metrics such as METEOR and ROUGE-L provide useful signals about lexical and surface-level similarity, they may not fully capture clinical equivalence when semantically similar findings are expressed with different phrasing.

To address this limitation, a semantic evaluation strategy using LLM was introduced. The LLM was provided with both the predicted and reference reports and guided by a structured prompt to assign scores ranging from 1 to 5 to the following four categories:

  • Clinical accuracy: The degree to which the predicted report accurately reflects the actual clinical condition of the lesions.

  • Findings completeness: The extent to which all essential clinical findings are adequately covered.

  • Linguistic and grammatical quality: Assessment of sentence construction, expressiveness, and grammatical correctness.

  • Report structure: Evaluation of the structural similarity between the generated and reference reports.

Each category is scored from 1 (significant errors or omissions) to 5 (clinically comprehensive and accurate). The final consistency score is computed as the average across these four categories. The full prompt used for this evaluation is provided in Supplementary Appendix A3.

Additionally, we include ROUGE-L and METEOR for evaluating lexical and semantic similarity, the GREEN score for assessing factual consistency, and F1-RadGraph for measuring structured clinical correctness. These standardized metrics support multifaceted and reproducible comparisons with prior work.

Experimental results

Comparative performance

To validate the lesion description performance of the proposed model, a comparative analysis was conducted on precision, recall, and F1-score between the predicted and reference reports across all 14 major thoracic abnormalities. Several recent state-of-the-art baselines were included in the evaluation, namely, RaDialog11, RGRG17, LLM-CXR36, CvT-DistilGPT237, R2Gen7, MAIRA-213, XrayGPT14, LLaVa-Rad15, and TSGET38. A summary of these results is presented in Table 1.

Table 1.

Lesion-level performance comparison across 14 thoracic abnormalities.

Lesions RaDialog11 RGRG17 LLM-CXR36 CvTDistilGPT237 R2Gen7 XrayGPT14 MAIRA-213 LLaVa-Rad15 TSGET38 Ours
(CLALA-Net)
Cardiomegaly 0.65, 0.63, 0.64 0.51, 0.73, 0.60 0.37, 0.27, 0.31 0.61, 0.52, 0.56 0.59, 0.33, 0.42 0.43,0.52,0.47 0.53, 0.53, 0.53 0.62, 0.66, 0.64 0.62,0.62,0.62 0.62, 0.67, 0.64
Emphysema 0.34, 0.29, 0.31 0.17, 0.05, 0.08 0.11, 0.09, 0.10 0.34, 0.19, 0.25 0.44, 0.01, 0.03 0.12,0.06,0.08 0.30, 0.30, 0.30 0.34, 0.41, 0.37 0.31,0.20,0.24 0.37, 0.48, 0.42
Pleural effusion 0.73, 0.71, 0.72 0.71, 0.41, 0.52 0.57, 0.46, 0.51 0.79, 0.45, 0.57 0.85, 0.18, 0.29 0.48,0.37,0.42 0.69, 0.68, 0.69 0.72, 0.70, 0.71 0.74,0.58,0.65 0.71, 0.73, 0.72
Hernia 0.07, 0.07, 0.07 0.82, 0.26, 0.39 0.00, 0.00, 0.00 0.85, 0.30, 0.45 0.08, 0.05, 0.06 0.02,0.02,0.02 0.27, 0.38, 0.31 0.43, 0.53, 0.48 0.40,0.20,0.27 0.67, 0.37, 0.47
Nodule 0.21, 0.17, 0.19 0.29, 0.00, 0.00 0.06, 0.03, 0.04 0.20, 0.00, 0.01 0.62, 0.02, 0.05 0.10,0.09,0.10 0.19, 0.09, 0.12 0.31, 0.18, 0.23 0.15,0.00,0.01 0.26, 0.13, 0.18
Pneumothorax 0.37, 0.39, 0.38 0.18, 0.16, 0.17 0.03, 0.05, 0.04 0.56, 0.09, 0.15 0.33, 0.03, 0.05 0.05,0.08,0.06 0.28, 0.27, 0.27 0.46, 0.44, 0.45 0.11,0.04,0.06 0.51, 0.45, 0.48
Atelectasis 0.51, 0.61, 0.56 0.45, 0.85, 0.59 0.38, 0.33, 0.35 0.5, 0.31, 0.39 0.45, 0.25, 0.32 0.41,0.43,0.42 0.50, 0.28, 0.36 0.51, 0.58, 0.55 0.52,0.41,0.46 0.57, 0.61, 0.59
Pleural thickening 0.11, 0.10, 0.11 0.50, 0.00, 0.00 0.00, 0.00, 0.00 0.13, 0.02, 0.04 0.00, 0.00, 0.00 0.01,0.01,0.01 0.02, 0.00, 0.01 0.20, 0.12, 0.15 0.26,0.02,0.04 0.22, 0.13, 0.16
Mass 0.13, 0.25, 0.17 0.33, 0.00, 0.00 0.01, 0.01, 0.01 0.50, 0.02, 0.04 0.23, 0.06, 0.09 0.01,0.01,0.01 0.14, 0.13, 0.13 0.17, 0.14, 0.15 0.22,0.04,0.07 0.26, 0.23, 0.24
Edema 0.49, 0.61, 0.54 0.46, 0.54, 0.49 0.47, 0.44, 0.46 0.50, 0.36, 0.42 0.44, 0.24, 0.31 0.33,0.36,0.34 0.49, 0.34, 0.40 0.48, 0.46, 0.47 0.48,0.49,0.48 0.47, 0.67, 0.55
Consolidation 0.35, 0.37, 0.36 0.32, 0.21, 0.26 0.30, 0.34, 0.32 0.45, 0.18, 0.26 0.44, 0.18, 0.25 0.24,0.21,0.22 0.38, 0.26, 0.31 0.43, 0.37, 0.40 0.46,0.19,0.27 0.41, 0.41, 0.41
Infiltration 0.20, 0.16, 0.18 0.22, 0.01, 0.01 0.12, 0.13, 0.13 0.25, 0.12, 0.16 0.17, 0.22, 0.19 0.11,0.09,0.10 0.21, 0.18, 0.19 0.23, 0.19, 0.21 0.20,0.11,0.14 0.22, 0.29, 0.25
Fibrosis 0.17, 0.03, 0.06 0.00, 0.00, 0.00 0.05, 0.02, 0.02 0.50, 0.04, 0.07 0.23, 0.06, 0.10 0.04,0.01,0.01 0.18, 0.12, 0.15 0.27, 0.11, 0.16 0.66,0.06,0.11 0.20, 0.18, 0.19
Pneumonia 0.17, 0.20, 0.19 0.18, 0.14, 0.15 0.20, 0.28, 0.24 0.29, 0.09, 0.14 0.16, 0.08, 0.10 0.09,0.12,0.10 0.15, 0.01, 0.03 0.22, 0.27, 0.24 0.22,0.08,0.12 0.24, 0.27, 0.26
Average 0.32, 0.33, 0.32 0.37, 0.24, 0.23 0.19, 0.17, 0.18 0.47, 0.19, 0.25 0.36, 0.12, 0.16 0.17,0.17,0.17 0.31, 0.26, 0.27 0.38, 0.37, 0.37 0.38,0.22,0.25 0.41, 0.40, 0.40

Values in each cell correspond to Precision, Recall, and F1 Score. Bold values indicate the best performance for each metric.

Among the evaluated models, CLALA-Net achieved the highest average F1-score of 0.40, surpassing prior approaches including RaDialog (0.32), RGRG (0.23), LLM-CXR (0.18), CvT-DistilGPT2 (0.25), R2Gen (0.16), and the recently introduced large-scale models XrayGPT (0.17), MAIRA-2 (0.27), LLaVA-Rad (0.37), and TSGET (0.25). While LLaVA-Rad showed competitive results on frequently observed abnormalities such as pleural effusion (0.71) and cardiomegaly (0.64), its performance decreased for low-prevalence or radiographically subtle conditions such as fibrosis (0.16) and pleural thickening (0.15), where CLALA-Net recorded marginally higher F1-scores of 0.19 and 0.16, respectively.

CLALA-Net demonstrated consistently strong performance on several clinically important lesions, including pleural effusion (0.72), atelectasis (0.59), edema (0.55), and pneumothorax (0.48). While the absolute F1-scores for some of these conditions may appear moderate, they are substantially higher than those of other state-of-the-art models, highlighting the relative advantage of our approach. These findings typically require a nuanced understanding of both global radiographic context and local lesion patterns—an integration effectively supported by the hybrid architecture of CLALA-Net, which combines image-level lesion priors via LCA with region-specific discrimination reinforced by LLCL.

For challenging lesion types such as pleural thickening, mass, and fibrosis, where performance tends to drop across all models, CLALA-Net maintained relatively stable F1-scores of 0.16, 0.24, and 0.19, respectively. While the absolute performance gains over competing methods were modest, several baselines—such as RGRG and CvT-DistilGPT2—exhibited extremely low recall or inconsistent predictions for these lesion categories. In contrast, CLALA-Net consistently retained non-trivial recall, suggesting that the attention-based fusion and contrastive learning strategy contributes to preserving lesion-specific signals in subtle or ambiguous cases.

From a precision–recall perspective, CLALA-Net achieved a balanced trade-off with a mean precision of 0.41 and recall of 0.40. By contrast, CvT-DistilGPT2 displayed higher precision (0.47) but much lower recall (0.19), indicating a tendency toward conservative generation that may miss relevant findings. Conversely, region-based models like RGRG showed high recall for specific lesions (e.g., atelectasis: 0.85) but suffered from performance instability across other classes, such as nodule (0.00) and fibrosis (0.00), possibly due to over-sensitivity or poor generalization.

Despite advances in generalist vision–language models, including MAIRA-2 and LLaVA-Rad, their performance varied considerably across lesion categories. While they captured major findings well, they were less consistent in detecting less common abnormalities. This underscores the continued importance of domain-specific architectural components—such as lesion-centric representation learning and explicit regional-global integration—for reliable and comprehensive radiology report generation.

Taken together, the results demonstrate that CLALA-Net provides a well-rounded and clinically attuned solution for lesion-level report generation, exhibiting competitive performance across both common and challenging abnormalities while maintaining consistency in precision and recall.

To assess the overall quality, fluency, and clinical reliability of the generated radiology reports, we performed a multi-faceted report-level evaluation that combines large language model (LLM)-guided scoring and standardized lexical and factual metrics. This comprehensive evaluation captures both the human-aligned and automated perspectives on report consistency.

As shown in Table 2a, our model consistently achieved the highest scores across all four LLM-evaluated dimensions. It attained a clinical accuracy score of 3.14, reflecting a strong alignment with true lesion findings described in the reference reports. The findings completeness score was 2.98, indicating that the proposed model was more effective at capturing all essential diagnostic elements than competing baselines. In terms of linguistic and grammatical quality, our model scored 4.36, demonstrating fluency and coherence comparable to radiologist-authored reports. It also achieved a report structure score of 3.83, showing that the generated narratives adhered closely to the organization and flow expected in clinical documentation. These consistent gains led to a total LLM-based consistency score of 14.32, clearly surpassing the best-performing baseline (RGRG, 13.40), as well as recent vision-language models such as MAIRA-2 (11.58) and LLaVA-Rad (12.24).

Table 2.

Comprehensive report-level performance evaluation across experimental settings, assessed using LLM-based consistency evaluation and standardized Language generation metrics. Results are presented for (a) comparison with baseline methods, (b) ablation analysis of CLALA-Net components, and (c) comparison between lesion labeling strategies (LLM-based vs. CheXbert-based) (a) baseline comparison.

a
Lesions RaDialog11 RGRG17 LLM-CXR36 CvTDistilGPT237 R2Gen7 XrayGPT14 MAIRA-213 LLaVa-Rad15 TEGET38 Ours
(CLALA-Net)
LLM-based Consistency Evaluation Clinical accuracy 2.57 2.84 2.26 2.48 2.18 2.14 2.38 2.66 2.41 3.14
Findings completeness 2.42 2.72 2.18 2.34 2.09 2.08 2.25 2.55 2.30 2.98
Linguistic and grammatical quality 4.06 4.24 3.94 3.57 3.35 3.68 3.77 3.64 3.50 4.36
Report structure 3.39 3.59 3.08 3.03 2.87 3.07 3.17 3.38 3.05 3.83
Total 12.45 13.40 11.48 11.43 10.51 10.98 11.58 12.24 11.27 14.32
METEOR 0.139 0.169 0.076 0.138 0.140 0.117 0.100 0.127 0.144 0.165
ROUGE 0.240 0.232 0.162 0.219 0.271 0.159 0.155 0.223 0.282 0.218
GREEN 0.294 0.430 0.196 0.305 0.270 0.197 0.261 0.301 0.301 0.441
F1-Radgraph 0.215 0.267 0.124 0.200 0.195 0.135 0.130 0.149 0.216 0.262
b
LLCL LCA LLM-based consistency evaluation METEOR ROUGE GREEN F1- radgraph
Clinical accuracy Findings completeness Linguistic and grammatical quality Report structure Total
3.05 2.87 4.30 3.79 14.02 0.156 0.215 0.407 0.257
3.07 2.90 4.33 3.80 14.11 0.162 0.216 0.412 0.260
3.13 2.97 4.36 3.82 14.29 0.164 0.216 0.438 0.261
3.14 2.98 4.36 3.83 14.32 0.165 0.218 0.441 0.262
c
Model LLM-based consistency evaluation METEOR ROUGE GREEN F1-radgraph
Clinical accuracy Findings completeness Linguistic and grammatical quality Report structure Total
CheXbert Labels 2.85 2.71 4.20 3.66 13.43 0.172 0.218 0.331 0.233
LLM-based Labels 3.14 2.98 4.36 3.83 14.32 0.165 0.218 0.441 0.262

Bold values indicate the best performance for each metric. LLCL lesion-level contrastive learning, LCA lesion cross-attention.

In addition to LLM-based analysis, we also evaluated report quality using four widely adopted automatic metrics: METEOR, ROUGE-L, GREEN, and F1-RadGraph. Our model achieved a METEOR score of 0.165, which is the second-highest among all models, closely trailing RGRG (0.169) and outperforming other recent models such as MAIRA-2 (0.100) and LLaVA-Rad (0.127). The ROUGE-L score was 0.218, comparable to that of R2Gen (0.271), RGRG (0.232), and TSGET (0.282), despite our model not relying on sentence-level templates or token-level copying strategies. More importantly, CLALA-Net achieved the highest GREEN score of 0.441, demonstrating its superior ability to maintain factual consistency between the image and the generated report. This score reflects its success in minimizing hallucinations while ensuring the accurate inclusion of key clinical facts. Finally, the F1-RadGraph score reached 0.262, ranking second among all compared models. This indicates that the generated reports preserved strong structural clinical correctness and effectively captured key medical entities and relations, though a marginal gap remains compared to the top-performing model. Figure 2 illustrates an example output of a real CXR case. Although the phrasing and sentence structure differ from the reference report, the generated descriptions accurately convey key abnormalities, including pleural effusion, atelectasis, and pneumothorax, with clear semantic equivalence. The model successfully maintained diagnostic fidelity while expressing findings through diverse yet clinically consistent formulations, reinforcing the high scores observed in the linguistic and structural dimensions of the evaluation.

Fig. 2.

Fig. 2

Examples of automatically generated radiology reports for chest X-rays using the proposed model. Left: input CXR image. Middle and right: variations of the generated sentences for the same image. Disease entities such as pleural effusion, atelectasis, and pneumothorax are highlighted in color. The generated reports maintain clinical consistency despite variations in syntax and phrasing.

Together, these results indicate that the proposed model not only excels in identifying and describing individual lesions, but also produces reports that are coherent, clinically accurate, and faithful to the imaging content. The synergy between fine-grained regional features and global lesion context, combined with a controlled LLM-based aggregator, enables the generation of high-quality radiology reports that align with both clinical expectations and linguistic standards.

Ablation study

Impact of individual modules

To assess the contribution of each core component, an ablation study was conducted, comparing the lesion-level performance of four configurations: base model, LLCL-only, LCA-only, and full model combining LLCL and LCA. Table 3 summarizes the precision, recall, and F1-scores across the 14 lesion categories.

Table 3.

Ablation study of lesion-level performance under different configurations of LLCL and LCA modules.

Lesions Base model LLCL-only LCA-only LLCL + LCA
Metrics Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
Cardiomegaly 0.61 0.58 0.60 0.58 0.54 0.56 0.62 0.67 0.64 0.62 0.67 0.64
Emphysema 0.25 0.35 0.29 0.41 0.32 0.36 0.36 0.47 0.41 0.37 0.48 0.42
Pleural effusion 0.72 0.73 0.72 0.66 0.79 0.72 0.72 0.73 0.72 0.71 0.73 0.72
Hernia 0.34 0.38 0.36 0.50 0.38 0.43 0.67 0.37 0.47 0.67 0.37 0.47
Nodule 0.12 0.01 0.03 0.23 0.13 0.16 0.26 0.14 0.18 0.26 0.13 0.18
Pneumothorax 0.60 0.11 0.18 0.57 0.28 0.38 0.50 0.50 0.48 0.51 0.45 0.48
Atelectasis 0.38 0.90 0.54 0.37 0.89 0.52 0.56 0.60 0.58 0.57 0.61 0.59
Pleural thickening 0.15 0.03 0.06 0.28 0.07 0.10 0.20 0.12 0.15 0.22 0.13 0.16
Mass 0.66 0.15 0.23 0.10 0.01 0.02 0.26 0.22 0.24 0.26 0.23 0.24
Edema 0.41 0.71 0.52 0.41 0.64 0.50 0.46 0.67 0.55 0.47 0.67 0.55
Consolidation 0.34 0.37 0.35 0.30 0.35 0.32 0.38 0.41 0.39 0.41 0.41 0.41
Infiltration 0.16 0.43 0.22 0.13 0.40 0.20 0.21 0.30 0.25 0.22 0.29 0.25
Fibrosis 0.90 0.05 0.10 0.67 0.11 0.19 0.21 0.16 0.18 0.20 0.18 0.19
Pneumonia 0.17 0.18 0.17 0.13 0.23 0.17 0.21 0.28 0.24 0.24 0.27 0.26
Average 0.41 0.36 0.31 0.38 0.37 0.33 0.40 0.40 0.39 0.41 0.40 0.40

Bold values indicate the best performance for each metric.

The LCA-only variant demonstrated improved recognition of location-sensitive pathologies. Significant F1-score gains were observed for pneumothorax (0.18 → 0.48) and fibrosis (0.10 → 0.18), with the highest precision achieved for pleural effusion (0.72), while maintaining the base F1-score (0.72). These results highlight the importance of incorporating a global lesion context for accurate localization and detailed description.

The LLCL-only variant improved performance for subtle findings, such as emphysema (F1 score:0.29 → 0.36) and nodule (F1 score:0.03 → 0.16), demonstrating its effectiveness in enhancing feature separability for small or visually ambiguous abnormalities. However, in some cases, this configuration resulted in reduced recall, suggesting a trade-off between precision and coverage when contrastive learning is employed in isolation.

The combination of LCA and LLCL yielded the highest overall performance, increasing the mean F1-score from 0.31 to 0.40. These improvements were consistent across both precision and recall. Notably, substantial gains were observed for clinically significant conditions such as atelectasis (0.54 → 0.59), edema (0.52 → 0.55), consolidation (0.35 → 0.41), and pneumothorax (0.18 → 0.48). These results underscore the complementary strengths of the two modules: LCA provides globally informed representations that improve sensitivity to complex or spatially diffuse findings, whereas LLCL enhances lesion-specific discrimination through contrastive feature separation.

The impact of each configuration on report-level consistency was also evaluated by comparing models with different combinations of the LLCL and LCA modules, using both LLM-based consistency metrics and standardized language evaluation scores (Table 2b).

When neither LLCL nor LCA was applied, the model yielded a total consistency score of 14.02. Introducing LLCL alone led to consistent gains across evaluation metrics, including improvements in clinical accuracy (3.07) and linguistic quality (4.33). Similarly, enabling LCA alone resulted in greater enhancements, notably in clinical accuracy (3.13), findings completeness (2.97), and overall report structure (3.82), with the total score reaching 14.29.

The full configuration, integrating both LLCL and LCA, achieved the best performance across all metrics. It recorded the highest clinical accuracy (3.14), findings completeness (2.98), linguistic and grammatical quality (4.36), and report structure (3.83), resulting in a total LLM-based consistency score of 14.32. Furthermore, standardized metrics such as METEOR (0.165), ROUGE (0.218), GREEN (0.441), and F1-RadGraph (0.262) were also maximized in the complete model, indicating improved language quality, medical factuality, and structural correctness.

In summary, this ablation analysis quantitatively confirmed that LCA improved the contextual richness and fluency of the generated reports, whereas LLCL enhanced the intra- and inter-lesion discriminability. Their integration enabled the model to generate diagnostic narratives that were clinically comprehensive and linguistically robust.

Impact of lesion label accuracy

To assess the influence of lesion label accuracy on the performance of the LCA module and the overall quality of the generated reports, an experiment was conducted in which the label F1-score was varied across five levels: 0.20, 0.43 (the model’s baseline), 0.70, 0.85, and 1.00 (ground truth). The key performance metrics of the generated reports were evaluated for each level.

As illustrated in Fig. 3a, there exists a strong positive correlation between lesion label accuracy and overall report generation performance. When trained with noisy labels exhibiting low accuracy (0.2), the model demonstrated limited ability to capture relevant clinical findings, yielding an F1-score of only 0.23. By contrast, training with labels derived from our LLM-based automatic extraction pipeline (label accuracy = 0.43) led to a notable performance improvement, achieving a balanced F1-score of 0.40. Further increments in label accuracy to 0.7 and 0.85 continued to yield proportional gains, culminating in a peak F1-score of 0.76 when trained with ground-truth annotations.

Fig. 3.

Fig. 3

Impact of lesion label accuracy on lesion-level detection and report generation performance. (a) Lesion-level precision, recall, and F1-score across different levels of label accuracy; (b) LLM-based consistency evaluation scores (clinical accuracy, findings completeness, linguistic and grammatical quality, and report structure); and (c) standardized language generation metrics (METEOR, ROUGE-L, GREEN, and F1-RadGraph). Overall, the results show consistent improvements across lesion-level and report-level metrics with increasing label accuracy, highlighting the importance of accurate lesion guidance.

Interestingly, even a moderate enhancement in label quality—from 0.43 to 0.7—resulted in a substantial F1 increase of 0.17, highlighting the sensitivity of region-guided report generation models to annotation accuracy. These findings emphasize the critical role of precise lesion-level supervision for optimizing downstream report quality, while also affirming the practical value of our LLM-based labeling approach as a scalable and effective alternative to exhaustive manual annotation.

As shown in Fig. 3b,c, increasing lesion label accuracy not only improves lesion detection fidelity but also leads to consistent gains in report-level performance, including clinical consistency, structural coherence, and linguistic quality.

In the LLM-based clinical consistency evaluation (Fig. 3b), metrics such as clinical accuracy and findings completeness improved markedly with higher label accuracy. Clinical accuracy increased from 2.87 (at 0.2) to 3.36 (at 1.0), while findings completeness rose from 2.73 to 3.19. Although linguistic and grammatical quality and report structure showed more gradual gains, their upward trends—reaching 4.42 and 3.98, respectively—indicate that improved lesion annotation quality contributes positively to both factual and stylistic dimensions of report generation.

Complementary results were observed in the standardized language metrics (Fig. 3c). The F1-RadGraph score, which evaluates the preservation of clinical entities and relations, increased from 0.241 to 0.278 as label accuracy improved. Similarly, GREEN, a metric for semantic alignment with ground truth, rose from 0.384 to 0.475. METEOR and ROUGE-L, which assess lexical and semantic overlaps, exhibited steady improvements from 0.160 to 0.174 and from 0.210 to 0.225, respectively.

These results underscore the importance of precise lesion classification in maximizing the diagnostic utility of automated report-generation systems. Specifically, they validate the role of high-quality lesion labels in enhancing the effectiveness of the LCA module and improving the clinical consistency of radiology reports.

Impact of lesion labeling strategy

To rigorously assess the contribution of our fine-grained lesion labeling pipeline, we conducted a controlled experiment comparing CLALA-Net trained with LLM-derived labels against a variant trained using CheXbert-generated labels. Both labeling strategies were applied to the same Chest-Imagenome training set, with all other model components, hyperparameters, and training protocols held constant. As CheXbert provides only image-level binary annotations without anatomical localization, the LLCL module in the CheXbert-based variant was adapted to construct contrastive pairs based solely on lesion presence, thereby lacking regional awareness during training.

As shown in Table 4, the LLM-derived model consistently outperforms its CheXbert-generated counterpart in lesion-level performance, achieving a mean F1-score of 0.40 versus 0.31. This advantage is especially prominent for spatially localized or subtle abnormalities. For example, the LLM-based model yields markedly higher F1-scores for pneumothorax (0.48 vs. 0.21), hernia (0.47 vs. 0.22), and pleural thickening (0.16 vs. 0.00), indicating superior sensitivity to region-specific findings that are poorly captured by image-level annotations. In contrast, for globally prominent conditions such as cardiomegaly, both models perform comparably (0.64 vs. 0.66), suggesting that image-level labels may suffice where spatial reasoning is less critical.

Table 4.

Lesion-level performance comparison between LLM-labeled and CheXbert-labeled training for CLALA-Net.

Lesions CheXbert labels LLM-based labels
Metrics Precision Recall F1 score Precision Recall F1 score
Cardiomegaly 0.61 0.72 0.66 0.62 0.67 0.64
Emphysema 0.22 0.42 0.29 0.37 0.48 0.42
Pleural effusion 0.62 0.87 0.72 0.71 0.73 0.72
Hernia 0.33 0.16 0.22 0.67 0.37 0.47
Nodule 0.43 0.11 0.17 0.26 0.13 0.18
Pneumothorax 0.22 0.20 0.21 0.51 0.45 0.48
Atelectasis 0.48 0.77 0.59 0.57 0.61 0.59
Pleural thickening 0.00 0.00 0.00 0.22 0.13 0.16
Mass 0.33 0.13 0.18 0.26 0.23 0.24
Edema 0.52 0.71 0.60 0.47 0.67 0.55
Consolidation 0.40 0.32 0.35 0.41 0.41 0.41
Infiltration 0.21 0.28 0.24 0.22 0.29 0.25
Fibrosis 0.00 0.00 0.00 0.20 0.18 0.19
Pneumonia 0.21 0.15 0.18 0.24 0.27 0.26
Average 0.33 0.34 0.31 0.41 0.40 0.40

Bold values indicate the best performance for each metric.

To determine how these lesion-level improvements translate into holistic report quality, we evaluated the two models using both LLM-based consistency metrics and standardized language metrics. As summarized in Table 2c, the LLM-labeled model achieves superior clinical accuracy (3.14 vs. 2.85) and findings completeness (2.98 vs. 2.71), resulting in a + 0.89 gain in total consistency score. These gains underscore that anatomically contextualized supervision enhances both the factual correctness and clinical breadth of generated reports. Importantly, this improvement comes without sacrificing language fluency: both models achieve similarly high scores in grammaticality (4.36 vs. 4.20) and report structure (3.83 vs. 3.66), demonstrating that the aggregator module remains robust to variations in label type.

The superiority of LLM-based labeling is further validated through standardized clinical metrics. The GREEN score, which penalizes hallucinations and factual inconsistencies, improves substantially from 0.331 to 0.441, while F1-RadGraph rises from 0.233 to 0.262. These gains indicate that fine-grained supervision not only improves detection of relevant findings but also enhances the structural and semantic alignment of the reports. In contrast, text overlap metrics such as METEOR (0.165 vs. 0.172) and ROUGE-L (0.218 for both) remain similar, highlighting their limited sensitivity to clinically meaningful distinctions.

Together, these results indicate that anatomically contextualized supervision contributes meaningfully to both lesion-level recognition and holistic report quality. By introducing spatial specificity and semantically aligned supervision, the LLM-based labeling pipeline plays a central role in enabling CLALA-Net’s region-aware reasoning and clinically coherent generation capabilities.

Impact of ROI quality

Table 5 summarizes the impact of ROI quality on both lesion-level detection and report-level generation across three dimensions: ROI selection strategy, input resolution, and localization accuracy. Overall, improvements across ROI quality dimensions yield modest but consistent gains, with distinct effects observed at the lesion and report levels. Detailed lesion-wise performance and component-level LLM consistency scores (clinical accuracy, findings completeness, and report structure) are provided in Supplementary Tables S1–S4.

Table 5.

Aggregated lesion- and report-level performance across different ROI quality configurations, including ROI selection strategy, input resolution, and localization accuracy.

Configuration Lesion-level Report-level
Precision Recall F1-score LLM-based consistency evaluation (Total) METEOR ROUGE-L GREEN F1-RadGraph
Selection strategy All ROIs 0.37 0.43 0.39 13.90 0.174 0.199 0.378 0.226
Relevant ROIs 0.41 0.40 0.40 14.32 0.165 0.218 0.441 0.262
Resolution 256 × 256 0.41 0.38 0.39 14.29 0.162 0.216 0.439 0.258
384 × 384 0.41 0.40 0.40 14.32 0.165 0.218 0.441 0.262
Location accuracy GT 0.44 0.41 0.42 14.73 0.178 0.236 0.470 0.288
Predicted 0.41 0.40 0.40 14.32 0.165 0.218 0.441 0.262

Bold values indicate the best performance for each metric. ground truth, GT.

ROI selection strategy primarily affects the precision–recall balance. Restricting the model to clinically relevant ROIs improves lesion-level precision (0.37 → 0.41) while slightly reducing recall (0.43 → 0.40), resulting in a marginal increase in average F1-score (0.39 → 0.40). At the report level, this configuration achieves higher LLM-based consistency and factual accuracy metrics, including GREEN (0.378 → 0.441) and F1-RadGraph (0.226 → 0.262). These results suggest that excluding redundant or noisy anatomical regions enhances clinical coherence and factual alignment of generated reports, even when lesion-level gains remain limited.

ROI input resolution shows incremental yet consistent benefits. Increasing the resolution from 256 × 256 to 384 × 384 maintains lesion-level precision while improving recall (0.38 → 0.40), leading to a slight increase in average F1-score (0.39 → 0.40). Correspondingly, report-level metrics, including total LLM-based consistency (14.29 → 14.32) and factual consistency measures (GREEN and F1-RadGraph), exhibit small but uniform improvements. This trend indicates that higher-resolution ROIs better preserve fine-grained visual cues relevant to subtle abnormalities without compromising linguistic quality.

ROI localization accuracy exerts the strongest influence among the evaluated factors. Using ground-truth ROIs yields the highest lesion-level performance (F1 = 0.42) and the highest overall report-level consistency (14.73), outperforming the predicted-ROI configuration across all metrics. Notably, the predicted-ROI setting still achieves competitive performance (F1 = 0.40; consistency = 14.32), demonstrating that CLALA-Net remains robust under realistic localization noise. The performance gap highlights the importance of accurate anatomical boundary alignment, particularly for spatially localized lesions, while also underscoring the practical deployability of the proposed framework.

Collectively, these results indicate that ROI quality governs complementary aspects of model behavior: ROI selection modulates the precision–recall trade-off, ROI resolution influences sensitivity to subtle visual patterns, and ROI localization accuracy determines the upper bound of both lesion-level fidelity and report-level clinical consistency.

Conclusion

This study introduces CLALA-Net, a hybrid framework for chest X-ray report generation that integrates lesion-aware attention, contrastive learning, and LLM-based aggregation. By combining both region-level and global image representations, the model effectively bridges the gap between fine-grained localization and holistic clinical reasoning. Extensive experiments on the Chest-Imagenome dataset demonstrate that CLALA-Net achieves superior performance compared to both region-based and generalist vision–language models, particularly in generating accurate, clinically coherent reports. Ablation studies further validate the contributions of lesion-centric supervision and anatomically contextualized learning to both lesion-level and report-level quality.

Despite these promising results, our approach presents several limitations. First, it depends on fine-grained lesion annotations with anatomical localization, which are costly to obtain and currently available only in specialized datasets such as Chest–Imagenome. Although we employ an LLM-based labeling pipeline to mitigate annotation effort, it still relies on structured radiology reports and may face domain adaptation challenges in settings without consistent reporting protocols. Second, our evaluation is limited to a single dataset, which constrains our ability to generalize findings across diverse imaging modalities, disease distributions, or institutional styles. Future work should extend validation to multi-institutional and multi-modal datasets. Third, the current framework does not directly exploit the native high-resolution or dynamic global–local processing capabilities offered by recent multimodal LLMs. Although focusing on anatomically bounded ROIs allows the model to preserve fine-grained regional information at moderate input resolutions, tighter integration with high-resolution vision–language encoders remains an important limitation. Addressing this limitation may further enhance diagnostic detail while maintaining interpretability and clinical coherence. Fourth, the computational cost of the hybrid architecture—including region extraction, cross-attention, and LLM-based aggregation—is higher than that of end-to-end models. While this complexity is justified by performance gains, lightweight variants or model distillation techniques may be necessary for deployment in resource-limited environments. In addition, while our benchmark covers a representative range of recent baselines, including full-image, region-based, and hybrid architectures, it does not encompass all state-of-the-art methods due to practical constraints in reliably reproducing some recently proposed implementations with incomplete public releases. This limitation should be considered when interpreting the scope of our comparative results.

In summary, CLALA-Net provides a clinically attuned, lesion-guided solution to radiology report generation. Through anatomically grounded training and structured language synthesis, it establishes a foundation for future models aiming to combine medical accuracy with linguistic fluency.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (67.6KB, docx)

Author contributions

Guarantors of integrity of the entire study, W.-J.N., S.-W.P., and B.-D.L.; conceptualization, B.-D.L.; literature review, W.-J.N., S.-W.P., and B.-D.L.; supervision, B.-D.L.; software, W.-J.N. and S.-W.P.; formal analysis, W.-J.N., S.-W.P., and B.-D.L.; writing—original draft preparation, W.-J.N., S.-W.P., and B.-D.L.; writing—review and editing, W.-J.N., S.-W.P., and B.-D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Kyonggi University’s Graduate Research Assistantship 2025. This research was also supported by the Basic Science Research Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2020-NR049579).

Data availability

The dataset utilized in this study can be accessed publicly at [https://physionet.org/content/chest-imagenome/1.0.0/](https:/physionet.org/content/chest-imagenome/1.0.0).

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Won-Jun Noh and Sun-Woo Pi contributed equally to this work.

References

  • 1.Kwee, T. C. & Kwee, R. M. Workload of diagnostic radiologists in the foreseeable future based on recent scientific advances: growth expectations and role of artificial intelligence. Insights Imaging. 12, 1–12. 10.1186/s13244-021-01031-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang, X. et al. A survey of deep-learning-based radiology report generation using multimodal inputs. Med. Image. Anal.103, 103627. 10.1016/j.media.2025.103627 (2025). [DOI] [PubMed] [Google Scholar]
  • 3.Khalifa, M. & Albadawy, M. AI in diagnostic imaging: revolutionising accuracy and efficiency. Comput. Methods Programs Biomed. Update. 4, 100146 (2024). [Google Scholar]
  • 4.Wu, J. et al. Chest Imagenome dataset. PhysioNet. (2021).
  • 5.Sirshar, M. et al. Attention based automated radiology report generation using CNN and LSTM. PLoS One. 17, e0262209. 10.1371/journal.pone.0262209 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Liu, F. et al. Contrastive attention for automatic chest X-ray report generation. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP2021 269–280. (2021).
  • 7.Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer. Proc. EMNLP10.18653/v1/2020.emnlp-main.112 (2020). [Google Scholar]
  • 8.Wang, Z. et al. Metransformer: radiology report generation by transformer with multiple learnable expert tokens. In: Proc. CVPR (2023).
  • 9.Li, M. et al. Dynamic graph enhanced contrastive learning for chest X-ray report generation. In: Proc. CVPR (2023).
  • 10.Gu, T. et al. Complex organ mask guided radiology report generation. In: Proc. WACV (2024).
  • 11.Pellegrini, C. et al. RaDialog: A large vision-language model for radiology report generation and conversational assistance. arXiv preprint at https://abs/arXiv.org/2311.18681. (2023).
  • 12.Bannur, S. et al. Learning to exploit Temporal structure for biomedical vision-language processing. In:Proc CVPR (2023).
  • 13.Bannur, S. et al. MAIRA-2: Grounded radiology report generation. arXiv preprint at https://abs/arXiv.org/2406.04449. (2024).
  • 14.Khan, O. et al. XrayGPT: chest radiographs summarization using large Language medical vision-language models. In: Proc. BioNLP (2024) .
  • 15.Chaves, J. M. Z. et al. A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings. Nat. Commun.16, 3108 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huang, S. C. et al. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021).
  • 17.Tanida, T. et al. Interactive and explainable region-guided radiology report generation. In: Proc. CVPR (2023).
  • 18.Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog. 1 (8), 9 (2019). [Google Scholar]
  • 19.Zhang, T. et al. BERTScore: evaluating text generation with BERT. In: Proc. ICLR (2020).
  • 20.van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint at https//abs/arXiv.org/1807.03748. (2018).
  • 21.Johnson, A. E. W. et al. MIMIC-CXR: a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data. 6, 317. 10.1038/s41597-019-0322-0 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Smit, A. et al. CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In: Proc. EMNLP (2020).
  • 23.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large Language models. Adv. Neural Inf. Process. Syst.35, 24824–24837 (2022). [Google Scholar]
  • 24.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. ICLR (2020).
  • 25.Deng, J. et al. Imagenet: A large-scale hierarchical image database. In: Proc CVPR. IEEE10.1109/CVPR.2009.5206848 (2009). [Google Scholar]
  • 26.Devlin, J. et al. BERT: Pre-training of deep bidirectional Transformers for Language Understanding. In: Proc. NAACL-HLT (2019).
  • 27.Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In: Proc. ICLR (2019).
  • 28.Radford, A. et al. Learning transferable visual models from natural language supervision. In: Proc ICML. PMLR (2021).
  • 29.Wang, A. et al. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst.37, 107984–108011 (2024). [Google Scholar]
  • 30.Achiam, J. et al. GPT-4 technical report. arXiv preprint at https://abs/arXiv.org/2303.08774. (2023).
  • 31.Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32. (2019).
  • 32.Goldstein J, et al. METEOR: An automatic metric for MT evaluation with improved correlation with human judgements. In: Proc Intr-Ext Eval MT/Summ. 2005.
  • 33.Lin, C. ROUGE: A package for automatic evaluation of summaries. In: Proc SUMO-ACL. (2004).
  • 34.Ostmeier, S. et al. GREEN: Generative radiology report evaluation and error notation. arXiv preprint at https://abs/arXiv.org/2405.03595. (2024).
  • 35.Feiyang, Yu. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns4(9), 100802. 10.1016/j.patter.2023.100802 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Lee, S. et al. LLM-CXR: Instruction-finetuned LLM for CXR image Understanding and generation. In: Proc. ICLR (2024).
  • 37.Nicolson, A., Dowling, J. & Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med.144, 102633. 10.1016/j.artmed.2023.102633 (2023). [DOI] [PubMed] [Google Scholar]
  • 38.Yi, X. et al. TSGET: two-stage global enhanced transformer for automatic radiology report generation. IEEE J. Biomedical Health Inf.28 (4), 2152–2162 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (67.6KB, docx)

Data Availability Statement

The dataset utilized in this study can be accessed publicly at [https://physionet.org/content/chest-imagenome/1.0.0/](https:/physionet.org/content/chest-imagenome/1.0.0).


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES