Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images

Helen Haile Hayeso; Zenebe Markos Lonseko; Fahad Mushabbab G Alotaibi; Tao Gan; Peifeng Shi; Shuqi Dong; Nini Rao

doi:10.21037/tlcr-2025-aw-1299

. 2026 Feb 26;15(3):47. doi: 10.21037/tlcr-2025-aw-1299

Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images

Helen Haile Hayeso ¹, Zenebe Markos Lonseko ¹, Fahad Mushabbab G Alotaibi ¹, Tao Gan ², Peifeng Shi ¹, Shuqi Dong ¹, Nini Rao ^1,^✉

PMCID: PMC13071740 PMID: 41982697

Abstract

Background

Lung cancer (LC) is the leading cause of cancer mortality worldwide. Accurate prognosis in vulnerable populations, particularly in low-resource settings remains challenging with high radiation dose of standard imaging modalities. While chest X-rays (CXRs) are safer and more accessible, their low spatial resolution limits prognostic accuracy. Existing multimodal models primarily rely on computationally intensive architectures, such as computed tomography (CT) or positron emission tomography (PET) inputs, which reduce their clinical utility. This study aimed to develop and validate a lightweight hybrid foundation model (LHFM) that integrates CXR imaging with clinical data to enable accurate LC prognosis in low-resource settings.

Methods

We developed a LHFM that integrates visual features from CXR images extracted by a segment anything model (SAM)-Med2D encoder with semantically enriched prompts generated by BioGPT and clinical metadata. These multimodal features are fused via a dual-branch transformer architecture for survival prediction. The model was trained and validated on the JSRT and PadChest datasets, with external validation on multicenter datasets including the NIH CXR. Performance was evaluated using the concordance index (C-index), area under the receiver operating characteristic curve (AUROC), and Kaplan-Meier (KM) survival analysis.

Results

The proposed LHFM achieved superior prognostic performance, with a C-index of 0.910 [95% confidence interval (CI): 0.898–0.922, standard deviation (SD) =0.006] and AUROC of 0.935 (95% CI: 0.927–0.943, SD =0.004), outperforming existing multimodal benchmarks (P<0.001). KM curves demonstrated significant separation between the high-risk and low-risk groups. Domain-shift robustness testing across heterogeneous external datasets demonstrated representation stability under distribution shift.

Conclusions

LHFM establishes a new paradigm for prognostic precision by exhibiting significant performance from low-dose CXR. This hybrid approach directly addresses the implementation gap in clinical artificial intelligence (AI), offering a scalable, equitable, and immediately applicable solution for personalized cancer care in resource-limited and radiography first workflows, with potential applicability across other cancer types.

Keywords: Lung cancer prognosis (LC prognosis), chest X-ray imaging (CXR imaging), multimodal deep learning (multimodal DL), foundation model in medical imaging, artificial intelligence in oncology (AI in oncology)

Highlight box.

Key findings

• This study proposes a lightweight hybrid foundation model (LHFM) that integrates chest X-ray (CXR) image features extracted by segment anything model (SAM)-Med2D with semantic prompts generated by BioGPT and clinical metadata for lung cancer prognosis.

• LHFM achieved a concordance index of 0.910 and an area under the receiver operating characteristic curve (AUROC) of 0.935, outperforming existing multimodal benchmarks.

• The model demonstrated strong external generalizability and robustness under domain shift across three independent datasets [concordance index (C-index): 0.835–0.879], confirming its robustness and translational potential.

What is known and what is new?

• Lung cancer prognostic models often rely on computed tomography (CT) or positron emission tomography (PET), which are limited by high radiation exposure, cost, and accessibility in low-resource settings. Although CXR is more accessible and involves lower radiation, its reduced spatial resolution hinders the performance of traditional deep learning models.

• In this study, we introduce a LHFM that integrates CXR features with clinical data through a multimodal framework, offering an accurate prognosis. LHFM provides a scalable, low-dose solution suitable for resource-limited environments.

What is the implication, and what should change now?

• LHFM provides a cost-effective and scalable solution for integrating artificial intelligence (AI)-aided prognosis into clinical workflows using standard radiography equipment.

• The framework can support early triage, treatment prioritization, and equitable access to advanced prognostic tools in low-resource and vulnerable populations environments.

• Future clinical adoption should focus on prospective validation and integration into radiology information systems to enhance precision oncology and workflow efficiency.

Introduction

Lung cancer (LC) remains the leading cause of cancer-related mortality worldwide, accounting for over 1.8 million deaths annually (1,2). Accurate prognostication is crucial for personalized treatment, yet existing models rely on computed tomography (CT) or positron emission tomography (PET) imaging modalities (3-6) which are limited by cost, radiation exposure, and access barriers in vulnerable populations and low-resource settings (7,8). By contrast, chest X-ray (CXR) is inexpensive, widely accessible, and involves substantially lower radiation dose (9-12), but its two-dimensional, lower-resolution nature hinders reliable extraction of subtle prognostic signals.

Recent deep learning (DL) studies have demonstrated that CXR images encode prognostic information (13,14); however, most frameworks remain unimodal, image-only, underutilizing complementary clinical context available in radiology reports or electronic health record (EHR) data (15,16). Consequently, these models lack interpretability and are impractical for real-world deployment. Existing multimodal architectures, including Lite-ProSENet (17) and FGCN (18), achieve significant improvements concordance index (C-index) ≈0.76–0.78, but they require high-resolution CT and computationally expensive transformers, which restricts their clinical scalability (19). In contrast, other radiomics and omics-based approaches report more modest performance (C-index ≈0.65–0.78) and often lack external validation or interpretability (20-23). Yet, these methods depend on high-resolution, high-radiation imaging and computationally intensive architectures, limiting scalability beyond specialized centers. Collectively, current evidence underscores a critical gap: there is no lightweight, interpretable, CXR-based multimodal framework that leverages recent advances in medical foundation encoders while remaining deployable in routine practice and resource-limited clinical environments.

Emerging medical vision-language models, such as segment anything model (SAM)-Med2D for segmentation (24) and BioGPT for biomedical language understanding (16), enable rich visual-text representations but are typically embedded in large, resource-intensive pipelines (25). Harnessing their representational strength within a compact architecture optimized for CXR-based risk stratification offers an opportunity to deliver accurate, interpretable prognosis without relying on CT/PET or high-end computing resources.

To address these gaps, we propose a lightweight hybrid foundation model (LHFM) designed to integrate CXR-derived embeddings from SAM-Med2D with BioGPT-generated clinical prompts and structured metadata through a dual-branch transformer. We hypothesize that integrating SAM-Med2D visual embeddings with BioGPT-generated semantic prompts and clinical data will achieve significant survival prediction accuracy from low-dose CXRs compared with unimodal and heavy multimodal baselines. The main contributions of this study are as follows.

❖ We propose a LHFM that integrates SAM-Med2D CXR features, BioGPT-generated semantic prompts, and structured clinical metadata through a unified dual-branch transformer architecture for LC prognosis.
❖ LHFM bridges the gap between high-capacity foundation models and deployable clinical aids by combining visual embeddings from SAM-Med2D with textual semantics from BioGPT, achieving high accuracy while preserving interpretability and scalability.
❖ Across multiple CXR-based datasets, LHFM consistently outperformed existing unimodal and multimodal baselines, with semantic prompt–guided fusion enhancing both interpretability and prognostic performance.
❖ The framework demonstrates robust cross-domain generalization and enables real-time inference on standard hardware, underscoring its translational potential for resource-limited and clinical settings.

We present this article in accordance with the TRIPOD reporting checklist (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-aw-1299/rc).

Methods

Datasets

This study utilized multiple publicly available chest CXR and CT datasets for model development and validation, and to assess domain adaptability. The model was developed and validated primarily on the JSRT and PadChest CXR datasets, which provide high-quality annotated images alongside demographic and clinical metadata, enabling robust prognostic model development across diverse patient groups and imaging protocols (26,27). Baseline demographic and clinical characteristics for the primary development cohort (JSRT) are summarized in Table S1. For experimental validation, these datasets were split into training (80%) and testing (20%). To minimize unintended correlations and prevent information leakage, this study used advanced splitting such that samples from the same subject did not appear in both training and testing partitions. Model stability was assessed using 5-fold cross-validation on the development datasets, and the held-out test set remained strictly isolated until final evaluation, and sample images are presented in Figure 1. The model’s generalizability and translational potential was performed on heterogeneous datasets, including the large-scale NIH ChestX-ray and the NSCLC-Radiomics-Interobserver1 datasets, which represent varied demographic distributions and clinical contexts. In addition to internal testing, we evaluated LHFM under marked distribution shifts using heterogeneous external imaging cohorts. Because some datasets (e.g., Shenzhen/Montgomery) (28,29) are not LC cohorts, these analyses are reported as domain-shift robustness and feature transferability tests, rather than clinical external validation of LC prognosis. This evaluation strategy probes whether multimodal representations learned from low-dose radiographs remain stable under substantial differences in acquisition, pathology prevalence, and cohort composition. Furthermore, the NSCLC-Radiomics-Interobserver1 dataset was incorporated to assess the model’s performance (30).

Sample lung CXR images. The figure depicts three distinct clinical classifications: (A) a malignant nodule, (B) a benign nodule, and (C) a non-nodule appearance for comparative analysis. CXR, chest X-ray.

Accordingly, the primary endpoint used for training and evaluation was an algorithmically computed risk label (high-risk vs. low-risk) derived from available demographic and clinical metadata under controlled experimental assumptions. This design enables systematic assessment of whether hybrid fusion of imaging features and semantic prompt representations improves predictive discrimination in a low-dose CXR setting, while avoiding unsupported claims of actual time-to-event clinical outcome prediction.

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. All datasets used in this study were publicly available (as presented in Appendix 1) and fully anonymized, and institutional review board approval was waived. Missing data in demographic or clinical metadata were handled by imputation with median values for continuous variables and mode for categorical variables, confirming balanced input across patient groups.

Clinical prompts enhance LC prognosis by integrating key tumor features such as size, density, margins, and location into a lightweight predictive model. This method integrates imaging and clinical data to provide interpretable, efficient, and highly accurate predictions, making it suitable for diverse healthcare settings. In this study, clinical prompts were used as auxiliary inputs during model training.

Proposed methods

This section describes the proposed method for determining the prognosis of patients with LC using low-dose CXR. The main workflow includes four steps: data preprocessing, feature extraction, hybrid model training, and prognostic prediction, as presented in Figure 2. We begin by outlining data preprocessing techniques for robust and generalizable inputs, followed by model architecture, training, and prognostic prediction.

Block diagram of the proposed LHFM framework. The pipeline begins with multimodal data input, including chest X-ray images and clinical metadata prompts, followed by preprocessing tasks such as normalization and augmentations. Feature extraction is then performed via SAM-Med2D for visual representations and BioGPT for semantic text embeddings. These complementary features are combined within a hybrid fusion model, which subsequently enables robust survival prediction and prognostic classification for accurate prognostic prediction. LHFM, lightweight hybrid foundation model.

Data preprocessing

The preprocessing pipeline standardized the CXR images from the adult and other cohorts by resizing them to 224×224 pixels and normalizing the pixel intensities. Data augmentation, including rotation, scaling, and cropping, was applied to improve model robustness. Clinical metadata were processed to handle missing values, ensuring consistent and balanced input for subsequent multimodal feature extraction and model training. Pixel intensity values were normalized via z-score standardization as depicted in Eq. [1]. Clinical metadata were processed via a parallel preprocessing pipeline.

I^{'} = \frac{I - μ}{σ}

[1]

where $I$ is the raw pixel intensity, $μ$ is the dataset-wide mean, and $σ$ is the standard deviation (SD), which is computed per dataset to account for acquisition-specific illumination and contrast variations.

Statistical analysis

Statistical analysis was performed to evaluate the model’s prognostic performance and robustness. The C-index assessed survival ranking, while the area under the receiver operating characteristic curve (AUROC) evaluated binary classification. Kaplan-Meier (KM) analysis with the log-rank test compared survival between risk groups. Performance is reported as mean ± SD with 95% CI from five-fold cross-validation. Sensitivity analyses included modality ablation and external validation. All analyses were conducted using Python (version 3.10) as shown in Table S2. P values smaller than 0.05 were considered statistically significant.

Constructing a LHFM

Figure 3 demonstrates the conceptual transition from traditional and DL models to the proposed model (LHFM), highlighting its multimodal efficiency and interpretability. The framework and implementation of the proposed model, are shown in Figure 4, which consists of a dual-branch feature extractor, a fusion model, and a prognostic prediction module. The details are explained below.

Comparative schematic of lung cancer prognosis architectures. (A) Shows that traditional statistical models rely solely on clinical data. (B) shows DCNN-based approaches that use single-modality features. (C) presents multimodal models such as Lite-ProSENet, which integrate clinical data and CT scans but require high-radiation imaging. (D) illustrates the proposed LHFM, which integrates SAM-Med2D CXR embeddings with BioGPT-generated prompts and clinical metadata through a hybrid fusion model. CT, computed tomography; CXR, chest X-ray; DCNN, deep convolutional neural network; LHFM, lightweight hybrid foundation model.

Overall framework of the lightweight hybrid foundation model. Preprocessed CXR images and clinical text are encoded by SAM-Med2 and BioGPT to yield 512-D image and 768-D text embeddings with clinical metadata. A fusion module, comprising a fusion layer, an attention-based block, and fully connected layers, outputs a sigmoid-activated risk probability dichotomized into low- versus high-risk classes. CXR, chest X-ray.

Dual hybrid feature extraction

In this stage, visual features are extracted from the preprocessed CXR images via the SAM-Med2D (24) visual encoder. It generates high-fidelity embeddings that capture spatial context and critical visual features from CXR, facilitating the subsequent integration of visual information into the model. In parallel, BioGPT (16), a biomedical language model, prompts are image-conditioned semantic representations, generated from CXR-derived embeddings using a lightweight learned prompt generator that bridges the visual latent space to BioGPT’s token embedding space. The generated prompt functions as an auxiliary semantic representation that supports multimodal fusion and interpretability. Importantly, the prompt branch is designed as a representation-learning mechanism, rather than a standalone clinical text prognostic model. Qualitative examples of generated prompts and additional prompt-sensitivity experiments are provided in the Table S3. For example, a generated prompt might be: “Presence of a speculated mass with lesion margins in the right upper lobe, approximately 3.2 cm in diameter.” These prompts offer interpretable descriptions of clinical conditions, effectively bridging pixel-level features to high-level semantic descriptions.

Dual hybrid feature extraction combines visual and textual pathways to create clinically informative CXR representations. Let $X = {x_{i}}_{i = 1}^{N}$ denote the set of preprocessed CXR images, where each $x_{i} \in ℝ^{H \times W \times 3}$ starts by projecting through a vision transformer (ViT) backbone whose patch-token output is linearly mapped to a compact visual embedding $v_{i} = W_{V} V i T (x_{i}) + b_{v}$ . A lightweight prompt generator $φ p (\cdot)$ then converts $v_{i}$ into a clinically interpretable prompt $p_{i}$ . This prompt is tokenized and passed through BioGPT, whose [CLS] output $h_{o} (p_{i})$ is similarly linearly projected to yield the text embedding $t_{i} = W_{T} h_{o} (p_{i}) + b_{T}$ . By aligning $v_{i}$ and $v_{i}$ in a shared latent space ( $ℝ^{d_{v}}$ and $ℝ^{d_{t}}$ ), we fuse pixel-level detail with semantic descriptors, enabling the model to ground high-level clinical concepts directly in image features.

Let $X = {x_{i}}_{i = 1}^{N}$ denote the set of preprocessed CXR images, where each $x_{i} \in ℝ^{H \times W \times 3}$ . We first extract a visual embedding $v_{i} \in ℝ^{d_{v}}$ for each image via the SAM-Med2D encoder as shown in Eqs. [2-4].

v_{i} = f_{S A M} (x_{i})

[2]

where $f_{S A M} : ℝ^{H \times W \times 3} \to ℝ^{d_{v}}$ is implemented as a ViT with a projection head. Specifically, if $z_{i} = V i T (x_{i}) \in ℝ^{D}$ are the raw patch-token outputs, then visual extraction employs the SAM-Med2D (24).

v_{i} = W_{v} z_{i} + b_{v}, W_{v} \in ℝ^{d_{v} \times D}, b_{v} \in ℝ^{d_{v}}

[3]

encoder, $v = f_{v} (X)$ , which processes input image $X$ into spatial embeddings $v \in ℝ^{d}$ using a transformer architecture fine-tuned for medical imaging. Concurrently, the clinical textual prompt $p_{i}$ from $v_{i}$ , via prompt-generator $g$ is embedded with BioGPT (16):

p_{i} = g (v_{i}), t_{i} = f_{B i o G P T} (p_{i})

[4]

where $t_{i} \in ℝ^{d_{t}}$ represents the number of clinical text embeddings. Here $g$ is a small feed-forward network and $t_{i} = W_{t} h_{o} (p_{i}) + b_{t}$ , with $h_{o} (p_{i}) \in ℝ^{H_{o}}$ the token output of BioGPT, and $W_{t} \in ℝ^{d_{t} \times H_{o}}$ , $b_{t} \in ℝ^{d_{t}}$ . $t = f_{t} (v)$ , where cross-modal attention dynamically synthesizes clinical prompts conditioned on $v$ (visual embeddings). This cross-attention mechanism, $A t t n (Q = v W_{Q}, K = t W_{K})$ , explicitly aligns pixel-level features with semantic descriptors, ensuring that prompts are anatomically grounded and clinically interpretable.

Hybrid multimodal fusion model

The main model training stage involves integrating visual embeddings and semantic prompts via a lightweight dual-branch transformer model. This multimodal fusion architecture outputs scalar risk scores indicative of prognosis. The fusion model $f_{i}$ integrates visual ( $V$ ), textual ( $T$ ), and clinical ( $C \in ℝ^{d_{c}}$ ) features via a dual-branch transformer. Let $X \in ℝ^{H \times W \times 3}$ denote an input CXR. The vision encoder $f_{v} : ℝ^{H \times W \times 3} \to ℝ^{d_{v}}$ employs a patch-based transformer. The inputs are visual features $V \in ℝ^{d_{v}}$ , textual features $T \in ℝ^{d_{t}}$ , and $V \in ℝ^{d_{c}}$ . The modality-specific projections are fused as depicted in Eq. [5].

\begin{array}{l} h_{v} = R e L U (W_{v} V + b_{v}), \\ h_{t} = R e L U (W_{t} T + b_{t}), \\ h_{c} = R e L U (W_{c} C + b_{c}) \end{array}

[5]

where $W_{*} \in ℝ^{3 d_{h}}$ are learnable weights. Tokenization and iterative fusion are employed as shown in Eqs. [5,6].

M_{i}^{(0)} = L N (W_{m} [v_{i} | | t_{i} | | c_{i}] + b_{m}) \in ℝ^{d}

[6]

where $L N$ represents layer normalization, || represents the concatenation operator, $W_{m} \in ℝ^{d \times (d + d + d)}$ , and $b_{m} \in ℝ^{d}$ represents the fusion weights/bias.

Prognostic prediction

Our prognosis-prediction module embeds CXR features, BioGPT-derived semantic prompts, and tabular covariates into a shared latent vector $H_{i}$ . A dual-head architecture then produces (I) a calibrated continuous hazard score for time-event analysis and (II) a SoftMax posterior that dichotomizes patients into low versus high-risk strata. The transformer backbone first aligns modality-specific embeddings through stacked multihead fusion-attention layers, a design that has already outperformed conventional pipelines (31). The fused representation is routed to a survival head, $S_{i} = σ (W_{r}^{T} H_{i} + b_{r})$ , mirroring the continuous-risk formulation and to a classification head, $\hat{y_{i}} = (W_{c} H_{i} + b_{c})$ , which supports binary triage consistent with recent vision-language prognostic frameworks for CXR analysis. The survival risk and discrete outcome are depicted in Eqs. [7,8], respectively. A sigmoid-activated risk score $S_{i}$ quantifies individual survival probability, whereas a parallel SoftMax head $\hat{y_{i}}$ yields class-level prognosis. Both heads share the fused latent $H_{i}$ , ensuring that predictions are informed simultaneously by pixel-level patterns, semantic prompts, and structured patient data (32).

r_{i} = W_{r}^{T} H_{i} + b_{r}, S_{i} = σ (r_{i}) \in (0, 1)

[7]

\hat{y_{i}} = s o f t m a x (W_{c} H_{i} + b_{c}) \in Δ^{K - 1}

[8]

Loss function of LHFM

The model was optimized via weighted binary cross-entropy loss (Eq. [9]) to address class imbalance, with a positive-class weight $α = \frac{N_{n e g}}{N_{p o s}}$ applied to increase the recall of high-risk cases. Training employed the Adam optimizer (β₁=0.9, β₂=0.999, weight decay =1×10^–4, learning rate =1×10^–4) with early stopping, yielding stable convergence and well-calibrated risk predictions suitable for clinical decision support. The simplified gradient $σ (z_{i}) - y_{i}$ ensures numerically stable updates under mixed-precision training.

L B C E = \frac{1}{N} \sum_{i = 1}^{N} [max (0, z_{i}) - y_{i} z_{i} + n l (1 + e^{- | z_{i} |})]

[9]

Reproducibility and implementation details

All experiments were implemented in Python using PyTorch and trained on an NVIDIA GeForce RTX 2080Ti GPU. To enhance reproducibility and minimize run-to-run variability, random initialization was controlled using a deterministic seed (applied across Python/NumPy/PyTorch, including CUDA where applicable). CXR images were preprocessed using a standardized pipeline, including resizing all inputs to 224×224 pixels and applying consistent intensity normalization to stabilize optimization and improve convergence. Model training was performed using the Adam optimizer with a batch size of 16, and an early stopping strategy was applied based on validation performance to reduce overfitting and retain the best-performing checkpoint. To improve generalization, radiography-appropriate data augmentation was incorporated during training. Model evaluation was conducted using 5-fold cross-validation, and final performance is reported as mean ± SD across folds along with 95% confidence intervals (CIs) to quantify uncertainty and ensure reliable comparison across experiments. Specific details were presented in Table S2.

Evaluation metrics

The experimental results were evaluated via the concordance index (C-index) to assess prognostic accuracy (33) as depicted in Eq. [10], alongside the area under the AUROC to compute classification performance. Model performance was reported as mean ± SD across folds, with 95% CI, enabling robust estimation of variability and ensuring reproducibility under repeated sampling. The statistical significance of improvements over baselines was validated by CIs and interpretability analysis quantified the contributions of imaging and textual features to the predictions (34).

C-index = \frac{\sum_{i, j} I ({\hat{y}}_{i} < {\hat{y}}_{j} | e_{i} = 1, y_{i} < y_{j})}{\sum_{i, j} I ({\hat{y}}_{i} < {\hat{y}}_{j} | e_{i} = 1)}

[10]

where ${\hat{y}}_{i} < {\hat{y}}_{j}$ are the predicted risk scores for individuals $i$ and $j$ , and where $y_{i}$ and $y_{j}$ are the actual observed survival times. $e_{i}$ is an event indicator for individual $i$ (1 if the event occurred, 0 if censored), and $I (\cdot)$ is an indicator function that returns a value of 1 if the condition inside is true and 0 otherwise.

Moreover, KM survival estimation is utilized in LC prognosis, which robustly handles values, allowing accurate survival function estimation even when patients are lost to follow-up or when the event has not occurred by the end of the study (35), as depicted in Eq. [11].

\hat{S} (t) = \prod_{t_{(j)} \leq t} (1 - \frac{d_{j}}{n_{j}})

[11]

where $\hat{S} (t)$ represents the probability that a patient survives longer than $t$ times after diagnosis or treatment starts, $t_{(j)} = j^{t h}$ represents the ordered event time, $d_{j}$ indicates the number of events at time $t_{(j)}$ , $n_{j} =$ represents the number of individuals at risk just before $t_{(j)}$ , and the product is taken over all distinct event times up to $t$ .

Results

Performance comparisons with related works

The progressive improvement from unimodal to multimodal fusion models highlights the intrinsic value of heterogeneous data integration in survival prediction tasks. Specifically, the baseline clinical model demonstrated limited prognostic discrimination (C-index: 0.696±0.008) as presented in Table 1, underscoring the inherent limitations of traditional approaches that rely on structured tabular data. In contrast, the FGCN model’s modest gains (C-index: 0.769±0.007) suggest that model inter-feature dependencies can partially address these limitations.

Table 1. Performance comparisons with related works on the primary test set (JSRT and PadChest).

Model	Modality	AUROC		C-index
Model	Modality	Mean ± SD	95% CI	Mean ± SD	95% CI
Baseline	Unimodal	0.750±0.007	0.736–0.764	0.696±0.008	0.680–0.712
Lite-ProSENet (17)	Multimodal	0.919±0.004	0.911–0.927	0.893±0.005	0.883–0.903
FGCN (18)	Multimodal	0.834±0.005	0.824–0.844	0.769±0.007	0.755–0.783
Schulz et al. (21)	Multimodal	0.852±0.006	0.840–0.864	0.779±0.007	0.765–0.793
Deng et al. (22)	Multimodal	0.732±0.006	0.720–0.744	0.659±0.008	0.643–0.675
LHFM (ours)	Multimodal	0.935±0.004	0.927–0.943	0.910±0.006	0.898–0.922

Open in a new tab

AUROC, area under the receiver operating characteristic curve; C-index, concordance index; CI, confidence interval; LHFM, lightweight hybrid foundation model; SD, standard deviation.

These trends are consistent with the recent studies where multimodal or habitat-aware CT models achieved strong but CT-dependent survival or response prediction in NSCLC patients (19,36-38). The C-index of 0.910±0.006 (95% CI: 0.898–0.922) confirms that LHFM effectively integrates image and text information, outperforming multimodal baselines, thereby overcoming the gap that often hinders unimodal DL approaches.

The most significant improvement is observed over Deng et al. (22), with AUROC and C-index increases of 0.203 and 0.251, respectively, highlighting a substantial increase in both classification discrimination and survival ranking capabilities. Similarly, compared with the baseline unimodal model, LHFM achieved improvements of 0.185 in the AUROC and 0.214 in the C-index, underscoring the benefit of integrating multimodal data. Notably, while Lite-ProSENet (17) exhibited strong performance, LHFM still achieved significant improvements (AUROC: +0.016, C-index: +0.017), reflecting that its advantages persist even against high-performing multimodal architectures.

These improvements validate LHFM’s multimodal hybrid fusion approach, effectively leveraging complementary information from different data types. The consistent superiority across all comparisons suggests that LHFM generalizes well to different baseline architectures and is not overly tuned to any single reference model. This makes LHFM a clinically promising framework for LC prognosis, with potential applicability across diverse datasets and clinical scenarios.

KM survival curve

KM survival analysis confirmed significant stratification between the high-and low-risk groups identified by LHFM (log-rank P ≈6.42×10⁻¹³), as shown in Figure 5. The high-risk group exhibited a survival decline, with the probability of survival decreasing to less than 0.4 within 12 months and approaching zero by 60 months, whereas the low-risk group maintained substantially greater survival throughout follow-up. The minimal overlap of the 95% CI beyond the initial time points reinforced the robustness of this separation. LHFM risk stratification aligns with clinically significant outcome differences, demonstrating strong prognostic utility (C-index: 0.910; AUROC: 0.935) as validated in Table 2.

Table 2. LHFM performance on external multimodal datasets.

Dataset	C-index (mean ± SD, 95% CI)
NIH ChestX-ray dataset (28)	0.851±0.006 (0.839–0.863)
NSCLC-Radiomics-Interobserver1 dataset (30)	0.879±0.005 (0.869–0.889)
The Shenzhen + Montgomery datasets (29)	0.835±0.007 (0.821–0.849)

Open in a new tab

CI, confidence interval; C-index, concordance index; LHFM, lightweight hybrid foundation model; SD, standard deviation.

Generalizability

Table 2 shows that LHFM preserves exceptional prognostic discrimination across a spectrum of external cohorts. To evaluate external validity and generalizability, the model was trained on the JSRT and PadChest datasets and directly tested on independent external datasets. For the expansive NIH CXR dataset (28), the model achieves a C-index of 0.851, attesting to its robustness within adult radiography pipelines. The performance further increased to 0.879 on the volumetric, expert-annotated NSCLC Radiomics Interobserver1 (30), indicating that the integration of rich 3D morphological information can meaningfully enhance survival stratification.

Notably, even on the comparatively limited and tuberculosis-oriented Shenzhen and Montgomery CXR datasets (29), LHFM achieved a C-index of 0.835, reflecting its performance reduction amidst domain and pathology shifts. All values exceeded 0.80, confirming the clinical relevance and robustness under distribution shifts, collectively underscoring LHFM’s capacity to generalize across diverse imaging modalities, spatial resolutions, cohort scales, and disease spectra with minimal adaptation. All external validation values are reported with 95% CIs.

External validation across three independent datasets (NIH CXR, NSCLC-Radiomics-Interobserver1, and Shenzhen-Montgomery) demonstrated strong generalizability (C-index range: 0.835–0.879), confirming that LHFM maintains prognostic accuracy under significant demographic and modality variation.

Ablation study

An ablation study conducted on a stratified hold-out test set (20% of the dataset) confirmed that visual and textual components are essential for LHFM’s prognostic performance. The removal of either modality resulted in significant performance degradation, as depicted in Table 3. Ablation study performance with the text-only variant exhibited particularly substantial declines, underscoring the indispensability of imaging data for capturing significant anatomical details. Multimodal fusion via attention-based integration outperformed naive fusion strategies, with the full model (LHFM) achieving optimal performance (AUROC: 0.935±0.004; C-index: 0.910±0.006). These results confirm the synergistic interaction between imaging and clinical text modalities, whereas visualization analyses, as depicted in Figure 6, further validated the model’s interpretability and class-agnostic robustness.

Table 3. Ablation study performance.

Model variant	Modification	AUROC		C-index
Model variant	Modification	Mean ± SD	95% CI	Mean ± SD	95% CI
Image-only	Text encoder removed	0.873±0.007	0.859–0.887	0.842±0.009	0.824–0.860
Prompt-only	Image encoder removed	0.824±0.006	0.812–0.836	0.794±0.008	0.778–0.810
Full LHFM (ours)	None (image encoder + text encoder)	0.935±0.004	0.927–0.943	0.910±0.006	0.898–0.922

Open in a new tab

AUROC, area under the receiver operating characteristic curve; CI, confidence interval; C-index, concordance index; LHFM, lightweight hybrid foundation model; SD, standard deviation.

Performance visualization of the proposed model. (A) AUROC for the full LHFM model compared against image-only and prompt-only variants on the hold-out test set. (B) Confusion matrix for binary risk classification at the optimal threshold determined by Youden’s index. (C) Sample visualization from the test set: (left) original CXR image with ground-glass opacity, (middle) model-generated heatmap localizing the region of high prognostic saliency, (right) attention map from the fusion module highlighting features correlated with the generated clinical prompt. AUC, area under the curve; AUROC, area under the receiver operating characteristic curve; CXR, chest X-ray; LHFM, lightweight hybrid foundation model; ROC, receiver operating characteristic.

Sensitivity analyses and robustness evaluation

To address concerns regarding potential prompt learning and unusually high discrimination metrics, we conducted sensitivity analyses across benchmark comparisons, modality ablations, and external robustness evaluation. The full multimodal LHFM achieved the strongest overall performance (AUROC =0.935±0.004; C-index =0.910±0.006). Removing individual branches resulted in consistent degradation, including the image-only model (AUROC =0.873±0.007; C-index =0.842±0.009) and the prompt-only model (AUROC =0.824±0.006; C-index =0.794±0.008), supporting that performance gains are not attributable to a single dominant modality. These experiments confirm that LHFM benefits from complementary multimodal integration rather than trivial shortcuts (Table 4).

Table 4. Sensitivity analyses and model robustness evaluation (95% CI).

Model	Analysis type	AUROC		C-index		Fusion description
Model	Analysis type	Mean ± SD	95% CI	Mean ± SD	95% CI	Fusion description
LHFM (ours)	Reference (primary)	0.935±0.004	0.927–0.943	0.910±0.006	0.898–0.922	Full hybrid fusion model (image + prompt- integration)
Image-only	Ablation	0.873±0.007	0.859–0.887	0.842±0.009	0.824–0.860	Prompt encoder removed (imaging backbone only)
Prompt-only	Ablation	0.824±0.006	0.812–0.836	0.794±0.008	0.778–0.810	Image encoder removed (prompt-only)
Baseline	Benchmark	0.750±0.007	0.736–0.764	0.696±0.008	0.680–0.712	Unimodal baseline
Lite-ProSENet (17)	Benchmark	0.919±0.004	0.911–0.927	0.893±0.005	0.883–0.903	Multimodal comparator
FGCN (18)	Benchmark	0.834±0.005	0.824–0.844	0.769±0.007	0.755–0.783	Multimodal comparator
Schulz et al. (21)	Benchmark	0.852±0.006	0.840–0.864	0.779±0.007	0.765–0.793	Multimodal comparator
Deng et al. (22)	Benchmark	0.732±0.006	0.720–0.744	0.659±0.008	0.643–0.675	Multimodal comparator

Open in a new tab

AUROC, area under the receiver operating characteristic curve; CI, confidence interval; C-index, concordance index; LHFM, lightweight hybrid foundation model; SD, standard deviation.

Sensitivity analyses demonstrated that LHFM remained robust across ablation experiments, benchmark comparisons, and external evaluations (Table 4). The full multimodal LHFM achieved the strongest overall performance (AUROC =0.935±0.004; C-index =0.910±0.006). Removing individual branches resulted in consistent performance degradation, including the image-only variant (AUROC =0.873±0.007; C-index =0.842±0.009) and the prompt-only variant (AUROC =0.824±0.006; C-index =0.794±0.008), supporting the complementary value of hybrid fusion. External robustness was maintained across independent datasets with C-index values ranging from 0.835 to 0.879, confirming generalizable prognostic discrimination under dataset shift (Table S3).

Discussion

This study introduces LHFM, which integrates visual features extracted from CXR via a fine-tuned SAM-Med2D encoder with semantically enriched clinical text prompts generated by BioGPT for LC prognosis. The LHFM demonstrated consistent superiority over state-of-the-art unimodal and multimodal benchmarks across diverse cohorts. It achieved a C-index of 0.910 (95% CI: 0.898–0.922; P<0.001), representing a significant improvement over existing model such as Lite-ProSENet (17). Ablation studies validated that integrating clinical text embeddings enhanced prognostic accuracy by approximately 7% compared with image-only baselines and substantially improved model interpretability through attention-guided risk stratification. Unlike existing ViT models focused primarily on report generation (39,40), the LHFM is specifically designed for survival analysis, explicitly modeling time-to-event endpoints with a computationally efficient architecture. This directly addresses the unmet need for lightweight, interpretable prognostic models in vulnerable populations and resource-limited contexts.

The interpretability and clinical relevance of the LHFM were enhanced through BioGPT-guided saliency maps that consistently co-localized with clinically relevant anatomical regions, thereby reinforcing clinician trust and adhering to emerging standards for AI transparency and accountability (39). This aligns with the findings of a recent study that emphasized the importance of structured AI reporting and its integration into radiology workflows (41). This alignment is particularly impactful in vulnerable populations oncology, where limited datasets and the imperative to minimize radiation exposure make interpretable, low-dose prognostic tools essential for both scientific and ethical precision medicine (41). The model’s prognostic utility was further confirmed through KM survival analysis, which provides visualization and complements statistical metrics with risk stratification (42). By leveraging the accessibility and safety of CXR (9), our approach provides a clinically relevant and evidence-driven framework for advancing precision oncology in diverse healthcare settings.

Furthermore, LHFM employs a lightweight architectural design that uses substantially fewer parameters than conventional transformer-based survival models do (43), ensuring computational efficiency without compromising predictive performance. This framework is particularly crucial for clinical deployment in resource-constrained environments. The model demonstrated robust generalizability across diverse patient demographics and imaging protocols, achieving consistent prognostic performance with C-index values ranging from 0.835 to 0.879 on external validation datasets, exceeding the clinically meaningful threshold of 0.80 (44,45). The clinical implication is that this framework could be integrated into bedside radiology systems to support real-time triage and informed decision-making (3,46).

The strong discrimination reported in this study should be interpreted within the context of the study design. Because the primary endpoint is an algorithmically computed risk label derived from available demographic and clinical data, the reported AUROC and C-index reflect methodological feasibility and multimodal integration capacity, rather than direct clinical survival prediction performance. Therefore, these results should not be directly compared with CT-based survival radiomics studies using verified time-to-event endpoints. Instead, LHFM demonstrates that lightweight hybrid fusion of foundation-level visual embeddings and semantically enriched prompt representations can deliver robust risk stratification performance on low-dose CXR data under controlled assumptions.

In addition to its methodological contributions, the LHFM has significant implications for medical system integration (47,48). By enabling accurate LC prognosis from low-dose CXR, the framework directly supports clinical decision-support systems deployable at the point of care. Its lightweight architecture reduces computational requirements, facilitating adoption in health facilities where advanced imaging modalities are not available. The scalability of the framework enables integration with radiology workflows and EHR platforms, supporting risk stratification, follow-up scheduling, and treatment prioritization. By leveraging CXR, the most widely available imaging modality, our model enhances cost-effectiveness, reduces reliance on high-radiation techniques, and promotes equitable access to advanced prognostic tools. In this way, the model contributes not only to technical innovation but also to healthcare efficiency and system-wide optimization. Moreover, this work aligns with the translational focus of Translational Lung Cancer Research by signifying how foundation models can be adapted into lightweight, clinically deployable prognostic tools.

We performed dedicated model generated prompt-sensitivity experiments to verify that the prompt pathway does not artificially inflate performance through pairing artifacts. Compared with the reference LHFM, replacing prompts with a null token reduced performance, while prompt shuffling across subjects produced a substantial drop, and random prompt injection further degraded discrimination. These findings confirm that the prompt branch contributes meaningful value only when prompts remain semantically relevant and correctly paired with the corresponding image, supporting the interpretation that LHFM learns clinically grounded multimodal representations rather than prompt-driven outcome.

However, this study is limited by its retrospective design and potential biases arising from multicenter heterogeneity. Although robustness and prompt-sensitivity analyses were performed, prospective validation on survival-linked LC cohorts is needed. Furthermore, the reliance on CXR, while a strength in terms of accessibility, inherently limits the morphological detail available compared to CT, which may cap the ultimate prognostic performance. Future studies will prospectively evaluate LHFM within integrated genomic and EHR to assess workflow impact, trust calibration, and prognostic depth.

Conclusions

This study presents LHFM, which integrates SAM-Med2D visual features with BioGPT-generated prompts and clinical metadata for the prognosis of LC. Our model demonstrated significant prognostic accuracy (C-index =0.910 and AUROC =0.935) and interpretability compared to existing frameworks, while maintaining computational efficiency, making it suitable for vulnerable populations and resource-limited healthcare environments. By leveraging CXRs, the most widely accessible imaging modality, LHFM reduces reliance on costly and high-radiation imaging, facilitating seamless integration into EHR and radiology workflows. In conclusion, the proposed framework provides a scalable, cost-effective decision support tool that integrates seamlessly into healthcare delivery systems, improving workflow efficiency and access to accurate prognoses in diverse clinical settings. Future work will focus on prospective validation and the incorporation of genomic data to enhance clinical utility across oncological domains.

Supplementary

The article’s supplementary files as

tlcr-15-03-47-rc.pdf^{(140.2KB, pdf)}

DOI: 10.21037/tlcr-2025-aw-1299

tlcr-15-03-47-coif.pdf^{(252.7KB, pdf)}

DOI: 10.21037/tlcr-2025-aw-1299

tlcr-15-03-47-supplementary.pdf^{(73.1KB, pdf)}

DOI: 10.21037/tlcr-2025-aw-1299

Acknowledgments

None.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Footnotes

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-aw-1299/rc

Funding: This work was supported by the National Natural Science Foundation (No. 62271127), the China Higher Education Institution Industry-University-Research Innovation Fund (Nos. 2024IT007 and 2024DR039), the Medico-Engineering Cooperation Funds from the UESTC, and West China Hospital of Sichuan University (Nos. ZYGX2022YGRH011 and HXDZ22005).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-aw-1299/coif). The authors have no conflicts of interest to declare.

References

1.Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229-63. 10.3322/caac.21834 [DOI] [PubMed] [Google Scholar]
2.Li C, Lei S, Ding L, et al. Global burden and trends of lung cancer incidence and mortality. Chin Med J (Engl) 2023;136:1583-90. 10.1097/CM9.0000000000002529 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Liu F, Zhu T, Wu X, et al. A medical multimodal large language model for future pandemics. NPJ Digit Med 2023;6:226. 10.1038/s41746-023-00952-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Jiang Y, Gao C, Shao Y, et al. The prognostic value of radiogenomics using CT in patients with lung cancer: a systematic review. Insights Imaging 2024;15:259. 10.1186/s13244-024-01831-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kumaran S Y, Jeya JJ, R MT, et al. Explainable lung cancer classification with ensemble transfer learning of VGG16, Resnet50 and InceptionV3 using grad-cam. BMC Med Imaging 2024;24:176. 10.1186/s12880-024-01345-x [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wang D, Liu H, Cao Z, et al. Integrating deep learning and radiomics in the differentiation of major histological subtypes of invasive non-mucinous lung adenocarcinoma using positron emission tomography and computed tomography. Transl Lung Cancer Res 2025;14:3323-36. 10.21037/tlcr-2025-333 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Brody AS, Frush DP, Huda W, et al. Radiation risk to children from computed tomography. Pediatrics 2007;120:677-82. 10.1542/peds.2007-1910 [DOI] [PubMed] [Google Scholar]
8.Almohiy H. Paediatric computed tomography radiation dose: A review of the global dilemma. World J Radiol 2014;6:1-6. 10.4329/wjr.v6.i1.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Bailey GL, Wells AU, Desai SR. Imaging of pulmonary sarcoidosis—a review. J Clin Med 2024;13:822. 10.3390/jcm13030822 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chen Q, Cheng J, Wang L, et al. Primary lung cancer in children and adolescents. J Cancer Res Clin Oncol 2024;150:225. 10.1007/s00432-024-05750-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wang X, Gu L, Zhang Y, et al. Validation of survival prognostic models for non-small-cell lung cancer in stage- and age-specific groups. Lung Cancer 2015;90:281-7. 10.1016/j.lungcan.2015.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Voggel S, Abele M, Seitz C, et al. Primary lung carcinoma in children and adolescents - Clinical characteristics and outcome of 12 cases from the German registry for rare paediatric tumours (STEP). Lung Cancer 2021;160:66-72. 10.1016/j.lungcan.2021.08.004 [DOI] [PubMed] [Google Scholar]
13.Monajatipoor M, Rouhsedaghat M, Li LH, et al. BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis. Med Image Comput Comput Assist Interv 2022;13435:725-34. 10.1007/978-3-031-16443-9_69 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chen W, Bai J, Fang Y, et al. Prognostic factors and surgical management in pediatric primary lung cancer: a retrospective cohort study using SEER data. Transl Pediatr 2024;13:1671-83. 10.21037/tp-24-174 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Lonseko ZM, Hu D, Zhang K, et al. Deep multi-task learning framework for gastrointestinal lesion-aided diagnosis and severity estimation. Sci Rep 2025;15:25827. 10.1038/s41598-025-09587-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 2022;23:bbac409. 10.1093/bib/bbac409 [DOI] [PubMed] [Google Scholar]
17.Wu Y, Wang Y, Huang X, et al. Multimodal learning for non-small cell lung cancer prognosis. Biomed Signal Process Control 2025;106:107663. [Google Scholar]
18.Ma X, Ning F, Xu X, et al. Survival Prediction for Non-Small Cell Lung Cancer Based on Multimodal Fusion and Deep Learning. IEEE Access 2024;12:123236-49.
19.Zhou L, Mao C, Fu T, et al. Development of an AI model for predicting hypoxia status and prognosis in non-small cell lung cancer using multi-modal data. Transl Lung Cancer Res 2024;13:3642-56. 10.21037/tlcr-24-982 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Amini M, Nazari M, Shiri I, et al. Multi-level multi-modality (PET and CT) fusion radiomics: prognostic modeling for non-small cell lung carcinoma. Phys Med Biol 2021. [DOI] [PubMed] [Google Scholar]
21.Schulz S, Woerl AC, Jungmann F, et al. Multimodal Deep Learning for Prognosis Prediction in Renal Cancer. Front Oncol 2021;11:788740. 10.3389/fonc.2021.788740 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Deng R, Shaikh N, Shannon G, Nie Y. Cross-modality attention-based multimodal fusion for non-small cell lung cancer (NSCLC) patient survival prediction. In: Medical Imaging 2024: Digital and Computational Pathology. SPIE, 2024:46-50. [Google Scholar]
23.Christie JR, Daher O, Abdelrazek M, et al. Predicting recurrence risks in lung cancer patients using multimodal radiomics and random survival forests. J Med Imaging (Bellingham) 2022;9:066001. 10.1117/1.JMI.9.6.066001 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ma J, He Y, Li F, et al. Segment anything in medical images. Nat Commun 2024;15:654. 10.1038/s41467-024-44824-z [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ying X, Guo H, Ma K, et al. X2CT-GAN: reconstructing CT from biplanar X-rays with generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:10619-28. [Google Scholar]
26.Shiraishi J, Katsuragawa S, Ikezoe J, et al. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. AJR Am J Roentgenol 2000;174:71-4. 10.2214/ajr.174.1.1740071 [DOI] [PubMed] [Google Scholar]
27.Bustos A, Pertusa A, Salinas JM, et al. PadChest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal 2020;66:101797. 10.1016/j.media.2020.101797 [DOI] [PubMed] [Google Scholar]
28.Wang X, Peng Y, Lu L, et al. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. Available online: https://uts.nlm.nih.gov/metathesaurus.html
29.Jaeger S, Candemir S, Antani S, et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg 2014;4:475-7. 10.3978/j.issn.2223-4292.2014.11.20 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Wee L, Aerts HJL, Kalendralis P, et al. Data from NSCLC-Radiomics-Interobserver1 [data set]. The Cancer Imaging Archive 2019;10.
31.Gomaa A, Huang Y, Hagag A, et al. Comprehensive multimodal deep learning survival prediction enabled by a transformer architecture: A multicenter study in glioblastoma. Neurooncol Adv 2024;6:vdae122. 10.1093/noajnl/vdae122 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kumar V, Prabha C, Sharma P, et al. Unified deep learning models for enhanced lung cancer prediction with ResNet-50-101 and EfficientNet-B3 using DICOM images. BMC Med Imaging 2024;24:63. 10.1186/s12880-024-01241-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Harrell FE, Jr, Califf RM, Pryor DB, et al. Evaluating the yield of medical tests. JAMA 1982;247:2543-6. [PubMed] [Google Scholar]
34.Park SY, Park JE, Kim H, et al. Review of Statistical Methods for Evaluating the Performance of Survival or Other Time-to-Event Prediction Models (from Conventional to Deep Learning Approaches). Korean J Radiol 2021;22:1697-707. 10.3348/kjr.2021.0223 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Clark TG, Bradburn MJ, Love SB, et al. Survival analysis part I: basic concepts and first analyses. Br J Cancer 2003;89:232-8. 10.1038/sj.bjc.6601118 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Lou N, Cui X, Lin X, et al. Development and validation of a deep learning-based model to predict response and survival of T790M mutant non-small cell lung cancer patients in early clinical phase trials using electronic medical record and pharmacokinetic data. Transl Lung Cancer Res 2024;13:706-20. 10.21037/tlcr-23-737 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Chen Z, Liu H, Sun H, et al. Integrating radiomics and deep learning for enhanced prediction of high-grade patterns in stage IA lung adenocarcinoma. Transl Lung Cancer Res 2025;14:1076-88. 10.21037/tlcr-24-995 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Zhang Z, Liu Z, Yang M, et al. Multi-institutional development and validation of habitat imaging for predicting outcomes of first-line immunotherapy in advanced non-small cell lung cancer. Transl Lung Cancer Res 2025;14:3886-99. 10.21037/tlcr-2025-554 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Tanno R, Barrett DGT, Sellergren A, et al. Collaboration between clinicians and vision-language models in radiology report generation. Nat Med 2025;31:599-608. 10.1038/s41591-024-03302-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Ouis MY, Akhloufi MA. ChestBioX-Gen: contextual biomedical report generation from chest X-ray images using BioGPT and co-attention mechanism. Frontiers in Imaging. 2024 Apr 19;3.
41.Jorg T, Halfmann MC, Stoehr F, et al. A novel reporting workflow for automated integration of artificial intelligence results into structured radiology reports. Insights Imaging 2024;15:80. 10.1186/s13244-024-01660-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Schober P, Vetter TR. Survival Analysis and Interpretation of Time-to-Event Data: The Tortoise and the Hare. Anesth Analg 2018;127:792-8. 10.1213/ANE.0000000000003653 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Zhou HY, Yu Y, Wang C, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng 2023;7:743-55. 10.1038/s41551-023-01045-x [DOI] [PubMed] [Google Scholar]
44.Zhong Z, Wang Y, Wu J, et al. Vision-language model for report generation and outcome prediction in CT pulmonary angiogram. NPJ Digit Med 2025;8:432. 10.1038/s41746-025-01807-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Hong YJ, Ha SH, Park SH, et al. Deep learning-based classification of pleural malignancy using medical thoracoscopic images. Transl Lung Cancer Res 2025;14:4475-84. 10.21037/tlcr-2025-588 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Zhang JN, Li ZF, Zheng SY, et al. Deep learning model for predicting spread through air spaces of lung adenocarcinoma based on transfer learning mechanism. Transl Lung Cancer Res 2025;14:1061-75. 10.21037/tlcr-24-985 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Tsuji T, Hirata Y, Kusunose K, et al. Classification of chest X-ray images by incorporation of medical domain knowledge into operation branch networks. BMC Med Imaging 2023;23:62. 10.1186/s12880-023-01019-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Baek S, He Y, Allen BG, et al. Deep segmentation networks predict survival of non-small cell lung cancer. Sci Rep 2019;9:17286. 10.1038/s41598-019-53461-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Transl Lung Cancer Res.

Peer Review File

Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-aw-1299/prf

tlcr-15-03-47-prf.pdf^{(101.9KB, pdf)}

Open in a new tab

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Wee L, Aerts HJL, Kalendralis P, et al. Data from NSCLC-Radiomics-Interobserver1 [data set]. The Cancer Imaging Archive 2019;10.

Supplementary Materials

The article’s supplementary files as

tlcr-15-03-47-rc.pdf^{(140.2KB, pdf)}

DOI: 10.21037/tlcr-2025-aw-1299

tlcr-15-03-47-coif.pdf^{(252.7KB, pdf)}

DOI: 10.21037/tlcr-2025-aw-1299

tlcr-15-03-47-supplementary.pdf^{(73.1KB, pdf)}

DOI: 10.21037/tlcr-2025-aw-1299

[r1] 1.Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229-63. 10.3322/caac.21834 [DOI] [PubMed] [Google Scholar]

[r2] 2.Li C, Lei S, Ding L, et al. Global burden and trends of lung cancer incidence and mortality. Chin Med J (Engl) 2023;136:1583-90. 10.1097/CM9.0000000000002529 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Liu F, Zhu T, Wu X, et al. A medical multimodal large language model for future pandemics. NPJ Digit Med 2023;6:226. 10.1038/s41746-023-00952-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Jiang Y, Gao C, Shao Y, et al. The prognostic value of radiogenomics using CT in patients with lung cancer: a systematic review. Insights Imaging 2024;15:259. 10.1186/s13244-024-01831-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Kumaran S Y, Jeya JJ, R MT, et al. Explainable lung cancer classification with ensemble transfer learning of VGG16, Resnet50 and InceptionV3 using grad-cam. BMC Med Imaging 2024;24:176. 10.1186/s12880-024-01345-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.Wang D, Liu H, Cao Z, et al. Integrating deep learning and radiomics in the differentiation of major histological subtypes of invasive non-mucinous lung adenocarcinoma using positron emission tomography and computed tomography. Transl Lung Cancer Res 2025;14:3323-36. 10.21037/tlcr-2025-333 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Brody AS, Frush DP, Huda W, et al. Radiation risk to children from computed tomography. Pediatrics 2007;120:677-82. 10.1542/peds.2007-1910 [DOI] [PubMed] [Google Scholar]

[r8] 8.Almohiy H. Paediatric computed tomography radiation dose: A review of the global dilemma. World J Radiol 2014;6:1-6. 10.4329/wjr.v6.i1.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9.Bailey GL, Wells AU, Desai SR. Imaging of pulmonary sarcoidosis—a review. J Clin Med 2024;13:822. 10.3390/jcm13030822 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10] 10.Chen Q, Cheng J, Wang L, et al. Primary lung cancer in children and adolescents. J Cancer Res Clin Oncol 2024;150:225. 10.1007/s00432-024-05750-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.Wang X, Gu L, Zhang Y, et al. Validation of survival prognostic models for non-small-cell lung cancer in stage- and age-specific groups. Lung Cancer 2015;90:281-7. 10.1016/j.lungcan.2015.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Voggel S, Abele M, Seitz C, et al. Primary lung carcinoma in children and adolescents - Clinical characteristics and outcome of 12 cases from the German registry for rare paediatric tumours (STEP). Lung Cancer 2021;160:66-72. 10.1016/j.lungcan.2021.08.004 [DOI] [PubMed] [Google Scholar]

[r13] 13.Monajatipoor M, Rouhsedaghat M, Li LH, et al. BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis. Med Image Comput Comput Assist Interv 2022;13435:725-34. 10.1007/978-3-031-16443-9_69 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Chen W, Bai J, Fang Y, et al. Prognostic factors and surgical management in pediatric primary lung cancer: a retrospective cohort study using SEER data. Transl Pediatr 2024;13:1671-83. 10.21037/tp-24-174 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Lonseko ZM, Hu D, Zhang K, et al. Deep multi-task learning framework for gastrointestinal lesion-aided diagnosis and severity estimation. Sci Rep 2025;15:25827. 10.1038/s41598-025-09587-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 2022;23:bbac409. 10.1093/bib/bbac409 [DOI] [PubMed] [Google Scholar]

[r17] 17.Wu Y, Wang Y, Huang X, et al. Multimodal learning for non-small cell lung cancer prognosis. Biomed Signal Process Control 2025;106:107663. [Google Scholar]

[r18] 18.Ma X, Ning F, Xu X, et al. Survival Prediction for Non-Small Cell Lung Cancer Based on Multimodal Fusion and Deep Learning. IEEE Access 2024;12:123236-49.

[r19] 19.Zhou L, Mao C, Fu T, et al. Development of an AI model for predicting hypoxia status and prognosis in non-small cell lung cancer using multi-modal data. Transl Lung Cancer Res 2024;13:3642-56. 10.21037/tlcr-24-982 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20] 20.Amini M, Nazari M, Shiri I, et al. Multi-level multi-modality (PET and CT) fusion radiomics: prognostic modeling for non-small cell lung carcinoma. Phys Med Biol 2021. [DOI] [PubMed] [Google Scholar]

[r21] 21.Schulz S, Woerl AC, Jungmann F, et al. Multimodal Deep Learning for Prognosis Prediction in Renal Cancer. Front Oncol 2021;11:788740. 10.3389/fonc.2021.788740 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22] 22.Deng R, Shaikh N, Shannon G, Nie Y. Cross-modality attention-based multimodal fusion for non-small cell lung cancer (NSCLC) patient survival prediction. In: Medical Imaging 2024: Digital and Computational Pathology. SPIE, 2024:46-50. [Google Scholar]

[r23] 23.Christie JR, Daher O, Abdelrazek M, et al. Predicting recurrence risks in lung cancer patients using multimodal radiomics and random survival forests. J Med Imaging (Bellingham) 2022;9:066001. 10.1117/1.JMI.9.6.066001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r24] 24.Ma J, He Y, Li F, et al. Segment anything in medical images. Nat Commun 2024;15:654. 10.1038/s41467-024-44824-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[r25] 25.Ying X, Guo H, Ma K, et al. X2CT-GAN: reconstructing CT from biplanar X-rays with generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:10619-28. [Google Scholar]

[r26] 26.Shiraishi J, Katsuragawa S, Ikezoe J, et al. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. AJR Am J Roentgenol 2000;174:71-4. 10.2214/ajr.174.1.1740071 [DOI] [PubMed] [Google Scholar]

[r27] 27.Bustos A, Pertusa A, Salinas JM, et al. PadChest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal 2020;66:101797. 10.1016/j.media.2020.101797 [DOI] [PubMed] [Google Scholar]

[r28] 28.Wang X, Peng Y, Lu L, et al. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. Available online: https://uts.nlm.nih.gov/metathesaurus.html

[r29] 29.Jaeger S, Candemir S, Antani S, et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg 2014;4:475-7. 10.3978/j.issn.2223-4292.2014.11.20 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30] 30.Wee L, Aerts HJL, Kalendralis P, et al. Data from NSCLC-Radiomics-Interobserver1 [data set]. The Cancer Imaging Archive 2019;10.

[r31] 31.Gomaa A, Huang Y, Hagag A, et al. Comprehensive multimodal deep learning survival prediction enabled by a transformer architecture: A multicenter study in glioblastoma. Neurooncol Adv 2024;6:vdae122. 10.1093/noajnl/vdae122 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r32] 32.Kumar V, Prabha C, Sharma P, et al. Unified deep learning models for enhanced lung cancer prediction with ResNet-50-101 and EfficientNet-B3 using DICOM images. BMC Med Imaging 2024;24:63. 10.1186/s12880-024-01241-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r33] 33.Harrell FE, Jr, Califf RM, Pryor DB, et al. Evaluating the yield of medical tests. JAMA 1982;247:2543-6. [PubMed] [Google Scholar]

[r34] 34.Park SY, Park JE, Kim H, et al. Review of Statistical Methods for Evaluating the Performance of Survival or Other Time-to-Event Prediction Models (from Conventional to Deep Learning Approaches). Korean J Radiol 2021;22:1697-707. 10.3348/kjr.2021.0223 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r35] 35.Clark TG, Bradburn MJ, Love SB, et al. Survival analysis part I: basic concepts and first analyses. Br J Cancer 2003;89:232-8. 10.1038/sj.bjc.6601118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r36] 36.Lou N, Cui X, Lin X, et al. Development and validation of a deep learning-based model to predict response and survival of T790M mutant non-small cell lung cancer patients in early clinical phase trials using electronic medical record and pharmacokinetic data. Transl Lung Cancer Res 2024;13:706-20. 10.21037/tlcr-23-737 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r37] 37.Chen Z, Liu H, Sun H, et al. Integrating radiomics and deep learning for enhanced prediction of high-grade patterns in stage IA lung adenocarcinoma. Transl Lung Cancer Res 2025;14:1076-88. 10.21037/tlcr-24-995 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r38] 38.Zhang Z, Liu Z, Yang M, et al. Multi-institutional development and validation of habitat imaging for predicting outcomes of first-line immunotherapy in advanced non-small cell lung cancer. Transl Lung Cancer Res 2025;14:3886-99. 10.21037/tlcr-2025-554 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r39] 39.Tanno R, Barrett DGT, Sellergren A, et al. Collaboration between clinicians and vision-language models in radiology report generation. Nat Med 2025;31:599-608. 10.1038/s41591-024-03302-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r40] 40.Ouis MY, Akhloufi MA. ChestBioX-Gen: contextual biomedical report generation from chest X-ray images using BioGPT and co-attention mechanism. Frontiers in Imaging. 2024 Apr 19;3.

[r41] 41.Jorg T, Halfmann MC, Stoehr F, et al. A novel reporting workflow for automated integration of artificial intelligence results into structured radiology reports. Insights Imaging 2024;15:80. 10.1186/s13244-024-01660-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r42] 42.Schober P, Vetter TR. Survival Analysis and Interpretation of Time-to-Event Data: The Tortoise and the Hare. Anesth Analg 2018;127:792-8. 10.1213/ANE.0000000000003653 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r43] 43.Zhou HY, Yu Y, Wang C, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng 2023;7:743-55. 10.1038/s41551-023-01045-x [DOI] [PubMed] [Google Scholar]

[r44] 44.Zhong Z, Wang Y, Wu J, et al. Vision-language model for report generation and outcome prediction in CT pulmonary angiogram. NPJ Digit Med 2025;8:432. 10.1038/s41746-025-01807-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r45] 45.Hong YJ, Ha SH, Park SH, et al. Deep learning-based classification of pleural malignancy using medical thoracoscopic images. Transl Lung Cancer Res 2025;14:4475-84. 10.21037/tlcr-2025-588 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r46] 46.Zhang JN, Li ZF, Zheng SY, et al. Deep learning model for predicting spread through air spaces of lung adenocarcinoma based on transfer learning mechanism. Transl Lung Cancer Res 2025;14:1061-75. 10.21037/tlcr-24-985 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r47] 47.Tsuji T, Hirata Y, Kusunose K, et al. Classification of chest X-ray images by incorporation of medical domain knowledge into operation branch networks. BMC Med Imaging 2023;23:62. 10.1186/s12880-023-01019-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r48] 48.Baek S, He Y, Allen BG, et al. Deep segmentation networks predict survival of non-small cell lung cancer. Sci Rep 2019;9:17286. 10.1038/s41598-019-53461-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images

Helen Haile Hayeso

Zenebe Markos Lonseko

Fahad Mushabbab G Alotaibi

Tao Gan

Peifeng Shi

Shuqi Dong

Nini Rao

Abstract

Background

Methods

Results

Conclusions

Highlight box.

Key findings

What is known and what is new?

What is the implication, and what should change now?

Introduction

Methods

Datasets

Figure 1.

Proposed methods

Figure 2.

Data preprocessing

Statistical analysis

Constructing a LHFM

Figure 3.

Figure 4.

Dual hybrid feature extraction

Hybrid multimodal fusion model

Prognostic prediction

Loss function of LHFM

Reproducibility and implementation details

Evaluation metrics

Results

Performance comparisons with related works

Table 1. Performance comparisons with related works on the primary test set (JSRT and PadChest).

KM survival curve

Figure 5.

Table 2. LHFM performance on external multimodal datasets.

Generalizability

Ablation study

Table 3. Ablation study performance.

Figure 6.

Sensitivity analyses and robustness evaluation

Table 4. Sensitivity analyses and model robustness evaluation (95% CI).

Discussion

Conclusions

Supplementary

Acknowledgments

Footnotes

References

Peer Review File

Associated Data

Data Citations

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases