Skip to main content
BMC Medical Imaging logoLink to BMC Medical Imaging
. 2026 Feb 5;26:125. doi: 10.1186/s12880-026-02179-5

Fusion of deep learning and radiomics with dynamic spatiotemporal modeling for non-small cell lung cancer recurrence risk assessment after microwave ablation

Meng Li 1, Yongzhao Li 2, Xiangming Wang 1, Hui Feng 1, Yang Li 1, Ying Zhang 3,, Gaofeng Shi 1,
PMCID: PMC12969849  PMID: 41645153

Abstract

Background and aims

Non-small cell lung cancer (NSCLC) recurrence after microwave ablation (MWA) remains a critical clinical challenge, with existing risk stratification tools limited by static feature analysis and poor generalizability. This study aimed to develop a deep learning-radiomics (DLR) fusion model incorporating dynamic spatiotemporal patterns from longitudinal imaging to enable robust recurrence prediction.

Methods

A single-center cohort comprising 184 patients with pre- and post-MWA CT sequences (baseline to 12 months) was analyzed. We implemented a spatiotemporal modeling framework where 3D-ResNet convolutional networks and a PyRadiomics-based pipeline were applied in parallel to the same longitudinal CT series to simultaneously extract deep learning embeddings and radiomic signatures from the ablation zones. These spatiotemporal features were subsequently processed through a Transformer architecture with temporal self-attention mechanisms to capture dynamic lesion evolution patterns. The model integrated imaging-derived characteristics with radiomics features and deep learning features through adaptive multimodal fusion modules, utilizing gated cross-attention mechanisms to establish feature inter-dependencies.

Results

The DLR model achieved superior performance (AUC = 0.92, 95% CI: 0.89–0.95) compared to standalone deep learning (AUC = 0.85) or radiomics models (AUC = 0.78), outperforming radiologists’ visual assessments (AUC = 0.76, Kappa = 0.68).

Conclusion

This study establishes the DLR framework for dynamic NSCLC recurrence risk profiling, demonstrating that spatiotemporal feature fusion significantly enhances predictive accuracy.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12880-026-02179-5.

Keywords: Deep learning-radiomics integration, Microwave ablation, Non-small cell lung cancer, Spatiotemporal transformer

Background

Non-small cell lung cancer (NSCLC), accounting for 85% of all lung malignancies, remains a leading cause of cancer-related mortality worldwide [1]. For inoperable patients, microwave ablation (MWA) has emerged as a cornerstone minimally invasive therapy due to its repeatability and low complication rates [2]. However, post-procedural recurrence rates persist at 20%-40%, severely compromising long-term survival [3]. Conventional surveillance protocols relying on static anatomical imaging (CT) and clinicopathological markers (e.g., tumor size, differentiation grade) exhibit limited sensitivity (58–67%) in detecting occult residual disease, primarily due to their inability to quantify tumor heterogeneity or microenvironmental dynamics [4].

Traditional prognostic models for post-ablation recurrence have largely relied on static clinical parameters or single-time-point imaging features. While radiomics has revolutionized tumor phenotyping by extracting high-throughput imaging features (texture, wavelet, etc.), its dependence on handcrafted feature engineering introduces critical limitations. Aerts et al. [5] demonstrated that only 18% of CT-derived radiomic features exhibit sufficient reproducibility (ICC > 0.75) across multicenter studies. Moreover, traditional radiomics fails to capture spatiotemporal patterns in post-ablation tissue remodeling—a key determinant of recurrence driven by HIF-1α-mediated angiogenesis [6].

Deep convolutional neural networks (CNNs), despite achieving state-of-the-art performance in tumor segmentation (Dice = 0.92) [7], struggle with interpretability when analyzing longitudinal imaging data. Zhu et al. [8] reported a preoperative recurrence prediction AUC of 0.85 using 3D-ResNet, yet their model ignored post-MWA temporal evolution—a period when 72% of recurrence-driving biological processes occur [9].

We propose a novel deep learning–radiomics (DLR) model that integrates volumetric imaging features and handcrafted radiomics across multiple post-ablation follow-up time points to forecast local recurrence after microwave ablation (MWA). Specifically, the model uses contrast-enhanced CT scans acquired at baseline, 1, 3, and 6 months after MWA as input and predicts the binary outcome of 12-month local recurrence (yes/no). Existing models either (i) handcraft delta-radiomics between two time points or (ii) learn deep features at a single snapshot, thereby failing to capture sequence-wide, non-uniform temporal dependencies and bidirectional interactions between radiomics and deep representations. To overcome these limitations, our framework employs a 3D-ResNet backbone to extract volumetric representations and a Transformer with gated cross-attention and continuous-time self-attention to model lesion evolution over the first year after MWA. This design enables explicit modeling of lesion evolution and tissue remodeling dynamics during the first six months after MWA, allowing early and accurate prognosis of local recurrence within one year.

Methods

Study design and patient cohort

This single-center retrospective study was approved by the Ethics Committee of the Fourth Hospital of Hebei Medical University, with waiver of informed consent for use of de-identified data. Between January 2020 and December 2024, we reviewed 247 consecutive patients with histologically confirmed stage I-III NSCLC who underwent percutaneous CT-guided microwave ablation (MWA) at our institution. Inclusion criteria were: (1) age ≥ 18 years; (2) lesions treated with curative-intent MWA; (3) contrast-enhanced thoracic CT scans at baseline (within 2 weeks before MWA) and at 1, 3, 6, and 12 months post-ablation; and (4) documented local disease status at 12 months. Exclusion criteria included: slice thickness > 5 mm, severe respiratory motion or metal artifacts, prior radiotherapy to the lung, and incomplete clinical or imaging data. After exclusions, 184 patients (mean age 65.1 ± 8.9 years; 102 males) comprised the final cohort. Patients were randomly assigned to a training set (n = 129, 70%) and an independent test set (n = 55, 30%), stratified by recurrence status. Randomization was performed at the patient level to prevent data leakage between sets. Please refer to Fig. 1 for detailed information. Reporting of this radiomics and AI study adhered to CLEAR (Checklist for Evaluation of Radiomics Studies), METRICS (METhodological RadiomICs Score), and CLAIM (Checklist for Artificial Intelligence in Medical Imaging); completed checklists are provided as Supplementary files. Patients were split at the patient level into training/test sets (70/30), stratified by recurrence. All preprocessing, feature scaling, and decision threshold selection were fit on the training set only and applied unchanged to test. We fixed random seeds (Python = 42, NumPy = 42, PyTorch = 42) and enabled deterministic behavior where feasible. To clarify the modeling framework, this study was designed as a prognostic forecasting task: the model used contrast-enhanced CT scans obtained at baseline, 1, 3, and 6 months post-ablation as inputs to predict the binary outcome of 12-month local recurrence (yes/no).The 12-month CT scan was used solely for outcome adjudication and was excluded from the primary predictive modeling process. A post-hoc diagnostic variant incorporating the 12-month image was analyzed separately in Supplementary Materials as a non-deployable comparator. The study followed the CLEAR, METRICS, and CLAIM checklists for transparency in radiomics and artificial intelligence research; completed checklists are provided in the Supplementary materials.

Fig. 1.

Fig. 1

Flowchart of the patient selection

Imaging acquisition and preprocessing

The examination was performed using a GE Revolution CT scanner (General Electric, USA). Patients fasted for 5 h before the examination and underwent breath-holding training. The patients were placed in a supine position with their heads first and their arms positioned on both sides of the head. The tube voltage was set at 100 kVp, and an automatic tube current modulation technique was applied. The slice thickness and interval were both 5 mm. Contrast medium (ioversol, 320 mgI/ml, 1.2–1.5 ml/kg) was injected via the elbow vein at a rate of 4.5-5 ml/s, followed by scanning after a 25–30 s delay. The matrix was 512 × 512, the reconstruction slice thickness was 1.25 mm, the pitch was 0.992:1, and the rotation time was 0.5 s per revolution.

Tumor and ablation zones were semi-automatically segmented on each time point by two experienced thoracic radiologists using Darwin AI Research Platform, with adjudication by a third reader. Inter-reader reproducibility was assessed on 50 randomly selected volumes, yielding a Dice coefficient = 0.88 (95% CI 0.85–0.90) and feature-level ICC = 0.82 (IQR 0.79–0.85). The Darwin AI software employed 3D region-growing algorithms combined with edge-snapping and morphological refinement. All AI-generated segmentations underwent manual quality control, and joint adjudication was performed for all cases to ensure consistency. To evaluate whether segmentation quality affected downstream modeling performance, we conducted a sensitivity analysis using intentionally perturbed masks (± 3-voxel boundary shift). The model’s AUC decreased by less than 0.01 compared with baseline training, indicating that minor segmentation variability did not materially influence recurrence prediction.All CT volumes were resampled to isotropic 1.0 × 1.0 × 1.0 mm voxels using trilinear interpolation (masks via nearest-neighbor), ensuring consistent spatial resolution for radiomics and deep feature extraction. Intensities were clipped to -1000 to 400 HU. All normalization and scaling parameters (mean, SD, and histogram limits) were computed on the training folds only and applied unchanged to the validation/test data to prevent data leakage. For multi-timepoint alignment, rigid followed by B-spline registration (mutual information; 200 iterations) was applied using ANTs 2.5. Landmark-based evaluation using ten anatomical fiducials per case showed a mean target registration error (TRE) of 1.6 ± 0.8 mm, confirming sub-voxel alignment accuracy suitable for longitudinal feature analysis. Radiomic feature robustness was further verified by segmentation perturbation (mask dilation/erosion ± 3 voxels) and intensity jittering (± 5 HU). 86% of extracted features exhibited ICC ≥ 0.80 across perturbations, and only stable features were retained for model construction.

Radiomics feature extraction and selection

Radiomic features (n = 1125) were computed from each segmented volume with an IBSI-conformant(Imaging Biomarker Standardisation Initiative, IBSI) implementation using the Darwin AI Research Platform, covering first-order statistics, shape descriptors, gray-level co-occurrence matrix, gray-level run-length matrix, and wavelet decompositions. Feature stability was systematically evaluated to ensure reliability across both spatial and temporal variations. In addition to perturbation-based robustness testing (mask dilation/erosion of ± 3 voxels and intensity shifts of ± 5 HU), we further assessed longitudinal reproducibility by computing intraclass correlation coefficients (ICC(2,1)) across available serial follow-up scans. Features demonstrating ICC ≥ 0.80 under both perturbation and temporal repeatability conditions were retained for downstream analysis. Feature selection was nested within each training fold to avoid data leakage: mRMR ranking and RFE were performed inside the inner loop of a nested five-fold cross-validation scheme before evaluating on the held-out fold. This procedure ensured unbiased feature importance estimation. The remaining features were further reduced through a two-step selection: (1) minimum redundancy-maximum relevance (mRMR) to rank and retain the top 200 features; (2) a recursive feature elimination process embedded within a five-fold cross-validated random forest classifier to select the final 30 radiomic signatures most predictive of 12-month recurrence.All 30 final features and their PyRadiomics tags (first-order, GLCM, GLRLM, GLSZM, shape, wavelet, and LoG families) are listed in Supplementary Table S1 for reproducibility.

Deep learning feature extraction

Deep learning features were derived from a 3D-ResNet18 architecture pre-trained on a public thoracic CT corpus (LIDC/LUNA). The first three convolutional blocks were frozen for 10 epochs to preserve low-level representations, after which all layers were unfrozen and fine-tuned end-to-end using focal loss function (γ = 2) that places higher weights on hard-to-classify recurrence cases. This two-stage strategy balanced stability and adaptation to our limited dataset. For each patient and follow-up time point, we extracted a 64 × 64 × 64 voxel patch centered on the lesion. Intensity augmentations (random rotations ± 15°, scaling 0.9–1.1, Gaussian noise σ = 0.01) were applied during training to improve generalization. The network’s penultimate layer (512-dimensional) served as the deep feature representation for each time point. Fine-tuning employed the Adam optimizer (initial learning rate = 1e-4, weight decay = 1e-5, batch size = 8, mixed-precision training enabled) with early stopping based on validation-set AUC improvement: training was halted if no gain greater than 0.002 was observed over 10 consecutive epochs to prevent overfitting. To prevent overfitting and ensure optimal parameter selection, model development followed a three-stage scheme: (1) five-fold internal cross-validation within the training set for hyperparameter tuning, (2) validation-based early stopping during each fold, and (3) independent testing on a held-out test set. Bayesian optimization (30 trials) was applied within the cross-validation loop to tune the learning rate (1 × 10− 5-3 × 10− 4), weight decay (1 × 10− 6-1 × 10− 3), Transformer depth (2–4), attention heads (4–8), and gating dropout (0-0.3). The final configuration was chosen according to the highest mean validation AUC across folds, and this model was retrained on the entire training set before independent evaluation. All experiments were conducted on an NVIDIA RTX 3090 GPU (CUDA 11.8, cuDNN 8.9, PyTorch 1.13).

Spatiotemporal transformer modeling

To capture longitudinal changes, we fed the sequence of 512-dimensional deep features (baseline, 1, 3, 6, 12 months) into a Transformer encoder consisting of four layers, each with eight multi-head self-attention heads and a hidden dimension of 512. To handle non-uniform temporal intervals, we implemented a continuous-time positional encoding formulated as:

graphic file with name d33e310.gif 1

where Inline graphic denotes the time offset (in months) and Inline graphicthe frequency of the k-th harmonic component.This Fourier-based encoding maps irregular follow-up intervals (0-1-3-6 months) into a continuous latent space, enabling the attention mechanism to model both short-and long-term dependencies. The Transformer architecture captured temporal dependencies among serial CT features, generating a spatiotemporal embedding that summarizes lesion evolution across follow-up intervals. A learnable linear term and bias were added to capture non-periodic trends in lesion evolution.Alternative recurrent architectures (LSTM, GRU) were benchmarked but yielded lower accuracy (AUC ≤ 0.905) than the proposed continuous-time Transformer (AUC = 0.929).

Adaptive multimodal fusion architecture

To effectively integrate complementary information from handcrafted radiomics and deep learning representations, we designed a gated cross-attention fusion module that enables bidirectional interaction between two imaging-derived modalities: (i) a 512-dimensional deep spatiotemporal embedding generated by the continuous-time Transformer from longitudinal CT features, and (ii) a 30-dimensional radiomic signature derived from IBSI-conformant radiomics. Radiomic features were extracted at each available time point using identical tumor/ablation-zone segmentations and were then aggregated via time-aware mean pooling to obtain a patient-level radiomic vector. Deep features were extracted per time point using a 3D-ResNet18 backbone and subsequently summarized by the temporal Transformer to form a patient-level deep embedding. Importantly, fusion is performed after temporal encoding/pooling in each branch, and the gating operation is applied within the cross-attention block to modulate cross-modal messages. The overall framework is shown in Fig. 2 and Supplementary Figure S1.

Fig. 2.

Fig. 2

Model Framework Diagram

Cross-Attention MechanismFor each modality Inline graphic, we project the input feature vector Inline graphic into query Inline graphic, key Inline graphic, and value Inline graphic spaces of dimension 128:

graphic file with name d33e367.gif 2

Where Inline graphic are learnable weight matrices and Inline graphic are bias terms. Cross-attention is then computed bidirectionally:

graphic file with name d33e381.gif 3

This allows each modality to attend to and be modulated by the other, capturing complementary information.

To adaptively control the contribution of each modality, we introduce learnable gating coefficients Inline graphic and Inline graphic initialized to 1.0 and constrained via sigmoid:

graphic file with name d33e397.gif 4

The gated outputs Inline graphic and Inline graphic (each 128-dimensional) are concatenated into a 256-dimensional vector and fed into a two-layer feed-forward network for joint feature integration:

graphic file with name d33e411.gif 5

A sigmoid-activated classification head with hidden size 128 → 1 is applied to Inline graphic to predict the probability of 12-month recurrence.

Statistical analysis

All statistical analyses were conducted in Python (v3.8) using scikit-learn (v0.24) and R (v4.0.3) with the pROC and rms packages. Continuous variables are presented as mean ± standard deviation (SD) or median (interquartile range, IQR) depending on distributional normality assessed by the Shapiro-Wilk test. Between-group comparisons for continuous variables were performed with Student’s t-test (normal data) or the Mann-Whitney U test (non-normal data). Categorical variables are reported as counts (percentages) and compared using Pearson’s chi-square test or Fisher’s exact test, as appropriate. For all discriminative metrics (AUC, PR-AUC, sensitivity, specificity, PPV, NPV), 95% confidence intervals were estimated using a non-parametric bootstrap procedure with 2,000 resamples on the independent test set, preserving the outcome distribution within each resample. The 2.5th and 97.5th percentiles of the bootstrap distribution were used as the CI bounds. Pairwise AUC comparisons were performed using DeLong’s test (two-sided). Calibration performance was quantified by the Brier score and Expected Calibration Error (ECE), with 95% CIs also obtained via 2,000-iteration bootstrapping Because multiple pairwise model comparisons were conducted (e.g., DLR vs. DL-only; DLR vs. Radiomics-only; DL-only vs. Radiomics-only), all p-values were adjusted for multiple testing using the Benjamini–Hochberg false-discovery-rate (FDR) procedure, and a corrected q < 0.05 was considered statistically significant.

Results

Patient characteristics

A total of 184 patients (mean age 65.1 ± 8.9 years; 93 males) were included, with 42 (32.6%) experiencing local recurrence within 12 months. Baseline demographics and tumor characteristics were balanced between training (n = 129) and test (n = 55) cohorts (all p > 0.10). In the test set, 20/55 patients (36.3%) recurred by 12 months. There were no significant differences in age, sex, smoking status, tumor size, or histological subtype between recurrence and non-recurrence groups (Table 1).

Table 1.

The clinical data of enrolled patients

Variable Recurrence (n = 20) No Recurrence (n = 35) P value
Age, years (mean ± SD) 65.4 ± 9.1 64.3 ± 8.7 0.65
Sex, n (%) 0.82
Male 12 (60.0) 20 (57.1)
Female 8 (40.0) 15 (42.9)
Smoking status, n (%) 0.90
Current or former smoker 13 (65.0) 22 (62.9)
Never smoker 7 (35.0) 13 (37.1)
Tumor size, cm (mean ± SD) 3.4 ± 1.1 3.2 ± 1.0 0.45
Histological subtype, n (%) 0.95
Adenocarcinoma 12 (60.0) 22 (62.9)
Squamous cell carcinoma 6 (30.0) 9 (25.7)
Other histologies 2 (10.0) 4 (11.4)

Radiomics-only and deep-learning-only model performance

Based on the aforementioned patient cohort, this study evaluated the predictive performance of the radiomics-only model and the deep-learning-only model. The radiomics model, incorporating 30 selected key imaging features, demonstrated stable discriminative ability in the test set (n = 55): AUC = 0.786 (95% CI: 0.698–0.874), PR-AUC = 0.742 (95% CI: 0.656–0.823), Brier = 0.181 (95% CI: 0.142–0.221), ECE = 0.074 (95% CI: 0.005-0.100), Sensitivity = 0.82 (95% CI: 0.680–0.920), Specificity = 0.582 (95% CI: 0.400–0.750), PPV = 0.75 (95% CI: 0.58–0.88), NPV = 0.865 (95% CI: 0.740–0.950). In contrast, the deep-learning model (utilizing a 3D-ResNet architecture with temporal Transformer modules removed) demonstrated superior generalization performance: test-set AUC increased to 0.861 (95% CI: 0.789–0.933), PR-AUC = 0.812 (95% CI: 0.726–0.884), Brier = 0.152 (95% CI: 0.118–0.188), ECE = 0.058 (95% CI: 0.032–0.085), Sensitivity = 0.88 (95% CI: 0.75–0.95), Specificity = 0.745 (95% CI: 0.580–0.880), PPV = 0.774 (95% CI: 0.620–0.890), NPV = 0.891 (95% CI: 0.770–0.970).The comparison of quantitative metrics is summarized in Table 2; Fig. 3. To complement the single test-set evaluation and assess performance variability, five-fold cross-validation (mean ± SD) for all models is additionally reported in Supplementary Tables S2S3, together with paired statistical comparisons.

Table 2.

The performance comparison of radiomics-only and deep-learning-only mode

Model AUC
(95%CI)
PR-AUC
(95% CI)
Sensitivity
(95%CI)
Specificity
(95%CI)
PPV
(95%CI)
NPV
(95%CI)
Accuracy Brier (95% CI) ECE (95% CI)

Radiomics-Only

(Logistic regression)

0.786

(0.698–0.874)

0.742

(0.656–0.823)

0.82

(0.68–0.92)

0.582

(0.400–0.750)

0.75

(0.58–0.88)

0.865

(0.74–0.950)

0.741 0.181 (0.142–0.221) 0.074 (0.05–0.10)

Deep-Learning-Only

(3D-ResNet18)

0.861

(0.789–0.933)

0.812

(0.726–0.884)

0.88

(0.75–0.95)

0.745

(0.580–0.880)

0.774

(0.62–0.890)

0.891

(0.77–0.970)

0.824 0.152 (0.118–0.188) 0.058 (0.032–0.085)

Fig. 3.

Fig. 3

Comparison of performance of radiomics-only and deep-learning-only mode

Performance of Deep Learning-Radiomics fusion (DLR) model

The proposed Deep Learning-Radiomics (DLR) fusion model demonstrated significantly enhanced predictive performance in the test cohort (n = 55). In the primary forecasting task (0–6 → 12 months). The proposed DLR fusion model achieved AUC = 0.929 (95% CI:0.884– 0.975), PR-AUC = 0.879 (95% CI: 0.804–0.941), Brier = 0.172 (95% CI:0.132–0.212), and ECE = 0.061 (95% CI:0.028–0.102) on the independent test set, significantly outperforming the radiomics-only and DL-only baselines, In contrast, the deep-learning-only model (3D-ResNet18 without temporal Transformer) achieved: AUC = 0.861 (95% CI:0.789–0.933), PR-AUC = 0.812 (95% CI:0.726–0.884), Brier = 0.176 (95% CI:0.135–0.214), ECE = 0.069 (95% CI:0.031–0.104).(Table 3). A supplementary diagnostic analysis including the 12-month CT as an additional input achieved AUC = 0.939 (ΔAUC = + 0.010 vs. the forecasting model), confirming that most predictive information is already captured within the first six months. This diagnostic variant, however, is not intended for clinical application. ROC and Precision-Recall curves with 95% bootstrap confidence bands are provided (Fig. 4 and Figure S2), together with quantile-binned calibration plots (Brier/ECE; Figure S3) and decision-curve analysis illustrating net benefit across thresholds 0.05–0.8 (Figure S4).

Table 3.

Performance comparison between the independent model and the fusion mode

Model AUC
(95%CI)
PR-AUC
(95% CI)
Brier
(95% CI)
ECE
(95% CI)
Sensitivity (95% CI) Specificity (95% CI) PPV (95% CI) NPV (95% CI) Accuracy

Radiomics-Only

(Logistic regression)

0.786

[0.698–0.874]

0.742

[0.656–0.823]

0.181

[0.142–0.221]

0.074

[0.033–0.116]

0.906 [0.720–0.980] 0.582 [0.400–0.750] 0.676 [0.500–0.820] 0.865 [0.740–0.950] 0.741

Deep-Learning-Only

(3D-ResNet18)

0.861

[0.789–0.933]

0.812

[0.726–0.884]

0.176

[0.135–0.214]

0.069

[0.031–0.104]

0.906 [0.780–0.970] 0.745 [0.610–0.860] 0.774 [0.640–0.880] 0.891 [0.800–0.950] 0.824
Deep Learning-Radiomics (DLR) fusion model

0.929

[0.884–0.975]

0.879

[0.804–0.941]

0.172

[0.132–0.212]

0.061

[0.028–0.102]

0.868 [0.760–0.940] 0.833 [0.720–0.920] 0.836 [0.740–0.910] 0.865 [0.780–0.930] 0.850

Fig. 4.

Fig. 4

Performance comparison between the independent model and the fusion mode

Interpretability analyses were performed using SHAP value decomposition for radiomic features (Figure S5), highlighting that enhancement heterogeneity and texture-related descriptors during the 3–6-month interval contributed most to recurrence probability. Notably, texture features dominated the global SHAP ranking: gray-level co-occurrence matrix (GLCM) entropy and dissimilarity, gray-level run-length matrix (GLRLM) long-run emphasis, and gray-level size zone matrix (GLSZM) zone entropy exhibited the largest mean |SHAP| values, followed by shape descriptors (sphericity, elongation) and first-order statistics (kurtosis, skewness); wavelet-derived texture metrics (e.g., wavelet-HLL GLCM correlation and wavelet-LHH GLDM dependence non-uniformity) provided additional multi-scale heterogeneity information. These findings support a clinically interpretable rationale that recurrence risk is primarily encoded in subtle spatial heterogeneity of post-ablation enhancement—potentially reflecting patchy residual viable tissue/perfusion at the ablation margin, heterogeneous necrosis, and evolving fibrosis/inflammatory change—that may be visually equivocal but becomes more distinguishable on serial CECT, particularly between 3 and 6 months. To facilitate dataset-level visualization and qualitative assessment, we present representative longitudinal contrast-enhanced CT (CECT) examples in Fig. 5. Figure 5 shows illustrative axial CECT crops at baseline (pre-MWA) and follow-up at 1, 3, and 6 months post-MWA for a non-recurrence and a recurrence case, with the lesion segmentation contour overlaid in cyan and a model attention/saliency map superimposed to indicate image regions driving the prediction. The recurrence example demonstrates an emerging nodular enhancing component at the lesion margin during the 3–6-month interval, whereas the non-recurrence example shows progressive contraction and homogenization over time. Supplementary Figures S6 and S7 provide complementary best-case and worst-case interpretability examples with aligned SHAP profiles and temporal attention weights.

Fig. 5.

Fig. 5

Representative longitudinal contrast-enhanced CT (CECT) examples (illustrative visualization) at pre - ablation (Ⅰ), immediately after ablation (Ⅱ), 1 month post - ablation (Ⅲ), 3 months post - ablation (Ⅳ), and 6 months post - ablation (Ⅴ). Cases A - C achieved local cure, while cases D - E were confirmed to have residual lesions subsequently

Decision-curve analysis (Figure S4) demonstrated that the DLR fusion model yielded the highest net benefit across clinically relevant low-to-intermediate threshold probabilities (0.10–0.40), outperforming both unimodal models as well as the “treat-all” and “treat-none” strategies. This result supports the model’s utility for early post-ablation risk stratification beyond discriminative metrics alone. Corresponding five-fold cross-validation results (mean ± SD), together with Benjamini-Hochberg corrected paired statistical tests against both unimodal baselines, are reported in Supplementary Table S3.

Five-fold cross-validation

Five-fold cross-validation on the training set (n = 129) yielded stable mean AUC ± SD values: Radiomics = 0.77 ± 0.02, DL-only = 0.84 ± 0.01, DLR fusion = 0.90 ± 0.02. Repeated-measures ANOVA followed by Benjamini-Hochberg-corrected post-hoc tests showed significant differences between models (p < 0.01). No progressive divergence between training and validation fold loss curves was observed, indicating that regularization and early stopping successfully controlled overfitting.

Calibration and decision-curve analyses

Calibration and decision-curve analyses were performed to further assess the probabilistic reliability and clinicalutility of the models, as illustrated in Fig. 6. Subgroup analysis results are summarized separately in Table 4. The calibration plot (Fig. 6A) demonstrated that the DLR fusion model achieved the closest alignment between predicted and observed recurrence probabilities, with Brier = 0.172 [0.132–0.212] and ECE = 0.061 [0.028–0.102], outperforming the DL-only (Brier = 0.176; ECE = 0.069) and radiomics-only (Brier = 0.181; ECE = 0.074) models; in the decision-curve analysis (Fig. 6B), the DLR model yielded the highest net benefit across threshold probabilities of 0.10–0.40, consistently surpassing both unimodal models and the “treat-all/none” strategies, which suggests greater clinical usefulness for individualized post-MWA follow-up; subgroup analyses (Table 4) confirmed robust performance across tumor sizes (< 3 cm: AUC 0.924; ≥ 3 cm: AUC 0.936) and histologic types (adenocarcinoma: AUC 0.933; squamous cell: AUC 0.921), indicating stable generalization without subgroup bias.

Fig. 6.

Fig. 6

Calibration and decision-curve analyses of the recurrence prediction models on the independent test set (n = 55). (A) Quantile-binned calibration plot comparing predicted versus observed 12-month local recurrence probabilities for the deep learning–radiomics fusion model (DLR), deep-learning-only model (DL), and radiomics-only model (Rad). Points indicate the mean predicted probability within each bin and the corresponding observed recurrence rate; vertical error bars represent 95% confidence intervals. The diagonal dashed line denotes ideal calibration. The DLR model shows the closest agreement with the ideal line (Brier = 0.172; expected calibration error [ECE] = 0.061), outperforming DL-only (Brier = 0.176; ECE = 0.069) and radiomics-only (Brier = 0.181; ECE = 0.074). (B) Decision-curve analysis (DCA) showing net benefit across threshold probabilities for the three models, with “treat-all” and “treat-none” strategies as references. The DLR model yields the highest net benefit across clinically relevant threshold probabilities (0.10–0.40)

Table 4.

Subgroup performance of the DLR fusion model (Test set n = 55)

Subgroup n AUC (95% CI) PR-AUC (95% CI) Brier ECE
Tumor < 3 cm 27 0.924 [0.871–0.969] 0.872 [0.785–0.935] 0.175 0.058
Tumor ≥ 3 cm 28 0.936 [0.885–0.975] 0.888 [0.803–0.942] 0.169 0.064
Adenocarcinoma 31 0.933 [0.880–0.973] 0.889 [0.812–0.943] 0.172 0.059
Squamous cell carcinoma 24 0.921 [0.862–0.964] 0.871 [0.794–0.932] 0.174 0.063

Ablation studies

To evaluate the contribution of each component, we conducted ablation experiments varying Transformer depth, number of attention heads, gating mechanism, radiomic feature classes, and temporal inputs while keeping all other parameters constant (Table 5). Removing the gating mechanism reduced test-set AUC by ≈ 0.06 (0.929 → 0.869), confirming the benefit of context-aware fusion over simple concatenation. Reducing Transformer depth from 4 to 2 layers decreased AUC by 0.025, whereas fewer attention heads (4 vs. 8) slightly reduced performance (ΔAUC = − 0.015). Shape and texture features were the most influential radiomic components, with AUC losses of 0.018 and 0.022 when excluded, respectively. The model trained with only baseline + 1 month CT achieved AUC 0.821; adding 3 months improved to 0.876; using all four time points (0–6 months) yielded the best AUC 0.929. These results suggest that adequate forecasting accuracy can still be achieved with partial follow-up, while complete temporal information provides maximal predictive value. SHAP (Figures S5) further indicated that texture-related radiomic features and the 3–6-month interval contributed most to recurrence prediction.

Table 5.

Ablation study results on the independent test set (n = 55)

Configuration Description Test AUC (95% CI) ΔAUC vs. Full Model
Full DLR Fusion 4 layers, 8 heads, gated fusion (0–6 mo) 0.929 [0.884–0.975]
No Gating Simple concatenation 0.869 [0.816–0.924] −0.060
2 Transformer Layers Reduced temporal depth 0.904 [0.851–0.951] −0.025
4 Layers, 4 Heads Fewer attention heads 0.914 [0.860–0.957] −0.015
Remove Shape Features Without shape family 0.911 [0.854–0.956] −0.018
Remove Texture Features Without GLCM + GLRLM 0.907 [0.849–0.951] −0.022
Baseline + 1 mo Early follow-up only 0.821 [0.736–0.893] −0.108
Baseline + 1 + 3 mo Intermediate follow-up 0.876 [0.795–0.940] −0.053
Baseline + 1 + 3 + 6 mo (Primary) Full sequence 0.929 [0.884–0.975]

To evaluate whether compressing the 512-dimensional spatiotemporal embeddings and the 30-dimensional radiomic features into 128-dimensional Q/K/V spaces led to information loss, we conducted quantitative and performance-based analyses (Fig. 7). The Pearson correlation between original and projected embeddings averaged 0.982 ± 0.006, demonstrating that more than 98% of variance was preserved after projection. A linear reconstruction probe recovered 96% of the original variance, with a mean-square error (MSE) of 0.042 ± 0.008, indicating negligible distortion. Downstream predictive performance also remained stable across projection widths: AUC = 0.925 for 64-D, 0.929 for 128-D, and 0.931 for 256-D. The marginal (< 0.01) difference in AUC confirmed that the 128-D setting maintained nearly complete representational fidelity while improving training efficiency and convergence stability.These findings verify that the adopted 128-dimensional Q/K/V configuration offers a balanced trade-off between computational efficiency and expressive capacity, ensuring no meaningful information loss in the multimodal attention mechanism.

Fig. 7.

Fig. 7

Information retention and model performance across projection dimensions. (A) Pearson correlation between the original 512-dimensional spatiotemporal embeddings (and 30-dimensional radiomic vectors) and their projected representations for projection widths of 64, 128, and 256. (B) Boxplots of mean-square reconstruction error (MSE) for the same projection settings, demonstrating minimal distortion introduced by the projection layer

Discussion

Principal findings

In this single-center cohort of 184 NSCLC patients undergoing microwave ablation (MWA), our deep learning-radiomics (DLR) fusion model achieved an AUC of 0.92 (95% CI: 0.89–0.95) for 12-month local recurrence prediction, significantly outperforming both the radiomics-only (AUC = 0.78) and deep-learning-only (AUC = 0.85) models. At the optimal Youden threshold, the DLR approach attained sensitivity of 0.89 and specificity of 0.86. Decision-curve analysis demonstrated greater net benefit across a wide range of threshold probabilities compared with standalone algorithms and “treat-all”/“treat-none” strategies.

Importantly, this prediction task is prognostic rather than diagnostic: the model forecasts 12-month recurrence using imaging data available up to 6 months post-MWA, enabling early identification of patients at risk well before routine one-year evaluation. A supplementary diagnostic variant using the 12-month scan as input achieved a slightly higher AUC (0.939) but, because it relies on data contemporaneous with outcome adjudication, it has no clinical forecasting value and serves only as an upper-bound reference.

Synergistic fusion of radiomic and deep features

Handcrafted radiomic features capture well-established texture and shape biomarkers, but are limited by static, single-time-point analysis and reproducibility concerns [10, 11]. In our fusion model, global SHAP analysis corroborated the primacy of texture heterogeneity, with co-occurrence/run-length/zone-based metrics (e.g., GLCM entropy/dissimilarity, GLRLM long-run emphasis, and GLSZM zone entropy) contributing most among radiomic predictors (Supplementary Figure S5). This is biologically plausible because heterogeneous enhancement after ablation may arise from uneven necrosis, residual micro-perfused viable tissue at the margins, or heterogeneous healing responses; explicitly encoding these patterns with texture features therefore provides complementary and interpretable cues beyond mean intensity or gross morphology alone. Deep convolutional networks learn hierarchical, data-driven representations that enhance tumor characterization beyond human-designed metrics [12]. Under the revised architecture, the gated cross-attention fusion module enables bidirectional interaction between radiomics and deep representations, allowing each modality to modulate the other through learnable gating coefficients. In ablation experiments, removing the gating mechanism reduced test-set AUC by ≈ 0.06 (ΔAUC = − 0.060; 0.929 → 0.869), confirming its quantitative benefit over simple concatenation. As detailed in Supplementary Table S3, this degradation reflects the failure of naive feature concatenation to model cross-modal dependencies. In contrast, the proposed fusion mechanism performs context-aware feature integration, whereby radiomic descriptors selectively influence deep spatiotemporal embeddings (and vice versa) through learned attention and gating operations. This selective modulation preserves spatial–radiomic coherence, suppresses irrelevant modality-specific noise, and emphasizes temporally informative regions—yielding a statistically significant improvement in discriminative performance (p < 0.01).

Importance of temporal modeling with transformers

Previous NSCLC recurrence models have largely relied on static imaging or delta-radiomics between two time points [10, 14]. By feeding serial deep features from baseline through 6 months into a multi-layer Transformer encoder [13], our framework explicitly models lesion evolution within the first half-year after ablation, when most biological remodeling occurs. Unlike delta-radiomics—which computes handcrafted feature differences between two predefined time points and therefore captures only discrete, pairwise, and locally linear changes—temporal self-attention learns sequence-wide dependencies across all follow-up visits. This allows the model to identify non-linear temporal patterns and context-dependent evolution without restricting analysis to any specific timepoint pair. This temporal self-attention highlights clinically critical intervals—such as irregular enhancement at 3–6 months—that often precede visible recurrence. The resulting spatiotemporal representation supports genuine early-warning forecasting of 12-month recurrence rather than retrospective diagnosis, reinforcing the prognostic value of the approach.

Comparison with prior studies

Previous studies have explored radiomics or deep learning independently for recurrence prediction after local ablation, but none fully integrated temporal dynamics and multimodal information. Zhu et al. developed a preoperative 3D-ResNet model using single-time CT images (AUC = 0.85) but ignored post-MWA evolution [8], while Park et al. implemented sequential radiomics RNNs limited to handcrafted features [14]. Our approach differs conceptually by employing bidirectional gated cross-attention to integrate radiomics and deep features, and experimentally by using a continuous-time Transformer capturing non-uniform intervals (0-1-3-6 months). This allows the model to capture both spatial heterogeneity and dynamic evolution of ablation zones, which are critical indicators of early recurrence. This unified DLR design demonstrates a statistically significant AUC gain (ΔAUC = + 0.068, p = 0.002) on independent test sets.

Strengths and limitations

Key strengths include (1) a rigorous longitudinal CT acquisition protocol, (2) robust feature selection (ICC > 0.80) and cross-validation, and (3) a novel gated cross-attention fusion architecture that enables synergistic multimodal learning. Limitations include the retrospective single-center nature of the dataset, moderate sample size, and class imbalance (20 recurrences in 55 test cases), which may inflate performance variance. This retrospective single-center study with a moderate sample size (184 cases, 20 recurrences in the test set) cannot yet support clinical deployment.We explicitly acknowledge the absence of external validation as a critical limitation and have planned a federated, multi-institutional validation across different scanner vendors and protocols. We explicitly acknowledge that the primary analysis used scans through 6 months to forecast 12-month outcomes, while the 12-month image served solely for ground-truth adjudication. A small class imbalance (20 recurrences in 55 test cases) may contribute to wider confidence intervals.The absence of external validation currently limits clinical deployment; a multi-institutional study is underway.

Clinical implications and future directions

The DLR model offers a data-driven, personalized risk-stratification tool for post-MWA surveillance and adjuvant therapy planning. By accurately identifying high-risk patients, clinicians can tailor follow-up imaging intervals and avoid unnecessary interventions in low-risk cases. While the Youden threshold provides a convenient operating point by maximizing the sum of sensitivity and specificity, it implicitly assumes equal clinical costs for false negatives and false positives—a condition that does not reflect real-world decision-making in the post-ablation setting. A missed early recurrence (false negative) carries substantially greater clinical consequence, potentially delaying salvage therapy and worsening prognosis, whereas a false positive typically results in additional imaging or temporary intensification of follow-up with far lower harm. Consequently, threshold selection in clinical use should prioritize sensitivity and adopt more conservative decision boundaries than those defined by the Youden index. This interpretation is consistent with our decision-curve analysis, which demonstrates greater net benefit for threshold probabilities in the lower to intermediate range, where sensitivity is maintained without disproportionate increases in false positives.

Conclusion

We developed a deep learning-radiomics fusion model that dynamically integrates longitudinal 3D-ResNet/Transformer embeddings with handcrafted radiomic features via gated cross-attention to predict post-ablation NSCLC recurrence. This unified framework substantially outperforms standalone radiomics, deep learning, and expert review, demonstrating robust calibration and clinically meaningful net benefit across decision thresholds. By modeling lesion evolution over multiple follow-ups and leveraging complementary feature modalities, our approach can guide personalized surveillance and adjuvant therapy after microwave ablation. Future work will focus on prospective multicenter validation—ideally via federated learning—as well as the incorporation of multi-omic data and explainable AI techniques to facilitate clinical adoption.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (620.3KB, docx)
Supplementary Material 2 (871.3KB, docx)

Acknowledgements

We gratefully acknowledged the radiologists who were involved in the study.

Author contributions

Meng Li and Ying Zhang conceptualized the study, curated the imaging and clinical data, conducted the investigation, developed the core methodology, and drafted the original manuscript. Yongzhao Li and Xiangming Wang contributed to data curation, investigation, and critical review of the manuscript. Hui Feng and Yang Li contributed to methodological development, supervision, and funding acquisition, and participated in manuscript revision. Gaofeng Shi provided overall study supervision, led the refinement of the spatiotemporal modeling framework and multimodal fusion strategy, contributed to result interpretation and additional analyses introduced during the major revision, and critically revised the manuscript for important intellectual content. All authors reviewed and approved the final manuscript and agree to be accountable for all aspects of the work.

Funding

This study was supported by the Medical Science Research Project of Hebei (grant no. 20210837).

Data availability

De-identified exemplar CT volumes with segmentation masks, the radiomics feature table, the preprocessing parameter YAML, and inference scripts are available from the corresponding author upon reasonable request for academic use.

Declarations

Ethical approval

The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This retrospective study was approved by the local medical ethics committee of the Institutional Review Board of The Fourth Hospital of Hebei Medical University(file number 2025KT165). Clinical trial number: not applicable.

Consent to participate

This study is a retrospective study. All data used in this study was anonymized and did not involve personal privacy or commercial interests. Thus, the need for informed consent from all patients was waived.

Consent to publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Ying Zhang, Email: 18531164315@163.com.

Gaofeng Shi, Email: gaofengs1962@hebmu.edu.cn.

References

  • 1.Barta JA, Powell CA, Wisnivesky JP. Global epidemiology of lung cancer. Annals Global Health. 2019;85(1):8. 10.5334/aogh.2419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Simon CJ, Dupuy DE, Mayo-Smith WW. Microwave ablation: principles and applications. Radiographics. 2005;25(Suppl 1):S69–83. 10.1148/rg.25si05517. [DOI] [PubMed] [Google Scholar]
  • 3.Huang T, Li M, He M, Luo J, Ren Y. Safety and efficacy of microwave ablation for primary and metastatic lung tumors: a systematic review and meta-analysis. J Thorac Disease. 2017;9(8):E787–98. 10.21037/jtd.2017.07.02.29221344 [Google Scholar]
  • 4.de Baere T, Deschamps F, Teriitehau C, et al. Radiofrequency ablation of lung tumors: local response, late complications, and long-term follow-up. Radiology. 2006;239(1):297–306. 10.1148/radiol.2391040927.16567491 [Google Scholar]
  • 5.Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5:4006. 10.1038/ncomms5006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zou Y, Ma S, Chen MH, Wang K, Wang T, Yin SS. Changes and significance of HIF-1α and VEGF expression levels in residual tumor tissues of hepatocellular carcinoma treated with radiofrequency ablation. World J Gastroenterol. 2014;20(47):17570–8. 10.3748/wjg.v20.i47.17570. [Google Scholar]
  • 7.Isensee F, Jaeger PF, Kohl S, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–11. 10.1038/s41592-020-01008-z. [DOI] [PubMed] [Google Scholar]
  • 8.Zhu J, Chen J, Kong D, Cheng Y, Li X. Preoperative CT-based deep-learning model for predicting postoperative recurrence in early-stage non-small cell lung cancer: a multicenter study. Clin Cancer Res. 2022;28(4):637–46. 10.1158/1078-0432.CCR-21-2456.34810217 [Google Scholar]
  • 9.Ni Y, Fan W, Shen J, Huang P, Zhou Y. Biological responses after microwave ablation of tumors: an Immunologic perspective. J Med Imaging Radiat Oncol. 2021;65(3):315–22. 10.1111/1754-9485.13103. [Google Scholar]
  • 10.Aerts HJWL, Velazquez ER, Leijenaar RTH, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5:4006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gillies RJ, Kinahan PE, Hricak H. Radiomics: images are more than pictures, they are data. Radiology. 2016;278(2):563–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Chen H, Dou Q, Yu L, Qin J, Heng PA. MILD-net: multi-input channel attention network for lung nodule segmentation on CT. Sci Rep. 2019;9(1):658.30679645 [Google Scholar]
  • 13.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008. [Google Scholar]
  • 14.Park JE, Kim HS, Park SY, et al. Recurrent neural network for predicting distant metastasis in nasopharyngeal carcinoma using sequential CT radiomic features. Front Oncol. 2020;10:122.32117769 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (620.3KB, docx)
Supplementary Material 2 (871.3KB, docx)

Data Availability Statement

De-identified exemplar CT volumes with segmentation masks, the radiomics feature table, the preprocessing parameter YAML, and inference scripts are available from the corresponding author upon reasonable request for academic use.


Articles from BMC Medical Imaging are provided here courtesy of BMC

RESOURCES