Abstract
Objectives
To evaluate the impact of tumor volume segmentation variability on the repeatability of radiomic features (RFs) and to determine how RF repeatability influences the generalizability of radiomic models for predicting overall survival (OS) in patients with oropharyngeal carcinoma (OPC).
Methods
We retrospectively analyzed CT images from 1017 patients with oropharyngeal carcinoma across three institutions. Perturbation methods were applied to simulate variations in gross tumor volume segmentation. RFs were extracted from both the original images and Laplacian of Gaussian‐filtered images using different perturbation masks. RF repeatability was quantified using intra‐class correlation coefficients (ICC). Repeatable RFs were progressively incorporated into the modeling process according to different ICC thresholds to assess the influence of feature repeatability on model generalizability.
Results
Incorporation of RFs with ICC values between 0.7 and 0.8 improved the AUC index of the two‐year and three‐year OS models in external validation cohorts. Using an ICC threshold of 0.7, RFs were classified into high‐ and low‐repeatability groups, and OS models were trained and validated using the training, internal testing, and external validation cohorts. Across all cohorts, the OS model trained with high‐repeatability RFs demonstrated significantly superior performance compared to the model trained with low‐repeatability RFs.
Conclusion
The findings demonstrate that selecting RFs with ICC values greater than 0.7 substantially enhances both the generalizability and predictive performance of CT‐based radiomic models for patients with OPC. This study further underscores the importance of considering RF repeatability, particularly in the presence of tumor volume segmentation variability, to improve the robustness and clinical reliability of radiomic models.
Keywords: generalizability, overall survival, oropharyngeal carcinoma, radiomic feature
Highly repeatable RFs (ICC > 0.7) significantly enhances the generalizability and clinical applicability of radiomic‐based predictive models.

1. INTRODUCTION
Radiomics is an emerging technology that enables the high‐throughput extraction of latent features from medical images and has demonstrated significant potential for clinical applications, including disease prediction, 1 , 2 prognosis, 3 , 4 , 5 and treatment evaluation. 6 , 7 However, radiomic models often exhibit inferior performance during external validation compared to internal testing, highlighting their limited generalizability and raising concerns regarding their reliability for clinical implementation. To address these challenges, substantial efforts have focused on enhancing the reliability, repeatability, and generalizability of radiomic models through the development of advanced artificial intelligence models, 8 the preselection of highly reproducible radiomic features (RFs) prior to model construction, 9 , 10 , 11 and the establishment of standardized feature extraction pipelines. 12 , 13 While the Imaging Biomarker Standardization Initiative (IBSI) has provided guidelines for standardized feature extraction, the reliability of RFs remains a critical determinant of model performance. Consequently, assessing the reliability of RFs before model development is essential for optimizing the performance and translational potential of radiomic models.
The reliability of RFs is influenced by multiple sources of variability, including inter‐observer variability, 14 variations in scanner types, 15 , 16 and heterogeneity in scanning protocols, 17 etc. Fully accounting for all these factors to comprehensively assess the stability of RFs remains challenging. The test‐retest approach is commonly used to evaluate the reliability of RFs; however, this strategy entails additional radiation exposure for patients and requires considerable time investment from oncologists for repeated tumor segmentation, rendering it impractical in routine clinical practice. To overcome the limitations associated with test‐retest scanning, alternative perturbation methods have been developed to assess RF repeatability. 18 For instance, Teng et al. employed a perturbation‐based approach to enhance the reliability of an overall survival (OS) prediction model for patients with head and neck cancer. 10 Similarly, Zhang et al. improved the generalizability of a disease‐free survival model in head and neck cancer by pre‐selecting highly reproducible RFs using a perturbation method. 11 Collectively, these studies demonstrate that perturbation methods represent robust and effective approaches for evaluating RF reliability. Among the various sources of variability that can be simulated using perturbation methods, segmentation is particularly noteworthy. With advancements in auto‐segmentation and semi‐segmentation techniques, 19 , 20 there is significant potential to achieve more consistent segmentation results across different institutions. Therefore, evaluating the impact of segmentation variability on model generalizability represents a critical step toward facilitating the clinical application of radiomics models.
Numerous studies have focused on the impact of segmentation on the repeatability of RFs. For instance, Gustav Müller‐Franzes et al. demonstrated that models constructed using pre‐selected reliable RFs with respect to segmentation variability exhibited improved performance in predicting patient survival across different clinical imaging modalities and tumor entities. 21 Thomas Louis et al. reported that robust features demonstrated superior predictive potential compared to non‐robust features. 22 Nurin Syazwina Mohd Haniff reported that semi‐automatic segmentation using 3D‐Slicer represents a preferable alternative to manual segmentation, as this can produce more robust and reproducible RFs. 23 Segmentation‐dependent RF analysis has demonstrated potential advantages during the radiomic modeling process. However, the effect of segmentation on model generalizability remains insufficiently validated in multi‐cohort studies, particularly with respect to the selection thresholds used to identify reliable RFs.
In this study, we analyzed CT images from a total of 1017 patients with oropharyngeal carcinoma (OPC) across three institutions and employed a perturbation method to assess the impact of segmentation on the repeatability of RFs. The repeatability of RFs was quantified using the intra‐class correlation coefficient (ICC). To investigate how segmentation‐influenced RF repeatability affects model generalizability, we conducted a multi‐cohort OS analysis incorporating independent external validation cohorts. RFs were initially pre‐selected according to varying ICC thresholds, after which the Max‐Relevance and Min‐Redundancy (mRMR) feature selection method was applied to identify the final set of RFs for OS model construction. Ultimately, we established a definitive ICC threshold to pre‐select highly repeatable RFs, thereby enhancing the generalizability of the OS models.
2. MATERIALS AND METHODS
2.1. Patient cohort
We retrospectively collected CT images from 1017 patients with stage III‐IV OPC from The Cancer Imaging Archive (TCIA). 24 Eligible patients had a primary diagnosis of oropharyngeal cancer and had not undergone surgery. Patients whose deaths were attributed to non‐cancer‐related causes were excluded. This study incorporated three institutional datasets: the single‐institution Radiomic Biomarkers in Oropharyngeal Carcinoma (RBOPC) study, which served as the training cohort for the OS model, and two external validation cohorts—the Head and Neck Squamous Cell Carcinoma (HNSCC) study and the HEAD‐NECK‐RADIOMICS‐HN1 (HN137) study. Detailed demographic and clinical characteristics, including median age, sex distribution, and treatment modalities for each dataset, are summarized in Table 1. Gross tumor volume (GTV) segmentation was manually delineated by experienced radiation oncologists, and the resulting GTV masks were utilized for RF extraction. Overall survival, available in the public TCIA datasets, was defined as the duration from treatment initiation to the earliest occurrence of cancer‐related death. Given the retrospective nature of this study, informed consent was not required.
TABLE 1.
Baseline patient characteristics of the RBOPC (training and internal testing cohorts), HNSCC, and HN137 (external validation) cohorts.
| Characteristic | RBOPC (training) | HNSCC (external validation) | HN137 (external validation) | p value |
|---|---|---|---|---|
| Age | ||||
| Median | 61 | 57 | 62 | <0.01, 0.05, 0.01 |
| Sex | ||||
| Female | 93 | 62 | 17 | 0.11, 0.79, 0.62 |
| Male | 377 | 411 | 57 | |
| Overall stage | ||||
| III | 60 | 67 | 14 | 0.91, 0.04, 0.51 |
| IV | 410 | 406 | 60 | |
| Chemotherapy | ||||
| RT | 219 | 46 | 37 | <0.01, 0.16, 0.10 |
| CCRT | 0 | 257 | 11 | |
| RT+ICT | 251 | 46 | 0 | |
| CCRT+ICT | 0 | 124 | 0 | |
| RT+CON | 0 | 0 | 26 |
Abbreviations: RT, radiotherapy; CCRT, concurrent chemoradiotherapy; ICT, induction chemotherapy; CON, concomitant.
2.2. Preprocessing and feature extraction
Image preprocessing and feature extraction were performed in accordance with the guidelines of the IBSI. 12 , 13 RFs were extracted using an in‐house Python‐based (version 3.8) pipeline, leveraging the SimpleITK (version 1.2.4) and PyRadiomics (version 2.2.0) libraries. Prior to RF extraction, all images were resampled to a resolution of 1 × 1 × 1 mm3, and pixel values were constrained to a range of –150 to 180 Hounsfield Units (HU) through re‐segmentation, effectively excluding non‐tumor tissues such as air and bone from the volume of interest. 10 All shape‐based features, first‐order features, and texture features derived from the Gray‐Level Co‐occurrence Matrix (GLCM), Gray Level Size Zone Matrix (GLSZM), Gray Level Run Length Matrix (GLRLM), and Neighbouring Gray Tone Difference Matrix (NGTDM) were extracted from the original images as well as from three‐dimensional Laplacian of Gaussian (LoG) filtered images with sigma values of 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, and 6 mm. Each image was discretized using a fixed bin number of 30 before feature extraction. In total, 1226 RFs were computed per patient.
2.3. Perturbation and RF repeatability assessment
To simulate segmentation variability in the GTV, we employed the perturbation method introduced by Zwanenburg et al. during image preprocessing. 18 This method applies a three‐dimensional random displacement field to deform the segmented masks and generate randomized contours, as illustrated in Figure 1. For each voxel, a random vector component in each spatial dimension was independently sampled from a uniform distribution over the interval [‐1, 1]. To simulate the uniform inter‐slice variability characteristic of slice‐by‐slice manual contouring, all field vectors within the same slice were assigned an identical z‐component. The vector field was subsequently normalized dimension‐wise using the root mean square and smoothed with a Gaussian filter (σ = 10) to ensure spatial continuity and prevent abrupt transitions in the resulting contour deformations. The reliability of the perturbed masks was evaluated using the Dice similarity coefficient (DSC), which quantifies the spatial overlap between the perturbed and original masks. Forty random mask perturbations were applied to assess the impact of segmentation on feature repeatability. The repeatability of RFs derived from the perturbed masks was quantified using a one‐way random absolute agreement ICC.
FIGURE 1.

Schematic diagram of the research approach.
2.4. Feature selection
RFs extracted from the unperturbed masks were utilized to train the OS models. To identify the optimal ICC threshold for selecting RFs that enhance model generalizability, we initially filtered RFs based on ICC values less than 0.6, 0.7, 0.8, and 0.9. Subsequently, the mRMR feature‐selection method was applied to select the seven most informative RFs for the OS model construction. The optimal ICC threshold was determined by comparing OS model performance in the external validation cohorts. Finally, RFs were categorized into high‐repeatability and low‐repeatability groups based on the selected ICC threshold, and the mRMR feature selection method was employed to select the final seven RFs for constructing the OS model, as illustrated in Figure 1.
2.5. Overall survival model development and evaluation
The RBOPC cohort was randomly divided into training (70%) and internal testing (30%) cohorts, while the HN137 and HNSCC cohorts served as independent external validation datasets to assess the performance of the constructed OS models. Two‐year and three‐year OS models were developed using Logistic Regression (LR) and Support Vector Machine (SVM) algorithms based on the training cohort. Model performance was evaluated using the area under the curve (AUC) across the training, internal testing, and external validation cohorts. Additionally, a random downsampling method was applied to balance positive and negative outcome events in the training data. To ensure the reliability of the internal performance of the OS model, the RBOPC dataset was randomly split into training and testing cohorts 1000 times, and the corresponding 95% confidence intervals (CI) were calculated from these 1000 iterations.
3. RESULTS
As shown in Figure 2(A), the DSC coefficients of the perturbed masks were evaluated across the HN137, HNSCC, and RBOPC cohorts. The mean DSC values (95% CI) for the HN137, HNSCC, and RBOPC cohorts were 0.925 (0.841–0.967), 0.909 (0.824–0.955), and 0.923 (0.854–0.959), respectively. All mean DSC values exceeded 0.9, indicating that the applied mask perturbations were within an acceptable and reasonable range for this study. The repeatability of RFs influenced by segmentation was quantified using ICC coefficients in the HN137, HNSCC, and RBOPC cohorts. RFs were classified into repeatability categories based on ICC intervals with a fixed bin size of 0.1, as shown in Figure.2(B)‐(D). The majority of RFs demonstrated a mean ICC value greater than 0.7, accounting for 97.43%, 98.26%, and 97.02% of features in the HN137, HNSCC, and RBOPC cohorts, respectively. When an ICC threshold of 0.9 was applied, 77.01%, 71.29%, and 70.38% of RFs met this criterion in the HN137, HNSCC, and RBOPC cohorts, respectively. Notably, the unfiltered (original) and LoG‐filtered RFs with different sigma values demonstrated distinct statistical patterns. Most of the unfiltered RFs fell within the ICC range greater than 0.9. The characteristics of LoG‐filtered RFs were correlated with the sigma value, with an increasing number of RFs falling within the ICC > 0.9 range as the sigma value increased. When the sigma value exceeded 1.5, the number of LoG‐filtered RFs within the ICC > 0.9 range reached a plateau, indicating stabilization of feature repeatability.
FIGURE 2.

(A) Dice similarity coefficient (DSC) of perturbed gross tumor volume masks. Distribution of radiomic features according to intra‐class correlation coefficient values with a fixed bin size of 0.1 for the (B) HN137, (C) HNSCC, and (D) RBOPC datasets.
To assess the influence of feature repeatability on the generalizability of the OS models, unfiltered RFs were initially selected using different ICC thresholds. Two‐year and three‐year OS prediction models were subsequently developed using a unified modeling pipeline that incorporated SVM and LR algorithms. RFs with ICC below 0.6 were first included, followed by progressive inclusion of increasingly repeatable RFs through higher ICC thresholds to examine their impact on model generalizability, as shown in Figure 3(A)‐(D). Across all four OS models, the mean AUC index showed a gradual increasing trend in both the training and internal testing groups as more repeatable RFs were incorporated. Notably, when the ICC threshold increased from 0.7 to 0.8, three of the four models demonstrated improved mean AUC indices in the external validation cohorts (HNSCC and HN137). A distinct pattern was observed in the LR‐based three‐year OS model, in which a marginal decrease in performance was noted within the HN137 cohort with the inclusion of repeatable RFs. However, the mean AUC index of the LR‐based three‐year OS model increased significantly in the HNSCC external validation cohorts when the ICC threshold was set at 0.9. Despite the moderate performance of the three‐year LR‐based OS model incorporating highly repeatable RFs in the HN137 cohort, these features substantially enhanced the generalizability of the remaining three OS models. The performance trajectories of these models, evaluated through the progressive release of repeatable RFs based on ICC coefficients, indicate that RFs with ICC values exceeding 0.7 make a significant contribution to model generalizability. This ICC threshold appears to represent a critical point for feature selection when optimizing OS prediction models.
FIGURE 3.

Changes in the performance of (A) two‐year and (B) three‐year overall survival models based on Support Vector Machine (SVM), and (C) two‐year and (D) three‐year overall survival models based on Logistic Regression (LR), with repeatable radiomic features selected at different ICC thresholds.
To further assess the effect of RF repeatability on model generalizability, unfiltered RFs were categorized into low‐repeatable and high‐repeatable groups based on the ICC threshold of 0.7, as shown in Figure 4(A)‐(D). The highly repeatable and low repeatable features used for model construction are shown in Table 2. Across the training, internal testing, and external validation cohorts, OS models constructed using high‐repeatable RFs consistently demonstrated superior mean AUC values compared with those built using low‐repeatable RFs, regardless of whether SVM or LR algorithms were used to train the OS models. Although the performance of the LR‐based three‐year OS models exhibited variability with increasing ICC coefficients in the HN137 and HNSCC external validation cohorts (Figure 3(D)), the mean AUC index of these models remained higher when high‐repeatable RFs were used compared to low‐repeatable RFs in both external validation cohorts (Figure 4(D)). These findings suggest that high‐repeatable RFs, which are minimally affected by segmentation variability, contribute to the enhanced generalizability of radiomic models. The detailed mean AUC values and corresponding 95% CI for OS models constructed using high‐repeatable and low‐repeatable RFs are listed in Table 3.
FIGURE 4.

Performance evaluation of overall survival models constructed with low‐repeatable (ICC < 0.7) and high‐repeatable (ICC > 0.7) radiomic features: (A) Support Vector Machine (SVM)‐based two‐year model and (B) SVM‐based three‐year model. Logistic Regression (LR)‐based (C) two‐year model and (D) three‐year model.
TABLE 2.
Features with low repeatability and high repeatabity used for model training.
| Features | p value |
|---|---|
| ICC > 0.7 | |
| LoG(sigma:2.5)_glszm_GrayLevelNonUniformity | 0 |
| LoG(sigma:2)_glcm_MCC | 0.045 |
| LoG(sigma:3)_glrlm_LongRunLowGrayLevelEmphasis | 0.027 |
| LoG(sigma:6)_glszm_SizeZoneNonUniformityNormalize | 0 |
| LoG(sigma:1.5)_firstorder_Minimum | 0 |
| Original_glszm_SizeZoneNonUniformityNormalized | 0 |
| Original_glrlm_GrayLevelNonUniformity | 0 |
| ICC < 0.7 | |
| LoG(sigma:3)_glszm_LargeAreaHighGrayLevelEmphasis | 0 |
| LoG(sigma:0.5)_glrlm_RunEntropy | 0.224 |
| LoG(sigma:0.5)_glszm_LowGrayLevelZoneEmphasis | 0.162 |
| Original_firstorder_Minimum | 0.031 |
| LoG(sigma:0.5)_glszm_LargeAreaLowGrayLevelEmphasis | 0.015 |
| LoG(sigma:1)_gldm_DependenceVariance | 0.351 |
| LoG(sigma:0.5)_glrlm_LongRunHighGrayLevelEmphasis | 0.028 |
Note: p values were calculated by permurbation test.
TABLE 3.
The mean AUC values and their 95%CI for the two‐year and three‐year overall survival models built with high‐repeatable and low‐repeatable radiomic features in the training, internal testing, and external validation cohorts.
| Two‐year AUC (95%CI) | Three‐year AUC (95%CI) | ||||
|---|---|---|---|---|---|
| SVM | LR | SVM | LR | ||
| Training | ICC > 0.7 |
0.6965 (0.6396‐0.7523) |
0.7235 (0.6627‐0.7813) |
0.7243 (0.6717‐0.7751) |
0.7250 (0.6722‐0.7769) |
| ICC < 0.7 |
0.6028 (0.3928‐0.6764) |
0.6205 (0.5539‐0.6881) |
0.5918 (0.4070‐0.6588) |
0.6127 (0.5526‐0.6734) |
|
| Internal‐testing | ICC > 0.7 |
0.6835 (0.5784‐0.7785) |
0.7031 (0.6047‐0.7946) |
0.7082 (0.6239‐0.7884) |
0.6939 (0.6073‐0.7783) |
| ICC < 0.7 |
0.5782 (0.3192‐0.7078) |
0.5890 (0.4798‐0.6856) |
0.5742 (0.3363‐0.6865) |
0.6046 (0.4982‐0.7016) |
|
| HN137 | ICC > 0.7 |
0.7178 (0.6936‐0.7349) |
0.6736 (0.6334‐0.918) |
0.6689 (0.6509‐0.6856) |
0.6197 (0.5760‐0.6556) |
| ICC < 0.7 |
0.6225 (0.3455‐0.6774) |
0.6648 (0.6316‐0.7062) |
0.5875 (0.3782‐0.6454) |
0.6031 (0.5973‐0.6273) |
|
| HNSCC | ICC > 0.7 |
0.6480 (0.5724‐0.7070) |
0.7162 (0.6745‐0.7639) |
0.7273 (0.6734‐0.7597) |
0.7516 (0.7096‐0.7763) |
| ICC < 0.7 |
0.5905 (0.3718‐0.6431) |
0.6121 (0.5646‐0.6778) |
0.6148 (0.3475‐0.6686) |
0.6556 (0.6324‐0.6939) |
|
Abbreviations: AUC, area under the receiver operating characteristic curve; SVM, Support Vector Machine; LR, Logistic Regression; ICC, intra‐class correlation coefficient.
The receiver operating characteristic (ROC) curves of OS models built with low‐repeatable and high‐repeatable RFs effectively demonstrate their predictive performance for two‐year and three‐year OS outcomes, as shown in Figure 5. The ROC curves for the SVM‐based two‐year and three‐year OS models (Figure 5(A) and (B)) demonstrate that models incorporating high‐repeatable RFs achieve superior predictive performance compared to those built with low‐repeatable RFs. Similarly, the LR‐based two‐year and three‐year OS models demonstrate performance trends comparable to the SVM‐based models across the training, internal testing, and external validation cohorts when evaluating the impact of high‐repeatable versus low‐repeatable RFs. This observation is supported by the distinct ROC curve patterns shown in Figure 5(C) and (D), which consistently indicate improved model performance associated with high‐repeatable RFs.
FIGURE 5.

Receiver operating characteristic (ROC) curves for two‐year survival models developed using support vector machine (SVM) and logistic regression (LR) algorithms with (A) high‐repeatable and (B) low‐repeatable radiomic features. Similarly, ROC curves for three‐year survival models built with SVM and LR, utilizing (C) high‐repeatable and (D) low‐repeatable radiomic features.
To further validate the impact of high‐repeatable and low‐repeatable features on model generalizability, we separately assessed their ability to distinguish between high‐ and low‐risk survival groups in two external validation cohorts, as shown in Figure 6. High‐repeatable features demonstrated a measurable ability to differentiate risk groups in the HN137 cohort (Figure 6A) and successfully distinguished high‐ and low‐risk groups in the HNSCC external validation cohort (Figure 6C). In contrast, low‐repeatable features failed to achieve effective risk stratification in both the HNSCC and HN137 cohorts (Figure 6B and D). These findings further support the contribution of high‐repeatable features to the enhanced cross‐institutional decision‐making capability of radiomic models.
FIGURE 6.

Kaplan‐Meier survival analysis of low‐ and high‐risk groups stratified by (A, C) high‐repeatable and (B, D) low‐repeatable features in the external validation cohorts HN137 and HNSCC.
Clinical features such as age and treatment modality showed significant correlations with patient survival across all three datasets. To compare the generalizability of clinical features and RFs, we developed models based on clinical features and high‐repeatable RFs, with detailed results presented in Table 4. Overall, models built with high‐repeatable RFs demonstrated superior performance compared with clinical feature‐based models in the HNSCC dataset. However, in the HN137 dataset, the clinical feature‐based model demonstrated superior performance. Notably, the combined model integrating both clinical features and RFs achieved improved performance in both external validation cohorts, indicating that the incorporation of clinical features further enhances model generalizability.
TABLE 4.
The mean AUC values and their 95%CI for the two‐year and three‐year overall survival models built with high‐repeatable and radiomic features in the training, internal testing, and external validation cohorts.
| Nomogram |
Training (95% CI) |
Internal‐testing (95% CI) |
External validation (95% CI) | |
|---|---|---|---|---|
| HN137 | HNSCC | |||
| Clinical | ||||
| SVMTwo |
0.6586 (0.4519‐0.7274) |
0.6509 (0.3550‐0.7562) |
0.6768 (0.3216‐0.6914) |
0.5593 (0.4464‐0.5674) |
| SVMThree |
0.7228 (0.6745‐0.7707) |
0.7206 (0.6316‐0.8056) |
0.7037 (0.7013‐0.7056) |
0.5761 (0.5710‐0.5783) |
| LRTwo |
0.6949 (0.6304‐0.7616) |
0.6894 (0.5862‐0.7861) |
0.6787 (0.6725‐0.6833) |
0.6031 (0.5780‐0.6134) |
| LRThree |
0.7473 (0.6994‐0.7970) |
0.7423 (0.6512‐0.8265) |
0.6787 (0.6753‐0.6816) |
0.5659 (0.5503‐0.5770) |
| Radiomics | ||||
| SVMTwo |
0.6965 (0.6396‐0.7523) |
0.6835 (0.5784‐0.7785) |
0.7178 (0.6936‐0.7349) |
0.6480 (0.5724‐0.7070) |
| SVMThree |
0.7243 (0.6717‐0.7751) |
0.7082 (0.6239‐0.7884) |
0.6689 (0.6509‐0.6856) |
0.7273 (0.6734‐0.7597) |
| LRTwo |
0.7235 (0.6627‐0.7813) |
0.7031 (0.6047‐0.7946) |
0.6736 (0.6334‐0.918) |
0.7162 (0.6745‐0.7639) |
| LRThree |
0.7250 (0.6722‐0.7769) |
0.6939 (0.6073‐0.7783) |
0.6197 (0.5760‐0.6556) |
0.7516 (0.7096‐0.7763) |
| Integrated | ||||
| SVMTwo |
0.6967 (0.6396‐0.7528) |
0.6858 (0.5798‐0.7849) |
0.7191 (0.6936‐0.7367) |
0.6481 (0.5756‐0.7076) |
| SVMThree |
0.7254 (0.6746‐0.7716) |
0.7082 (0.6206‐0.7874) |
0.6682 (0.6501‐0.6824) |
0.7279 (0.6760‐0.7595) |
| LRTwo |
0.7944 (0.7402‐0.8490) |
0.7682 (0.6848‐0.8450) |
0.7298 (0.6945‐0.7655) |
0.7678 (0.7314‐0.7970) |
| LRThree |
0.8350 (0.7894‐0.8775) |
0.8046 (0.7271‐0.8733) |
0.6688 (0.6375‐0.6966) |
0.7379 (0.6748‐0.7770) |
4. DISCUSSION
This study employed a perturbation‐based approach to simulate inter‐observer variability in tumor delineation among clinical oncologists and to evaluate its impact on the repeatability of RFs. The ICC coefficient was used as a quantitative metric to evaluate the repeatability of the RFs. Appropriate ICC thresholds were identified by progressively incorporating repeatable RFs across increasing ICC thresholds, enabling selection of stable RFs to enhance the generalizability of the OS model. Our analysis demonstrated that RFs with ICC values exceeding 0.7 made a significant contribution to model generalizability. Notably, more than 70% of RFs across all three datasets demonstrated excellent reproducibility, with ICC values exceeding 0.9. From the standpoint of the effect of RF repeatability on model generalizability, more than 95% of RFs in the three datasets satisfied the criterion of ICC values greater than 0.7, supporting their suitability for improving model generalizability. Despite the high proportion of reproducible features, our findings emphasize that segmentation variability continued to adversely affect a subset of RFs, which may compromise predictive performance. These results underscore the importance of assessing segmentation‐induced variability in RF reproducibility as a critical preprocessing step to enhance model generalizability.
The number of LoG‐filtered RFs demonstrating high repeatability (ICC > 0.9) exhibited a positive correlation with increasing sigma values, as shown in Figure 2(B)‐(D). Notably, this trend was observed despite the use of identical perturbed masks for RF extraction across all LoG‐filtered images, indicating that the observed variations in the number of repeatable features primarily stem from intrinsic differences introduced by the filtering process rather than from segmentation variability. LoG filtering plays a fundamental role through its combined effects of image smoothing and edge contrast enhancement. At smaller sigma values, the filter amplifies high‐frequency intensity variations, leading to increased image heterogeneity and reduced feature repeatability. Conversely, larger sigma values produce more pronounced smoothing effects, resulting in more homogeneous intensity distributions and, consequently, enhanced repeatability of extracted RFs.
While numerous studies have investigated the impact of segmentation variability on feature repeatability and have employed ICC thresholds for repeatable RF selection, the optimal ICC threshold for enhancing model generalizability remains to be validated. In our methodology, we implemented a two‐stage feature selection process, consisting of an initial pre‐selection based on ICC values below a specified threshold, followed by refined selection using the mRMR criterion. Using this approach, we systematically evaluated model performance by progressively incorporating repeatable RFs across increasing ICC thresholds. This progressive integration of segmentation‐affected features enabled the identification of ICC thresholds that optimally maximize model generalizability. Our analysis demonstrated a significant improvement in the AUC metrics for both two‐year and three‐year OS prediction models in external validation cohorts when an ICC threshold of 0.8 was applied. This finding suggests that incorporation of RFs with ICC values in the range of 0.7 to 0.8 contributes meaningfully to enhanced model generalizability. Consequently, our results demonstrate that segmentation‐affected RFs retaining ICC values above 0.7 can effectively improve the generalizability of CT‐based OS prediction models for OPC patients. Among the high‐repeatable features listed in Table 2 and used for model construction, six were texture‐based features. These texture features describe tumor heterogeneity from different dimensions, supporting the notion that tumor heterogeneity may be closely associated with radiosensitivity and, thus, plays an important role in the OS model.
Although this study systematically investigated the effects of segmentation variability on the generalizability of OPC OS prediction models, several limitations warrant consideration. First, although we identified the influence of sigma values on RF repeatability, the optimal sigma parameter range for simultaneously enhancing both feature reproducibility and model generalizability has yet to be determined. Second, as this study was retrospective in nature, our findings require validation in prospective clinical studies to confirm their clinical applicability and robustness. Third, variations in imaging equipment, scanning parameters, and acquisition protocols across the three centers may have influenced the stability of RFs in cross‐center applications. Our study did not comprehensively account for these variables but rather focused specifically on assessing the stability of independent target delineation features. Fourth, the appropriate ICC threshold may vary according to imaging modality and tumor type. Therefore, future studies are warranted to validate these thresholds across diverse imaging modalities and tumor entities in order to establish robust, disease‐specific feature selection criteria. Finally, the current OS prediction models demonstrated suboptimal performance, as reflected by relatively low AUC values. Future research should therefore focus on developing more advanced modeling approaches and integrating complementary data sources to enhance the predictive accuracy of OS models.
5. CONCLUSION
Segmentation variability significantly impacts the repeatability of the RFs and, consequently, influences the generalizability of radiomic models. In this study, a perturbation‐based method was used to simulate inter‐observer variability in tumor delineation among oncologists, and ICC was used to quantify the repeatability of RFs. Our findings demonstrate that segmentation‐affected RFs with ICC values above 0.7 contribute substantially to the generalizability of CT‐based OS prediction models for OPC patients, with over 95% of RFs satisfying this reproducibility criterion. Regarding the LoG filtering parameters, sigma values exceeding 1.5 were shown to significantly enhance feature repeatability. However, the relationship between sigma values and model generalizability requires further investigation. Overall, these results underscore the importance of implementing rigorous feature selection strategies prior to model development, as the incorporation of highly repeatable RFs significantly enhances the generalizability and clinical applicability of radiomics‐based predictive models.
CONFLICT OF INTEREST STATEMENT
The authors declare no known competing financial interests or personal relationships that could have influenced the work reported in this paper.
ETHICAL STATEMENT
Not applicable.
ACKNOWLEDGMENTS
This work was supported by the Henan Provincial Science and Technology Research Project (Grant No. 232102310125) and the Youth Natural Science Foundation of Henan Province (Grant No. 242300420413).
Contributor Information
Yongqiang Wang, Email: wyq331@mail.ustc.edu.cn.
Jing Cai, Email: jing.cai@polyu.edu.hk.
Hong Ge, Email: gehong616@126.com.
REFERENCES
- 1. Liu Z, Wang S, Dong D, et al. The Applications of Radiomics in Precision Diagnosis and Treatment of Oncology: Opportunities and Challenges. Theranostics. 2019;9(5):1303‐1322. doi: 10.7150/thno.30309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Gul M, Bonjoc KC, Gorlin D, et al. Diagnostic Utility of Radiomics in Thyroid and Head and Neck Cancers. Front Oncol. 2021;11:639326. doi: 10.3389/fonc.2021.639326 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Feng C, Zhu L, Mao W, et al. Expression and prognosis of cellular senescence genes in head and neck squamous cell carcinoma. Holist Integr Oncol. 2025; 4:2. doi: 10.1007/s44178-024-00115-7 [DOI] [Google Scholar]
- 4. Huang Y, Liu Z, He L, et al. Radiomics Signature: A Potential Biomarker for the Prediction of Disease‐Free Survival in Early‐Stage (I or II) Non‐Small Cell Lung Cancer. Radiology. 2016;281(3):947‐957. doi: 10.1148/radiol.2016152234 [DOI] [PubMed] [Google Scholar]
- 5. HajiEsmailPoor Z, Kargar Z, Baradaran M, et al. Prognostic value of CT scan‐based radiomics in intracerebral hemorrhage patients: A systematic review and meta‐analysis. Eur J Radiol. 2024;178:111652. doi: 10.1016/j.ejrad.2024.111652 [DOI] [PubMed] [Google Scholar]
- 6. Chetan MR, Gleeson FV. Radiomics in predicting treatment response in non‐small‐cell lung cancer: current status, challenges and future perspectives. Eur Radiol. 2021;31(2):1049‐1058. doi: 10.1007/s00330-020-07141-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Patel M, Zhan J, Natarajan K, et al. Machine learning‐based radiomic evaluation of treatment response prediction in glioblastoma. Clin Radiol. 2021;76(8):628.e17‐628.e27. doi: 10.1016/j.crad.2021.03.019 [DOI] [PubMed] [Google Scholar]
- 8. Zhang Y, Lam S, Yu T, et al. Integration of an imbalance framework with novel high‐generalizable classifiers for radiomics‐based distant metastases prediction of advanced nasopharyngeal carcinoma. Knowl‐Based Syst, 2022, 235: 107649. doi: 10.1016/j.knosys.2021.107649 [DOI] [Google Scholar]
- 9. Teng X, Zhang J, Zwanenburg A, et al. Building reliable radiomic models using image perturbation. Sci Rep. 2022;12(1):10035. doi: 10.1038/s41598-022-14178-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Teng X, Zhang J, Ma Z, et al. Improving radiomic model reliability using robust features from perturbations for head‐and‐neck carcinoma. Front Oncol. 2022;12:974467. doi: 10.3389/fonc.2022.974467 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Zhang J, Lam SK, Teng X, et al. Radiomic feature repeatability and its impact on prognostic model generalizability: A multi‐institutional study on nasopharyngeal carcinoma patients. Radiother Oncol. 2023;183:109578. doi: 10.1016/j.radonc.2023.109578 [DOI] [PubMed] [Google Scholar]
- 12. Zwanenburg A, Vallières M, Abdalah MA, et al. The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High‐Throughput Image‐based Phenotyping. Radiology. 2020;295(2):328‐338. doi: 10.1148/radiol.2020191145 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Whybra P, Zwanenburg A, Andrearczyk V, et al. The Image Biomarker Standardization Initiative: Standardized Convolutional Filters for Reproducible Radiomics and Enhanced Clinical Insights. Radiology. 2024;310(2):e231319. doi: 10.1148/radiol.231319 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Kocak B, Durmaz ES, Kaya OK, Ates E, Kilickesmez O. Reliability of Single‐Slice‐Based 2D CT Texture Analysis of Renal Masses: Influence of Intra‐ and Interobserver Manual Segmentation Variability on Radiomic Feature Reproducibility. AJR Am J Roentgenol. 2019;213(2):377‐383. doi: 10.2214/AJR.19.21212 [DOI] [PubMed] [Google Scholar]
- 15. Jha AK, Mithun S, Jaiswar V, et al. Repeatability and reproducibility study of radiomic features on a phantom and human cohort. Sci Rep. 2021;11(1):2055. doi: 10.1038/s41598-021-81526-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Pandey U, Saini J, Kumar M, Gupta R, Ingalhalikar M. Normative Baseline for Radiomics in Brain MRI: Evaluating the Robustness, Regional Variations, and Reproducibility on FLAIR Images. J Magn Reson Imaging. 2021;53(2):394‐407. doi: 10.1002/jmri.27349 [DOI] [PubMed] [Google Scholar]
- 17. Zhang H, Lu T, Wang L, et al. Robustness of radiomics within photon‐counting detector CT: impact of acquisition and reconstruction factors. Eur Radiol. 2025;35(8):4661‐4673. doi: 10.1007/s00330-025-11374-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zwanenburg A, Leger S, Agolli L, et al. Assessing robustness of radiomic features by image perturbation. Sci Rep. 2019;9(1):614. doi: 10.1038/s41598-018-36938-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Primakov SP, Ibrahim A, van Timmeren JE, et al. Automated detection and segmentation of non‐small cell lung cancer computed tomography images. Nat Commun. 2022;13(1):3423. doi: 10.1038/s41467-022-30841-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Sahlsten J, Jaskari J, Wahid KA, et al. Application of simultaneous uncertainty quantification and segmentation for oropharyngeal cancer use‐case with Bayesian deep learning. Commun Med (Lond). 2024;4(1):110. doi: 10.1038/s43856-024-00528-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Müller‐Franzes G, Nebelung S, Schock J, et al. Reliability as a Precondition for Trust‐Segmentation Reliability Analysis of Radiomic Features Improves Survival Prediction. Diagnostics (Basel). 2022;12(2):247. doi: 10.3390/diagnostics12020247 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Louis T, Lucia F, Cousin F, et al. Identification of CT radiomic features robust to acquisition and segmentation variations for improved prediction of radiotherapy‐treated lung cancer patient recurrence. Sci Rep. 2024;14(1):9028. doi: 10.1038/s41598-024-58551-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Haniff NSM, Abdul Karim MK, Osman NH, Saripan MI, Che Isa IN, Ibahim MJ. Stability and Reproducibility of Radiomic Features Based Various Segmentation Technique on MR Images of Hepatocellular Carcinoma (HCC). Diagnostics (Basel). 2021;11(9):1573. doi: 10.3390/diagnostics11091573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Clark K, Vendt B, Smith K, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013;26(6):1045‐1057. doi: 10.1007/s10278-013-9622-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
