Abstract
Background
Predictive models using high-dimensional data, such as genomics and transcriptomics, are increasingly used in oncology for time-to-event endpoints. Internal validation of these models is crucial to mitigate optimism bias prior to external validation. Common strategies include train-test, bootstrap, and (nested) cross-validation. However, no benchmark exists for these methods in high-dimensional settings. We aimed to compare these strategies and provide recommendations in the field of transcriptomic analysis.
Method
A simulation study was conducted using data from the SCANDARE head and neck cohort (NCT 03017573) including n = 76 patients. Simulated datasets included clinical variables (age, sex, HPV status, TNM staging), transcriptomic data (15,000 transcripts), and disease-free survival, with a realistic cumulative baseline hazard. Sample sizes of 50, 75, 100, 500, and 1000 were simulated, with 100 replicates each. Cox penalized regression was performed for model selection, followed by train-test 70 % training), bootstrap (100 iterations), 5-fold cross-validation, and nested cross-validation (5 ×5) to assess discriminative (time-dependent AUC and C-Index) and calibration (3-year integrated Brier Score) performance.
Results
Train-test validation showed unstable performance. Conventional bootstrap was over-optimistic, while the 0.632 + bootstrap was overly pessimistic, particularly with small samples (n = 50 to n = 100). The k-fold cross-validation and nested cross-validation improved performance with larger sample sizes, with k-fold cross-validation demonstrating greater stability. Nested cross-validation showed performance fluctuations depending on the regularization method for model development.
Conclusion
The K-fold cross-validation and nested cross-validation are recommended for internal validation of Cox penalized models in high-dimensional time-to-event settings. These methods offer greater stability and reliability compared to train-test or bootstrap approaches, particularly when sample sizes are sufficient.
Keywords: Transcriptomic, Prognosis, Simulation, Head and neck, Penalized regression, Validation
Graphical Abstract
1. Introduction
Multidimensional research has been rapidly evolving since the advent of genomic and transcriptomic sequencing at the end of the 2000s [1]. From the emergence of next generation sequencing (NGS) and development of targeted molecular therapies, therapeutic decision-making processes for many cancer types have become increasingly complex due to the inclusion of molecular biomarkers in prognostication [2]. Predictive models are important tools to provide estimates of patient outcomes [3], but developing such models is particularly challenging when dealing with high-dimensional data such as genomics, epigenomics, transcriptomics, proteomics, metabolomics because the large number of variables relative to the number of samples can lead to overfitting, increased noise, and instability in the results. This complexity often reduces model generalizability and makes it difficult to identify truly informative features, thereby complicating model development and interpretation.
Model development aims to detect the relevant features and discard irrelevant ones. Selecting prognostic variables in high-dimensional data is challenging due to correlations between covariates, limited statistical power, and increased risk of type I errors in variable selection with prognosis outcome when using multi-testing approaches. Several methods could be used for selection, such as Cox penalized regression or machine learning approaches [4] with time-to-event endpoints. Penalized Cox regression models, such as LASSO (Least Absolute Shrinkage and Selection Operator) and elastic net, exhibit good sparsity and interpretability [5] for time-to event outcomes. Penalized regression is performed to estimate the prognostic importance of biological factors and to assess the correlation between the variables integrating all available data to compute more comprehensive models.
The validation of a statistical model aims to assess its performances without bias [6]. Internal validation procedures evaluate the model using the same dataset used for its development, which helps estimate how well the model fits the training data. In contrast, external validation involves testing the model on an independent dataset to assess its generalizability to new, unseen data. Although external validation is crucial for confirming the robustness and applicability of prognostic models [7], the assessment of the model performance particularly in high dimensional setting using time-to event endpoints is notably affected by optimism and instability because of right censoring [8]. Commonly used approaches for internal validation include train-test, bootstrap, cross-validation and nested cross-validation. Train-test validation of clinical prediction models which divides data into training and testing sets, may reduce predictive accuracy and the precision of validation in prognosis studies [9]. Bootstrap validation methods are also performed to validate penalized models but carry a risk of overestimation depending on the bootstrap estimator; correction methods have been described for linear and binary models to limit optimism bias [10], [11]. The 0.632 + bootstrap estimator, which combines the resampling error and bootstrap error with a weighting scheme to reduce bias, has been shown to produce less optimism bias, particularly in non-regularized models using time-to-event (TTE) endpoints, where outcomes are measured as the time until an event such as death or relapse occurs [12]. K-fold cross-validation seems to provide a good balance between bias and stability in high dimensional setting [13], [14] but it has been poorly evaluated using time-to-event endpoints. Nested-cross validation is also performed in small sample datasets, optimizing hyperparameters with good accuracy; it aims to combine both selection and validation of the best model [15], [16] within the same procedure but has been under-evaluated using time-to-event endpoints in high-dimensional settings. To date, no studies have directly compared the performance of internal validation methods in omics approaches, particularly those involving the integration of transcriptomic data using time to-event endpoint.
The aim of the study is therefore to compare the performance of train-test, bootstrap, cross validation and nested cross validation with transcriptomic data using time-to event endpoints in a head and neck carcinomas simulation study.
2. Methods
2.1. Simulation study and implementation
Datasets of different sample size (50, 75, 100, 500, and 1000), inspired by head and neck oncology, were generated, with r = 100 replicates per scenario. An independent dataset of n = 1000 samples was simulated with the same approach as the replicates. All simulation and analysis were performed with R software (version 4.4.0). The entire process of the study is summarized in Fig. 1.
Fig. 1.
Flow chart of statistical process of the simulation.
2.2. Data generating mechanism
2.2.1. Data
The original data from the SCANDARE study (NCT03017573) were collected prospectively from 2017 to 2020 as part of a biobanking initiative in cancer patients. The primary objective was to better understand the interplay between molecular alterations within the tumor, the tumor microenvironment, and the immune response.
SCANDARE is a multicenter, prospective biobanking study collecting tumor samples (with or without nodal tissue), plasma, and blood at various time points. The study included patients with several cancer types: ovarian cancer, triple-negative breast cancer, head and neck cancer, advanced-stage treatment-naïve cervical cancer, sarcomas (including breast angiosarcoma and uterine sarcoma). This study specifically focused on the head and neck cancer population.
Participants met the following criteria: newly diagnosed and treatment-naïve patients eligible for surgical intervention, age ≥ 18 years, both male and female, and signed informed consent.
A multi-platform approach was applied to tumor and blood samples, including: pathological evaluation, genomic profiling, transcriptomic analysis, DNA methylation assessment. These platforms enabled a comprehensive characterization of tumor biology and host immune features in the selected population.
2.2.2. Clinical data simulation
We simulated 4 independent covariables: age, sex, HPV status and TNM staging. Age was sampled using a normal distribution with mean of 65 years and a standard deviation of 10 years. We simulated Sex and HPV status using a Bernoulli distribution p = 0.3 for both. We sampled the TNM staging (I to IV) with probabilities of 0.22, 0.13, 0.25 and 0.40, based on SCANDARE head and neck cohort (NCT03017573).
2.2.3. Transcriptomic data simulation
We independently simulated log-transformed count data for 15000 transcripts bases on SCANDARE head and neck distribution and dispersion parameters independently. We applied a four-step simulation: (i) the simulation of the mean expression µ1 of 15 000 transcripts using a normal distribution with µ0 = 2.3 and σ0 = 1.8, (ii) a dispersion parameter τ1 following a lognormal distribution with µτ = 2.8 and log(στ) = 0.4. (iii) the simulation of value of each sample (n) of the datasets using a skewed normal distribution with mean µ2 = exp(µ1), standard deviation σ2 = (exp(µ1)/1.3) x τ1 and skewness ϒ = 2; (iv) we replicated this simulation 100 times for each sample size n = 50, 75, 100, 500 and 1000. We represented the logarithm of the variance according to the logarithm of the mean of the count in TPM (Transcript Per Million) in supplementary data (eFigure 1) for simulated data and original data from SCANDARE.
2.2.4. Time-to-event simulation
We estimated the coefficients used for clinical covariables (eTable 1) based on the SCANDARE head and neck cancer cohort with a non-penalized multivariable Cox regression. The coefficients βAge, βSex, βTNM, βHPV were respectively, 0.02, −0.8, 0.3 and −0.5. For TNM, a linear relation with the log hazard was assumed and only one coefficient was needed. For each dataset, we assumed that 200 out of a total of 15 000 transcripts are associated with the risk of cancer recurrence through a Cox regression model with 100 coefficients generated from an uniform U(-0.1,-0.01) distribution and 100 coefficients generated from an uniform U(0.01,0.1) distribution randomly affected to the logarithm of the count. The coefficients for the transcriptomics covariates were applied on a logarithmic continuous scale.
We generated individual event times [17] according to the survival model using the estimated cumulated baseline hazard H0(t) of the SCANDARE head and neck cohort and the coefficients βxi for time-to event distribution simulation with the inverted method:
Where Xi= (Xi1,…,Xip) T the vector of covariates for patient i and β= (β1,…,βp) the vector of associated coefficients.
The data generating mechanism was mechanism was consistent across all 100 replicates for each scenario (n = 50, 75, 100, 500 and 1000) as well as for the independent dataset (n = 1000).
2.3. Outcome
In this study, we selected disease-free survival (DFS) as the outcome of interest for our statistical analysis. DFS was defined as the time from the start of treatment until death from any cause or local, regional, or metastatic recurrence, which was coded as event 1 when it occurred. Time horizons of 3 and 5 years were fixed for the dynamic simulation.
2.4. Model development
The model development had to measure the individual probability of recurrence or death assed by the DFS. For each datasets, we fitted a Cox model with elastic net penalty with a parameter α equal to 0.05 (ridge-like penalty), 0.5 (elastic net) and 0.95 (lasso-like penalty). The penalty λ was set using ten-fold cross-validations using cv.glmnet function of glmnet r package and we chose the to select best model in entire dataset using all visits before internal validation testing [10].
2.5. Internal validation method
The train-test validation was implemented by randomly dividing the original dataset into two subsamples: a training set (used for model derivation) and a testing set (used for model validation). The individual predicted survival probabilities were estimated on the testing set. We also employed the conventional bootstrap procedure estimator [3], [9], using a fixed number of iterations. In this approach, the training set is the bootstrap sample, and the testing set is the original dataset for each iteration, with performance metrics evaluated on the testing set. Additionally, we implemented the 0.632 + bootstrap estimator described by Efron and Tibshirani [18].
For the k-fold cross-validation, as commonly performed in the literature [14], the dataset was divided into K equal-sized subsamples. K−1 subsamples were used as the training set, and the remaining subsample served as the testing set for assessing performance metrics. This process was repeated for each of the K subsamples. Lastly, we employed nested cross-validation as described in the literature [19], [20]. This method involves tuning the parameter within inner loops to optimize the model, followed by applying the optimized model on the outer testing set to evaluate predictive performance. This procedure was repeated multiple times across the K outer folds.
For all methods, the optimal was determined using ten-fold cross-validation on the training set with the ‘glmnet’ function prior to performing the validation procedure. The specific parameters used for each method of internal validation were:
-
i.
Train-test partition: The dataset was randomly split into training (70 %) and testing (30 %) sets. Individual predictions were calculated on the testing set.
-
ii.
Conventional Bootstrap: After determining 100 bootstrap iterations were performed. Predictions were generated using the conventional bootstrap method [3], [12], [21] with timeROC and pec R packages.
-
iii.
0.632 + Bootstrap: After determining , 100 bootstrap iterations were performed. Predictions were obtained using the 0.632 + bootstrap estimator, and performance was evaluated with timeROC and pec R packages.
-
iv.
K-fold-cross validation: We performed K-fold cross-validation with optimization. We fixed a K= 5 and 10 repetitions and obtained the prediction using the test samples of each fold and repetitions.
-
v.
Nested cross-validation: We performed 5-repeated nested cross-validation with K= 5 outer loop segmentation and a n = 5 inner loop segments. About the outer Loop, the dataset was divided into K=outer folds. In each iteration of the outer loop, one fold was set aside as the testing set, while the remaining K−1 folds served as the training set. About the inner loop, within the training set of the outer loop, the inner loop was performed with J= 5 folds. These inner folds were used to tune the hyperparameter using cross-validation. Specifically, in each iteration of the inner loop, J−1 folds were used to train the model, and the remaining fold was used to validate and evaluate the performance for hyperparameter tuning. The "best model" refers to the model trained on the entire training set of the outer loop using the optimal determined from the inner loop. This best model was then applied to the testing set of the outer loop to compute individual predictions and evaluate the performance metrics.
An independent validation of the best models after internal validation process was performed using the independent independent simulated data (n = 1000) not used in the internal validation process to assess the predictive performance.
2.6. Estimands and performance metrics
We assessed the predictive performances of the estimated models with the time dependent area under the curve (AUC) and the C-index as measures of model discrimination. We considered for each dynamic analysis of AUC, a cumulative/dynamic method for calculation of AUC-time dependent [22], [23], [24]. The AUC ranges from 0.5 (uninformative model) to 1 (perfect discrimination). The time-dependent cumulative/dynamic AUC between time 0 and a given horizon time is obtained from the timeROC R package, as the mean of the 100 replicates for each scenario. Horizon times ranged from 1 to 5 year with 0.1 year increment and a standard deviation estimate across all the replicates at 3 year time. The C-index and corresponding standard deviation were estimated using the R package survcomp. To compare the variance of the 3-year AUC across replicates between the internal validation strategies, we conducted a homogeneity of variance test using Bartlett’s test with the null hypothesis of equal variances. If the p-value was less than 0.05, adjusted pairwise comparisons using Fisher’s test were conducted. Pairwise equality of variance indicates comparable stability between two methods according to the model development approach and sample size.
We also evaluated calibration performances using integrated Brier Score (IBS) estimated with inverse probability of censoring weights (IPCW)[25], [26] calculating the individual survival probabilities using pec package from time 0–3 years.
2.6.1. Measure of bias
We obtained the oracle AUC after calculating linear predictors using the coefficients generated for data simulation with an independent simulated dataset. We developed the naïve 3 years-IBS using the same linear predictors as the oracle AUC calculation. The oracle performances represent the best possible model outcomes, obtained by using all true coefficients from the simulation process. This oracle model serves as a benchmark that should not be surpassed. We used an independent validation approach with the independent simulated dataset not used for the internal validation strategies measuring the AUC(t) and the 3 year-IBS with the hyperparameter of the models priorly fixed before internal validation (α and ) and into the nested-cross validation approach.
We obtained the measure of optimism bias by calculating the difference between the oracle AUC model and the mean of AUC obtained by each strategy of internal validation: train-test validation, conventional bootstrap, 0.632 + bootstrap, k-fold cross-validation, and nested cross-validation.
2.6.2. Application
We assessed the internal validation methods, train-test validation, conventional bootstrap method repeated-K-fold cross validation and nested cross-validation on SCANDARE clinical and transcriptomic original and real dataset with n = 76 samples. We developed Lasso-like regularization with α= 0.95 using optimization on the entire dataset incorporating clinical covariables (Ag, Sex, HPV status, TNM staging) and 14924 transcriptomics variables using logarithmic transformation of the TPM normalization. We measured both AUC(t) and the 3-year IBS to evaluate performance of all method of internal validation.
3. Results
In this benchmark study we presented the performances of internal validation strategy by model development regularization according to discriminative performance and calibration performance. We reported on eTable 1 the non-convergence rate after model selection by regularization of penalized regression.
3.1. Time-dependent discriminative performance
We represented 3-year AUC on Table 1. The mean of 3-year AUC of independent validation according to sample size were: 0.50 [0.48–0.53] (n = 100), 0.55 [0.50–0.58] (n = 500) and 0.60 [0.57–0.63] (n = 1000). The oracle 3 year-AUC using the parameters of simulation was 0.82.
Table 1.
Three-year AUC according to internal validation strategy and penalized model.
| Sample size | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Validation | Regularization | 50 | 75 | 100 | 500 | 1000 | |||||
| AUC (SD) | C-Index (SD) | AUC (SD) | C-Index (SD) | AUC (SD) | C-Index (SD) | AUC (SD) | C-Index (SD) | AUC (SD) | C-Index (SD) | ||
| TTV | Lasso | 0.52 (0.17) | 0.52 (0.13) | 0.50 (0.14) | 0.51 (0.10) | 0.50 (0.12) | 0.50 (0.09) | 0.54 (0.06) | 0.54 (0.05) | 0.58 (0.04) | 0.57 (0.03) |
| Enet | 0.50 (0.17) | 0.50 (0.15) | 0.51 (0.12) | 0.52 (0.10) | 0.50 (0.11) | 0.50 (0.08) | 0.55 (0.05) | 0.54 (0.05) | 0.58 (0.04) | 0.57 (0.03) | |
| Ridge | 0.53 (0.18) | 0.52 (0.14) | 0.52 (0.13) | 0.52 (0.10) | 0.51 (0.11) | 0.51 (0.08) | 0.55 (0.05) | 0.54 (0.05) | 0.58 (0.04) | 0.57 (0.03) | |
| BT | Lasso | 0.85 (0.03) | 0.81 (0.03) | 0.84 (0.04) | 0.79 (0.04) | 0.82 (0.04) | 0.77 (0.04) | 0.77 (0.07) | 0.72 (0.06) | 0.73 (0.06) | 0.69 (0.05) |
| Enet | 0.88 (0.03) | 0.83 (0.03) | 0.87 (0.04) | 0.81 (0.04) | 0.84 (0.04) | 0.79 (0.04) | 0.77 (0.07) | 0.72 (0.06) | 0.74 (0.06) | 0.70 (0.05) | |
| Ridge | 0.91 (0.02) | 0.86 (0.03) | 0.90 (0.03) | 0.84 (0.03) | 0.88 (0.03) | 0.82 (0.04) | 0.79 (0.08) | 0.74 (0.07) | 0.75 (0.07) | 0.71 (0.06) | |
| BT632 + | Lasso | 0.51 (0.05) | 0.51 (0.04) | 0.50 (0.04) | 0.50 (0.03) | 0.50 (0.03) | 0.50 (0.02) | 0.53 (0.02) | 0.52 (0.02) | 0.56 (0.02) | 0.55 (0.02) |
| Enet | 0.51 (0.05) | 0.51 (0.04) | 0.50 (0.04) | 0.50 (0.03) | 0.50 (0.03) | 0.50 (0.03) | 0.53 (0.02) | 0.52 (0.02) | 0.56 (0.02) | 0.55 (0.02) | |
| Ridge | 0.51 (0.07) | 0.51 (0.06) | 0.51 (0.05) | 0.51 (0.04) | 0.51 (0.04) | 0.50 (0.03) | 0.53 (0.02) | 0.52 (0.02) | 0.56 (0.02) | 0.55 (0.02) | |
| CV | Lasso | 0.55 (0.16) | 0.52 (0.07) | 0.53 (0.11) | 0.52 (0.08) | 0.52 (0.10) | 0.52 (0.08) | 0.56 (0.04) | 0.57 (0.05) | 0.61 (0.03) | 0.60 (0.02) |
| Enet | 0.57 (0.15) | 0.54 (0.07) | 0.53 (0.11) | 0.53 (0.08) | 0.52 (0.09) | 0.51 (0.08) | 0.57 (0.05) | 0.57 (0.06) | 0.61 (0.03) | 0.60 (0.02) | |
| Ridge | 0.51 (0.07) | 0.50 (0.11) | 0.55 (0.11) | 0.54 (0.08) | 0.53 (0.09) | 0.53 (0.08) | 0.57 (0.06) | 0.57 (0.06) | 0.61 (0.02) | 0.60 (0.02) | |
| NCV | Lasso | 0.48 (0.14) | 0.49 (0.08) | 0.51 (0.11) | 0.50 (0.06) | 0.50 (0.08) | 0.51 (0.05) | 0.59 (0.05) | 0.58 (0.04) | 0.63 (0.03) | 0.62 (0.03) |
| Enet | 0.50 (0.12) | 0.50 (0.08) | 0.51 (0.08) | 0.51 (0.07) | 0.50 (0.09) | 0.50 (0.08) | 0.59 (0.05) | 0.58 (0.04) | 0.63 (0.03) | 0.62 (0.03) | |
| Ridge | 0.49 (0.10) | 0.50 (0.09) | 0.50 (0.07) | 0.50 (0.06) | 0.51 (0.07) | 0.51 (0.07) | 0.54 (0.04) | 0.54 (0.05) | 0.57 (0.03) | 0.57 (0.04) | |
Abbreviation: BT632 + = 0.632 + Bootstrap; BT = Conventional Bootstrap;; CV = K-fold cross-Validation; NCV = Nested cross-validation; TTV = Train-test validation
* We represented 3 years AUC with (SD)
We represented the discriminative AUC-time dependent for Cox penalized regression according to internal validity method on Fig. 2 with their variation across the time. The AUC(t) between 1 and 5 years were stable for all the internal validation strategies according to the scenarios of simulation with the variation of sample size (n = 50 to n = 1000) and the model development hyperparameter (α=0.05–0.95).
Fig. 2.
AUC time-dependent discrimination performances by internal validation strategy according to Cox penalized selection model and sample size (n = 50, n = 75, n = 100, n = 500, n = 1000).
The conventional bootstrap method demonstrated higher AUC(t) performance in simulations. The discriminative performance for very low sample setting (n = 50 to n = 100) surpassed that of the oracle model, indicating a tendency towards hyper-optimistic results. In very low sample size setting (n = 50 to n = 100), the 3-years AUC of the train-test validation, 0.632 + bootstrap and nested cross-validation methods were close to 0.5. The discriminative performance of independent set validation were close to 0.5. As the sample size increased from n = 500 to n = 1000, the tendency for optimism bias in the conventional bootstrap method was less important and statistically over the other methods according to the 3-years AUC (SD) represented on Table 1. The discriminative independent set validation approach for sample size from n = 500 to n = 1000.
3.2. Calibration performance
We represented on Table 2 the 3 years-IBS according to scenario, regularization and internal validation method. The oracle 3 years integrated Brier Score was 0.159.
Table 2.
Three-year IBS according to internal validation strategy and penalized model.
|
Sample size |
||||||
|---|---|---|---|---|---|---|
| 50 | 75 | 100 | 500 | 100 | ||
| Validation | Regularization | IBS (SD)* | ||||
| TTV | Lasso | 0.156 (0.033) | 0.162 (0.027) | 0.165 (0.022) | 0.166 (0.010) | 0.167 (0.008) |
| Enet | 0.156 (0.034) | 0.161 (0.028) | 0.165 (0.022) | 0.166 (0.009) | 0.167 (0.008) | |
| Ridge | 0.156 (0.034) | 0.162 (0.027) | 0.165 (0.022) | 0.166 (0.009) | 0.167 (0.008) | |
| BT | Lasso | 0.131 (0.015) | 0.136 (0.015) | 0.139 (0.012) | 0.144 (0.011) | 0.149 (0.008) |
| Enet | 0.128 (0.016) | 0.132 (0.013) | 0.136 (0.013) | 0.144 (0.012) | 0.148 (0.009) | |
| Ridge | 0.123 (0.013) | 0.127 (0.014) | 0.131 (0.012) | 0.142 (0.012) | 0.147 (0.010) | |
| BT632 + | Lasso | 0.150 (0.016) | 0.153 (0.014) | 0.157 (0.011) | 0.160 (0.006) | 0.160 (0.004) |
| Enet | 0.150 (0.016) | 0.153 (0.014) | 0.157 (0.011) | 0.160 (0.010) | 0.160 (0.004) | |
| Ridge | 0.150 (0.016) | 0.153 (0.014) | 0.157 (0.011) | 0.160 (0.006) | 0.160 (0.004) | |
| CV | Lasso | 0.126 (0.014) | 0.143 (0.015) | 0.149 (0.011) | 0.151 (0.007) | 0.157 (0.004) |
| Enet | 0.125 (0.015) | 0.142 (0.014) | 0.148 (0.012) | 0.149 (0.007) | 0.157 (0.004) | |
| Ridge | 0.141 (0.015) | 0.142 (0.014) | 0.148 (0.011) | 0.146 (0.006) | 0.157 (0.004) | |
| NCV | Lasso | 0.151 (0.019) | 0.154 (0.016) | 0.159 (0.013) | 0.158 (0.006) | 0.157 (0.005) |
| Enet | 0.152 (0.019) | 0.154 (0.016) | 0.157 (0.014) | 0.158 (0.006) | 0.157 (0.004) | |
| Ridge | 0.154 (0.018) | 0.155 (0.016) | 0.158 (0.013) | 0.160 (0.007) | 0.159 (0.004) | |
Abbreviation: BT632 + = 0.632 + Bootstrap; BT = Conventional Bootstrap;; CV = K-fold cross-Validation; NCV = Nested cross-validation; TTV = Train-test validation
* We represented 3 years IBS with (SD)
The choice of penalized selection method had only a minor impact on calibration, with slight variation across different scenarios. Internal validation influenced calibration in a manner similar to its effect on discriminative performance. The train-test method exhibited the poorest performance with higher 3years IBS (SD) than the other methods for all the sample size from n = 50 to n = 1000 and for all the regularization of penalized regression. Calibration performance in bootstrap validation declined with increasing sample size, highlighting its susceptibility to hyper-optimism, particularly in high-dimensional, low-sample-size settings (n = 50 to n = 100). The K-fold cross-validation method exhibits a more discrete variation in performance, showing a decline as the sample size increases from n = 100 to n = 500. In contrast, the nested cross-validation method maintains stable calibration performance across different sample sizes (n = 50 to n = 1000).
3.3. Variance and stability
The homogeneity test of the variance of the 3-years AUC comparing all the internal validation strategies according to regularization of penalized regression and sample size were reported on eTable 3 and were for all the variation of regularization and sample size statistically significant (p < 0.05). The Fisher adjusted pairwise comparison according all scenarios of simulation were represented on eTable 4 with a less important variance for k-fold cross-validation and nested cross-validation approaches (p > 0.05) in sample size (n = 50,75,100,500) for both Lasso-like and Elastic-net regularization. The train-test validation approach demonstrated a higher variance compared to other internal validation strategies, underscoring its instability. In contrast, k-fold cross-validation exhibited a stable performance over time and across different regularization methods. Nested cross-validation, however, exhibited variability depending on the specifics of the model development regularization but consistent across the sample size excepted for Ridge-like regression with n = 1000 (supplementary materiel: eTable 4) with higher variance for Nested cross-validation approach.
We summarized the performance of the internal validation process in a combined plot (Fig. 3) according to scenario sample size, regularization of penalized regression, internal validation approaches, 3 year-AUC and 3 years IBS. The performance assessment and their variation are described as most consistent for the K-fold cross-validation and nested-cross validation across the regularization of penalized regression and the variation of sample size n = 50 to n = 1000.
Fig. 3.
Combined plot representing The 3 years AUC, 3 years IBS and their standard deviation variation according to sample size, internal validation method and regularization.
3.4. Application on SCANDARE dataset
We estimated the coefficients using the n = 76 samples with the lasso-like approach (a= 0.95) for selection of variables. We included a matrix of 4 clinical variables (Age, Sex, HPV status, TNM staging) and 14928 transcripts normalized in log-TPM scale for model development.
We represented the AUC(t) of internal validation of the lasso-like model in Fig. 4. Among the validation strategies, the conventional bootstrap method demonstrated the highest discriminative performance, followed by the k-fold cross-validation, and then nested cross-validation. The Train-test validation described an instable variation of the AUC over the time.
Fig. 4.
AUC(t) according to internal validation method of the lasso-like model using SCANDARE dataset.
We represented the 3-years IBS of the lasso-like model for the five previous methods of internal validation on the Table 3. The conventional bootstrap and the k-fold cross-validation showed the best calibration performance. The Train-test method is prone to a greater discrepancy between the predicted event probabilities and actual observations.
Table 3.
Three-year IBS according to internal validation strategy and penalized model of SCANDARE lasso-like model.
| Validation | 3-year IBS * |
|---|---|
| TTV | 0.187 |
| Boot | 0.144 |
| 0.632 + Boot | 0.167 |
| CV | 0.154 |
| Nested CV | 0.167 |
Abbreviation: 0.632 +Boot = 0.632 + Bootstrap; Boot = Conventional Bootstrap; CV = K-fold cross-Validation; TTV = Train-test validation
The non-zero coefficients and their associated variables were obtained by applying a penalized Cox regression (LASSO) model to the SCANDARE head and neck cancer cohort. Selection was performed at the lambda value corresponding to minimum cross-validated error (λ₁se), targeting the prediction of disease-free survival at 3 years. Only variables with non-zero coefficients at this threshold are considered part of the prognostic signature. These selected variables and their corresponding non-zero coefficients are detailed in supplementary material (eTable 5), representing the LASSO-derived multi-marker signature applied for prediction of disease-free survival at 3 years in the SCANDARE cohort including 29 transcriptomics covariables.
4. Discussion
In this study, we aimed to compare different internal validation strategies in scenarios where the number of predictors vastly exceeds the sample size, focusing on right-censored time-to-event outcomes using all the comprehensive information for analysis. These complexities make it challenging to accurately assess both discriminative and calibration performance. We demonstrated the impact of the validation method on predictive performances and the potential risk of overestimating model performance prior to independent validation set. This study highlights the necessity of standardizing statistical approaches and collaboration in translational research to ensure the most accurate and reliable prognostic models for use in clinical practice [27], [28].
4.1. Parameters of simulation
In this simulation study, we performed penalization regression for prediction model development especially in high dimensional settings as recommended in statistical literature [29], [30], [31]. Some model development methods include evaluating model performance through internal validation, which is sometimes inseparable from the model's optimization and development phase. When selection occurs prior to validation, there is a risk of selection bias that must be taken into consideration [32]. The final model must be integrated using appropriate methods and metrics to prevent from optimism bias. We chose to independently simulate clinical and transcriptomic variables, a strategy that may reduce the overall performance of final model and introduce some instability in regularization methods such as ridge regression. This approach was deliberately adopted to highlight the challenges of variable selection while preserving the inherent complexity of high-dimensional data. In this simulation study, we balanced the most current parameters for internal validation methods using a train-test split ratio of 70/30 % or K = 5 folds in the k-fold cross-validation while considering the computational cost of the simulation process. These choices reflect widely accepted practices in biostatistics and machine learning, where a 70 % training and 30 % testing split is commonly used to optimize model learning and evaluation [33], and 5-fold cross-validation [34] provides a robust yet computationally feasible method for performance assessment.
4.2. Performance metrics and bias
We established criteria for metric evaluation using a time-dependent approach to assess the stability of internal method evaluation. The first criterion involves a discriminative approach using the cumulative/dynamic AUC as recommended by Uno et al. [22], Viallon et al. [23] and Kamarudin et al. [35]. We chose to use the AUC(t) and the C-index [36] metrics as discriminative approaches, the AUC(t) provides a dynamic assessment of the model’s discriminative ability at clinically relevant time points. The second criterion was a calibration method using the integrated Brier Score [21], [22] to measure prediction errors over time. We aimed to measure the optimism bias using the oracle performances, which represents the best possible model performances with all true coefficients used for simulation process. This oracle model serves as a benchmark that should not be surpassed. Additionally, we employed an independent validation method to assess the reliability of internal validation strategies, ensuring they do not overestimate model performance. We also evaluated how closely the performance of internal validation methods aligns with that of the independent validation process to determine how far the internal validation process could be compared with a different dataset. To ensure a direct comparison between the internal validation strategies the hyperparameters a of glmnet package and of the model was priorly fixed except for the nested cross-validation approach, which was specifically used to assess the extent of optimism bias introduced by hyperparameter estimation and ensure an unbiased evaluation of model performance.
4.3. Internal validation methods
One of the challenges in evaluating the performance of a statistical model is the potential dependence on random partitioning. Resampling methods help to mitigate this dependence and the instability associated with random partitioning. However, the optimism bias is not addressed in the same way across different resampling methods. Authors confirmed recently that resampling approaches perform better to internally validate the association between a time-dependent binary indicator and a time-to-event outcome [37].
With this study we confirmed the inadequate performance in high-dimensional settings of train-test validation method. When the data is divided in these contexts, there is an increased risk that the model’s λ parameters are trained on insufficient data, resulting in a strong dependency on the data splitting point. We chose to divide into train data and test data with a 70/30 % ratio which is the most current in the literature. The randomization of the sample in training set and test set itself can lead to an important variability of the performances. This simulation study confirmed the negative impact of this method using time-to event endpoint, as previously described in the literature [7], yet still commonly employed in current research which suffers from larger instability in high-dimensional setting.
The conventional bootstrap method consistently demonstrates inherent optimism, irrespective of the regularization method employed [9], [11]. This optimism bias is especially pronounced in very low setting of the sample size. When α is low, the elastic-net regularization leans more towards ridge regression, which tends to include more non-zero coefficients in the model, potentially increasing this optimism bias with a non-reproducible assessment of the performance on independent validation approach. Adding clinical variables on the Cox penalized model reduce the optimism bias but not sufficiently to avoid an important gap between the performance evaluation of internal and independent validation particularly in very low sample size setting. When using the 0.632 + bootstrap estimator, the performance of internal validation seemed to be inferior of the independent validation process. We confirmed the risk of underfitting described by Iba et al. [12] with a setting of > 10 variables per event in the high dimensional space in this particular setting of experiment. The underfitting remains problematic considering the difficulty to develop a model with good predictive performance.
The k-fold cross-validation is a robust method for assessing model performance that helps reduce the risk of overfitting. It provides stable and reliable performance estimates and is particularly recommended for logistic regression and other binary outcome models [38]. Recently, k-fold cross-validation approaches were described as reference for binary variable using time-to event endpoint [37]. Unlike other validation techniques that might be prone to optimistic bias, k-fold cross-validation effectively splits the data multiple times, providing an average assessment of model performance across different subsets. This iterative approach helps in capturing the true variability of the dataset, leading to consistent and reducing the optimism bias [14]. Extensive research has demonstrated that k-fold cross-validation [13], [15], [39], particularly in the context of high-dimensional data, offers a balanced evaluation by preventing the model from becoming overly tailored to any single training subset. This stability is crucial for developing predictive models that generalize well to new, unseen data.
Different implementation of K-fold cross-validation are described in the literature as the stratified cross-validation when unbalanced variable could affect the modelling [40] or the repeated cross-validation (RCV) [41] to improve the stability of the predictive performance. The RCV amplifies these benefits by iterating the cross-validation process multiple times, which helps in stabilizing the performance estimates. The leave-one-out (LOOV) [38], [42] validation is a special case of k-fold cross-validation where k = n, but it suffers from high computational cost in high-dimensional settings and therefore was not tested in this study. Among the various methods evaluated, the nested cross-validation approach [36], which performs both model development and internal validation simultaneously, is considered as a feasible strategy that demonstrates satisfactory stability in high-dimensional settings with TTE with no significant difference between the variance with K-fold cross-validation especially in Lasso-like and Elastic-net regularization in very low settings (n = 50, n = 75, n = 100). To avoid overfitting in model development, it’s preferable to treat the selection process as an integral part of the model fitting process and performance evaluation on every subsample [43]. The k-fold cross-validation approaches offer stability using different regularization pathway of penalized regression. This approach reduces the variance associated with a single cross-validation run and mitigates the risk of overfitting. As reported on Fig. 3, The performance assessment and their variation are described as most consistent for k-fold cross-validation and nested-cross validation.
Performance in settings with very small sample sizes may be poor, highlighting the inherent challenge of achieving robust independent validation for models developed under such constraints.
Key points:
-
1)
K-fold cross validation produced the most stable internal validation results
-
2)
Nested cross-validation could be performed as an alternative
-
3)
Conventional bootstrap and train-test validation have a risk of overfitting or instability
-
4)
0.632 +Bootstrap estimator have a risk of underfitting
-
5)
Independent validation of model developed with very low sample setting is hazardous after development using high dimensional setting
Improving the predictive performance of multi-omics models fundamentally depends on the methods used to integrate diverse data types, yet it should not be contingent on the choice of predictive performance evaluation strategy. In high-dimensional settings, where the number of predictors vastly exceeds the number of observations, penalized regression methods like lasso, ridge, and elastic-net are commonly used to manage overfitting and enhance variable selection. However, these methods can sometimes exclude clinically relevant variables if they are penalized alongside other predictors. Integrating the omics information using adaptative penalized regression incorporating appropriate weights especially for clinical predictor can increase power to identify the prognostic covariates and improve risk prediction [44] or hierarchical integration as priority Lasso [45] or IPF-Lasso [46] (Integrative L1-Penalized regression with penalty factors). The clinical variables in high-dimensional selections appears to significantly improve model performance and reduce the risk of overfitting [9]. Clinical variables often hold substantial predictive power and relevance, and their inclusion with a different penalties ensures that these important predictors are retained in the model, regardless of the penalization applied to other high-dimensional data. Adaptative methods and hierarchical regularization aim to preserve their predictive influence leading to more robust and clinically interpretable models, improving the overall predictive accuracy and stability of the model. This approach leverages the inherent value of clinical variables while effectively managing the inclusion of additional high-dimensional data through regularization techniques.
4.4. Perspectives
We confirmed in this study the importance of resampling methods, such as the implementation of k-fold cross-validation, to develop and assess the predictive performance of prognosis models integrating clinical and continuous variables. These findings could be further evaluated using different distribution settings of covariables and coefficients in simulation processes, with applications across various cancer types. Future studies should aim to validate these results on larger patient cohorts to enhance the generalizability and robustness of the conclusions. Nested cross-validation offers a good tradeoff between bias and stability by developing and assessing the model within the same framework. These framework have to be developed and implemented using different integration pathway as hierarchical integration, Bayesian approach and multi-source integration to confirm these stability with different setting. A key challenge in improving modeling quality with multi-omics data lies in the creation and implementation of collaborative databases as The Cancer Genome Atlas (TCGA) Research Network [47] that are continuously enriched over time. This approach can help address the high-dimensionality issue by increasing the effective sample size, thereby enhancing model robustness and generalizability of personalized prognosis models developed for clinical management.
4.5. Limits
We aimed to fix the parameters of the simulation using the original dataset of 76 patients to perform a realistic simulation. Nevertheless, this relatively small sample size poses limitations, including potentially reduced representativeness and limited statistical power to fully capture the diversity of the real-world patient population. While our objective was not to perfectly replicate real-world distributions, we acknowledge that this limitation could impact the generalizability of our findings to populations with distributions similar to those in our simulation study. Nevertheless, with this approach we did not included some parameters in our scenario as the censoring variation impact on the validation strategies or the optimization of the K parameters for the K-fold cross-validation approach. Some parameters variation of the simulation K-fold variation or the inner loop variation affected computational time and increased the stability of the analyses and were not reported. Moreover, the censoring probability of the event was not assessed on different setting. When combining different feature selection methods, the independent validation performance was inconsistent. This variability is likely due to the challenge of selecting non-zero coefficients and accurately estimating prognostic coefficients when the covariate matrix is high-dimensional and not prefiltered using statistical procedures such as univariable selection. Prior dimensionality reduction is critical to improve model stability and generalizability in high-dimensional omics data analyses [48]. However, as the sample size increased, the signal detected during both the development and internal validation processes was more reliably confirmed through independent validation.
5. Conclusions
In our study, we aimed to compare different methods of internal validation for prognostic models to contribute to the reproducibility of translational research. Our findings suggest that, for penalized models, nested cross-validation offers a balanced approach between model development and internal validation, effectively mitigating optimism bias and providing controlled stability. Repeated k-fold cross-validation emerges as a viable alternative maintaining limited optimism bias and stability prior to independent validation.
While these findings underscore the strengths of variations of cross-validation methods in high-dimensional, time-to-event outcome settings, it is important to acknowledge the limitations of our study. For instance, the simulation scenarios may not capture all complexities of real-world data, and the performance of the evaluated methods could vary with different datasets or settings.
We recommend the application of k-fold cross-validation or nested-cross validation as internal validation approaches for penalized models in high-dimensional contexts, while emphasizing the need for external validation to confirm generalizability with different setting.
CRediT authorship contribution statement
Constance LAMY: Writing – review & editing, Visualization, Validation, Resources, Project administration. Maud KAMAL: Writing – review & editing, Visualization, Resources, Project administration. Ladidi AHMANACHE: Writing – review & editing, Visualization, Validation, Resources. Jerzy KLIJANIENKO: Writing – review & editing, Visualization, Validation, Resources. VACHER sophie: Writing – review & editing, Visualization, Validation, Resources. Célia DUPAIN: Writing – review & editing, Visualization, Validation, Resources, Project administration, Methodology. LE TOURNEAU christophe: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Resources, Project administration, Methodology, Conceptualization. Antoine DUBRAY-VAUTRIN: Writing – review & editing, Writing – original draft, Visualization, Validation, Resources, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Olivier CHOUSSY: Writing – review & editing, Visualization, Validation, Project administration. Nicolas SERVANT: Writing – review & editing, Visualization, Validation, Resources. Mullaert Jimmy: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Project administration, Methodology, Formal analysis, Conceptualization. Victor GRAVRAND: Writing – review & editing, Visualization, Validation, Resources, Methodology, Investigation, Formal analysis. Marret Grégoire: Writing – review & editing, Visualization, Validation, Resources, Investigation.
Ethics statement
Not applicable. Consent for publication Not applicable.
Declaration of Competing Interest
The authors, along with all co-authors(Antoine DUBRAY-VAUTRIN, Victor GRAVRAND, Grégoire MARRET, Constance LAMY, Jerzy KLIJANIENKO, Sophie Vacher, Ladidi AHMANACHE, Maud KAMAL, Olivier CHOUSSY, Nicolas SERVANT, Célia DUPAIN, Christophe LE TOURNEAU, Jimmy MULLAERT), declare no conflicts of interest.
Acknowledgements
We would like to acknowledge the patients, their families, and the multidisciplinary and continuously growing SCANDARE team: Julie Flavius, Tiphaine Batteux, Atikatou Mama, Xavier Paoletti, Anne-Sophie Plissonnier, Amir Kadi, Frédérique Berger, Odette Mariani, Sylvie Jovelin, Céline Méaudre, Julie Celini, Magalie Toutain, Charlotte Martinat, Marine Ledru, Rémi Goudefroye.
SCANDARE is supported by the grant ANR-10-EQPX-03 and ANR-10-INBS-09–08 from the Agence Nationale de la Recherche and by the Cancéropôle Ile-de-France.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2025.08.035.
Appendix A. Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Data availability
The parameters, functions, codes for data generation and results of the current study are available in the GitHub repository using the link: https://github.com/adubrayv/Internal_validation.git
References
- 1.Hagan T., Pulendran B. Will systems biology deliver its promise and contribute to the development of new or improved vaccines? From data to understanding through systems biology. Cold Spring Harb Perspect Biol. 2018;10:a028894. doi: 10.1101/cshperspect.a028894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hyman D.M., Solit D.B., Arcila M.E., Cheng D., Sabbatini P., Baselga J., et al. Precision Medicine at memorial sloan kettering cancer center: clinical next-generation sequencing enabling next-generation targeted therapy trials. Drug Discov Today. 2015;20:1422–1428. doi: 10.1016/j.drudis.2015.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Harrell F.E., Lee K.L., Mark D.B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–387. doi: 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4. [DOI] [PubMed] [Google Scholar]
- 4.Salerno S., Li Y. High-Dimensional survival analysis: methods and applications. Annu Rev Stat Appl. 2023;10:25–49. doi: 10.1146/annurev-statistics-032921-022127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
- 6.Subramanian J., Simon R. Overfitting in prediction models – is it a problem only in high dimensions? Contemp Clin Trials. 2013;36:636–641. doi: 10.1016/j.cct.2013.06.011. [DOI] [PubMed] [Google Scholar]
- 7.Ramspek C.L., Jager K.J., Dekker F.W., Zoccali C., Diepen M. van. External validation of prognostic models: what, why, how, when and where? Clin Kidney J. 2020;14:49. doi: 10.1093/ckj/sfaa188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Riley R.D., Snell K.I.E., Martin G.P., Whittle R., Archer L., Sperrin M., et al. Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small. J Clin Epidemiol. 2021;132:88–96. doi: 10.1016/j.jclinepi.2020.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Steyerberg E. Clinical prediction models. A Pract Approach Dev Valid Updat. 2009;19 doi: 10.1007/978-0-387-77244-8. [DOI] [Google Scholar]
- 10.Coley R.Y., Liao Q., Simon N., Shortreed S.M. Empirical evaluation of internal validation methods for prediction in large-scale clinical data with rare-event outcomes: a case study in suicide risk prediction. BMC Med Res Method. 2023;23:33. doi: 10.1186/s12874-023-01844-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Musoro J.Z., Zwinderman A.H., Puhan M.A., ter Riet G., Geskus R.B. Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Method. 2014;14:116. doi: 10.1186/1471-2288-14-116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Iba K., Shinozaki T., Maruo K., Noma H. Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models. BMC Med Res Methodol. 2021;21:9. doi: 10.1186/s12874-020-01201-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Subramanian J., Simon R. An evaluation of resampling methods for assessment of survival risk prediction in high-dimensional settings. Stat Med. 2011;30:642–653. doi: 10.1002/sim.4106. [DOI] [PubMed] [Google Scholar]
- 14.Simon R.M., Subramanian J., Li M.-C., Menezes S. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform. 2011;12:203–214. doi: 10.1093/bib/bbr001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Tsamardinos I., Rakhshani A., Lagani V. In: Artificial Intelligence: Methods and Applications. Likas A., Blekas K., Kalles D., editors. Springer International Publishing; Cham: 2014. Performance-Estimation properties of Cross-Validation-Based protocols with simultaneous Hyper-Parameter optimization; pp. 1–14. [DOI] [Google Scholar]
- 16.Tsamardinos I. Don’t lose samples to estimation. Patterns (N Y) 2022;3 doi: 10.1016/j.patter.2022.100612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bender R., Augustin T., Blettner M. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005;24:1713–1723. doi: 10.1002/sim.2059. [DOI] [PubMed] [Google Scholar]
- 18.Efron B., Tibshirani R. Improvements on Cross-Validation: The.632+ bootstrap method. J Am Stat Assoc. 1997;92:548–560. doi: 10.2307/2965703. [DOI] [Google Scholar]
- 19.Herrmann M., Probst P., Hornung R., Jurinovic V., Boulesteix A.-L. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinforma. 2021;22 doi: 10.1093/bib/bbaa167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lewis M.J., Spiliopoulou A., Goldmann K., Pitzalis C., McKeigue P., Barnes M.R. Nestedcv: an r package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data. Bioinform Adv. 2023;3 doi: 10.1093/bioadv/vbad048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Steyerberg E.W., Harrell F.E., Borsboom G.J.J.M., Eijkemans M.J.C., Vergouwe Y., Habbema J.D.F. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774–781. doi: 10.1016/S0895-4356(01)00341-9. [DOI] [PubMed] [Google Scholar]
- 22.Uno H., Cai T., Tian L., Wei L.J. Evaluating prediction rules for t-Year survivors with censored regression models. J Am Stat Assoc. 2007;102:527–537. doi: 10.1198/016214507000000149. [DOI] [Google Scholar]
- 23.Viallon V., Latouche A. Discrimination measures for survival outcomes: connection between the AUC and the predictiveness curve. Biom J. 2011;53:217–236. doi: 10.1002/bimj.201000153. [DOI] [PubMed] [Google Scholar]
- 24.Blanche P., Dartigues J.-F., Jacqmin-Gadda H. Review and comparison of ROC curve estimators for a time-dependent outcome with marker-dependent censoring. Biom J. 2013;55:687–704. doi: 10.1002/bimj.201200045. [DOI] [PubMed] [Google Scholar]
- 25.Gerds T.A., Schumacher M. Consistent estimation of the expected brier score in general survival models with right-censored event times. Biom J. 2006;48:1029–1040. doi: 10.1002/bimj.200610301. [DOI] [PubMed] [Google Scholar]
- 26.Schmid M., Hielscher T., Augustin T., Gefeller O. A robust alternative to the schemper-henderson estimator of prediction error. Biometrics. 2011;67:524–535. doi: 10.1111/j.1541-0420.2010.01459.x. [DOI] [PubMed] [Google Scholar]
- 27.George S.L. Statistical issues in translational cancer research. Clin Cancer Res. 2008;14:5954–5958. doi: 10.1158/1078-0432.CCR-07-4537. [DOI] [PubMed] [Google Scholar]
- 28.Xu W., Huang S.H., Su J., Gudi S. O’Sullivan B. Statistical fundamentals on cancer research for clinicians: working with your statisticians. Clin Transl Radiat Oncol. 2021;27:75–84. doi: 10.1016/j.ctro.2021.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Tibshirani R. The lasso method for variable selection in the cox model. Stat Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 30.An evaluation of penalised survival methods for developing prognostic models with rare events - Ambler - 2012 - Statistics in Medicine - Wiley Online Library n.d. 〈https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4371〉 (accessed June 26, 2024). [DOI] [PubMed]
- 31.Pavlou M., Ambler G., Seaman S.R., Guttmann O., Elliott P., King M., et al. How to develop a more accurate risk prediction model when there are few events. BMJ. 2015;351:h3868. doi: 10.1136/bmj.h3868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ambroise C., McLachlan G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci. 2002;99:6562–6566. doi: 10.1073/pnas.102102699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gholamy A., Kreinovich V., Kosheleva O. Departmental technical reports (CS) Why 70/30 80/20 Relat Train Test Sets A Pedagog Explan. 2018 [Google Scholar]
- 34.Pang H., Jung S.-H. Sample size considerations of Prediction-Validation methods in High-Dimensional data for survival outcomes. Genet Epidemiol. 2013;37:276–282. doi: 10.1002/gepi.21721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kamarudin A.N., Cox T., Kolamunnage-Dona R. Time-dependent ROC curve analysis in medical research: current methods and applications. BMC Med Res Methodol. 2017;17:53. doi: 10.1186/s12874-017-0332-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hartman N., Kim S., He K., Kalbfleisch J.D. Pitfalls of the concordance index for survival outcomes. Stat Med. 2023;42:2179–2190. doi: 10.1002/sim.9717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Falvey C.A., Todd J.L., Neely M.L. Evaluating the performance of a resampling approach for internally validating the association between a time-dependent binary indicator and time-to-event outcome. Journal of Biopharmaceutical Statistics n.d.;0:1–11. 10.1080/10543406.2025.2489293. [DOI] [PubMed]
- 38.Geroldinger A., Lusa L., Nold M., Heinze G. Leave-one-out cross-validation, penalization, and differential bias of some prediction model performance measures-a simulation study. Diagn Progn Res. 2023;7:9. doi: 10.1186/s41512-023-00146-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bertrand F., Maumy-Bertrand M. Fitting and Cross-Validating cox models to censored big data with missing values using extensions of partial least squares regression models. Front Big Data. 2021;4 doi: 10.3389/fdata.2021.684794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.T r M, V V.K., V D.K., Geman O., Margala M., Guduri M. The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification. Healthc Anal. 2023;4 doi: 10.1016/j.health.2023.100247. [DOI] [Google Scholar]
- 41.Kim J.-H. Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data Anal. 2009;53:3735–3745. doi: 10.1016/j.csda.2009.04.009. [DOI] [Google Scholar]
- 42.Cheng H., Garrick D.J., Fernando R.L. Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction. J Anim Sci Biotechnol. 2017;8:38. doi: 10.1186/s40104-017-0164-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Cawley G.C., Talbot N.L.C. On Over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res. 2010;11:2079–2107. [Google Scholar]
- 44.Madjar K., Rahnenführer J. Weighted cox regression for the prediction of heterogeneous patient subgroups. BMC Med Inf Decis Mak. 2021;21:342. doi: 10.1186/s12911-021-01698-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Klau S., Jurinovic V., Hornung R., Herold T., Boulesteix A.-L. Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinforma. 2018;19:322. doi: 10.1186/s12859-018-2344-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Boulesteix A.-L., De Bin R., Jiang X., Fuchs M.I.P.F.-L.A.S.S.O. Integrative l 1-Penalized regression with penalty factors for prediction based on Multi-Omics data. Comput Math Methods Med. 2017;2017 doi: 10.1155/2017/7691937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R.M., Ozenberger B.A., Ellrott K., et al. The cancer genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Simon N., Friedman J., Hastie T., Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Soft. 2011;39 doi: 10.18637/jss.v039.i05. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Data Availability Statement
The parameters, functions, codes for data generation and results of the current study are available in the GitHub repository using the link: https://github.com/adubrayv/Internal_validation.git





