Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
letter
. 2023 Aug 30;5(5):e230202. doi: 10.1148/ryai.230202

A Remark on Using Data Twice in Cross-Validation Schemes

Aydin Demircioğlu 1,
PMCID: PMC10546366  PMID: 37795144

The recent review published in the July 2023 issue of Radiology: Artificial Intelligence, “A Guide to Cross-Validation for Artificial Intelligence in Medical Imaging (1),” provides a concise introduction to cross-validation (CV). This topic is central to obtaining unbiased models and applying machine learning methods in clinical practice.

Among others, the authors introduced two schemes: nested cross-validation (NCV), which can be understood as the reference standard method (2), and a scheme, which they call the select-shuffle-test (SST). This scheme consists of two steps: During the first step, the best algorithm is selected using CV; in the second step, the data are shuffled, and the selected algorithm is tested using another CV. However, the data are used twice, for model selection and estimation, and violates the basic rule that the test data should never be used during training. Thus, it can be expected that the proposed SST introduces bias.

To illustrate the issue, we applied the code (the authors kindly provided) to several radiomics datasets (see https://github.com/aydindemircioglu/CVTesting). We employed three feature selection methods and three classifiers and considered these choices, together with the optimal number of features to be selected, as hyperparameters. We then compared SST and NCV on 14 datasets in terms of accuracy, repeating the experiment 100 times with varying CV-splits and employing a one-sided t test. Indeed, on all datasets, SST showed higher accuracy than NCV. The differences were slight on average (+1.6%) but higher on some datasets (+4.1%), and they were statistically significant (all P < .02). It is reasonable that with more hyperparameters, the problem will deepen. Since NCV is the reference standard, this implies that SST is biased and should not be used.

We want to emphasize that any validation scheme must ensure that the data on which the evaluation is based were not used in any way for model selection. Violation of this rule will inevitably lead to false-positives. We applaud the efforts of Dr Bradshaw and colleagues to provide guidance on the central issue of CV and hope that similar reviews will be published in the future to ensure reproducible research.

Footnotes

Disclosures of conflicts of interest: A.D. No relevant relationships.

References

  • 1. Bradshaw TJ , Huemann Z , Hu J , Rahmim A . A guide to cross-validation for artificial intelligence in medical imaging . Radiol Artif Intell 2023. ; 5 ( 4 ): e220232 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Varma S , Simon R . Bias in error estimation when using cross-validation for model selection . BMC Bioinformatics 2006. ; 7 ( 1 ): 91 . [DOI] [PMC free article] [PubMed] [Google Scholar]
Radiol Artif Intell. 2023 Aug 30;5(5):e230202. doi: 10.1148/ryai.230202

Response

Tyler J Bradshaw *, Zachary Huemann , Junjie Hu , Arman Rahmim §,§

We thank the author for their interest in our article, and we commend them for their efforts in applying the code we provided to compare SST CV and NCV.

We would like to address the concerns raised, first with respect to the principles of CV, and second regarding their empirical results. The author stated that SST violates the rule that training and testing data must be kept independent to avoid overoptimistic results. We would like to clarify that overoptimism can occur for several reasons (see our article). One cause is overfitting, which occurs when a model’s weights are trained to effectively “memorize” a dataset. Another cause is tuning to the test set, in which one model is selected out of many candidate models because it has a random but favorable match between the model’s hyperparameters and the test set’s data distribution. The SST method is not susceptible to overfitting because the training and test sets are always kept independent. We hypothesized that SST could mostly avoid the pitfall of tuning to the test set by shuffling the data between the model selection step and the testing step (ie, changing the test set data distribution).

In their analysis, the author found a statistically significant bias when comparing SST to NCV. However, it is important to contextualize these results. First, the bias was quite small, averaging 1.6% in classification accuracy for a set of models with accuracies ranging from 50% to 95%. In other words, SST’s performance was very similar to that of NCV, but with a small bias. Second, it should be recognized that many CV methods are known to exhibit biases or large variances (13). And while NCV is considered unbiased, it is rarely used due to the computational demands of training large neural networks. A useful comparison would be between SST and other more practical CV approaches, including k-fold and random sampling CV, while using more contemporary models. These more common CV methods might also compare unfavorably to NCV.

In summary, we agree that results obtained using SST, as well as from other CV methods, should be interpreted in the context of their bias-variance profiles. Moreover, we encourage additional methodological studies that characterize the bias-variance profiles of various model evaluation techniques.

Footnotes

Disclosures of conflicts of interest: T.J.B. Grants/contracts from GE Healthcare, NIH, and Voximetry. Z.H. NVIDIA RTX A6000 Academic Hardware Grant Award. J.H. No relevant relationships. A.R. No relevant relationships.

References

  • 1. Varma S , Simon R . Bias in error estimation when using cross-validation for model selection . BMC Bioinformatics 2006. ; 7 ( 1 ): 91 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Rodríguez JD , Pérez A , Lozano JA . Sensitivity analysis of kappa-fold cross validation in prediction error estimation . IEEE Trans Pattern Anal Mach Intell 2010. ; 32 ( 3 ): 569 – 575 . [DOI] [PubMed] [Google Scholar]
  • 3. Wainer J , Cawley G . Nested cross-validation when selecting classifiers is overzealous for most practical applications . Expert Syst Appl 2021. ; 182 : 115222 . [Google Scholar]

Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES