An official website of the United States government
Here's how you know
Official websites use .gov
A
.gov website belongs to an official
government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you've safely
connected to the .gov website. Share sensitive
information only on official, secure websites.
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with,
the contents by NLM or the National Institutes of Health.
Learn more:
PMC Disclaimer
|
PMC Copyright Notice
1Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, University of Duisburg-Essen, Hufelandstrasse 55, D-45147 Essen, Germany
1Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, University of Duisburg-Essen, Hufelandstrasse 55, D-45147 Essen, Germany
The recent review published in the July 2023 issue of Radiology: Artificial
Intelligence, “A Guide to Cross-Validation for Artificial
Intelligence in Medical Imaging (1),”
provides a concise introduction to cross-validation (CV). This topic is central to
obtaining unbiased models and applying machine learning methods in clinical
practice.
Among others, the authors introduced two schemes: nested cross-validation (NCV),
which can be understood as the reference standard method (2), and a scheme, which they call the select-shuffle-test (SST).
This scheme consists of two steps: During the first step, the best algorithm is
selected using CV; in the second step, the data are shuffled, and the selected
algorithm is tested using another CV. However, the data are used
twice, for model selection and estimation, and
violates the basic rule that the test data should never be used during training.
Thus, it can be expected that the proposed SST introduces bias.
To illustrate the issue, we applied the code (the authors kindly provided) to several
radiomics datasets (see https://github.com/aydindemircioglu/CVTesting). We
employed three feature selection methods and three classifiers and considered these
choices, together with the optimal number of features to be selected, as
hyperparameters. We then compared SST and NCV on 14 datasets in terms of accuracy,
repeating the experiment 100 times with varying CV-splits and employing a one-sided
t test. Indeed, on all datasets, SST showed higher accuracy
than NCV. The differences were slight on average (+1.6%) but higher on some datasets
(+4.1%), and they were statistically significant (all P <
.02). It is reasonable that with more hyperparameters, the problem will deepen.
Since NCV is the reference standard, this implies that SST is biased and should not
be used.
We want to emphasize that any validation scheme must ensure that the data on which
the evaluation is based were not used in any way for model selection. Violation of
this rule will inevitably lead to false-positives. We applaud the efforts of Dr
Bradshaw and colleagues to provide guidance on the central issue of CV and hope that
similar reviews will be published in the future to ensure reproducible research.
Footnotes
Disclosures of conflicts of interest: A.D. No relevant relationships.
References
1.
Bradshaw
TJ
,
Huemann
Z
,
Hu
J
,
Rahmim
A
.
A
guide to cross-validation for artificial intelligence in medical
imaging
.
Radiol Artif
Intell
2023.
;
5
(
4
):
e220232
.
[DOI] [PMC free article] [PubMed] [Google Scholar]
2.
Varma
S
,
Simon
R
.
Bias
in error estimation when using cross-validation for model
selection
.
BMC
Bioinformatics
2006.
;
7
(
1
):
91
.
[DOI] [PMC free article] [PubMed] [Google Scholar]
We thank the author for their interest in our article, and we commend them for their
efforts in applying the code we provided to compare SST CV and NCV.
We would like to address the concerns raised, first with respect to the principles of
CV, and second regarding their empirical results. The author stated that SST
violates the rule that training and testing data must be kept independent to avoid
overoptimistic results. We would like to clarify that overoptimism can occur for
several reasons (see our article). One cause is overfitting, which occurs when a
model’s weights are trained to effectively “memorize” a
dataset. Another cause is tuning to the test set, in which one
model is selected out of many candidate models because it has a random but favorable
match between the model’s hyperparameters and the test set’s data
distribution. The SST method is not susceptible to overfitting because the training
and test sets are always kept independent. We hypothesized that SST could mostly
avoid the pitfall of tuning to the test set by shuffling the data between the model
selection step and the testing step (ie, changing the test set data
distribution).
In their analysis, the author found a statistically significant bias when comparing
SST to NCV. However, it is important to contextualize these results. First, the bias
was quite small, averaging 1.6% in classification accuracy for a set of models with
accuracies ranging from 50% to 95%. In other words, SST’s performance was
very similar to that of NCV, but with a small bias. Second, it should be recognized
that many CV methods are known to exhibit biases or large variances (1–3). And while NCV is considered unbiased, it is rarely used due to the
computational demands of training large neural networks. A useful comparison would
be between SST and other more practical CV approaches, including k-fold and random
sampling CV, while using more contemporary models. These more common CV methods
might also compare unfavorably to NCV.
In summary, we agree that results obtained using SST, as well as from other CV
methods, should be interpreted in the context of their bias-variance profiles.
Moreover, we encourage additional methodological studies that characterize the
bias-variance profiles of various model evaluation techniques.
Footnotes
Disclosures of conflicts of interest: T.J.B. Grants/contracts from GE Healthcare, NIH, and
Voximetry. Z.H. NVIDIA RTX A6000 Academic Hardware Grant
Award. J.H. No relevant relationships. A.R. No
relevant relationships.
References
1.
Varma
S
,
Simon
R
.
Bias
in error estimation when using cross-validation for model
selection
.
BMC
Bioinformatics
2006.
;
7
(
1
):
91
.
[DOI] [PMC free article] [PubMed] [Google Scholar]
2.
Rodríguez
JD
,
Pérez
A
,
Lozano
JA
.
Sensitivity
analysis of kappa-fold cross validation in prediction error
estimation
.
IEEE Trans Pattern Anal Mach
Intell
2010.
;
32
(
3
):
569
–
575
.
[DOI] [PubMed] [Google Scholar]
3.
Wainer
J
,
Cawley
G
.
Nested
cross-validation when selecting classifiers is overzealous for most
practical applications
.
Expert Syst
Appl
2021.
;
182
:
115222
.
[Google Scholar]