Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

Dong Wook Kim; Hye Young Jang; Yousun Ko; Jung Hee Son; Pyeong Hwa Kim; Seon-Ok Kim; Joon Seo Lim; Seong Ho Park

doi:10.1371/journal.pone.0238908

. 2020 Sep 11;15(9):e0238908. doi: 10.1371/journal.pone.0238908

Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

Dong Wook Kim ^1,^#, Hye Young Jang ^2,^#, Yousun Ko ³, Jung Hee Son ¹, Pyeong Hwa Kim ¹, Seon-Ok Kim ⁴, Joon Seo Lim ⁵, Seong Ho Park ^1,^*

Editor: Julian C Hong⁶

PMCID: PMC7485764 PMID: 32915901

Abstract

Background

The development of deep learning (DL) algorithms is a three-step process—training, tuning, and testing. Studies are inconsistent in the use of the term “validation”, with some using it to refer to tuning and others testing, which hinders accurate delivery of information and may inadvertently exaggerate the performance of DL algorithms. We investigated the extent of inconsistency in usage of the term “validation” in studies on the accuracy of DL algorithms in providing diagnosis from medical imaging.

Methods and findings

We analyzed the full texts of research papers cited in two recent systematic reviews. The papers were categorized according to whether the term “validation” was used to refer to tuning alone, both tuning and testing, or testing alone. We analyzed whether paper characteristics (i.e., journal category, field of study, year of print publication, journal impact factor [JIF], and nature of test data) were associated with the usage of the terminology using multivariable logistic regression analysis with generalized estimating equations. Of 201 papers published in 125 journals, 118 (58.7%), 9 (4.5%), and 74 (36.8%) used the term to refer to tuning alone, both tuning and testing, and testing alone, respectively. A weak association was noted between higher JIF and using the term to refer to testing (i.e., testing alone or both tuning and testing) instead of tuning alone (vs. JIF <5; JIF 5 to 10: adjusted odds ratio 2.11, P = 0.042; JIF >10: adjusted odds ratio 2.41, P = 0.089). Journal category, field of study, year of print publication, and nature of test data were not significantly associated with the terminology usage.

Conclusions

Existing literature has a significant degree of inconsistency in using the term “validation” when referring to the steps in DL algorithm development. Efforts are needed to improve the accuracy and clarity in the terminology usage.

Introduction

Deep learning (DL), often used almost synonymously with artificial intelligence (AI), is the most dominant type of machine learning technique at present. Numerous studies have been published on applying DL to medicine, most prominently regarding the use of DL to provide diagnoses from various medical imaging techniques [1–3]. The development of a DL algorithm for clinical use is a three-step process—training, tuning, and testing [4–6]. Of note is the difference between the second step (tuning) and the third step (testing): in the tuning step, algorithms are fine-tuned by, for example, optimizing hyperparameters; in the testing step, the accuracy of a completed fine-tuned algorithm is confirmed typically by using datasets that were held out from the training and tuning steps. Clinical adoption of a DL algorithm demands rigorous evaluation of its performance by carefully conducting the testing step, for which the use of independent external datasets that represent the target patients in real-world clinical practice is critical [3, 6–16].

Despite the notable difference between the tuning and testing steps, existing literature on DL show inconsistency in the use of the terminology “validation”, with some using it for the tuning step and others for the testing step [6, 12, 17–19]. Such inconsistency in terminology usage or inaccurate use of “validation” to refer to testing are likely due to the fact that the term is typically used in general communication as well as in medicine to refer to the testing of the accuracy of a completed algorithm [6, 20], while the field of machine learning uses it as a very specific term that refers to the tuning step [4–6, 12, 17, 19, 21]. Also, the tuning step sometimes uses “cross-validation” procedures, which may create further confusion regarding the terminology for researchers who are less familiar with the methods and terms. The mixed usage of the terminology may have substantial repercussions as it hinders proper distinction between DL algorithms that were adequately tested and those that were not. The real-world performance of a DL algorithm tested on adequate external datasets tends to be lower, often by large degrees, than those obtained with internal datasets during the tuning step [3, 6, 22–24]. Therefore, such mixed usage of the terminology may inadvertently exaggerate the performance of DL algorithms to researchers and the general public alike who are not familiar with machine learning. We thus investigated the extent of inconsistency in usage of the term “validation” in studies on the accuracy of DL algorithms in providing diagnosis from medical imaging.

Methods and materials

Literature selection

We collected all original research papers that were cited in two recent systematic review studies [15, 18] (Fig 1). Both systematic reviews excluded studies that used medical waveform data graphics (e.g., electrocardiography) or those investigating image segmentation rather than the diagnosis and classification of diseases or disease states. The full texts of all papers collected were reviewed to confirm eligibility by four reviewers (each reviewed approximately 150 papers) (Fig 1), three of whom were medical doctors and one was a PhD; all reviewers were familiar with studies on reporting the accuracy of machine learning algorithms as well as systematic review of literature. Prior to participation in the current study, the reviewers received reading materials [12, 15, 17, 18] and had an offline discussion to review their contents.

Specifically, one systematic review [15] searched PubMed MEDLINE and Embase to include 516 original research papers (publication date: January 1^st, 2018–August 17^th, 2018) on the accuracy of machine learning algorithms (including both DL and non-DL machine learning) for providing diagnosis from medical imaging. Of these, 276 papers were excluded because 274 of them were on non-DL algorithms, one was retracted following the publication of the systematic review, and another had been incorrectly characterized in the systematic review [15]. The other systematic review [18] searched Ovid-MEDLINE, Embase, Science Citation Index, and Conference Proceedings Citation Index to include 82 papers (publication date: January 1^st, 2012–June 6^th, 2019) that compared the accuracies of DL algorithms and human health-care professionals in providing diagnosis from medical imaging. Further details of the literature search and selection methods are described in each paper [15, 18]. After excluding a total of 16 papers that overlapped between the two systematic reviews, the reviewers checked if the term “validation” (or “validate” as a verbal form) was used in the papers to describe any of the three-step process of developing DL algorithms. As a result, 105 papers that did not use the term to describe the steps of DL algorithm development were excluded, and a total of 201 papers were deemed eligible for analysis (Fig 1).

Data extraction

The reviewers further reviewed the eligible papers to extract information for analysis. The reviewers determined if the term “validation” (or “validate” as a verbal form) was used to indicate the tuning step alone, testing step alone, or both. We considered tuning as a step for fine-tuning a model and optimizing the hyperparameters, and testing as a step for evaluating the accuracy of a completed algorithm regardless of the nature of the test dataset used. Therefore, we did not limit the testing step to an act of checking the algorithm performance on a held-out dataset, although the use of a held-out dataset is recommended for testing (i.e., by splitting the entire data, or more rigorously, by collecting completely external data). We then identified whether a given study used a held-out dataset for testing. “Validation” used as a part of a fixed compound term was not considered: for example, in a phrase such as “algorithm tuning used k-fold cross-validation”, we did not consider this as a case of “validation” referring to the tuning step, because “cross-validation” is a fixed compound term. Papers that had ambiguity as judged by individual reviewers were re-reviewed at a group meeting involving all four reviewers and a separate reviewer who was experienced with machine learning research and 13 years of experience as a peer reviewer or an editor for scholarly journals.

In addition, the reviewers analyzed other characteristics of the papers, including the journal category (medical vs. non-medical), field of study, year of print publication, and the journal impact factor according to the Journal Citation Reports (JCR) 2018 edition if applicable. The distinction between medical vs. non-medical journals was made according to a method adopted elsewhere [15] as follows: the journals were first classified according to the JCR 2018 edition categories, and for those not included in the JCR database, we considered them as medical if the scope/aim of the journal included any fields of medicine or if the editor-in-chief was a medical doctor.

Statistical analysis

The proportion of papers using the term “validation” (or “validate” as a verbal form) was calculated according to its usage. The overall results in all papers and the separate results broken down by the characteristics of the papers were obtained. We also analyzed whether any characteristics of the papers were associated with the usage of the terminology. For this analysis, we dichotomized the use of terminology into tuning alone vs. testing (i.e., testing alone or both tuning and testing), because we considered that using the term “validation” for meanings other than tuning, as it is specifically defined in the field of machine learning, is the source of confusion. We performed logistic regression analysis and used generalized estimating equations with exchangeable working correlation to estimate the odds ratio (OR) for using the term in the meaning of testing (i.e., OR >1 indicating a greater likelihood to use the term to refer to testing in comparison with the reference category) accounting for the correlation between papers published in the same journals. The characteristics of the papers as independent variables were used as categorical variables: journal category (medical vs. non-medical), field of study (radiology vs. others), year of print publication (before 2018, 2018, after 2018), journal impact factor (<5, 5 to 10, >10, unavailable), and nature of test data (held-out dataset vs. not held-out dataset). We combined the field of study into a binary category (radiology vs. others) because papers in individual medical disciplines other than radiology were small in number. Univariable and multivariable analyses were performed. SAS software version 9.4 (SAS Institute Inc., Cary, NC, USA) was used for statistical analysis. P values smaller than 0.05 were considered statistically significant.

Results

The characteristics of the 201 papers analyzed, published in 125 journals, are summarized in Table 1 and the raw data are available as supplementary material.

Table 1. Characteristics of the papers.

	Number of papers (%)
Journal category
Medical journals	165	(82.1)
Non-medical journals	36	(17.9)
Field of study
Radiology	121	(60.2)
Others	80	(39.8)
Year of print publication
Before 2018	10	(5.0)
2018	150	(74.6)
After 2018	41	(20.4)
Journal impact factor
<5	128	(63.7)
5 to 10	44	(21.9)
>10	18	(9.0)
Unavailable	11	(5.5)
Nature of test data
Held-out dataset	133	(66.2)
Not held-out dataset	68	(33.8)

Open in a new tab

Of the papers, 118 (58.7%), 9 (4.5%), and 74 (36.8%) used the term to refer to tuning alone, both tuning and testing, and testing alone, respectively. More than half of the papers used the term to specifically refer to tuning alone, which is in line with the definition used in the field of machine learning, similarly in both medical journals (97/165, 58.8%) and non-medical journals (21/36, 58.3%). Specific examples of the quotes on “validation” (or “validate”) to refer to tuning and testing in the papers are shown in Table 2.

Table 2. Example quotes on using “validation” (or “validate” as a verbal form) to refer to tuning or testing.

Meaning	First author (year)	Quote
Tuning	Zhou (2018) [25]	We have randomly separated them into three parts: 400 for training, 45 for validation and 95 for independent test.
Testing	Bien (2018) [26]	The training set was used to optimize model parameters, the tuning set to select the best model, and the validation set to evaluate the model’s performance.
	Nam (2019) [27]	The results, including AUROCs, JAFROC FOMs, and F1 scores … remained consistent with our results from DLAD among four external validation data sets.
	Li (2019) [28]	The high performance of the deep learning model we developed in this study was validated in several cohorts.

Open in a new tab

Table 3 shows the associations between paper characteristics and the usage of the terminology. Journal impact factors showed a weak association with the terminology usage, as papers published in journals with higher impact factors were more likely to use the term to refer to the testing step, i.e., testing alone or both tuning and testing, (vs. journal impact factor <5; journal impact factor 5 to 10: adjusted odds ratio 2.11, P = 0.042 with statistical significance; journal impact factor >10: adjusted odds ratio 2.41, P = 0.089). Journal category, field of study, year of print publication, and the nature of test data were not significantly associated with the terminology usage.

Table 3. Association between terminology usage and paper characteristics.

	Terminology usage^*						Univariable analysis^†			Multivariable analysis^†
	Tuning alone		Both tuning and testing		Testing alone		Unadjusted OR (95% CI)		P value	Adjusted OR (95% CI)		P value
Total	118	(58.7)	9	(4.5)	74	(36.8)
Journal category
Medical journals	97	(58.8)	6	(3.6)	62	(37.6)	Reference category			Reference category
Non-medical journals	21	(58.3)	3	(8.3)	12	(33.3)	1.04	(0.57, 1.89)	0.896	1.22	(0.66, 2.25)	0.528
Field of study
Radiology	73	(60.3)	5	(4.1)	43	(35.5)	Reference category			Reference category
Other fields	45	(56.3)	4	(5.0)	31	(38.8)	1.23	(0.70, 2.16)	0.465	1.05	(0.59, 1.90)	0.862
Year of print publication
Before 2018	6	(60.0)	0	(0.0)	4	(40.0)	Reference category			Reference category
2018	84	(56.0)	8	(5.3)	58	(38.7)	1.14	(0.31, 4.24)	0.846	1.36	(0.33, 5.53)	0.667
After 2018	28	(68.3)	1	(2.4)	12	(29.3)	0.70	(0.17, 2.87)	0.615	0.74	(0.18, 3.03)	0.673
Journal impact factor
<5	82	(64.1)	7	(5.5)	39	(30.5)	Reference category			Reference category
5 to 10	22	(50.0)	2	(4.5)	20	(45.5)	1.98	(1.05, 3.73)	0.034	2.11	(1.03, 4.31)	0.042
>10	8	(44.4)	0	(0.0)	10	(55.6)	2.49	(0.98, 6.27)	0.054	2.41	(0.88, 6.63)	0.089
Unavailable	6	(54.5)	0	(0.0)	5	(45.5)	1.64	(0.57, 4.76)	0.363	1.57	(0.49, 5.01)	0.447
Nature of test data
Held-out dataset	73	(54.9)	6	(4.5)	54	(40.6)	Reference category			Reference category
Not held-out dataset	45	(66.2)	3	(4.4)	20	(29.4)	0.60	(0.33, 1.08)	0.088	0.62	(0.34, 1.13)	0.119

Open in a new tab

OR, odds ratio; CI, confidence interval.

*Data are numbers of papers with the % in each row category in parentheses.

^†From logistic regression analysis with generalized estimating equations. OR >1 indicates a greater likelihood to use the term to refer to testing (i.e., testing alone or both tuning and testing) in comparison with the reference category.

Discussion

We found that existing literature, whether medical or non-medical, have a significant degree of inconsistency (or inaccuracy) in using the term “validation” in referring to the steps in DL algorithm development, with 58.7% of the papers using the term to refer to the tuning step alone as specifically defined in the field of machine learning. Interestingly, papers published in journals with higher impact factors were slightly more likely to use the term to refer to the testing step (i.e., testing alone or both tuning and testing).

Inconsistency in terminology use hinders accurate delivery of information. In this regard, some investigators advocate a uniform description of the datasets for the steps in DL algorithm development as a training set (for training the algorithm), a tuning set (for tuning hyperparameters), and a validation test set (for estimating the performance of the algorithm) [18]. However, others recommend referring to them as training, validation, and test sets [16]. “Validation” is a specific scientific term that is canonically accepted to refer to model tuning in the field of machine learning, and is also widely used in a colloquial sense to refer to testing in non-machine learning language; therefore, an attempt to enforce any one way of terminology use may likely be futile. The presence of a weak association between journal impact factor and the terminology usage (i.e., journals with higher impact factors being more likely to use “validation” to refer to testing) observed in this study should not be interpreted as providing a rationale to promote the term usage to refer to testing; rather, the data merely delineate the current pattern of term usage in the journals included in this analysis.

In order to avoid possible confusion, it would be helpful if academic journals outside the field of machine learning employ certain policy in using the term “validation” when publishing articles on machine learning, such as recommending using “validation” as a specific scientific term instead of a general word. At the very least, researchers should clarify the meaning of the term “validation” early in their manuscripts [6, 17]. As long as each paper carefully explains its definition of the term “validation”, the degree and possibility of confusion would substantially decrease. A useful way for bringing the attention of researchers to terminology use and encouraging them to use the term more accurately and clearly in their reports of machine learning research would be through guidelines for reporting research studies, most notably those set forth by the EQUATOR (Enhancing the Quality and Transparency of Health Research) Network. Specifically, a machine learning-specific version of the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement, TRIPOD-ML, is currently under development [29]. Therefore, addressing the use of the term “validation” in the TRIPOD-ML would likely be an effective approach.

Another important related issue in studies reporting the accuracy of DL algorithms is the distinction between internal testing and external testing. The importance of adequate external testing using independent external datasets that represent the target patients in clinical practice cannot be overstated when testing the performance of DL algorithms for providing diagnosis [3, 6–16]. Testing with a subset split from the entire dataset, even if the subset was held out and unused for training and tuning, is not external testing and most likely insufficient [9, 30]. DL algorithms for medical diagnosis require a large quantity of data for training, and producing and annotating this magnitude of medical data is highly resource-intensive and difficult [31–34]. Therefore, the data collection process, which is mostly carried out in a retrospective manner, is prone to various selection biases, notably spectrum bias and unnatural prevalence [12, 31, 34]. Additionally, there is often substantial heterogeneity in patient characteristics, equipment, facilities, and practice pattern according to hospitals, physicians, time periods, and governmental health policies [3, 35]. These factors, combined with overfitting and strong data dependency of DL, can substantially undermine the generalizability and usability of DL algorithms for providing diagnosis in clinical practice [3, 8, 9]. Therefore, guidelines for reporting studies on DL algorithms should also instruct authors to clearly distinguish between internal testing, including the use of a held-out subset split from the entire dataset, and external testing on a completely independent dataset so as not to mislead the readers.

Our study is limited in that we could not analyze the relevant literature in its entirety. However, the two sets of papers collected from the recent systematic reviews [15, 18] may be representative of the current practice of the terminology use in DL algorithm studies, considering that the related research activity is currently most prominent in the field of medical imaging [1–3]. Also, we did not directly assess the effect of the inconsistency (or inaccuracy) in terminology usage, and the effect of mixed terminology usage on the perceived level of confusion in readers according to the field of study would be worthwhile investigating in the future.

In conclusion, our study shows the vast extent of inconsistency in the usage of the term “validation” in papers on the accuracy of DL algorithms in providing diagnosis from medical imaging. Efforts by both academic journals and researchers are needed to improve the accuracy and clarity in the terminology usage.

Supporting information

S1 File

(XLSX)

Click here for additional data file.^{(23.2KB, xlsx)}

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The authors received no specific funding for this work.

References

1.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019; 25: 44–56. 10.1038/s41591-018-0300-7 [DOI] [PubMed] [Google Scholar]
2.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019; 380: 1347–1358. 10.1056/NEJMra1814259 [DOI] [PubMed] [Google Scholar]
3.Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019; 17: 195 10.1186/s12916-019-1426-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chartrand G, Cheng PM, Vorontsov E, Drozdzal M, Turcotte S, Pal CJ, et al. Deep learning: a primer for radiologists. Radiographics. 2017; 37: 2113–2131. 10.1148/rg.2017170077 [DOI] [PubMed] [Google Scholar]
5.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction 2nd ed. New York: Springer; 2009. [Google Scholar]
6.Liu Y, Chen P-HC, Krause J, Peng L. How to read articles that use machine learning: users’ guides to the medical literature. JAMA. 2019; 322: 1806–1816. 10.1001/jama.2019.16489 [DOI] [PubMed] [Google Scholar]
7.Doshi-Velez F, Perlis RH. Evaluating machine learning articles. JAMA. 2019; 322: 1777–1779. 10.1001/jama.2019.17304 [DOI] [PubMed] [Google Scholar]
8.Nevin L. Advancing the beneficial use of machine learning in health care and medicine: toward a community understanding. PLoS Med. 2018; 15: e1002708 10.1371/journal.pmed.1002708 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Van Calster B, Wynants L, Timmerman D, Steyerberg EW, Collins GS. Predictive analytics in health care: how can we know it works? J Am Med Inform Assoc. 2019. 10.1093/jamia/ocz130 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Parikh RB, Obermeyer Z, Navathe AS. Regulation of predictive analytics in medicine. Science. 2019; 363: 810–812. 10.1126/science.aaw0029 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Yu KH, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf. 2019; 28: 238–241. 10.1136/bmjqs-2018-008551 [DOI] [PubMed] [Google Scholar]
12.Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology. 2018; 286: 800–809. 10.1148/radiol.2017171920 [DOI] [PubMed] [Google Scholar]
13.Park SH, Do K-H, Choi J-I, Sim JS, Yang DM, Eo H, et al. Principles for evaluating the clinical implementation of novel digital healthcare devices. J Korean Med Assoc. 2018; 61: 765–775. 10.5124/jkma.2018.61.12.765 [DOI] [Google Scholar]
14.England JR, Cheng PM. Artificial intelligence for medical image analysis: a guide for authors and reviewers. AJR Am J Roentgenol. 2018: 1–7. 10.2214/ajr.18.20490 [DOI] [PubMed] [Google Scholar]
15.Kim DW, Jang HY, Kim KW, Shin Y, Park SH. Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers. Korean J Radiol. 2019; 20: 405–410. 10.3348/kjr.2019.0025 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Bluemke DA, Moy L, Bredella MA, Ertl-Wagner BB, Fowler KJ, Goh VJ, et al. Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Radiology Editorial Board. Radiology. 2020; 294: 487–489. 10.1148/radiol.2019192515 [DOI] [PubMed] [Google Scholar]
17.Park SH, Kressel HY. Connecting technological innovation in artificial intelligence to real-world medical practice through rigorous clinical validation: what peer-reviewed medical journals could do. J Korean Med Sci. 2018; 33: e152 10.3346/jkms.2018.33.e152 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019; 1: e271–e297. 10.1016/S2589-7500(19)30123-2 [DOI] [PubMed] [Google Scholar]
19.Wang C, Liu C, Chang Y, Lafata K, Cui Y, Zhang J, et al. Dose-Distribution-Driven PET Image-Based Outcome Prediction (DDD-PIOP): A Deep Learning Study for Oropharyngeal Cancer IMRT Application. Front Oncol. 2020; 10: 1592 10.3389/fonc.2020.01592 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015; 350: g7594 10.1136/bmj.g7594 [DOI] [PubMed] [Google Scholar]
21.Do S, Song KD, Chung JW. Basics of Deep Learning: A Radiologist's Guide to Understanding Published Radiology Articles on Deep Learning. Korean J Radiol. 2020; 21: 33–41. 10.3348/kjr.2019.0312 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 2018; 15: e1002683 10.1371/journal.pmed.1002683 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Harris M, Qi A, Jeagal L, Torabi N, Menzies D, Korobitsyn A, et al. A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PLOS ONE. 2019; 14: e0221339 10.1371/journal.pone.0221339 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ridley EL. Deep-learning algorithms need real-world testing. 2018. November 27 cited 14 December 2019]. In AuntMinnie.com [Internet]. Available from: https://www.auntminnie.com/index.aspx?sec = nws&sub = rad&pag = dis&ItemID = 123871. [Google Scholar]
25.Zhou Y, Xu J, Liu Q, Li C, Liu Z, Wang M, et al. A Radiomics Approach With CNN for Shear-Wave Elastography Breast Tumor Classification. IEEE Trans Biomed Eng. 2018; 65: 1935–1942. 10.1109/tbme.2018.2844188 [DOI] [PubMed] [Google Scholar]
26.Bien N, Rajpurkar P, Ball RL, Irvin J, Park A, Jones E, et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 2018; 15: e1002699 10.1371/journal.pmed.1002699 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Nam JG, Park S, Hwang EJ, Lee JH, Jin KN, Lim KY, et al. Development and Validation of Deep Learning-based Automatic Detection Algorithm for Malignant Pulmonary Nodules on Chest Radiographs. Radiology. 2019; 290: 218–228. 10.1148/radiol.2018180237 [DOI] [PubMed] [Google Scholar]
28.Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J, et al. Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol. 2019; 20: 193–201. 10.1016/s1470-2045(18)30762-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019; 393: 1577–1579. 10.1016/s0140-6736(19)30037-6 [DOI] [PubMed] [Google Scholar]
30.Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res. 2016; 18: e323 10.2196/jmir.5870 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Nsoesie EO. Evaluating artificial intelligence applications in clinical settings. JAMA Network Open. 2018; 1: e182658 10.1001/jamanetworkopen.2018.2658 [DOI] [PubMed] [Google Scholar]
32.Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019; 20: e262–e273. 10.1016/s1470-2045(19)30149-4 [DOI] [PubMed] [Google Scholar]
33.Vigilante K, Escaravage S, McConnell M. Big data and the intelligence community—lessons for health care. N Engl J Med. 2019; 380: 1888–1890. 10.1056/NEJMp1815418 [DOI] [PubMed] [Google Scholar]
34.Zou J, Schiebinger L. AI can be sexist and racist—it's time to make it fair. Nature. 2018; 559: 324–326. 10.1038/d41586-018-05707-8 [DOI] [PubMed] [Google Scholar]
35.Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. Practical guidance on artificial intelligence for health-care data. Lancet Digit Health. 2019; 1: e157–e159. 10.1016/S2589-7500(19)30084-6 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0238908.r001

Decision Letter 0

Julian C Hong

23 Jun 2020

PONE-D-20-14331

Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

PLOS ONE

Dear Dr. Park,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 07 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Julian C Hong

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

Additional Editor Comments (if provided):

Thank you to the authors for their submission. The reviewers were positive regarding this submission with recommended revisions.

In particular, the perspectives regarding the use of the term "validation" varies across the reviewers given their respective backgrounds, and it would be helpful to discuss both viewpoints in the manuscript to accommodate the diversity of potential interested readers.

Please see reviewer responses for specific recommendations.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors are concerned about the varying definitions of the term "validation" in medical imaging AI papers. In the machine learing world, this term usually refers to a dataset used to tune model hyperparameters and decide when to stop training optimization, but in the medical world this term usually refers to the testing of an algorithm on data not used as part of the training process at all. They show that there are many papers that use the term in the first way, and many papers that use it in the second way.

The analysis seems to be appropriate and carefully done, and the writing is very clear. In the spirit of the PLOS One data sharing policy, I think the authors need to share their raw dataset now, so the reviewers can better assess the results. There are no anonymity issues that would keep them from sharing this.

It would be helpful to include a table showing a few specific quotes from the 201 papers that illustrate the two uses of the term "validation". For readers not familiar with machine learning, this will help them understand the problem. It would also be interesting to know if all 201 papers included a separate held-out test set, or if some never tested their model on a held-out set (which would be bad).

Since it is unlikely the use of the term "validation set" to describe a tuning set will change in the machine learning world, I do not not think medical AI papers will ever have completely standardized terminology. As long as each paper carefully explains its definitions, I do not think it is a major issue. So I am not completely on board with the authors' conclusions about the need for strict guidelines on this topic.

Reviewer #2: This paper provides a systematic review and meta-analysis of the term ‘validation’ as used in deep learning studies regarding medical diagnosis. There is a major inconsistency in the medical imaging community regarding the correct use of ‘validation’ in the technical sense. This paper is well written and the topic is important to readers of both PLoS One and the medical imaging community. However, there are several aspects of this paper that I feel need to be addressed prior to publication. My general comments are as follows.

First, I don’t believe this is an inconsistency problem (which implies that there is a lack of standard terminology), but rather an inaccuracy problem regarding the correct use of a well-defined, technical term. As the authors correctly point out, the inconsistency of the term ‘validation’ is most likely because the term is used in general communication to refer to testing the accuracy of a completed algorithm. However, the term means something very specific when referring to the science of machine learning, where it is used to define the tuning of hyper-parameters in a model (i.e., validation the training procedure, not the generalization capacity). This definition is canonically accepted, and in general, the term ‘validation’ should not be mis-used in the colloquial sense when referring to technical work. In my opinion, we simply need to educate the medicine community to use the appropriate terminology when applying machine learning approaches, and this confusion won’t be a problem.

As such, I do not agree with lines 204-206, where the authors argue, “It would be ideal if medical journals would unify the use of the term validation to refer to to the testing step instead of the tuning step”. In my opinion, there is nothing to ‘unify’ here, and instead the medical journals need to enforce the correct use of terminology. If research related to machine learning (including deep learning) is to be published in scientific journals (including imaging journals), it should use canonical terminology. Superficially changing these technical terms to colloquial terms would only cause more confusion, and question sound scientific reasoning published in the journals.

Next, while I find it interesting that papers published in high impact factor journals tend to use ‘validation’ to refer to model testing (instead of the more appropriate and accurate model tuning), I don’t think this statistic should be used as a rationale to justify the proposal in Lines 206-207 as stated above. If anything, it makes more sense to test if there was a statistically significant difference in use of ‘validation’ between technical machine learning journals and imaging journals, regardless of impact factor.

The authors argue at Lines 206-207, “Such a unification in terminology use may be difficult in disciplines such as machine learning, where the term is relatively widely used to refer to the tuning step”. This line similarly does not make much sense to me; machine learning is the discipline being used, regardless of the application (i.e., medical imaging or otherwise). In the context of this paper, it is all machine learning, and imaging is simply the application. It is not a one-to-one fair comparison.

On Lines 219-236, the authors discuss “internal validation vs. external validation”. Here, they are using the term in the colloquial sense, but in the paragraphs that follow, they actually explain a process that is technically already defined as “internal testing vs. external testing”. By definition, a model that has been validated on an internal dataset (i.e., the learning procedure has been (cross)-validated to identify optimal hyper-parameters) has to be tested on a separate dataset not included in the original training and validation procedure, either internally (on a subset of the dataset held out) or externally (on a completely independent dataset). The notion of “internal validation vs. external validation” is ill-defined and goes against the entire purpose of this paper.

In technical machine learning workflows, it is often implied that model validation is based on some form of a cross-validation procedure, where the the training of a machine learning algorithm is applied to sub-sets (e.g., folds) of the data set to best learn the optimal hyper parameters. The authors to not mention this at all, and it is important as it may help to better define and impairment proper terminology.

Finally, I think that the Discussion section needs to be substantially expanded on, to include specific examples that are related to imaging. Further, I would like to see the authors also take more of a stance on promoting the appropriate use of the term ‘validation’, as stated in the above points. While the Discussion section does a nice job at documenting the different uses of the term ‘validation’, in my opinion it falls short in promoting a good solution, and should focus primarily on well-defined definitions as a reference point.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Michael Gensheimer

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Sep 11;15(9):e0238908. doi: 10.1371/journal.pone.0238908.r002

Author response to Decision Letter 0

23 Jul 2020

Reviewer #1:

The authors are concerned about the varying definitions of the term "validation" in medical imaging AI papers. In the machine learing world, this term usually refers to a dataset used to tune model hyperparameters and decide when to stop training optimization, but in the medical world this term usually refers to the testing of an algorithm on data not used as part of the training process at all. They show that there are many papers that use the term in the first way, and many papers that use it in the second way.

1. The analysis seems to be appropriate and carefully done, and the writing is very clear. In the spirit of the PLOS One data sharing policy, I think the authors need to share their raw dataset now, so the reviewers can better assess the results. There are no anonymity issues that would keep them from sharing this.

Author response: Thank you for reviewing our study in detail and providing critical comments. We fully agree with your comment that the raw dataset should be shared. We have included the raw data as supplementary material.

2. It would be helpful to include a table showing a few specific quotes from the 201 papers that illustrate the two uses of the term "validation". For readers not familiar with machine learning, this will help them understand the problem. It would also be interesting to know if all 201 papers included a separate held-out test set, or if some never tested their model on a held-out set (which would be bad).

Author response: This is a great suggestion. As per your suggestion, we have added Table 2 to provide example quotes from the papers regarding the terminology usage.

Also, regarding the use of the held-out test set, a small fraction of the 201 papers did not use a held-out dataset for testing and instead evaluated the performance of a complete algorithm using cross-validation methods. We considered the testing step as a stage for assessing the accuracy of a completed algorithm regardless of the nature of the test dataset used. Therefore, we did not limit the testing step to the act of checking the algorithm performance on a held-out dataset, even though the use of a held-out dataset is recommended for testing.

As this new information has been added, we have also revised the statistical analysis section to explore if the use of a held-out test set was associated with the terminology usage. As a result, we found that there was no significant difference between studies that used a held-out test set and those that did not. We added these issues in the revised manuscript, which are indicated by the marginal memos reading “R1-2”.

3. Since it is unlikely the use of the term "validation set" to describe a tuning set will change in the machine learning world, I do not not think medical AI papers will ever have completely standardized terminology. As long as each paper carefully explains its definitions, I do not think it is a major issue. So I am not completely on board with the authors' conclusions about the need for strict guidelines on this topic.

Author response: We agree with your comment. We have revised the Discussion section according to the reviewer’s opinion (marginal memos “R1-3”).

Thank you again for reviewing our study in detail and providing critically helpful comments. We hope that our responses and revisions are satisfactory.

Reviewer #2:

This paper provides a systematic review and meta-analysis of the term ‘validation’ as used in deep learning studies regarding medical diagnosis. There is a major inconsistency in the medical imaging community regarding the correct use of ‘validation’ in the technical sense. This paper is well written and the topic is important to readers of both PLoS One and the medical imaging community. However, there are several aspects of this paper that I feel need to be addressed prior to publication. My general comments are as follows.

1. First, I don’t believe this is an inconsistency problem (which implies that there is a lack of standard terminology), but rather an inaccuracy problem regarding the correct use of a well-defined, technical term. As the authors correctly point out, the inconsistency of the term ‘validation’ is most likely because the term is used in general communication to refer to testing the accuracy of a completed algorithm. However, the term means something very specific when referring to the science of machine learning, where it is used to define the tuning of hyper-parameters in a model (i.e., validation the training procedure, not the generalization capacity). This definition is canonically accepted, and in general, the term ‘validation’ should not be mis-used in the colloquial sense when referring to technical work. In my opinion, we simply need to educate the medicine community to use the appropriate terminology when applying machine learning approaches, and this confusion won’t be a problem.

Author response: Thank you for reviewing our study in detail and providing critical comments. We agree with your comment, and we revised the manuscript accordingly. Also, considering the reviewer’s opinion, we also thought that it would be more reasonable to consider that using the term “validation” for meanings other than tuning, as it is specifically defined in the field of machine learning, is the source of confusion. Therefore, we revised the statistical analysis using the revised binary categorization of tuning alone vs. testing (testing alone or both tuning and testing) instead of the previous categorization of testing alone vs. tuning (tuning alone or both tuning and testing).

The OR was thus re-calculated for using the term to refer to testing (i.e., OR >1 indicating a greater likelihood to use the term to refer to testing in comparison with the reference category), whereas the same was calculated for using the term to refer to tuning in the original version of the manuscript. The updated statistical results showed consistent results with the previous version, although the exact numerical values have been changed. Please refer to the corresponding revised portions of the manuscript, which are indicated by the marginal memos reading “R2-1”.

2. As such, I do not agree with lines 204-206, where the authors argue, “It would be ideal if medical journals would unify the use of the term validation to refer to to the testing step instead of the tuning step”. In my opinion, there is nothing to ‘unify’ here, and instead the medical journals need to enforce the correct use of terminology. If research related to machine learning (including deep learning) is to be published in scientific journals (including imaging journals), it should use canonical terminology. Superficially changing these technical terms to colloquial terms would only cause more confusion, and question sound scientific reasoning published in the journals.

Author response: We agree with your comment on how the medical community should be more aware of the correct use of said terminology. We have revised a paragraph in the Discussion section accordingly (marginal memo “R2-2”).

3. Next, while I find it interesting that papers published in high impact factor journals tend to use ‘validation’ to refer to model testing (instead of the more appropriate and accurate model tuning), I don’t think this statistic should be used as a rationale to justify the proposal in Lines 206-207 as stated above. If anything, it makes more sense to test if there was a statistically significant difference in use of ‘validation’ between technical machine learning journals and imaging journals, regardless of impact factor.

Author response: We agree that the statistical results should not be used as a rationale to justify the proposal as stated above. We have revised the Discussion accordingly (marginal memos “R2-3”).

4. The authors argue at Lines 206-207, “Such a unification in terminology use may be difficult in disciplines such as machine learning, where the term is relatively widely used to refer to the tuning step”. This line similarly does not make much sense to me; machine learning is the discipline being used, regardless of the application (i.e., medical imaging or otherwise). In the context of this paper, it is all machine learning, and imaging is simply the application. It is not a one-to-one fair comparison.

Author response: We agree with your comment. We have revised the Discussion section accordingly (marginal memo “R2-4”).

5. On Lines 219-236, the authors discuss “internal validation vs. external validation”. Here, they are using the term in the colloquial sense, but in the paragraphs that follow, they actually explain a process that is technically already defined as “internal testing vs. external testing”. By definition, a model that has been validated on an internal dataset (i.e., the learning procedure has been (cross)-validated to identify optimal hyper-parameters) has to be tested on a separate dataset not included in the original training and validation procedure, either internally (on a subset of the dataset held out) or externally (on a completely independent dataset). The notion of “internal validation vs. external validation” is ill-defined and goes against the entire purpose of this paper.

Author response: As per your comment, we have revised the terms to “internal testing” and “external testing”. Thank you for the keen comment.

6. In technical machine learning workflows, it is often implied that model validation is based on some form of a cross-validation procedure, where the the training of a machine learning algorithm is applied to sub-sets (e.g., folds) of the data set to best learn the optimal hyper parameters. The authors to not mention this at all, and it is important as it may help to better define and impairment proper terminology.

Author response: We have revised the corresponding portions as suggested.

7. Finally, I think that the Discussion section needs to be substantially expanded on, to include specific examples that are related to imaging. Further, I would like to see the authors also take more of a stance on promoting the appropriate use of the term ‘validation’, as stated in the above points. While the Discussion section does a nice job at documenting the different uses of the term ‘validation’, in my opinion it falls short in promoting a good solution, and should focus primarily on well-defined definitions as a reference point.

Author response: We have newly added Table 2 to show the specific examples from the reference papers, and revised the Discussion in multiple locations to incorporate the points raised by the reviewer (marginal memo “R2-7”).

Again, we very much appreciate your insightful inputs. We hope our revisions are satisfactory.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(28.9KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0238908.r003

Decision Letter 1

Julian C Hong

27 Aug 2020

Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

PONE-D-20-14331R1

Dear Dr. Park,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Julian C Hong

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Thank you to the authors for their work and revisions. There are minor recommendations for additional citations to include in the manuscript.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: In the Introduction, please include some references when discussing cross-validation, and how it fits into a training/cross validation + testing experimental design. This paper is a good example of a good deep learning implementation:

C. Wang, et. al. Dose-Distribution-Driven PET Image-based Outcome Prediction (DDD-PIOP): A Deep Learning Study for Oropharyngeal Cancer IMRT Application. Frontiers in Oncology. 2020. https://doi.org/10.3389/fonc.2020.01592

Also in the introduction please add cite example references when discussing the fine-tuning of a model during the validation phase. This paper is a good example of a network that was tuned during validation, using the correct terminology very well:

Y. Chang, et. al. Development of realistic multi-contrast textured XCAT (MT-XCAT) phantoms using a dual-discriminator conditional-generative adversarial network (D-CGAN). 2020. Physics in Medicine and Biology. 2020 Mar;65(6).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Michael F. Gensheimer

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0238908.r004

Acceptance letter

Julian C Hong

1 Sep 2020

PONE-D-20-14331R1

Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

Dear Dr. Park:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Julian C Hong

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(XLSX)

Click here for additional data file.^{(23.2KB, xlsx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(28.9KB, docx)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting Information files.

[pone.0238908.ref001] 1.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019; 25: 44–56. 10.1038/s41591-018-0300-7 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref002] 2.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019; 380: 1347–1358. 10.1056/NEJMra1814259 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref003] 3.Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019; 17: 195 10.1186/s12916-019-1426-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref004] 4.Chartrand G, Cheng PM, Vorontsov E, Drozdzal M, Turcotte S, Pal CJ, et al. Deep learning: a primer for radiologists. Radiographics. 2017; 37: 2113–2131. 10.1148/rg.2017170077 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref005] 5.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction 2nd ed. New York: Springer; 2009. [Google Scholar]

[pone.0238908.ref006] 6.Liu Y, Chen P-HC, Krause J, Peng L. How to read articles that use machine learning: users’ guides to the medical literature. JAMA. 2019; 322: 1806–1816. 10.1001/jama.2019.16489 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref007] 7.Doshi-Velez F, Perlis RH. Evaluating machine learning articles. JAMA. 2019; 322: 1777–1779. 10.1001/jama.2019.17304 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref008] 8.Nevin L. Advancing the beneficial use of machine learning in health care and medicine: toward a community understanding. PLoS Med. 2018; 15: e1002708 10.1371/journal.pmed.1002708 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref009] 9.Van Calster B, Wynants L, Timmerman D, Steyerberg EW, Collins GS. Predictive analytics in health care: how can we know it works? J Am Med Inform Assoc. 2019. 10.1093/jamia/ocz130 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref010] 10.Parikh RB, Obermeyer Z, Navathe AS. Regulation of predictive analytics in medicine. Science. 2019; 363: 810–812. 10.1126/science.aaw0029 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref011] 11.Yu KH, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf. 2019; 28: 238–241. 10.1136/bmjqs-2018-008551 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref012] 12.Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology. 2018; 286: 800–809. 10.1148/radiol.2017171920 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref013] 13.Park SH, Do K-H, Choi J-I, Sim JS, Yang DM, Eo H, et al. Principles for evaluating the clinical implementation of novel digital healthcare devices. J Korean Med Assoc. 2018; 61: 765–775. 10.5124/jkma.2018.61.12.765 [DOI] [Google Scholar]

[pone.0238908.ref014] 14.England JR, Cheng PM. Artificial intelligence for medical image analysis: a guide for authors and reviewers. AJR Am J Roentgenol. 2018: 1–7. 10.2214/ajr.18.20490 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref015] 15.Kim DW, Jang HY, Kim KW, Shin Y, Park SH. Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers. Korean J Radiol. 2019; 20: 405–410. 10.3348/kjr.2019.0025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref016] 16.Bluemke DA, Moy L, Bredella MA, Ertl-Wagner BB, Fowler KJ, Goh VJ, et al. Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Radiology Editorial Board. Radiology. 2020; 294: 487–489. 10.1148/radiol.2019192515 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref017] 17.Park SH, Kressel HY. Connecting technological innovation in artificial intelligence to real-world medical practice through rigorous clinical validation: what peer-reviewed medical journals could do. J Korean Med Sci. 2018; 33: e152 10.3346/jkms.2018.33.e152 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref018] 18.Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019; 1: e271–e297. 10.1016/S2589-7500(19)30123-2 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref019] 19.Wang C, Liu C, Chang Y, Lafata K, Cui Y, Zhang J, et al. Dose-Distribution-Driven PET Image-Based Outcome Prediction (DDD-PIOP): A Deep Learning Study for Oropharyngeal Cancer IMRT Application. Front Oncol. 2020; 10: 1592 10.3389/fonc.2020.01592 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref020] 20.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015; 350: g7594 10.1136/bmj.g7594 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref021] 21.Do S, Song KD, Chung JW. Basics of Deep Learning: A Radiologist's Guide to Understanding Published Radiology Articles on Deep Learning. Korean J Radiol. 2020; 21: 33–41. 10.3348/kjr.2019.0312 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref022] 22.Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 2018; 15: e1002683 10.1371/journal.pmed.1002683 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref023] 23.Harris M, Qi A, Jeagal L, Torabi N, Menzies D, Korobitsyn A, et al. A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PLOS ONE. 2019; 14: e0221339 10.1371/journal.pone.0221339 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref024] 24.Ridley EL. Deep-learning algorithms need real-world testing. 2018. November 27 cited 14 December 2019]. In AuntMinnie.com [Internet]. Available from: https://www.auntminnie.com/index.aspx?sec = nws&sub = rad&pag = dis&ItemID = 123871. [Google Scholar]

[pone.0238908.ref025] 25.Zhou Y, Xu J, Liu Q, Li C, Liu Z, Wang M, et al. A Radiomics Approach With CNN for Shear-Wave Elastography Breast Tumor Classification. IEEE Trans Biomed Eng. 2018; 65: 1935–1942. 10.1109/tbme.2018.2844188 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref026] 26.Bien N, Rajpurkar P, Ball RL, Irvin J, Park A, Jones E, et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 2018; 15: e1002699 10.1371/journal.pmed.1002699 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref027] 27.Nam JG, Park S, Hwang EJ, Lee JH, Jin KN, Lim KY, et al. Development and Validation of Deep Learning-based Automatic Detection Algorithm for Malignant Pulmonary Nodules on Chest Radiographs. Radiology. 2019; 290: 218–228. 10.1148/radiol.2018180237 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref028] 28.Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J, et al. Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol. 2019; 20: 193–201. 10.1016/s1470-2045(18)30762-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref029] 29.Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019; 393: 1577–1579. 10.1016/s0140-6736(19)30037-6 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref030] 30.Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res. 2016; 18: e323 10.2196/jmir.5870 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238908.ref031] 31.Nsoesie EO. Evaluating artificial intelligence applications in clinical settings. JAMA Network Open. 2018; 1: e182658 10.1001/jamanetworkopen.2018.2658 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref032] 32.Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019; 20: e262–e273. 10.1016/s1470-2045(19)30149-4 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref033] 33.Vigilante K, Escaravage S, McConnell M. Big data and the intelligence community—lessons for health care. N Engl J Med. 2019; 380: 1888–1890. 10.1056/NEJMp1815418 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref034] 34.Zou J, Schiebinger L. AI can be sexist and racist—it's time to make it fair. Nature. 2018; 559: 324–326. 10.1038/d41586-018-05707-8 [DOI] [PubMed] [Google Scholar]

[pone.0238908.ref035] 35.Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. Practical guidance on artificial intelligence for health-care data. Lancet Digit Health. 2019; 1: e157–e159. 10.1016/S2589-7500(19)30084-6 [DOI] [PubMed] [Google Scholar]

PERMALINK

Inconsistency in the use of the term “validation” in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

Dong Wook Kim

Hye Young Jang

Yousun Ko

Jung Hee Son

Pyeong Hwa Kim

Seon-Ok Kim

Joon Seo Lim

Seong Ho Park

Roles

Abstract

Background

Methods and findings

Conclusions

Introduction

Methods and materials

Literature selection

Fig 1. Study flow diagram.

Data extraction

Statistical analysis

Results

Table 1. Characteristics of the papers.

Table 2. Example quotes on using “validation” (or “validate” as a verbal form) to refer to tuning or testing.

Table 3. Association between terminology usage and paper characteristics.

Discussion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Julian C Hong

Roles

Author response to Decision Letter 0

Decision Letter 1

Julian C Hong

Roles

Acceptance letter

Julian C Hong

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases