. 2020 Nov 19;31(6):3909–3922. doi: 10.1007/s00330-020-07417-0

Table 1.

Checklist of items to include when reporting ML studies

1. Which clinical problem is being solved?
□ Which patients or disease does the study concern?
□ How can ML improve upon existing diagnostic or prognostic approaches?
□ What stage of diagnostic pathway is investigated?
2. Choice of ML model
□ Which ML model is used?
□ Which measures are taken to avoid overfitting?
3. Sample size motivation
□ Is the sample size clearly motivated?
□ Which considerations were used to prespecify a sample size?
□ Is there a statistical analysis plan?
4. Specification of study design and training, validation, and testing datasets
□ Is the study prospective or retrospective?
□ What were the inclusion and exclusion criteria?
□ How many patients were included for training, validation, and testing?
□ Was the test dataset kept separate from the training and validation datasets?
□ Was an external dataset used for validation?*
□ Who performed external validation?
5. Standard of reference
□ What was the standard of reference?
□ Were existing labels used, or were labels newly created for the study?
□ How many observers contributed to the standard of reference?
□ Were observers blinded to the output of the ML algorithm and to labels of other observers?
6. Reporting of results
□ Which measures are used to report diagnostic or prognostic accuracy?
□ Which other measures are used to express agreement between the ML algorithm and the standard of reference?
□ Are contingency tables given?
□ Are confidence estimates given?
7. Are the results explainable?
□ Is it clear how the ML algorithm came to a specific classification or recommendation?
□ Which strategies were used to investigate the algorithm’s internal logic?
8. Can the results be applied in a clinical setting?
□ Is the dataset representative of the clinical setting in which the model will be applied?
□ What are significant sources of bias?
□ For which patients can it be used clinically?
□ Can the results be implemented at the point of care?
9. Is the performance reproducible and generalizable?
□ Has reproducibility been studied?
□ Has the ML algorithm been validated externally?
□ Which sources of variation have been studied?
10. Is there any evidence that the model has an effect on patient outcomes?
□ Has an effect on patient outcomes been demonstrated?
11. Is the code available?
□ Is the software code available? Where is it stored?
□ Is the fully trained ML model available or should the algorithm be retrained with new data?
□ Is there a mechanism to study the algorithms’ results over time?

*Data from another institute or hospital