Table 1.
1. Which clinical problem is being solved? | |
□ Which patients or disease does the study concern? | |
□ How can ML improve upon existing diagnostic or prognostic approaches? | |
□ What stage of diagnostic pathway is investigated? | |
2. Choice of ML model | |
□ Which ML model is used? | |
□ Which measures are taken to avoid overfitting? | |
3. Sample size motivation | |
□ Is the sample size clearly motivated? | |
□ Which considerations were used to prespecify a sample size? | |
□ Is there a statistical analysis plan? | |
4. Specification of study design and training, validation, and testing datasets | |
□ Is the study prospective or retrospective? | |
□ What were the inclusion and exclusion criteria? | |
□ How many patients were included for training, validation, and testing? | |
□ Was the test dataset kept separate from the training and validation datasets? | |
□ Was an external dataset used for validation?* | |
□ Who performed external validation? | |
5. Standard of reference | |
□ What was the standard of reference? | |
□ Were existing labels used, or were labels newly created for the study? | |
□ How many observers contributed to the standard of reference? | |
□ Were observers blinded to the output of the ML algorithm and to labels of other observers? | |
6. Reporting of results | |
□ Which measures are used to report diagnostic or prognostic accuracy? | |
□ Which other measures are used to express agreement between the ML algorithm and the standard of reference? | |
□ Are contingency tables given? | |
□ Are confidence estimates given? | |
7. Are the results explainable? | |
□ Is it clear how the ML algorithm came to a specific classification or recommendation? | |
□ Which strategies were used to investigate the algorithm’s internal logic? | |
8. Can the results be applied in a clinical setting? | |
□ Is the dataset representative of the clinical setting in which the model will be applied? | |
□ What are significant sources of bias? | |
□ For which patients can it be used clinically? | |
□ Can the results be implemented at the point of care? | |
9. Is the performance reproducible and generalizable? | |
□ Has reproducibility been studied? | |
□ Has the ML algorithm been validated externally? | |
□ Which sources of variation have been studied? | |
10. Is there any evidence that the model has an effect on patient outcomes? | |
□ Has an effect on patient outcomes been demonstrated? | |
11. Is the code available? | |
□ Is the software code available? Where is it stored? | |
□ Is the fully trained ML model available or should the algorithm be retrained with new data? | |
□ Is there a mechanism to study the algorithms’ results over time? |
*Data from another institute or hospital