TABLE 2.
Category | Topic | Recommendation |
---|---|---|
Study design | Task definition | Collaborate with domain experts, stakeholders |
Study types | Identify publications as development studies or evaluation studies | |
Risk assessment | Assess the degree of risk that algorithm poses to patients and conduct study accordingly | |
Statistical plan | Preregister statistical analysis plans for prospective studies | |
Data collection | Bias anticipation | Collect data belonging to classes or groups that are vulnerable to bias |
Training set size estimation | Estimate size on the basis of trial and error, or prior similar studies | |
Evaluation of set size estimation* | Use statistical power analysis for guidance | |
Data decisions | Use justified, objective, and documented inclusion and exclusion criteria | |
Data labeling | Reference standard | Use labels that are regarded as sufficient standards of reference by the field |
Label quality | Justify label quality by application, study type, and clinical claim (Fig. 4) | |
Labeling guide* | Produce detailed guide for labelers in reader studies | |
Quantity/quality tradeoff | Consider multiple labelers (quality) over greater numbers (quantity) | |
Model design | Model comparison* | Explore and compare different models for development studies |
Baseline comparison | Compare complex models with simpler models or standard of care | |
Model selection | Report model selection and hyperparameter tuning techniques | |
Model stability | Use repeated training with random initialization when feasible | |
Ablation study* | Perform ablation studies for development studies focusing on novel architectures | |
Model training | Cross validation* | Use cross validation for development studies; preserve data distribution across splits |
Data leakage | Avoid information leaks from test set during model training | |
Model testing and interpretability | Test set | Use same data and class distribution as for target population; use high-quality labels |
Target population | Explicitly define target population | |
External sets | Use external sets for evaluating model sensitivity to dataset shift | |
Evaluation metric | Use multiple metrics when appropriate; visually inspect model outputs | |
Model interpretability* | Use interpretability methods for clinical tasks | |
Reporting and dissemination | Reporting | Follow published reporting guidelines and checklists |
Sharing* | Make code and models from development studies accessible | |
Transparency | Be forthcoming about failure modes and population characteristics in training and evaluation sets | |
Reproducibility checks | Ensure that submitted materials to journals are sufficient for replication | |
Evaluation† |
*Not all recommendations are applicable to all types of studies.
†Addressed in separate report from AI Task Force.