. 2022 Apr;63(4):500–510. doi: 10.2967/jnumed.121.262567

TABLE 2.

Summary of Recommendations

Category	Topic	Recommendation
Study design	Task definition	Collaborate with domain experts, stakeholders
	Study types	Identify publications as development studies or evaluation studies
	Risk assessment	Assess the degree of risk that algorithm poses to patients and conduct study accordingly
	Statistical plan	Preregister statistical analysis plans for prospective studies
Data collection	Bias anticipation	Collect data belonging to classes or groups that are vulnerable to bias
	Training set size estimation	Estimate size on the basis of trial and error, or prior similar studies
	Evaluation of set size estimation*	Use statistical power analysis for guidance
	Data decisions	Use justified, objective, and documented inclusion and exclusion criteria
Data labeling	Reference standard	Use labels that are regarded as sufficient standards of reference by the field
	Label quality	Justify label quality by application, study type, and clinical claim (Fig. 4)
	Labeling guide*	Produce detailed guide for labelers in reader studies
	Quantity/quality tradeoff	Consider multiple labelers (quality) over greater numbers (quantity)
Model design	Model comparison*	Explore and compare different models for development studies
	Baseline comparison	Compare complex models with simpler models or standard of care
	Model selection	Report model selection and hyperparameter tuning techniques
	Model stability	Use repeated training with random initialization when feasible
	Ablation study*	Perform ablation studies for development studies focusing on novel architectures
Model training	Cross validation*	Use cross validation for development studies; preserve data distribution across splits
	Data leakage	Avoid information leaks from test set during model training
Model testing and interpretability	Test set	Use same data and class distribution as for target population; use high-quality labels
	Target population	Explicitly define target population
	External sets	Use external sets for evaluating model sensitivity to dataset shift
	Evaluation metric	Use multiple metrics when appropriate; visually inspect model outputs
	Model interpretability*	Use interpretability methods for clinical tasks
Reporting and dissemination	Reporting	Follow published reporting guidelines and checklists
	Sharing*	Make code and models from development studies accessible
	Transparency	Be forthcoming about failure modes and population characteristics in training and evaluation sets
	Reproducibility checks	Ensure that submitted materials to journals are sufficient for replication
Evaluation^†

*Not all recommendations are applicable to all types of studies.

^†Addressed in separate report from AI Task Force.