Table 1. The REFORMS checklist for ML-based science.
See text S1 for guidelines on how to use the checklist. Alongside each item, authors should report the section or page number where the item is reported. Some items in the REFORMS checklist could be hard to report for specific studies. Instead of requiring strict adherence for each item, authors and referees should decide which items are relevant for a study and how details can be reported better. To that end, we hope that the checklist can offer a useful starting point for authors and referees working on ML-based science.
Module | Item |
---|---|
Study goals | 1a. State the population or distribution about which the scientific claim is made. |
1b. Describe the motivation for choosing this population or distribution (1a.). | |
1c. Describe the motivation for the use of ML methods in the study. | |
Computational reproducibility | 2a. Describe the dataset used for training and evaluating the model and provide a link or DOI to uniquely identify the dataset. |
2b. Provide details about the code used to train and evaluate the model and produce the results reported in the paper along with link or DOI to uniquely identify the version of the code used. | |
2c. Describe the computing infrastructure used. | |
2d. Provide a README file which contains instructions for generating the results using the provided dataset and code. | |
2e. Provide a reproduction script to produce all results reported in the paper. | |
Data quality | 3a. Describe source(s) of data, separately for the training and evaluation datasets (if applicable), along with the time when the dataset(s) are collected, the source and process of ground-truth annotations, and other data documentation. |
3b. State the distribution or set from which the dataset is sampled (i.e., the sampling frame). | |
3c. Justify why the dataset is useful for the modeling task at hand. | |
3d. State the outcome variable of the model, along with descriptive statistics (split by class for a categorical outcome variable) and its definition. | |
3e. State the sample size and outcome frequencies. | |
3f. State the percentage of missing data, split by class for a categorical outcome variable. | |
3g. Justify why the distribution or set from which the dataset is drawn (3b.) is representative of the one about which the scientific claim is being made (1a.). | |
Data preprocessing | 4a. Describe whether any samples are excluded with a rationale for why they are excluded. |
4b. Describe how impossible or corrupt samples are dealt with. | |
4c. Describe all transformations of the dataset from its raw form (3a.) to the form used in the model, for instance, treatment of missing data and normalization—preferably through a flow chart. | |
Modeling | 5a. Describe, in detail, all models trained. |
5b. Justify the choice of model types implemented. | |
5c. Describe the method for evaluating the model(s) reported in the paper, including details of train-test splits or cross-validation folds. | |
5d. Describe the method for selecting the model(s) reported in the paper. | |
5e. For the model(s) reported in the paper, specify details about the hyperparameter tuning. | |
5f. Justify that model comparisons are against appropriate baselines. | |
Data leakage | 6a. Justify that preprocessing (Module 4) and modeling (Module 5) steps only use information from the training dataset (and not the test dataset). |
6b. Describe methods used to address dependencies or duplicates between the training and test datasets (e.g. different samples from the same patients are kept in the same dataset partition). | |
6c. Justify that each feature or input used in the model is legitimate for the task at hand and does not lead to leakage. | |
Metrics and uncertainty | 7a. State all metrics used to assess and compare model performance (e.g., accuracy, AUROC etc.). Justify that the metric used to select the final model is suitable for the task. |
7b. State uncertainty estimates (e.g., confidence intervals, standard deviations), and give details of how these are calculated. | |
7c. Justify the choice of statistical tests (if used) and a check for the assumptions of the statistical test. | |
Generalizability and limitations | 8a. Describe evidence of external validity. |
8b. Describe contexts in which the authors do not expect the study’s findings to hold. |