Skip to main content
. 2016 Dec 16;18(12):e323. doi: 10.2196/jmir.5870

Table 3.

Items to include when reporting predictive models in biomedical research: methods section.

Item
number
Topic Checklist item
5 Describe the setting Identify the clinical setting for the target predictive model.
Identify the modeling context in terms of facility type, size, volume, and duration of available data.
6 Define the prediction problem Define a measurement for the prediction goal (per patient or per hospitalization or per type of outcome).
Determine that the study is retrospective or prospective.a
Identify the problem to be prognostic or diagnostic.
Determine the form of the prediction model: (1) classification if the target variable is categorical, (2) regression if the target variable is continuous, (3) survival prediction if the target variable is the time to an event.
Translate survival prediction into a regression problem, with the target measured over a temporal window following the time of prediction.
Explain practical costs of prediction errors (eg, implications of underdiagnosis or overdiagnosis).
Defining quality metrics for prediction models.b
Define the success criteria for prediction (eg, based on metrics in internal validation or external validation in the context of the clinical problem).
7 Prepare data for model building Identify relevant data sources and quote the ethics approval number for data access.
State the inclusion and exclusion criteria for data.
Describe the time span of data and the sample or cohort size.
Define the observational units on which the response variable and predictor variables are defined.
Define the predictor variables. Extra caution is needed to prevent information leakage from the response variable to predictor variables.c
Describe the data preprocessing performed, including data cleaning and transformation. Remove outliers with impossible or extreme responses; state any criteria used for outlier removal.
State how missing values were handled.
Describe the basic statistics of the dataset, particularly of the response variable. These include the ratio of positive to negative classes for a classification problem and the distribution of the response variable for regression problem.
Define the model validation strategies. Internal validation is the minimum requirement; external validation should also be performed whenever possible.
Specify the internal validation strategy. Common methods include random split, time-based split, and patient-based split.
Define the validation metrics. For regression problems, the normalized root-mean-square error should be used. For classification problems, the metrics should include sensitivity, specificity, positive predictive value, negative predictive value, area under the ROCd curve, and calibration plot [19].e
For retrospective studies, split the data into a derivation set and a validation set. For prospective studies, define the starting time for validation data collection.
8 Build the predictive model Identify independent variables that predominantly take a single value (eg, being zero 99% of the time).
Identify and remove redundant independent variables.
Identify the independent variables that may suffer from the perfect separation problem.f
Report the number of independent variables, the number of positive examples, and the number of negative examples.
Assess whether sufficient data are available for a good fit of the model. In particular, for classification, there should be a sufficient number of observations in both positive and negative classes.
Determine a set of candidate modeling techniques (eg, logistic regression, random forest, or deep learning). If only one type of model was used, justify the decision for using that model.g
Define the performance metrics to select the best model.
Specify the model selection strategy. Common methods include K-fold validation or bootstrap to estimate the lost function on a grid of candidate parameter values. For K-fold validation, proper stratification by the response variable is needed.h
For model selection, include discussion on (1) balance between model accuracy and model simplicity or interpretability, and (2) the familiarity with the modeling techniques of the end user.i

aSee Figure 1.

bSee some examples in Multimedia Appendix 2.

cSee Textbox 1.

dROC: receiver operating characteristic.

eAlso see Textbox 2.

fSee Textbox 3.

gSee Multimedia Appendix 1 for some common methods and their strengths and limitations.

hSee Textbox 4.

iA desirable but not mandatory item.