Table 1.
Selection of AI approach based on clinical question and data characteristics | Supervised methods suited to classification and prediction tasks involving “labeled” data: e.g., image segmentation or survival prediction. Unsupervised methods useful to identify structures and patterns in unlabeled data: e.g., association and clustering. Reinforcement learning algorithms interact with the environment by producing actions that get rewarded or penalized, while identifying the optimal path to address the problem. DL can be used to accelarate supervised, unsupervised or reinforcement learning but is better suited to larger, more unstructured datasets. Classical ML is more likely to work better in smaller training datasets. |
Algorithm selection | Are there “off-the-shelf” algorithms tailored to identical problems or validated in similar data? Transparency, understandability and performance are all important features. Try to avoid “black box” approaches where it is not possible to scrutinize the features that inform the classification or explain the outputs in high-stakes decision-making. |
Data pre-processing | Several steps are likely to be required in the preparation of data including anonymization, quality control, data normalization and standardization, addressing how to handle missing data points and outliers, imputation of missing values, etc. Is the training data an accurate representation of the wider data/population (e.g., all expected variation present, same technical characteristics)? |
Feature selection | A subset of relevant features (variables or predictors) is selected from high dimensional data allowing for a more succinct representation of the dataset. |
Data allocation | Evaluate the available data and plan the proportions of data being allocated into the training, testing, and validation datasets. Other approaches include cross validation, stratified cross validation, leave-one-out, and bootstrapping. |
Hardware considerations | Based on the volume of data and methodological approaches are CPU clusters, GPUs, or cloud computing better suited? |
Evaluation of model performance | Receiver operating characteristic (ROC) curves with accuracy measured by the area under the ROC curve (AUC), C-statistics, negative predictive value, positive predictive values, sensitivity, and specificity, Hosmer–Lemeshow test for goodness of fit, precision, recall, f-measure. Imaging segmentation accuracy (comparison between human expert labels and automated labels) reported as Dice metric, mean contour distance, and Hausdorff distance. If the accuracy is perfect, have too many predictors been included for the sample size or are there confounding biases hidden in the data that may result in the model overfitting the data? Compare performance against standard statistical approaches (i.e., multivariate regression). If several algorithms are tested report on them all and not just on the best performance. |
Publication and transparency | Make code and anonymized sample of data publicly available (e.g., GitHub, Docker containers, R packages, or Code Ocean repositories). Encourage independent scrutiny of the algorithm. |
Generalization and replication results | Algorithms should be validated by independent researchers on external cohorts and satisfy the requirements of medical devices and software regulatory frameworks. |