. 2022 Nov 14;93(2):405–412. doi: 10.1038/s41390-022-02380-6

Table 1.

Common Measures of Model Performance.

	Description	Advantages	Limitations
Accuracy	Ratio of correct predictions to total number of predictions made. (TP + TN/Total)	Easy to understand, works well if there are an equal number of samples in each class	Does not represent a clinically meaningful number if the classes are unbalanced, or if there is a high cost of misclassification (rare but fatal disease)
Precision (positive predictive value)	Number of correct positives divided by number of positive test results TP/(TP + FP)	Gives information about performance with respect to false positives. Goal is to minimize false positives	No information about false negatives
Sensitivity/recall	Number of correct positive results divided by all that are actually positive TP/(TP + FN)	Gives information about performance with respect to false negatives. Goal is to minimize false negatives	No information about false positives
Specificity	Number of correct negative results divided by all that are actually negative TN/(FP + TN)	Useful to characterize the rate of true negatives compared to predicted negatives	No information about true positives.
F1 score	Measure of the accuracy of a test– represents a harmonic mean between precision and recall. F1 = 2 × ((Precision × Recall)/(Precision + Recall))	Balances both precision and recall, for instance a high precision with low recall may have a high accuracy but would have a lower F1 score	Harder to calculate than an arithmetic mean, may be difficult to interpret unless familiar with the concept
Mean absolute error	Average of difference between original values and predicted values	Measure of the distance between prediction and actual input	Does not give any insight into direction of error (under or over prediction)
Mean squared error	Similar to mean absolute error but takes square of the difference	Easier to compute gradient differences	Can emphasize the effect of larger errors over smaller errors
Logarithmic loss	Classifier must assign probabilities for each prediction. Penalizes false classifications. Values closer to 0 indicate higher accuracy	Very strong if many observations	Weak if few observations. Maximizing log loss may lead to better probability estimation but at cost of accuracy
Area under receiver operator curve (AUROC)	Probability a true positive will have a higher predicted probability than true negative across all thresholds	Useful for discrimination. Helpful to visually assess performance over range of thresholds, useful to compare across models. Higher is better (1.0 = perfect)	Not clinically relevant, can be biased if classes are unbalanced
Area under precision-recall curve (AUPRC)	Average probability that a positive prediction will be true across all sensitivities	Useful for discrimination. Reflects overall probability that a positive prediction is a true positive. Higher is better (1.0 = perfect). Better positioned in rare events than AUROC, and helpful to visually assess performance	May be difficult to interpret, some performance is also graphed at clinically irrelevant regions
Hosmer–Lemeshow	Observed probability vs predicted probability across varying ranges of prediction	Useful for assessing model calibration. Visually represents the data, allows easy observation of areas where the model may have poor performance	Groupings of ranges are arbitrary. May struggle with smaller datasets
Scaled Brier’s Score	Squared difference between observations and prediction, scaled to account for the event rate	Explains variance so useful for both discrimination and calibration. Higher is better (1.0 = perfect). Good measure of overall predictive performance	Does not give information about individual predictions and may represent performance at clinically irrelevant regions