Skip to main content
. 2022 Nov 14;93(2):405–412. doi: 10.1038/s41390-022-02380-6

Table 1.

Common Measures of Model Performance.

Description Advantages Limitations
Accuracy Ratio of correct predictions to total number of predictions made. (TP + TN/Total) Easy to understand, works well if there are an equal number of samples in each class Does not represent a clinically meaningful number if the classes are unbalanced, or if there is a high cost of misclassification (rare but fatal disease)
Precision (positive predictive value)

Number of correct positives divided by number of positive test results

TP/(TP + FP)

Gives information about performance with respect to false positives. Goal is to minimize false positives No information about false negatives
Sensitivity/recall

Number of correct positive results divided by all that are actually positive

TP/(TP + FN)

Gives information about performance with respect to false negatives. Goal is to minimize false negatives No information about false positives
Specificity Number of correct negative results divided by all that are actually negative TN/(FP + TN) Useful to characterize the rate of true negatives compared to predicted negatives No information about true positives.
F1 score

Measure of the accuracy of a test– represents a harmonic mean between precision and recall.

F1 = 2 × ((Precision ×  Recall)/(Precision + Recall))

Balances both precision and recall, for instance a high precision with low recall may have a high accuracy but would have a lower F1 score Harder to calculate than an arithmetic mean, may be difficult to interpret unless familiar with the concept
Mean absolute error Average of difference between original values and predicted values Measure of the distance between prediction and actual input Does not give any insight into direction of error (under or over prediction)
Mean squared error Similar to mean absolute error but takes square of the difference Easier to compute gradient differences Can emphasize the effect of larger errors over smaller errors
Logarithmic loss Classifier must assign probabilities for each prediction. Penalizes false classifications. Values closer to 0 indicate higher accuracy Very strong if many observations Weak if few observations. Maximizing log loss may lead to better probability estimation but at cost of accuracy
Area under receiver operator curve (AUROC) Probability a true positive will have a higher predicted probability than true negative across all thresholds Useful for discrimination. Helpful to visually assess performance over range of thresholds, useful to compare across models. Higher is better (1.0 = perfect) Not clinically relevant, can be biased if classes are unbalanced
Area under precision-recall curve (AUPRC) Average probability that a positive prediction will be true across all sensitivities Useful for discrimination. Reflects overall probability that a positive prediction is a true positive. Higher is better (1.0 =  perfect). Better positioned in rare events than AUROC, and helpful to visually assess performance May be difficult to interpret, some performance is also graphed at clinically irrelevant regions
Hosmer–Lemeshow Observed probability vs predicted probability across varying ranges of prediction Useful for assessing model calibration. Visually represents the data, allows easy observation of areas where the model may have poor performance Groupings of ranges are arbitrary. May struggle with smaller datasets
Scaled Brier’s Score Squared difference between observations and prediction, scaled to account for the event rate Explains variance so useful for both discrimination and calibration. Higher is better (1.0 =  perfect). Good measure of overall predictive performance Does not give information about individual predictions and may represent performance at clinically irrelevant regions