Table 1.
Common Measures of Model Performance.
| Description | Advantages | Limitations | |
|---|---|---|---|
| Accuracy | Ratio of correct predictions to total number of predictions made. (TP + TN/Total) | Easy to understand, works well if there are an equal number of samples in each class | Does not represent a clinically meaningful number if the classes are unbalanced, or if there is a high cost of misclassification (rare but fatal disease) |
| Precision (positive predictive value) |
Number of correct positives divided by number of positive test results TP/(TP + FP) |
Gives information about performance with respect to false positives. Goal is to minimize false positives | No information about false negatives |
| Sensitivity/recall |
Number of correct positive results divided by all that are actually positive TP/(TP + FN) |
Gives information about performance with respect to false negatives. Goal is to minimize false negatives | No information about false positives |
| Specificity | Number of correct negative results divided by all that are actually negative TN/(FP + TN) | Useful to characterize the rate of true negatives compared to predicted negatives | No information about true positives. |
| F1 score |
Measure of the accuracy of a test– represents a harmonic mean between precision and recall. F1 = 2 × ((Precision × Recall)/(Precision + Recall)) |
Balances both precision and recall, for instance a high precision with low recall may have a high accuracy but would have a lower F1 score | Harder to calculate than an arithmetic mean, may be difficult to interpret unless familiar with the concept |
| Mean absolute error | Average of difference between original values and predicted values | Measure of the distance between prediction and actual input | Does not give any insight into direction of error (under or over prediction) |
| Mean squared error | Similar to mean absolute error but takes square of the difference | Easier to compute gradient differences | Can emphasize the effect of larger errors over smaller errors |
| Logarithmic loss | Classifier must assign probabilities for each prediction. Penalizes false classifications. Values closer to 0 indicate higher accuracy | Very strong if many observations | Weak if few observations. Maximizing log loss may lead to better probability estimation but at cost of accuracy |
| Area under receiver operator curve (AUROC) | Probability a true positive will have a higher predicted probability than true negative across all thresholds | Useful for discrimination. Helpful to visually assess performance over range of thresholds, useful to compare across models. Higher is better (1.0 = perfect) | Not clinically relevant, can be biased if classes are unbalanced |
| Area under precision-recall curve (AUPRC) | Average probability that a positive prediction will be true across all sensitivities | Useful for discrimination. Reflects overall probability that a positive prediction is a true positive. Higher is better (1.0 = perfect). Better positioned in rare events than AUROC, and helpful to visually assess performance | May be difficult to interpret, some performance is also graphed at clinically irrelevant regions |
| Hosmer–Lemeshow | Observed probability vs predicted probability across varying ranges of prediction | Useful for assessing model calibration. Visually represents the data, allows easy observation of areas where the model may have poor performance | Groupings of ranges are arbitrary. May struggle with smaller datasets |
| Scaled Brier’s Score | Squared difference between observations and prediction, scaled to account for the event rate | Explains variance so useful for both discrimination and calibration. Higher is better (1.0 = perfect). Good measure of overall predictive performance | Does not give information about individual predictions and may represent performance at clinically irrelevant regions |