Skip to main content
. 2022 May 20;4:923944. doi: 10.3389/fdgth.2022.923944

Table 1.

Evaluation measures from statistics and machine learning fields.

Evaluation measures Field (statistics/machine learning) Definition
Discrimination measures (decision threshold independent)
Area under the receiver operating characteristic-curve (AUROC) S/ML The receiver operating characteristic (ROC) curve plots sensitivity as a function of 1-specificity. The baseline is fixed. The area under the ROC-curve can be compared across settings with different event rates
Area under the precision recall-curve (AUPRC) ML The precision recall curve plots the precision (positive predictive value) as a function of sensitivity. The baseline is determined by the ratio of positive predictions and total predictions. The area under the precision recall curve cannot be compared across settings with different event rates and ignores true negatives
Classification measures (decision threshold dependent)
Crude accuracy ML Crude accuracy is the number of true positive and negative predictions divided by the total number of cases
Sensitivity (recall) S/ML The sensitivity is the number of true positive predictions divided by the number of true positive cases at a specified probability threshold
Specificity S/ML The specificity is the number of true negative predictions divided by the number of true negative cases at a specified probability threshold
Positive predictive value (precision) S/ML The positive predictive value (PPV) is the number of true positive predictions divided by the total number of positive predictions at a specified probability threshold
Negative predictive value S/ML The negative predictive value (NPV) is the number of true negative predictions divided by the total number of negative predictions at a specified probability threshold
Fβ-score ML The Fβ-score is the harmonic mean of sensitivity and positive predictive value controlled by the β coefficient: Fβ=(1+β2)*PPV*sensitivityβ2*PPV+sensitivity. When false positives are more important than false negatives, the β coefficient is set to be smaller than 1. When false negatives are more important than false positives, the β coefficient is set to be larger than 1. Popular installments of the Fβ-score are the F1- and F2-score. The F1score implies equal weight for false negatives and false positive classifications, which is “absurd” for most medical contexts (7)
Measures related to clinical utility
Net Benefit S Net Benefit is a weighted sum of true positive (TP) and false positive (FP) predictions at a given decision threshold (t): NB=(TP-t1-t*FP)/N. Net Benefit can be plotted over a range of decision thresholds resulting in a decision curve (4)
Relative utility S Relative utility is the maximum net benefit of risk prediction at a given decision threshold divided by the maximum net benefit of perfect prediction. A relative utility curve plots relative utility over a range of decision thresholds (8)