Use of the Area Under the Precision-Recall Curve to Evaluate Prediction Models of Rare Critical Illness Events

Blake Martin; Tellen D Bennett; Peter E DeWitt; Seth Russell; L Nelson Sanchez-Pinto

doi:10.1097/PCC.0000000000003752

. 2025 Apr 29;26(6):e855–e859. doi: 10.1097/PCC.0000000000003752

Use of the Area Under the Precision-Recall Curve to Evaluate Prediction Models of Rare Critical Illness Events

Blake Martin ^1,^2,^✉, Tellen D Bennett ^1,², Peter E DeWitt ², Seth Russell ², L Nelson Sanchez-Pinto ³

PMCID: PMC12133047 PMID: 40304543

In the critical care setting, where missed diagnoses and false alarms can have significant consequences, evaluating the performance of a classification (or binary prediction) model must be nuanced, critical, and thorough. Traditionally, the area under the receiver operating characteristic (AUROC) curve has been used as the primary evaluation metric for most models. The receiver operating characteristic (ROC) curve is a plot of a model’s sensitivity (also known as recall or true positive rate) and specificity (true negative rate) at each possible model output threshold (typically a score or probability). The AUROC, thus, provides a global evaluation of a model’s ability to discriminate between the two outcome classes by summarizing the sensitivity–specificity tradeoff over all model output thresholds.

However, AUROC may be misinterpreted clinically when the outcome of interest is rare. Such scenarios are referred to as “imbalanced” because of the large ratio of negative to positive (or vice versa) cases. In these situations, a model may have a high AUROC as a result of the robust true negative rate (specificity) (1, 2) yet be unable to reliably identify positive cases (Fig. 1). Accordingly, AUROC may not give clinicians the insights they seek when deciding whether a classification model will be clinically useful (3).

Figure 1. — Example impact of outcome class imbalance on model predictions. This figure shows how application of model predictions to balanced (A) and imbalanced (B) outcome sets can affect the distribution of true positive (TP) and false positive (FP) and false negative (FN) predictions. While the true negative (TN) rate (specificity) remains high in the imbalanced set, the ratio of FP predictions to TP predictions has increased leading to a low positive predictive value and a potentially less useful model.

Many events of interest in the critical care setting, including mortality, clinical deterioration, and acute kidney injury, are imbalanced in our datasets because only a minority of our patients suffer these events (< 10–20%). In such situations, the precision-recall (PR) curve and area under the PR curve (AUPRC) offer more clinically relevant and operationally useful measures of performance. The sensitivity (recall) of a model indicates the proportion of all positive cases the model can identify. The positive predictive value (PPV or precision) is the proportion of positive predictions that are correct. Both are intuitive concepts for clinicians. By measuring the tradeoff between sensitivity and PPV, AUPRC aligns better with what we care about in imbalanced problems: reliable identification of a rare event. Further, PR curves can illustrate the operational implications of model deployment in critical care settings. Here, we demonstrate the utility of AUPRC and PR curves when predicting rare critical illness events using simulated pediatric data.

A PREDICTION MODEL EXAMPLE USING SIMULATED CRITICAL CARE DATA

We synthesized clinical data for 200,000 virtual pediatric patients presenting with diabetic ketoacidosis and built prediction models for the outcome of cerebral edema (CE). Input variables, distributions, and CE prevalence were chosen to reflect observational studies (4–6). Virtual patients were randomly divided into training (80%) and test (20%) sets, and logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) models were trained to predict CE. We calculated AUROC and AUPRC and plotted ROC and PR curves using the pROC and PRROC packages (which use piecewise trapezoidal integration to determine area under the curve). AUROC and AUPRC comparisons and 95% CIs were computed using bootstrapping methods. The R software (Version 4.4.1; Vienna, Austria) we used is described in the Supplement (http://links.lww.com/PCC/C622).

The AUROCs for the LR, RF, and XGBoost models were 0.953 (95% CI, 0.939–0.964), 0.874 (0.851–0.897), and 0.947 (0.939–0.964), respectively, while AUPRCs were 0.116 (0.095–0.142), 0.083 (0.068–0.102), and 0.096 (0.082–0.112), respectively (Fig. 2). Importantly, interpretation of AUPRC and each model’s value is best done in relation to the frequency of the event of interest (specifically, by dividing AUPRC by the frequency of that event).

Figure 2. — Receiver operator characteristic (ROC) and precision-recall (PR) curves for simulation prediction models. This figure shows ROC (A) and PR (B) curves and associated area under the ROC curve (AUROC) and area under the PR curve (AUPRC) values (with 95% CIs) for three models (logistic regression, random forest, and extreme gradient boosting [XGBoost]) predicting cerebral edema in the simulated population test set. We have reversed the X-axis in A, which is more commonly shown as 1–specificity. PPV = positive predictive value.

Dividing the LR model AUPRC (0.116) by the CE outcome frequency (0.007), we find the LR model is 16.6-times more useful than a random model. Additionally, as observed in the PR curve in Figure 2B, a model threshold chosen to achieve a sensitivity of 0.85–0.90 would potentially result in a PPV (precision) 5–10% higher for the LR and XGBoost models than the RF model. None of this is readily apparent in the ROC curve (Fig. 2A). Finally, although the LR and XGBoost models had similar AUROC, the LR model AUPRC was statistically significantly higher, primarily as a result of its improved PPV at lower sensitivities. In such a scenario, sole use of AUROC and ROC curves for model selection could lead to selection of a model with a PPV-sensitivity tradeoff that does not provide the clinical utility required by the clinical team.

CAPTURING WHAT MATTERS: SPECIFICITY VS. POSITIVE PREDICTIVE VALUE

High sensitivity is important for prediction of critical illness events, including CE, because failing to identify events could have severe consequences for patients. As both AUROC and AUPRC use sensitivity in their measures, the main difference between AUROC and AUPRC is use of specificity vs. PPV, respectively. In imbalanced datasets, where negative cases are abundant, this difference becomes consequential (2).

When the number of true negatives vastly exceeds true positives, a model can have a high specificity (and AUROC) by labeling all cases as negative, but this would have no clinical value. A model that labels most cases as negative may have limited ability to detect the rarer positive cases. In our simulation, all three models have excellent AUROC over 0.85. However, in an application where a sensitivity of 50% is deemed necessary, the resulting PPV will be less than 0.2, indicating less than one in five positive predictions will be correct. While this tradeoff may still be acceptable to the clinician, it is a much more sober assessment of clinical utility than AUROC suggests.

Furthermore, a small change in model performance that converts a true positive into a false negative has only a small effect on specificity in imbalanced problems due to the overwhelming number of true negatives that partially mask this drop in performance. PPV will change more markedly in this situation, offering a more transparent measure of performance when some data points have high leverage. As illustrated in Figure 1, a “miss” by the model (a change from a true positive to a false negative) has an immediate, noticeable impact on PPV, without masking from true negatives.

OPERATIONAL RELEVANCE OF AUPRC IN CRITICAL CARE SETTINGS

When deploying a model to detect or predict critical illness events, the primary goals are often two-fold: minimizing the number of missed positive cases (high sensitivity) and ensuring clinicians are not overwhelmed by false positive alerts. A high false positive rate diminishes clinicians’ trust in an alert and increases the time to respond (7).

The PR curve illustrates these priorities effectively allowing one to gauge, at different sensitivities, how well the model maximizes PPV and minimizes the “number needed to alert” (NNA), defined as 1/PPV (8). The lower the PPV, the higher the number of false positive predictions and larger the NNA. Both PPV and NNA are intuitive operational metrics for clinicians. NNA reflects the burden of false positives expected on the clinical teams responding to the alerts for each correct prediction, which is crucial in balancing the cost of alert fatigue with the benefit of accurate predictions.

By illustrating what PPV is attainable at different sensitivities, clinicians can judge whether a model’s performance aligns with an acceptable threshold for operational use. Returning to our simulation, LR model PPV is greater than 0.25 only at low (< 0.05) sensitivities. Conversely, at sensitivities near 0.90, PPV ranges from 0.15 to 0.2. By examining this curve, the clinician can determine the impact of prioritizing PPV vs. sensitivity and consider the operational implications through discussion of the NNA. A clinician may decide, for example, that a threshold that provides a sensitivity of 0.90 and NNA of 6–7 is or is not acceptable depending on the actions involved (e.g., high-risk therapy vs. more intense monitoring).

In summary, inspection of PR curves allows the clinician to evaluate if a prediction model is appropriate for clinical use and enables selection of a probability threshold suitable for the prediction task and clinical environment. Although inspection of the ROC curve remains a key component of predictive model evaluation, it does not provide the same operational insights as the PR curve because the tradeoff between sensitivity and 1-specificity can be more challenging to translate into real-world clinical scenarios.

While AUPRC is advantageous for evaluating models on imbalanced datasets, it is important to recognize the potential challenges. AUPRC can be sensitive to small changes in precision and recall, especially when positive cases are extremely rare. Therefore, it is important to complement AUPRC with visual inspection of PR curves to ensure model and threshold selection match the priorities of the prediction task. Further, it is important to note that a model that consistently outperforms others across all thresholds in ROC space will also outperform these models in PR space.

CONCLUSIONS

AUPRC and the PR curve are informative, practical metrics that should be considered when assessing prediction models for rare clinical events. By prioritizing PPV over specificity, AUPRC aligns with the operational priorities of critical care clinicians, offering a clearer picture of model performance and insight on interpreting positive predictions. While AUROC is an important metric and should be used as one of several measures of discrimination, we recommend investigators working with imbalanced prediction problems in critical care use AUPRC and the PR curve to determine how a model will perform when applied clinically at various thresholds, considering the tradeoff between sensitivity and PPV.

Supplementary Material

pcc-26-e855-s001.docx^{(25.8KB, docx)}

Footnotes

Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal’s website (http://journals.lww.com/pccmjournal).

Financial support was used for this study from the Eunice Kennedy Shriver National Institute for Child Health and Human Development (NICHD) and the National Institutes of Health (NIH), award number R01HD105939 (principal investigators [PIs] to Drs. Bennett and Sanchez-Pinto); the Eunice Kennedy Shriver NICHD and the NIH, award number K23HD111616 (PI to Dr. Martin); and the National Heart, Lung, and Blood Institute, award number K24 HL168225 (PI to Dr. Bennett).

Drs. Martin’s, DeWitt’s, Russell’s, and Sanchez-Pinto’s institutions received funding from the Eunice Kennedy Shriver National Institute for Child Health and Human Development (NICHD) (R01HD105939; K23HD111616) and the National Heart, Lung, and Blood Institute (NHLBI) (K24 HL168225). Drs. Martin, Bennett, DeWitt, Russell, and Sanchez-Pinto received support for article research from the National Institutes of Health. Dr. Bennett’s institution received funding from the NICHD, the NHLBI, the Centers for Disease Control and Prevention, and the National Institute of General Medical Sciences. Dr. Sanchez-Pinto disclosed angel investments in several startup companies related to cancer and dementia therapies; they disclosed they served as consultant for artificial intelligence implementation for patient safety use cases.

This work was performed at University of Colorado School of Medicine, Aurora, CO.

Contributor Information

Tellen D. Bennett, Email: tell.bennett@cuanschutz.edu.

Peter E. DeWitt, Email: peter.dewitt@cuanschutz.edu.

Seth Russell, Email: seth.russell@cuanschutz.edu.

L. Nelson Sanchez-Pinto, Email: lsanchezpinto@luriechildrens.org.

REFERENCES

1.Saito T, Rehmsmeier M: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015; 10:e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Liu Z, Bondell HD: Binormal precision–recall curves for optimal classification of imbalanced data. Stat Biosci. 2015; 11:141–161 [Google Scholar]
3.Steyerberg EW, Vergouwe Y: Towards better clinical prediction models: Seven steps for development and an ABCD for validation. Eur Heart J. 2014; 35:1925–1931 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Glaser N, Barnett P, McCaslin I, et al. : Risk factors for cerebral edema in children with diabetic ketoacidosis. N Engl J Med. 2001; 344:264–269 [DOI] [PubMed] [Google Scholar]
5.Lawrence SE, Cummings EA, Gaboury I, et al. : Population-based study of incidence and risk factors for cerebral edema in pediatric diabetic ketoacidosis. J Pediatr. 2005; 146:688–692 [DOI] [PubMed] [Google Scholar]
6.Edge JA, Hawkins MM, Winter DL, et al. : The risk and outcome of cerebral oedema developing during diabetic ketoacidosis. Arch Dis Child. 2001; 85:16–22 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bonafide CP, Zander M, Graham CS, et al. : Video methods for evaluating physiologic monitor alarms and alarm responses. Biomed Instrument Technol. 2014; 48:220–230 [DOI] [PubMed] [Google Scholar]
8.Dewan M, Sanchez-Pinto LN: Crystal balls and magic eight balls: The art of developing and implementing automated algorithms in acute care pediatrics. Pediat Crit Care Med. 2019; 20:1197–1199 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

pcc-26-e855-s001.docx^{(25.8KB, docx)}

[R1] 1.Saito T, Rehmsmeier M: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015; 10:e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Liu Z, Bondell HD: Binormal precision–recall curves for optimal classification of imbalanced data. Stat Biosci. 2015; 11:141–161 [Google Scholar]

[R3] 3.Steyerberg EW, Vergouwe Y: Towards better clinical prediction models: Seven steps for development and an ABCD for validation. Eur Heart J. 2014; 35:1925–1931 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Glaser N, Barnett P, McCaslin I, et al. : Risk factors for cerebral edema in children with diabetic ketoacidosis. N Engl J Med. 2001; 344:264–269 [DOI] [PubMed] [Google Scholar]

[R5] 5.Lawrence SE, Cummings EA, Gaboury I, et al. : Population-based study of incidence and risk factors for cerebral edema in pediatric diabetic ketoacidosis. J Pediatr. 2005; 146:688–692 [DOI] [PubMed] [Google Scholar]

[R6] 6.Edge JA, Hawkins MM, Winter DL, et al. : The risk and outcome of cerebral oedema developing during diabetic ketoacidosis. Arch Dis Child. 2001; 85:16–22 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Bonafide CP, Zander M, Graham CS, et al. : Video methods for evaluating physiologic monitor alarms and alarm responses. Biomed Instrument Technol. 2014; 48:220–230 [DOI] [PubMed] [Google Scholar]

[R8] 8.Dewan M, Sanchez-Pinto LN: Crystal balls and magic eight balls: The art of developing and implementing automated algorithms in acute care pediatrics. Pediat Crit Care Med. 2019; 20:1197–1199 [DOI] [PubMed] [Google Scholar]

PERMALINK

Use of the Area Under the Precision-Recall Curve to Evaluate Prediction Models of Rare Critical Illness Events

Blake Martin, MD

Tellen D Bennett, MD, MS

Peter E DeWitt, PhD

Seth Russell, MS

L Nelson Sanchez-Pinto, MD, MBI

Figure 1.

A PREDICTION MODEL EXAMPLE USING SIMULATED CRITICAL CARE DATA

Figure 2.

CAPTURING WHAT MATTERS: SPECIFICITY VS. POSITIVE PREDICTIVE VALUE

OPERATIONAL RELEVANCE OF AUPRC IN CRITICAL CARE SETTINGS

CONCLUSIONS

Supplementary Material

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Use of the Area Under the Precision-Recall Curve to Evaluate Prediction Models of Rare Critical Illness Events

Blake Martin, MD

Tellen D Bennett, MD, MS

Peter E DeWitt, PhD

Seth Russell, MS

L Nelson Sanchez-Pinto, MD, MBI

Figure 1.

A PREDICTION MODEL EXAMPLE USING SIMULATED CRITICAL CARE DATA

Figure 2.

CAPTURING WHAT MATTERS: SPECIFICITY VS. POSITIVE PREDICTIVE VALUE

OPERATIONAL RELEVANCE OF AUPRC IN CRITICAL CARE SETTINGS

CONCLUSIONS

Supplementary Material

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases