Skip to main content
Diagnostic and Prognostic Research logoLink to Diagnostic and Prognostic Research
. 2026 Feb 17;10:8. doi: 10.1186/s41512-026-00224-z

The continuous net benefit: assessing the clinical utility of prediction models when informing a continuum of decisions

Jose Benitez-Aurioles 1,, Laure Wynants 2,3,4, Niels Peek 5, Patrick Goodley 6,7, Philip Crosbie 6,7, Matthew Sperrin 1
PMCID: PMC12911006  PMID: 41703610

Abstract

Background

The net benefit and decision curve analysis are increasingly being used to assess the clinical utility of prognostic models. This metric assesses the value added by a model’s predictions when individuals are treated differently according to whether they are over or under a chosen threshold. Although such ‘treat or not’ decisions are common, prognostic models are also often used to tailor and personalise the care of patients, which implicitly involves the consideration of multiple interventions at different risk thresholds. We aim to extend decision curve analysis to estimate the net benefit of a model over multiple thresholds.

Methods

We take a weighted area under a rescaled version of the net benefit curve, deriving the continuous net benefit. In addition to the consideration of a continuum of interventions, we also show how the continuous net benefit can be used to evaluate single treatments in populations with a range of optimal thresholds, due to individual variations in expected treatment benefit or harm, highlighting limitations of current proposed methods that calculate the area under the decision curve. We propose this not as a substitute for decision curves, but as a complementary evaluation metric, in lieu of single-threshold point estimates.

Results

We showcase this metric through two examples of model validation in cardiovascular preventive care. The continuous net benefit brought additional insight over point estimates when comparing models over a range of decisions.

Conclusions

The continuous net benefit informs those looking to validate clinical prediction models of their clinical utility, and helps decision makers understand their usefulness, improving their viability towards implementation.

Supplementary Information

The online version contains supplementary material available at 10.1186/s41512-026-00224-z.

Keywords: Clinical prediction models, Net benefit, Decision curve analysis, Prognosis

Introduction

Clinical prediction models are used to determine a patient’s diagnostic or prognostic risk. This can be useful when deciding whether patients should be screened, treated, monitored, or referred for diagnosis. Performance metrics such as the C-statistic or calibration plots can be used to assess the performance of these models irrespective of the clinical context and decisions that the model is meant to support [1, 2], but are insufficient when determining if they will be useful in practice [1, 3].

Decision curve analysis is used to compare the clinical utility of such models using the net benefit [4]. Models are evaluated by binarizing the risk scores at a clinically informed choice of threshold [5], with the decision curve plotting this threshold-specific net benefit for a range of thresholds. Originally designed for models informing a single clinical decision [4], a model’s output is often not used just to answer a single ‘treat or not’ question. For example, in cardiovascular disease prevention, a clinician might use QRISK to decide whether to give a patient lifestyle recommendations, refer them to a smoking cessation programme, prescribe statins and/or antihypertensives, and increase monitoring frequency [69]. Within the same clinical setting, thresholds can also differ between patients because of differences in expected treatment benefit or potential harm. For instance, individuals with liver disease might be cautioned against taking statins and therefore require a higher cardiovascular risk before treatment is considered appropriate. In addition, even among clinically similar patients, thresholds may vary because of personal attitudes towards risk. In shared decision-making settings, some patients may prefer to defer treatment until their estimated risk is markedly higher after discussing possible complications with their clinician [10]. Reporting the benefit of a model for a single threshold would not reflect the true threshold variability, would misrepresent the underlying uncertainty, and would overestimate precision. In such cases, a model is considered superior to another model if its corresponding decision curve is above the other model’s curve over the entire range of relevant thresholds. Whilst you can interpret a difference in net benefit between two models for a particular threshold, you cannot meaningfully interpret the difference in net benefit of one model over many thresholds. This introduces difficulties in developing metrics that ‘summarize’ the net benefit over a range, and these approaches are subject to debate. However, a single-metric evaluation of net benefit across multiple thresholds could be useful when optimising or comparing models.

In this paper, we show why, in order to consider the previous formulation of the area under the net benefit curve valid, an assumption of equal value of true positives across the population is needed. We introduce a new metric, called the continuous net benefit, derived from the conditional utility formula, and discuss its use, connecting it to performance metrics like the likelihood function and Brier score. The continuous net benefit requires an appropriately chosen weighting function across thresholds to be specified. We discuss the potential use of continuous net benefit to inform researchers of clinical utility during model validation, including examples of its application.

Methods

Introduction to the net benefit

Before introducing the continuous net benefit, we provide a brief walkthrough of the derivation of the net benefit as introduced in Vickers & Elkin (2006) in order to better position our proposed extension [4]. Here, the focus is not on determining whether a particular treatment is beneficial, but instead on determining the net benefit of a particular model when used to decide whether patients should receive treatment or not.

We explore this problem through an example in cardiovascular prognosis. Consider a population with a binary outcome of interest Inline graphic which indicates whether an individual will have a cardiovascular event in the next ten years. A model is developed so that the risk of the outcome is estimated for each individual. Those with model scores above a threshold probability Inline graphic are prescribed statins, and the rest are not. The performance of that policy is summarised by its per capita true positives Inline graphic (number of people given statins who would have had a cardiovascular event if they had not been prescribed statins, divided by population size), false positives Inline graphic (people treated which would not have had the event if they had not been prescribed statins), false negatives Inline graphic (people who were not treated and who would then have the event) and true negatives Inline graphic (people who were not treated and would then not have the event). We define utility as a numerical measure of clinical benefit or harm. Its unit may vary depending on context, for example, quality-adjusted life years, probability of a favourable outcome, or patient-reported satisfaction. If Inline graphic, Inline graphic, Inline graphic, and Inline graphic represent the utility for a patient of getting a true positive, false positive, false negative, and true negative result, respectively, the total utility, conditional on the population, of the model is defined as:

graphic file with name d33e369.gif 1

The net benefit Inline graphic is a simplified version of the utility so that if one model has a higher Inline graphic than another, it also has a higher Inline graphic. To derive the net benefit, the best choice of threshold, or clinically optimal threshold Inline graphic is determined by Inline graphic, so that knowing the value of the utilities determines Inline graphic, and thus knowing Inline graphic is informative of the utility values (without fully specifying them, as Inline graphic is a ratio of the four utilities) [4]. Rescaling the benefit of giving a patient statins when they would have had a cardiovascular event (i.e., Inline graphic) to be 1, the net benefit becomes:

graphic file with name d33e415.gif 2

The optimal threshold Inline graphic can be determined from initial choices of the utilities Inline graphic to Inline graphic, or more commonly through clinical knowledge and practice. If the optimal threshold Inline graphic is not known or varies across the population (as the utilities Inline graphic to Inline graphic might not be equal for everyone), the net benefit can be plotted across a reasonable range of values for Inline graphic. This plot is known as the decision curve.

Limitations of previous area under the net benefit curve metrics

A previous motivation for calculating the area under the net benefit curve has been the idea that individuals, due to different underlying utilities (Inline graphic, Inline graphic, Inline graphic and Inline graphic), have different optimal thresholds Inline graphic. For example, there will always be differences between individual patients in the extent to which they benefit from treatment or get harmed by side-effects. Models therefore need to be beneficial across a range of thresholds. One approach to account for this distribution of optimal thresholds Inline graphic is to check the decision curve throughout a previously determined range of reasonable thresholds (i.e., the range of Inline graphic for which we think Inline graphic is non-negligible). If the net benefit of a model is better than another policy across the entire range, the model is considered more beneficial [5]. This assumes that the threshold is a random variable Inline graphic independent of the performance of the model at that threshold, so that the overall estimate of the performance in terms of Inline graphic and Inline graphic is not biased compared to the subgroup-specific performance in those individuals for which the threshold is Inline graphic.

The area under the net benefit curve has been proposed as a measure of the clinical benefit of a model in a population with a distribution of optimal thresholds Inline graphic by calculating its expected net benefit [11, 12]. It uses an estimated (or assumed) distribution of the optimal threshold Inline graphic to calculate the expected net benefit in a population:

graphic file with name d33e523.gif 3

Here we highlight an underlying assumption of this equation (but not of decision-curve analysis as a whole) which, to our knowledge, has not been previously discussed in the literature. Because the net benefit is in units of true positives, as mentioned in the previous section, giving Inline graphic, then using the integral [3] to yield an expectation of the utility is only valid when assuming that true positives are equally beneficial for all patients. The difference in optimal thresholds across the population is thus entirely dictated by variations in the false positive harm (Inline graphic) in each individual. This is unlikely to be true, as expected efficacy of treatment may drive differences in personal threshold.

While this is not an issue when examining the decision curve, as it is meant to be compared at every threshold (it is read ‘vertically’), it becomes an issue when the net benefit is integrated across a range (read ‘horizontally’). Let’s for example consider an alternative assumption, where the harm of a false positive Inline graphic is constant and equal to 1, so that Inline graphic, and the area under the net benefit is:

graphic file with name d33e550.gif 4

For any single value of threshold Inline graphic, Inline graphic and Inline graphic are equivalent, in that if a model has higher net benefit than another in Inline graphic, it also has higher net benefit in Inline graphic. This is however not true in terms of the integral as one model might have a higher Inline graphic and lower Inline graphic than another (we show a proof for this in Additional file 1, Supplementary 1.4).

In general, knowing the distribution of optimal thresholds Inline graphic in a population of interest is not sufficient to calculate an expected or total benefit of a model in this population without additional knowledge or assumptions. To understand why this is the case, imagine a population of two groups Inline graphic and Inline graphic where each makes up half of the population. A prediction model is used to inform whether a preventative drug should be given in both groups, and this model has equal calibration and discrimination in Inline graphic and Inline graphic. Consider that in group Inline graphic, there is no real value in treatment nor harm in over-treating so that Inline graphic and Inline graphic are small, but balanced so that Inline graphic. On the other hand, group Inline graphic is considered frail, and the treatment decision is ‘life-or-death’ for them, with large benefit in prevention, but serious harm in being given the drug when not needed, so Inline graphic and Inline graphic are large, but balanced so that Inline graphic. A useful measure of the benefit of the model is not one that averages over the benefit of the model at Inline graphic and Inline graphic with equal weighting. In reality, decisions in group Inline graphic are inconsequential, and decisions in group Inline graphic are very important. It is thus important not only to know the distribution Inline graphic across the population, but also to have more information of the relative scale of the utilities Inline graphic and Inline graphic, in order to know how useful a model is for a population with varying thresholds. This is not something you need to consider if you can separately optimise the choice of model in both groups but is essential if you are using a single model for both groups.

Combining the net benefit of two decisions

We briefly focus on the related issue of combining the net benefit of multiple decisions. For example, we would now like to assess the net benefit of a cardiovascular model when both informing the prescription of statins and the enrolment of patients into a lifestyle intervention programme. It is often not necessary to combine the benefit of two interventions, as their benefit can be easily considered separately, and different models can be used for each decision [13]. However, considering how to do this is a useful intermediary step for later reasoning.

We assume that both decisions to intervene depend on the risk of the same outcome. The two interventions have different Inline graphic and Inline graphic, different Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic, Inline graphic, Inline graphic and Inline graphic, different Inline graphic and Inline graphic, and net benefit values of Inline graphic and Inline graphic. We assume that the effects of the two interventions do not interact, or that at least the utilities of the second intervention are in terms of the added benefit and harm with respect to the first, so that the conditional utility of both model-informed decisions is:

graphic file with name d33e737.gif 5

 

The goal is to find an expression of the net benefit Inline graphic (i.e., the benefit of a model across the two interventions) so that, if and only if a model has a higher Inline graphic than another in a population, it also has a higher Inline graphic.

The first intuition for calculating the total net benefit Inline graphic (the difference in benefit between using a model to inform which patients should get which treatments and never giving either treatment) is to choose Inline graphic. However, this is problematic, as Inline graphic has as unit the benefit of treating a true positive with statins (so Inline graphic, and Inline graphic has as unit the benefit of enrolling a true positive in a lifestyle intervention programme (Inline graphic). The treatment effect of these two interventions is not equal (Inline graphic). Instead, the ratio between the value of both interventions, Inline graphic, needs to be estimated in order to calculate the total net benefit as:

graphic file with name d33e791.gif 6

In the cardiovascular example, if we estimate that enrolment in a lifestyle intervention programme is half as effective in reducing cardiovascular events compared to treating a patient with statins, an appropriate sum of the two net benefits which ranks models in the same order as Inline graphic, would be Inline graphic, up to a rescaling and offset factor.

The continuous net benefit

In reality, cardiovascular prognostic models like QRISK are used for a very wide range of potential treatment strategies and monitoring plans that are generally impossible to enumerate in advance. We model this situation as a continuum of interventions, each with a particular optimal threshold Inline graphic, so that, at each threshold Inline graphic, the corresponding treatments carry an added true positive utility of Inline graphic and false positive utility of Inline graphic. The total utility across all potential treatments for the patient is thus the integral of the conditional utility function:

graphic file with name d33e825.gif 7

Where Inline graphic and Inline graphic are constants, as they relate to the absence of any treatment. Writing Inline graphic and Inline graphic, where Inline graphic is the incidence of the outcome in the population, we get:

graphic file with name d33e852.gif 8

The term Inline graphic is equal for all policies or models, and can thus be ignored when comparing models. For any threshold Inline graphic, we still have Inline graphic, and thus:

graphic file with name d33e870.gif 9

We call this metric the continuous net benefit. The sum Inline graphic is equal, up to a constant, to the sum of the net benefit in the treated and that in the untreated for an intervention with optimal threshold Inline graphic [14]. The weighting function Inline graphic is the harmonic mean between the utility of identifying a true positive Inline graphic, and avoiding a false positive Inline graphic. We call the value of this weighting function for a particular threshold its ‘importance’, as it, in practice, determines how much the benefit (and thus performance) of a model at a particular threshold should be weighed against other thresholds. If a normalisation constant is chosen so that Inline graphic, the unit of the continuous net benefit is still true positives (though it will be combined true positives, i.e., the benefit that one patient with Inline graphic gets when managed as a high-risk patient for all decisions). The continuous net benefit represents the net utility gain, relative to a baseline ‘treat-none’ strategy, when the model informs all treatments under consideration. Its units are true positives. A continuous net benefit of one indicates that, compared with classifying everyone as low risk (and therefore treating no one), the model provides a population-level benefit equivalent to correctly identifying and fully treating one additional true positive patient. The continuous net benefit can also be calculated for non-model-based policies, such as treating all patients as low risk (treat-none) or as high risk (treat-all). For the treat-none policy, the continuous net benefit is always zero, as is the case for the standard net benefit, as it represents the reference strategy against which net gain or loss is measured. Comparing any model with these two baseline policies is recommended to determine whether it provides benefit or causes harm relative to much simpler alternatives. Generally, this weighting function is high only when both the utilities Inline graphic and Inline graphic are high, and is low if either of the two is low. To visualise this, consider possible treatments for cardiovascular disease, and the quality-adjusted life years (QALYs) added or lost due to true and false positives. We might expect the benefit of true positives to range between 5 and 15 QALYs compared to not intervening, and the harms of false positives to be lower, between 0.1 and 1 QALYs lost compared to not intervening, as commonly used preventive interventions in cardiovascular care do not usually carry serious side-effects. As seen in Fig. 1, in this case, the importance of a threshold is mostly determined by the severity of the harm of false positives, as it is the smaller effect in the harmonic mean, meaning that interventions with worse potential false positive harm should be weighted more than those with relatively low harm.

Fig. 1.

Fig. 1

Change in the value of the importance of a threshold Inline graphic as part of the weighting function Inline graphic over different values of Inline graphic and Inline graphic, where the two utility differences are in units of quality-adjusted life years added Inline graphic or lost Inline graphic when a patient is flagged positive. These plots show how, in this example, the value of Inline graphic changes more as Inline graphic, varies, as it is the smallest out of the two utility differences

Some metrics commonly used to assess the performance of prediction models can be related to the continuous net benefit through specific choices of weighting function. For example, consider the choice of a uniform function Inline graphic. The chosen ‘importance’ of all thresholds is thus equal. In this case, Inline graphic, and the difference between the continuous net benefit of two models Inline graphic and Inline graphic is equivalent to the difference between their likelihoods, with:

graphic file with name d33e981.gif 10

Here, Inline graphic is the likelihood function of the models given the observed data. The derivation of this result is shown in Additional file 1, Supplementary 1.1. Models are often fitted to maximise the likelihood function, which could intuitively be interpreted as choosing the best model when considering each threshold equally important. On the other hand, if a weighting function is chosen to be a symmetric parabola Inline graphic so that Inline graphic, the difference between the continuous net benefit of two models is equivalent to the negative Brier score difference between them, with:

graphic file with name d33e999.gif 11

The derivation of this result is shown in Additional file 1, Supplementary 1.2. The Brier score has been found to be a suboptimal metric to represent clinical benefit [15]. This follows from its associated parabolic weighting function: in most clinical prediction model applications, the most important thresholds rarely lie near 0.5, as true positive benefit typically outweighs false positive harm, leading to weighting functions with higher values at lower thresholds. Finally, the binary net benefit of a model can be recovered by using a point mass as the weighting function Inline graphic, giving Inline graphic.

To choose a weighting function in practice, we recommend first defining a plausible range of thresholds where clinical decisions are made. Researchers can then divide this range according to intervention types and qualitatively compare expected true positive benefits and false positive harms. Regions where both benefits and harms are large should be assigned greater importance than regions where both are small. Within subranges corresponding to similar interventions, the shape of the weighting function should reflect how these benefits and harms change across thresholds. If higher thresholds correspond to greater false positive harm, the weighting function should increase with the threshold. Conversely, if higher thresholds reflect diminishing returns in true positive benefit, it should decrease. When both vary, the weighting function will be mainly driven by the smaller of the two effects, the benefit Inline graphic and the harm Inline graphic, as shown in Fig. 1. Researchers can elicit this information from clinicians by asking about the range of risk scores at which decisions are made and the main factors (treatment benefit or risk) influencing threshold changes. These considerations allow researchers to sketch a plausible weighting function, which can then be approximated by a smooth distribution to compute the continuous net benefit. Alternatively, expected changes in the utilities Inline graphic and Inline graphic across the range can be simulated, and using the formula Inline graphic, used to derive the corresponding weighting function directly.

Considering single interventions with varying thresholds using the continuous net benefit

We revisit the motivation of previous literature of calculating an expected net benefit in a population with a distribution of utilities Inline graphic, Inline graphic, Inline graphic and Inline graphic, and a corresponding distribution of optimal threshold Inline graphic. In that case, the expected utility of the population is:

graphic file with name d33e1067.gif 12

Following a similar derivation to that from Eq. (7) to (9), detailed in Additional file 1, Supplementary 1.3, the expected net benefit across the population can be obtained as:

graphic file with name d33e1073.gif 13

Where the weighting function Inline graphic includes both the distribution of the thresholds across the population and the utilities for each of these thresholds. In this case, defining a weighting function follows a similar process to that described in the previous section, where now the focus shifts to changes in utility across patients, not treatments. A second step is now to estimate or assume a threshold distribution Inline graphic and multiply it by the weighting function to obtain Inline graphic. If the weighting function is normalised so that Inline graphic, the unit of the net benefit will be true positives across everyone in the population. As each patient has distinct utilities (Inline graphic to Inline graphic) and corresponding thresholds (Inline graphic), the benefit of treating a true positive is different across individuals. The continuous net benefit unit thus represents the average utility gain from treating a true positive. A continuous net benefit of one true positive indicates that, compared with the treat-none strategy, the model yields an average utility gain equivalent to correctly identifying and treating one true positive patient as high risk. R code showcasing how to calculate the continuous net benefit is included in Additional file 2.

Results

Example 1: developing a cardiovascular risk prognostic model for multiple interventions

We showcase the use of the continuous net benefit through an example. We develop and validate four models and policies (Full Model, Dichotomised Model, Small Model, and Marker-based Policy, detailed in Additional File 3, Supplementary 3.4) to predict the risk that an individual will have a cardiovascular disease event in the next 10 years, using data from the Framingham Heart Study [16].

We evaluate the models for two separate sets of decisions: (1) the prescription of statins, evaluated at the optimal threshold of Inline graphic used in UK practice [7], and (2) overall patient management and lifestyle recommendations. The first is evaluated by placing a point mass (i.e., calculating the standard net benefit) at the 10% threshold, while the second is evaluated using the continuous net benefit, choosing as weighting function Inline graphic a half-Gaussian, corresponding to a full-Gaussian with mean of Inline graphic and standard deviation of Inline graphic with only non-zero values below Inline graphic. This weighting function was chosen after discussion with a clinician, after which we determined that usually clinicians will attempt to reduce the risk of a patient through lifestyle interventions and more frequent monitoring before their risk reaches Inline graphic. We believe that these interventions are especially ‘aggressive’ as the patient gets closer to the threshold of 10%, and carry higher risks, like those of exercise-related injuries or disengagement from healthcare. To capture this, we specified an increasing weighting function over the relevant range (5% to 10%), giving greater importance to higher thresholds where potential harms are larger, without meaningful changes in true positive benefit. The standard deviation of 2% was chosen to reflect that thresholds below 5% are clinically negligible. To assess the sensitivity of the metric to misspecified weighting functions, we also evaluated a uniformly weighted area under the standard net benefit, as defined in expression [2], across thresholds from 5% to 10%. Further details on the development and validation of the model are included in Additional file 3, and full code is included in Additional file 4. Confidence intervals (95%CI) and optimism corrections for the continuous net benefit are calculated through bootstrapping [17]. Although, from a ‘pure’ decision theory standpoint, it is not necessary to report confidence intervals for decision curves [18], we still report them for the continuous net benefit to show the expected variance of the proposed metric. Here, the confidence interval should be interpreted in the same way as it is interpreted for the standard net benefit, as reflecting only the variability due to sample selection. A 95% confidence interval indicates the range that would contain the population mean continuous net benefit, given the chosen weighting function, in 95% of repeated samples. Importantly, these intervals do not capture uncertainty about the optimal threshold or about whether individual patients will benefit from the model.

The decision curves of the models, as well as the weighting function chosen, are plotted in Fig. 2a, and the net benefit of both sets of interventions is shown in Fig. 2b. For overall patient management (excluding statins), when the weighting function was appropriately chosen, the continuous net benefit of the Full Model was highest, 8.3 (95%CI: 7.4–9.1) true positives per 100 people, followed by the Dichotomised Model (8.2, 7.3–9.0), the Small Model (7.3, 6.4–8.2) and finally the Marker-based Policy (7.3, 6.4–8.1). The results of the uniformly weighted continuous net benefit produced slightly different absolute values of the net benefit, but preserved the previous ranking, with the Full Model (8.5, 7.6–9.4), the Dichotomised Model (8.4, 7.5–9.2), the Small Model (7.7, 6.7–8.5), and the Marker-based Policy (7.5, 6.6–8.3) ranking from most to least beneficial. When looking at statins prescribing (i.e., the net benefit at the 10% threshold), the Full Model and Dichotomised model naturally had the same net benefit (7.7, 6.8–8.5), followed by the Marker-based Policy (6.7, 5.8–7.6) and finally the Small Model (6.5, 5.6–7.3).

Fig. 2.

Fig. 2

Reported benefit of the cardiovascular model development example when considering multiple treatments. a Decision curve of the developed logistic regression model using all (Full Model), a dichotomised version of the same logistic regression, where patients are only shown as low or high risk (Dichotomised Model), a logistic regression with a limited set of the available predictors (Small Model), or a policy based on whether a patient has one of four markers associated with high risk (Marker-based Policy). We compare these models to the policies of treating all patients (Treat All) and treating no patients (Treat No One). The weighting function to calculate the continuous net benefit function is shown, separated into the components related to the prescription of statins (red) and the recommendation of lifestyle changes (blue). b Continuous net benefit estimates of all policies for the decisions of prescribing statins and overall patient management, as well as a uniformly weighted area under the net benefit curve between the thresholds of 5% and 10%. The unit of the statins component is the benefit of treating a true positive patient with statins per 100 people, and the unit of the two overall management components is the benefit of managing a true positive patient as high risk per 100 people

Example 2: Developing a cardiovascular risk prognostic model for one intervention with a distribution of thresholds

In this second example, we instead consider that not all individuals will be prescribed statins when they are over the risk of Inline graphic. Usually, the range of thresholds at which cardiovascular models are evaluated in decision curve analysis goes between 0% and 20% [6], although higher thresholds might be considered [19]. We model the distribution of optimal thresholds Inline graphic to be a log normal distribution with mean of 0.12 and logarithm standard deviation of 0.3, plotted in Fig. 3a. We think that the main variation of thresholds is due to changes in the relative risk reduction achieved by statins, and we thus assume that the harms due to false positives are constant (Inline graphic), so that we use Eq. (4) to calculate an expected net benefit across the population. If we wanted to consider variation in false positive harm in our analysis (as in reality, some patients value more negatively false positive harm from preventative treatment), Eq. (13) would need to be used. We have also evaluated a uniformly weighted area under the standard net benefit between thresholds of 5% and 20%.

Fig. 3.

Fig. 3

Reported benefit of the cardiovascular model development example when considering a single treatment with a distribution of optimal thresholds across the population. a Decision curve of the developed logistic regression model using all (Full Model), a dichotomised version of the same logistic regression (Dichotomised Model), a logistic regression with a limited set of the available predictors (Small Model), or a policy based on whether a patient has one of four markers associated with high risk (Marker-based Policy). We compare these models to the policies of treating all patients (Treat All) and treating no patients (Treat No One). The assumed distribution of optimal thresholds across the population is shown in red. b Continuous net benefit estimates of all policies for the decision of prescribing statins, with different assumptions of everyone having an optimal threshold of 10% (left), individuals having optimal thresholds taken from the distribution shown in (a) (middle), and using a uniformly weighted area under the net benefit curve between the thresholds of 5% and 20% (right). The unit is the average benefit of treating a true positive patient with statins per 100 people

The continuous net benefit, shown in Fig. 3b, in units of average true positives per 100 people, is 7.2 (6.4–8.0) for the Full Model, 7.1 (6.3–8.0) for the Dichotomised Model, 5.8 (5.0–6.6) for the Small Model, and 6.2 (5.3–7.0) for the Marker-based Policy. Generally, the continuous net benefit was in agreement with the net benefit at the threshold of 10% and the uniformly weighted area under the net benefit.

Discussion

In this paper, we present an approach to assess the clinical utility of prognostic models in situations where the model is used for overall patient management, or situations where we believe that a single intervention’s optimal threshold varies across the population. Summarising the total net benefit as a single measure showed the overall usefulness of prediction models in a cardiovascular problem.

In both examples, the continuous net benefit produced results similar to those from the single-threshold net benefit and from a misspecified version using a uniform weighting function. This is in part due to the evaluated policies being ranked consistently across the entire relevant threshold range. Such agreement may not hold when the relevant threshold range is wider or when models differ more substantially. For example, models which rely on different predictor sets may perform differently when discriminating in high-risk or low-risk patients. Further work is needed to identify the conditions under which simpler approaches, such as using a single threshold or an imprecise weighting function, yield acceptable approximations. Even when differences between approaches are small, the continuous net benefit remains, in principle, the more appropriate method for summarising model value across multiple thresholds, as it provides absolute benefit estimates that more accurately represent overall population utility. It can therefore serve as a reference for determining when simplifications, such as single-threshold evaluation, are justified.

In both examples, the continuous net benefit produced results similar to those from the single-threshold net benefit and from a misspecified version using a uniform weighting function. This is in part due to the evaluated policies being ranked consistently across the entire relevant threshold range. Such agreement may not hold when the relevant threshold range is wider or when models differ more substantially. For example, models which rely on different predictor sets may perform differently when discriminating in high-risk or low-risk patients. Further work is needed to identify the conditions under which simpler approaches, such as using a single threshold or an imprecise weighting function, yield acceptable approximations. Even when differences between approaches are small, the continuous net benefit remains, in principle, the more appropriate method for summarising model value across multiple thresholds, as it provides absolute benefit estimates that more accurately represent overall population utility. It can therefore serve as a reference for determining when simplifications, such as single-threshold evaluation, are justified.

We recommend using the continuous net benefit when assessing model benefit across a range of thresholds is important, for example when considering the distribution of optimal thresholds across the population, or when multiple clinical decisions are considered simultaneously. We believe it especially adds value when a quantitative assessment of the overall benefit difference between models or policies is needed, and graphical assessments using decision curves are not enough. For instance, a difficult-to-implement complex model might have a higher decision curve than a simpler one, but without quantifying the magnitude of its added clinical utility, it is difficult to judge whether implementation is justified. Besides using it during the validation stage, the continuous net benefit could also be useful for researchers and methodologists during model development, when comparing modelling choices and when needing a metric to optimise [20], or when summarising the benefit of a model over multiple validation studies using meta-analysis [21].

Use of the continuous net benefit may not be warranted in situations where the graphical assessment of using decision curves is sufficient, such as when, for example, a particular model is vastly superior to all other policies across the entire range. Moreover, if a reasonable weighting function that reflects the clinical problem cannot be roughly estimated, the metric loses clinical meaning and practical interpretability. Importantly, we do not recommend to ever use the continuous net benefit as a substitute of decision curves, but rather as an addition to the framework, similarly to how calibration-in-the-large and the calibration slope are reported alongside calibration plots. While the continuous net benefit is limited in its ability to assess clinical utility by the assumptions discussed, we nonetheless consider that it can be a useful first step before carrying out more involved cost-benefit analysis further down the implementation of a clinical prediction model.

Conclusions

An increasing number of clinical prediction models are being created, with few being implemented in practice [22, 23]. In part, this is due to the large number of available models, and to the challenges faced by policymakers when inferring whether models would be clinically useful and beneficial to patients. Our work introduces to clinical prediction model researchers a tool that can help decision makers better consider the use of models beyond a single systematic clinical decision, enabling them to assess the overall clinical value of an algorithm. Further work could investigate how assumptions around the independence of optimal thresholds and predictors can be relaxed with subgroup analysis or explore different approaches to choosing the weighting function.

Supplementary Information

41512_2026_224_MOESM1_ESM.docx (38.6KB, docx)

Additional file 1. Supplementary containing proofs of some statements made in the Methods section

41512_2026_224_MOESM2_ESM.r (3.1KB, r)

Additional file 2. (.R): Code (in R) showcasing the use of continuous net benefit

41512_2026_224_MOESM3_ESM.r (133.8KB, r)

Additional file 3. Supplementary containing additional information of the model development and validation presented in the Results section

41512_2026_224_MOESM4_ESM.docx (30.5KB, docx)

Additional file 4. (.R): Full code (in R) of the model development and validation presented in the Results section

Acknowledgements

We thank Brian McMillan for providing helpful comments during the conception of this work.

Abbreviations

QALY

Quality-Adjusted Life Years

Authors' contributions

JBA developed the proposed approach, performed the analysis, and wrote the original manuscript. LW contributed to the development of the method and reviewed the manuscript. PG and PC reviewed the manuscript. NP and MS were involved in the conception and development of the proposed approach, provided supervision, and reviewed the manuscript. All authors read and approved the final manuscript.

Funding

JBA is the receipt of the studentship awards from the Health Data Research UK-The Alan Turing Institute Wellcome PhD Programme in Health Data Science (Grant Ref: 218529/Z/19/Z). We acknowledge support of the UKRI AI programme, and the Engineering and Physical Sciences Research Council, for CHAI - Causality in Healthcare AI Hub [grant number EP/Y028856/1]. The research was carried out at the National Institute for Health and Care Research (NIHR) Manchester Biomedical Research Centre (BRC) (NIHR203308).

Data availability

The data used in the two examples is publicly available as part of the ‘riskCommunicator’ package in R. All code used for this work, which includes importing the data, is included as an additional file.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Van Calster B, Collins GS, Vickers AJ, Wynants L, Kerr KF, Barreñada L, et al. Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance. Lancet Digit Health. 2025;7. [DOI] [PubMed]
  • 4.Vickers AJ, Elkin EB. Decision curve analysis: A novel method for evaluating prediction models. Med Decis Mak. 2006;26(6):565–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Van Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ, et al. Reporting and interpreting decision curve analysis: A guide for investigators. Eur Urol. 2018;74(6):796–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hippisley-Cox J, Coupland CAC, Bafadhel M, Russell REK, Sheikh A, Brindle P, et al. Development and validation of a new algorithm for improved cardiovascular risk prediction. Nat Med. 2024;30:1440–7. 10.1038/s41591-024-02905-y. [DOI] [PMC free article] [PubMed]
  • 7.Cardiovascular disease. risk assessment and reduction, including lipid modification [Internet]. National Institute for Health and Care Excellence; 2023 Dec [cited 2024 Jun 17]. Report No.: NG238. Available from: https://www.nice.org.uk/guidance/ng238
  • 8.Hypertension in adults: diagnosis and management [Internet]. National Institute for Health and Care Excellence. 2019 Aug [cited 2024 Jun 17]. Report No.: NG136. Available from: https://www.nice.org.uk/guidance/ng136
  • 9.Collins GS, Altman DG. Predicting the 10 year risk of cardiovascular disease in the united kingdom: independent and external validation of an updated version of QRISK2. BMJ. 2012;344:e4181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Elwyn G, Frosch D, Thomson R, Joseph-Williams N, Lloyd A, Kinnersley P, et al. Shared decision making: a model for clinical practice. J Gen Intern Med. 2012;27(10):1361–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhang Z, Rousson V, Lee WC, Ferdynus C, Chen M, Qian X, et al. Decision curve analysis: a technical note. Annals Translational Med. 2018;6(15):308–308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Talluri R, Shete S. Using the weighted area under the net benefit curve for decision curve analysis. BMC Med Inf Decis Mak. 2016;16(1):94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chalkou K, Vickers AJ, Pellegrini F, Manca A, Salanti G. Decision curve analysis for personalized treatment choice between multiple options. Med Decis Mak. 2023;43(3):337–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rousson V, Zumbrunn T. Decision curve analysis revisited: overall net benefit, relationships to ROC curve analysis, and application to case-control studies. BMC Med Inf Decis Mak. 2011;11:45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Assel M, Sjoberg DD, Vickers AJ. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Prognostic Res. 2017;1(1):19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Framingham Heart Study (FHS). | NHLBI, NIH [Internet]. [cited 2024 Apr 15]. Available from: https://www.nhlbi.nih.gov/science/framingham-heart-study-fhs
  • 17.Vickers AJ, Cronin AM, Elkin EB, Gonen M. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Med Inf Decis Mak. 2008;8:53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Vickers AJ, Van Claster B, Wynants L, Steyerberg EW. Decision curve analysis: confidence intervals and hypothesis testing for net benefit. Diagn Progn Res. 2023;7:11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yebyo HG, Aschmann HE, Menges D, Boyd CM, Puhan MA. Net benefit of Statins for primary prevention of cardiovascular disease in people 75 years or older: a benefit–harm balance modeling study. Ther Adv Chronic Dis. 2019;10:2040622319877745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Vickers A, Hollingsworth A, Bozzo A, Chatterjee A, Chatterjee S. Hypothesis: net benefit as an objective function during development of machine learning algorithms for medical applications. Int J Med Informatics. 2025;197:105844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wynants L, Riley RD, Timmerman D, Van Calster B. Random-effects meta-analysis of the clinical utility of tests and prediction models. Stat Med. 2018;37(12):2034–52. [DOI] [PubMed] [Google Scholar]
  • 22.Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369:m1328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Markowetz F. All models are wrong and yours are useless: making clinical prediction models impactful for patients. Npj Precis Onc. 2024;8(1):1–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

41512_2026_224_MOESM1_ESM.docx (38.6KB, docx)

Additional file 1. Supplementary containing proofs of some statements made in the Methods section

41512_2026_224_MOESM2_ESM.r (3.1KB, r)

Additional file 2. (.R): Code (in R) showcasing the use of continuous net benefit

41512_2026_224_MOESM3_ESM.r (133.8KB, r)

Additional file 3. Supplementary containing additional information of the model development and validation presented in the Results section

41512_2026_224_MOESM4_ESM.docx (30.5KB, docx)

Additional file 4. (.R): Full code (in R) of the model development and validation presented in the Results section

Data Availability Statement

The data used in the two examples is publicly available as part of the ‘riskCommunicator’ package in R. All code used for this work, which includes importing the data, is included as an additional file.


Articles from Diagnostic and Prognostic Research are provided here courtesy of BMC

RESOURCES