Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2021 Aug 7;24(1):85–107. doi: 10.1093/biostatistics/kxab023

Tailored Bayes: a risk modeling framework under unequal misclassification costs

Solon Karapanagiotis 1,, Umberto Benedetto 2, Sach Mukherjee 3, Paul D W Kirk 4, Paul J Newcombe 5
PMCID: PMC9748575  EMSID: EMS140632  PMID: 34363680

Summary

Risk prediction models are a crucial tool in healthcare. Risk prediction models with a binary outcome (i.e., binary classification models) are often constructed using methodology which assumes the costs of different classification errors are equal. In many healthcare applications, this assumption is not valid, and the differences between misclassification costs can be quite large. For instance, in a diagnostic setting, the cost of misdiagnosing a person with a life-threatening disease as healthy may be larger than the cost of misdiagnosing a healthy person as a patient. In this article, we present Tailored Bayes (TB), a novel Bayesian inference framework which “tailors” model fitting to optimize predictive performance with respect to unbalanced misclassification costs. We use simulation studies to showcase when TB is expected to outperform standard Bayesian methods in the context of logistic regression. We then apply TB to three real-world applications, a cardiac surgery, a breast cancer prognostication task, and a breast cancer tumor classification task and demonstrate the improvement in predictive performance over standard methods.

Keywords: Bayesian inference, Binary classification, Misclassification costs, Tailored Bayesian methods

1. Introduction

Risk prediction models are widely used in healthcare (Roques and others, 2003; Hippisley-Cox and others, 2008; Wishart and others, 2012). In both diagnostic and prognostic settings, risk prediction models are regularly developed, validated, implemented, and updated with the aim of assisting clinicians and individuals in estimating probabilities of outcomes of interest which may ultimately guide their decision making (Down and others, 2014; NICE, 2016; Baumgartner and others, 2017). The most common type of risk prediction model is based on binary outcomes, with class labels 0 (negative) and 1 (positive). Models for binary outcomes are often constructed to minimize the expected classification error; that is, the proportion of incorrect classifications (Zhang, 2004; Steinwart, 2005; Bartlett and others, 2006). We refer to this paradigm as the standard classification paradigm. The disadvantage of this paradigm is that it implicitly assumes that all classification errors have equal costs, that is, the cost of misclassification of a positive label equals the cost of misclassification of a negative label. (Throughout the document, we refer to the costs of incorrect classifications as misclassification costs). However, equal costs may not always be appropriate and will depend on the scientific or medical context. For example, in cancer diagnosis, a false negative (i.e., misdiagnosing a cancer patient as healthy) could have more severe consequences than a false positive (i.e., misdiagnosing a healthy individual with cancer); the latter may lead to extra medical costs and unnecessary anxiety for the individual but not result in loss of life.1 For such applications, a prioritized control of asymmetric misclassification costs is needed.

To meet this need, different methods have been developed. In the machine learning literature, they are studied under the term cost-sensitive learning (Elkan, 2001). Existing research on cost-sensitive learning can be grouped into two main categories: direct and indirect approaches. Direct approaches aim to make particular classification algorithms cost-sensitive by incorporating different misclassification costs into the training process. This amounts to changing the objective/likelihood function that is optimized when training the model (e.g., Kukar and others, 1998; Ling and others, 2004; Masnadi-Shirazi and Vasconcelos, 2010). A limitation is that these approaches are designed to be problem-specific, requiring considerable knowledge of the model in conjunction with its theoretical properties, and possibly new computational tools. Conversely, indirect approaches are more general because they achieve cost-sensitivity without any, or with minor modification to existing modeling frameworks. In this article, we focus on indirect approaches.

Indirect methods can be further subdivided into thresholding and sampling/weighting. Thresholding is the simplest approach of the two, as it changes the classification threshold of an existing risk prediction model. We can use the threshold to classify datapoints into positive or negative status if the model can produce probability estimates. This strategy is optimal if the true class probabilities were available. In other words, if the model is based on the logarithm of the ratio of true class probabilities, the threshold should be modified by a value equal to the logarithm of the ratio of misclassification costs (Duda and others, 2012). This is based on decision theoretic arguments as we show in Section 2 (Pauker and Kassirer, 1975; Duda and others, 2012). In practice, however, this strategy may lead to sub-optimal solutions. We demonstrate this using synthetic (Section 3) and real-life data (Section 4).

Alternatively, sampling methods modify the distribution of the training data according to misclassification costs (see Elkan (2001) for a theoretical justification). This can be achieved by generating new datapoints from the class with smaller numbers of datapoints, that is, oversampling from the minority class, or by removing datapoints from the majority class (undersampling). The simplest form is random sampling (over- or under-). However, both come with drawbacks. Duplicating samples from the minority class may cause overfitting (Zadrozny and others, 2003). Similarly, random elimination of samples from the majority class can result in loss of data which might be useful for the learning process. Weighting (e.g., Ting, 1998; Margineantu and Dietterich, 2003) can also be conceptually viewed as a sampling method, where weights are assigned proportionally to misclassification costs. For example, datapoints of the minority class, which usually carries a higher misclassification cost, may be assigned higher weights. Datapoints with high weights can be viewed as sample duplication – thus oversampling. In general, random sampling/weighting determine the datapoints to be duplicated or eliminated based on outcome information (whether a datapoint belongs to the majority or the minority class). Notably, they do not take into account critical regions of the covariate space, such as regions that are closer to the target decision boundary. A decision boundary specifies distinct classification regions on the covariate space based on specified misclassification costs (see Section 3 for details). This is the goal of the framework presented here.

In this article, we build upon the seminal work of Hand and Vinciotti (2003), and present an umbrella framework that allows us to incorporate misclassification costs into commonly used models for binary outcomes. The framework allows us to tailor model development with the aim of improving performance in the presence of unequal misclassification costs. Although the concepts we discuss are general, and allow for relatively simple tailoring of a wide range of models (essentially whenever the objective function can be expressed as a sum over samples), we focus on a Bayesian regression paradigm. Hence, we present Tailored Bayes (TB), a framework for tailored Bayesian inference when different classification errors incur different penalties. We use a decision theoretic approach to quantify the benefits and costs of correct and incorrect classifications (Section 2). The method is based on the principle that the relative harms of false positives and false negatives can be expressed in terms of a target threshold. We then build a 2-stage model (Section 2.3); first introduced by Hand and Vinciotti (2003). In the first stage, the most informative datapoints are identified. A datapoint is treated as informative if it is close to the target threshold of interest. Each datapoint is assigned a weight proportional to its distance from the target threshold. Intuitively, one would expect improvements in performance to be possible by putting decreasing weights on the class labels of the successively more distant datapoints. In the second stage, these weights are used to downweight each datapoint’s likelihood contribution during model training. A key feature is that this changes the estimation output in a way that goes beyond thresholding and we demonstrate this effect in simple examples (Section 3).

We conduct simulation studies to illustrate the improvement in predictive performance of our proposed TB modeling framework over the standard Bayesian paradigm (Section 3). We then apply the methodology to three real-data applications (Section 4). Our two main case studies are a breast cancer and a cardiac surgery prognostication task where we have information on how clinicians prioritize different classification errors. We show that incorporating this information into the model through our TB approach leads to better treatment decisions. We finish with a discussion of our approach, findings and provide some general remarks in Section 5.

2. Methods

We use a decision theoretic approach to summarize the costs of misclassifications of a binary outcome into a single number, which we refer to as the target threshold (Section 2.1). We later (Section 2.2) define the expected utility of risk prediction and use the target threshold and the never treat policy to simplify the expected utility and derive the Net Benefit of a risk prediction model. We use the Net Benefit as our model evaluation metric throughout the article. In Section 2.3, we incorporate the target threshold in the model formulation which results in the tailored likelihood function (Section 2.4) and the tailored posterior (Section 2.5).

2.1. The target threshold

Let Inline graphic represent a binary outcome of interest. The observed Inline graphic is a realization of a binary random variable following a Bernoulli distribution with Inline graphic. This is the marginal probability of the outcome being present, and consequently, the probability the outcome being absent is Inline graphic.

We introduce utility functions to take into account the benefits or harms of different classifications. A utility function assigns a value to each of the four possible classification-outcome combinations stating exactly how beneficial/costly each action (treat or no treat) is. We assume that people who are classified as positive receive treatment and people who are classified as negative do not receive treatment. We use “treatment” in the generic sense of healthcare intervention which could be a drug, surgery, or further testing. Each possible combination of classification (negative and positive) and outcome status (0, 1) is associated with an utility where a positive value indicates a benefit and a negative value indicates a cost or harm. The four utilities associated with binary classification problems are: (i) Inline graphic, the utility of a true positive classification, that is administering treatment to a patient who has the outcome (i.e., treat when necessary), (ii) Inline graphic, the utility of a false positive classification, that is the utility of administering treatment to a patient who does not have the outcome (i.e., administering unnecessary treatment), (iii) Inline graphic, the utility of a false negative classification, that is the utility of withholding treatment from a patient that has the outcome (i.e., withholding beneficial treatment), and (iv) Inline graphic, the utility of a true negative classification, that is the utility of withholding treatment from a patient who does not have the outcome (i.e., withholding unnecessary treatment).

The expected utilities of the two fixed courses of action (or policies) of always treat and never treat are given by

graphic file with name Equation1.gif (2.1a)
graphic file with name Equation2.gif (2.1b)

where Inline graphic and Inline graphic are the expected utility of treating and not treating, respectively. In principle, one should choose the course of action with the highest expected utility. When the expected utilities are equal, the decision maker is indifferent on the course of action (Pauker and Kassirer, 1975). Based on classical decision theory, we employ the threshold concept and denote with Inline graphic the threshold at which the decision maker is indifferent on the course of action (Pauker and Kassirer, 1980). This is the principle of clinical equipoise which exists when all of the available evidence about a course of action does not show that it is more beneficial than an alternative and, equally, does not show that it is less beneficial than the alternative (Turner, 2013). Clinical equipoise is regarded as an “ethically necessary condition in all cases of clinical research” (Freedman, 1987). Based on the threshold concept, an individual should be treated (i.e., classified as positive) if Inline graphic and should not be treated (i.e., classified as negative) otherwise. Having defined Inline graphic as the value of Inline graphic of clinical equipoise where the expected benefit of treatment is equal to the expected benefit of avoiding treatment implies Inline graphic or equivalently, Inline graphic. Solving for Inline graphic,

graphic file with name Equation3.gif (2.2)

where Inline graphic is the difference between the utility of administering treatment to individuals who have the outcome and the utility of withholding treatment in those who have the outcome. In other words, Inline graphic is the benefit for positive prediction, and consequent treatment, among those with the outcome. Similarly, Inline graphic can be interpreted as the consequence of failing to treat when it would have been of benefit, that is, the harm from a false negative result (compared to a true positive result). Comparably, Inline graphic is the difference between the utility of avoiding treatment in patients who do not have the outcome and the utility of administering treatment to those who do not have the outcome (i.e., Inline graphic). In other words, Inline graphic is the consequence of being treated unnecessarily, this is the harm associated with a false positive result (compared to a true negative result).

We henceforth refer to Inline graphic as the target threshold. Alternative names in the literature are risk threshold (Baker and others, 2009) and threshold probability (Tsalatsanis and others, 2010). It is a scalar function of Inline graphic, and Inline graphic that determines the cut-off point for calling a result positive that maximizes expected utility. Equation (2.2) therefore tells us that the target threshold at which the decision maker will opt for treatment is informative of how they weigh the relative harms of false positive and false negative results. The main advantage of this decision theoretic approach is there is no need to explicitly specify the relevant utilities, but only the desired target threshold.

Example: Assume that for every correctly treated patient (true positive) we are willing to incorrectly treat 9 healthy individuals (false positives).2 Then, we consider the benefit of correctly treating a patient to be nine times larger than the harm of an unnecessary treatment: the harm-to-benefit ratio is 1:9. This ratio has a direct relationship to Inline graphic: the odds of Inline graphic equal the harm-to-benefit ratio. That is, Inline graphic which is implied by (2.2). For example, Inline graphic of 10% implies a harm-to-benefit ratio of 1:9 (odds(10%) = 10/90).

2.2. Net benefit for risk prediction

In practice, we do not know the probability of the outcome of any given individual. Instead, we need to estimate it, according to a set of covariates. Let Inline graphic be a vector of Inline graphic covariates and define Inline graphic as the conditional class 1 probability given the observed values of the covariates, Inline graphic. We are concerned with the problem of classifying future values of Inline graphic from the information that the covariates Inline graphic contain. Assume we have a prediction model and an estimate of Inline graphic, denoted Inline graphic. We classify an individual as positive if Inline graphic, where Inline graphic is the target threshold (defined in (2.2)) and as negative otherwise. The expected utility of assigning treatment or not (i.e., classifying positive or negative) at Inline graphic based on the model’s predictions Inline graphic can be written as

graphic file with name Equation4.gif (2.3)

where Inline graphic is the true positive rate, that is, Inline graphic and Inline graphic is the false positive rate, that is, Inline graphic. The drawback of this formulation is the need to specify the four utilities. Equation (2.3) can be simplified by considering the expected utility of risk prediction in excess of the expected utility of no treatment. The expected utility of no treatment is given in (2.1b), and so, subtracting this from both sides of (2.3), the expected utility of risk prediction in excess of the expected utility of no treatment is

graphic file with name Equation5.gif (2.4)

This is a Hippocratic utility function because it is motivated by the Hippocratic oath; do the best in ones ability (beneficence) and do no harm (nonmaleficence) (Childress and Beauchamp, 2001). To be consistent with the Hippocratic oath, the modeler chooses the model that has the greatest chance of giving an outcome no worse than the outcome of no treatment. With Inline graphic, (2.4) is defined as the Net Benefit of risk prediction versus treat none (Vickers and Elkin, 2006; Baker and others, 2009). Setting Inline graphic as the reference level means that Net Benefit is measured in units of true positive predictions. To see this we re-write (2.4) as

graphic file with name Equation6.gif (2.5)

where Inline graphic is number of patients with true positive results, Inline graphic is number of patients with false positive results, and Inline graphic is the sample size. To simplify notation we write NB instead of Inline graphic. NB gives the proportion of net true positives in the data set, accounting for the different misclassification costs. In other words, the observed number of true positives is corrected for the observed proportion of false positives weighted by the odds of the target threshold, and the result is divided by the sample size. This net proportion is equivalent to the proportion of true positives in the absence of false positives. For instance, a NB of 0.05 for a given target threshold, can be interpreted as meaning that use of the model, as opposed to simply assuming that all patients are negative, leads to the equivalent of an additional 5 net true positives per 100 patients.

For the remainder of the manuscript NB will be our main performance measure for model evaluation. We have written NB as a function of the target threshold Inline graphic, which allows information about the relative utilities of treatments to be included in our model formulation, which we now show.

2.3. Model formulation

Denote data Inline graphic where Inline graphic is the outcome indicating the class to which the Inline graphic datapoint belongs and Inline graphic is the vector of covariates of size Inline graphic. The objective is to estimate the posterior probability of belonging to one of the classes given a set of new datapoints. We use Inline graphic to fit a model Inline graphic and use it to obtain Inline graphic for a future datapoint Inline graphic with covariates Inline graphic. We simplify the structure using Inline graphic, where Inline graphic is a function that maps the vector of the covariates to the real line i.e., the linear predictor used in generalized linear models. To develop the complete model, we need to specify Inline graphic and Inline graphic.

In the machine learning literature, most of the binary classification procedures use a loss-function-based approach. In the same spirit, we model Inline graphic according to a loss function Inline graphic which measures the loss for reporting Inline graphic when the truth is Inline graphic. Mathematically, minimizing this loss function can be equivalent to maximizing Inline graphic, where Inline graphic is proportional to the likelihood function. This duality between “likelihood” and “loss,” that is viewing the loss as the negative of the log-likelihood is referred to in the Bayesian literature as a logarithmic score (or loss) function (Bernardo and Smith, 2009; Bissiri and others, 2016). A few popular choices of loss functions for binary classification are the exponential loss used in boosting classifiers (Friedman and others, 2000), the hinge loss of support vector machines (Zhang, 2004), or logistic loss of logistic regression (Friedman and others, 2000; Zhang, 2004). In this article, we focus on the following loss,

graphic file with name Equation7.gif (2.6)

where we define Inline graphic and Inline graphic are datapoint-specific weights. This is a generalized version of the logistic loss, first introduced by Hand and Vinciotti (2003). We recover the standard logistic loss by setting Inline graphic for all Inline graphic. Note that we specify Inline graphic as a linear function, i.e., Inline graphic, where Inline graphic is a Inline graphic dimensional vector of regression coefficients. Hence, our objective is to learn Inline graphic. We make this explicit by replacing Inline graphic with Inline graphic for the rest of this work.

The datapoint-specific weights, Inline graphic, allow us to tailor the standard logistic model. We wish to weigh observations based on their vicinity to the target threshold, Inline graphic, upweighting observations close to Inline graphic (the most informative) and downweighting those that are further away. To accomplish this, we set the weights as

graphic file with name Equation8.gif (2.7)

where Inline graphic is the squared distance (see Supplementary material available at Biostatistics online for other options) and Inline graphic is the unweighted version of Inline graphic. Of course, in practice we do not know Inline graphic so we cannot measure the distance between Inline graphic and each datapoint’s predicted probability, Inline graphic, in order to derive these weights. To overcome this, we propose a two-stage procedure. First, the distance is measured according to an estimate of Inline graphic, Inline graphic, which can be compared with Inline graphic to yield the weights. This estimate could be based on any classification method: we use standard unweighted Bayesian logistic regression in the analysis below. If a well-established model of Inline graphic already exists in the literature that could be used (as in our cardiac surgery case study, see Section 4.2) this task would not be necessary. After deriving the weights, they are then used to estimate Inline graphic. Finally, under the formulation in (2.7) the weights decrease with increasing distance from the target threshold Inline graphic. The tuning parameter Inline graphic controls the rate of that decrease. For Inline graphic we recover the standard logistic regression model. We use cross-validation to choose Inline graphic, see later for details.

2.4. Tailored likelihood function

To gain a better insight into the model, we define the tailored likelihood function as

graphic file with name Equation9.gif (2.8)

Strictly speaking, this quantity is not the standard logistic likelihood function. Nevertheless, it is instinctive to see its correspondence with the standard likelihood function. Thus, we rewrite (2.8) (after taking the log in both sides) as

graphic file with name Equation10.gif (2.9)

where Inline graphic is the standard logistic log-likelihood function. We can further replace (2.7) into (2.9)

graphic file with name Equation11.gif

to see that each datapoint contributes exponentially proportional to its distance from the target threshold Inline graphic, which summarizes the four utilities associated with binary classification problems (see 2.2). One option to proceed is by optimizing the tailored likelihood function with respect to the coefficients in an empirical risk minimization approach (Vapnik, 1998). An attractive feature of (2.9) is that this optimization is computationally efficient since we can rely on existing algorithmic tools, for example, (stochastic) gradient optimization. However, here we learn the coefficients in a Bayesian formalism.

2.5. Bayesian tailoring

Following Bayes Theorem, the TB posterior is

graphic file with name Equation12.gif (2.10)

where Inline graphic is the tailored likelihood function given in (2.8), Inline graphic is the prior on the coefficients, and Inline graphic, is the normalizing constant. In this work we assume a normal prior distribution for each element of Inline graphic, that is, Inline graphic where Inline graphic and Inline graphic are the mean and standard deviation respectively for the Inline graphic element of Inline graphic (Inline graphic). For all analysis below, we use vague priors with Inline graphic and Inline graphic, for all Inline graphic.

Conveniently, we can interpret the choice of prior as a regularizer on a per-datapoint influence/importance (see Section S1). Crucially, this allows us to view the TB posterior as combining a standard likelihood function with a data-dependent prior (Section S1). Hence, even though the tailored likelihood function does not have a probabilistic interpretation the TB posterior is a proper posterior.

In the Supplementary material available at Biostatistics online, we provide details on the model inference and predictions steps (Section S2), the cross-validation scheme for choosing Inline graphic (Section S3), the data-spitting strategy (Section S4), and the Markov chain Monte Carlo (MCMC) algorithm we are implementing (Section S5).

3. Simulations

The simulations are designed to provide insight into when TB can be advantageous compared to the standard Bayesian paradigm. Two scenarios where TB is expected to outperform standard Bayes (SB) are the absence of parallelism of the optimal decision boundaries and data contamination. A decision boundary determines distinct classification regions in the covariate space. It provides a rule to classify datapoints based on whether the datapoint’s covariate vector falls inside or outside the classification region. If a datapoint falls inside the classification region it will be labeled as belonging to class 1 (e.g., positive), if it falls outside it will be labeled as belonging to class 0 (e.g., negative). According to Bayesian decision theory the optimal decision boundaries determine the classification regions where the expected reward is maximized given prespecified misclassification costs (Duda and others, 2012). More specifically, we classify as positive if Inline graphic, where Inline graphic denotes the true class 1 probability, as in Section 2.1. Simulations 1 and 2 present two settings where the optimal decision boundaries are not parallel with their orientation changing as a function of the target threshold. Simulation 3 is an example of data contamination.

3.1. Simulation 1: linear decision boundaries

We first evaluate the performance of tailoring by extending a simulation from Hand and Vinciotti (2003). We simulate Inline graphic data points according to two covariates, Inline graphic and Inline graphic, and assign label 1 with probability: Inline graphic with Inline graphic, Inline graphic and where Inline graphic is a scalar. The parameter Inline graphic determines the relative prevalence of the two classes, when Inline graphic there are more class 1 than class 0, otherwise there are more class 0 than class 1. Figure 1 shows the optimal decision boundaries in the covariate space for a range of target thresholds using Inline graphic and Inline graphic (which leads to a prevalence of 0.5). A key feature is that these boundaries are linear but not parallel. The absence of parallelism renders any linear model unsuitable as a global fit, but the linearity of the decision boundaries allows linear models to describe these boundaries sufficiently.

Fig. 1.

Fig. 1

Optimal decision boundaries (black lines) for target thresholds 0.1, 0.3, 0.5, 0.7, and 0.9. Posterior mean boundaries for SB (grey) and TB (yellow) when targeting the (a) 0.3 and (b) 0.5 boundary. Shaded regions represent 90% highest predictive density (HPD) regions.

We use the decision boundaries corresponding to 0.3 and 0.5 target thresholds as exemplars. SB results in a sub-optimally estimated decision boundary for Inline graphic (Figure 1a). The estimated 0.3 boundary from SB is parallel to the 0.5-optimal boundary. This is expected because under this simulation setting logistic regression is bound to find a compromise model which should be linear with level lines roughly parallel to the true 0.5 boundary (where misclassification costs are equal). On the other hand, TB allows derivation of a decision boundary which is far closer to the optimum. Note the wider predictive regions of the tailoring. This is an expected consequence of our framework which we comment on in Section S9 of the Supplementary material available at Biostatistics online. When deriving decision boundaries under the equal costs implied by a 0.5 target threshold (Figure 1b), the two models are almost indistinguishable.

To systematically investigate the performance of tailoring across a wide range of settings, we set-up different scenarios by varying: (i) the sample size, (ii) the prevalence of the outcome, (iii) and the target threshold. Model performance is evaluated in an independently sampled data set of size 2000. Under most scenarios tailoring outperforms standard Bayesian regression (Figure 2). The performance gains are evident even for small sample sizes. With a few exceptions (most notably Inline graphic and 0.9) the advantage of tailoring is relatively stable across sample sizes. The advantage of tailoring persists even when varying the prevalence of the outcome. In fact, we see that under certain scenarios TB is superior to SB even for the 0.5 boundary. Figure S2 of the Supplementary material available at Biostatistics online illustrates such a scenario for Inline graphic, which corresponds to prevalence of 0.15. Under such class imbalance, which is common in medical applications, even when targeting the 0.5 boundary, one might want to use tailoring over standard modeling approaches.

Fig. 2.

Fig. 2

Difference in Net Benefit for samples sizes of 500, 1000, 5000, 10 000 averaged over 20 repetitions. A positive difference means TB outperforms SB. The values of 0.1, 0.5, and 1, for the Inline graphic parameter correspond to prevalence of around 0.15, 0.36, and 0.50, respectively.

3.2. Simulation 2: quadratic decision boundaries

Our second simulation is a more pragmatic scenario where the optimal decision boundaries are a quadratic rather than a linear function of the covariates. The model is of the form

graphic file with name Equation13.gif

where Inline graphic contains the two continuous-valued predictors. The marginal probabilities of the outcome are equal, that is, Inline graphic. In this case of unequal covariance matrices, the optimal decision boundaries are a quadratic function of Inline graphic (Figure 3a) (Duda and others (2012), Chapter 2). A linear model, like the one we implement is suboptimal. Nevertheless, this example allows us to demonstrate in an analytically tractable way the advantage of tailoring and it allows us to explore a broader array of generic simulation examples, since arbitrary Gaussian distributions lead to decision boundaries that are general hyperquadrics.

Fig. 3.

Fig. 3

(a) Optimal decision boundaries for target thresholds 0.1, 0.3, 0.5, 0.7, 0.9. Posterior median boundaries for (b) SB, and (c) TB.

Figure 3(b) and (c) shows the posterior median decision boundaries for SB and TB using Inline graphic under the data generating model described above, and for a range of target thresholds. It is clear that the direction of the optimal decision boundary is a function of the costs. The parallel decision boundaries obtained by applying different thresholds to the standard logistic predictions are clearly not an optimal solution when comparing against the optimal boundaries depicted in Figure 3(a). Although limited to estimation of linear boundaries, tailoring is able to adapt the angle of the boundary to better approximate the optimal curves. One exception in comparative performance is the 0.5 threshold which is estimated perfectly for both models. This is expected, since the standard logistic model targets the 0.5 boundary.

As before, we investigate the performance of tailoring across a wide range of settings, by varying: (i) the sample size, (ii) the prevalence of the outcome, and (iii) the target threshold. Performance is evaluated in an independently sampled test set of size 2000. Figure 4 shows the difference in NB between TB and SB. Tailoring performs similarly or better than standard regression across all target thresholds for prevalence scenarios 0.3 and 0.5. For 0.1 the two models are closely matched. A further comparison with a non-linear model, namely Bayesian Additive Regression Trees (BART) (Sparapani and others, 2021) is detailed in the supplementary material (Section S7). Briefly, TB demonstrated equivalent or better performance than BART at the clinically relevant lower disease prevalences of 0.1 and 0.3, indicating that the benefits offered by TB cannot be matched simply by switching to a non-linear modeling framework.

Fig. 4.

Fig. 4

Difference in Net Benefit for samples sizes of 500, 1000, 5000, 10 000 averaged over 20 repetitions. A positive difference means TB outperforms SB. Each grid corresponds to a different prevalence setting.

3.3. Simulation 3: Data contamination

Our third simulation scenario demonstrates the robustness of tailoring to data contamination, that is, the situation in which a fraction of the data have been mislabeled. The data generating model is a logistic regression with a large fraction of mislabeled datapoints. We simulate Inline graphic covariates and Inline graphic datapoints. Figure 5 depicts a scenario with Inline graphic of datapoints mislabeled among those with high values of both covariates, that is, among the upper right hand side of the data cloud. For each covariate, Inline graphic values are independently drawn from a standard Gaussian distribution. Denoting the coefficient vector by Inline graphic with values Inline graphic (the first value corresponds to the intercept term) we simulate the outcome vector as Inline graphic, where Inline graphic. We then corrupt the data with class 0 datapoints, that is, we set Inline graphic for Inline graphic datapoints where Inline graphic is the fraction of contamination taking values Inline graphic and Inline graphic. The covariates are generated from equivalent and independent normal distributions, specifically Inline graphic. This type of contamination framework has been popularized by Huber (1964, 1965) and used extensively to study the robustness of learning algorithms to adversarial attacks in general (Balakrishnan and others, 2017; Diakonikolas and others, 2019; Prasad and others, 2018; Osama and others, 2020) and medical applications (Paschali and others, 2018).

Fig. 5.

Fig. 5

Single realization from contaminated distribution with Inline graphic corrupted datapoints. Data (Inline graphic) with labels 0 and 1 are shown in blue and red, respectively. The corrupted data points are depicted with triangles on the upper right-hand corner of the data cloud. The lines corresponds to target thresholds 0.1, 0.5, and 0.9.

We derive the optimal NB based on the true probability score in an independent non-contaminated test data set of size Inline graphic. Figure 6 shows the results for various contamination fractions. For most fractions TB outperforms SB. As the contamination fraction gets larger the performance of both models degrades, but standard regression degrades at a faster rate. Tailoring can accommodate various degrees of contamination better than standard regression, while generally never resulting in poorer performance.

Fig. 6.

Fig. 6

Net Benefit of tailoring (red) and standard regression (green) compared to optimal classification (blue) averaged over 20 repetitions. Each grid corresponds to different contamination fraction.

Note that under no contamination (i.e., Inline graphic, first panel Figure 6) SB is an optimal classifier, since the optimal decision boundaries are parallel straight lines (Figure 5). However, for all other scenarios even a data corruption as small as 5% results in poor performance under SB for target thresholds Inline graphic. On the contrary, tailoring maintains stable performance and close to the optimal for Inline graphic, for up to 15% of mislabeled datapoints.

4. Real data applications

We evaluate the performance of TB on three real-data applications involving a breast cancer prognostication task (Section 4.1), a cardiac surgery prognostication task (Section 4.2) and a breast cancer tumor classification task (Section S8 of the Supplementary material available at Biostatistics online). Overall, our empirical results demonstrate the improvement in predictive performance when taking into consideration misclassification costs during model training.

4.1. Real data application 1: Breast cancer prognostication

Here, we apply the TB methodology to predict mortality after diagnosis with invasive breast cancer. The training data is based on 4718 estrogen receptor positive subjects diagnosed in East Anglia, UK between 1999 and 2003. The outcome modeled is 10-year mortality. The covariates are age at diagnosis, tumor grade, tumor size, number of positive lymph nodes, presentation (screening vs. clinical), and type of adjuvant therapy (chemotherapy, endocrine therapy, or both). We use 20% of the data as design and the rest as development set (see Figure S1 of the Supplementary material available at Biostatistics online), repeating the design/development set split Inline graphic times. The entire train data is used to fit SB. Both models are evaluated in an independent test set consisting of 3810 subjects. Detailed information on the data sets can be found in Karapanagiotis and others (2018).

An important part of the TB methodology is the choice of Inline graphic. In breast cancer, accurate predictions are decisive because they guide treatment. In clinical practice, treatment is given if it is expected to reduce the predicted risk by at least some pre-specified magnitude. For instance, clinicians in the Cambridge Breast Unit (Addenbrooke’s Hospital, Cambridge, UK) currently use the absolute 10-year survival benefit from chemotherapy to guide decision making for adjuvant chemotherapy as follows: Inline graphic3% no chemotherapy; 3%–5% chemotherapy discussed as a possible option; Inline graphic5% chemotherapy recommended (Down and others, 2014). Following previous work (Karapanagiotis and others, 2018), we assume that chemotherapy reduces the 10-year risk of death by 22% (Peto and others, 2012). Then, a risk reduction between 3% and 5%, corresponds to target thresholds between 14% and 23%. Hence, we explore misclassification cost ratios corresponding to Inline graphic in the range between 0.1 and 0.5.

Figure 7 shows the difference in NB between the two models averaged over the five splits. We see TB outperforms SB for most target thresholds, especially where decisions about adjuvant chemotherapy are made. Compared to SB, tailoring achieves up to 3.6 more true positives per 1000 patients (when Inline graphic), which is equivalent to having 3.6 more true positives per 1000 patients for the same number of unnecessary treatments.

Fig. 7.

Fig. 7

Difference in Net Benefit for various Inline graphic values evaluated on the test set. Error bars correspond to one standard error of the difference. That is, denoting the difference in Net Benefit Inline graphic with Inline graphic for each Inline graphic then the standard error of the difference is Inline graphic, where Inline graphic. This accounts for the fact that both models have been evaluated on the same data. The units on the y axis may be interpreted as the difference in benefit associated with one patient who would die without treatment and who receives therapy. The 0.14–0.23 shaded area on the x axis corresponds to 3%–5% absolute risk of death reduction with and without chemotherapy. These are the risk ranges where chemotherapy is discussed as a treatment option.

Next, we examine the effect of tailoring on the posterior distributions of the coefficients. As an exemplar, we use the posterior samples for the model corresponding to Inline graphic (Figure 8). We see that tailoring affects both the location and spread of the estimates compared to standard modeling. First, note the wider spread of tailoring compared to the standard models. Second, the tailored posteriors are centered on different values. The most extreme example is the coefficient for the number of nodes. Under tailoring it has a stronger positive association with the risk of death. To quantify the discrepancy between the posteriors of the two models Table 1 shows estimates of the overlapping area between the posteriors for each covariate. These range from 3% to 70%. The relative shifts in magnitude of the effect sizes indicates different relative importance of the covariates in terms of their contribution to the predictions from the two models.

Fig. 8.

Fig. 8

Marginal density plots of posterior parameters for Inline graphic for SB (blue) and TB (red).

Table 1.

Overlapping area of posterior distributions for each coefficient based Gaussian kernel density estimations (Pastore and Calcagnì, 2020).

Covariate Posterior overlap (%)
Nodes 3.05
Size 23.46
Chemo 41.92
Age 48.78
Hormone 57.76
Grade 62.66
Screen 69.94

4.2. Real data application 2: Cardiac surgery prognostication

For our second case study, we investigate whether TB allows for better predictions, and consequently improved clinical decisions for patients undergoing aortic valve replacement (AVR). Cardiac patients with severe symptomatic aortic stenosis are considered for surgical AVR (SAVR). Given that SAVR is typically a high-risk procedure, transcatheter aortic valve implantation (TAVI) is recommended as a lower risk alternative but it is associated with higher rates of complications (Baumgartner and others, 2017). The European System for Cardiac Operative Risk Evaluation (EuroSCORE) is routinely used as a criterion to choose between SAVR and TAVI (Roques and others, 2003). EuroSCORE is an operative mortality risk prediction model which takes into account 17 covariates encompassing patient-related, cardiac and operation-related characteristics. It was first introduced by Nashef and others (1999) and it has been updated in 2003 (Roques and others, 2003) and 2012 (Nashef and others, 2012). Published guidelines recommend TAVI over SAVR if a patient’s predicted mortality risk is above 10% (Baumgartner and others, 2017) or 20% (Vahanian and others, 2008). Here, we compare the performance of TB with EuroSCORE and SB given these target thresholds.

We use data (Inline graphic = 9031) from the National Adult Cardiac Surgery Audit (UK) collected between 2011 and 2018. We use 80% of the data for training and the rest for testing, repeating the train/test set split Inline graphic times. For this data a design set to estimate Inline graphic is not necessary (see Figure S1 of the Supplementary material available at Biostatistics online) but instead we use the predictions from EuroSCORE (Roques and others, 2003). We add an extra step of re-calibration to account for the population/time drift (Cox, 1958; Miller and others, 1993). Figure 9 presents the results. We see TB outperforms both EuroSCORE and SB when targeting the 0.1 threshold, and only EuroSCORE at Inline graphic.

Fig. 9.

Fig. 9

Difference in Net Benefit (Inline graphicNB) between TB and EuroSCORE (ES) (red), and between TB and SB (green) for various target thresholds evaluated on the test set. Error bars correspond to one standard error of the difference (see caption of Figure 7 for details).

We further investigate the effect of tailoring to individual parameters. Figure 10 shows the highest posterior density (HPD) regions for a subset of the covariates under SB and TB for Inline graphic and Inline graphic. As in the previous case study, under tailoring the regions are generally wider and are centered on different values. For instance, compared to SB under both Inline graphic and 0.2 the posteriors of critical operative state and unstable angina are shifted towards the same direction (positive for critical operative state and negative for unstable angina). Contrast these with the posterior of emergency that compared to SB it is centered on more positive values under Inline graphic and more negative under Inline graphic. On the contrary, extracardiac arteriopathy, recent myocardial infarct, and sex are centered on similar values across the three models. This once more exemplifies the change in the contribution of some covariates towards the predicted risks when taking into account misclassification costs.

Fig. 10.

Fig. 10

Highest posterior density (HPD) regions for the parameters. Dots represent medians, and thick and thin lines represent 90 and the 95% of the HPD regions, respectively. The dashed vertical lines pass through the posterior median values of the SB parameters.

5. Discussion

In this work, we present Tailored Bayes, a framework to incorporate misclassification costs into Bayesian modeling. We demonstrate that our framework improves predictive performance compared to standard Bayesian modeling over a wide range of scenarios in which the costs of different classification errors are unbalanced.

The methodology relies solely on the construction of the datapoint-specific weights (see (2.7)). In particular, we need to specify Inline graphic, the grid of Inline graphic values for the CV, a model to estimate Inline graphic and the weighting function, Inline graphic. For some applications there may be a recommended target threshold, Inline graphic. For instance, UK national guidelines recommend that clinicians use a risk prediction model (QRISK2; Hippisley-Cox and others, 2008) to determine whether to prescribe statins for primary prevention of cardiovascular disease (CVD) if a person’s CVD risk is 10% or more (NICE, 2016). When guidelines are not available, the specification of Inline graphic is inevitably subjective, since it reflects the decision maker’s preferences regarding the relative costs of different classification errors. In practice, eliciting these preferences may be challenging, despite the numerous techniques that have been proposed in the literature to help with this (e.g., Tsalatsanis and others, 2010; Hunink and others, 2014). In such situations, we advocate fitting the model for a range of plausible Inline graphic values that reflect general decision preferences. For example, research in both mammographic (Schwartz and others, 2000) and colorectal cancer screening (Boone and others, 2013) has shown that healthcare professionals and patients alike greatly value gains in sensitivity over loss of specificity. For additional examples on setting Inline graphic see Vickers and others (2016) and Wynants and others (2019). Further examples in which benefits and costs associated with an intervention (as well as with patients’ preferences) are taken into account, are provided by Manchanda and others (2016), Le and others (2017), Watson and others (2020).

We discuss the remaining elements for the construction of the weights in Section S9 of the Supplementary material available at Biostatistics online. There we define the effective sample size for tailoring, Inline graphic, and showcase how to use it to set the upper limit for the grid of Inline graphic values. In addition, we show our framework is robust to miscalibration of Inline graphic and the choice of Inline graphic. The framework is therefore flexible, allowing many ways for the user to specify the weights.

In contrast to the work of Hand and Vinciotti (2003), our approach is framed within the Bayesian formalism. Consequently, the tailored posterior integrates the attractive features of Bayesian inference—such as flexible hierarchical modeling, the use of prior information and quantification of uncertainty—while also allowing for tailored inference. Quantification of uncertainty is critically important, especially in healthcare applications (Begoli and others, 2019; Kompa and others, 2021). Whilst two (or more) models can perform similarly in terms of aggregate metrics (e.g., area under ROC curve) they can provide very different individual (risk) predictions for the same patient (Pate and others, 2019; Li and others, 2020). This can ultimately lead to different decisions for the individual, with potential detrimental effects. Uncertainty quantification can mitigate this issue since it allows the clinician to abstain from utilizing the model’s predictions. If there is high predictive uncertainty for an individual, the clinician can discount or even disregard the prediction.

To illustrate this point, we use the SB posterior from the breast cancer prognostication case study. The posterior predictive distributions for two patients are displayed in Figure 11. The average posterior risk for each patient is indicated by the vertical line at 34 and 35%, respectively. Based solely on these average estimates chemotherapy should be recommended as a treatment option to both patients (see Section 4.1). It is clear, however, that the predictive uncertainty for these two patients is quite different, as the distribution of risk for patient 1 is much more dispersed than the distribution for patient 2. One way to quantify the predictive uncertainty would be to calculate the standard deviation of these distributions, which are 6.9% and 2.8% for patient 1 and patient 2, respectively. Even though both estimates are centered at similar values the predictive uncertainty for patient 1 is more than two times higher than patient 2. Using this information, we could flag patient 1 as needing more information before making a clinical decision.

Fig. 11.

Fig. 11

Predictive uncertainty for the risk of death in two patients. These posterior predictive distributions reflect the range of risks assigned to these patients, and the mean risk is shown as vertical lines. Despite the fact that both patients have similar mean risks, we may be more inclined to trust the predictions for patient 2 given the lower amount of uncertainty associated with that prediction.

A few related comments are in order. In this work, we use vague Gaussian priors, but they could be replaced with other application-specific distribution choices. For instance, in the case of high-dimensional data another option could be the sparsity-inducing prior used by Bayesian lasso regression (Park and Casella, 2008). Furthermore, we can easily incorporate external information in a flexible manner, through Inline graphic, in addition to the prior on the coefficients. If a well-established model exists, then it is natural to consider using it to improve the performance of an expanded model. We have implemented such an approach in Section 4.2. Cheng and others (2019) propose several approaches for incorporating published summary associations as prior information when building risk models. A limitation of their approaches is the requirement for a parametric model, that is, information on regression coefficients. Our method does not have any restriction on the form of Inline graphic, it can arise from a parametric or non-parametric model.

We note that we opted to use the same set of covariates, Inline graphic, to estimate both Inline graphic and Inline graphic. This does not need to be the case. If available, we could instead use another set of covariates, say Inline graphic to estimate Inline graphic. The set Inline graphic could be a superset or a subset of Inline graphic or the two sets could be completely disjoint. We also note that in this work we focus on linear logistic regression to showcase the methodology (linear refers to linear combinations of the covariates). This is because it is widely utilized and allows analytical and computational tractability. Nevertheless, we would stress that our framework is generic, and not restricted to linear logistic regression. It can accommodate a wide range of modeling frameworks, from linear to non-linear and from classical statistical approaches to state-of-the-art machine learning algorithms. As a result, future work could consider such extensions to non-linear models. Also, future work could consider the advantages of a joint estimation, that is, both steps, stage 1 (estimation of weights) and stage 2 (estimation of weighted prediction probabilities) jointly. A further direction is the extension of the framework to high-dimensional settings.

To conclude, in response to recent calls for building clinically useful models (Chatterjee and others, 2016; Shah and others, 2019), we present an overarching Bayesian learning framework for binary classification where we take into account the different benefits/costs associated with correct and incorrect classifications. The framework requires the modelers to first think of how the model will be used and the consequences of decisions arising from its use—which we would argue should be a prerequisite for any modeling task. Instead of fitting a global, agnostic model and then deploying the result in a clinical setting we propose a Bayesian framework to build towards models tailored to the clinical application under consideration.

6. Software

The R code used for the experiments in this article has been made available as an R package, TailoredBayes, on Github: https://github.com/solonkarapa/TailoredBayes.

Supplementary Material

kxab023_Supplementary_Data

Acknowledgments

The authors thank Paul Pharoah for providing the breast cancer data set and Jeremias Knoblauch for the insightful discussions.

Conflict of Interest: None declared.

Footnotes

1 Note this example constitutes a simplification of the problem aimed to introduce the main idea of the article, that is, in some applications the false positives and negatives have different costs. Hence, we are not considering the negative effects of toxicity of chemotherapy, overdiagnosis/unnecessary treatment for certain cancers, quality of life issues, etc.

2 The statement is equivalent to the following: Assume that not treating an individual with the outcome (false negative) is 9 times worse than treating unnecessarily a healthy individual (false positive). Both statements result in the same harm-to-benefit ratio.

Contributor Information

Solon Karapanagiotis, MRC Biostatistics Unit, University of Cambridge, UK and The Alan Turing Institute, UK.

Umberto Benedetto, Bristol Heart Institute, University of Bristol, UK.

Sach Mukherjee, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany and MRC Biostatistics Unit, University of Cambridge, UK.

Paul D W Kirk, MRC Biostatistics Unit, University of Cambridge, UK.

Paul J Newcombe, MRC Biostatistics Unit, University of Cambridge, UK.

Supplementary material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

Funding

The Medical Research Council (MCInline graphicUUInline graphic00002/9 to S.K. and P.J.N. and MCInline graphicUUInline graphic00002/13 Inline graphic MR/R014019/1 to P.D.W.K.); National Institute for Health Research Bristol Biomedical Research Centre (NIHR Bristol BRC) to U.B.; the National Institute for Health Research (Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care; The Alan Turing Institute under the EPSRC grant (EP/N510129/1 to S.K.). Partly funded by the RESCUER project. RESCUER has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 847912.

References

  1. Baker, S. G., Cook, N. R., Vickers, A. and Kramer, B. S. (2009). Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society: Series A 172, 729–748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Balakrishnan, S., Du, S. S., Li, J. and Singh, A. (2017). Computationally efficient robust sparse estimation in high dimensions. In: Satyen Kale, and Ohad Shamir, (editors), Conference on Learning Theory. PMLR, pp. 169–212. [Google Scholar]
  3. Bartlett, P. L., Jordan, M. I. and McAuliffe, J D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association 101, 138–156. [Google Scholar]
  4. Baumgartner, H., Falk, V., Bax, J. J., De Bonis, M., Hamm, C., Holm, P. J., Iung, B., Lancellotti, P., Lansac, E., Rodriguez Munoz, D.. and others. (2017). 2017 ESCc/EACTS guidelines for the management of valvular heart disease. European Heart Journal 38, 2739–2791. [DOI] [PubMed] [Google Scholar]
  5. Begoli, E., Bhattacharya, T. and Kusnezov, D. (2019). The need for uncertainty quantification in machine-assisted medical decision making. Nature Machine Intelligence 1, 20–23. [Google Scholar]
  6. Bernardo, J. M. and Smith, A. F. M. (2009). Bayesian Theory, Volume 405. John Wiley & Sons. [Google Scholar]
  7. Bissiri, P. G., Holmes, C. C. and Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B 78, 1103–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Boone, D., Mallett, S., Zhu, S., Yao, G. L., Bell, N., Ghanouni, A. von Wagner, C., Taylor, S. A., Altman, D. G., Lilford, R.. and others. (2013). Patients’ healthcare professionals’ values regarding true-& false-positive diagnosis when colorectal cancer screening by CT colonography: discrete choice experiment. PLoS One 8, e80767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chatterjee, N., Shi, J. and García-Closas, M. (2016). Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nature Reviews Genetics 17, 392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cheng, W., Taylor, J. M. G., Gu, T., Tomlins, S. A. and Mukherjee, B. (2019). Informing a risk prediction model for binary outcomes with external coefficient information. Journal of the Royal Statistical Society: Series C 68, 121–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Childress, J. F. and Beauchamp, T. L. (2001). Principles of Biomedical Ethics. New York: Oxford University Press. [Google Scholar]
  12. Cox, D. R. (1958). Two further applications of a model for binary regression. Biometrika 45, 562–565. [Google Scholar]
  13. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Steinhardt, J. and Stewart, A. (2019). Sever: a robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815. [Google Scholar]
  14. Down, S.K., Lucas, O., Benson, J. R. and Wishart, G. C. (2014). Effect of predict on chemotherapy/trastuzumab recommendations in her2-positive patients with early-stage breast cancer. Oncology Letters 8, 2757–2761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Duda, R. O., Hart, P. E. and Stork, D. G. (2012). Pattern Classification. John Wiley & Sons. [Google Scholar]
  16. Elkan, C. (2001). The foundations of cost-sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, IJCAIâŁ01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. p. 973–978. [Google Scholar]
  17. Freedman, B. (1987). Equipoise and the ethics of clinical research. New England Journal of Medicine 317, 141–145. [DOI] [PubMed] [Google Scholar]
  18. Friedman, J., Hastie, T., Tibshirani, R.. and others. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics 28, 337–407. [Google Scholar]
  19. Hand, D. J. and Vinciotti, V. (2003). Local versus global models for classification problems: Fitting models where it matters. The American Statistician 57, 124–131. [Google Scholar]
  20. Hippisley-Cox, J., Coupland, C., Vinogradova, Y., Robson, J., Minhas, R., Sheikh, A. and Brindle, P. (2008). Predicting cardiovascular risk in england and wales: prospective derivation and validation of qrisk2. BMJ 336, 1475–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics 35, 73–101. [Google Scholar]
  22. Huber, P. J. (1965). A robust version of the probability ratio test. Annals of Mathematical Statistics 36, 1753–1758. [Google Scholar]
  23. Hunink, M. G. M., Weinstein, M. C., Wittenberg, E., Drummond, M. F., Pliskin, J. S., Wong, J. B. and Glasziou, P. P. (2014). Decision Making in Health and Medicine: Integrating Evidence and Values. Cambridge University Press. [Google Scholar]
  24. Karapanagiotis, S., Pharoah, P. D. P., Jackson, C. H. and Newcombe, P. J. (2018). Development and external validation of prediction models for 10-year survival of invasive breast cancer. comparison with predict and cancermath. Clinical Cancer Research 24, 2110–2115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kompa, B., Snoek, J. and Beam, A. L. (2021). Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digital Medicine 4, 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kukar, M., Kononenko, I.. and others. (1998). Cost-sensitive learning with neural networks. In: Proceedings of the 13th European Conference on Artificial Intelligence (ECAI). John Wiley & Sons, 98, pp. 445–449. [Google Scholar]
  27. Le, P., Martinez, K. A., Pappas, M. A. and Rothberg, M. B. (2017). A decision model to estimate a risk threshold for venous thromboembolism prophylaxis in hospitalized medical patients. Journal of Thrombosis and Haemostasis 15, 1132–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li, Y., Sperrin, M., Ashcroft, D. M. and van Staa, T. P. (2020). Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar. BMJ 371:m3919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ling, C. X., Yang, Q., Wang, J. and Zhang, S. (2004). Decision trees with minimal costs. In: Proceedings of the Twenty-first International Conference on Machine Learning. Association for Computing Machinery, pp. 69. [Google Scholar]
  30. Manchanda, R., Legood, R., Antoniou, A. C., Gordeev, V. S. and Menon, U. (2016). Specifying the ovarian cancer risk threshold of ‘premenopausal risk-reducing salpingo-oophorectomy ’for ovarian cancer prevention: a cost-effectiveness analysis. Journal of Medical Genetics 53, 591–599. [DOI] [PubMed] [Google Scholar]
  31. Margineantu, D. and Dietterich, T. (2003). A wrapper method for cost-sensitive learning via stratification. [Online; cited December 2019] Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.1102. [Google Scholar]
  32. Masnadi-Shirazi, H. and Vasconcelos, N. (2010). Risk minimization, probability elicitation, and cost-sensitive SVMS. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, pp. 759–766. [Google Scholar]
  33. Miller, M. E., Langefeld, C. D., Tierney, W. M., Hui, S. L. and McDonald, C. J. (1993). Validation of probabilistic predictions. Medical Decision Making 13, 49–57. [DOI] [PubMed] [Google Scholar]
  34. Nashef, S. A. M., Roques, F., Michel, P., Gauducheau, E., Lemeshow, S., Salamon, R. and EuroSCORE Study Group. (1999). European system for cardiac operative risk evaluation (Euro SCORE). European Journal of Cardio-Thoracic Surgery 16, 9–13. [DOI] [PubMed] [Google Scholar]
  35. Nashef, S. A. M., Roques, F., Sharples, L. D., Nilsson, J., Smith, C., Goldstone, A. R. and Lockowandt, U. (2012). Euroscore II. European Journal of Cardio-thoracic Surgery 41, 734–745. [DOI] [PubMed] [Google Scholar]
  36. NICE. (2016). Cardiovascular disease: risk assessment and reduction, including lipid modification. [Online; cited December 2019] Available at: https://www.nice.org.uk/guidance/cg181/chapter/1-recommendations. [Google Scholar]
  37. Osama, M., Zachariah, D. and Stoica, P. (2020). Robust risk minimization for statistical learning. arXiv preprint arXiv:1910.01544. [Google Scholar]
  38. Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
  39. Paschali, M., Conjeti, S., Navarro, F. and Navab, N. (2018). Generalizability vs. robustness: adversarial examples for medical imaging. arXiv preprint arXiv:1804.00504. [Google Scholar]
  40. Pastore, M. and Calcagnì, A. (2020). Measuring distribution similarities between samples: a distribution-free overlapping index. Frontiers in Psychology 10, 1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Pate, A., Emsley, R., Ashcroft, D. M., Brown, B. and van Staa, T. (2019). The uncertainty with using risk prediction models for individual decision making: an exemplar cohort study examining the prediction of cardiovascular disease in English primary care. BMC Medicine 17, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Pauker, S. G. and Kassirer, J. P. (1975). Therapeutic decision making: a cost-benefit analysis. New England Journal of Medicine 293, 229–234. [DOI] [PubMed] [Google Scholar]
  43. Pauker, S. G. and Kassirer, J. P. (1980). The threshold approach to clinical decision making. New England Journal of Medicine 302, 1109–1117. [DOI] [PubMed] [Google Scholar]
  44. Peto, R., Davies, C., Godwin, J., Gray, R., Pan, H. C., Clarke, M., Cutter, D., Darby, S., McGale, P., Taylor, C.. and others. (2012). Comparisons between different polychemotherapy regimens for early breast cancer: meta-analyses of long-term outcome among 100,000 women in 123 randomised trials. Lancet 379, 432–444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Prasad, A., Suggala, A. S., Balakrishnan, S. and Ravikumar, P. (2018). Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485. [Google Scholar]
  46. Roques, F., Michel, P., Goldstone, A. R. and Nashef, S. A. M. (2003). The logistic Euro SCORE. European Heart Journal 24, 882–883. [DOI] [PubMed] [Google Scholar]
  47. Schwartz, L. M., Woloshin, S., Sox, H. C., Fischhoff, B. and Welch, H. G. (2000). US women’s attitudes to false positive mammography results and detection of ductal carcinoma in situ: cross sectional survey. BMJ 320, 1635–1640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Shah, N. H., Milstein, A. and Bagley, S. C. (2019). Making machine learning models clinically useful. JAMA 322, 1351–1352. [DOI] [PubMed] [Google Scholar]
  49. Sparapani, R., Spanbauer, C. and McCulloch, R. (2021). Nonparametric machine learning and efficient computation with Bayesian additive regression trees: the BART R package. Journal of Statistical Software 97, 1–66. [Google Scholar]
  50. Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory 51, 128–142. [Google Scholar]
  51. Ting, K. M. (1998). Inducing cost-sensitive trees via instance weighting. In: Zytkow, Jan M. and Quafafou, M. (editors). Principles of Data Mining and Knowledge Discovery. Berlin Heidelberg: Springer. pp. 139–147. [Google Scholar]
  52. Tsalatsanis, A., Hozo, I., Vickers, A. and Djulbegovic, B. (2010). A regret theory approach to decision curve analysis: a novel method for eliciting decision makers’ preferences and decision-making. BMC Medical Informatics and Decision Making 10, 51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Turner, J. R. (2013). Chapter Principle of equipoise. In: Gellman, Marc D. and Turner, J. Rick (editors), Encyclopedia of Behavioral Medicine. New York: Springer, pp. 1537–1538. [Google Scholar]
  54. Vahanian, A., Alfieri, O. R., Al-Attar, N., Antunes, M. J., Bax, J., Cormier, B., Cribier, A., De Jaegere, P., Fournial, G., Kappetein, A. P.. and others. (2008). Transcatheter valve implantation for patients with aortic stenosis: a position statement from the European Association of Cardio-thoracic Surgery (EACTS) and the European Society of Cardiology (ESC), in collaboration with the European Association of Percutaneous Cardiovascular Interventions (EAPCI). European Journal of Cardio-Thoracic Surgery 34, 1–8. [DOI] [PubMed] [Google Scholar]
  55. Vapnik, V. (1998). Statistical Learning Theory. John Wiley & Sons. [Google Scholar]
  56. Vickers, A. J. and Elkin, E. B. (2006). Decision curve analysis: a novel method for evaluating prediction models. Medical Decision Making 26, 565–574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Vickers, A. J., Van Calster, B. and Steyerberg, E. W. (2016). Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ 352, i6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Watson, V., McCartan, N., Krucien, N., Abu, V., Ikenwilo, D., Emberton, M. and Ahmed, H. U. (2020). Evaluating the trade-offs men with localised prostate cancer make between the risks and benefits of treatments: the compare study. The Journal of Urology 204, 273–280. [DOI] [PubMed] [Google Scholar]
  59. Wishart, G. C., Bajdik, C. D., Dicks, E., Provenzano, E., Schmidt, M. K., Sherman, M., Greenberg, D. C., Green, A. R., Gelmon, K. A., Kosma, V.-M.. and others. (2012). Predict plus: development and validation of a prognostic model for early breast cancer that includes HER2. British Journal of Cancer 107, 800–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Wynants, L., van Smeden, M., McLernon, D. J., Timmerman, D., Steyerberg, E. W., Van Calster, B.. on behalf of the Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. (2019). Three myths about risk thresholds for prediction models. BMC Medicine 17, 192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Zadrozny, B., Langford, J. and Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. In: Third IEEE International Conference on Data Mining. IEEE. pp. 435–442. [Google Scholar]
  62. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics 32, 56–85. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxab023_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES