Abstract
Objective:
To improve the calibration of logistic regression (LR) estimates using local information.
Background:
Individualized risk assessment tools are increasingly being utilized. External validation of these tools often reveals poor model calibration.
Methods:
We combine a clustering algorithm with an LR model to produce probability estimates that are close to the true probabilities for a particular case. The new method is compared to a standard LR model in terms of calibration, as measured by the sum of absolute differences (SAD) between model estimates and true probabilities, and discrimination, as measured by area under the ROC curve (AUC).
Results:
We evaluate the new method on two synthetic data sets. SADs are significantly lower (p < 0.0001) in both data sets, and AUCs are significantly higher in one data set (p < 0.01).
Conclusion:
The results suggest that the proposed method may be useful to improve the calibration of LR models.
Introduction
There are several examples of clinical decision support tools that are based on classifiers learned from data, such as the Framingham Cardiovascular Disease Risk Calculator.1 The quality of a classifier depends on its discriminatory power (how well it can distinguish between two or more classes) and on its calibration (how close the system output is to the “true” probability of an event). Historically, the question of assessing a model’s discrimination has received much more research interest than the question of assessing its calibration.2–4 This imbalance can be traced, at least in part, to the difficulty of defining exactly what constitutes a true probability in the context of predictive models.
The well-known Hosmer-Lemeshow (HL) goodness-of-fit statistic for logistic regression (LR) models5 sidesteps this question by measuring to what extend a model’s prediction for a case agrees with the relative frequency in the vicinity of the case. This only indirectly addresses the main issue, because “vicinity of the case” is defined by proximity in terms of model output (HL-H) or in terms of which decile of risk a particular case falls into (HL-C), and not in terms of proximity in input space (i.e., how similar two cases are without respect to a particular model). A diagnostic output of 0.2 obtained from a model that is considered well-calibrated in the HL sense does not mean that two out of ten similar cases are diseased, but instead that two out of ten cases with similar model output are diseased.
This seemingly minor but important distinction between similarity in input space and similarity in model output space is hardly news to researchers in the field of biomedical informatics. However, one cannot assume that the general public is aware of this distinction. The reason why such a distinction is now relevant to everyone is that recent years have brought a proliferation of self-help websites that calculate risk estimates for a variety of diseases. Although several of these websites are endorsed by reputable institutions, they nevertheless vary quite widely in their risk assessment of the same individual. From a purely technical point of view based on HL, these models may all be well calibrated. However, they may not produce estimates that can be considered calibrated from a common sense perspective. As an example, it cannot easily be explained why, when predicting the risk of coronary heart disease, the personal risk score of 17% obtained from one website (American Heart Association) and the personal risk score of “greater than 30%” obtained from another website (NIH’s National Heart, Lung, and Blood Institute) can both be well-calibrated. In this paper, we seek to ameliorate this situation by presenting a method that modifies LR outputs by considering similarity in input space. This modified output is an approximation of the true probability distribution in input space, and can be considered a hybrid of local non-parametric regression and LR. Using synthetic data for which the true probability distribution is known, we show that our method gives probability estimates that are closer to the true probabilities than the original, unaltered LR outputs.
Method
Our approach to improving the calibration of LR models is divided into three tasks, the first two of which can be done in parallel. In Task 1, we apply a clustering algorithm to identify regions of similar cases within the data set. In Task 2, we train a LR model to rank the cases according to their class-membership probabilities. In Task 3, we subdivide the clusters we identified in Task 1 using the case rankings of the LR model obtained in Task 2, and use the relative frequency of cases in a subcluster as our model output. That is, when applying our final model to a new case, we use the clustering obtained in Task 3 to check whether the new case is sufficiently similar to cases in the training set. If it is, we make a prediction that reflects the local probability distribution in the region of the new case. If it is not, we treat this case as an outlier, and do not make a prediction.
Model construction
In Task 1, we use the EM algorithm to fit a Gaussian mixture model to the data set. The EM algorithm6 is an iterative process for finding the maximum-likelihood estimates of the parameters of an underlying distribution from a given data set. It is often applied to mixture models, in which case it requires knowing the initial number of distributions whose parameters are adjusted to the given data set. In our algorithm, we apply the Gap statistic of Tibshirani et al.7 to find the optimal number of mixture distributions. This method compares the sum of within-cluster distances of the actual data with that of a clustered uniform distribution, and chooses the smallest number of clusters for which this difference is significant.
After running the EM algorithm with the appropriate number of components, we partition the training data according to the mixture model. A data point is assigned to the cluster for which it has the highest posterior probability of belonging to. The result of Task 1 is thus a mixture model that provides an approximation to the density distribution of the data.
For Task 2, we train a regularized LR model by maximum likelihood. The appropriate size of the parameter governing the regularization term is chosen by 10-fold cross-validation. The result of Task 2 is thus a LR model for predicting posterior class membership probabilities. Depending on the data set, this model may or may not “pass” the HL goodness-of-fit test for LR models, where “pass” means that the p-value of the HL statistic is above 0.05, a commonly acceptable threshold.
Task 3 represents the novel contribution of our work. Our aim is to improve the calibration of the LR output by taking similarities in input space into account. For this, we combine the mixture model obtained in Task 1 with the estimated probability of the trained LR model of Task 2 in order to refine our mixture model. The combination is done as follows: For every cluster in the mixture model, we sort the data points in that cluster according to their probabilities in the LR model. Then, we divide each cluster into subclusters of at least 10 elements with similar LR-estimated probabilities, beginning at the end of the cluster that is closer to the LR separation line. If the difference between largest and smallest LR probability in the subcluster is smaller than a pre-defined constant (we use 0.05), we keep adding elements to this subcluster until this limit in difference is reached. The result of Task 3 is thus a cluster division that is more fine-grained than the one obtained in Task 1.
In summary, the procedure essentially uses LR-based outputs to order cases within clusters and subdivides the latter into smaller groups. These new groups can be viewed as indirectly supervised subclusters, as they are determined by the estimates produced by the LR model.
Model application
The refined clustering that constitutes our improved model allows the calculation of class-membership probabilities that are better calibrated than the LR posterior probabilities. To predict the probability of a new data point, we determine the subcluster to which it belongs. This is done by finding the cluster of the mixture model that contains the data point, and then using the LR output of the data point to locate the appropriate subcluster. If the data point does not belong to any cluster in the mixture model, we treat it as an outlier. We can use this information, for example, to suppress the output in a decision support system because there are not enough similar cases to make an accurate prediction (i.e., there are no other patients like this in the training set and the extrapolation is not warranted).
The check for outlier status of a data point x is done by calculating the squared Mahalanobis distance of x from each multivariate Gaussian in the mixture model. This quantity is known to be chi-square distributed with m degrees of freedom, where m is the dimensionality of the input space. A case that has p < 0.001 for every cluster in the mixture model is considered an outlier.
If the data point is not an outlier, and was determined to be in subcluster C, the improved probability estimate we provide is the proportion of cases in subcluster C labeled with “1”.
Experiments
In our experiments, we used two synthetic data sets to assess the performance of our algorithm. Each data set consisted of two compact Gaussian distributions forming the diseased cases, and one wider Gaussian distribution forming the healthy cases. For each of the two data sets, we generated a sample of n = 1000 cases within the [0,9] × [0,9] square, with 500 cases drawn from the diseased distributions and 500 cases from the healthy distribution.
For both data sets, the Gap statistic of Task 1 correctly identified an optimum of three clusters. The final model found by the EM algorithm was close to the true mixture of distributions. The cluster centers deviated at most 0.132 from the true cluster centers. The maximal difference in σ2 was 0.113.
Experiment 1
Figure 1 depicts the data set used in our first experiment. The cluster centers μ are at (3.5, 4.5) and (6.3, 6.8) with σ2 = 0.6 and 0.4, respectively, for the diseased cases, and at μ = (7, 4) with σ2 = 1.1 for the healthy cases. The diseased cases are illustrated by red × and the healthy cases are illustrated by green O. The optimal separating line between the two classes (shown as the dotted curve), can be calculated because the underlying distribution of our synthetic data set is known. Because it passes the HL-C test (p = 0.948), the calibration of the LR can be considered high. By comparison with the true probabilities we will, however, show that the LR model is not as well calibrated as the local probability estimates of our model.
Figure 1.
Data used for Example 1 (500 healthy and 500 diseased cases drawn from three different known distributions). Class separation by the logistic regression model is shown by the dashed line. Optimal separation is shown by the dotted line.
For the highlighted red ×, the LR probability estimate is 0.24. This probability does not reflect the true probability of 0.76. By contrast, our approach returns a probability estimate of 0.70. Because the optimal separating curve is nonlinear, there are several data points for which the LR classifier predicts the wrong class membership.
For the highlighted green O, we estimate a local probability of 0.50. The true probability of this data point is 0.16, and the estimated probability from the LR model is 0.78.
Figure 2 shows the left portion of Figure 1 in more detail. The circle indicates the 0.999 iso-contour line of the left disease cluster, as identified by the EM algorithm. Because of overlap with the other two clusters, not all of the points within this circle actually belong to this cluster. Those data points with higher probability of belonging to one of the other two clusters are shown in lighter color. The solid lines within the cluster borders define the refinement of the cluster. One can see that the cluster is sub-divided into five sub-clusters. Starting with the bottom-most subcluster, the local probability estimates are 0.7, 0.8, 0.9, 0.97 and 1, respectively. The first three of these contain 10 data points. The last two clusters are more homogenous in terms of LR probabilities, resulting in larger (broader) clusters.
Figure 2.
Details from the identified cluster on the left portion of Figure 1.
To determine the calibration of our model and that of the LR model, we calculated the sum of absolute differences (SAD) between the model’s estimates for each case and the case’s true probability. For our model, we obtained a SAD of 37.05. The SAD of the LR model was 52.44, about 1.4 times higher than that of our approach. This difference is highly significant (p < 0.0001, Wilcoxon signed-rank test). Note that the absolute difference between the estimate produced by the model and the true underlying probability for a particular case (which we here use as a measure of “discalibration”) is different from the residual, the difference between the estimate produced by the model and the label “0” or “1”. The former can only be calculated for problems in which the distributions from which the cases are drawn are all known. It is directly related to the common sense notion of calibration for binary classifiers. The latter can be measured from the observations.
We evaluated the classification performance of our approach on a test set with 500 diseased and 500 healthy cases, sampled from the same mixture distribution as the training set. We classified the test set with a classification accuracy of 94.8% and an area under the ROC curve (AUC) of 0.975. The classification accuracy of the LR model was 93.4%. The AUC was the same as that of our approach.
Experiment 2
Figure 3 depicts the data set used in our second experiment. The parameters of the mixture distributions are μ = (3.3, 5.2), σ2 = 0.4 and μ = (6.3, 6.5), σ2 = 0.3 for the diseased cases, and μ = (5.2, 4.5), σ2 = 2 for the healthy cases. As in Figure 1, the diseased cases are illustrated by red × and the healthy cases by green ○. For this data set, the LR model outputs do not pass the HL-C test (p < 0.00001).
Figure 3.
Data used in Example 2. The class boundary defined by the LR model is depicted by the dashed line. Optimal separation is shown by the two circles.
For the highlighted red ×, our local probability estimate is 0.43. The true probability for this case is 0.65, and the estimated probability from the LR model is 0.36. The advantage of our method is even more pronounced for the highlighted green O: the true probability is 0.08, but LR assigns this case to the diseased distribution with an estimated probability of 0.93. Our approach approximates the true probability more accurately by returning a probability of 0.
For experiment 2, we obtained an SAD of 104.34. In contrast, the LR model’s SAD was 169.14. Again using the Wilcoxon signed-rank test, this difference was highly significant (p < 0.0001).
As in experiment 1, we evaluated the discriminatory power of our approach on a test set sampled from the mixture distribution of the training set. We classified the test data set with an accuracy of 81.9%. The AUC was 0.853. The LR model had a classification accuracy of 76.8%, and an AUC value of 0.801. Thus, we achieved a significantly higher AUC result than that of the LR model (p < 0.01).
Discussion
Embedded in the idea of personalized medicine is the concept of individualization of risk assessment, diagnosis, therapeutic interventions, and prognosis. Since critical decisions (such as whether or not to recommend cholesterol lowering medication to patients who are at-risk for cardiovascular disease) are based on the individual probabilities estimated by LR models, it is important that the estimates be as calibrated as possible.
We showed that our method performs well for two synthetic data sets. The use of synthetic data is necessary because it is the only way to establish the “ground truth”, or true underlying probability for each case. As explained before, the case label “0” or “1” is the result of drawing from this true underlying distribution and it constitutes the only observable outcome. The method modifies estimates from LR models by taking advantage of information related to similarity of cases.
We have taken an initial step towards better calibration of LR-based risk assessment models. However, this study has several limitations: (1) we have only empirically demonstrated the performance of his method, and need to develop its theoretical framework in detail, (2) we have only used two simple data sets in which both the clustering and the LR steps performed well, and we have not explored problems in which either one performs poorly, (3) measuring calibration by SAD, although technically correct, is not applicable in practice, and we are investigating more practical measures, and (4) we have not compared the performance of this method to that of some conventional recalibration methods, because the latter are expected to perform poorly since most are heavily based on recalibration of the intercept or other linear transformation of the estimates,8 which would be of little help for the examples studied here.
Conclusion
We presented a method to improve the calibration of LR models, which are currently widely used in biomedicine and now easily available to anyone on the web. We compared our method’s estimates with those of an LR model in two examples in which the data distributions were known. The proposed method performed significantly better in terms of calibration (SAD), and also had a significantly higher discrimination (AUC) in one of the examples.
Acknowledgments
This work was funded in part by grant 1R01LM009520-01 from the National Library of Medicine, NIH, grant FAS0703850 from the Komen Foundation (LO), and the Austrian Genome Program (GENAU), project Bioinformatics Integration Network (BIN II).
References
- 1.Wilson PW, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837–47. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]
- 2.Cook NR. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clinical Chemistry. 2008;54:17–23. doi: 10.1373/clinchem.2007.096529. [DOI] [PubMed] [Google Scholar]
- 3.Dreiseitl S, Ohno-Machado L, Binder M. Comparing three-class diagnostic tests by three-way ROC analysis. Medical Decision Making. 2000;20:323–331. doi: 10.1177/0272989X0002000309. [DOI] [PubMed] [Google Scholar]
- 4.Hanley JA, McNeil BJ. The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 5.Hosmer DW, Hosmer T, le Cessie S, Lemeshow S. A comparison of goodness–of–fit tests for the logistic regression model. Statistics in Medicine. 1997;16:965–980. doi: 10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
- 6.Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
- 7.Tibshirani R, Walther G, Hastie T. Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society, Series B. 2001;32:411–423. [Google Scholar]
- 8.Zadrozny B, Elkan C.Transforming classifier scores into accurate multiclass probability estimates Proceedings of the 8th ACM conference on Knowledge discovery and data mining2002694–699.pp. 694–699. [Google Scholar]



