Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Feb 23;6(2):e16110. doi: 10.1371/journal.pone.0016110

Calibration Belt for Quality-of-Care Assessment Based on Dichotomous Outcomes

Stefano Finazzi 1,*, Daniele Poole 2,3, Davide Luciani 4, Paola E Cogo 3,5, Guido Bertolini 3,6
Editor: Mike Gravenor7
PMCID: PMC3043050  PMID: 21373178

Abstract

Prognostic models applied in medicine must be validated on independent samples, before their use can be recommended. The assessment of calibration, i.e., the model's ability to provide reliable predictions, is crucial in external validation studies. Besides having several shortcomings, statistical techniques such as the computation of the standardized mortality ratio (SMR) and its confidence intervals, the Hosmer–Lemeshow statistics, and the Cox calibration test, are all non-informative with respect to calibration across risk classes. Accordingly, calibration plots reporting expected versus observed outcomes across risk subsets have been used for many years. Erroneously, the points in the plot (frequently representing deciles of risk) have been connected with lines, generating false calibration curves. Here we propose a methodology to create a confidence band for the calibration curve based on a function that relates expected to observed probabilities across classes of risk. The calibration belt allows the ranges of risk to be spotted where there is a significant deviation from the ideal calibration, and the direction of the deviation to be indicated. This method thus offers a more analytical view in the assessment of quality of care, compared to other approaches.

Introduction

Fair, reliable evaluation of quality of care has always been a crucial but difficult task. According to the classical approach proposed by Donabedian [1], indicators of the structure, process, or outcome of care can be variably adopted, depending on the resources available, the purpose and the context of the analysis. Whichever indicator is adopted, quality of care is assessed by comparing the value obtained in the evaluated unit with a reference standard. Unfortunately, this approach is hampered by more or less important differences between the case-mix under scrutiny and the case-mix providing the reference standard, thereby precluding direct comparison. To solve this problem, multipurpose scoring systems have been developed in different fields of medicine. Their aim is to provide standards tailored on different case-mixes, enabling the quality of care to be measured in varying contexts. Most of these systems are prognostic models, designed to estimate the probability of an adverse event occurring (e.g., patient death), basing quality of care assessment on an outcome indicator. These models are created on cohorts representative of the populations to which they will be applied [2].

A simple tool to measure clinical performance is the ratio between the observed and score-predicted (i.e. standard) probability of the event. For instance, if the observed-to-expected event probability ratio is significantly lower than 1, performance is judged to be higher than standard, and vice versa. A more sophisticated approach is to evaluate the calibration of the score, which represents the level of accordance between observed and predicted probability of the outcome. Since most prognostic models are developed through logistic regression, calibration is usually evaluated through the two Hosmer–Lemeshow goodness-of-fit statistics, Inline graphic and Inline graphic [3]. The main limitations of this approach [4], [5] are overcome by Cox calibration analysis [6], [7], although this method is less popular. All these tests investigate only the degree of deviation between observed and predicted values, without providing any clue as to the region and the direction of this deviation. Nevertheless, the latter information is of paramount importance in interpreting the calibration of a model. As a result, expected-to-observed outcome across risk subgroups is usually reported in calibrations plots, without providing any formal statistical test. Calibration plots comprise as many points as the number of subgroups considered. Since these points are expected to be related by an underlying curve, they are often connected in the so-called ‘calibration curve’. However, one can more correctly estimates this curve by fitting a parametric model to the observed data. In this perspective, the analysis of standard calibrations plot can guide the choice of the appropriate model.

In this paper we use two illustrative examples to show how to fit such a model, in order to plot a true calibration curve and estimate its confidence band.

Analysis

Two illustrative examples

Every year GiViTI (Italian Group for the Evaluation of Interventions in Intensive Care Medicine) develops a prognostic model for mortality prediction based on the data collected by general ICUs that join a project for the quality-of-care assessment [8]. In our first example, we applied the GiViTI mortality prediction model to 194 patients admitted in 2008 to a single ICU participating to the GiViTI project.

In the second example, we applied the SAPS II [9] scoring system to predict mortality in a cohort of 2644 critically ill patients recruited by 103 Italian ICUs during 2007, to evaluate the calibration of different scoring systems in predicting hospital mortality.

In the two examples we evaluated the calibration of the models through both traditional tools and the methodology we are proposing. The main difference between the two examples is the sample size: quite small in the former, quite large in the latter example. Any valuable approach designed to provide quality-of-care assessment should be able to return trustworthy and reliable results, irrespective of the level of application (e.g., single physician, single unit, group of units). Unfortunately, due to the decreasing sample size, the closer the assessment is to the final healthcare provider (i.e. the single physician), the more the judgment varies. In this sense, it is crucial to understand how different approaches behave according to different sample sizes.

In the first example, the overall observed ICU mortality was 32% (62 out of 194), compared to 33% predicted by the GiViTI model. The corresponding standardized mortality ratio (SMR) was 0.96 (95% confidence interval (CI): 0.79, 1.12), suggesting an on-average behavior of the observed unit. However, the SMR does not provide detailed information on the calibration of the model. For instance, an SMR value of 1 (perfect calibration) may be obtained even in the presence of significant miscalibration across risk classes, which can globally compensate for each other if they are in opposite directions.

The Hosmer–Lemeshow goodness-of-fit statistics are an improvement in this respect. In the two proposed tests (Inline graphic and Inline graphic), patients are in fact ordered by risk of dying and then grouped in deciles (of equal-size for the Inline graphic test, of equal-risk for the Inline graphic test). The statistics are finally obtained by summing the relative squared distances between expected and observed mortality. In this way, every decile-specific miscalibration leads to an increase in the overall statistic, independently of the sign of the difference between expected and observed mortality. The Hosmer–Lemeshow Inline graphic-statistic in our sample yielded a Inline graphic-value of 32.4 with 10 degrees of freedom (Inline graphic), the Inline graphic-statistic a Inline graphic-value of 32.7 (Inline graphic). These values contradict the reassuring message given by the SMR and suggest a problem of miscalibration. Unfortunately, the Hosmer–Lemeshow statistics only provide an overall measure of calibration. Hence, any ICU interested in gaining deeper insight into its own performance should explore data with different techniques. More information is usually obtained by plotting the calibration curve (reported in the left panel of Fig. 1), which is the graphical representation of the rough numbers at the basis of the Inline graphic-statistic. In the example, the curve shows that the mortality is greater than expected across low risk deciles, lower in medium risk deciles, greater in medium-high risk deciles and, again, lower in high-risk deciles. Unfortunately, this plot does not provide any information about the statistical significance of deviations from the bisector. In particular, the wide oscillations that appear for expected mortality greater than 0.5 are very difficult to interpret from a clinical perspective and may simply be due to the small sample size of these deciles. Finally, it is worth remarking that connecting the calibration points gives the wrong idea that an observed probability corresponding to each expected probability can be read from the curve even between two points. This is clearly not correct, given the procedure used to build the plot.

Figure 1. Calibration plots through representation of observed mortality versus expected mortality (bisector, dashed line).

Figure 1

Left panel: Data of 194 patients staying longer than 24 hours in a single Intensive Care Unit (ICU) taking part in GiViTI (Italian Group for the Evaluation of Interventions in Intensive Care Medicine) in 2008; expected mortality calculated with a prediction model developed by GiViTI in 2008. Right panel: Data of 2644 critically ill patients admitted to 103 ICUs in Italy from January to March 2007; expected mortality calculated with SAPS II.

In the second example, the SMR was significantly different from 1 (0.83, 95% CI: 0.79, 0.88), indicating a lower than expected mortality in our sample. The two Hosmer–Lemeshow goodness-of-fit statistics (Inline graphic-value: 226.7, Inline graphic; Inline graphic-value: 228.5, Inline graphic) confirm poor overall calibration. Finally, the calibration curve (Fig. 1, right panel) tells us that the lower than expected mortality is proportional to patient severity, as measured by expected mortality. The first two dots are so close to the bisector that they do not modify the general message, despite being above it. Since expected mortality is calculated using an old model, the most natural interpretation is that, as expected, ICUs performed consistently better in 2008 than in 1993, when the SAPS II score was developed.

In summary, the above-mentioned tools for assessing quality of care based on dichotomous outcomes suffer from various drawbacks, which are only partially balanced by their integrated assessment. The SMR and Hosmer–Lemeshow goodness-of-fit statistics only provide information on the overall behavior, which is almost invariably insufficient for good clinical understanding, for which a detailed information on specific values of mortality would be necessary. The calibration curve seems to provide complementary information, but at least two main disadvantages undermine its interpretation: first, it is not really a curve; second, it is not accompanied by any information on the statistical significance of deviations from the bisector. In the following sections, we propose a method to fit the calibration curve and to compute its confidence band. This method is applied to both the examples.

The calibration curve

We define Inline graphic the probability of the dichotomous outcome experienced by a patient admitted to the studied unit and Inline graphic the expected probability of the same outcome, provided by an external model representing the reference standard of care. The quality of care is assessed by determining the relationship between Inline graphic and Inline graphic described by a function Inline graphic. In the ICU example, if a patient has a theoretical probability Inline graphic of dying, his actual probability Inline graphic differs from Inline graphic depending on the level of care the admitting unit is able to provide. If he has entered a well-performing unit, Inline graphic will be lower than Inline graphic and vice versa. Hence, we can write

graphic file with name pone.0016110.e028.jpg (1)

The function Inline graphic, to be determined, represents the level of care provided or, in mathematical terms, the calibration function of the reference model to the given sample.

We start to note that, from a clinical standpoint, Inline graphic represents an infinitely severe patient with no chance of survival. The opposite happens in the case of Inline graphic, an infinitely healthy patient with no chance of dying. Moreover, in the vast majority of real cases, the expected probability of death is provided by a logistic regression model

graphic file with name pone.0016110.e032.jpg (2)

where Inline graphic are the patient's physiological and demographic parameters and Inline graphic are the logistic parameters. In this case the values Inline graphic or Inline graphic can only be obtained with non-physical infinite values of the variables Inline graphic, which therefore correspond to infinite (theoretical) values of physiological or demographic parameters.

This feature can be made more explicit by a standard change of variables. Instead of Inline graphic and Inline graphic, ranging between 0 and 1, we used two new variables Inline graphic and Inline graphic, ranging over the whole real axis Inline graphic, such that Inline graphic and Inline graphic. A traditional way of doing so is to log-linearize the probabilities through a logit transformation, where the logit of Inline graphic is the natural logarithm of Inline graphic. Hence, Eq. (1) is rewritten as

graphic file with name pone.0016110.e047.jpg (3)

In a very general way, one can approximate Inline graphic with a polynomial Inline graphic of degree Inline graphic:

graphic file with name pone.0016110.e051.jpg (4)

Once the relation between the logits Inline graphic has been determined, the function Inline graphic, as expressed in Eq. (1), is approximated up to the order Inline graphic by

graphic file with name pone.0016110.e055.jpg (5)

where Inline graphic is given in Eq. (3).

When Inline graphic, Eq. (5) reduces to the Cox calibration function [6]. In this particular case, the probability Inline graphic is a logistic function of the logit of the expected probability Inline graphic. The value of the parameters Inline graphic can be estimated through the maximum likelihood method, from a given set of observations Inline graphic, Inline graphic, where Inline graphic is the patient's final dichotomous outcome (0 or 1). Consequently, the estimators Inline graphic are obtained by maximizing

graphic file with name pone.0016110.e065.jpg (6)

where Inline graphic is the likelihood function and Inline graphic is its natural logarithm.

The optimal value of Inline graphic can be determined with a likelihood-ratio test. Defining Inline graphic the maximum of the log-likelihood Inline graphic, for a given Inline graphic, the variable

graphic file with name pone.0016110.e072.jpg (7)

is distributed as a Inline graphic with 1 degree of freedom, under the hypothesis that the system is truly described by a polynomial Inline graphic of order Inline graphic. Starting from Inline graphic, a new parameter Inline graphic is added to the model only if the improvement in the likelihood provided by this new parameter is significant enough, that is when

graphic file with name pone.0016110.e078.jpg (8)

where Inline graphic is the inverse of the Inline graphic cumulative distribution with 1 degree of freedom. In the present paper we use Inline graphic. The iterative procedure stops at the first value of Inline graphic for which the above inequality is not satisfied. That is, the final value of Inline graphic is such that for each Inline graphic, Inline graphic and Inline graphic.

The choice of a quite large value of Inline graphic (i.e. retaining only very significant coefficients) is supported by clinical reasons. In the quality-of-care setting, the calibration function should indeed avoid multiple changes in the relationship between observed and expected probabilities. Whilst it is untenable to assume that the performance is uniform along the whole spectrum of severity, it is even less likely it changes many times. We can imagine a unit that is better (or worse) at treating sicker patients than healthy ones, but it would be very odd to find a unit that performs well (or poorly) in less severe, poorly (or well) in medium-severe, and well (or poorly) in more severe patients. Large values of Inline graphic assure to spot only significant phenomena without spurious effects related to the statistical noise of data.

A measure of the quality of care can thus be derived from the coefficients Inline graphic. If Inline graphic and Inline graphic for Inline graphic, the considered unit performs exactly as the general model (i.e., the calibration curve matches the bisector). Overall calibration can be assessed through a Likelihood-ratio test or a Wald test, applied to the coefficients Inline graphic, with the null hypothesis Inline graphic, Inline graphic for Inline graphic, which corresponds to perfect calibration. In the particular case in which Inline graphic, Inline graphic and Inline graphic can be respectively identified with the Cox parameters Inline graphic and Inline graphic [6]. Cox referred to them respectively as the bias and the spread because Inline graphic represents the average behavior with respect to the perfect calibration, while Inline graphic signals the presence of different behaviors across risk classes.

In the first example (single ICU), the iterative procedure described above stops at Inline graphic, that is the linear approximation of the calibration function. The Likelihood-ratio test gives a Inline graphic-value of 0.048 and the Wald test gives Inline graphic. Both tests warn that the model is not calibrating well in the sample. Notably, this approach discloses a miscalibration which the SMR fails to detect (see section Two illustrative examples), confirming the result of the Inline graphic and Inline graphic tests. In the second example (a group of ICUs), the iterative procedure described above stopped at Inline graphic. The Likelihood-ratio test gives a Inline graphic-value of Inline graphic and the Wald test a Inline graphic-value of Inline graphic, indicating a miscalibration of the model.

One approach to obtain more detailed information about the range of probabilities in which the model does not calibrate well, is to plot the calibration function of Eq. (5), built through the estimated coefficients Inline graphic, with Inline graphic, where Inline graphic is fixed by the above described procedure. In Fig. 2, we plot such a curve for our examples in the range of expected probability for which observations are present, in order to avoid extrapolation. The model calibrates well when the calibration curve is close to the bisector. This curve is clearly more informative than the traditional calibration plot of expected against observed outcomes, averaged over subgroups (Fig. 1). In fact, spurious effects related to statistical noise due to low populated subgroups (in high risk deciles) are completely suppressed in this new plot. However, no statistically meaningful information concerning the deviation of the curve from the bisector has yet been provided.

Figure 2. Calibration functions (solid line) compared to the bisector (dashed line) for the two discussed examples.

Figure 2

The stopping criterion yielded Inline graphic for the left curve and Inline graphic for the right one. To avoid extrapolation the curve have been plotted in the range of mortality where data are present. Refer to the caption of Fig. 1 for information about the data sets.

The calibration belt

To estimate the degree of uncertainty around the calibration curve, we have to compute the curve's confidence belt. In general, given a confidence level Inline graphic, by performing lots of experiments, the whole unknown true curve Inline graphic will be contained in the confidence belt in a fraction Inline graphic of experiments. The problem of drawing a confidence band for a general logistic response curve (Inline graphic) has been solved in [10], [11]. In Appendix S1, the analysis of [10] is generalized to the case in which Inline graphic. In this section we report only the result.

Determining a confidence region for the curve Inline graphic is equivalent to determining a confidence region in the Inline graphic-dimensional space of parameters Inline graphic. This is easy once one notes that, for large Inline graphic, the estimated Inline graphic, obtained by maximizing the likelihood of Eq. (6), have a multivariate normal distribution with mean values Inline graphic, variances Inline graphic, and covariances Inline graphic (see Eq. (S2) in Appendix S1).

Given a confidence level Inline graphic, it is possible to show (see Appendix S1) that the confidence band for Inline graphic is

graphic file with name pone.0016110.e134.jpg (9)

where the confidence interval of the logit Inline graphic is

graphic file with name pone.0016110.e136.jpg (10)

and Inline graphic is the inverse of the Inline graphic cumulative distribution with 2 degrees of freedom. The above the variances denotes that the values are estimated through the maximum likelihood method.

It is worth noting the one-to-one correspondence between this procedure to build the confidence band and the Wald test applied to the set of parameters Inline graphic. In fact, when the test Inline graphic-value is less than Inline graphic, the band at Inline graphic confidence level does not include the bisector and vice versa.

We are now able to plot the confidence belt to estimate the observed probability Inline graphic, as a function of the estimated probability Inline graphic, given by a reference model. Since the parameters of the calibration curve and belt are estimated through a fitting procedure, in order to prevent incorrect extrapolation, one must not extend them outside the range of expected probability Inline graphic in which observations are present. In Fig. 3 we plot two confidence belts, for both examples, using Inline graphic (inner belt, dark gray) and Inline graphic (outer belt, light gray). Statistically significant information on the region where the calibration curve calibrates poorly can now be derived from this plot, where the bisector is not contained in the belt.

Figure 3. Calibration belts for the two discussed examples at two confidence levels.

Figure 3

Inline graphic (dark shaded area) and Inline graphic (light shaded area); Inline graphic for the first example (left panel), Inline graphic for the second (right panel). bisector (dashed line). As in Fig. 2, the calibrations bands have been plotted in the range of mortality where data are present. Refer to the caption of Fig. 1 for information about the data sets.

In the first example (Inline graphic), the confidence belts do not contain the bisector for expected mortality values higher than 0.56 (80% confidence level) and 0.83 (95% confidence level). This clarifies the result of the Hosmer–Lemeshow tests which have already highlighted the poor miscalibration of the model for the particular ICU. Now it is possible to claim with confidence that this miscalibration corresponds to better performance of the studied ICU compared to the national average for high severity patients.

In the second example, given the larger sample, the number of significant parameters is 3 (Inline graphic) and the information provided by the calibration belt is very precise, as proven by the very narrow bands. From the calibration belt, the observed mortality is lower than the expected one when this is greater than 0.25, while the model is well calibrated for low-severity patients. The lower-than-expected mortality is not surprising and can be attributed to improvements of the quality of care since SAPS II was developed, about 15 years before data collection.

Discussion

Calibration, which is the ability to correctly relate the real probability of an event to its estimation from an external model, is pivotal in assessing the validity of predictive models based on dichotomous variables. This problem can be approached in two ways. First, by using statistical methods which investigate the overall calibration of the model with respect to an observed sample. This is the case with the SMR, the Hosmer–Lemeshow statistics, and the Cox calibration test. As shown in this paper, all these statistics have drawbacks that limit their application as useful tools in quality of care assessment. The aim of the second approach is to localize possible miscalibration as a function of expected probability. An easy but misleading way to achieve this target is to plot averages of observed and expected probability over subsets. As illustrated above, this procedure might lead to non-informative or even erroneous conclusions.

We propose a solution to assess the dependence of calibration on the expected probability, by fitting the observed data with a very general calibration function, and plotting the corresponding curve. This method also enables confidence intervals to be computed for the curve, which can be plotted as a calibration belt. This approach allows to finely discriminate the ranges in which the model miscalibrates, in addition to indicating the direction of this phenomenon. This method thus offers a substantial improvement in the assessment of quality of care, compared to other available tools.

Supporting Information

Appendix S1

Computation of the confidence band. In this Appendix, we compute the confidence band for the calibration curve. By generalizing the procedure given in [10] to the case in which Inline graphic, we demonstrate the results reported in Eqs. (9) and (10).

(PDF)

Acknowledgments

The authors have substantially contributed to the conception and interpretation of data, drafting the article or critically revising it. All authors approved the final version of the manuscript. None of the authors has any conflict of interest in relation to this work. The authors acknowledge Laura Bonavera, Marco Morandotti and Carlotta Rossi for stimulating discussions. The authors also thank all the participants from the ICUs who took part in the project providing the data for the illustrative examples. The authors wish finally to thank an anonymous referee whose suggestions considerably contributed to improve and generalize our treatment.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: These authors have no support or funding to report.

References

  • 1.Donabedian A. The quality of care. how can it be assessed? JAMA. 1988;260:1743–1748. doi: 10.1001/jama.260.12.1743. [DOI] [PubMed] [Google Scholar]
  • 2.Wyatt J, Altman D. Prognostic models: clinically useful or quickly forgotten? Bmj. 1995;311:1539–1541. [Google Scholar]
  • 3.Lemeshow S, Hosmer D. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. 1982;115:92–106. doi: 10.1093/oxfordjournals.aje.a113284. [DOI] [PubMed] [Google Scholar]
  • 4.Bertolini G, D'Amico R, Nardi D, Tinazzi A, Apolone G, et al. One model, several results: the paradox of the hosmer–lemeshow goodness-of-fit test for the logistic regression model. J Epidemiol Biostat. 2000;5:251–253. [PubMed] [Google Scholar]
  • 5.Kramer A, Zimmerman J. Assessing the calibration of mortality benchmarks in critical care: The hosmer–lemeshow test revisited. Crit Care Med. 2007;35:2052–2056. doi: 10.1097/01.CCM.0000275267.64078.B0. [DOI] [PubMed] [Google Scholar]
  • 6.Cox D. Two further applications of a model for a method of binary regression. Biometrika. 1958;45:562–565. [Google Scholar]
  • 7.Miller M, Hui S, Tierney W. Validation techniques for logistic regression models. Stat Med. 1991;10:1213–1226. doi: 10.1002/sim.4780100805. [DOI] [PubMed] [Google Scholar]
  • 8.Rossi C, Pezzi A, Bertolini G. RAPPORTO 2008. Bergamo: Edizioni Sestante; 2009. Progetto Margherita - Promuovere la ricerca e la valutazione in Terapia Intensiva. [Google Scholar]
  • 9.Gall JL, Lemeshow S, Saulnier F. A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. JAMA. 1993;270:2957–2963. doi: 10.1001/jama.270.24.2957. [DOI] [PubMed] [Google Scholar]
  • 10.Hauck W. A note on confidence bands for the logistic response curve. The American Statistician. 1983;37:158–160. [Google Scholar]
  • 11.Brand R, Pinnock D, Jackson K. Large sample confidence bands for the logistic response curve and its inverse. The American Statistician. 1973;27:157–160. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1

Computation of the confidence band. In this Appendix, we compute the confidence band for the calibration curve. By generalizing the procedure given in [10] to the case in which Inline graphic, we demonstrate the results reported in Eqs. (9) and (10).

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES