Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2012 Mar 19;2012:39–46.

I-spline Smoothing for Calibrating Predictive Models

Yuan Wu 1,*, Xiaoqian Jiang 1,*, Jihoon Kim 1, Lucila Ohno-Machado 1,1
PMCID: PMC3392066  PMID: 22779048

Abstract

We proposed the I-spline Smoothing approach for calibrating predictive models by solving a nonlinear monotone regression problem. We took advantage of I-spline properties to obtain globally optimal solutions while keeping the computational cost low. Numerical studies based on three data sets showed the empirical evidences of I-spline Smoothing in improving calibration (i.e.,1.6x, 1.4x, and 1.4x on the three datasets compared to the average of competitors-Binning, Platt Scaling, Isotonic Regression, Monotone Spline Smoothing, Smooth Isotonic Regression) without deterioration of discrimination.

Introduction

Learning models focused on maximizing discrimination (i.e., the ability to separate positive cases from negative cases) often ignore calibration, which relates to the correctness of predicted values. However, the latter aspect is important to medical decision-making, since clinicians may use predictive model estimates as surrogates to individualized risk scores2, 3. However, if the predictive model is not calibrated (e.g., raw outputs of Support Vector Machine are used to represent risk), the decisions may be wrong. As molecular markers from genomics and proteomics are increasingly considered in predictive models and become available to consumers4, 5, calibration is even more crucial to enable reliable risk assessment diagnosis, and prognosis based on individual genomics and proteomics68.

Such challenge is getting even more critical when medicine is becoming more and more “personalized”, for which predicted scores need to faithfully reflect the probability of outcomes of individual patients for best performance. Unfortunately, many popular predictive models (i.e., Decision Trees, and Naive Bayes classifiers) do not optimize calibration9. To improve on a predictive model calibration without deteriorating its discrimination, we need to develop novel and practical approaches.

Related work

There are a number of attempts towards improving the calibration of predictive models. To understand the pros and cons of each of them, we briefly review state-of-the-art methods. A most intuitive idea is called binning10, which sorts and groups predicted scores into bins, and replaces the predicted scores as the fraction of positive cases within each bin, so as to reduce the discrepancy between predictions and the unknown true probabilities. This intuitive approach, although capable of improving calibration, may decrease discrimination due to the loss of rankings within each bin. Platt suggested a rescaling model11 that uses an additional logistic regression model to refit predictions (i.e., predictor variables) against class labels (i.e., the target variable). This method can convert arbitrary predicted scores (e.g., outputs of a Support Vector Machine model) into estimated probabilities. However, the approach is parametric, and it has limited ability to calibrate predictive models. Zadrozny and Elkan proposed another approach12 named Isotonic Regression (IR), a model to minimize the mean squared errors while respecting monotonic constraints, i.e., keeping the order of the predictions and hence no altering the ROC curve. These authors showed that IR can achieve superior performance over some baseline methods. A problem with this approach is that the lack of smoothness might decrease its generalization performance13.

Yet another approach using a monotone spline smoothing technique proposed by Wang and Li14 offers both smoothness and non-parametricity. The method is theoretically sound but it is complicated to implement in practice (high dimension of the problem due to large number of spline knots, complicated constraints for the monotonicity of the estimation, etc.), and as acknowledged by the authors: more efficient ways for choosing the penalty term parameters are necessary. A most recent attempt by Jiang et al13 uses a two-step approach to obtain a smoothed isotonic regression: (1) fitting an isotonic regression model to obtain the knot points; (2) use these knot points to refit a Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) model. However, this heuristic approach has no theoretical guarantees of optimality. The following table summarizes the characteristics of a number of popular calibration approaches. The last row of the table summarizes the properties of I-spline Smoothing, a new calibration method introduced in this paper.

All above mentioned techniques1114, except for binning, aim at solving a monotonic regression problem

k=1n(ckf(pk))2s.t.f(pk)f(pk+1),0f(pk)1,k,

where pk is the pre-calibrated estimate from the model, and ck is the observed binary outcome. Therefore, another way to understand differences and challenges among various methods is to look at their specific assumptions of functions. For example, Platt assumes f (·) to be an inverse logit function11, Isotonic Regression assumes a free-form f (·) that only has values at pk, and Monotone Spline Smoothing assumes f (·) to be a natural cubic spline14. Due to these assumptions, each method has its own challenge, as we discussed earlier. This paper intends to introduce a new approach for which we assume the function f (·) to be a member of the cubic spline family with an I-spline basis. Thanks to compelling properties, the computation of I-spline Smoothing can be made much easier when compared to optimizing natural cubic splines using Monotone Spline Smoothing14. The following figure illustrates the adjusted estimations of probabilities using five different calibration approaches and the predictions of a LR model on a linearly separable data set.

Methodology

I-splines are monotone splines that have the most obvious applications in monotone nonlinear regression problem, as discussed by Ramsay15. Though attractive in their simple expression and theoretical properties, very few articles described real applications of I-spline techniques. Lu et al16 and Wu17 used I-splines in solving Maximum Likelihood Estimation (MLE) problems. In this paper, we identified a new application of I-splines to solve calibration problems. Let us start with an introduction of I-spline basic concepts. The l-th order I-splines based on a knot sequence are defined by Ramsay15 as

Iil(s)=1, (1a)
Iil(s)=LsMil(t)dt,1<iq, (1b)

with LsU, where L and U are the left and the right end knots of the knot sequence, respectively. The number of I-splines is q = n(knot) + l +1, where n(knot) is the number of interior knots (i.e., the knots that are not end knots in the knot sequence). Note that i corresponds to the index of I-splines. Wu17 identified an interesting relationship: each Mil(t) is related to a B-spline18 Nil(t) such that Mil(t)=(lNil(t))/(ui+1ui), and therefore B-splines could be used to construct I-splines

Iil(s)=m=iqNml+1(s). (2)

It is easy to compute I-splines using formula (2), since B-splines can be efficiently computed, and are already available in statistical packages. De Boor19 showed that LUMil(t)dt=1, and that each Mil(t) is nonnegative, which implies that I-splines in (1a) and (1b) are monotone and have function values between 0 and 1. The nonparametricity, monotonicity, and the range constraint between [0,1] altogether make I-splines good candidates to model distribution functions.

Given pre-calibrated prediction probabilities P = {p1,⋯, pn} and class labels C = {c1,⋯, cn}, we now show how to use I-spline based smoothing techniques to calibrate predictive models by solving a nonlinear monotone least square regression problem.

Define Ω={f:f(t)=i=1qαiIil(t),αi0,i=1qαi1} as the space of I-spline functions. The monotone least square regression finds f* ∈Ω that minimizes:

k=1n(ckf(pk))2. (3)

As mentioned before, I-splines are monotone and their values are between 0 and 1, the constraints for {αi}1q in Ω guarantee each f ∈Ω is monotone with function values lay between 0 and 1. Given a knot sequence, I-splines are fixed, hence this monotone regression problem is actually minimizing (3) with respect to I-spline coefficients {αi}1q with constraints αi ≥ 0 for i =1,⋯,q and i=1qαi1. We can rewrite this problem as a maximization problem:

argmax{αi}iq[k=1n(cki=1qαiIil(pk))2]s.t.ai0,i,i=1qαi1 (4)

The next question is how to pick interior knots, which is always critical to any spline-based technique. Intuitively, there are two general rules:

  1. More interior knots should be added to allow more flexibility;

  2. More interior knots should be added where samples are frequently observed.

The second rule is easy to follow, after deciding the number of interior knots we could position according to the sample percentiles. But how to decide the number of interior knots is really decided on a case-by-case basis. Ramsay15 mentioned very few interior knots are necessary, say, 1 or 2, for I-spline based regression problems. However, both Lu et al.16 and Wu17 chose the cube root of sample size as the number of interior knots for the MLE based spline estimations, and their experiments supported this choice. Given our sample size n, we used max{1, (n1/3 − 4)} as the number of interior knots, which works best for the proposed estimation in this paper when cubic splines are applied.

The computing for the maximization problem (4) can be done by a generalized gradient projection algorithm1. First we rewrite the constraints in (4) as y, where X = (x1, x2,⋯, xq+1)T with x1 = (−1, 0,⋯, 0)T, x2 = (0,−1, 0,⋯, 0)T,⋯, xq = (0,⋯, 0,−1)T, xq+1 = (1,⋯,1)T ;α = (α1,⋯,αq)T ; and y = (0,⋯0,1)T . If some I-spline coefficients equal 0 or all coefficients sum up to 1, then we say their according constraints are active and let X̄α = ȳ represent all active constraints, where rows of and ȳ are from a subset of rows of X and y . is used to facilitate the computation.

Initially we put integers representing active constraints in vector Λ (including indexes of I-spline coefficients for those equal to 0 and (q +1) when all coefficients sum up to 1). The vector Λ with r scalars corresponds to an r × q matrix . For example, if Λ = (2,1,(q +1)), then = (x2, x1, xq+1)T .

We denote the target function [i=1n(cki=1qαiIil(pk))2] in (4) as F(α) with. Let ∂F(α) and H(α) be gradient and Hessian matrix of F(α) with respect to α, respectively. Let W = −H(α) + γI, where I is an identity matrix, and γ is set to be large enough to make W positive definite. With that introduced, the generalized gradient projection algorithm is implemented as Algorithm 1.

Experiments

To compare different calibration methods, we used two indices, the Area Under the ROC Curve (AUC)20 and the decile-based Hosmer-Lemeshow goodness-of-fit test (HL-test)21, to assess model’s discrimination and calibration, respectively. We compared the original logistic regression model and four calibration approaches: Platt Scaling (PS), Isotonic Regression (IR), Smooth Isotonic Regression (SIR), and I-spline Smoothing (IS). Because I-spline Smoothing is strictly monotonic, we expect it would not decrease the AUC of an input model. The experiment is to verify our expectation, and evaluate if IS has the potential to improve calibration.

Data

We used three real-world data sets to evaluate the performance of proposed I-spline Smoothing calibration method. Table 2 summarizes the data in terms of their feature dimension, sample size, training to test set ratio, and short descriptions of the data.

Table 2:

Summarization of the data used to conduct the experiments.

Feature dimension Sample size Training / Test ratio Note
GSE203422 15 209 6 / 4 Breast cancer data sets from NCBI Gene Expression Omnibus (GEO) used to construct a decision support system for predicting reoccurrences of breast cancer using extracted gene expression features. We followed Osl et al23 to select features.
Edin (MI)24 48 1,253 6 / 4 This data contain clinical and electrocardiographic information about 500 patients with and without myocardial infarction (MI) admitted with chest pain into an emergency department in Sheffield, England. The study was to determine which, and how many, data items are required to construct a decision support system for early diagnosis of acute myocardial infraction24.
PIMATR25 8 768 6 / 4 Pima Indians Diabetes data set from National Institute of Diabetes and Digestive and Kidney Diseases. The population lives near Phoenix, Arizona, USA, and all patients are females at least 21 years old of Pima Indian heritage25. The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care)25.

Algorithm 1:

Generalized gradient projection algorithm to solve I-spline Smoothing.

Step 1 (Computing the feasible search direction) Compute
     = (d1,d2,⋯,dq) = {IW−1 T (X̄W−1 T)−1 }W−1 ∂F(α).
Step 2 (Forcing the updated α to fulfill the constraints) Compute
γ=min{mini:di<0{αidi},1i=1qαii=1qdi},ifi=1qdi>0,
γ=min{mini:di<0{αidi}},otherwise.
The execution guarantees that αi + γdi ≥ 0 for i = 1,2,⋯q, and i=1q(αi+γdi)1.
Step 3 (Updating the solution by Step-Halving line search1) Find the smallest integer ω starting from 0 such that
F(α+(1/2)ωγd¯)F(α).
Replace α by α̂ = α + min{1/2}ω γ,0.5} · .
Step 4 (Updating Λ, ) If k = 0 and γ ≤ 0.5, modify Λ by adding indexes of new I-spline coefficients when these new coefficients become 0, or adding (q + 1) when i=1qαi becomes 1, and modify accordingly,
Step 5 (Checking the stopping criterion) If ‖‖ ≥ ε, for small ε, go to Step 1, otherwise compute λ = (X̄W−1 T)X̄W−1∂F(α).
  1. If the j-th component λj ≥ 0 for all j, set α̂ = α and stop.

  2. If there is at least one j such that λj <0, let j* = arg minj:λj<0 {λj}, then remove the j*-th component from Λ and remove the j*-th row from , and go to Step 1.

Results

Following aforementioned ratios, we randomly divided the data into training and test sets, 100 times, to evaluate the performances of methods. Figure 1 below shows the box plots of AUCs in the first row, where each color represents a method and every subplot corresponds to one data set, as denoted in the caption. In the second row of Figure 2, we illustrate the rate of ‘passing’ the HL-test at the significance level of 0.05 for each method in all three data sets.

Figure 1:

Figure 1:

Illustration of calibration functions of four different approaches, including Binning, Platt Scaling (PS), Isotonic regression (IR), Smooth Isotonic Regression (SIR) and our proposed method I-spline Smoothing (IS).

Figure 2:

Figure 2:

Illustration of AUCs and HL-tests of all five methods in comparison using three different data.

The figure visually demonstrates that all five models were comparable in terms of AUCs, while I-spline Smoothing stood out in terms of calibration performance. Table 3 lists actual values of these comparisons. For GSE2034 data, I-spline Smoothing ranked second in calibration since its HL-test passing rate was 43%, compared to LR (28%), PS (17%), IR (44%), and SIR (16%). The AUCs of I-spline Smoothing were not significantly smaller than any of the other methods. Note that we used one-tailed paired t-test to compare different AUCs.

Table 3:

Performance of different models using different data.

Logistic Regression (LR) Platt Scaling (PS) Isotonic Regression (IR) Smooth Isotonic Regression (SIR) I-spline Smoothing (IS)
GSE2034
(AUC±std) / (HL-test pass rate)
(0.81±0.04) / (0.28) (0.81±0.04) / (0.44) (0.80±0.05) / (0.17) (0.80±0.05) / (0.16) (0.81±0.04) / (0.43)
Edin (MI)
(AUC±std) / (HL-test pass rate)
(0.89±0.02) / (0.14) (0.89±0.02) / (0.09) (0.89±0.02) / (0.31) (0.89±0.02) / (0.30) (0.89±0.02) / (0.29)
PIMATR
(AUC±std) / (HL-test pass rate)
(0.82±0.05) / (0.57) (0.82±0.05) / (0.66) (0.80±0.05) / (0.37) (0.81±0.05) / (0.32) (0.82±0.05) / (0.73)

The results on Edin (MI) data showed similar patterns for calibration. I-spline Smoothing had the third highest rate of passing the HL-test (29%), which is close to SIR (30%) and IR (30%), and better than LR (14%) and PS (9%). Regarding discrimination, the AUCs of I-spline Smoothing were not significantly smaller than any other models. Finally, the experiments using PIMATR data, I-spline Smoothing outperformed all the other methods in calibration with a HL-test passing rate of 73%, followed by PS (66%), LR (57%), IR (37%), and SIR (32%). The AUCs of I-spline Smoothing are not smaller than any of those of other models.

Discussion and Conclusion

In this paper, we introduced a novel method called I-spline Smoothing (IS) for calibrating predictive models as an alternative to existing approaches. The advantages of IS lie in the following aspects: (1) IS is a non-parametrically monotonic transformation, that provides more flexibility in calibrating predictive models, when compared to parametric approaches like Platt Scaling. (2) IS is globally optimized, as opposed to Smooth Isotonic Regression which is a heuristic approach. (3) IS is easy to implement, compared to Monotone Spline Smoothing. The results using three real-world data sets showed advantages of IS in both discrimination and calibration, empirically. In these experiments, IS demonstrates superior calibration without significant deterioration of discrimination. Although these experiments were conducted at small scale, they suggest that future research on IS is warranted and may improve calibration.

A limitation of this technique is that we need to set the number of interior knots heuristically. Even though it worked well in our experiments, a theoretical result describing systematic ways to choose the number of interior knots is needed.

Table 1:

Summarization of popular calibration approaches.

Monotonic Non-parametric Non-exponential complexity Continuous
Binning10 x x
Platt scaling11 x x
Isotonic Regression12 x x x
Smooth Isotonic Regression13 x x x x
Monotone Spline Smoothing14 x x x
I-spline Smoothing x x x x

Acknowledgments

The authors were funded in part by the National Library of Medicine (R01LM009520) and NHLBI (U54 HL10846). We thank Dr. Fraser and Dr. El-Kareh for making the data sets available for this study.

Reference

  • 1.Jamshidian M. On algorithms for restricted maximum likelihood estimation. Computational Statistics and Data Analysis. 2004;45:137–57. [Google Scholar]
  • 2.Kamath PS, Kim W. The model for end stage liver disease (MELD) Hepatology. 2007;45(3):797–805. doi: 10.1002/hep.21563. [DOI] [PubMed] [Google Scholar]
  • 3.Racowsky C, Ohno-Machado L, Kim J, Biggers JD. Is there an advantage in scoring early embryos on more than one day? Human Reproduction. 2009;24(9):2104. doi: 10.1093/humrep/dep198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Altman RB, Miller KS. 2010 Translational bioinformatics year in review. Journal of the American Medical Informatics Association. 2010;18(4):358–66. doi: 10.1136/amiajnl-2011-000328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sarkar IN, Butte AJ, Lussier YA, Tarczy-Hornoch P, Ohno-Machado L. Translational bioinformatics: linking knowledge across biological and clinical realms. Journal of the American Medical Informatics Association. 2011;18(4):354–57. doi: 10.1136/amiajnl-2011-000245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jiang X, Osl M, Kim J, Ohno-Machado L. Calibrating Predictive Model Estimates to Support Personalized Medicine. Journal of the American Medical Informatics Association. doi: 10.1136/amiajnl-2011-000291. (Epub ahead of print) 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Osl M, Ohno-Machado L, Baumgartner C, Tilg B, Dreiseitl S. Improving calibration of logistic regression models by local estimates. AMIA Annual Symposium Proceedings. 2008:535–39. [PMC free article] [PubMed] [Google Scholar]
  • 8.Wei W, Visweswaran S, Cooper GF. The application of naive Bayes model averaging to predict Alzheimer's disease from genome-wide data. Journal of the American Medical Informatics Association. 2011;18(4):370–75. doi: 10.1136/amiajnl-2011-000101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. International Conference on Machine Learning. 2001. pp. 609–16.
  • 10.Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. The Eighteenth International Conference on Machine Learning. 2001. pp. 609–16.
  • 11.Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers. 1999:61–74. [Google Scholar]
  • 12.Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. The Eighth International Conference on Knowledge Discovery and Data Mining. 2002. pp. 694–99.
  • 13.Jiang X, Osl M, Kim J, Ohno-Machado L. Smooth Isotonic Regression: a new method to calibrate predictive models. AMIA Summit on Clinical Research Informatics (CRI'11) 2011. [PMC free article] [PubMed]
  • 14.Wang X, Li F. Isotonic Smoothing Spline Regression. Journal of Computational and Graphical Statistics. 2008;17(1):21–37. [Google Scholar]
  • 15.Ramsay JO. Monotone regression splines in action. Statistical Science. 1988;3:425–41. [Google Scholar]
  • 16.Lu M, Zhang Y, Huang J. Estimation of the mean function with panel count data using monotone polynomial splines. Biometrika. 2007;94:705–718. [Google Scholar]
  • 17.Wu Y. Iowa City: University of Iowa; 2010. The partially monotone tensor spline estimation of joint distribution function with bivariate current status data Department of Mathematics. [Google Scholar]
  • 18.Schumaker L. Spline Functions: Basic Theory. New York: Wiley; 1981. [Google Scholar]
  • 19.De Boor C. A Practical Guide to Splines. New York: Spring-Verlag; 2001. [Google Scholar]
  • 20.Hanley J, McNeil B. The meaning and use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  • 21.Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine. 1997;16(9):965–80. doi: 10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
  • 22.Wang Y, Klijn J, Zhang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–79. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
  • 23.Osl M, Dreiseitl S, Kim J, Patel K, Baumgartner C, Ohno-Machado L. Effect of data combination on predictive modeling: a study using gene expression data. AMIA Annual Symposium Proceedings. 2010:567–71. [PMC free article] [PubMed] [Google Scholar]
  • 24.Kennedy RL, Burton AM, Fraser HS, McStay LN, Harrison RF. Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. European Heart Journal. 1996;17(8):1181–91. doi: 10.1093/oxfordjournals.eurheartj.a015035. [DOI] [PubMed] [Google Scholar]
  • 25.Asuncion A, Newman DJ. UCI machine learning repository. 2007.

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES