Abstract
Medical cost data are often skewed to the right and heteroscedastic, having a nonlinear relation with covariates. To tackle these issues, we consider an extension to generalized linear models by assuming nonlinear associations of covariates in the mean function and allowing the variance to be an unknown but smooth function of the mean. We make no further assumption on the distributional form. The unknown functions are described by penalized splines, and the estimation is carried out using nonparametric quasi-likelihood. Simulation studies show the flexibility and advantages of our approach. We apply the model to the annual medical costs of heart failure patients in the clinical data repository (CDR) at the University of Virginia Hospital System.
Keywords: Generalized linear model, semiparametric regression, health econometrics, smoothing parameter, generalized cross validation
1. INTRODUCTION
Medical costs in the U.S. have been rising at a faster pace than general inflation for decades. A recent report from the National Institute for Health Care Management [1] shows that health care costs per person reached 8,100 U.S. dollars, for a total of 2.5 trillion U.S. dollars, which accounted for 17.6% of the gross domestic product in 2009. The rising health care costs are the main obstacle to securing medical coverage for all citizens in the United States [2]. Containing skyrocketing health care costs is one of the central targets of President Obama’s health care reform plan.
The heightened interest in health care cost containment strategies increases the importance of statistical analysis of medical cost data. Medical cost data are routinely collected in billing records of hospitals and claims of health insurance plans (e.g., Medicare, Medicaid, or commercial insurance). The wide availability of such data has motivated the development and application of state-of-the-art statistical and econometric methods in the past.
Although other summary measures of medical costs have been taken, e.g., median cost [3], mean cost is used overwhelmingly in health economics studies to measure the financial burden of a disease, as far as the total cost is concerned. We are thus interested in the mean medical cost in this paper.
Many previous studies of health care costs indicate that the estimate of mean response may be quite sensitive to heteroscedasticity and severe skewness. These data patterns are clearly demonstrated in Figure 1, which shows the histogram of annual medical costs for heart failure patients in the clinical data repository (CDR) at University of Virginia (UVa) Health System database. To overcome these problems, researchers historically relied on logarithmic or other transformation of response Y, followed by regressing the transformed Y on covariates X. The main drawback of transforming Y (e.g., log Y) is that the regression no longer results in a model for the mean response in the original scale, which is the scale of interest for most applications. A retransformation is needed in order to draw inference about the mean response in the natural scale of Y [4]. However, the retransformation is complicated in the presence of heteroscedasticity [5]. In practical applications, any retransformation can potentially yield biased estimators unless considerable efforts are devoted to studying the appropriate form of heteroscedasticity. Another drawback of transformation of response is that it might mislead the hypothesis testing. For example, the t-test of equality of the mean of log-transformed costs tests the wrong hypothesis unless the variances in the log-scale are equal (Zhou et al., [6]).
To avoid the aforementioned issues associated with retransformation, researchers have focused on the use of generalized linear models (GLMs), e.g., [5, 7, 8, 9, 10]. Instead of estimating E(log Y), the GLMs model log E(Y) so we can obtain the mean medical cost easily. These models simultaneously describe the link and variance structure by pre-specified functions, e.g., log link with gamma error distribution. The link function provides the relationship between the linear predictor and the mean of the distribution function. However, the assumption of linear covariate associations in the link function is sometimes violated. This has led to the developments and applications of various generalized nonparametric models [11].
In the GLMs framework, consistency of the estimated regression coefficients depends on a correctly specified link function regardless of variance function. However, a mis-specified variance function could lead to a substantial loss of efficiency for parameter estimates. Chiou and Muller [12] considered the variance function as an unspecified function of the mean by using local polynomial or kernel smoothing in their estimation. Basu and Rathouz [13] proposed an extension to the estimating equations in the GLMs, with variance structure as a flexible parametric function of the mean.
In this paper, we propose a model with nonparametric components to relax the assumption of linear covariate associations in the mean function, and characterize the variance as an unspecified function of the mean. The unknown functions are assumed to be smooth and estimated using penalized splines (P-splines). The estimates are obtained by maximizing the penalized nonparametric quasi-likelihood (PNQL). We call it PNQL instead of penalized quasi-likelihood (PQL) because the true variance function in the usual definition of quasi-likelihood [7] is replaced by a smooth, unspecified variance function.
The rest of the paper is organized as follows. The model is introduced in Section 2. Section 3 presents the estimation method and inference. In Section 4, we show the results of simulation studies. In Section 5 we apply our method to analyze the annual medical cost data of heart failure patients in the CDR database. Summary of our conclusions and implications of this work are presented in the final section.
2. Model
Define by Yi the medical cost for subject i, and E(Yi) = μi. We propose a generalized semiparametric model with an unknown variance function (GSUV) as follows:
(1) |
(2) |
where xi denotes covariates of length p with linear regression coefficients β, and zi = (z1i, …, zmi)T is a m × 1 covariate vector with unspecified functional relations fj(·), j = 1, …, m, with the mean. In model (1), the link g() is a known monotone and differentiable function, e.g., log link for medical costs. This is in contrast to the “unknown link” models, where the link function g(·) either is unspecified (nonparametric), e.g., Chiou and Muller [12] and Zhou et al. [14], or involves some unknown parameters to be estimated, e.g., Basu and Rathouz [13] of a Box-Cox transformation form. In model (2), the variance is modeled as an unknown but smooth function of μ. For medical cost data and many other situations, the functional form of the error variance is of secondary importance - we are mostly interested in the mean structure. Note in this model we make no further assumption on the specific distribution (e.g., Gamma) of Y.
Our model is especially appealing when there exist different functional forms for different covariates on the impact of the mean medical cost, whereas in the “unknown link” models, all covariates are restricted to having a linear relation with g(μ). Therefore, the GSUV provides flexibility in modeling the nonlinear association between response and continuous covariates (e.g., age) by introducing the nonparametric components fj(·). Also, as we use a known log link for the mean medical cost, the interpretation of covariate associations in model (1) is straightforward. For example, we can interpret the covariate associations of xi (e.g., treatment) as the percentage change (e.g., (eβ − 1) × 100%) in the mean cost for the linear parameter β. In contrast, the interpretation of covariate associations in the “unknown link” models could be complicated since it depends on the unknown link function, which has to be estimated from data. For example, Basu and Rathouz [13] relied on marginal and incremental effects to study the impact of covariate on the mean cost, which involves additional complicated derivation.
Second, since we make no assumptions on the specific form of the variance structure, the GSUV is more robust in fitting data with heteroscedasticity. This adaptability is especially useful in health cost studies. An alternative approach is to model the dispersion as a semiparametric function of multiple covariates, e.g., [15, 16, 17, 18, 19]. These models are more appealing when one is interested in the associations between covariates and the variance structure. However, while there could exist multiple smooth functions for nonlinear covariate associations in the dispersion function in their approach, our model is simpler in that there is only a single smooth function (of μ) in model (2). Thus, our model can avoid variable selection and model averaging for the variance function in their framework (e.g., Yau and Kohn, [15]).
Remark
In our model, the focus is the mean structure (1). The variance function is assumed to be a function of the mean. In health economics literature, Park’s test [20] is commonly used for medical cost studies, i.e., regressing the log of the estimated residual squared (on the scale of the analysis) on the raw (untransformed) scale for y, to estimate and test a very specific form of heteroscedasticity. Our formulation is thus an extension to Manning and Mullahy [9] and Basu and Rathouz [13].
3. Methods
3.1. Penalized Nonparametric Quasi-likelihood
Suppose that the components of the response vector y are independent with mean vector μ and variance vector σ2V(μ), where σ2 may be unknown and V(·) is a known function. The quasi-likelihood [7, 21] is
For our GSUV, we assume an unknown variance function instead of σ2V(μ) in Q(μ; y). Since the variance function is unknown, we replace it with a nonparametrically estimated variance function . Following Chiou and Muller [12], we propose the nonparametric quasi-likelihood
The unknown univariate function fj(·) in our GSUV can be estimated by a P-spline [22]. Assume that
where are spline knots. The basis is known as the truncated power basis of degree q. We choose q = 2 in this paper. Define the spline coefficient vector . Then the mean function of our GSUV is
where X = (x, B1, …, Bm) and . We do not include intercept in the P-spline representation of the nonparametric function to avoid the problem of un-identifiability.
Given , the estimates of can be obtained by maximizing the following penalized nonparametric quasi-likelihood
where Σ = diag(0p×p, λ1K1, …, λmKm). In other words, we have , where Kj is the jth (positive semidefinite) penalty matrix with unknown smoothing parameter λj. In this article, we use .
The score function for the penalized nonparametric quasi-likelihood is
(3) |
where , , and Di = ∂μi/∂θ.
In our GSUV, we do not assume any specific form of the variance function , i.e., is estimated from data. Define , we have based on model (2). Consequently, we can form a variance model as:
(4) |
where εi is an error term with E[εi] = 0. Note that we should plug in the estimated values and , respectively, in (4) to estimate the unknown functional form in .
3.2. Estimation Procedure
To obtain the estimates of β, fj(·), and , we develop the iterative estimation procedure as follows:
-
Step 0
Initialize and which are estimated by quasi-likelihood using a known variance, e.g., .
-
Step 1Estimate by minimizing the penalized least square
where , and Jλ(·) is a penalty function with smoothing parameter λ. Note here we have only 1 smooth function, which can be estimated by conventional nonparametric methods, e.g., P-spline with quadratic penalty αT Kα as in Wood [23], where α is a vector of spline coefficients, and K is similar the Kj, one of the penalty matrices described above. -
Step 2Given , estimate by solving the penalized nonparametric quasi-score function:
Iterating between Steps 1 and 2 until convergence will yield the estimates. A similar algorithm was adopted in Chiou and Muller [12] for a quasi-likelihood regression with link and variance functions both being unknown. We note Step 0 provides consistent estimates of initial values for and when the link function is correct. As a result, we have not encountered the convergence issue in our simulation studies and application.
Estimates in Step 1 can be obtained using existing software, such as SAS Mixed Procedure [22]. We use gam() function with P-spline in the mgcv package of R [23]. To obtain the estimates in Step 2, we adopt the Newton-Raphson method with Fisher scoring. Given the estimate in the lth iterate, the (l + 1)th estimate is updated by:
3.3. Smoothing Parameter and Knots Selections
Selecting appropriate smoothing parameters plays an important role in nonparametric smoothing. When using penalized splines, smoothing parameters control the trade-off between goodness of fit to the data and smoothness of the fitted splines. To implement the estimation procedure introduced in Section 3.2, we need to select two sets of smoothing parameters: the smoothing parameter for variance function ; and the smoothing parameter for regression functions fj(t). We use gam() function with P-spline in mgcv package of R to estimate the variance function. Generalized cross validation (GCV) is used for smoothing parameter selection in gam(), providing efficient automatic smoothing parameter selection [24]. However, GCV can not be directly applied for PNQL. Consequently, we concentrate on smoothing parameter selection for regression function fj(t) in model (1).
In Step 2, we use the approximate generalized cross-validation (AGCV) to select the optimal smoothing parameter for fj(t). We call it AGCV rather than GCV because it is derived based on the nonparametric quasi-likelihood. We define the AGCV score as:
where , and . See Appendix for the justification of AGCV in detail. The optimal smoothing parameter is found by minimizing the AGCV score over a grid of values of λj. Yu and Ruppert [25] used a 30-point grid where the values of log10(λj) are equally spaced between −6 to 7. Based our simulated and application data, we find that the 20-point grid from −6 to 3 is a reasonable choice.
Of note, the smoothing parameter selection is done separately in Steps 1 (for the variance function) and 2 (for the mean function). This is also the common practice for such type of models, e.g., Chiou and Muller [12], Rigby and Stasinopoulos [16], Gijbels et al. [19]. A simultaneous smoothing parameter selection procedure for both steps merits further consideration.
It is recommended [25] that the knots be placed at equally-spaced sample quantiles of the predictor variable. For example, if there are 9 knots, they could be placed at the 10th percentile, 20th percentile, … of the values of predictor variable. Ruppert [26] has a detailed study on the choice of total number of knots. For smooth and either monotonic or unimodal regression functions, 10 to 20 knots are often adequate. If the regression function is less smooth somewhere, then it is important to place a knot near it.
3.4. Inference
Following McCullagh and Nelder [7] and incorporating the penalty matrix [27], we estimate the covariance matrix of as follows:
(5) |
This sandwich estimate can be used for joint confidence regions and for hypothesis testing, such as the Wald test. For example, if a test of the null hypothesis of H0: Rθ0 − q0 = 0 is desired, where R is a d1 × dim(θ) matrix of full rank d1 ≤ dim(θ), then the test can be based on the Wald statistic,
which has a chi-squared limiting distribution with d1 degrees of freedom.
The estimated smoothing function is . The variance of can be estimated via , where is the (q + Kj) × (q + Kj) block matrix of .
4. Simulation
Simulation studies are conducted to assess the performance of our estimation method. Two settings are generated from gamma and overdispersed Poisson distributions, respectively. In both settings, we generate 1000 datasets, each with sample size n = 200. For illustrative purpose, we only consider one nonlinear function in simulation.
Example 1
The simulated data are analogous to medical cost data with a continuous response variable. The response Y is generated from a gamma distribution with the density function
where α = μ3, γ = 1/μ2, E(Y) = μ, and V (Y) = 1/μ. Note this variance is of the power variance form in Basu and Rathouz [13]. We assume a log link with log(μ) = βx + f(z), where β = 1. The continuous covariate Z is uniformly distributed over [0, 1], and f(z) = −0.5 + sin(2πz). The linear covariate X follows distribution N(0, 1).
Example 2
We simulate data with an overdispersed Poisson response variable to show the wide application of our method. The overdispersed Poisson data are generated via a Gamma-Poisson mixture. The log link is used: log(μ) = βx + f(z), where f(z) = 5 + {sin(2πz)}/2 and β = 1. Covariate Z is uniformly distributed over [0, 1], and X follows distribution N(0,1). The underlying variance function is V (Y) = μ(1 + 6μ). Note this variance is of the quadratic variance form in Basu and Rathouz [13].
We fit each dataset using three methods: (1) our PNQL method assuming an unknown variance function; (2) the PQL method assuming the correct parametric variance structure (termed “PQL-cv”); and (3) the PQL method with a misspecified parametric variance structure (termed “PQL-icv”): we assume V (Y) = μ for both gamma and overdispersed Poisson data. The PQL method with correct variance function is the best method (the gold standard) and serves as a criterion for assessing the performance of other methods.
The results of estimates of βs are shown in Table 1. For both gamma and overdispersed Poisson data, the gold standard - the PQL-cv performs the best. Our PNQL performs very similarly in terms of biases and standard deviation to the gold standard. However, the PQL-icv yields slightly larger biases and much higher variation. We also assess the performance of the asymptotic covariance formula (5). In Table 1, “SD” is the sampling standard deviation of 1000 coefficient estimates, “SE” is the average of the estimated standard errors, calculated by formula (5). We notice that the standard error formula works well for both gamma and overdispersed Poisson data.
Table 1.
Method | Gamma Data | Overdispersed Poisson Data | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
Bias | SD | SE | CP |
|
Bias | SD | SE | CP | ||
PQL-cv | 1/μ | 0.009 | 0.173 | 0.166 | 94.9% | μ(1 + 6μ) | −0.055 | 0.948 | 0.907 | 95.2% | |
PNQL | σ2(μ) | 0.005 | 0.184 | 0.174 | 95.3% | σ2(μ) | −0.062 | 0.981 | 0.948 | 95.9% | |
PQL-icv | μ | 0.047 | 0.285 | 0.269 | 96.9% | μ | −0.094 | 1.265 | 1.246 | 97.1% |
Figure 2 shows point-wise biases, mean squared errors (MSE), and empirical coverage probabilities of estimated curves for the gamma data. It can be seen that the estimated curve using the PQL-icv shows larger bias compared with the other two, although all biases are relatively small in magnitude. The average coverage probabilities are 95.3%, 95.2%, and 97.7% for PQL-cv, PNQL, and PQL-icv, respectively. The same plots are shown in Figure 3 for the overdispersed Poisson data. We observe the similar pattern to Figure 2. The estimated curve using the PQL-icv shows larger bias compared with the other two, although still small in magnitude. The average coverage probabilities are 95.7%, 95.7%, and 97.2% for PQL-cv, PNQL, and PQL-icv, respectively. For both gamma and overdispersed Poisson data, the point-wise MSE and coverage probabilities obtained by our method are very close or even overlaps with the gold standard. However, the MSE of the PQL-icv is much larger than that of our method and the gold standard.
Based on the results, we notice that the PQL method with the misspecified variance function does not have substantial impact on the consistency of the estimates, but results in loss of efficiency in estimating both linear coefficients and smoothing functions. The performance of our PNQL is reasonably well compared with the PQL with the correct variance function. For a single dataset in our simulation studies, the implementation procedure runs around 50 seconds for a 20-point grid search of smoothing parameter (3.30GHz CPU, 16G RAM). The running time doubles for a 40-point grid search. We acknowledge a drawback of our method is the increased computational burden for the smoothing parameter selection for multiple nonparametric functions due to the grid search method introduced in Section 3.3.
We conduct more simulation studies for other types of outcome, e.g., binary data with a logistic link. The findings from these simulations are similar and thus ignored (available upon request).
Per reviewers’ suggestions, we compare our method to a standard method and a naive method. We use the gam() function in the mgcv package in R as the standard method (Wood [23]), adopting penalized splines which assume a parametric variance structure and a parametric distribution. The “family” option in the gam() function allows us to specify the parametric gamma and Poisson distributions (no overdispersed Poisson distribution available). The variance functions are restricted to be μ2 and μ for gamma and poisson distributions respectively in the gam() function. We find out the mgcv estimates have relatively larger bias and standard deviation compared with the results of PQL-cv and PNQL. For example, the MSEs of PQL-cv, PNQL, PQL-icv, and mgcv estimates for β are 0.030, 0.034, 0.083, and 0.138 for the gamma data, and 0.902, 0.966, 1.609, and 1.498 for the overdispersed poisson data, respectively. The estimated nonlinear function from mgcv has relatively larger point-wise biases and mean squared errors, and poorer coverage probabilities (results available upon request). The results are consistent with our expectations. First, mgcv performs much worse than the PQL-cv and PNQL methods when it misspecifies the variance function. Second, between PQL-icv and mgcv, we find that for the overdispersed Poisson data when PQL-icv and mgcv both misspecify the variance function as V (Y) = μ, the result for mgcv is slightly better than that of PQL-icv since mgcv assumes a parametric distribution (Poisson), while PQL-icv makes no parametric assumption on the distributional form. While for gamma data, mgcv assumes V (Y) = μ2, which is further away from the truth (V (Y) = 1/μ) than PQL-icv (V (Y) = μ). As a result, the PQL-icv method yields much better result than mgcv.
For the naive method, we compare our method to GLM of Manning and Mullahy [9]. For Setting 1, we fit the data by the quasilikelihood method [9] which uses Z, Z2 and Z3 (i.e., ) to approximate f(z). Such approximation, however, shows a poor performance (results available upon request), indicating that limitation of this method, especially when there exist complicated nonlinear functions.
5. Application
Heart failure is a chronic illness where the heart fails to deliver oxygen-rich blood to the body’s other cells. It is the only cardiac disease with rising prevalence, affecting a total of over 5 million patients in America. Heart failure is also one of the most expensive health care problems in the U.S.. The economic burden for heart failure in 2010 is 39.2 billion U.S. dollars [28].
We apply our method to the analysis of annual medical costs for heart failure patients. The data are extracted from the CDR database from UVa Health System. Our dataset includes 1370 patients aged from 60 to 90 years whose first treatment of heart failure at UVa Health System was in 2004 (ICD9 diagnosis code 428.xx). The data consist of medical care costs incurred within one year follow-up. The summary of this cohort is given in Table 2. Medical costs are highly skewed to the right and heteroscedastic (median 9,298 U.S. dollars; mean 22,287 U.S. dollars).
Table 2.
Covariate | Mean (Percent) | SD |
---|---|---|
Age (years) | 72.2 | 7.7 |
Male | 54.2% | |
White | 73.6% | |
Inpatient | 37.7 % | |
Death | 5.9 % | |
Cost (U.S. dollar) | 22,287 | 37,630 |
GSUV with a log link function is used to fit the cost data. The response is annual costs in U.S. dollars. The covariate vector xi in the linear term includes four predictors: gender (1 for male, 0 for female), white (1 for white, 0 for non-white), whether the patient received inpatient service within the 12-month study period (1 for yes, 0 for no), and death during the study period (1 for yes, 0 for no). The continuous predictor in the nonlinear term zi is age.
The estimated coefficients for linear predictors are shown in Table 3. The standard errors for testing the significance of the coefficients are computed using formula (5). The association between gender and medical costs is not significant (p=0.858). The associaitons for “white”, “inpatient”, and “death” are significant. The mean annual medical cost for white patients is 20% less than other races (p = 0.005), indicating the possibility of racial disparity in heart failure. Such racial disparity is in agreement with a study by Jha et al. [29], showing that black women less often received appropriate preventive therapy and adequate risk factor control despite a greater coronary heart disease event risk. Consequently, they were at a more severe disease stage when treated in a hospital, resulting in higher costs. Being hospitalized increases the annual medical cost by 3.9 times (p < 0.0001), while patients dead within the one year follow-up period spent 45% more in medical costs than those alive (p = 0.045).
Table 3.
GSUV(N) | GSUV(Q) | |||||
---|---|---|---|---|---|---|
Estimate | SE | P-value | Estimate | SE | P-value | |
Gender | 0.013 | 0.074 | 0.858 | 0.011 | 0.074 | 0.892 |
White | −0.226 | 0.080 | 0.005 | −0.239 | 0.078 | 0.002 |
Inpatient | 1.360 | 0.072 | < 0.001 | 1.356 | 0.071 | < 0.001 |
Death | 0.371 | 0.185 | 0.045 | 0.383 | 0.170 | 0.025 |
Age | — | — | — | 1.486 | 0.581 | 0.010 |
Age2 | — | — | — | −1.756 | 0.655 | 0.007 |
The estimated curve for age is presented in the left panel of Figure 4. There is a nonlinear age effect: annual medical costs increase from age 60, reach a maximum value around age 70, then decrease and level off after age 85. This finding agrees well with Liu [30] who found the quadratic association between age and monthly medical cost of heart failure patients. An intuitive explanation is: (i) starting from age 60, medical costs increased since the physical conditions of the patients deteriorated as age increased; (ii) after age 70, older patients with heart diseases were often treated less aggressively, resulting in lower medical costs. For example, Gatsonis et al. [31] showed less frequent utilization of coronary angiography for elderly patients with an acute myocardial infarction. Stukel et al. [32, 33] showed that younger patients with heart diseases were more likely to receive invasive treatment and medical therapy. The plot on the right panel of Figure 4 is the estimated variance curve. It can be seen that the variance increases with the mean, indicating clear evidence of heteroscedasticity. The variance has a nonlinear relation with the mean which might not be described easily by a parametric (e.g., polynomial) model. Our method is flexible to capture the nonlinear association between response and predictors, and the dependence of the variance on the mean.
After seeing the left panel of Figure 4, we also fit the medical cost data using GSUV assuming a parametric quadratic effect of age in the mean function, with an unknown variance function. The results are shown in the right panel of Table 3. Clearly there is a strong indication for a quadratic association for age. It can be seen that the estimates of the other linear covariate associations are similar to those using GSUV with an unspecified function of age. Our model can thus be used as an exploratory tool to fit medical cost data more flexibly.
Per referee’s suggestions, we fit the data assuming the variance is a linear function of the mean, Var(Yi) = 2μi, as is the case roughly in the right panel of Figure 4. We find the estimates of linear coefficients have larger standard errors compared to Table 3, which may be an indication of the inappropriate variance structure. The association for race becomes non significant (results available upon request). We also fit the data with additional interaction terms involving death, none of which is significant (results available upon request).
6. Discussion
Motivated by the study of medical costs for heart failure patients, we have proposed a generalized semiparametric model with an unknown variance function. This model allows some of the predictors to be modeled linearly, with others being modeled nonlinearly. We use the penalized nonparametric quasi-likelihood for estimation, and develop an AGCV score for selecting optimal smoothing parameters. Assuming an unknown but smooth variance function provides flexibility and robustness in fitting various data as shown in the simulation studies. This is especially useful when dealing with the problems of heteroscedasticity and severe skewness that are common in medical cost data.
In this paper, we use the truncated power basis due to its ease to implement. Although there is some concern on the possible correlation in the truncated power basis, the penalty used in our research is equivalent to assuming those coefficients have independent normal distribution with mean zero and constant variance for the jth nonparametric function. This penalty will stabilize the computation regardless of the correlation in the truncated power basis functions. In our future research, we are interested in working on other basis functions, e.g., B-spline and and cubic smoothing spline.
There are two issues of model selection. In our simulation studies, we are interested in the selection of appropriate variance functional form. For instance, Manning and Mullahy [9] has discussion about specification and diagnosis for variance functions. Our method is an extension of PQL with correct variance function (PQL-cv), relying on the data to estimate the variance function. In other words, our method is more general while retaining the similar performance of PQL-cv. The other issue is the selection of covariates in the mean function (1). We have not addressed the problem of variable selection in this paper, it will be worth considering variable selection in future research.
As raised by a reviewer, in the Application study, “inpatient” can be endogenous. Unfortunately, the information available in the CDR database would not allow us to apply econometric methods, e.g., instrumental variables [34], to address the endogeneity issue. We caution that the regressions should be treated as estimating a conditional mean with no clear behavioral interpretations, but nevertheless, providing a useful illustration of the method in a real-world setting.
In this paper we are interested in modeling cross-sectional medical costs, e.g., annual medical costs. We have not considered the analysis of incomplete or censored medical costs, where there could exist an induced dependent censoring on medical costs. Many authors, e.g., [35, 36, 37, 38, 39, 40, 41], tackled this issue by the inverse probability of censoring weighting method proposed by Robins and Rotnitzky [42]. In our future research, we can adopt this approach with inverse weight of censoring probability to our model framework. There is also a growing interest in the temporal trend of longitudinal medical cost, e.g., monthly medical costs, see Liu and colleagues [30, 43, 44, 45] and Yabroff et al. [46]. As containing growing health care costs is the single most important fiscal issue in the U.S. [47], we expect considerable theoretical and numerical innovations in modeling medical costs in the coming years. These new developments will also help applied researchers for the practical problems encountered currently, such as the availability of easy-to-use software and public sharing data resources. The CDR database from UVa is available and subject to approval at http://www.medicine.virginia.edu/clinical/departments/phs/informatics/cdr-page, and the R code of our method is given as a supplemental material in the journal’s website.
A. Justification of AGCV Score
Estimates are obtained by maximizing the penalized nonparametric quasi-likelihood. Given the estimate θ in the lth iterate, the (l + 1)th estimate is updated by:
(6) |
Denote = Xθ, and Di = ΔiXi, where Δi = ∂μi/∂ηi = 1/g′(μi). Equation (6) can be represented as
where Wi = 1/{g′(μi)2Vi}, and Γi = g′(μi).
This is equivalent to finding by solving the weighted penalized least squares problem
where W = diag(Wi), Γ = diag(Γi), and .
Consequently, we propose our AGCV as
A is the influence (hat) matrix for the model, i.e. , where and are evaluations of the last iterate.
Supplementary Material
Acknowledgments
The authors thanks the editor, the associate editor, and the three referees for valuable suggestions. We are grateful to Dr. Jason Lyman, Mr. Mac Dent and Mr. Ken Scully at clinical data repository of the University of Virginia for preparing the medical cost data. This research is partly supported by the NIAAA grant RC1 AA 019274 and AHRQ grant R01 HS020263.
References
- 1.National Institute for Health Care Management. Understanding US health care spending: NICHM Foundataion data brief July 2011. NIHCM Foundation; 2011. Available at http://nihcm.org/images/stories/NIHCM-CostBrief-Email.pdf. [Google Scholar]
- 2.Obama B. Released from the Office of the Press Secretary. The White House; 2009. Remarks by the president on reforming the health care system to reduce costs. Available at http://www.whitehouse.gov/the_press_office/Remarks-by-the-President-on-Reforming-the-Health-Care-System-to-Reduce-Costs/ [Google Scholar]
- 3.Bang H, Tsiatis A. Median regression with censored cost data. Biometrics. 2002;58:643–649. doi: 10.1111/j.0006-341x.2002.00643.x. [DOI] [PubMed] [Google Scholar]
- 4.Duan N. Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association. 1983;78:605–610. [Google Scholar]
- 5.Manning WG. The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics. 1998;17:283–295. doi: 10.1016/s0167-6296(98)00025-3. [DOI] [PubMed] [Google Scholar]
- 6.Zhou XH, Melfi CA, Hui SL. Method for comparison of cost data. Annals of Internal Medicine. 1997;127:752–756. doi: 10.7326/0003-4819-127-8_part_2-199710151-00063. [DOI] [PubMed] [Google Scholar]
- 7.McCullagh P, Nelder JA. Generalized Linear Models. Chapman Hall; New York: 1989. [Google Scholar]
- 8.Blough DK, Madden CW, Hornbrook MC. Modeling risk using generalized linear models. Journal of Health Economics. 1999;18:153–171. doi: 10.1016/s0167-6296(98)00032-0. [DOI] [PubMed] [Google Scholar]
- 9.Manning WG, Mullahy J. Estimating log models: to transform or not to transform? Journal of Health Economics. 2001;20:461–494. doi: 10.1016/s0167-6296(01)00086-8. [DOI] [PubMed] [Google Scholar]
- 10.Manning WG, Basu A, Mullahy J. Generalized modeling approaches to risk adjustment of skewed outcomes data. Journal of Health Economics. 2005;20:465–488. doi: 10.1016/j.jhealeco.2004.09.011. [DOI] [PubMed] [Google Scholar]
- 11.Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. Chapman Hall; New York: 1994. [Google Scholar]
- 12.Chiou JM, Muller HG. Nonparametric Quasi-likelihood. The Annals of Statistics. 1999;27:36–64. [Google Scholar]
- 13.Basu A, Rathous PJ. Estimating marginal and incremental effects on health outcomes using flexible link and variance function models. Biostatistics. 2005;6:93–109. doi: 10.1093/biostatistics/kxh020. [DOI] [PubMed] [Google Scholar]
- 14.Zhou XH, Lin H, Johnson E. Non-parametric heteroscedastic transformation regression models for skewed data with an application to health care costs. Journal of Royal Statistical Society, Series B. 2009;70:1029–1047. [Google Scholar]
- 15.Yau P, Kohn R. Estimation and variable selection in nonparametric heteroscedastic regression. Statistics and Computing. 2003;13:191–208. [Google Scholar]
- 16.Rigby RA, Stasinopoulos DM. Generalized additive models for location, scale and shape. Applied Statistics. 2005;54:507–554. [Google Scholar]
- 17.Nott D. Semiparametric estimation of mean and variance functions for non-Gaussian data. Computational Statistics. 2006;21:603–620. [Google Scholar]
- 18.Leng C, Zhang W, Pan J. Semiparametric mean-covariance regression analysis for longitudinal data. Journal of the American Statistical Association. 2010;105:181–193. [Google Scholar]
- 19.Gijbels I, Prosdocimi I, Claeskens G. Nonparametric estimation of mean and dispersion functions in extended generalized linear models. Test. 2010;19:580–608. [Google Scholar]
- 20.Park R. Estimation with heteroscedastic error terms. Econometrica. 1966;34:888. [Google Scholar]
- 21.Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–447. [Google Scholar]
- 22.Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge; New York: 2003. [Google Scholar]
- 23.Wood SN. Modeling and smoothing parameter estimation with multiple quadratic penalties. Journal of Royal Statistical Society, series B. 2000;62:413–428. [Google Scholar]
- 24.Wood SN. Generalized Linear Models: An Introduction with R. Chapman & Hall; New York: 2006. [Google Scholar]
- 25.Yu Y, Ruppert D. Penalized spline estimation for partial linear single-index models. Journal of the American Statistical Association. 2002;97:1042–1054. [Google Scholar]
- 26.Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11:735–757. [Google Scholar]
- 27.Lin X, Zhang D. Inference in generalized additive mixed models by using smoothing splines. Journal of Royal Statistical Society, Series B. 1999;61:381–400. [Google Scholar]
- 28.Lloyd-Jones D, Adams RJ, Brown TM, et al. Heart Disease and Stroke Statistics - 2010 Update. A Report from the American Heart Association Statistics Committee and Stroke Statistics Subcommittee. Circulation. 2010;121:e1–e170. doi: 10.1161/CIRCULATIONAHA.109.192667. [DOI] [PubMed] [Google Scholar]
- 29.Jha AK, Varosy PD, Kanaya AM, Hunninghake DB, Hlatky MA, Waters DD, Furberg CD, Shlipak MG. Differences in medical care and disease outcomes among black and white women with heart disease. Circulation. 2003;18:1089–1094. doi: 10.1161/01.CIR.0000085994.38132.E5. [DOI] [PubMed] [Google Scholar]
- 30.Liu L. Joint modeling longitudinal semi-continuous data and survival, with application to longitudinal medical cost data. Statistics in Medicine. 2009;28:972–986. doi: 10.1002/sim.3497. [DOI] [PubMed] [Google Scholar]
- 31.Gatsonis C, Epstein AM, Newhouse JP, Normand SL, McNeil BJ. Variations in the utilization of coronary angiography for elderly patients with an acute myocaridal infaction: an analysis using hierarchial logistic regression. Medical Care. 1995;33:625–642. doi: 10.1097/00005650-199506000-00005. [DOI] [PubMed] [Google Scholar]
- 32.Stukel TA, Lucas FL, Wennberg DE. Long-term outcomes of regional variations in intensity of invasive vs medical management of Medicare Patients with acute myocardial infarction. Journal of the American Medical Association. 2005;293:1329–1337. doi: 10.1001/jama.293.11.1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Stukel TA, Fisher ES, Wennberg DE, Alter DA, Gottlieb DJ, Vermeulen MJ. Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. Journal of the American Medical Association. 2007;297:278–285. doi: 10.1001/jama.297.3.278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Terza JV, Basu A, Rathouz PJ. Two-stage residual inclusion estimation: addressing endogeneity in health econometric modeling. Journal of Health Economics. 2008;27:531–43. doi: 10.1016/j.jhealeco.2007.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Bang H, Tsiatis A. Estimating medical costs with censored data. Biometrika. 2000;87:329–343. [Google Scholar]
- 36.Lin DY. Proportional means regression for censored medical costs. Biometrics. 2000;56:775–778. doi: 10.1111/j.0006-341x.2000.00775.x. [DOI] [PubMed] [Google Scholar]
- 37.Lin DY, Feuer EJ, Etzioni R, Wax Y. Estimating medical costs from incomplete follow-up data. Biometrics. 1997;53:419–434. [PubMed] [Google Scholar]
- 38.Lin DY, Ying Z. Semiparametric and nonparametric regression analysis of longitudinal data (with discussion) Journal of the American Statistical Association. 2001;96:103–126. [Google Scholar]
- 39.Lin DY. Regression analysis of incomplete medical cost data. Statistics in Medicine. 2003;22:1181–1200. doi: 10.1002/sim.1377. [DOI] [PubMed] [Google Scholar]
- 40.Gardiner JC, Luo ZH, Bradley CJ, Sirbu CA, Given CW. A dynamic model for estimating changes in health status and costs. Statistics in Medicine. 2006;25:3648–3667. doi: 10.1002/sim.2484. [DOI] [PubMed] [Google Scholar]
- 41.Zhao HW, Bang H, Wang H, Pfeifer PE. On the equivalence of some medical cost estimators with censored data. Statistics in Medicine. 2007;26:4520–4530. doi: 10.1002/sim.2882. [DOI] [PubMed] [Google Scholar]
- 42.Robins J, Rotnitzky A. AIDS Epidemiology-Methodological Issues. Boston: Birkhauser; 1992. Recovery of information and adjustment for dependent censoring using surrogate markers; pp. 297–331. [Google Scholar]
- 43.Liu L, Wolfe RA, Kalbfleisch JD. A shared random effects model for censored medical costs and mortality. Statistics in Medicine. 2007;26:139–155. doi: 10.1002/sim.2535. [DOI] [PubMed] [Google Scholar]
- 44.Liu L, Conaway MR, Knaus WA, Bergin J. A random effects four-part model, with application to correlated medical costs. Computational Statistics and Data Analysis. 2008;52:4458–4473. [Google Scholar]
- 45.Liu L, Huang XL, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics. 2008;64:950–958. doi: 10.1111/j.1541-0420.2007.00954.x. [DOI] [PubMed] [Google Scholar]
- 46.Yabroff KR, Warren JL, Schrag D, Angela M, Marie T, Martin BL. Comparison of approaches for estimating incidence costs of care for colorectal cancer patients. Medical Care. 2009;47(Suppl 1):S56–S63. doi: 10.1097/MLR.0b013e3181a4f482. [DOI] [PubMed] [Google Scholar]
- 47.Orszag P. Medicare trustees to America: Bend the curve! Statements released from the Office of Management and Budget. The White House; 2009. Available at http://www.whitehouse.gov/omb/blog/09/05/12/MedicareTrusteestoAmericaBendtheCurve/ [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.