Abstract
Estimating the average monthly medical costs from disease diagnosis to a terminal event such as death for an incident cohort of patients is a topic of immense interest to researchers in health policy and health economics because patterns of average monthly costs over time reveal how medical costs vary across phases of care. The statistical challenges to estimating monthly medical costs longitudinally are multifold; the longitudinal cost trajectory (formed by plotting the average monthly costs from diagnosis to the terminal event) is likely to be nonlinear, with its shape depending on the time of the terminal event, which can be subject to right censoring. The goal of this paper is to tackle this statistically challenging topic by estimating the conditional mean cost at any month t given the time of the terminal event s. The longitudinal cost trajectories with different terminal event times form a bivariate surface of t and s, under the constraint t ≤ s. We propose to estimate this surface using bivariate penalized splines in an Expectation-Maximization algorithm that treats the censored terminal event times as missing data. We evaluate the proposed model and estimation method in simulations and apply the method to the medical cost data of an incident cohort of stage IV breast cancer patients from the Surveillance, Epidemiology and End Results–Medicare Linked Database.
Keywords: Bivariate Smoothing, Joint Modeling, Lifetime and Survival Analysis, Medical Cost, SEER Medicare
1. INTRODUCTION
Electronic submission of medical billing records enables the automated collection of detailed medical cost data from a large population enrolled in health insurance plans. These data often come in a longitudinal format in the sense that whenever an encounter occurs, the system documents the medical expenditures associated with the encounter with a time stamp. When the medical cost data are structured by following an incident cohort of patients with a specific disease or condition, such data are referred to as the incident cost data (Yabroff et al 2009). A population study of longitudinal incident cost data is of tremendous interest to health policy makers and health economists as patterns of medical costs over time inform healthcare resource use throughout various phases of an illness. For example, the National Cancer Institute (NCI) periodically publishes estimates and projections of the cost of cancer care in the United States based on the analysis of claims data (Mariotto et al 2011; Yabroff et al 2008).
Unlike the total or cumulative medical cost, which summarizes all medical costs of a patient throughout a follow-up period of interest, the longitudinal incident cost provides details on healthcare consumption across different phases of the disease continuum (Yabroff et al 2008). In the motivating application of this paper, for example, the data include the medical costs of an incident cohort of Medicare patients with stage IV breast cancer, from their initial diagnosis until death. Although charges to Medicare as well as Medicare payments may be posted on a daily basis whenever a healthcare encounter occurs, we standardized the unit of analysis to monthly intervals for cost so as to make the data less volatile. The conventional phase-of-care approach of the NCIs cost reporting defines the first 12 months after cancer diagnosis as the initial care phase, the last 12 months before death as the terminal care phase, and the period in between the continuing care phase. During the initial care phase, cancer patients often receive active treatment, including surgery, chemotherapy, and/or radiation therapy to eradicate or inhibit tumor growth. During the terminal care phase, patients may receive intensive salvage treatments and/or palliative care. During the continuing care phase, clinical visits become less intensive as patients have completed active treatment. In general, higher costs are expected during the initial and terminal care phases; whereas lower costs are anticipated for the continuing care phase, which primarily involves surveillance of the patient. The transition time between care phases as defined in the NCIs phase-of-care costing approach appears somewhat arbitrary as the choice of a 12-month duration for each of the initial and terminal care phases makes it difficult to capture the heterogeneity of the cost trajectory among patients with various demographic characteristics and disease progression. It is thus desirable to estimate the population averaged monthly medical costs as a longitudinal cost trajectory from diagnosis to death, rather than the average total or cumulative costs for each phase of care as defined by pre-specified cutoff times.
There are a number of statistical challenges in estimating the longitudinal medical cost trajectory. First, while it is of interest for policy makers to study the average costs of a patient population rather than the costs of individuals, simply averaging across all the patients in the study sample may not generate the information required for policy decision making. For example, in the breast cancer patient cohort, not all patients contribute cost data to all three care phases. Aggressive or late-stage tumors may result in patient death shortly after the diagnosis of breast cancer, and the cost trajectories for such patients may consist of the initial and terminal care phases with an absence of the continuing care phase. Conversely, for patients with tumors that progress slowly, the continuing care phase could last many years. Therefore, it would not be appropriate to average, for example, the medical costs 18 months after diagnosis over patients who died at month 22 and patients who died at month 60, because the costs at month 18 would be part of the terminal care costs for the former patients but part of the continuing care costs for the latter patients. A more appropriate aggregation strategy is to calculate the mean cost at any month t conditional on the time of death s. In other words, the mean cost at month t should be expressed as a function of both t and s, forming a mean cost surface of t and s. By definition, this surface can only be defined within a triangular region satisfying 0 < t ≤ s. Second, conditional on the time of death, the longitudinal cost trajectory as a function of t may be nonlinear. Brown et al (2002) demonstrated through descriptive analyses of non-censored patients that the average cost trajectory of a breast cancer cohort is U-shaped with the mouth and bottom of the “U” depending on survival times. Hence, a model for the cost trajectory needs to accommodate nonlinear shapes and dependence between the cost trajectory and survival time. Lastly, the time of death may not be observed for all patients due to the limited duration of the follow-up as a result of administrative censoring commonly observed in claims data. Proper handling of right censoring is necessary in order to estimate the cost trajectories.
As discussed previously, the NCIs current reporting of the cost of cancer care summarizes such costs by pre-defined care phases (Mariotto et al 2011; Yabroff et al 2008; Yabroff et al 2008). However, their analytical methods do not address censoring explicitly and, as a result, cost data from censored phases are not used in the analyses. The existing health sciences literature on cost trajectory has used descriptive statistics to summarize cost data by survival time (Brown et al 2002; Etzioni et al 1996; Yabroff et al 2008; Yabroff et al 2009) or has used parametric functions to model the cost trajectory without allowing the trajectory to vary with the survival time (Akushevich et al 2011). These studies also only used cost data from uncensored subjects. Statistical literature on medical cost research has often focused on estimating the cumulated cost or studying the association between the cumulated cost and baseline covariates (Bang 2005; Bang and Tsiatis 2000; Baser et al 2006; Lin 2000, 2003; Lin et al 1997; O’Hagan and Stevens 2004; Pullenayegum and Willan 2011; Zhao, Cheng and Bang 2011; Zhao and Tian 2001; Zhao et al 2012), adjusting for “induced dependent censoring” in the total cost (Bang 2005; Lin et al 1997). Another related line of research is regression analyses with longitudinal costs modeled as repeated measures outcomes. A comprehensive overview can be found in a special issue of Medical Care (vol. 47, no. 7 supplement 1, July 2009). More recent work can be found in Chen et al (2016), Chen et al (2013), Liu et al (2008) and the references therein. These models focus on modeling the association between covariates and longitudinal medical cost outcomes; the cost trajectories are usually modeled with simple parametric functions (e.g., linear), without allowing them to depend on the possibly censored survival times. Shared parameter models have been used to jointly model longitudinal medical costs and survival times (Liu 2009; Liu, Huang, and O’Quigley 2008; Liu, Wolfe, and Kalbeisch 2007). While these models allow the shape of the trajectory to be indirectly influenced by the survival time through the shared random effect, the intensive computation involved in fitting such models often limits these analyses to using very simple trajectory shapes such as a linear trajectory. However, this restriction does not fit the scenario described in our motivating application. Chan and Wang (2010) studied nonparametric estimation of the mean cost trajectory for a pre-specified time period prior to the terminal events. Their estimated mean trajectory is not conditional on the survival time. The censored survival time data were used to estimate the inverse probability censoring weights, but the cost data from censored patients were not used in the analysis.
In this article, we propose statistical methodology that addresses the three aforementioned challenges in estimating the longitudinal medical cost trajectory. The proposed cost summarization method is completed conditional on the time from diagnosis to the terminal event to properly average the costs for different phases of therapeutic treatment while utilizing all the available data from the patients in the cohort. We believe that this represents a general class of research problems of interest to health economics and health policy studies for which no published statistical methodology is yet available. We present a model and estimation method in Section 2. In Section 3, we provide a data analysis for the motivating application. We present simulation results to assess the empirical performance of the proposed method in Section 4, and a discussion in Section 5.
2. MODEL AND ESTIMATION
2.1 Notation
Suppose we have independent and identically distributed data from n subjects, , where is the time to the terminal event of interest, such as death, Ci is the censoring time, Ti is the observed time of the terminal event or censoring, whichever is earlier, and δi is the event indicator. Let Yi = (Yi1, Yi2, …, Yini)T denote the monthly reported medical cost (after proper transformation to reduce the skewness of the data; see Section 3) up to Ti, and let ti = (ti1, ti2, …, tini)T be the corresponding months, starting from the time of the initial diagnosis of the disease. For notational convenience, we may also express Yij as a function of tij as Yij = Yi(tij), and ti is deterministic given Ti. When every month’s cost data are available, tij = j, j = 1, 2, …, ni and tini ≤ Ti < tini + 1. Since the cost for a fraction of a month is considered undefined, tij can only be integers, but Ti is allowed to be continuous. The notations and methodology in this paper extend naturally to applications with other time units, such as quarterly or yearly incident cost data. Throughout this paper, we assume that the censoring time Ci is independent of and Yi. This assumption is likely to hold in the insurance claims database research similar to the motivating application in this paper, where the interest is in the medical cost till death and the censoring is caused by administrative end of the follow-up.
The goal of the research is to estimate
Although t is an integer in this application, for ease of statistical modeling and estimation, we treat both t and as continuous variables operationally and make the working assumption that is a smooth surface over the continuous upper triangular area . This assumption is reasonable given that the longitudinal cost trajectory is expected to be a visually smooth U-shaped curve that depends on the survival times (Section 1).
2.2 Model
The general idea is to model the marginal distribution of survival and the conditional distribution of the incident costs given survival. In the following presentation, we use the notation to denote the conditional density function of Yi given the time of the terminal event to denote the marginal density function of . We use the notation Yi, Ti, δi to denote the random variables or their observed values when there is no ambiguity. The log-likelihood of the observed data is
(1) |
The marginal distribution of the time to the terminal event can be estimated nonparametrically (e.g., Kaplan-Meier method) or semiparametrically (e.g., penalized splines; Cai, Hyndman and Wand 2002). Note that when the Kaplan-Meier method is used, we need a regularity condition that the support of the true survival time lies inside the support of the censoring time C. This condition is needed for calculating the integrand in (1) over (Ti, ∞).
For any subject i, the model for the conditional distribution of Yi given the time of the terminal event follows a multivariate normal distribution. The j-th (j = 1, 2, …, ni) element of Yi, Yij, has the distribution
where σ2 is the variance parameter. We use α to denote the correlation parameters for Yi. The modeling assumption above assumes that the data Yij are normally distributed with a mean function that varies with tij, the month of the incident cost, and true survival time s. This mean function is modeled by flexible bivariate splines. Since the medical costs may be highly skewed, it is often necessary to apply a monotone transformation prior to the data analysis to reduce skewness. In such a situation, Yij denotes the transformed cost data. In subsequent development, we will show that the multivariate normal model not only simplifies the computation, but also is robust against moderate model violation due to either skewness or zero monthly costs.
For the motivating application in particular, it is expected that μ(t, s) is a smooth surface over the upper triangular area and μ(t, s) as a function of t most likely takes a U-shape, which may be bent and twisted by s in a flexible manner. While the proposed methodology below works with any flexible parametric function for μ(t, s), we prefer to express μ(t, s) as a linear combination of unknown parameters for ease of computation. This motivates us to consider approximating μ(t, s) with bivariate penalized splines with a truncated quadratic basis (Ruppert, Wand and Carroll 2003). However, conventional penalized splines for a bivariate surface are defined on a rectangular area and cannot be used directly on surfaces defined only on the upper triangular area. Thus, we propose the following shrinkage-expansion transformation to solve this problem:
The expanded surface is defined on the rectangular area 0 < u ≤ τ, 0 < s ≤ τ, where τ is the maximum allowable survival time. We model by
where B(u, s) is a vector of a bivariate truncated quadratic basis such that B(u, s) consists of (M1 + 3) × (M2 + 3) elements of the outer product matrix of vector and vector , where c1, …, cM1 are pre-specified internal knots on the scale of s, and v1, …, vM2 are internal knots on the scale of u. The number of internal knots M1 and M2 must be specified prior to data analysis. θ is a vector of unknown parameters to be estimated.
Let Di(s) be the matrix with the j-th row being B(tij(τ/s), s)T. Under this parameterization, the model can be written as
(2) |
where Σi(α, σ2) is the variance covariance matrix of Yi, with dimension ni × ni.
2.3 Estimation
Even when the marginal distribution of is known, direct maximization of the log-likelihood (1) is difficult for two reasons. First, the integrals inside each subject’s log-likelihood generally do not have a closed expression, and numerical integration must be used, which increases the complexity and reduces the stability of computation. Second, the marginal log-likelihood (1) is nonlinear with respect to θ, α, and σ2, and maximization must be done numerically. To make the cost trajectory model flexible, a large number of parameters must be used in θ, which further increases the computational difficulty.
We propose to make the computation feasible by treating as missing data and applying the Expectation-Maximization (EM) algorithm (Dempster, Laird and Rubin 1977). For this approach to work, the continuous survival time must first be “coarsened” into a discrete scale within the interval (Ti, ∞) (Heitjan and Rubin 1991). One advantage of coarsening is that the calculation of the expected complete data log-likelihood in the E-step and its maximization in the M-step can both be done analytically through a closed form formula, as shown below.
If patient i is censored at Ti, we assume that the true time to death falls within one of the Ki pre-specified intervals, denoted by Iik (k = 1, 2, …, Ki). The marginal survival probability is estimated up to τ. Operationally, we can set τ = max {Ti|i = 1, 2, …, n}. The coarsening above must ensure that . Let lik denote a representative point of interval Iik; we use the middle point throughout this paper. The coarsening approximation implies that if , then . The intervals {Iik} and their middle points {lik} are specified prior to applying the EM algorithm. We define the intervals of each censored subject to have equal lengths of at least one month, and impose a cap K = 10 to be the maximum number of intervals that a censored subject can have. In other words, when both the censoring time Ti and the maximum survival time τ are expressed in month, Ki = min{⌊τ − Ti⌋, K}, where ⌊b⌋ denotes the largest integer that is smaller than b. Therefore, for a censored subject whose Ti is very close to τ, Ki will be smaller than K; for a censored subject whose Ti is far away from τ, Ki may reach the cap K and the intervals are longer. Imposing the cap K is a simple and effective way to control the computational complexity. Smaller K results in quicker computation.
The coarsening of the survival time implies that if a subject is censored at can take on one of the values in {lik; k = 1, 2, …, Ki}. Hence, the indicators are viewed as missing data for censored subjects. Let Θ = (θ, α, σ2). The complete data log-likelihood is proportional to
(3) |
In the E-step, we calculate the conditional expectation of given Yi, Ti, δi = 0 and , the estimated Θ in the current iteration (m), as follows:
(4) |
Note that before initiating the EM algorithm, the marginal survival function of can be estimated consistently from {Ti, δi; i = 1, 2, …, n} using the Kaplan-Meier method. The conditional probability mass functions in (4) have been replaced by their estimates calculated prior to the EM iteration.
The expectation of the complete data log-likelihood is
(5) |
We call the failure time of the k-th (k = 1, 2, …, Ki) term of a censored subject i the k-th pseudo failure time of subject i. In (5), we replace the failure time of each censored subject with Ki pseudo failure times at lik (k = 1, 2, …, Ki), which are properly weighted, with the total weights summing up to 1, i.e., .
The M-step can be accomplished in two steps. In the first step, given the estimated α(m) and σ(m)2 in the current iteration, we estimate θ by maximizing a penalized version of the expected log complete data likelihood above:
(6) |
In this expression, λ is a non-negative penalty parameter that controls the smoothness of the surface fit. The matrix Ω is a penalty matrix pre-specified according to the basis functions. When the truncated quadratic basis is used, Ω is a block diagonal matrix with the first block being a 3 × 3 matrix of 0’s and the second block being an identity matrix (Ruppert, Wand and Carroll 2003). Under the proposed model, (6) is a quadratic function of θ; hence, the maximizer can be solved analytically as a weighted least squares estimator. The penalty term will balance the tradeoff between overfitting and underfitting the model and controls the smoothness of the fit: a very large λ leads to a quadratic fit and a small λ leads to a more flexible fit.
In the second step, given the estimated θ(m), we estimate σ2 and α by the method of moments similar to a generalized estimating equations model (Liang and Zeger 1986; Fitzmaurice et al 2009; SAS Institute 2008). Let be the Pearson residual. The variance σ2 can be estimated as
(7) |
For estimating α, the parameters associated with the correlation structure, we define Ui as the ni(ni − 1)/2 vector of the distinct pairwise products of , and ρi(α), the corresponding vector of expectations according to the chosen correlation structure; α can be estimated as the solution to (Liang and Zeger 1986; Fitzmaurice et al 2009).
The penalty parameter λ is selected by M-fold cross-validation, where M = 10 is commonly used (McLachlan, Do and Ambroise 2004), in the following steps.
Pre-specify a value for λ. Divide the data set in m subsets with an equal number of subjects in each.
- For m = 1, 2, …, M, repeat steps (2.1) and (2.2):
-
2.1Use the m-th subset as the test data set, and the other m − 1 subsets as the training data set.
-
2.2Let {δj, Tj, Yj, tj|j = 1, 2, …, Jm} denote the test data set. The cross-validated score for this test data set is defined by
where the bivariate surface is estimated from the training data.
-
2.1
The cross-validation score for λ is .
Calculate CV (λ) for a series of candidate values of λ and select the one with the smallest cross-validation score.
Remark 1. While the mean structure of the cost surface is estimated by using all the data, only the non-censored subjects are used to estimate σ2 and α. In our preliminary investigation, we found that including the censored subjects in this step slowed the convergence of the EM algorithm and hence is undesirable in computationally demanding tasks such as bootstrapping or cross-validation. A plausible explanation is that the residuals from the pseudo subjects are more variable than those from the non-censored subjects due to the uncertainty in .
Remark 2. A degree of freedom correction may be used for the denominator in (7). Let D be the matrix consisting of Di(Ti) (i = 1, 2, …, n; δi = 1) stacked one on top of the other, Σ be the block diagonal matrix with the diagonal blocks being Σi(α, σ2), and Y be the concatenated vector of the Yi’s, with δi = 1. The denominator of (7) can be replaced by (Ruppert, Wand, and Carroll 2003). In our application, the number of subjects and the average number of months per subject are both large. In such a situation, (7) provides a nearly unbiased estimation, while Sλ is costly to compute due to its large size. Therefore, we used (7) in the numerical studies.
The proposed estimation procedure is essentially a two-step approach. The first step is to use the Kaplan-Meier method to estimate the marginal distribution of survival. The asymptotic behavior of the Kaplan-Meier estimate is well established. The second step is to obtain the maximum likelihood estimator of the parameters that characterize the longitudinal cost distribution given the consistent survival distribution estimator from the first step. For the second step, we used bivariate penalized spline. The penalized splines are widely used in statistical methodological research (Ruppert, Wand and Carroll, 2003, 2009), but their asymptotic theory has only been developed in the context of univariate scatter plot smoothers where both the x- and y- variables are continuous (Wang, Shen, Ruppert 2011; Li and Ruppert 2008; Hall and Opsomer 2005). The theory is not directly applicable to the proposed model because one of the two x- variables is the month in which the incident cost is incurred, which is discrete by definition, and the number of distinct months within the follow-up period does not increase with the sample size. Further work is needed to study the asymptotic theory in the context of the proposed model.
3. DATA APPLICATION
The proposed methodology was motivated by the analysis of medical cost data of a study cohort constructed from the Surveillance, Epidemiology and End Results (SEER)–Medicare linked database. The study cohort consists of an incident cohort of 4,950 patients over the age of 65 who were diagnosed with stage IV breast cancer between January 1, 2002 and December 31, 2009. The Medicare payment amounts for inpatient and outpatient services are available for each patient from the date of diagnosis until death or the end of the follow-up, which was December 31, 2010. The overall censoring proportion was 21% and no patients were censored within the first 12 months following diagnosis. The maximum recorded survival time is 104 months, and the median survival time is estimated to be 16 months. The goal of the research was to estimate the population averaged medical costs among these patients so as to learn how the cost varies after diagnosis. We summarized Medicare payments as monthly costs.
We used a log(1 + x) transformation to reduce the skewness of the medical cost data in the initial model fitting. Figure 1 is a descriptive analysis of the monthly medical costs of individual patients and their averages, stratified by the duration of survival since diagnosis, among the non-censored subjects. While the individual cost trajectories exhibited large variations, the average transformed costs appeared to follow a U-shape, with higher costs at both tails of the distribution. In Figure 1, for each patient subgroup defined by the year of death, we estimated the standard deviations of the residuals as 2.90, 2.92, 3.11, 3.02, 3.24, and 3.10, respectively. These standard deviations are similar across these patient subgroups, supporting the modeling assumption of constant variance. The month-to-month cost fluctuation was affected by many factors. Since the intra-subject correlation is a nuisance parameter, we opted for an exchangeable autocorrelation structure, which can be interpreted as arising from a latent, patient-specific factor that increases or decreases each patient’s monthly costs. The trajectory surface μ(t, s) is modeled by using a truncated quadratic basis with pre-specified internal knots for survival time s at months 15, 35, 55, and 75, and internal knots for time t at months 10, 25, 45, 65, and 85, leading to a total of 56 basis functions and their corresponding coefficients to model the surface. These knots were chosen a priori to be equally spaced and cover the time scale of s and t.
Figure 2 (A) shows estimated average cost trajectories for patients whose survival time is 24, 36, 48, 60 or 72 months after diagnosis, together with pointwise 95% Wald-type confidence intervals obtained from variances estimated from 200 bootstrap samples. After transforming the estimators back to the dollar cost, we used Figure 2 (B) to illustrate the distribution of the monthly costs for each survival time on the scale of the dollar amount through the 25-th, 50-th, and 75-th quantiles at each month. These quantiles can be estimated with the proposed methodology due to the multivariate normal assumption in (2). Considerable heterogeneity in Medicare reimbursement is observed among patients in this cohort. For example, for patients at the 75-th quantile, their medical costs within the first few months after diagnosis could be as high as $ 60,000 per month, while for patients at the 25-th quantile, the monthly costs were around $ 1,000. The curves in Figure 2 also show that the monthly Medicare payment for patients with stage IV breast cancer generally follows a U-shaped curve from the initial diagnosis to death, regardless of the length of survival. The estimated cost trajectory suggests that a plausible cutoff for the initial active treatment phase is 12 months, which coincides with the duration that the NCI used to define the initial care phase. During the first 12 months after diagnosis, the patients receive intensive chemotherapy, radiation therapy, and/or surgical treatments. Interestingly, the costs for the active treatment phase are similar, regardless of the patients’ eventual survival times.
Following the active treatment phase is the surveillance phase, during which patients usually receive less intensive treatments, and the main purpose is to monitor tumor recurrence and manage long-term complications. The length of the surveillance phase varies by patient. We observed that patients who lived longer had lower monthly costs in the surveillance phase. However, this observation should not be interpreted as medical costs being predictive of survival. Rather, this pattern is likely driven by the fact that patients who had fewer complications or comorbidities had lower risk of death.
It is important to note that the U-shaped cost trajectories generally are not symmetric. The decline in monthly costs in the active treatment phase appeared to be quicker than the monthly cost increase toward the tail of the surveillance phase, when entering the end-of-life phase. Although almost all patients underwent costly treatments in the active treatment phase, their cost trajectories started to diverge as they entered the surveillance phase. This resulted in lower average monthly costs in the surveillance phase, with cost starting to climb at various time points, depending on the patients survival. For this reason, the transition between the surveillance phase and the end-of-life phase happened gradually and was heterogeneous, making it difficult to identify a clear cutoff separating the two phases, as was done in the NCIs cost approach. The rich information offered by longitudinal trajectories of monthly cost data cannot be obtained from the analysis of cumulative or total costs, nor can it be obtained by summarizing costs into pre-defined phases of care. This demonstrates the significance of our research question and the important implication of the proposed methodology in health policy as well as clinical research.
The estimated intra-subject correlation and variance were 0.62 and 8.53, respectively. The moderate positive intra-subject correlation suggested considerable variation in costs between patients. For comparison, we also show a “crude” estimate in Figure 2 (A), obtained by averaging the monthly costs over non-censored subjects whose survival times were within the year of the selected survival times. Since the crude averages were wiggly, we smoothed them with penalized splines ( R SemiPar package with the default settings) for better visualization. Even after smoothing, the crude average trajectory estimates showed some local curvy shapes that were difficult to interpret (e.g., the strata corresponding to patients who died at 48 or 60 months after diagnosis).
The proposed methodology can provide an estimated average cost trajectory for each distinct survival time. Putting all of these trajectories together, we obtained a “smooth” surface over the triangular region (Figure 3 (A)). This plot is complementary to Figure 2 because Figure 2 only shows the surface at a few cross-sections. We placed a quotation mark on “smooth” because the “month” variable cannot be continuous in the algebraic sense when we deal with monthly cost data. Since the month of the reported cost must be no greater than the total number of months from diagnosis to death or censoring, the surface is defined only on an upper triangular region. This is different from the conventional bivariate smoothing problems, and we solved this problem by using the shrinkage-expansion transformation in Section 2. We illustrate the estimated surface in both a contour plot and a 3-D plot in Figure 3.
The penalty parameter λ was selected by the proposed cross-validation procedure. We calculated the cross-validation score CV (λ) for λ = 5 × 10x with x = −6, −5, −4, −3, −2, −1, 0, 1. The corresponding cross-validation scores are equal to 9.14, 9.13, 9.13, 9.15, 9.23, 9.27, 9.28, and 9.28, respectively. This result suggests that the desired smoothness can be obtained when λ is below 5 × 10−3; the cross-validation score increases quickly as λ increases beyond this threshold. Note that the above grid for λ covers a wide range as adjacent values represent 10-fold increase. As λ decreases from 5 × 10−3 to 5 × 10−6, the run time for obtaining the point estimator increases from 27 minutes to 81 minutes on a computer with 3.30 GHz CPU and 16 GB memory, suggesting that a strong penalty restricts the movement of the coefficients of the basis functions and helps convergence. We also observe more local oscillation when λ is small. Given that a smooth U-shaped trajectory with no local oscillation is supported by the previous literature and clinical practice, we chose the penalty parameter at 0.002, the middle point of the interval (5 × 10−4, 5 × 10−3), for the final analysis in this section. The corresponding cross-validation score is 9.14.
4. SIMULATION
We conducted simulations to assess the finite sample performance of the proposed method. We generated the true survival time from an exponential distribution with mean 45, right truncated at 100 so that the maximum survival τ is taken to be 100 months. The log medical cost Yij of a subject with survival time at month tij is generated according to the model in Section 2 that ensures a U-shaped trajectory for any survival time. The residual variance was 4 and the intra-subject correlation was 0.5. The knots of survival are placed at months 15, 35, 55, and 75, and the knots for monthly costs are placed at months 10, 25, 45, 65, and 85. We considered four scenarios with a sample size of 1000 or 5000, and censoring rate of 20% or 50%. In each scenario, we conducted 200 Monte Carlo simulations and averaged the results. The convergence criterion is that for each element θ in Θ1, |θ(m) − θ(m−1)| should be less than 10−4. The penalty parameter is chosen as λ1 = 0.01 in preliminary runs.
Figure 4 shows that the proposed trajectory estimator has little bias when estimating the true trajectory (the red and black curves almost overlap). The crude method has some finite-sample bias, mostly because of its large variation: it does not use the cost data from the censored subjects and the estimation of one trajectory does not borrow information from the estimation of other trajectories. As in the simulation as well as in the data application, the subgroup sample size may be relatively small for subjects with long survival times, and this makes the crude method inefficient. Figure 5 shows that the proposed method has a favorable mean squared error (MSE) compared with the crude method in all scenarios. The MSE improvement is more prominent at the right tails of the curves where the data are sparser due to censoring.
We conducted additional simulations to study the the robustness of the proposed method in the context of the data application. The simulation design and results are included in Section S1 of the Online Supplementary Materials. Specifically, we studied the performance of the proposed method when the distribution of the data deviates from multivariate normality. We simulated data from either a skew normal distribution or a mixture of two normal distributions. The simulation results show that under these two types of misspecied modeling assumptions, the proposed model produces trajectory estimates that nearly overlap with the true trajectories. Similar robustness of the normal theory models have been widely recognized in other statistical contexts (Carroll and Ruppert, 1988).
5. DISCUSSION
Motivated by the challenges in summarizing and modeling Medicare costs for patients with stage IV breast cancer, this paper studied longitudinal medical cost data from a new perspective, i.e., estimating the average incident cost trajectory conditional on survival. The survival time and the mean cost trajectory form a bivariate surface on a triangular region, which is estimated by a semiparametric model that includes a bivariate penalized spline model for the surface, and nonparametric estimation for survival. The estimation properly accounts for right censoring of survival times. The rich information offered by longitudinal trajectories of monthly cost data cannot be obtained from the analysis of cumulative or total costs, as is the focus of statistical literature reviewed in Section 1, nor can it be obtained by summarizing costs into pre-defined phases of care, as done by the NCI’s cost summary. This highlights the significance of our research question and proposed methodology. Using the proposed methods, we found that almost all stage IV breast cancer patients underwent costly treatments during the first 6–12 months after diagnosis, though the amount varied substantially. Their cost trajectories started to diverge as they entered the surveillance phase after completing active treatment, and the treatment cost started to climb at various time points, depending on the patients survival time. Because the transition between the surveillance phase and the end-of-life phase happened gradually and is heterogeneous across patients, there is no clear-cut time separating the two phases, as used in the NCI’s cost summary. The R code for implementing the proposed methodology is available upon request.
The assumption that the transformed monthly costs of a patient follow a multivariate normal distribution is both a limitation and advantage of the proposed method. On the one hand, this assumption may be violated if an excessive proportion of the monthly medical costs are zero. In our data set, 10% of the 103, 973 monthly incident costs from the 4, 950 patients were zero. The proportion of zero cost months was small in this data set because the study cohort consisted of patients with stage IV breast cancer. Alternatively, we can aggregate the monthly costs into quarterly costs, which resulted in just 3% zero quarterly costs in our study cohort. In populations with less disease burden, there may be more zero costs in the data and it may be necessary to introduce a two-part model (Liu et al 2008) feature to our proposed methodology for adjustment. That will be studied in our future work. However, an advantage of building this parametric distribution assumption into the proposed semiparametric modeling framework is that once the estimation is complete on the transformed cost data, we can easily obtain the distribution of monthly costs in dollar amounts given any survival time, including medians and various other quantiles, as illustrated in Figure 2 (B). Without this assumption, one would need to jointly model survival and multiple nonlinear quantile functions of skewed longitudinal cost data, for which statistical methodology is yet to be developed. In Section S1 of the Online Supplementary Materials, we present a simulation study with data simulated from either a skew normal distribution or a mixture of two normal distributions. The proposed method is shown to produce robust estimated trajectories against those two types of model violation.
We used coarsening approximation with the maximum number of intervals K ≡ 10 in the numerical studies. The proposed coarsening approximation is equivalent to using a discrete distribution to approximate the conditional distribution of given the observed data and . We compared the coarsening approximation with K = 10 with another, much finer approximation that sets all the intervals of all subjects to be about one month. Since the proposed coarsening rule (Section 2.3) requests each interval to be at least one month in length, the second approach is equivalent to not setting the cap K, and Ki, the number of intervals per subject, roughly equals to the number of months from the censoring time to τ. In the data analysis of Section 3, if we use the monthly coarsening approximation, the EM algorithm converges in 232 minutes on a computer with 3.30 GHz CPU and 16 GB memory. For a coarsening approximation with K = 10, the convergence takes 30 minutes. The estimated surfaces with or without the coarsening approximation are nearly the same. Hence, in this example the coarsening approximation reduced 87% of the computing time without notable change in the estimates. In Section S2 of the Online Supplementary Materials, we present a simulation study to compare the coarsening approximation with K = 10 and the monthly coarsening approximation without the cap K. The simulation shows that two approximations produced very similar estimated trajectories with no noticeable differences. Both are very close to the true trajectories. This robustness and time-saving feature can be important in some research studies. In studies of health economics, medical cost data are usually obtained from large patient registries. Therefore, we believe that developing computationally tractable approaches for large data sets is very important for this kind of problem. In a shared parameter joint model of longitudinal and survival data, Barrett et al (2015) demonstrated that proper discretization of the time scale can make the computation tractable without notable bias or efficiency loss.
The proposed methodology can be extended to terminal events that are subject to competing risks (e.g., breast cancer death vs. death due to other causes) by replacing the Kaplan-Meier estimate with the cause-specific survival distribution. This will lead to estimated cost trajectories specific to the designated terminal event type. The goal of this paper is to estimate the population averaged longitudinal cost trajectory with proper adjustment for survival and censoring. This type of marginal estimation is useful for health services research and health policy making, to provide a basis for cost projection of a defined population. For future research, it is of interest to study how the shapes of the possibly nonlinear longitudinal cost trajectories are influenced by both covariates and survival.
In Section 2.2, a regularity condition needed for estimating the entire marginal distribution of by the Kaplan-Meier method is that the support of lies inside the support of the censoring time C. This regularity condition holds approximately for the motivating data application, where the terminal event is death and only about 6.5% of the subjects survived beyond the largest observed follow-up time at 104 months. When a substantial proportion of subjects are censored beyond the largest observed failure time, this regularity assumption does not hold, making in (1) not everywhere identifiable in the interval (Ti, ∞) for the integral. The proposed methodology can be extended to accommodate this situation and estimate the cost trajectory up to the maximum follow-up time. Let τC denote the the maximum follow-up time, and subjects are censored beyond τC. The complete data log-likelihood (3) can be modified to be
For each censored subject, the coarsening approximation uses Ki + 1 intervals: the Ki intervals that satisfy plus one extra interval to account for the possibility that . Additional parameters Θ2 are introduced to characterize the conditional distribution of Yi when falls in this extra interval (τC, ∞). The designation of the middle point li(Ki + 1) is not needed for this interval. The rest of the EM algorithm is similar to that in Section 2.3.
Supplementary Material
Acknowledgments
This work was supported by the Duncan Family Institute (Shih), the Agency for Healthcare Research and Quality (Shih, R01HS020263) and the Cancer Center Support Grant from the NIH (Li, Ning, Huang, Shih, Shen, P30CA016672). The authors would like to thank Ying Xu, Department of Health Services Research at The University of Texas MD Anderson Cancer Center, for the creation of the analytical sample, and thank Lee Ann Chastain, Department of Biostatistics, for editing assistance. This study used the linked SEER-Medicare database. The interpretation and reporting of these data are the sole responsibility of the authors. The authors acknowledge the efforts of the Applied Research Program, NCI; the Office of Research, Development and Information, CMS; Information Management Services (IMS), Inc.; and the Surveillance, Epidemiology, and End Results (SEER) Program tumor registries in the creation of the SEER-Medicare database. Dr. Shih had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Contributor Information
Liang Li, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler St., Houston, TX, 77030.
Chih-Hsien Wu, Department of Biostatistics, The University of Texas MD Anderson Cancer Center.
Jing Ning, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler St., Houston, TX, 77030.
Xuelin Huang, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler St., Houston, TX, 77030.
Ya-Chen Tina Shih, Department of Health Services Research, The University of Texas MD Anderson Cancer Center.
Yu Shen, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler St., Houston, TX, 77030.
References
- Akushevich I, Kravchenko J, Akushevich L, Ukraintseva S, Arbeev K, Yashin AI. Medical cost trajectories and onsets of cancer and noncancer diseases in US elderly population. Comput Math Methods Med. 2011:857–892. doi: 10.1155/2011/857892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bang H. Medical cost analysis: application to colorectal cancer data from the SEER Medicare database. Contemp Clin Trials. 2005;26(5):586–597. doi: 10.1016/j.cct.2005.05.004. [DOI] [PubMed] [Google Scholar]
- Bang H, Tsiatis AA. Estimating medical costs with censored data. Biometrika. 2000;87(2):329–343. [Google Scholar]
- Barrett J, Diggle P, Henderson R, Taylor-Robinson D. Joint modelling of repeated measurements and time-to-event outcomes: exible model specication and exact likelihood inference. Journal of the Royal Statistical Society, Series B. 2015;77(1):131–148. doi: 10.1111/rssb.12060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baser O, Gardiner JC, Bradley CJ, Yuce H, Given C. Longitudinal analysis of censored medical cost data. Health Economics. 2006;15(5):513–525. doi: 10.1002/hec.1087. [DOI] [PubMed] [Google Scholar]
- Brown ML, Riley GF, Schussler N, Etzioni R. Estimating health care costs related to cancer treatment from SEER-Medicare data. Medical Care. 2002;40(8 Suppl):IV-104–17. doi: 10.1097/00005650-200208001-00014. [DOI] [PubMed] [Google Scholar]
- Cai T, Hyndman RJ, Wand MP. Mixed model-based hazard estimation. Journal of Computational and Graphical Statistics. 2002;11(4):784–798. [Google Scholar]
- Carroll RJ, Ruppert D. Transformation and Weighting in Regression. Chapman & Hall; 1988. [Google Scholar]
- Chan KCG, Wang M-C. Backward estimation of stochastic processes with failure events as time origins. Annals of Applied Statistics. 2010;4(3):1602–1620. doi: 10.1214/09-AOAS319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J, Liu L, Shih YC, Zhang D, Severini TA. A flexible model for correlated medical costs, with application to medical expenditure panel survey data. Statistics in Medicine. 2016;35(6):883–894. doi: 10.1002/sim.6743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J, Liu L, Zhang D, Shih YC. A flexible model for the mean and variance functions, with application to medical cost data. Statistics in Medicine. 2013;32(24):4306–4318. doi: 10.1002/sim.5838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion) Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
- Dumont S, Jacobs P, Turcotte V, Anderson D, Harel F. The trajectory of palliative care costs over the last 5 months of life: a Canadian longitudinal study. Palliative Medicine. 2010;24(6):630–640. doi: 10.1177/0269216310368453. [DOI] [PubMed] [Google Scholar]
- Etzioni R, Urban N, Baker M. Estimating the costs attributable to a disease with application to ovarian cancer. Journal of Clinical Epidemiology. 1996;49(1):95–103. doi: 10.1016/0895-4356(96)89259-6. [DOI] [PubMed] [Google Scholar]
- Fitzmaurice G, Davidian M, Verbeke G, Molenberghs G. Longitudinal Data Analysis. Chapman & Hall/CRC; 2009. [Google Scholar]
- Hall P, Opsomer JD. Theory for penalized spline regression. Biometrika. 2005;92:10518. [Google Scholar]
- Heitjan DF, Rubin DB. Ignorability and Coarse Data. Annals of Statistics. 1991;19(4):2244–2253. [Google Scholar]
- Li Y, Ruppert D. On the asymptotics of penalized splines. Biometrika. 2008;95(2):415–436. [Google Scholar]
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
- Lin DY. Linear regression analysis of censored medical costs. Biostatistics. 2000;1(1):35–47. doi: 10.1093/biostatistics/1.1.35. [DOI] [PubMed] [Google Scholar]
- Lin DY. Regression analysis of incomplete medical cost data. Statistics in Medicine. 2003;22(7):1181–1200. doi: 10.1002/sim.1377. [DOI] [PubMed] [Google Scholar]
- Lin DY, Feuer EJ, Etzioni R, Wax Y. Estimating medical costs from incomplete follow-up data. Biometrics. 1997;53:419434. [PubMed] [Google Scholar]
- Liu L. Joint modeling longitudinal semi-continuous data and survival, with application to longitudinal medical cost data. Statistics in Medicine. 2009;28(6):972–986. doi: 10.1002/sim.3497. [DOI] [PubMed] [Google Scholar]
- Liu L, Conaway MR, Knaus WA, Bergin JD. A random effects four-part model, with application to correlated medical cost. Computational Statistics and Data Analysis. 2008;52:44584473. [Google Scholar]
- Liu L, Huang X, O’Quigley J. Analysis of longitudinal data in the presence of informative observational times and a dependent terminal event, with application to medical cost data. Biometrics. 2008;64(3):950–958. doi: 10.1111/j.1541-0420.2007.00954.x. [DOI] [PubMed] [Google Scholar]
- Liu L, Wolfe RA, Kalbeisch JD. A shared random effects model for censored medical costs and mortality. Statistics in Medicine. 2007;26:139155. doi: 10.1002/sim.2535. [DOI] [PubMed] [Google Scholar]
- Mariotto AB, Yabroff KR, Shao Y, Feuer EJ, Brown ML. Projections of the cost of cancer care in the United States, 2010–2020. Journal of the National Cancer Institute. 2011;103(2):117–128. doi: 10.1093/jnci/djq495. 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLachlan GJ, Do Kim-Anh, Ambroise C. Analyzing microarray gene expression data. John Wiley & Sons, Inc.; 2004. [Google Scholar]
- O’Hagan A, Stevens JW. On estimators of medical costs with censored data. Journal of Health Economics. 2004;23:615625. doi: 10.1016/j.jhealeco.2003.06.006. [DOI] [PubMed] [Google Scholar]
- Pullenayegum EM, Willan AR. Marginal models for censored longitudinal cost data: appropriate working variance matrices in inverse-probability-weighted GEEs can improve precision. The International Journal of Biostatistics. 2011;7(1) Article 14. [Google Scholar]
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; 2003. [Google Scholar]
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression during 2003–2007. Electronic Journal of Statistics. 2009;3:11931256. doi: 10.1214/09-EJS525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- SAS Institute. SAS/STAT 9.2 User’s Guide: The GENMOD Procedure. SAS Institute Inc.; Cary, NC, USA: 2008. [Google Scholar]
- Wang X, Shen J, Ruppert D. On the asymptotics of penalized spline smoothing. Electronic Journal of Statistics. 2011;5:1–17. [Google Scholar]
- Yabroff KR, Mariotto AB, Feuer E, Brown ML. Projections of the costs associated with colorectal cancer care in the United States, 2000–2020. Health Economics. 2008;17(8):947–959. doi: 10.1002/hec.1307. [DOI] [PubMed] [Google Scholar]
- Yabroff KR, Lamont EB, Mariotto A, Warren JL, Topor M, Meekins A, Brown ML. Cost of care for elderly cancer patients in the United States. Journal of National Cancer Institute. 2008;100(9):630–641. doi: 10.1093/jnci/djn103. [DOI] [PubMed] [Google Scholar]
- Yabroff KR, Warren JL, Schrag D, Mariotto A, Meekins A, Topor M, Brown ML. Comparison of approaches for estimating incidence costs of care for colorectal cancer patients. Medical Care. 2009;47(7 Suppl 1):S56–563. doi: 10.1097/MLR.0b013e3181a4f482. [DOI] [PubMed] [Google Scholar]
- Zhao H, Cheng Y, Bang H. Some insight on censored cost estimators. Statistics in Medicine. 2011;30(19):2381–2388. doi: 10.1002/sim.4295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H, Tian L. On estimating medical cost and incremental cost-effectiveness ratios with censored data. Biometrics. 2001;57:10021008. doi: 10.1111/j.0006-341x.2001.01002.x. [DOI] [PubMed] [Google Scholar]
- Zhao H, Zuo C, Chen S, Bang H. Nonparametric inference for median costs with censored data. Biometrics. 2012;68(3):717–725. doi: 10.1111/j.1541-0420.2012.01755.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.