Abstract
We consider a random effects model for longitudinal data with the occurrence of an informative terminal event that is subject to right censoring. Existing methods for analyzing such data include the joint modeling approach using latent frailty and the marginal estimating equation approach using inverse probability weighting; in both cases the effect of the terminal event on the response variable is not explicit and thus not easily interpreted. In contrast, we treat the terminal event time as a covariate in a conditional model for the longitudinal data, which provides a straight-forward interpretation while keeping the usual relationship of interest between the longitudinally measured response variable and covariates for times that are far from the terminal event. A two-stage semiparametric likelihood-based approach is proposed for estimating the regression parameters; first, the conditional distribution of the right-censored terminal event time given other covariates is estimated and then the likelihood function for the longitudinal event given the terminal event and other regression parameters is maximized. The method is illustrated by numerical simulations and by analyzing medical cost data for patients with end-stage renal disease. Desirable asymptotic properties are provided.
Keywords: Cox regression, Empirical process, Mixed effects model, Pseudo-maximum likelihood estimation
1 Introduction
In longitudinal studies, the collection of information can be stopped at the end of the study, at the time of dropout of a study participant, or at the time of a terminal event. Death, the most common terminal event, often occurs in cohort studies of older populations and in fatal disease follow-up studies, e.g., organ failure or cancer studies. Other types of terminal events also exist, for example, the final menstrual period is a terminal event for menstrual cycle data.
The current literature has primarily focused on modeling the longitudinally measured response variable and covariates given that the terminal event has not yet happened; see e.g. Tsiatis and Davidian (2004), Hsieh et al. (2006), Ding and Wang (2008), Albert and Shih (2010). If the terminal event is ignorable (Little and Rubin, 2002), then a likelihood-based estimation of regression parameters is straightforward. Oftentimes, however, the terminal event time is non-ignorable. Two types of approaches are widely used for longitudinal data analysis with non-ignorable terminal events: the joint modeling approach using latent frailty and the marginal estimating equation approach using inverse probability weighting. In the former, the relationship between the terminal event and the longitudinal data is indirectly modeled through the shared random effect. The latter approach is appropriate when the terminal event is simply censoring the observations of the longitudinal process, which is in fact continuing but unobserved; its use when the terminal event stops the longitudinal process is more controversial. Similar approaches have also been used in the context of recurrent events correlated with a terminal event; for example, see Ghosh and Lin (2002), Huang and Wang (2004), Zeng and Lin (2009), Albert and Shih (2010), Kalbfleisch et al. (2013), among many others.
These modeling strategies, however, are not as useful as one might wish for many longitudinal studies where the explicit effect of the terminal event time on the longitudinal measures is of interest. For example, medical payments in dialysis patients (Liu et al., 2007) and cancer patients (Chan and Wang, 2010) tend to increase when patients approach death; functional limitations in an aging population (Sowers et al., 2007) become more severe when people are closer to the end of life; and menstrual cycles become longer and more variable when women approach menopause (Harlow et al., 2008). In these cases, a question of interest is how does the impending terminal event affect the longitudinal measures, and for this question, a model for the longitudinal event conditional on the terminal event seems particularly useful and appropriate.
In this article, we propose a random effects model for repeated measures which includes the event time as an additional (fixed effect) covariate, and thus provides a more intuitive and meaningful interpretation of the effect of the terminal event time. The proposed conditional modeling strategy keeps the usual relationship of interest between the longitudinally measured response variable and covariates when the data collection time is far from the occurrence of the terminal event, but the response variable becomes increasingly dependent on the terminal event time when the data collection time is close to the terminal event. Since the terminal event time is subject to right censoring, the regression model with the terminal event time as a covariate falls into a general framework of regression with censored covariate. For this situation, the complete case analysis by dropping observations with censored event times will be shown to be a valid estimating approach under the usual noninformative conditional independent censoring assumption for the censoring time.
We propose a semiparametric, likelihood-based approach for parameter estimation in a linear regression model with a nonlinear component for the censored covariate, that utilizes both the complete and censored data. The proposed method is shown to be consistent and asymptotically normal under a set of mild regularity conditions, and is more efficient than the complete case analysis. The proofs of the asymptotic properties rely heavily on empirical process theory. A referee drew our attention to Li et al. (2013), which has a similar aim of recovering information from censored data. We comment further on this work later in the Discussion Section.
The rest of the article is organized as follows. We describe the proposed model in Section 2 and the two-stage estimating method in Section 3. The asymptotic properties are outlined in Section 4 with proofs given in the Appendix. Section 5 contains numerical results followed by a brief discussion. Detailed technical preparations are provided in the online Supplementary Material.
2 A Nonlinear Regression Model with Mixed Effects and Censored Covariate
2.1 Complete data model with observed terminal event time
For a subject i, denote the terminal event time by Si, the baseline covariates by a vector Xi where the first element is 1, the longitudinal response by Yij, and the prespecified visit time by tij, where i = 1, ⋯, n and j = 1, ⋯, ni. For given Si, we model Yij with the following mixed effect model for longitudinal data:
| (1) |
where β is a vector of regression coefficients with length p1, bi is an independent random effects vector of length q1 associated with covariates Zi, Ui(t) is an independent stochastic processes, εij, j = 1, …, ni, are independent measurement errors, g is a known function that satisfies Condition 1 in Appendix, and ξ is a vector with length p2. The function g(t, ξ) → 0 when t → ∞ so that model (1) reduces to a simpler relationship of interest between the longitudinally measured response variable Yij and covariates Xi, when tij is distant from Si, and should become increasingly related to the terminal event when tij is close to the terminal event Si. Motivated by figure 5 in Chan and Wang (2010), we can choose g(S − t, ξ) to be a normal kernel where g(S − t, ξ) = ξ1e−(S−t−ξ2)2ξ3, ξ = (ξ1, ξ2). Other examples include an exponential kernel where g(t, ξ) = ξ1e−(t−ξ2).
We make the following additional assumptions: (i) bi follows a normal distribution N(0, D(φ)), where D is a positive definite matrix depending on a parameter vector φ with length q2; (ii) Ui(t) is a mean zero Gaussian process with a given covariance function cov(Ui(t1), Ui(t2)) = κ(ν, ρ; t1, t2) that depends on a parameter vector ν with length q3 and a scalar ρ; for example, Ui(t) can be the nonhomogeneous Ornstein-Uhlenbeck (NOU) process satisfying var(Ui(t)) = ν(t) with log(ν(t)) = ν0 + ν1t and corr(Ui(t1), Ui(t2)) = ρ|t1−t2|; (iii) εij follows a normal distribution N(0, σ2); and (iv) bi, Ui(t), and εij are mutually independent.
For a vector t = (t1, ⋯, tm), denote g(t, ξ) = (g(t1, ξ), ⋯, g(tm, ξ))′. Let Yi = (Yi1, ⋯, Yini)′, ti = (ti1, ⋯, tini), and . When Si is observed, from (1) we have
| (2) |
where 1i = (1, ⋯, 1)′ with length ni, θ = (β, ξ)′ with length p = p1 + p2, ϕ = (φ, ν, ρ, σ2)′ with length q = q2 + q3 + 2, and Σi = ZiDZi′ + Γi + σ2Ii, where Ii is the ni × ni identity matrix and Γi is the covariance matrix of (U(ti1), ⋯, U(tini))′.
A semiparametric mixed effects model could also be considered, where g is an unknown function that can be estimated by smoothing splines. We focus on the parametric model (1) to more simply illustrate the proposed methodology.
2.2 Observed data model with potentially censored terminal event time
Let Ci be the censoring time for the ith subject. If Si ≤ Ci, then Si is observed; otherwise Si is right-censored by Ci. We denote the observed time by Vi = min(Si, Ci) and the censoring indicator by Δi = 1(Si ≤ Ci). Note that tij ≤ Vi, for all i = 1, ⋯, n, j = 1⋯, ni. Here, we assume that Ci and (Si, Yi) are conditionally independent given Xi.
For notational simplicity, assume that the random effect Z is a sub-vector of X. For a single subject, we observe (V, Δ, Y, X). The likelihood function for the observed data (V, Δ, Y, X) can be factored into
where f1 denotes the joint density of (V, Δ, Y, X), f2 denotes the conditional density of (V, Δ) given (Y, X), f3 denotes the conditional density of Y given X, and f4 denotes the marginal density of X. Since the conditional independence of C and (S, Y) given X implies that C and S are conditionally independent given (Y, X), we have
| (3) |
where fS denotes the conditional density of S given (Y, X), gC denotes the conditional density of C given (Y, X), with F̄S and ḠC being the corresponding conditional survival functions. Further assuming noninformative censoring, we can drop gC(C|Y, X) and ḠC(C|Y, X). Going through conditional arguments using the Bayes’ rule and dropping f4(X), we obtain the likelihood function
| (4) |
where f5(S|X) is the conditional density of S given X, and F5(S|X) is the corresponding cumulative distribution function. In (4), only fθ,ϕ contains the parameter of interest θ and nuisance parameter ϕ, whereas F5 (or f5) is an additional nuisance parameter.
In (4), {fθ,ϕ(Y |S, X)f5(S|X)}Δ is for a subject with observed terminal event time, which yields the fully observed data likelihood, and is for a subject with censored terminal event time. In section 4, we show that the complete case analysis by dropping the second part in (4) yields a consistent and asymptotically normally distributed estimator, but is inefficient compared to an approach that also utilizes the censored data. From the second part in (4), we see that the amount of efficiency gain depends on how well we can estimate the right tail of the conditional distribution F5(s|X) beyond C. We consider a semiparametric approach that allows reliable extrapolation beyond C and is robust against any parametric assumption.
Since Ci and (Si, Yi) are conditionally independent given Xi and Ci is random, all the commonly used semiparametric models for right-censored data allow extrapolation beyond Ci. Here, we propose the most widely used Cox regression model (Cox, 1972). Other viable models include the accelerated failure time model, the additive hazard model, and the transformation model (Kalbfleisch and Prentice, 2002). Suppose the hazard function of S given X has the following form:
| (5) |
where α is the regression parameter with an unknown true value α0, and λ(·) is the baseline hazard function. The conditional cumulative distribution function is then given by
where is the cumulative baseline hazard function with an unknown true value Λ0. Note that X appears in both models (2) and (5), but these two instances may refer to different regressions. For example, X1 might be a covariate in (2) whereas is a covariate in (5). The same X is used to denote all fully observed covariates for notational simplicity. The log-likelihood function then becomes
| (6) |
A similar idea has been used by Lu et al. (2010), but for a different problem. Lu et al. (2010) considered longitudinal data analysis with an event time, which does not terminate the observed data.
3 The Pseudo-likelihood Method
The log likelihood function (6) involves an unknown distribution function η and the corresponding density function η̇. Hence a maximum likelihood estimate, if it exists, can be complicated. We propose a tractable two-stage pseudo-likelihood approach in which the nuisance parameters (ϕ, η) are estimated in stage 1, and the parameter of interest θ is then estimated by maximizing (6) in stage 2 with nuisance parameters replaced by their estimators obtained in stage 1 (Kong and Nan, 2016). Details are given below:
Stage 1. Nuisance parameter estimation. The dispersion parameter ϕ is estimated by the complete case analysis of the nonlinear regression model (2); the Cox model regression coefficient α is estimated by maximizing the partial likelihood, and the cumulative baseline hazard Λ is estimated with the Breslow estimator (Breslow, 1972). Denote the estimators by ϕ̃n, α̃n, and Λ̃n, respectively. The c.d.f η(s; X) is estimated by , which is asymptotically equivalent to the product integral expression. It can be shown that all the estimates obtained in Stage 1 have desirable statistical properties. In particular, η̃n is n1/2-consistent in a finite interval, see Lemma A.3 in Supplementary Material; ϕ̃n obtained from the complete case analysis is n1/2-consistent, see Theorem 4.1 in section 4.
- Stage 2. Pseudo-likelihood estimation of θ. Replacing (ϕ, η) by their Stage 1 estimates (ϕ̃n, η̃n) in the log likelihood function yields the following log pseudo-likelihood function for a random sample of n subjects:
Note that the term Δ log η̇ in (6) is dropped because it does not involve θ. However, if one wants to maximize the log-likelihood directly without using the two-stage approach, then this term cannot be omitted.(7)
Let θ̂n denote the pseudo-likelihood estimator. Since it is obtained by maximizing the objective function (7), its asymptotic properties can be obtained from M-estimation theory, see van der Vaart (2002), Wellner and Zhang (2007) and Li and Nan (2011).
The estimates (η̃n, Λ̃n) are obtained using a standard package for the Cox regression model. The estimates (θ̃n, ϕ̃n) from complete case analysis are obtained by maximizing using a Newton-Raphson algorithm, where multiple initial values are tried. The two-stage estimator θ̂n is also obtained from a Newton-Raphson algorithm with the complete case analysis estimator θ̃n as the initial value.
4 Asymptotic Properties
Let l0(θ, ϕ; Y, X, Δ, V) = Δ log fθ,ϕ (Y |S, X). This is the first part in the log-likelihood for the observed data. Then
which is (6) with Δ log η̇ dropped.
A set of regularity conditions is introduced in the Appendix. Some conditions are commonly assumed for the Cox regression model; other conditions are for the mixed effects model, which are easily verified for a smooth function g and the NOU process. We will use standard empirical process notation from now on. In particular, ℙn is the empirical measure and Pf = ∫ fdP for a probability measure P and a function f.
Under the conditional independent censoring assumption, the estimators from the complete case analysis and the two-stage procedure, respectively, are consistent and asymptotically normal. These results are given in the following two theorems.
Theorem 4.1. (Complete case)
Assume that C and (S, Y) are independent given X. Under Conditions 1, 2(a), and 3–5, the complete case analysis estimator (θ̃n, ϕ̃n) that maximizes ℙnl0(θ, ϕ; Y, X, Δ, V) converges in outer probability to (θ0, ϕ0); and converges in distribution to a mean zero normal random variable with variance , where J1 and Q1 are provided in the Appendix.
Theorem 4.2. (Two-stage)
Assume that C and (S, Y) are independent given X. Under Conditions 1–8, the two-stage pseudo-likelihood estimator θ̂n that maximizes (7) converges in outer probability to θ0; and converges in distribution to a mean zero normal random variable with variance , where J2 and Q2 are defined in the Appendix
The proof of consistency is similar to Li and Nan (2011) and van der Vaart (2002). The proof of asymptotic normality is given in the appendix, and is based on the general M-estimation theory similar to Li and Nan (2011) and Wellner and Zhang (2007). The detailed proof relies heavily on empirical process theory and is given in the Appendix.
Because the asymptotic variance of θ̂n has a very complicated expression that does not yield a simply computed estimate from the observed data, we use the bootstrap variance estimator.
5 Numerical Results
5.1 Simulations
We conduct simulations to investigate the finite sample performance of the proposed method. Simulation data sets are generated from the nonlinear model with mixed effects,
where β0 = 1, β1 = 1, β2 = −3, μ = 1, and γ = 4. The random effect bi ~ N(0, exp(−0.5)), the error term εij ~ N(0, exp(−0.1)), and Ui(t) is an NOU process with ν0 = 1, ν1 = −1 and ρ = exp(−1)/(1 + exp(−1)). The two fully observed covariates are X1i and X2i, where X1i ~ Bernoulli(0.5) and X2i ~ N(0, 1) truncated at ±3. The terminal event time is Si = 4 + S0i, where S0i follows an exponential distribution with conditional hazard function exp(−1 − 6X1i + 4X2i). To generate the censoring time Ci, we first generate , where follows an exponential distribution with conditional hazard function exp(−3 − X1i + X2i), then set Ci = tij, where j satisfies tij ≤ C0i and tij+1 > C0i assuming tini+1 = ∞. The constant κ is chosen to yield 40% censoring. For each subject i, there are 10 scheduled visit times, and the first visit time ti1 is 0. There are two different settings to generate the subsequent visit times: (1) equally spaced time intervals with tij = j − 1, j = 2, ⋯, 10; (2) non-equally spaced time intervals with the subsequent visit times generated recursively from tij = tij−1 + min(4, Wi) for j = 2,⋯, 10, where Wi follows an exponential distribution with conditional hazard function exp(−3 − X1i + X2i). In each setting, ξ takes two different values, 1.2 and 0.2, corresponding to a flat and a sharp nonlinear predictor in the regression model, respectively.
We simulate 500 replications for each scenario with sample size 300. The biases and variances of the proposed method are compared with those of full data and complete case analyses. The full data analysis represents the case that all data are available; in other words, there is no censoring, which has more visits and serves as a benchmark. The complete case analysis simply eliminates subjects with censored terminal event time. For the proposed two-stage method, we report the 90% and 95% coverage proportions for which the variances estimators are obtained from 100 bootstrap samples. The results are presented in Tables 1–4.
Table 1.
Simulation results for equally spaced time interval with sharp nonlinear term. varb=boostrap variance estimator; CR=coverage rate
| β0 = 1 | β1 = 1 | β2 = −3 | μ = 1 | γ = 4 | ξ = 1.2 | ||
|---|---|---|---|---|---|---|---|
| Full data | bias | −0.0064 | 0.0011 | −0.0002 | 0.0001 | −0.0032 | 0.0031 |
| var | 0.0801 | 0.0115 | 0.0082 | 0.0002 | 0.0140 | 0.0032 | |
| Two-stage | bias | −0.0153 | 0.0045 | 0.0003 | −0.0010 | −0.0030 | 0.0039 |
| var | 0.0973 | 0.0144 | 0.0098 | 0.0003 | 0.0166 | 0.0040 | |
| varb | 0.1094 | 0.0161 | 0.0102 | 0.0003 | 0.0151 | 0.0043 | |
| 90% CR | 0.904 | 0.876 | 0.898 | 0.918 | 0.896 | 0.912 | |
| 95% CR | 0.966 | 0.944 | 0.960 | 0.972 | 0.952 | 0.946 | |
| Complete case | bias | −0.0092 | 0.0010 | 0.0041 | −0.0008 | −0.0024 | 0.0024 |
| var | 0.1217 | 0.0208 | 0.0130 | 0.0003 | 0.0242 | 0.0053 |
Table 4.
Simulation results for non-equally spaced time interval with at nonlinear term. varb=boostrap variance estimator; CR=coverage rate
| β0 = 1 | β1 = 1 | β2 = −3 | μ =1 | γ = 4 | ξ = 0.2 | ||
|---|---|---|---|---|---|---|---|
| Full data | bias | −0.0141 | −0.0044 | 0.0336 | −0.0005 | 0.0015 | 0.0037 |
| var | 0.1951 | 0.0189 | 0.0789 | 0.0036 | 0.0187 | 0.0013 | |
| Two-stage | bias | −0.0371 | 0.0035 | 0.0401 | −0.0055 | −0.0053 | 0.0048 |
| var | 0.2471 | 0.0239 | 0.1019 | 0.0054 | 0.0225 | 0.0018 | |
| varb | 0.3158 | 0.0227 | 0.1485 | 0.0063 | 0.0220 | 0.0019 | |
| 90% CR | 0.916 | 0.896 | 0.882 | 0.910 | 0.926 | 0.886 | |
| 95% CR | 0.966 | 0.952 | 0.946 | 0.954 | 0.956 | 0.940 | |
| Complete case | bias | −0.0306 | −0.0133 | 0.0772 | −0.0059 | −0.0021 | 0.0049 |
| var | 0.4136 | 0.0388 | 0.1896 | 0.0078 | 0.0356 | 0.0026 |
The results suggest that the biases for the proposed two-stage method are minimal and comparable to both the full data analysis and the complete case analysis. From the tables, it can be seen that the proposed method is much more efficient than the complete case analysis, and the bootstrap method performs well in estimating the variance, yielding reasonable coverage rates for all the scenarios.
We run additional simulations to further investigate the impact of survival model misspecification in (5). The results are provided in the Supplementary Material, where it is shown that misspecification of the Cox regression model can yield biased results, and that the bias increases as the severity of misspecification of (5) grows; this indicates the importance of model-checking before implementing the proposed two-stage method.
5.2 End-stage renal disease
We consider data on inpatient hospital costs of patients with end-stage renal disease (ESRD) as reported in an analysis file provided by the United States Renal Data System (USRDS); this provides an illustrative example of longitudinal data with a terminal event. These costs are of substantial interest, since Medicare paid about $10.5 billion in 2012 for inpatient costs (USRDS 2014 annual report). We focus on the monthly inpatient costs paid by Medicare; these costs are terminated by the occurrence of death, and Chan and Wang (2010) and Liu et al. (2007) suggested that the medical payment pattern changes when patients approach death. We explore this issue taking account of patient level covariates.
For illustrative purposes, we selected a 2% random sample of the white and black patients whose service started in the calendar year 2007, and who were 65 years or older at baseline. The average age at baseline was 76.1 and follow-up ended on December 31st, 2010. Of the 840 patients selected for analysis, 65.5% died during the follow-up period. Others were censored through loss to follow up or at the end of the study. The average follow-up time for medical payment was 23.4 months. For convenience, we assume that the inpatient cost rate is constant within each hospitalization. For example, if a hospitalization starts from April 21st and ends on May 10th with the amount $3,000, then the Medical payment is $2,000 for April and $1,000 for May. Usually the month of death is shorter than other months. For example, a death on April 15th only has a half month to accrue spending. We consider “spending rate” for the month in which death occurs. For example, for a death on April 15th with the April Medical payment amount $3,000, we scale up the payment to $6,000 for that month in the analysis. Age, log transformed body mass index (BMI), heart disease and lung disease are used as predictors for the death hazard. All of them are significant with p-values < 0.0001, 0.0011, < 0.0001 and 0.0127, respectively. The goodness of fit for the Cox regression model is checked in Figure 1. Dotted lines in the first row of Figure 1 are the plots of 20 realizations from the distributions of the score processes. The observed score processes are presented with solid lines which randomly fluctuate around zero. From Figure 1, we see that the proportional hazards model for age and log transformed BMI fits the data reasonably well, with respective goodness-of-fit empirical p-values of 0.485, and 0.284, respectively, based on 1000 simulated martingale residual score processes (Lin et al., 1993). A plot of log Λ̂0(t) versus log t is displayed in the lower panel of Figure 1. The approximate parallelism of the curves suggests that the proportional hazards model for lung disease and heart disease provides a reasonably good approximation, all except for early times with lung disease.
Figure 1.
Goodness of fit of the Cox model for the ESRD data.
Since the distribution of monthly Medicare payment, Y, is highly skewed, we consider a log transformation log(Y/1000 + 1). Figure 2 shows the final six-month trajectories of monthly inpatient costs (log transformed) for 30 randomly selected patients who died during follow-up (dotted lines). Many show an increasing and then decreasing pattern before death. We consider a normal kernel in the nonlinear mixed model.
Figure 2.
Monthly inpatient costs (log transformed). The solid line is the average of the estimated log transformed monthly cost. The shaded area is its 95% pointwise confidence band. The dotted lines are 30 randomly selected subjects with terminal event
Exploration of the data showed a similar pattern after entry as described in Liu et al. (2007). Inpatient costs tended to increase over the first two months after entry, and then showed an approximately linear decreasing pattern through to the eighth month. Hence, we create three variables to capture this effect, where Start1 = 1(Month = 1), Start2 = Month × 1(2 ≤ Month ≤ 7) and Start3 = 1(Month ≥ 8). Diabetes, heart disease and race are also the covariates of interest, whereas age, BMI, sex and lung disease are not significantly associated. The final models are
Table 5 shows the regression coefficient estimates, where we see that the proposed two-stage method yields similar point estimates with smaller estimated variances compared to the complete case analysis, indicating the efficiency gain of the proposed method.
Table 5.
Longitudinal data analysis results for the inpatient cost paid by Medicare with death as a covariate.
| Complete Case | Two-Stage | |||||
|---|---|---|---|---|---|---|
| estimate | var (×10−3) | p-value | estimate | var (×10−3) | p-value | |
| Start1 | −0.65 | 4.10 | < 0.0001 | −0.59 | 2.00 | < 0.0001 |
| Start2 | −0.07 | 0.11 | < 0.0001 | −0.07 | 0.04 | < 0.0001 |
| Start3 | −0.46 | 3.54 | < 0.0001 | −0.50 | 2.78 | < 0.0001 |
| Diabetes | 0.06 | 2.51 | 0.20 | 0.09 | 1.46 | 0.01 |
| Heart | 0.12 | 3.31 | 0.04 | 0.11 | 1.80 | 0.01 |
| Race | 0.14 | 2.71 | 0.007 | 0.11 | 1.81 | 0.01 |
| γ | 1.54 | 3.67 | < 0.0001 | 1.57 | 3.59 | < 0.0001 |
| ξ | 0.99 | 42.90 | < 0.0001 | 0.96 | 30.95 | < 0.0001 |
| μ | 0.78 | 5.92 | < 0.0001 | 0.76 | 5.30 | < 0.0001 |
The estimated averages of log transformed Medicare payments are presented with a solid line in Figure 2. A Q-Q plot is used to check the normal error assumption, where only one residual is randomly selected for each patient to avoid the correlation within subjects, see Figure 3. The linear pattern is consistent with the nonlinear mixed model with normal kernel. We checked many random selections, and the plot is typical.
Figure 3.
QQ plot for inpatient cost model.
6 Discussion
We consider the identity link and a Gaussian error in this article. The proposed two-stage method could be generalized to non-Gaussian error, logistic or Poisson regression provided the model is identifiable and regularity conditions are suitably modified.
We allow only time-independent covariates in this article for simplicity. Time dependent covariates are often of interest in longitudinal data analysis and survival analysis. Implementation of the two-stage method for time-dependent covariates involves extrapolating η(u; X̄ (C)) beyond C, where X̄ (C) is the history of the time-dependent covariates X up to time C. It involves predicting the censored covariate process, which will be explored elsewhere. An alternative is an estimating equation approach using inverse probability weighting, which would only use the subjects with observed terminal event time.
The function g(S − t, ξ) in (1) that we consider in this article is a known nonlinear function up to the parameter ξ. In practice, smoothing techniques can be used to determine an appropriate parametric functional form of g or to examine the fit of the data to a hypothesized g. For example, we fitted the model (1) but approximated g with cubic B-splines with 20 knots over the entire observation window of 48 months. This yielded an estimate of g that was very similar to the proposed Gaussian form.
We only considered the intercept parameter as a function of S − t in this article to illustrate the basic concept of the proposed methodology. This modeling strategy extends naturally to regression models with a time-varying coefficient for each regressor. Such an extension is under investigation.
The major difference between our work and that of Li et al. (2013) is that the function g in model (1), together with β0, can be viewed as the intercept parameter which depends on S − t, but all other variables in the model are with reference to time t in the same way as in the usual regression models for longitudinal data. On the other hand, none of the regression parameters in Li et al. (2013) varies with S − t, but all the variables in their model (including the error terms) are with reference to the reverse time scale, S − t. See their equation (2). This leads to different model interpretations.
Supplementary Material
Table 2.
Simulation results for equally spaced time interval with at nonlinear term. varb=boostrap variance estimator; CR=coverage rate
| β0 = 1 | β1 = 1 | β2 = −3 | μ = 1 | γ = 4 | ξ = 0.2 | ||
|---|---|---|---|---|---|---|---|
| Full data | bias | 0.0081 | −0.0116 | 0.0209 | −0.0007 | 0.0009 | 0.0001 |
| var | 0.1279 | 0.0130 | 0.0455 | 0.0009 | 0.0121 | 0.0005 | |
| Two-stage | bias | 0.0090 | −0.0121 | 0.0229 | 0.0003 | −0.0046 | 0.0010 |
| var | 0.1599 | 0.0161 | 0.0561 | 0.0014 | 0.0152 | 0.0007 | |
| varb | 0.1745 | 0.0166 | 0.0641 | 0.0015 | 0.0152 | 0.0007 | |
| 90% CR | 0.902 | 0.890 | 0.912 | 0.904 | 0.910 | 0.902 | |
| 95% CR | 0.958 | 0.948 | 0.946 | 0.950 | 0.964 | 0.952 | |
| Complete case | bias | 0.0044 | −0.0196 | 0.0463 | −0.0004 | −0.0062 | −0.0004 |
| var | 0.2248 | 0.0239 | 0.0821 | 0.0017 | 0.0236 | 0.0008 |
Table 3.
Simulation results for non-equally spaced time interval with sharp nonlinear term varb=boostrap variance estimator; CR=coverage rate
| β0 = 1 | β1 = 1 | β2 = −3 | μ = 1 | γ = 4 | ξ = 1.2 | ||
|---|---|---|---|---|---|---|---|
| Full data | bias | −0.0045 | 0.0020 | 0.0068 | 0.0016 | −0.0051 | 0.0059 |
| var | 0.1474 | 0.0203 | 0.0187 | 0.0005 | 0.0177 | 0.0084 | |
| Two-stage | bias | −0.0238 | 0.0089 | 0.0111 | 0.0015 | −0.0046 | 0.0090 |
| var | 0.1675 | 0.0235 | 0.0253 | 0.0007 | 0.0229 | 0.0113 | |
| varb | 0.1550 | 0.0220 | 0.0276 | 0.0007 | 0.0213 | 0.0135 | |
| 90% CR | 0.866 | 0.884 | 0.884 | 0.888 | 0.914 | 0.910 | |
| 95% CR | 0.942 | 0.936 | 0.938 | 0.944 | 0.950 | 0.960 | |
| Complete case | bias | −0.0295 | 0.0093 | 0.0217 | 0.0004 | 0.0002 | 0.0120 |
| var | 0.2480 | 0.0366 | 0.0353 | 0.0010 | 0.0340 | 0.0161 |
Acknowledgments
The data used in this paper were made available by the U.S. Renal Data System. This study was supported in part by the U.S. Renal Data System under Contract No. NO1-DK-9-2344 (National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland). The data analysis was completed while Shengchun Kong was Assistant Professor of Statistics at Purdue University.
The research is supported in part by NIH grant R01-AG036802 and NSF grants DMS-1007590 and DMS-1407142, and with Federal funds from the National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN276201400001C. The data reported here have been supplied by the United States Renal Data System (USRDS). The interpretation and reporting of these data are the responsibility of the authors and in no way should be seen as an official policy or interpretation of the U.S. government.
A Appendix
A.1 Regularity conditions
Denote the true value of θ by θ0, the true value of ϕ by ϕ0, the sample space of response variable Y by 𝒴, the sample space of covariate X by 𝒳, the sample space of random effect Z by 𝒵 ⊂ 𝒳, the parameter space of θ by Θ, the parameter space of ϕ by Φ, and the parameter space of η by ℱ. In addition to the assumptions of bounded support for X, bounded parameter spaces Θ and Φ, conditional independence between C and (S, Y) given X, and non-informative censoring, we provide a set of regularity conditions in the following:
Condition 1
The third derivatives |∂3g(t, ξ)/(∂ξi∂ξj∂ξk)| and |∂3g(t, ξ)/(∂t∂ξj∂ξk)| are bounded uniformly for all ξ ∈ Ξ and bounded t.
Condition 2
Pl0(θ, ϕ; Y, X, Δ, V) has a unique maximizer (θ0, ϕ0);
Pl(θ, ϕ0, η0; Y, X, Δ, V) has a unique maximizer θ0.
Condition 3
The eigenvalues for Σ(ϕ) are bounded between [λ1, λ2], where 0 < λ1 < λ2 < ∞ for any ϕ ∈ Φ and Z ∈ 𝒵.
Condition 4
The absolute values of all the elements in ∂3Σ(ϕ)/(∂ϕi∂ϕj∂ϕk) are bounded uniformly for all ϕ ∈ Φ and Z ∈ 𝒵.
Condition 5
The study stops at a finite time τ > 0 such that infx∈𝒳 P(C ≥ τ |X = x) = ω1 > 0 and infx∈𝒳 P(S ≥ τ|X = x) = ω2 > 0 for constants ω1 and ω2.
Condition 6
The conditional distribution of S given X possesses a continuous Lebesgue density.
Condition 7
The information matrix of the partial likelihood for the Cox regression model at the true parameter values is positive definite.
Condition 8
There exist constants δ1 > 0 and δ2 > 0, such that with probability 1 for any θ ∈ Θ and |ϕ − ϕ0| + ‖η − η0‖ < δ2.
REMARK
Condition 1 holds for many smooth function g, e.g. g(t, ξ) = ξ1 exp{(t − ξ2)2ξ3} or g(t, ξ) = ξ1 exp{−(t − ξ2)}. Bounded third derivatives implies bounded second derivatives, which is adequate for the proof of consistency. We implemented the numerical studies with g being the normal kernel. When g(t, ξ) = ξ1 exp{(t − ξ2)2ξ3}, Condition 2(a) implies ξ1ξ3 ≠ 0; by Theorem 2.1 of Lehmann (1998), this condition holds provided model (1) is identifiable. Condition 2(b) is for the consistency of the proposed two-stage estimator θ̂n, which may be unnecessarily strong as can be seen from the following. In the proof of Theorem 4.2, we can show Pl̈11(θ0, ϕ0, η0; Y, X, Δ, V) = P{{∂2l(θ, ϕ0, η0; Y, X, Δ, V)}/(∂θ∂θ′)|θ=θ0} is negative definite by Condition 2(a). Thus Pl̈11(θ, ϕ0, η0; Y, X, Δ, V), a continuous matrix of θ, is also negative definite in a neighborhood of θ0, which guarantees that θ0 is a unique maximizer of Pl(θ, ϕ0, η0; Y, X, Δ, V) in a neighborhood of θ0. The initial value we use in the algorithm for maximizing (7) is obtained from the complete case analysis, which is shown to be n1/2 -consistent; thus, the solution of the proposed two-stage method is likely to be in the same neighborhood, and therefore also consistent without the uniqueness requirement in Condition 2(b).
Conditions 3–4 automatically hold for model (1) with the NOU process if |ρ| ≤ 1 − δ, and ti,k+1 − ti,k ≥ ε, i = 1, ⋯, n, k = 1, ⋯, ni − 1, where δ > 0 and ε > 0; they are parallel to the conditions of bounded derivatives of the log likelihood in Theorem 1.1 and Theorem 2.3 of Lehmann (1998).
Conditions 5–7 are usual assumptions for Cox regression models (Andersen and Gill, 1982; Nan and Wellner, 2013). From Condition 5, we have
| (8) |
Condition 8 is mainly for technical convenience. One way to obtain Condition 8 might be to truncate the response variable Y such that |Y| ≤ M < ∞ for a large constant M. In our simulations, however, we do not implement such truncation but still obtain satisfactory results.
A.2 Proofs of Theorem 4.1 and 4.2
All the Lemmas A.1 – A.5 used in the following proofs are provided in the online Supplementary Material.
A.2.1 Proof of consistency in Theorem 4.1 for complete case analysis estimator
Proof
From Corollary 3.2.3 in van der Vaart and Wellner (1996), we need to show that (i)Pl0(θ0, ϕ0; Y, X, Δ, V) > sup(θ,ϕ)∉G Pl0(θ, ϕ; Y, X, Δ, V) for any open set G that contains (θ0, ϕ0); (ii) sup(θ,ϕ)‖(ℙn − P)l0(θ, ϕ; Y, X, Δ, V)‖ → 0. Condition (i) is satisfied from Condition 2(a) and non-informative censoring assumption. Condition (ii) is satisfied because the class of functions {−Δ(Y − Xβ − g(S1 − t, ξ))′Σ(ϕ)−1(Y − Xβ − g(S1 − t, ξ))/2 − log |Σ(ϕ)|/2 : θ ∈ Θ, ϕ ∈ Φ} belongs to Glivenko-Cantelli from Lemma A.4.
A.2.2 Proof of asymptotic normality in Theorem 4.1 for complete case analysis estimator
Denote the element-wise product of two matrices A and B by A * B. Let
Proof
The proof follows Lemma A.1 with ψ = (θ, ϕ). Here
The first order derivative of l0(θ, ϕ; Y, X, Δ, V) equals
where
with
| (9) |
and
| (10) |
with
| (11) |
The second order derivative of l0(θ, ϕ; Y, X, Δ, V) equals
where
with
| (12) |
with
and
with
Condition A1 holds from consistency. Condition A2 holds since for any u,
| (13) |
| (14) |
We have
where
with
Hence,
| (15) |
where
Thus, Pm̈(θ0, ϕ0; Y, X, Δ, V) is negative definite from Condition 2(a).
From (13), we have Condition A3 holds. Condition A4 holds automatically. Condition A5 holds if the class of functions {−Δtr [Σ(ϕ)−1Aj(ϕ)] /2+Δr(θ; V, Y, X)′Σ(ϕ)−1Aj(ϕ)Σ(ϕ)−1 r(θ; V, Y, X)/2 : j = 1, ⋯, q, |θ − θ0| < δ, |ϕ − ϕ0| < δ} is Donsker for some δ > 0 and satisfies P|ṁ(θ, ϕ; Y, X, Δ, V) − ṁ(θ0, ϕ0; Y, X, Δ, V)|2 → 0 as |(θ, ϕ) − (θ0, ϕ0)| ≤ δn ↓ 0. These two conditions hold from Conditions 1, 3–5, and Theorem 2.10.6 of van der Vaart and Wellner (1996). Condition A6 holds from Taylor expansion and Conditions 1 and 3–5. Hence,
which converges weakly to a mean zero normal random variable with variance , where J1 = −Pl̈0(θ0, ϕ0; Y, X, Δ, V) and Q1 = P{l̇0(θ0, ϕ0; Y, X, Δ, V)⊗2}. Furthermore,
| (16) |
where D5(ϕ0) and C(θ0, ϕ0; Ỹ, X̃, Δ̃, Ṽ) are defined in (15) and (10), respectively.
A.2.3 Proof of consistency in Theorem 4.2 for two-stage estimator
Proof
From Condition 2(b), we have
| (17) |
holds for every δ > 0. By the definition of θ̂n, we have
| (18) |
where the equality is obtained by Lemma A.4 and Lemma A.5. The class of functions {l(θ, ϕ, η; Y, X, Δ, V) : θ ∈ Θ, ϕ ∈ Φ, η ∈ ℱ} is Donsker from Lemma A.4. Hence it is Glivenko-Cantelli, and we then have
| (19) |
| (20) |
where (19) is obtained from (18) and (20) is obtained by Lemma A.5. By inequality (17), for every δ > 0 we have
with the sequence of the events on the right going to a null event in view of inequality (20), which yields the almost sure (thus in probability) convergence of θ̂n. This argument is taken from the proof of Theorem 5.8 in van der Vaart (2002) and the proof of Theorem 3 in Li and Nan (2011).
A.2.4 Proof of asymptotic normality in Theorem 4.2 for two-stage estimator
Proof
The proof follows Lemma A.2. Here
The partial derivative of l(θ, ϕ, η; Y, X, Δ, V) with respect to θ equals
where D2(θ; u, X) is defined in (9).
The second order derivative of l(θ, ϕ, η; Y, X, Δ, V) with respect to θ equals
where D3(θ, ϕ; V, Y, X) is defined in (12).
B1 holds from Theorem 4.1, Lemma A.3 and consistency of two-stage estimator. From (13) and (14),
| (21) |
which is negative definite from Condition 2(b); thus, B2 holds. From (13), we have B3 holds. B4 holds automatically.
Since
under Conditions 1, 3–5 and 8, we have
as |(θ, ϕ) − (θ0, ϕ0)| ≤ δn ↓ 0 by continuity and Condition 8. Similar to the proof of Lemma A.4, we have the class of functions { : θ ∈ Θ, ϕ ∈ Φ, η ∈ ℱ} belongs to Donsker. Hence, {l̇1(θ, ϕ, η; Y, X, Δ, V) : θ ∈ Θ, ϕ ∈ Φ} is Donsker from Section 2.10.2 of van der Vaart and Wellner (1996) and Condition 8. Furthermore, from Corollary 2.3.12 of van der Vaart and Wellner (1996), we have B5 holds. Under Conditions 3–5 and 8, similar to the proof of Theorem 1 in Kong and Nan (2016), we can show that B6 holds. Particularly in B6,
with
and
| (22) |
where
with
and A1(η0; u, X; X̃, Δ̃, Ṽ) is defined in Lemma A.3.
Hence by Lemma A.2 and the central limit theorem,
which converges weakly to a mean zero normal random variable with variance from (16) and (22), where
with D5(ϕ0) and C(θ0, ϕ0; Ỹ, X̃, Δ̃, Ṽ) defined in (15) and (10), respectively.
Footnotes
The online supplement contains general theorems about M-estimators, technical lemmas, and additional simulation. It also contains R code for implementing the methods developed here.
Contributor Information
Shengchun Kong, Gilead Sciences, Inc., Foster City, CA 94404.
Bin Nan, Departments of Biostatistics, University of Michigan, Ann Arbor, MI 48109.
John D. Kalbfleisch, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109.
Rajiv Saran, Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109.
Richard Hirth, Department of Health Management and Policy, University of Michigan, Ann Arbor, MI 48109.
References
- Albert PS, Shih JH. An approach for jointly modeling multivariate longitudinal measurements and discrete time-to-event data. The Annals of Applied Statistics. 2010;4(3):1517–1532. doi: 10.1214/10-AOAS339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The Annals of Statistics. 1982;10(4):1100–1120. [Google Scholar]
- Breslow NE. Discussion of “Regression models and life-tables” by D. R. Cox. Journal of the Royal Statistical Society, Series B. 1972;34(2):216–217. [Google Scholar]
- Chan K, Wang M. Backward estimation of stochastic processes with failure events as time origins. The Annals of Applied Statistics. 2010;4(3):1602–1620. doi: 10.1214/09-AOAS319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DR. Regression models and life-tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34(2):187–220. [Google Scholar]
- Ding J, Wang JL. Modeling longitudinal data with nonparametric multiplicative random effects jointly with survival data. Biometrics. 2008;64(2):546–556. doi: 10.1111/j.1541-0420.2007.00896.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghosh D, Lin DY. Marginal regression models for recurrent and terminal events. Statistica Sinica. 2002;12:663–688. [Google Scholar]
- Harlow SD, Mitchell ES, Crawford S, Nan B, Little R, Taffe J. The restage collaboration: defining optimal bleeding criteria for onset of early menopausal transition. Fertility and Sterility. 2008;89(1):129–140. doi: 10.1016/j.fertnstert.2007.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsieh F, Tseng YK, Wang JL. Joint modeling of survival and longitudinal data: likelihood approach revisited. Biometrics. 2006;62(4):1037–1043. doi: 10.1111/j.1541-0420.2006.00570.x. [DOI] [PubMed] [Google Scholar]
- Huang CY, Wang MC. Joint modeling and estimation for recurrent event processes and failure time data. Journal of the American Statistical Association. 2004;99(468):1153–1165. doi: 10.1198/016214504000001033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2. Hoboken: John Wiley & Sons, Inc; 2002. [Google Scholar]
- Kalbfleisch JD, Schaubel DE, Ye Y, Gong Q. An estimating function approach to the analysis of recurrent and terminal events. Biometrics. 2013;69(2):366–374. doi: 10.1111/biom.12025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kong S, Nan B. Semiparametric approach to regression with a covariate subject to a detection limit. Biometrika. 2016;103(1):161–174. [Google Scholar]
- Lehmann EL. Theory of Point Estimation. New York: Springer-Verlag; 1998. [Google Scholar]
- Li Z, Nan B. Relative risk regression for current status data in case-cohort studies. The Canadian Journal of Statistics. 2011;39(4):557–577. [Google Scholar]
- Li Z, Tosteson TD, Bakitas MA. Joint modeling quality of life and survival using a terminal decline model in palliative care studies. Statistics in Medicine. 2013;32(8):1394–1406. doi: 10.1002/sim.5635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Wei LJ, Ying Z. Checking the cox model with cumulative sums of martingale-based residuals. Biometrika. 1993;80(3):557–572. [Google Scholar]
- Little RJ, Rubin DB. Statistical Analysis with Missing Data. 2. Hoboken: John Wiley & Sons, Inc; 2002. [Google Scholar]
- Liu L, Wolfe RA, Kalbfleisch JD. A shared random effects model for censored medical costs and mortality. Statistics in Medicine. 2007;26(1):139–155. doi: 10.1002/sim.2535. [DOI] [PubMed] [Google Scholar]
- Lu X, Nan B, Song P, Sowers M. Longitudinal data analysis with event time as a covariate. Statistics in Biosciences. 2010;2(1):65–80. [Google Scholar]
- Nan B, Wellner JA. A general semiparametric z-estimation approach for case-cohort studies. Statistica Sinica. 2013;23:1155–1180. [PMC free article] [PubMed] [Google Scholar]
- Sowers M, Tomey K, Jannausch M, Eyvazzdh A, Crutchfield M, Nan B, Randolph J. Physical functioning and menopause states. Obstet Gynecol. 2007;110(6):1290–1296. doi: 10.1097/01.AOG.0000290693.78106.9a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsiatis AA, Davidian M. Joint modeling of longitudinal and time-to-event data: an overview. Statistica Sinica. 2004;14:809–834. [Google Scholar]
- van der Vaart AW. In: Semiparametric Statistics. In Lectures on Probability Theory and Statistics, Ecole d’Ete de Probabilites de Saint-Flour XXIX99. Bernard P, editor. Berlin Heidelberg: Springer-Verlag; 2002. pp. 330–457. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. [Google Scholar]
- Wellner JA, Zhang Y. Two likelihood-based semiparametric estimation methods for panel count data with covariates. Annals of Statistics. 2007;35(5):2106–2142. [Google Scholar]
- Zeng D, Lin DY. Semiparametric transformation models with random effects for joint analysis of recurrent and terminal events. Biometrics. 2009;65(3):746–752. doi: 10.1111/j.1541-0420.2008.01126.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



