Abstract
We propose a general class of semiparametric transformation models with random effects to formulate the effects of possibly time-dependent covariates on clustered or correlated failure times. This class encompasses all commonly used transformation models, including proportional hazards and proportional odds models, and it accommodates a variety of random-effects distributions, particularly Gaussian distributions. We show that the nonparametric maximum likelihood estimators of the model parameters are consistent, asymptotically normal and asymptotically efficient. We develop the corresponding likelihood-based inference procedures. Simulation studies demonstrate that the proposed methods perform well in practical situations. An illustration with a well-known diabetic retinopathy study is provided.
Keywords: Correlated failure times, frailty model, nonparametric maximum likelihood estimation, proportional hazards, semiparametric efficiency, survival analysis
1. Introduction
Clustered failure time data arise when the study subjects are sampled in clusters so that the failure times within the same cluster tend to be correlated. Medical examples include the onset of a genetic disease among family members, the appearance of tumors in littermates exposed to a carcinogen, the occurrence of visual loss in left and right eyes, and the initiation of cigarette smoking by classmates. Such failure times are inevitably subject to right censoring. The presence of censoring and intra-class dependence poses serious challenges in the semiparametric regression analysis of these data.
One approach to formulating the effects of covariates on the failure time while accounting for the intra-class dependence is the proportional hazards frailty model, under which the hazard function for the jth subject of the ith cluster associated with covariates Xij(·) takes the form
(1.1) |
where λ0(·) is an unspecified baseline hazard function, β is a vector of unknown regression parameters, and ξi is an unobserved frailty for the ith cluster. Statistical inference under model (1.1) turns out to be an interesting and challenging problem. The consistency and asymptotic distribution of the nonparametric maximum likelihood estimator for this model have been rigorously studied by Murphy (1994, 1995) for the case of no covariates, and by Parner (1998) for the case with covariates. All the results are restricted to the special case of gamma frailty.
The proportional hazards model with gamma frailty, although very interesting and useful, has important limitations. First, the proportional hazards assumption on the effects of covariates may not be reasonable in certain applications. Secondly, gamma frailty induces a restrictive form of dependence.
To address the above concerns, we study a broad class of transformation models with random effects. For the jth subject of the ith cluster, let Xij(·) be a d1-vector of (possibly time-dependent) covariates, and Zij(·) be another set of covariates, which may contain 1 and part of Xij(·). Also, let X̄ij(t) and Z̄ij(t) denote the histories of Xij(·) and Zij(·) over [0, t]. The cumulative hazard function of Tij, the jth failure time of the ith cluster, is related to Xij(·) and Zij(·) as follows:
(1.2) |
where H0 is a known increasing function with H0(0) = 0 and H0(∞) = ∞, Λ(·) is an unspecified increasing function, β is a set of unknown regression parameters, and bi is a set of unobserved mean-zero random effects for the ith cluster with a density function ψ(bi;γ) (with respect to a σ-finite measure μ(bi)) indexed by a d2-dimensional parameter γ. Note that (1.2) allows covariate-specific or subject-specific random effects.
Let G0(x) = 1 − e−H0(x). We may rewrite (1.2) as
(1.3) |
where the ϵij are i.i.d. random variables from a known distribution with the cumulative distribution function G0(·). If both Xij and Zij are time-independent, then (1.3) reduces to linear transformation models
(1.4) |
where H(x) = log Λ(x). The choices of the extreme-value and standard logistic distributions for log ϵij or G0(x) = 1 − e−x and G0(x) = 1−(1 + x)−1 correspond to the proportional hazards model (Cox (1972)) and the proportional odds model (Bennett (1983) and Pettitt (1984)), respectively.
Equation (1.4) is reminiscent of the linear mixed-effects model (Laird and Ware (1982)) for longitudinal data. For the latter model, however, the transformation of the response variable is known, and there is no censoring. The presence of censoring and the involvement of an unknown transformation make the estimation of transformation models with random effects for correlated failure time data much harder. In view of the linear model representation given in (1.4), Gaussian random effects are the most natural choice even for the proportional hazards model. The focus of the existing literature on gamma frailty is due to its mathematical simplicity.
Linear transformation models for independent failure time data (i.e., in the absence of random effects) have been studied extensively. In particular, the proportional odds model was studied by Bennett (1983), Pettitt (1984), Cuzick (1988), Wu (1995), Murphy, Rossini and van der Vaart (1997), Shen (1998) and Lam and Leung (2001). Estimation for general linear transformation models was investigated by Bickel (1986), Dabrowska and Doksum (1988), Cheng, Wei and Ying (1995) and Chen, Jin and Ying (2002), among others. Recently, Kosorok, Lee and Fine (2004) considered a class of frailty models for independent observations which is a one-parameter extension of the proportional hazards model.
For clustered failure time data, Cai, Cheng and Wei (2002) considered the class of models given in (1.4) with a scalar random effect (i.e., Zij ≡ 1). They proposed to estimate the parameters by minimizing the empirical sum of squares of the differences between certain observed quantities and their expected values. The estimators are not asymptotically efficient, and the variance estimation is computationally demanding. Furthermore, the censoring mechanism is required to be purely random and independent of covariates. Recently, Zeng, Lin and Yin (2005) studied efficient estimation of a special member of (1.4), namely, the proportional odds model with time-independent covariates and Gaussian random effects. They showed that the estimators of Cai et al. (2002) can be quite inefficient. Efficient estimation of (1.4), let alone (1.2), has not been studied in any generality.
In this paper, we study nonparametric maximum likelihood estimation of (1.2). Rather than focusing on specific models, we identify general conditions on the transformation G0(·) and the distribution of random effects under which the nonparametric maximum likelihood estimators have desirable asymptotic properties. We show that many commonly used transformations, including the familiar Box-Cox transformations and the class of logarithmic transformations studied by Chen et al. (2002), and Gaussian distributions of random effects ensure that the nonparametric maximum likelihood estimators for the regression parameters are asymptotically efficient. Important special cases include the Cox proportional hazards and proportional odds models with Gaussian random effects.
The structure of this paper is as follows. In Section 2, we describe the proposed methodology based on the nonparametric likelihood. In Section 3, we provide the asymptotic theory behind the proposed methodology. In Section 4, we report the results of our simulation studies. In Section 5, we provide an illustration with a medical study. In Section 6, we provide some concluding remarks. We relegate the proofs of the theoretical results to an appendix.
2. Likelihood and Inference
Suppose that there are n independent clusters with potentially different sizes. The relationship between Tij and (Xij, Zij) is given in equation (1.2) or (1.3). Let Cij be the censoring time on Tij. The data consist of (Yij,Δij,X̄ij(Yij),Z̄ij(Yij)) (i = 1,…,n;j = 1,…,ni), where Yij = Tij ∧ Cij and Δij = I(Tij ≤ Cij). Here and in the sequel, a∧b = min(a, b), and I(·) is the indicator function. Our goal is to make inference about the regression parameters (β,γ) and the function Λ(·).
We make the coarsening at random assumption that, conditional on X̄ij(·), Z̄ij(·), Tij and bi, the hazard function of Cij at time t is only a function of X̄ij(t) and Z̄ij(t). Then under (1.2), the likelihood function for the parameters (β,γ,Λ) is proportional to
where, for any function g, g′ (x) is the derivative of g(x).
It would seem natural to calculate the maximum likelihood estimators of (β,γ,Λ) by maximizing the above likelihood function. The maximum of this function, however, is infinity since we can always choose some function Λ(t) with fixed values at each Yij while letting Λ′(Yij) go to infinity for some Yij with Δij = 1. Thus, we relax Λ(t) to be right-continuous and allow Λ(t) to have jumps at the Yij. We then propose to maximize
(2.1) |
where Λ{t} denotes the jump size of Λ(·) at t. To be specific, we maximize Ln(β,γ,Λ) over the parameter space
The resulting estimators, denoted by β̂n, γ̂n and Λ̂n, correspond to the Kiefer-Wolfowitz nonparametric maximum likelihood estimators (NPMLEs).
We show later that the maximum of (2.1) exists and that the jump sizes of Λ̂n are finite. Thus, the NPMLEs for (β,γ,Λ) can be obtained by maximizing Ln(β,γ,Λ) over the parameter space (β,γ) ∈ Θ and the jump sizes of Λ at the Yij for which Δij = 1. Computationally, to ensure the positiveness of the jump size, we can use the transformed parameter log(Λ{Yij}) instead of Λ{Yij} in the maximization. For a general transformation G0(·), the maximization can be realized via optimization algorithms which consist of optimum search based on the interior-reflective Newton method (Coleman and Li (1994, 1996)). These algorithms are available in the optimization toolbox of MATLAB. In the numerical calculation, the integration over b is replaced by numerical summation, such as the Gaussian quadrature approximation for Gaussian b. In each iteration of the search, a large linear system is approximately solved by using the method of preconditioned conjugate gradients (Coleman and Li (1994, 1996)). This search works very well in our setting. In the special case when the transformation G0(·) induces the proportional hazards model, the maximization can be carried out efficiently by the expectation-maximization (EM) algorithm (Dempster, Laird and Rubin (1977)). In the EM algorithm, random effects are treated as missing data and efficient computation takes advantage of the explicit solution for estimating Λ(·) in the M-step.
It is desirable to estimate the asymptotic covariance matrices of β̂n and γ̂n. When the nuisance parameter is of high dimension, i.e., the number of jumps in Λ̂n is large, the profile likelihood method (Murphy and van der Vaart (2000)) is particularly useful in estimating the variances. We define the profile log-likelihood function for θ ≡ (β,γ) as
where ln(β,γ,Λ) = log Ln (β,γ, Λ), and Λ is any right-continuous and increasing function in [0, τ] with Λ(0) = 0. Theorem 3 of Section 3 states that the asymptotic covariance for can be estimated by the negative inverse of the curvature of pln(θ) around . Specifically, to estimate the (s,l)th element of the asymptotic covariance matrix for , we choose a constant hn of the order , and let es and el be the canonical bases which are one at the sth and the lth coordinates, respectively, and are zero elsewhere. Then the (s,l)th element of the inverse of the asymptotic covariance matrix can be estimated by
Thus, we need to evaluate the profile likelihood function pln(θ) in a neighborhood of θ̂n. Computationally, for a general transformation G0(·), the profile likelihood function can be calculated by using the optimization search for fixed θ close to θ̂n. When G0(·) is the transformation corresponding to the proportional hazards model, the profile likelihood function can be calculated via the EM algorithm in which (βT,γT)T is held constant in both the E-step and M-step, so that the only updated parameters are the jump sizes of Λ(·) at the observed failure times. Our experiences showed that the EM algorithm is more efficient than direct optimization.
When the number of observed failure times is not large, an alternative way of estimating the asymptotic variance is simply to invert the observed information matrix for all the parameters including β,γ, and the jump sizes of Λ̂n. That is, we treat the likelihood function (2.1) as a likelihood function from a parametric model. One benefit of this approach is that we can estimate the asymptotic variance for Λ̂n. The validity of inverting the observed information matrix is ensured by Theorem 4. Our numerical studies revealed that this approach works very well in practical situations.
3. Asymptotic Theory
We impose the following regularity conditions.
-
C.1.
There exists some positive constant δ0 such that P(Cij ≥ τ|X̄ij(τ),Z̄ij(τ)) = P(Cij = τ|X̄ij(τ),Z̄ij(τ))≥ δ0 almost surely, where τ is a constant denoting the end of the study.
-
C.2.With probability one, Xij(·) and Zij(·) have right-continuous sample paths in [0,τ] and their right derivatives exist. In addition, there exists a constant M0 such that
where Xij′+ and Zij′+ denote the right derivatives. -
C.3.
The true value Λ0(t) of Λ(t) is a strictly increasing function in [0,τ] and is continuously differentiable. In addition, Λ0(0) = 0 and Λ′0(0) > 0.
-
C.4.The true values of β and γ, denoted by β0 and γ0, belong to a known compact set
-
C.5.
The size of the cluster is independent of the survival and censoring variables, and max1≤i≤n |ni| ≤ n0 for a constant n0, almost surely.
-
C.6.
The function G0(x) : [0,∞) → [0,1] is four times-continuously differentiable in [0,∞) with for k = 1,2,3,4, where denotes the kth derivative of G0(x). The function ψ(b; γ) is thrice-differentiable with respect to γ, and for k = 1,2,3, ∫b|ψ(k)(b;γ)|dμ(b) is uniformly bounded for γ ∈ Γ0.
-
C.7.There exists a positive constant ρ0 such that
(3.1) -
C.8.For any fixed constant K,
(3.2) -
C.9.For any pair of parameters (β1,γ1,Λ1) and (β2,γ2,Λ2), if with probability one,
for any k ∈ {1,…, ni} and any t1,…, tk ∈ [0,τ], then β1 = β2,γ1 = γ2 and Λ1(t) = Λ2(t) for t ∈ [0,τ] -
C.10.If Xij(t)Th1 + h(t) = 0 with probability one for some vector h1 and a function h(t), then h1 = 0 and h(t) = 0. In addition, if there exist a vector h2 and functions Aj(t, b), j = 1, …, ni such that with probability one,
for any k ∈ {1,…, ni} and any t1,…, tk ∈ [0, τ], then h2 = 0 and Aj(t, b) = 0, j = 1,…, ni.
Remark 1
C.1–C.5 are standard conditions for clustered failure time data. Conditions C.6 and C.7 are satisfied by all common transformations, including the Box-Cox transformations H0(x) = ((1 + x)ρ − 1)/ρ, and the logarithmic transformations H0(x) = r−1log(1 + rx) (Chen et al. (2002)). Condition C.8 pertains to the random-effects distribution. This condition is clearly satisfied by the Gaussian distribution and any distribution with tails less heavy than e− ‖b‖1+ϵ0 with ϵ0 > 0 (e.g., log-inverse Gaussian). Condition C.9 pertains to parameter identifiability, while C.10 entails that the Fisher information along any submodel at the true parameters is nonsingular. If X and Z are time-independent, then C.9 and C.10 reduce to C.9’ and C.10’:
C.9′ For any γ1 and γ2 if there exist two constant vectors ϕ1 and ϕ2, such that with probability one,
for any k ∈ {1,…, ni}, then ϕ1 = ϕ2 and γ1 = γ2.
C.10′ If there exist two vectors h1 and h2 such that with probability one,
then h1 = 0 and h2 = 0. When the random effects are Gaussian and the Zij are the same within each cluster, C.9 and C.10 are implied by the linear independence of the covariates.
The following lemma holds under conditions C.7 and C.8.
Lemma 1
With probability one,
(3.3) |
where c0 is a constant independent of β,γ and Λ.
Remark 2
Inequality (3.3) is essential to the consistency of the NPMLEs. In fact, C.7 and C.8 can be replaced by (3.3) in proving consistency. We impose C.7 and C.8 because they are easier to verify. Although the popular Cox proportional hazard model with gamma frailty does not satisfy C.8, we now show that (3.3) still holds for this model. Under this model, the left-hand side of (3.3) is
where O(1) denotes some positive constant. Since
the right-hand side of the above inequality is bounded by
By the concavity of log(x), we obtain the upper bound
This gives the inequality (3.3) in which ρ0 = γ/n0.
As stated in the next lemma, C.9 and C.10 ensure identifiability of parameters and non-singularity of information matrix.
Lemma 2
Under C.9 and C.10, the parameters in (1.2) are identifiable. Furthermore, the Fisher information matrix along any one-dimensional submodel is non-singular.
Our last lemma pertains to the existence of the NPMLEs.
Lemma 3
Under C.1 ~ C.8, the maximum likelihood estimators (β^n, γ^n, Λ^n) exist almost surely.
The following two theorems state our main results about the asymptotic properties of the proposed maximum likelihood estimators.
Theorem 1
Under C.1 ~ C.10, ‖β^n − β0‖ → 0, ‖γ^n − γ0‖ → 0 and supt∈[0,τ] |Λ^n(t) − Λ0(t)| → 0 almost surely, where ‖ · ‖ is the Euclidean norm.
Theorem 2
Under C.1 ~ C.10, weakly converges to a zero-mean Gaussian process in the metric space Rd1 × Rd2 × l∞ [0,τ], wherel ∞[0,τ] is the linear space consisting of all the bounded functions in [0,τ] and is equipped with the supremum norm. Furthermore, β^nand γ^nare asymptotically efficient.
Remark 3
Theorem 1 states the consistency of the maximum likelihood estimators. In C.1 to C10, Λ(·) is not assumed to be a bounded function, which means that the weak-compactness of the parameter Λ(·) is not imposed. Thus, obtaining a bound for Λ^n(·) is a key to the proof of Theorem 1. The consistency proof is based on the essential inequality (3.3), and it adopts the partitioning idea from Murphy’s (1994) proof of the consistency in the gamma frailty model. This partitioning idea was also used by Parner (1998), Kosorok et al. (2004) and Zeng et al. (2005). However, we provide a novel justification to avoid the concavity of G0 assumed in all previous papers. Once the consistency is established, the asymptotic distributions of the maximum likelihood estimators stated in Theorem 2 can be proved by verifying the conditions in Theorem 3.3.1 of van der Vaart and Wellner (1996). Our verification of the continuous invertibility of the information operator is specific to model (1.2) and is based on Lemma 2. Moreover, the Donsker property of some new classes of functions is proven. In the statement of Theorem 2, asymptotically efficient estimators mean that the asymptotic variances attain the semiparametric efficiency bounds as defined in Bickel et al. (1993, Chap. 3).
The next two theorems justify the validity of the proposed approach to estimating the asymptotic covariance.
Theorem 3
Under C.1 ~ C.10,
where hn = Op(n−1/2), e is any vector in Rd1+d2 with norm 1, and I(θ0) is the efficient information matrix for .
Theorem 3 does not deal with the estimation of the asymptotic variance of Λ^n, which is often desirable when one wishes to make prediction on future survival experience. Theorem 2 suggests that the parameter Λ(·), although infinite-dimensional, can be treated in the same way as the finite-dimensional parameters β and γ. Thus, the asymptotic covariance matrix can be estimated by the inverse of the observed information matrix. Specifically, for any constant vector (h1, h2) ∈ Rd1 × Rd2 and any bounded function h3, the asymptotic variance of
can be estimated by , where hn is the vector comprising of h1, h2 and the h3(Yij) for which Δij = 1, and Jn is the negative Hessian matrix of log Ln(β,γ,Λ) with respect to (β,γ) and the jump sizes of Λ at the Yij for which Δij = 1, evaluated at β^n, γ^n, Λ^n. The next theorem formalizes this approximation.
Theorem 4
Let V(h1, h2, h3) be the asymptotic variance of . Under C.1 ~ C.10, uniformly in (h1, h2, h3) such that ‖h1‖ ≤ 1,‖h2‖ ≤ 1 and ‖h3‖V ≤ 1, where ‖h‖V denotes the total variation of h(t) in [0,τ].
4. Simulation Studies
We carried out simulation studies to assess the performance of the proposed inference procedures in finite samples. We set the cluster size to two, and generated failure times from the proportional hazards model with a Gaussian random effect
where Λ0(t) = t, β1 = 1, β2 = −1, X1i1 ≡ X1i2 is a dichotomous variable with half of the subjects taking the value 1, Xij is an independent uniform(0,1) variable, and bi is normal with mean zero and variance σ2. The censoring time was set to be the minimum of 3 and a uniform(0,4) variable, corresponding to an approximate 35% censoring rate. The MLEs of β and σ2 were obtained via the EM algorithm. The standard error estimates were based on the profile likelihood function at some fixed parameter values around the MLEs. This calculation was done through the the EM algorithm, where these parameters were held fixed in both the E-step and M-step. Then the variance of the MLE was computed by using the numerical difference of the profile log-likelihood function as stated in Theorem 3 for an appropriate choice of hn. Specifically, we chose . The confidence intervals for (β and σ2 were based on the normal approximations to β^n and , respectively. We considered variance estimation with hn ranging from . It turned out that the variance estimation for β^n is fairly robust to the choice of hn, whereas that of is more sensitive. Murphy et al. (1997) suggested a rule of thumb of , where θ^n is the maximum likelihood estimate.
The simulation results with n = 200 are summarized in Table 1. These results demonstrate that the proposed methods work well in that the parameter estimators have little bias, the variance estimators are reasonably accurate and the confidence intervals have proper coverage probabilities. Additional simulation studies (results not shown) revealed that the efficiency gains of the proposed MLEs over the estimators of Cai et al. (2002) can be substantial in realistic situations.
Table 1.
Bias | SE | SEE | 95% CP | ||
---|---|---|---|---|---|
σ2 = 1 | β1 | −0.003 | 0.210 | 0.201 | 0.942 |
β2 | −0.015 | 0.267 | 0.285 | 0.953 | |
σ | −0.018 | 0.151 | 0.144 | 0.960 | |
σ2 = 3 | β1 | −0.010 | 0.300 | 0.287 | 0.946 |
β2 | −0.012 | 0.313 | 0.328 | 0.958 | |
σ | −0.025 | 0.190 | 0.180 | 0.935 |
Note: Bias and SE are the bias and standard error of the estimator. SEE is the mean of the standard error estimator, and 95% CP is the coverage probability of the 95% confidence interval. Each entry is based on 1,000 simulated data sets.
In related simulation studies, Zeng et al. (2005) generated failure times from proportional odds models with Gaussian random effects. The MLEs were calculated via the optimization search method and the variance estimates were calculated by inverting the observed information matrix. The conclusions are similar.
5. An Example
We now consider the well-known Diabetic Retinopathy Study (Huster, Brookmeyer and Self (1989)). This study was conducted to assess the ability of laser photocoagulation in delaying visual loss among patients with diabetic retinopathy. The subset of the data that has been analyzed extensively in the statistical literature pertains to 197 high-risk patients. For each patient, one eye was randomly selected to receive the laser treatment while the other eye was observed without treatment. The failure time of interest is the time to visual loss as measured by visual acuity less than 5/200. As in the existing literature, we consider three covariates: X1ij indicates, by the values 1 versus 0, whether or not the jth eye (j = 1 for the left eye and j = 2 for the right eye) of the ith patient was treated with laser photocoagulation, X2i1 ≡ X2i2 indicates, by the values 1 versus 0, whether the ith patient had adult-onset or juvenile-onset diabetics, and X3ij ≡ X1ij * X2ij is the interaction between X1ij and X2ij. We fit model (1.2) with these three covariates, along with a Gaussian random effect bi to account for the dependence between the two eyes of the same patient. We consider the transformation G0(·) from the following class: {1 − (1 + ξx)−1/ξ; ξ ∈ [0,1]}, where ξ = 0 corresponds to the proportional hazards model and ξ = 1 to the proportional odds model.
We vary the value of ξ from 0 to 1 in 0.1 increments and maximize the corresponding likelihood. It turns out that ξ = 0.3 is the best choice in that it yields the maximal value of the observed-data likelihood function. Table 2 summarizes the results under the selected transformation model, as well as the proportional hazards and proportional odds models. There is a high degree of dependence between the two eyes of the same patient in time to visual loss. The treated eye is less likely to suffer visual loss than the untreated eyes, and treatment is more effective for adult-onset diabetics than for juvenile-onset diabetics.
Table 2.
Parameter | Model |
|||||
---|---|---|---|---|---|---|
ξ = 0 | ξ = 0.3 | ξ = 1 | ||||
β1 | −0.523 | (0.231) | −0.564 | (0.250) | −0.659 | (0.295) |
β2 | 0.421 | (0.264) | 0.447 | (0.288) | 0.496 | (0.345) |
β3 | −0.999 | (0.369) | −1.073 | (0.398) | −1.234 | (0.466) |
σ | 1.038 | (0.191) | 1.114 | (0.207) | 1.296 | (0.251) |
Note: Standard error estimates are shown in parentheses.
6. Conclusion
The proposed likelihood-based methods have several advantages over the estimating-equations methods of Cai et al. (2002). First, the proposed estimators are more efficient. Second, it is less time-consuming to evaluate the variances of the proposed estimators than those of Cai et al.’s estimators. Third, the assumption on the independence of the censoring time and failure time required in the Cai et al. approach is avoided in the likelihood approach. Finally, the likelihood approach allows one to use AIC and other likelihood-based criteria for model selection, as demonstrated in the example.
Our experience shows that the algorithms described in Section 2 perform very well when the initial values are chosen appropriately. We recommend setting β = 0, σ2 = 1 and the jump sizes of Λ to 1/n. The algorithms are quite fast. It took about 10 hours on an IBM BladeCenter HS20 machine to complete all the simulation studies reported in Table 1. No convergence problem was encountered in any simulation run.
We have found that the estimation of regression parameters is not sensitive to the misspecification of the the random-effects distribution. For example, when we simulated failure times from the proportional hazards gamma frailty model but fitted the data using the proportional hazards model with normal random effect, the estimators of the regression parameters have very little bias and the confidence intervals have reasonable coverage probabilities.
An alternative approach to random-effects models is marginal models. Indeed, Cai, Wei and Wilcox (2000) studied marginal linear transformation models for clustered failure time data. There are several reasons for using random-effects models. First, these models allow one to predict survival experience of a subject given the event history of other members of the same cluster. Second, efficient estimation is possible under these models. Third, the dependence structures can be of scientific interest, especially in genetic studies.
Acknowledgements
This research was supported by the National Institutes of Health Grants R01 CA82659 (D. Zeng and D. Y. Lin) and R01 CA76404 (X. Lin). The authors thank the an associate editor and the referees for their helpful comments.
Appendix
In this appendix, we outline the proofs of the lemmas and theorems. The detailed proofs are given in a supplementary technical report. We introduce some notation. Let Oi denote the observations in the ith cluster consisting of ni and (Yij, Δij, X̅ij(Yij), Z̅ij(Yij)), j = 1,…, ni. Let 𝒫n and 𝒫 be the empirical measure and the expectation of n i.i.d observations O1,…, On. That is, for any measurable function and 𝒫 [g(O)] = E [g(O)].
Proof of Lemma 1
Under C.7, for some constant c1. Therefore, the left-hand side of (3.3) is bounded by
Let M be a constant larger than 1 such that M−1 ≤ exij(t)Tβ ≤ M and M−1 ≤ ‖Zij(t) ‖ ≤ M. Then
Thus,
Since Zij and Xij are bounded and ψ(b;γ) satisfies C.8, (3.3) in Lemma 1 holds for some constant c0.
Proof of Lemma 2
Suppose that the parameters (β*,γ*,Λ*) and (β0, γ0, Λ0) yield the same joint density of the data. That is, almost surely,
We wish to show that β* = β0, γ* = γ0 and Λ* = Λ0. For any fixed k ≤ ni, we perform the following actions on both sides of the above equality: for j ≤ k, we let Δij = 1 and integrate Yij from 0 to tj; for j > k, if Δij = 1, we integrate Yij from 0 to τ; otherwise, we let Yij = τ. We then sum over the equalities for all possible {Δij : j = k + 1,…, ni} to obtain
(A.1) |
It follows from C.9 that β* = β0, γ* = γ0 and Λ* = Λ0.
To prove the second half of the lemma, we suppose that there exists a one-dimensional submodel at the true parameters, denoted by (β0+εh1, γ0+εh2, Λ0+ ∫ h3(s)dΛ0(s)), ε ∈ R, for which the Fisher information is zero, or equivalently, the score function along this path is zero almost surely. Simple algebraic manipulations yield
(A.2) |
where
We show that (A.2) yields h1 = 0, h2 = 0 and h3 = 0. Fix a k such that 1 ≤ k ≤ ni and, for any function of the type g1(Δi1, Yi1) … gni (Δini, Yini), perform the following action. Partition {(Δij, Yij) : j = 1,…, ni} into three subsets: for j ≤ k, let Δij = 1 and integrate Yij from 0 to tj; for j > k and Δij = 0, let Yij = τ for j > k and Δij = 1, integrate Yij from 0 to τ. Apply this action to the integrand on the left-hand side of (A.2). Then sum over all possible choices of Δij ∈ {0,1} for j > k and let t1 = … = tni = t. These calculations yield
where
It then follows from C.10 that h2 = 0 and bim(t, b) = 0. The latter implies that Xij(t)Th1 + h3(t) = 0. Thus, C.10 yields h1 = 0 and h3 = 0.
Proof of Lemma 3
Under C.7, is bounded by some constant c1. Thus, (2.1) is bounded from above by
where c2 is some number depending on the observations. On the other hand, C.1 implies that there exists some (i,j) such that Yij = τ with probability tending to one. Therefore, at least one integral in the above expression is present, and such an integral is zero if Λ has an infinite jump size for some failure time. Thus the NPMLE exists and Λ^n has finite jump sizes.
Proof of Theorem 1
Let Ω be the measurable set in the probability space such that all the conditions hold for any fixed ω ∈ Ω. Clearly, P(Ω) = 1. Thus, the following arguments pertain to fixed ω ∈ Ω. We use O(1) to denote some positive constant, which may depend on ω but is independent of parameters and sample size. The proof consists of two steps.
Step 1
We prove that Λ^n(t) has an upper bound in [0, τ] with probability one. Write ln(β, γ, Λ) = log Ln (β, γ Λ). We prove the boundedness of Λ^n(·) by contradiction. Suppose that Λ^n(τ) → ∞. From the compactness of Θ, we also assume that β^n → β* and γ^n → γ*. The idea of obtaining a contradiction is the following: we first construct a step function Λ̅n with jumps only at the Yij for which Δij = 1 such that Λ̅n is close to the true function Λ0; then since β^n, γ^n, Λ^n maximizes lnβ,γΛ), it holds that 0 ≤ {ln(β^n, γ^n, Λ^n) − ln(β0, γ0, Λ̅n)}/n; finally, we show that if Λ^n(τ) → ∞, the right-hand side of the foregoing inequality will eventually be negative, which yields the contradiction.
By differentiating ln(β, γ Λ) with respect to Λ{Yij} and setting it to zero, we see that Λ^n{Yij satisfies
(A.3) |
where R1k(·) is defined in the proof of Lemma 2, and
In view of (A.3), we construct a step function Λ̅n(t) with jumps only at the Yij with jump size Λ̅n{Yij satisfying
(A.4) |
Thus, . By the Glivenko-Cantelli property of the classes R1k and R2k (proved in the appendix of our technical report), we can show that Λ̅n(t) converges uniformly in [0,τ] to Λ0(t).
Clearly, n−1ln(β^n, γ^n, Λ^n) −n−1ln (β0, γ0, Λ̅n) ≥ 0. From the construction of Λ̅n and according to (3.3), this inequality is equivalent to
(A.5) |
We show that if Λ^n(τ) → ∞, the right-hand side of (A.5) is eventually negative. The proof of the divergence of the right-hand side mimics the arguments of Murphy (1994). Specifically, we consider a partition of [0, τ] which consists of a sequence τ = s0 > … > sN = 0. Then the right-hand side of (A.5) can be bounded from above by
which is further bounded by
(A.6) |
Using Murphy’s (1994) idea of constructing the partition, we can choose s0 > s1 > s2 > … > sN such that the first term on the right-hand side of (A.6) diverges to −∞ as Λ^n(τ) → ∞ and the second term and the third term are negative for large n. This contradicts the fact that (A.6) should be non-negative.
Thus we have shown that, with probability one, Λ^n(τ) has an upper bound. By Helly’s Selection Theorem, we can assume that β^n → β*, γ^n → γ*, and Λ^n converges pointwise to some increasing function Λ*.
Step 2
We show that β* = β0, γ* = γ0 and Λ*(t) = Λ0(t). We consider
(A.7) |
Using equations (A.3) and (A.4), we can easily see that Λ^n(t) is absolutely continuous with respect to Λ̅n(t) and
(A.8) |
where
It follows from the Donsker property proved in the appendix of our technical report that
We wish to take limits on both sides of (A.8). We first show that the denominator of the integrand is uniformly bounded away from zero. From (A.8), for any ε > 0,
Let ε → 0 and use the Monotone Convergence Theorem to obtain
(A.9) |
We claim that mint∈[0,τ]| 𝒫 [ν(O;β*, γ*, Λ*, t)] > 0. If this inequality does not hold, then there exists some t* ∈ [0,τ] such that 𝒫 [ν(O;β*,γ*,Λ*,t*)] = 0. The function 𝒫 [ν(O; β*,γ*, Λ*, t)] is right-differentiable almost everywhere provided that δc(t|X̅ij(t), Z̅ij(t)) exists and is uniformly bounded almost everywhere. Thus, there exists a δ > 0 such that for t ∈ (t*,t* + δ),
almost everywhere. Thus (A.9) implies . This is a contradiction.
We can now take the limits on both sides of (A.8) to obtain
We conclude that Λ*(t) is absolutely continuous with respect to Λ0(t), so that Λ*(t) is differentiable with respect to t. In addition, dΛ^n(t)/d̅n(t) converges to dΛ*(t)/dΛ0(t) uniformly in t. Let n → ∞ in (A.7). Then we have
which is the negative Kullback-Leibler information. The identifiability result in Lemma 2 implies that β* = β0, γ* = γ0, and Λ* = Λ0.
Combining the results from Step 1 and Step 2, we conclude that, almost surely,
The uniform convergence of Λ^n to Λ0 follows from the fact that Λ0 is a continuous function.
Proofs of Theorems 2–4
The proof of the weak convergence of in Theorem 2 makes use of Theorem 3.3.1 of van der Vaart and Wellner (1996). The most difficult part is to verify that the information operator at the true parameters is invertible. This can be done by showing that the information operator is the summation of an invertible operator and a compact operator and that the the information operator is one to one. The former is derived from the explicit expression of the information operator and the latter follows from the fact, shown in Lemma 2, that any submodel has non-singular information. The details can be found in our technical report.
The proof of Theorem 3 proceeds by verifying the conditions of Murphy and van der Vaart (2000). In particular, we can construct an approximate least favorable submodel using the invertibility of the information operator. The nobias condition along the least favorable submodel follows from the arguments used in proving Theorem 1. The other regularity conditions follow from the Donsker property of appropriate functional classes proved in our technical report.
The proof of Theorem 4 is essentially the same as the that of Theorem 3 of Parner (1998). The main idea is that the empirical information operator based on Jn approximates the true information operator, so that it is invertible; see our technical report.
Contributor Information
Donglin Zeng, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599-7420, U.S.A. E-mail: dzeng@bios.unc.edu.
D. Y. Lin, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599-7420, U.S.A. E-mail: lin@bios.unc.edu
Xihong Lin, Department of Biostatistics, Harvard University, Boston, MA 02115, U.S.A. E-mail: xlin@hsph.harvard.edu.
References
- Bennett S. Analysis of survival data by the proportional odds model. Statist. Medicine. 1983;2:273–277. doi: 10.1002/sim.4780020223. [DOI] [PubMed] [Google Scholar]
- Bickel PJ. Efficient testing in a class of transformation models. Papers on Semiparametric Models at the ISI Centenary Session Amsterdam. 1986:63–81. [Google Scholar]
- Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press; 1993. [Google Scholar]
- Cai T, Cheng SC, Wei LJ. Semiparametric mixed-effects models for clustered failure time data. J. Amer. Statist. Assoc. 2002;97:514–522. [Google Scholar]
- Cai T, Wei LJ, Wilcox M. Semiparametric regression analysis for clustered failure time data. Biometrika. 2000;87:867–878. [Google Scholar]
- Chen K, Jin Z, Ying Z. Semiparametric analysis of transformation models with censored data. Biometrika. 2002;89:659–668. [Google Scholar]
- Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845. [Google Scholar]
- Coleman TF, Li Y. On the convergence of reflective Newton methods for large-scale nonlinear minimization subject to bounds. Math. Program. 1994;67:189–224. [Google Scholar]
- Coleman TF, Li Y. An Interior, Trust region approach for nonlinear minimization subject to bounds. SIAM J. Optim. 1996;6:418–445. [Google Scholar]
- Cox DR. Regression models and life tables (with discussion) J. Roy. Statist. Soc. Ser. B. 1972;34:187–220. [Google Scholar]
- Cuzick J. Rank regression. Ann. Statist. 1988;16:1369–1389. [Google Scholar]
- Dabrowska DM, Doksum KA. Estimation and testing in the two-sample generalized odds-rate model. J. Amer. Statist. Assoc. 1988;83:744–749. [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B. 1977;39:1–38. [Google Scholar]
- Huster WJ, Brookmeyer R, Self SG. Modelling paired survival data with covariates. Biometrics. 1989;45:145–156. [PubMed] [Google Scholar]
- Kosorok MR, Lee BL, Fine JP. Robust inference for proportional hazards univariate frailty regression models. Ann. Statist. 2004;32:1448–1491. [Google Scholar]
- Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
- Lam KF, Leung TL. Marginal likelihood estimation for proportional odds models with right censored data. Lifetime Data Analysis. 2001;7:39–54. doi: 10.1023/a:1009673026121. [DOI] [PubMed] [Google Scholar]
- Murphy SA. Consistency in a proportional hazards model incorporating a random effect. Ann. Statist. 1994;22:712–731. [Google Scholar]
- Murphy SA. Asymptotic theory for the frailty model. Ann. Statist. 1995;23:182–198. [Google Scholar]
- Murphy SA, Rossini AJ, van der Vaart AW. Maximal likelihood estimate in the proportional odds model. J. Amer. Statist. Assoc. 1997;92:968–976. [Google Scholar]
- Murphy SA, van der Vaart AW. On the profile likelihood. J. Amer. Statist. Assoc. 2000;95:449–465. [Google Scholar]
- Parner E. Asymptotic theory for the correlated gamma-frailty model. Ann. Statist. 1998;26:183–214. [Google Scholar]
- Pettitt AN. Proportional odds models for survival data and estimates using ranks. Appl. Statist. 1984;33:169–175. [Google Scholar]
- Shen X. Proportional odds regression and sieve maximum likelihood estimation. Biometrika. 1998;85:165–177. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Berlin: Springer; 1996. [Google Scholar]
- Wu CO. Estimating the real parameter in a two-sample proportional odds model. Ann. Statist. 1995;23:376–395. [Google Scholar]
- Zeng D, Lin DY, Yin G. Maximum likelihood estimation for the proportional odds model with random effects. J. Amer. Statist. Assoc. 2005;100:470–483. [Google Scholar]