Abstract
In many biomedical studies, it is common that due to budget constraints, the primary covariate is only collected in a randomly selected subset from the full study cohort. Often, there is an inexpensive auxiliary covariate for the primary exposure variable that is readily available for all the cohort subjects. Valid statistical methods that make use of the auxiliary information to improve study efficiency need to be developed. To this end, we develop an estimated partial likelihood approach for correlated failure time data with auxiliary information. We assume a marginal hazard model with common baseline hazard function. The asymptotic properties for the proposed estimators are developed. The proof of the asymptotic results for the proposed estimators is nontrivial since the moments used in estimating equation are not martingale-based and the classical martingale theory is not sufficient. Instead, our proofs rely on modern empirical theory. The proposed estimator is evaluated through simulation studies and is shown to have increased efficiency compared to existing methods. The proposed methods are illustrated with a data set from the Framingham study.
Keywords: Marginal hazard model, Correlated failure time, Validation set, Auxiliary covariate
1 Introduction
Exposure assessment like assays for biomarker or genetic traits can be prohibitively expensive in modern biomedical studies. Due to budget constraints, the main exposure in many studies can only be assembled on a subset of the full study cohort. This subset is refereed to as the validation set. Meanwhile, an inexpensive auxiliary variable for the main exposure is often readily available for all the cohort subjects. It is desirable to improve the study efficiency through properly utilizing these auxiliary information in the statistical inference.
In failure time studies, some methods have been proposed. For example, Zhou and Pepe (1995), Zhou and Wang (2000) and Wang et al. (1997) studied the auxiliary covariates problem for a multiplicative semiparametric hazard model using regression calibration techniques. Kulich and Lin (2000) and Jiang and Zhou (2007) proposed a corrected pseudo-score estimator for the additive risks model of Lin and Ying (1994). While the aforementioned methods are focused on univariate failure time data, correlated failure time data are commonly encountered in practice. For example, the data arise from family studies where individuals within a family may be correlated due to shared genetic or environmental factors. Two major classes of models have been proposed for correlated failure time data: the frailty models (Clayton and Cuzick, 1985; Nielsen et al., 1992; Hougaard, 2000; Gorfine, Zucker, and Hsu, 2006) and the marginal hazard models (Wei, Lin, and Weissfeld, 1989; Lee, Wei, and Amato, 1992; Cai and Prentice, 1995, 1997; Spiekerman and Lin, 1998). When the intracluster correlation is not of interest, marginal hazard models are preferred approach since they avoid strong assumptions about the dependencies among correlated failure times.
In the literature of marginal hazard models, two types of models have been extensively studied: the different baseline hazard model (Wei et al. 1989, referred to as the WLW model) and the common baseline model (Lee et al. 1992, hereafter referred to as the CBM model). When the main exposure is observed only on a validation set, several methods have been proposed to fit marginal hazard model by taking use the auxiliary information. For example, Hu and Lin (2002) developed a corrected score approach to provide a class of consistent estimators assuming that the auxiliary and the true covariate have the same mean. Liu et al. (2009) proposed an estimated pseudo-partial likelihood method assuming a discrete auxiliary covariate. For continuous auxiliary covariate, Liu et al. (2010) proposed to correct the partial likelihood through a kernel estimation procedure. All these methods are based on a marginal hazard model with different baseline hazards(WLW model). However, in many practical situations, including studies of disease occurrence patterns of twins or siblings, or in litter mate experiments, or the clustered failure time data in which the subjects within clusters are exchangeable, it will be natural to restrict the baseline hazard functions to be common for some or all members of a cluster. To the best of our knowledge, no methods are available for analysis of correlated failure time data with auxiliary information under the framework of CBM model.
In this paper, we propose a new inference method based on a pseudo-partial likelihood for the clustered failure time data where the main exposure is only observed for the validation set. We assume a marginal proportional hazards model with common baseline hazard and discrete auxiliary variable. The relative risk function is estimated by a weighted average of all the observations from the validation set. Consequently, the resulted estimating equation is not a marginal martingale and the classical martingale theory, the key to the theoretical development of CBM model, is not sufficient to derive the asymptotic results in this case. We employ results from the modern empirical process theory to derive the asymptotic properties of the proposed estimators. Simulation studies show that our proposed estimator is more efficient than the simple estimator based on the validation data, while not much less efficient than that from the full data. The merit of our approach is that it does not require the specification of the association between the main exposure and its auxiliary.
The rest of the paper is organized as follows. In Section 2, we outline the data structure and propose an estimating procedure for the regression coefficients and cumulative hazard function. The large sample properties of the proposed estimators for regression coefficients and baseline hazard function are given in section 3. Extensive simulation studies are conducted to examine the finite-sample properties and robustness properties of the proposed methods in section 4. We illustrate the proposed method through the analysis of a real data set from the Framingham study in section 5. Concluding remarks are given in Section 6. All proofs are outlined in the Appendix.
2 Model and Estimation
2.1 Notation and Models
We first set up the requisite notation. Suppose that the full cohort consists of n independent clusters, and the i-th cluster has ni correlated subjects. We assume that subjects within the same cluster are exchangeable conditional on covariates (Hougaard, 2000). Suppose that each individual has a fixed probability to have the main exposure covariate being measured. The set for individuals who have their main exposure covariate and other covariates being observed is refereed as validation set.
Let Tik and Cik be the failure time and censoring time for (i, k), where (i, k) represents the k-th subject in the i-th cluster. The observed time is Xik = min(Tik, Cik). Let Yik(t) = I(Xik ≥ t) be at-risk indicator process, Δik = I(Tik ≤ Cik) denotes the failure indicator and Nik(t) = I(Xik ≤ t, Δik = 1) is the standard counting process, where I(·) is the indicator function. Let Eik(t) and Zik(t) denote the possibly time-dependent covariates, where Eik(t) is the main exposure subjecting to missing and Zik(t) is the remaining covariate vector which is fully observed. Let Eik = {Eik(t), t ≥ 0}, and Zik is defined similarly. All the time-dependent covariates are assumed to be external, i.e. they are not affected by the disease processes (Kalbfleisch and Prentice, 2002). Suppose that the n sets of clustered observations (T, C, E, Z) are independent and identically distributed. Within each cluster, the observed vectors (T, C, E, Z) maybe dependent on each other, but are identically distributed. The number of subjects in each cluster, ni, does not depend on the observations of (T, C, E, Z). In addition, the clustered observations of T and C are assumed to be independent conditional on the clustered observations of covariates E and Z (i.e. independent censoring).
Suppose that the complete covariate histories (Eik(·), Zik(·)) are available for the subjects in validation set and only Zik(·) available for the subjects in non-validation set. Let ηik be the indicator for subject (i, k) being selected into the validation set, and ηik is assumed to be independent of {Nik(·), Yik(·), Eik(·), Zik(·), ni : k = 1, · · ·, ni}. In addition, some auxiliary information for Eik(·) are observed for the whole cohort subjects and are denoted by Aik(·). In this paper, we assume that Aik(·) is categorical. Therefore, the observed data can be represented by (Xik, Δik, Zik, ηikEik, Aik). For i = 1, · · ·, n and k = 1, · · ·, ni, we assume that ηik's are independent Bernoulli variables with distribution Pr(ηik = 1) = ρ.
We assume that conditional on (Eik, Zik), Aik provides no additional information to the regression model, i.e.
| (1) |
where λik(·) denotes the corresponding conditional marginal hazards function.
Suppose that the marginal hazard function of Tik follows the proportional hazards model (Cox 1972):
| (2) |
where λ0(t) is an unspecified common baseline hazard function, and are the parameters to be estimated.
When subject (i, k) belongs to the non-validation set, we only observe Zik and Aik. Under this situation, we can derive an induced hazard function as follows:
| (3) |
where A* includes auxiliary variable A and the part of the information in covariate Z that, given A, are still related to E and denotes the expectation. That is, A* satisfying the following conditional dependence f(Eik(t)|Xik(t) ≥ t, Zik(t), Aik(t)) = f(Eik(t)|Xik(t) ≥ t, A*ik(t)), where f denotes the conditional density function. Notice that under this formulation, A* still satisfies the auxiliary assumption that given E and Z, A* does not contribute to the regression model, i.e., λ(t|Z, E, A*) = λ(t|Z, E). In this paper, we assume that A*ik is categorical.
2.2 Proposed estimators
By (2) and (3), we derive the induced relative risk function to the baseline as:
| (4) |
where . Note that ϕ is a conditional expectation. can be interpreted as the induced relative risk for a subject with a missing E. If the data were completely observed, then the regression parameter β of model (2) could be estimated by solving the estimating equation U(β) = 0 (Lee et al. 1992), where
| (5) |
with τ denote the study end time, is the d-th derivative of rik(β, u) with respect to β (d = 0, 1) and . Since the data is not complete, (5) cannot be calculated. We need to estimate rik first. It is sufficient to estimate ϕik(β1, t). Before we give the estimation formula, we first define some necessary notations. Suppose is finite discrete with the distribution P(A*(t) = am) = pm, m = 1 · · · L. Let
Define . It can be shown that
Therefore, it is natural to estimate ϕam(t) empirically by taking average over those non-zero as following:
ϕik(β1, t) can be estimated by
| (6) |
Replace ϕik(β1, t) in (4) with its estimator , we obtain the estimator for relative risk rik(β, t),
Define . We can estimate β0, the true parameter, by the solution of , where
| (7) |
with be the first derivative of with respect to β
By plugging the estimator of relative risk rik in the commonly-used Breslow estimator for the cumulative baseline function, we obtain a natural estimator for cumulative baseline hazard :
3 Asymptotic Properties
To investigate the asymptotic properties of the proposed estimator, we introduce some notations first. Let be the true regression parameter. For a vector a = (a1, · · ·, ap)′, let . Unless otherwise stated, all the limits are taken as n → ∞.
Let be the marginal martingale and
Our main results are given in Theorem 1-2 below, the regularity conditions and the proofs of which are given in the Appendix. We provide only brief remarks about the proofs below.
Theorem 1. Under the regularity conditions in the Appendix, is a consistent estimator of β0. Also, is asymptotically normally distributed with mean zero and variance matrix in the form , where
with
where s(11)(β, t) and s(12)(β, t) equals the first derivative of s(0)(β, t) to β1 and β2 respectively.
It is worth pointing out that when there is no individual subjecting to missing, the asymptotic variance ΣE of proposed estimator is equal to that of partial likelihood estimator in Lee et al. (1992). The proof of consistency of follows by the inverse Function Theorem (Foutz 1977). The asymptotic normality follows from the asymptotic normality of , a Taylor expansion and the Cramèr-Wold device. The asymptotic normality of is derived by multiple central limit theorem and results from empirical process theory.
To study the asymptotic properties of , we define the following metric space. Let be a space consisting of right-continuous functions {f(t)} with left limits, where f(t) : [0, τ] → R, make a metric space by equipping it with the metric for . The essential asymptotic results for the baseline cumulative hazard function estimator are summarized by the following theorem.
Theorem 2. Under the regularity conditions in the Appendix, converges in probability to uniformly in converges weakly to a zero mean Gaussian random field in , the covariance function between and is C(s, t), where
The asymptotic variance can be consistently estimated by replacing the population quantities with its empirical counterparts.
4 Simulation Studies
Simulation studies are conducted to evaluate the finite sample performance of the proposed estimator and to compare the proposed methods with existing methods.
We generated failure time data from n = 300 clusters. We allow the cluster size ni to range from 1 to 6 with equal probability for each integer value in the range. The partially observed covariate E is generated from a ni multivariate normal distribution Nni(0, V ), where
and Zik's to be standard normal random variables.
For each cluster i, we use the method in Cai and Shen (2000) to generate the ni correlated failure times with λ0 = 1, which is an extension of the commonly used multivariate failure time distribution of Clayton and Cuzick (1985). The joint survival function of the ni correlated failure times take the form:
| (8) |
The positive parameter θ represents the intra-cluster association. θ is chosen to be 0.25, 1.5 and 5.7, which represents a strong, moderate and weak intra-cluster association, respectively. We assume an uniform distribution over (0, τ) for the censoring time, where τ = 5.96, 1.57 and 0.39 corresponding to censoring proportion of approximately 20%, 50% and 80%.
The auxiliary Aij is generated as follows: we first generate Wij = Eij + eij, where eij ~ N(0, σ2), then we assign Aij the value of 0, 1, 2, or 3 based on whether Wij is in the interval (−∞, a1], (a1, a2], (a2, a3], or (a3, ∞), respectively, where a1, a2, a3 are the 25%, 50%, 75% quantiles of W. Here σ is the parameter that controls the strength of the association between Eij and Wij, then between Eij and Aij. Smaller σ induces stronger correlation between E and A. Individuals are selected into the validation set by Bernoulli sampling with equal probability. Simulation results are based on independent runs of 1000 for each data configuration.
We compare the proposed estimator with three alternative estimators: , and . The first two are standard partial likelihood estimators (solution of (5)) by using the full data and the validation data, respectively. is the standard partial likelihood using the auxiliary variable to replace the true exposure variable. In real data settings where E is observed only for a validation set, can not be calculated.
Table 1 summarizes the simulation results for β1 = 0 and 0.693 with validation fraction 30% and censoring rate 50%. We list the empirical mean (mean) and standard error (SD), average of estimated standard error (SE), the empirical coverage rate of nominal 95% confidence interval and the asymptotic relative efficiency (RE) with respect to the validation set. When β1 = 0, all estimators are unbiased. When under estimated β1, both and are approximately unbiased. For β1, is more efficient than validation data estimator is much more efficient when the auxiliary provides more information (i.e. smaller σ) or when the intracluster association is weaker (i.e. larger θ); In all the simulated settings, the proposed estimator is not much less efficient than that from the full data case . For β2, is almost as efficient as the full data estimator under all settings we considered (results do not show here). The proposed estimated standard errors provide a very good estimate of the true variability of β1 and β2 and the corresponding 95% confidence intervals have reasonable coverage rates.
Table 2 summarizes the relative efficiency of to the validation estimator for β1 under various validation fractions and censoring rate. We fix β1 = 0.693, θ = 1.5 and σ = 0.1. The relative efficiency increases when the censoring rate increases and when the validation fraction decreases. This suggests that, with low validation fraction and high censoring rate, our proposed estimator performs even more efficient when compared to the validation set method.
Furthermore, as suggested by the referees, we compare the proposed methods with the estimated pseudo-partial likelihood estimator (EPPLE, denoted by proposed by Liu et al (2009), who constructed the estimator based on the marginal hazard model with different baseline hazard function for different clusters. The failure time satisfies model (8) with fixed cluster size being K = 2 and baseline hazard being λ01 = λ02 = 1. We set the size of clusters as n = 300. The parameter settings are: β1 = 0 and 0.693, β2 = −0.2, θ = 0.25 and 1.5, σ = 0.2. Table 3 show the results for the estimators of β1 by listing the empirical mean (mean), standard error (SD) and relative efficiency of with respect to . As can be seen from Table 3, our proposed estimator for β1 tends to be a little more efficient than that of Liu et al. this is natural since we use more information to estimate the relative risk in the proposed method. The relative efficiency of to becomes smaller for larger θ. For the estimator of β2, both these two methods are as efficient as the partial likelihood estimator based on the full data (results are not listed here).
We also conducted some simulation studies to test the robustness of our approach. The failure times are generated from marginal hazard model with different baseline functions:
with β1 = 0.693, β2 = −0.2 and n = 300. The censoring rate is around 50% and the validation fraction is set to 30%. We fix one of the baseline λ01 = 1 and λ02 varies from 1 to 2.4 with jump 0.2. The working model remains to be marginal hazard model with common baseline function. Table 4 listed the results. It can be seen that when the working model are not too far away from the true model (e.g. λ02 ≤ 2), the proposed estimator still works well.
5 Analysis of Framingham study
We illustrate our proposed method to estimate the effiect of cholesterol level on the risk of Coronary Heart disease (CHD) using a data set from the Framingham study (Dawber 1980). The Framingham Heart Study was the first prospective study on cardiovascular disease. The study began in 1948 in the United States. Participants from the town of Framingham, Massachusetts were randomly sampled. The full cohort includes 2336 men and 2873 women aged between 30 and 62 years. Examination of participants has taken place every two years and the patients were followed for morbidity and mortality. Since the primary sampling unit was the family, failure times are likely to be correlated for the individuals within a family.
For simplicity, our analysis consists of data for patients who had no history of hyper-tension or glucose intolerance and no previous experience of coronary heart disease or a cerebrovascular accident around age “45”. The data we used consists of 1571 patients from 1401 families, among which 1261 families have only 1 patient, 113 families have 2 patients, 24 families have 3 patients and 3 families have 4 patients. Among the full cohort, 250 patients experienced CHD. The time is originated at age “45”(Age45) and the follow-up information is up to the year 1980. In our analysis, the failure time was defined as the time from Age45 to the onset of CHD, and all observations were censored either at the time loss to follow up, or at the end of the study.
In addition to the cholesterol level (Chol45), as the exposure variable of interest, other potential confounding variables available for all subjects under study include age at first exam Framingham (Agev1), body mass index (Bmi45), systolic blood pressure (Sbp45), gender(Sex, 1 for female and 0 for male), waiting time from first exam to age “45” (Wait), smoking status (Smoke, 0= no, 1=yes). Since the patients were clustered by family and the family members are exchangeable within each family, therefore it is reasonable to assume marginal hazard model (2) with common baseline function. We consider model (2) that include the above mentioned seven risk factors at Age45 as the covariates.
In the Framingham study, the covariates were measured for all patients, and therefore, this provide us an opportunity to evaluate our proposed method using various validation sampling fractions against not only a validation set analysis but also a full data analysis. Measurement of the cholesterol level (Chol45) is one example of a variable that maybe moderately expensive to obtain and therefore represents a candidate main exposure variable which is observed in a sub-cohort. In terms of practical consideration, smokers usually have higher Chol45. Hence, we use the smoking status as the auxiliary variable for Chol45. We consider all seven covariates in the model. The fitted model is:
where Z = (Agev1, Bmi45, Sbp45, gender, Wait, Smoke).
We sampled a sequence of validation sets, with validation fraction ρ being 10%, 20%, 30% and 40%, from the full cohort of 1571 patients and analyzed them using the proposed method by assuming that the main covariate, cholesterol level, is only available for the validation set.
Table 5 listed the results from the Framingham study for the factor “chol45”. It is noted that the cholesterol level is significant in the full data analysis (95% CI: [0.001, 0.007], p-value: 0.007). Comparing the proposed method and the validation method, we see that at small validation fractions (ρ < 40%), the proposed estimator does not achieve the significance of testing βchol45 = 0. Nevertheless, the proposed method is approaching the significant level of the full cohort analysis as we increase the validation fraction. At ρ = 40%, the proposed method also reject the null hypothesis that βchol = 0 at 0.05 significance level. Further inspection of Table 5 also reveals that the validation set analysis consistently produced smaller Z-scores than the proposed estimator and hence always yielded a larger p-value in testing βchol45 = 0. At ρ = 40%, the validation estimator did not achieve the significance level of the full analysis or the proposed estimator.
Table 6 summarizes the results for all the factors in the Framingham study using the three methods with 615 (ρ = 40%) sample as the validation set. The p-values indicate that, at 0.05 significance level, Chol45, Bmi45, Sbp45, Sex and Smoking status are all statistically significant for CHD using the proposed method, which is the same as the full data analysis. However, only Sbp45 and Sex are significant for CHD for the validation set analysis. The proposed estimates appeared to be closer to the full data analysis, and is more efficient than the validation set estimator. The standard error of the proposed estimator for all the covariates is similar at this ρ level with that of the full data estimator.
This example confirmed that the proposed estimator is a more precise estimator. One would have improved the statistical power that would have been lost if one were only to analyze the validation set data without incorporating the auxiliary information.
6 Concluding Remarks
In this paper, we proposed an estimated likelihood approach for CBM model, where the main exposure is partly observed and a discrete auxiliary variable for the main exposure is available. An estimating equation based on the pseudo-partial likelihood is proposed. Our approach is nonparametric with respect to the association between the missing covariate and the observed auxiliary covariate. The proposed estimators are shown to be consistent and asymptotically normal. The theoretical proof is nontrivial because the classical martingale theory is not sufficient. Instead, we rely on the results from modern empirical process theory. Simulation studies and real data example demonstrate that the proposed method works well in moderate-size sample and shows an improved statistical efficiency over what would be achieved using only the validation set. Simulation studies also show that the statistical efficiency of the proposed method also depend on the validation fraction. Sampling more individuals result in more efficient proposed estimators.
We have a few recommendations on the applications of the proposed method. First, one can discretize a continuous auxiliary variable into categories and then apply the proposed method. To fully take advantage of a continuous auxiliary covariate, a nonlinear smoothing version of equation (6) will need to be developed. Secondly, the number of categories of the auxiliary variable cannot be too large (e.g. no more than 6) if the validation sample size is small (< 60). Additional simulations showed that there could be convergence problems when the validation size is less than 60 and the number of categories is greater than 6. We recommend to reduce the number of categories of the auxiliary variable if the sample is small. Thirdly, the estimating equation of Lee, et al (1992) did not take into consideration of the potential correlation in the multivariate failure times. Cai and Prentice (1995) and Xue et al.(2010) showed that more efficient β-estimators could be obtained by introducing weights into the estimating equations for small and large cluster size respectively. In modeling panel count data, which involves taking account of the dependence of the successive counts, Wellner and Zhang (2000) showed that the full non-parametric maximum likelihood estimator (NPMLE) improved the study efficiency compared to the pseudo likelihood estimator which ignores the potential correlation between counts. Therefore, introducing suitable weights to our proposed equation could further improve efficiency. Future work that improve the efficiency of estimators is certainly warranted.
Acknowledgment
The authors are very grateful for the valuable comments and suggestions from the editor and the two referees. This work was partly supported the National Natural Science Fund of China grant 11171263 (Liu and Yuan) and U.S. NIH grants R01 HL 57444 (Cai) and U.S. NIH grants R01 CA 79949 (Zhou).
APPENDIX
Assumptions and Outline of the Proofs of Theorem 1 and 2
We assume that the following conditions hold:
Conditions
(C1) (Finite interval): ;
(C2) Pr(τj·(t, am) > 0) > 0;
(C3) For any has uniformly bounded variation almost surely over [0, τ];
(C4) For d = 0, 1, 2, there exits a neighborhood of β0 such that s(d)(β, t) are continuous function of β, uniformly in t ∈ (0, τ], bounded on is bounded away from 0 on and Σ(β0) is positive definite.
(C5) For , where
The following lemma (Lemma 4.2, p. 54) in Kosorok (2008) will be often used in proving the theorem:
Lemma A1. Bn ∈ D[a, b] and An ∈ l∞[a, b] be either cadlag or caglad, where l∞[a, b] is the collection of all bounded functions on [a, b], and assume . An has uniformly bounded variation and Bn converges weakly to a tight, mean zero process with sample paths in D[a, b]. Then .
Define . For be the d-th derivative of ϕam(β1, t) respect to . Define for a vector b = (b1, · · ·, bp)′ and for a matrix B = (bij). The following asymptotic property plays important role in proving the theorems.
Lemma A2. For m = 1, · · ·, L, d = 0, 1, 2,
| (A1) |
Proof: For d = 0,since , we have
Therefore, the nominator of equals :
which is op(1) by condition (C5). Combine condition (C2), we can prove (A1) for d = 0. The same argument works for d = 1, 2.
Define
Similar as the argument by Liu et al.(2009), we can prove that, for d = 0, 1, 2, 3, 4,
| (A2) |
| (A3) |
Since Mik(t) is a Donsker class and converges weakly to a tight, mean zero process, we can prove the following useful property by (A3), condition (C4) and Lemma A1,
| (A4) |
Before we prove theorems, we prove the asymptotic normality of in the following lemma:
Lemma A3. Under conditions converges to a mean zero Normal distribution with covariance Σ1(β0) + Σ2(β0).
Proof: By simple algebraic manipulation, we have
| (A5) |
By (A1)-(A3), we can show that the first term of (A5) evaluated at β0 equals
Define S(d)(β, t) as the corresponding functions with rik(β, t) substituted for in . For the second term of (A5) evaluated at β0, we apply the first order expansion to and at and S(1)/S(0), respectively, we can rewrite the second term of (A5) evaluated at β0 equals asymptotically:
| (A6) |
where Ψ(1)(t) and Ψ(2)(t) are defined as the first line and second line of (A6).
By condition (C1)-(C5) and the definition of , Ψ(1)(t) can be rewritten as
where Qjl(β0) is defined as in Theorem 1. Similarly, we can prove that Ψ(2)(t) is asymptotically where Hjl(β0) is defined as in Theorem 1.
It follows that is asymptotically equivalent to
| (A7) |
| (A8) |
(A7) and (A8) are independent. By martingale central limit theorem, (A7) converges weekly to a continuous normal process with covariance Σ1(β0). (A8) is a summation of iid terms from the validation sample. By central limit theorem, it converges to a normal distribution with mean
| (A9) |
and covariance
Since ηik and ni are independent of covariates {Nik(·), Zik(·)} and is a local martingale, we have the expected value of the first term in the mean expression (A9) is 0. It is easy to show that and . Therefore the second term is 0.
The covariance matrix can be expressed as Σ2(β0, which is defined in Theorem1. Therefore the limiting distribution of
Proof of theorem 1:
(1) Consistency
To prove the consistency of , we use the Inver Function Theorem (Foutz 1977) by verifying the following conditions: (I) exists and is continuous in an open neighborhood of β0; (II) is negative definite with probability going to 1; (III) converges in probability to A(β), uniformly for β in an open neighborhood of β0; (IV) in probability.
Clearly (I) is satisfied due to the continuity of and .
Similar as Andersen and Gill (1982), we can decompose into several parts:
| (A10) |
where .
By (A1)-A(4), we can prove that the first term of the right side of (A10) equals asymptotically
which is martingale and converges to zero in probability by Lenglart inequality.
By condition (C1) and (A1), we can prove that converges in probability to
When β = β0, we have s(4)(β0, t) = s(2)(β0, t), and A(β0) = −Σ(β0) is negative by condition (C4). Thus (II) and (III) are satisfied.
Using the result in the proof of Lemma A3, (IV) hold by Chebyshev's inequality. Having now verified (I)-(IV), we conclude that converges in probability to β0.
(2) Asymptotic Normality
By a Taylor expansion of with respect to β and around β0, we have
| (A11) |
where β* is between and β0.
By (III) and the asymptotical normality of , we proved Theorem 1.
Proof of theorem 2: Note that
With the consistency of , we can show the first term on the right-hand side of the above inequality converges to zero by (A2) and the martingale properties, and the second term is also asymptotically negligible by (A1). Then we prove the uniform consistency of .
We can decompose into the following three parts:
By (A2), the first term equals . The second term is equal to by Taylor expansion and (A1), and to
by (A11) and (III).
By similar arguments as in the approximation proof of Ψ(1)(t), we can show the third term is asymptotically equal to
Thus
Therefore, we can rewrite the above sums into two independent items as
where uik(t, β0) and vik(t, β0) are defined as in theorem 2. We can easily show E(uik(t, β0)) = 0 and E(vik(t, β0)) = 0. By multivariate central limit theorem, Theorem 2 can be proved.
References
- Andersen PK, Gill RD. Cox's regression model for counting processes: a large sample study. Anna Stat. 1982;10:1100–1120. [Google Scholar]
- Cai J, Prentice RL. Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika. 1995;82:151–164. [Google Scholar]
- Cai J, Prentice RL. Regression analysis for correlated failure time data. Lifetime Data Analysis. 1997;3:197–213. doi: 10.1023/a:1009613313677. [DOI] [PubMed] [Google Scholar]
- Cai J, Shen Y. Permutation tests for comparing marginal survival functions with clustered failure time data. Stat Med. 2000;19:2963–2973. doi: 10.1002/1097-0258(20001115)19:21<2963::aid-sim593>3.0.co;2-h. [DOI] [PubMed] [Google Scholar]
- Clayton D, Cuzick J. Multivariate generalizations of the proportional hazard model (with discussion). J R Stats Soc, Series A. 1985;54:168–184. [Google Scholar]
- Cox DR. Regression Models and Life-Tables. J R Stats Soc, Series B. 1972;62:187–202. [Google Scholar]
- Dawber TR. The Epidemiology of Atherosclerotic Disease. Harvard University Press; Cambridge, MA: 1980. The Framingham Study. [Google Scholar]
- Foutz RV. On the unique consistent solution to the likelihood equations. J Am Stats Assoc. 1977;72:147–148. [Google Scholar]
- Gorfine M, Zucker DM, Hsu L. Prospective survival analysis with a general semi-parametric shared frailty model: A pseudo full likelihood approach. Biometrika. 2006;93:735C741. doi: 10.1901/jaba.2009.37-1489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu C, Lin DY. Cox Regression with Covariate Measurement Error. Scand J Stat. 2002;29:637–655. [Google Scholar]
- Hougaard P. Analysis of Multivariate Survival Data. Springer-Verlag; New York: 2000. [Google Scholar]
- Jiang J, Zhou H. Additive hazards regression with auxiliary covariates. Biometrika. 2007;94:359–369. [Google Scholar]
- Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. 2nd ed. Wiley; New York: 2002. [Google Scholar]
- Kosorok MR. Introduction to empirical process and semiparametric inference. Springer; New York: 2008. [Google Scholar]
- Kulich M, Lin DY. Additive hazards regression with covariate measurement error. Journal of the American Statistical Association. 2000;95:238–248. [Google Scholar]
- Lee EW, Wei LJ, Amato DA. Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In: Klein JP, Goel PK, editors. Survival Analysis: State of the Art. Kluwer Academic Publishers; Netherlands: 1992. pp. 237–247. [Google Scholar]
- Lin DY, Ying Z. Semiparametric analysis of the additive risk model. Biometrika. 1994;81:61–71. [Google Scholar]
- Liu Y, Wu Y, Zhou H. Multivariate failure time regression with a continuous auxiliary covariates. Journal of Multivariate Analysis. 2010;101:679–691. doi: 10.1016/j.jmva.2009.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Zhou H, Cai J. Estimated pseudo-partial-likelihood method for correlated failure time data with auxiliary covariates. Biometrics. 2009;65:1184–1193. doi: 10.1111/j.1541-0420.2009.01198.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen GG, Gill RD, Andersen PK, Sϕrensen TI. A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics. 1992;19:25–43. [Google Scholar]
- Spiekerman CF, Lin DY. Marginal regression models for multivariate failure time data. Journal of the American Statistical Association. 1998;93:1164–1175. [Google Scholar]
- Wang CY, Hsu L, Feng ZD, Prentice RL. Regression calibration in failure time regression. Biometrics. 1997;53:131–145. [PubMed] [Google Scholar]
- Wellner JA, Zhan Y. Two estimators of the mean of a counting process with panel count data. Annals of Statistics. 2000;28:779C814. [Google Scholar]
- Wei LJ, Lin DY, Weissfeld L. Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association. 1989;84:1065–1073. [Google Scholar]
- Xue L, Wang L, Qu A. Incorporating Correlation for Multivariate Failure Time Data When Cluster Size Is Large. Biometrics. 2010;66:393–404. doi: 10.1111/j.1541-0420.2009.01307.x. [DOI] [PubMed] [Google Scholar]
- Zhou H, Pepe MS. Auxiliary covariate data in failure time regression analysis. Biometrika. 1995;58:352–360. [Google Scholar]
- Zhou H, Wang CY. Failure time regression with continuous covariates measured with error. J R Stats Soc, Series B. 2000;62:657–665. [Google Scholar]
