Abstract
This paper discusses regression analysis of clustered failure time data, which means that the failure times of interest are clustered into small groups instead of being independent. Clustering occurs in many fields such as medical studies. For the problem, a number of methods have been proposed, but most of them apply only to clustered right-censored data. In reality, the failure time data is often interval-censored. That is, the failure times of interest are known only to lie in certain intervals. We propose an estimating equation-based approach for regression analysis of clustered interval-censored failure time data generated from the additive hazards model. A major advantage of the proposed method is that it does not involve the estimation of any baseline hazard function. Both asymptotic and finite sample properties of the proposed estimates of regression parameters are established and the method is illustrated by the data arising from a lymphatic filariasis study.
Keywords: additive hazards model, clustered data, estimating equation, interval censoring, semi-parametric regression analysis
1. Introduction
This paper discusses regression analysis of clustered failure time data, which occur when the failure times of interest are clustered into small groups or some study subjects are related such as siblings, families, or communities. The subjects from the same cluster or group usually share certain unobserved characteristics and their failure times tend to be correlated as a result. Siblings, for example, share similar genetic and environmental influences. A key feature of the failure time data is censoring and one type of censoring that has been extensively discussed is right censoring. Another complex type is interval censoring that arises when the failure event of interest cannot be observed directly but is known only to have occurred over a time interval. This is common and natural in a clinical trial or longitudinal study in which there is a periodic follow-up. For instance, an individual who is monitored weekly for a response may miss visits for a few weeks and return in a changed response state, thus contributing an interval-censored observation. In the following, we will focus on the regression analysis of such data.
For regression analysis of clustered failure time data, several methods have been proposed when only right censoring is present (Cai and Prentice 1997; Cai, Wei and Wilcox 2000; Zeng, Lin and Lin 2008). For example, Cai et al. (2000) and Zeng et al. (2008). For example, Cai et al. (2000) and Zeng et al. (2008) investigated the fitting of semi-parametric linear transformation models to clustered right-censored data and Rossini and Moore (1999) considered the use of estimating equations and pseudolikelihood for the analysis. For the case of interval-censored data, there also exist some procedures when there is no clustering (Jewell 1994; Wang and Ding 2000; Jewell and van der Laan 2004). For example, Huang (1996) considered the maximum-likelihood approach for Cox model regression of current status data, a special case of interval-censored data, and Sun (2006) gave a comprehensive review of the existing literature on the topic.
The additive hazards model is one of the most commonly used models for regression analysis of failure time data (Lin, Oakes and Ying 1998; Ghosh 2001; Martinussen and Sheike 2002; Wang, Sun and Tong 2010; Cai and Zeng 2011). For instance, Lin et al. (1998) considered the fitting of the model to current status data and developed some estimating equation approaches for the estimation of regression parameters. Wang et al. (2010) discussed the fitting of the model to general interval-censored failure time data and Cai and Zeng (2011) investigated the additive mixed effect model for clustered right-censored failure time data. In this paper, an approach is developed for fitting the additive hazards model to clustered interval-censored failure time data.
The reminder of the paper is organised as follows. We will begin in Section 2 with describing notation and models that will be used throughout the paper. In particular, the failure time of interest is assumed to follow the additive hazards model marginally. In Section 3, an estimating equation-based approach is developed for estimating regression parameters of interest and the asymptotic properties of the proposed estimates are established. The approach can be easily implemented and does not require the estimation of the baseline hazard function. Section 4 presents some results obtained from a simulation study performed to evaluate the proposed estimation procedure and Section 5 illustrates the proposed approach by a set of clustered interval-censored data arising from a lymphatic filariasis (LF) study. Some concluding remarks and discussion are provided in Section 6.
2. Notation and models
Consider a survival study that involves n clusters of subjects with ni denoting the size of cluster, where i = 1,…, n. Let Tij denote the failure time of interest for subject j in cluster i, where j = 1,…, ni, and Zij(t) is a possibly time-dependent p-dimensional vector of covariates that is associated with the subject and assumed to be completely observed. Throughout this paper, the Tijs are considered to be independent for subjects in different clusters but could be dependent for subjects within the same cluster, and only interval-censored data on the Tijs are observed. Specifically for each Tij, there exist two monitoring times denoted by Uij and Vij with Uij ≤ Vij such that Tij is known only to be smaller than Uij, between Uij and Vij, or greater than Vij. That is, we have clustered interval-censored data on the Tijs. Furthermore, it will be assumed that Tij is independent of the monitoring times Uij and Vij given covariate process Zij(t).
As mentioned above, our focus will be on the estimation of the covariate effect on the Tijs. For this, we assume that for each cluster, there exists a latent variable bi and all Tijs are independent given bi. Also, the hazard function for Tij is specified by the following additive hazard model:
| (1) |
given bi and the covariate process up to time t (Lin et al. 1998; Ghosh 2001). Here, λ0(t) is an unknown baseline hazard function and β0 denotes the p-dimensional vector of regression parameters.
In practice, the monitoring variables Uij and Vij could depend on covariates, too. To model this and due to the strict order restriction between them, it is natural to regard them as some failure models, too. Here, we consider employing the Cox-type models
| (2) |
and
| (3) |
for their hazard functions. In the above, both λ1(t) and λ2(t) are unspecified baseline hazard functions and γ0 is a p-dimensional vector of regression parameters. Note that model (3) essentially assumes that the gap time between the two monitoring times Uij and Vij follows a Cox-type model conditional on Uij. There are several motivations for considering the models described. First, it is well known that the Cox model is one of the most widely used models partly due to its simplicity and it will be seen that under the models above, an easy estimation procedure can be developed for regression coefficients without the need to estimate the baseline hazard functions. Also, the model assumptions can be easily checked since we have complete data for the monitoring times. Furthermore, the same baseline hazard functions and covariate effects for all subjects were assumed in the models (1)–(3), respectively, for simplicity and the method developed below can be easily generalised to more general situations. Some comments on these will be given below.
For subject j in cluster i, define
where j = 1,… ni; i = 1,…, n. Then, define the following counting processes:
and conditional on Uij,
where and . Note that the definition of is naturally based on the order restriction between Uij and Vij and indicates that Vij is considered only after Uij has been observed. Based on the models (1)–(3), following the same arguments as those in Wang et al. (2010), one can derive the intensity functions of and as
| (4) |
and
| (5) |
respectively. It can be seen that models (4) and (5) are Cox-type ones similar to models (2) and (3) and model (5) is a conditional one since the starting time point is the observed monitoring time Uij. In the following section, we will establish some estimating equations for the estimation of regression coefficients β0 and γ0.
3. Estimation of regression parameters
Now we consider the estimation of regression parameters β0 and γ0 and for this, we will employ the estimating equation approach. For l = 0, 1 and 2, define
and
where . Here for any vector a, a(0) = 1, a(1) = a and a(2) = aaT. Based on the models and (5) and motivated by Zhu, Tong and Sun (2008) and Wang et al. (2010), we propose the following estimating function:
for the estimation of β0 if γ0 is known, where and .
The key idea here is to reduce general interval-censored data to current status data (Zhu et al. 2008). In the expression of Uβ(β, γ), the first term is the partial likelihood score function under model (4) if one observes only current status data, while the second term is the partial likelihood score function given under model (5) if one considers only current status data given by the observations on the Vijs. Thus, Uβ(β, γ) is an unbiased function. Similar estimating function can be developed for the estimation of γ0. However, complete data are actually available for the monitoring times Uij and Vij and thus it is more efficient to estimate it based on models (2) and (3).
To this end, define and given Uij. Also for l 0, 1 and 2, define
and
Then, an estimating function for γ0 can be constructed as
Define the estimate of γ0 and to be the solution to Uγ (γ) = 0 and the estimate of β0 and to be the solution to . It can be easily shown that is consistent and has an asymptotic normal distribution (Wei, Lin and Weissfeld 1989; Lin 1994). The consistency of can be similarly proved by noting that converges to a positive matrix at β0. For the asymptotic distribution of , one can first show that converges in distribution to a vector of normal random variables with a zero mean vector and a covariance matrix that can be consistently estimated. It then follows by the Taylor series expansion of around β0 that the distribution of can be asymptotically approximated by the normal distribution with mean zero and a covariance matrix that can be consistently estimated and provided in the appendix. The proof of the result is also sketched in the appendix.
4. A simulation study
We conducted an extensive simulation study to assess the finite sample performance of the estimation procedure proposed in the previous sections. To generate the failure times of interest in the study, it was assumed that Tij followed the model
with λ0(t = 2. Note that the covariate process was assumed to be time independent for simplicity and generated from the Bernoulli distribution with success probability p = 0.5. Also, the latent variables bis were assumed to follow a normal distribution with zero mean and variance equal to . For the monitoring variables Uijs and Vijs, they were generated based on the models (2) and (3) with λ1(t) = 4 and λ2(t) = 2, respectively. Finally, the cluster size ni was assumed to follow the uniform distribution U{2, 3, 4}.
The following tables give the results based on 1000 replications, including the bias of the estimates given by the average of the estimates minus the true value (BIAS), the sample standard deviation (SSD) of the estimates, the average of the estimated standard errors (ESE) and the 95% empirical coverage probability (95%-CP). In particular, Table 1 presents the obtained results on the estimates and based on the simulated data with the true values of β0 being −0.25, 0, 0.25, 0.5 and 1, respectively with γ0 = −0.25. Two choices of cluster sizes, n = 200 and 400, are considered in this simulation study. One can see that these results suggest that the proposed estimates seem to be unbiased and the SSD is close to the ESE, suggesting that the proposed variance estimate and the normality of the estimates are reasonable. It is interesting to find out that the parameter γ seems to be better estimated than the parameter β. This is reasonable as complete data are available for the former and one only observes incomplete data for the latter.
Table 1.
Estimation of γ and β with binary covariate and γ0 = −0.25.
|
n = 200 |
n = 400 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| TRUE | EST | BIAS | SSD | ESE | 95%-CP | BIAS | SSD | ESE | 95%-CP |
| γ0 = −0.25 | −0.0068 | 0.1107 | 0.1077 | 0.948 | −0.0029 | 0.0772 | 0.0760 | 0.949 | |
| β0 = −0.25 | −0.0638 | 0.8182 | 0.7605 | 0.937 | −0.0299 | 0.5587 | 0.5331 | 0.940 | |
| γ0 = −0.25 | −0.0084 | 0.1087 | 0.1059 | 0.948 | 0.0058 | 0.0752 | 0.0748 | 0.965 | |
| β0 = 0.00 | −0.0694 | 0.8162 | 0.7619 | 0.966 | −0.0310 | 0.5616 | 0.5352 | 0.969 | |
| γ0 = −0.25 | 0.0112 | 0.1074 | 0.1046 | 0.941 | 0.0082 | 0.0745 | 0.0739 | 0.952 | |
| β0 = 0.25 | −0.0767 | 0.8226 | 0.7672 | 0.965 | −0.0317 | 0.5658 | 0.5376 | 0.960 | |
| γ0 = −0.25 | 0.0193 | 0.1063 | 0.1036 | 0.944 | 0.0120 | 0.0734 | 0.0732 | 0.945 | |
| β0 = 0.50 | −0.0832 | 0.8256 | 0.7728 | 0.967 | −0.0319 | 0.5640 | 0.5419 | 0.945 | |
| γ0 = −0.25 | 0.0341 | 0.1044 | 0.1021 | 0.934 | 0.0259 | 0.0721 | 0.0720 | 0.937 | |
| β0 = 1.00 | −0.0962 | 0.8518 | 0.7902 | 0.955 | −0.0480 | 0.5797 | 0.5533 | 0.938 | |
Tables 2 and 3 give the results obtained for the estimation of the regression parameters β and γ with γ0 = 0 and 0.25, respectively, and all other set-ups being the same as those in Table 1. They are similar to those given in Table 1 and again suggest that the proposed estimation approach seems to work well for the situations considered.
Table 2.
Estimation of γ and β with binary covariate and γ0 = 0.00.
|
n = 200 |
n = 400 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| TRUE | EST | BIAS | SSD | ESE | 95%-CP | BIAS | SSD | ESE | 95%-CP |
| γ0 = 0.00 | −0.0051 | 0.1126 | 0.1100 | 0.952 | −0.0034 | 0.0799 | 0.0777 | 0.945 | |
| β0 = −0.25 | 0.0172 | 0.8776 | 0.8255 | 0.938 | 0.0155 | 0.6065 | 0.5795 | 0.942 | |
| γ0 = 0.00 | 0.0019 | 0.1106 | 0.1081 | 0.948 | 0.0039 | 0.0785 | 0.0764 | 0.945 | |
| β0 = 0.00 | 0.0315 | 0.8804 | 0.8273 | 0.975 | 0.0065 | 0.6158 | 0.5810 | 0.968 | |
| γ0 = 0.00 | 0.0073 | 0.1102 | 0.1065 | 0.946 | 0.0122 | 0.0768 | 0.0753 | 0.943 | |
| β00 = 0.25 | −0.0225 | 0.8835 | 0.8303 | 0.973 | −0.0061 | 0.6106 | 0.5838 | 0.963 | |
| γ0 = 0.00 | 0.0122 | 0.1087 | 0.1053 | 0.941 | 0.0143 | 0.0755 | 0.0744 | 0.946 | |
| β00 = 0.50 | −0.0422 | 0.8903 | 0.8389 | 0.931 | −0.0169 | 0.6110 | 0.5892 | 0.941 | |
| γ0 = 0.00 | 0.0263 | 0.1063 | 0.1035 | 0.932 | 0.0212 | 0.0746 | 0.0732 | 0.940 | |
| β0 = 1.00 | −0.0748 | 0.9116 | 0.8583 | 0.933 | −0.0498 | 0.6539 | 0.6036 | 0.960 | |
Table 3.
Estimation of γ and β with binary covariate and γ0 = 0.25.
|
n = 200 |
n = 400 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| TRUE | EST | BIAS | SSD | ESE | 95%-CP | BIAS | SSD | ESE | 95%-CP |
| γ0 = 0.25 | −0.0036 | 0.1177 | 0.1151 | 0.941 | −0.0009 | 0.0829 | 0.0831 | 0.956 | |
| β0 = −0.25 | 0.0992 | 0.9643 | 0.9112 | 0.963 | 0.0328 | 0.7017 | 0.6768 | 0.956 | |
| γ0 = 0.25 | 0.0035 | 0.1154 | 0.1126 | 0.941 | 0.0054 | 0.0812 | 0.0762 | 0.948 | |
| β0 = 0.00 | 0.0850 | 0.9610 | 0.9150 | 0.940 | 0.0353 | 0.6957 | 0.6747 | 0.956 | |
| γ0 = 0.25 | 0.0122 | 0.1138 | 0.1107 | 0.948 | 0.0118 | 0.0802 | 0.0783 | 0.945 | |
| β0 = 0.25 | 0.0731 | 0.9634 | 0.9133 | 0.971 | 0.0307 | 0.7038 | 0.6763 | 0.959 | |
| γ0 = 0.25 | 0.0193 | 0.1124 | 0.1093 | 0.941 | 0.0186 | 0.0792 | 0.0772 | 0.938 | |
| β0 = 0.50 | 0.0627 | 0.9690 | 0.9216 | 0.936 | 0.0262 | 0.7102 | 0.6822 | 0.960 | |
| γ0 = 0.25 | 0.0286 | 0.1112 | 0.1070 | 0.926 | 0.0208 | 0.0769 | 0.0757 | 0.935 | |
| β0 = 1.00 | 0.0432 | 0.9721 | 0.9289 | 0.972 | 0.0299 | 0.7129 | 0.6964 | 0.969 | |
5. An application
In this section, we apply the estimation procedure proposed in the previous sections to a set of clustered interval-censored failure time data arising from an LF study (Williamson, Kim, Manatunga and Addiss 2008). The study followed 47 men with LF, a debilitating parasitic disease in which several worms live together in several nests. An effective treatment is expected to kill the worms in all of the nests. The goal of the study was to compare the effect of the co-administration of diethylcarbamazine (DEC) and albendazole (ALB) (new treatment) versus DEC alone (standard treatment) for the treatment of LF. The patients in the study were followed for a year since their treatment and periodically examined by ultrasound to see if the worms were still alive. Thus, for the times to the clearance of the worms in each nest, the variables of interest, only clustered interval-censored data were observed with each patient serving as a cluster and the cluster size being the number of nests of adult filial worms in the body of each patient.
Among 47 patients, 22 received the co-administration of DEC and ALB, while the others were given DEC alone. In total, 78 adult worm nests were detected by ultrasound with the cluster size ni ranging from 1 to 5. In addition to the treatment indicator, the age of each subject in years was also observed, ranging from 16 to 66. In the analysis below, we define X1i to be 0 if subject i was given the co-administration of DEC and ALB and 1 otherwise and let X2i be the age of the corresponding patient. Note that here we only have cluster-specific covariates.
To apply the proposed method, we assume that the times to the clearance of the worms and the monitoring times can be described by models (1)–(3), respectively. Since the observed data were given in the form Tij ∈ [Lij, Rij), that is, we have a mixture of left-, interval-, and right-censored observations, we need to transfer them to the form expressed by Uij and Vij to implement the proposed estimation procedure. For this and an observed interval [Lij, Rij), we set Uij = Rij and Vij to be the largest observation time in the study if Lij = 0; if 0 < Lij < Rij < +∞, we let Uij = Lij and Vij = Rij; if Rij = +∞, we take Vij = Lij and Uij to be the smallest observation time in the study. Correspondingly, the estimating function Uγ (γ) needs to be adjusted to
where and . In contrast, the estimating function Uβ(β, γ) remains the same. This essentially treats Uij as missing in the right-censored case, and Vij as missing in the left-censored case.
The results obtained by the application of the proposed estimation procedure to the data are presented in Table 4 and it includes the estimated treatment and age effects on the time to the clearance of the worms, the estimated standard deviation (SD), and the p-values for testing the covariate effects equal to zero. They suggest that the two treatments seem to have no significant difference in killing or cleaning the worms and also the clearance of the worms did not seem to be significantly related to the age of the patient. However, one may be careful about the conclusions due to the small number of subjects.
Table 4.
Estimations from the LF study.
| Covariate | Estimate | SD | p-Value |
|---|---|---|---|
| Treatment (γ1) | −0.375 | 0.383 | 0.327 |
| Age (γ1) | 0.0061 | 0.020 | 0.755 |
| Treatment (β1) | −0.522 | 0.442 | 0.237 |
| Age (β2) | 0.0108 | 0.415 | 0.794 |
6. Concluding remarks and discussion
As mentioned before, clustered failure time data occurs in a study if study subjects are related through being clustered into small groups. In this case, one has to take into account the correlation among them to perform a valid analysis. In the previous sections, an estimation procedure was developed for the problem in the presence of interval censoring for the data arising from the additive hazards model and the asymptotic properties of the proposed estimates were established. One major advantage of the presented method is that it does not involve the estimation of the baseline hazard functions.
For the problem considered here, an alternative approach is to employ the full likelihood approach and a main advantage of this method is that one can expect that it could be more efficient than the estimating equation method proposed here. However, the full likelihood approach would be time-consuming and may be infeasible if a nonparametric or semi-parametric approach was adopted since it involves the estimation of the infinite-dimensional functions. Also, it could be difficult to derive the asymptotic properties of the resulting estimates. In contrast, the proposed method can be easily implemented.
The proposed methodology involves modelling gap times between the adjacent monitoring times using the Cox model. An alternative choice is to model the monitoring times using the Cox model marginally. However, such modelling requires stricter conditions due to the order relationship. In contrast, the gap time modelling approach is more flexible. In the presented approach, the latent variables bis are assumed to follow the normal distribution for simplicity and the methodology still applies for other distributions.
Acknowledgements
The authors are grateful to the editor and the reviewers for their insightful comments on the article. This work was partially supported by NIH grant 5 R01 CA152035 to the third author.
Appendix
In the following, we will sketch the proofs of the asymptotic normality of the estimates and . First, we will discuss the asymptotic normality of . For this and l = 0, 1, 2, let and denote the limits of and , respectively. By following Wang et al. (2010), one can obtain that
| (1) |
where
and
with and . Thus, the distribution of can be approximated by the normal distribution with mean zero and covariance matrix which can be consistently estimated by , In the above,
and
For the asymptotic normality of , let denote the limit of and k = 1, 2. By some calculation and Equation (1), we have
Note that here and . Furthermore, define the following for l = 0, 1,
and
By the Taylor series expansion and Equation (1), one obtains that
This yields that
where
This shows that the distribution of can be approximated by the normal distribution with zero mean and the covariance matrix that can be consistently estimated by , where
and
References
- Cai J, Prentice R. Regression Estimation Using Multivariate Failure Time Data and a Common Baseline Hazard Function Model. Lifetime Data Analysis. 1997;3:197–213. doi: 10.1023/a:1009613313677. [DOI] [PubMed] [Google Scholar]
- Cai J, Zeng D. Additive Mixed Effect Model for Clustered Failure Time Data. Biometrics. 2011;67:1340–1351. doi: 10.1111/j.1541-0420.2011.01590.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai T, Wei L, Wilcox M. Semi-parametric Regression Analysis for Clustered Failure Time Data. Biometrika. 2000;87:867–878. [Google Scholar]
- Ghosh D. Efficiency Considerations in the Additive Hazard Model with Current Status Data. Statistica Neerlandica. 2001;55:367–376. [Google Scholar]
- Huang J. Estimation for the Cox Model with Interval Censoring. The Annals of Statistics. 1996;24:540–568. [Google Scholar]
- Jewell NP. Non-parametric Estimation and Doubly-Censored Data: General Ideas and Applications to Aids. Statistics in Medicine. 1994;13:2081–2095. doi: 10.1002/sim.4780131917. [DOI] [PubMed] [Google Scholar]
- Jewell NP, van der Laan MJ. Current Status Data: Review, Recent Developments and Open Problems. Advances in Survival Analysis. 2004;23:625–642. [Google Scholar]
- Lin D. Cox Regression Analysis of Multivariate Failure Time Data: The Marginal Approach. Statistics in Medicine. 1994;85:2233–2247. doi: 10.1002/sim.4780132105. [DOI] [PubMed] [Google Scholar]
- Lin D, Oakes D, Ying Z. Additive Hazard Regression with Current Status Data. Biometrika. 1998;85:289–298. [Google Scholar]
- Martinussen T, Sheike TH. Efficient Estimation in Additive Hazard Regression with Current Status Data. Biometrika. 2002;89:649–658. [Google Scholar]
- Rossini A, Moore D. Modeling Clustered, Discrete, or Grouped Time Survival Data with Covariates. Biometrics. 1999;55:813–819. doi: 10.1111/j.0006-341x.1999.00813.x. [DOI] [PubMed] [Google Scholar]
- Sun J. The Statistical Analysis of Interval Censored Failure Time Data. Springer; New York: 2006. [Google Scholar]
- Wang W, Ding AA. On Assessing the Association for Bivariate Current Status Data. Biometrika. 2000;87:879–893. [Google Scholar]
- Wang L, Sun J, Tong X. Regression Analysis of Case II Interval-Censored Failure Time Data with the Additive Hazard Model. Statistica Sinica. 2010;20:1709–1723. [PMC free article] [PubMed] [Google Scholar]
- Wei L, Lin D, Weissfeld L. Regression Analysis of Multivariate Incomplete Failure Time Data by Modeling Marginal Distributions. Journal of the American Statistical Association. 1989;84:1065–1073. [Google Scholar]
- Williamson J, Kim H, Manatunga A, Addiss D. Modeling Survival Data with Informative Cluster Size. Statistics in Medicine. 2008;27:543–555. doi: 10.1002/sim.3003. [DOI] [PubMed] [Google Scholar]
- Zeng D, Lin D, Lin X. Semiparametric Transformation Models with Random Effects for Clustered Failure Time Data. Statistica Sinica. 2008;18:355–377. [PMC free article] [PubMed] [Google Scholar]
- Zhu L, Tong X, Sun J. A Transformation Approach for the Analysis of Interval-Censored Failure Time Data. Lifetime Data Analysis. 2008;14:167–178. doi: 10.1007/s10985-007-9075-8. [DOI] [PubMed] [Google Scholar]
