Abstract
In large prospective cohort studies, accumulation of covariate information and follow−up data make up the majority of the cost involved in the study. This might lead to the study being infeasible when there are some expensive variables and/or the event is rare. Prentice (1986) proposed the case−cohort study for time to event data to tackle this problem. There has been extensive research on the analysis of univariate and clustered failure time data, where the clusters are formed among different individuals under case−cohort sampling scheme. However, recurrent event data are quite common in biomedical and public health research. In this paper, we propose case−cohort sampling schemes for recurrent events. We consider a multiplicative rates model for the recurrent events and propose a weighted estimating equations approach for parameter estimation. We show that the estimators are consistent and asymptotically normally distributed. The proposed estimator performed well in finite samples in our simulation studies. For illustration purposes, we examined the association between prior occurrence of measles on Acute Lower Respiratory Tract Infections (ALRI) among young children in Brazil.
Keywords: Generalized Case−Cohort Design, Recurrent Events, Correlated Data, Acute Lower Respiratory Tract Infections
1. Introduction
In large epidemiological studies and disease prevention trials, the majority of the effort and cost arises from the assembling of the covariate measurements and follow−up information on all the individuals. When the disease incidence is low and some exposures are expensive to measure, it is not cost effective and sometimes not feasible to measure the expensive variable on all individuals in the cohort. To reduce cost and achieve the same study goals as the cohort study, Prentice (1986) proposed the case−cohort study design. Under this design, a random sample is selected from the entire cohort, named subcohort, and covariate information is collected on only this sub−cohort and the individuals who experience the event.
Development of statistical methods for data from case−cohort studies is an active research area. For univariate failure time data, Self and Prentice (1988), Wacholder et al (1989), and Barlow (1994) considered efficient and robust estimation of the variance of the case−cohort estimator. Borgan et al (1995) considered a more general sampling frame whereas Lin and Ying (1993) viewed the case−cohort design as a special case of the missing data problem. Borgan et al (2000) developed methods for the analysis of exposure stratified case−cohort design and Breslow and Wellner (2006) considered weighted likelihood for two−phase stratified samples. Chen (2001) and Kulich and Lin (2004) developed sample reuse methods via local averaging leading to more efficient estimation. Nonetheless, correlated failure time data are quite common in biomedical and public health research. Lu and Shih (2006) and Zhang et al (2011) developed estimating equation for clustered failure time data assuming a marginal hazards model, accounting for correlation within clusters, which are formed by correlated subjects. Kang and Cai (2009a) considered marginal hazards model for case−cohort data with multiple disease outcomes. However, methods for analyzing recurrent events data from case−cohort studies are scarce.
Recurrent events are commonly encountered in biomedical research. Our motivating example is from a doubly−blind, placebo−controlled community trial conducted in northeastern Brazil in a cohort of children aged between 6 to 48 months (Barreto et al, 1994). The primary objective of this study was to evaluate the effect of high doses of vitamin A on acute−lower−respiratory−tract infection (ALRI). One thousand two hundred and seven children were randomized to receive either vitamin A supplement or placebo. They were followed for 1 year. An episode of ALRI was defined as cough plus a respiratory rate of 50 breaths per min or higher for children under 12 months, and 40 breaths per min or higher for older children(Barreto et al, 1994). About 15.37% of the children had at least one ALRI episode during their follow−up period. The number of episodes ranged from 1 to 6, resulting in a total of 305 episodes. As a secondary objective, it is of interest to examine whether the child ever had measles is related to ALRI. It can be expensive to verify the measles information because it is based on the parents’ acknowledgement. With the relatively low ALRI rate, a case−cohort sampling design can be more cost−effective in this situation.
Various methods have been proposed for analyzing recurrent event data from the full cohort. These include modeling the intensity functions of the recurrent event process (Andersen and Gill, 1982), rate/mean function (Pepe and Cai (1993); Lawless and Nadeau (1995), Lin et al (2000); Cook et al (2009)), and the gap times between each recurrence (Huang and Chen, 2003; Schaubel and Cai, 2004). However, methods for analyzing recurrent events data from case−cohort studies is limited. Chen and Chen (2014) extended the case−cohort design to recurrent events with specific clustering feature using a modified Cox−type self−exciting intensity model. Such a model makes the assumption that the dependence of the recurrent events is captured by some time−varying covariates. This assumption may not be easily verifiable. Schaubel et al (2006) further noted that it is reasonable to expect a covariate to affect ,the entire history up to time t, if it affects the event rate at that time. This would result in the overall effect of the covariates on the event process to be underestimated, since the intensity model assumes that the event rate conditional on the entire event history equals that conditional on only the covariate information at time t. An alternative is to model the marginal rate or mean function. The marginal rates or means model does not require such assumption (Lin et al, 2000) and the parameters in this model have population average interpretation, which is desirable in many population studies. However, analysis methods for marginal rates model have not been investigated for recurrent events data from case−cohort study design.
The main goal of this article is to propose case−cohort designs for recurrent events data and the estimation procedures for data from such designs. We considered two different situations. One is when the recurrent events are not very common in which case we will include into the case−cohort sample all individuals who developed events during the follow−up. The second situation is when the recurrent events are relatively common, when the proportion of subjects who experienced at least one event is about 20%‒50%. In this case, we propose to only include into the case−cohort study a sample of those who developed events during the follow−up. We refer to the first situation as traditional case−cohort design and the second as the generalized case−cohort design.
In this paper, we propose weighted estimating equations for estimating the parameters in the marginal rates regression model for recurrent events in case−cohort studies. The article is organized as follows. In Section 2, the design of the study and the estimation procedure are proposed. The asymptotic properties of the estimators are studied in Section 3. The finite sample properties are investigated by simulations in Section 4. In Section 5, we illustrate the proposed method on a case−cohort study based on the ALRI data on children in Brazil. In the Section 6, we provide some final remarks.
2. Model and Estimation
Suppose there are n independent individuals in the cohort. Let be the number of recurrent events for individual i over the time interval [0, t), Ci is the censoring time. is the p−dimensional covariate of interest for individual i, where is the set of expensive−to−measure variables and is the set of all other covariates. Let denote the j−th recurrent event time for individual i. The observed time is , j = 1, 2, . . . , ni + 1, where ni is the number of events that are observed for individual i, and is the total number of observed events. Let , , which is the indicator that individual i experienced at least one event, and τ denote the study ending time. The rate function for an individual is denoted as . We assume the following proportional rates model:
| (1) |
where µ0(.)is an unspecified continuous baseline mean function and θ0 and γ0 are the vectors of unknown parameters. Denoting , we can rewrite the rates model as and the mean function is given by for all t ∈ [0, τ ]. Note that the covariates are allowed to be time−dependent. We assume that the possibly time−dependent covariates are external (Kalbfleisch and Prentice (2002)), i.e., they are not affected by the recurrent event process.
2.1. Case−cohort study design for recurrent events
In this subsection, we introduce two sampling schemes for the recurrent event data. The first deals with the situation that the event is not common in the population. In this case, we draw a random sample from the full cohort and supplement that with all the cases. We refer to this sampling scheme as the original case−cohort design. The second sampling scheme is for the situation that the event is relatively common and we cannot afford to sample all individuals with events. An example for the common recurrent event is the randomized double−blinded trial conducted by Genentech Inc. in the early 1990’s to study the effect of rhDNase on pulmonary exacerbations among patients with cystic fibrosis (Therneau and Hamilton, 1997). Even though, the pulmonary exacerbation rate is ~ 40%, obtaining some genetic information based on biospecimen collected at baseline may be quite expensive to measure. Under such situation, we propose to sample only a fraction of those who have events for the case−cohort sampling. We refer to this sampling scheme as the generalized case−cohort design with recurrent events.
2.1.1. Estimation under the original case−cohort design
Under the case−cohort sampling, we select a sub−cohort from the entire cohort by simple random sampling. Let ξi denote the indicator function for individual i being selected into the subcohort; is the subcohort proportion where ñ is the number of individuals selected in the sub−cohort and n is the number of individuals in the full cohort. We call an individual a case if the individual experienced at least one event and an individual a non−case if the individual did not have an event during the study period. Hence, the observable information for individual i is if individual i is in the case−cohort sample. In other words, we have {Ti, ∆i, ξi, Zi(t), 0 ≤ t ≤ τ }if ∆i = 1 or ξi = 1 and when ∆i = 0 and ξi = 0. When information on the covariates for all the individuals are available, one can consider the following estimating equation for the full cohort data (Lin et al (2000)):
| (2) |
where . One can easily solve the estimating equation by some iterative algorithm, for example, the Newton−Raphson iteration method. However, because the data are not complete in case−cohort studies, (2) cannot be used directly. We consider a weighted estimating equation approach based on the idea of inverse probability of selection weighting. The estimating equation considered for estimating β0 is the following:
| (3) |
where , ∀d = 0, 1, , where is the estimator of the true sampling parameter, α. The weight is 1 for all the cases and is for the non−cases in the sub−cohort. Similar idea for the weights was considered by Kalbfleisch and Lawless (1988).They considered the time−invariant version of , which was given by ᾶ. Borgan et al (2000) used a similar idea for univariate failure time data from stratified case−cohort studies. We denote the solution to this equation by . Our proposed Breslow−Aalen type estimator of the baseline mean function is given by
| (4) |
The estimated mean function for a particular Z(t) is given by .
2.1.2. Estimation under the generalized case−cohort design
For the generalized case−cohort design, we sample a fraction of cases outside of the sub−cohort. Let ηi be an indicator for individual i who is a case but outside the sub−cohort being sampled. Let denote the sampling proportion for the additional cases, where ,n1 and ñ1 are the number of selected individuals who have experienced at least one event but are not in the subcohort, individuals who experienced at least one event in the full cohort and those who were in the subcohort respectively. Under this design, the covariate information is available for the subcohort members and the selected cases (ηi = 1). Hence, the observable information for individual i is {Ti, ∆i, ξi, ηi, Zi(t) : t ∈ [0, τ ]} when ξi = 1 or ηi = 1 and if ξi = 0 and ηi = 0. Using the inverse of probability of being sampled as the weight, our proposed estimating equation for the generalized case−cohort sampling scheme is
| (5) |
where , ∀d = 0, 1, and the weight function is given by where is the estimator of the true sampling parameter q.We denote the solution of this equation by . The Breslow−Aalen type estimator of the baseline mean function is . For given Z(t), the estimated mean function is given by .
3. Asymptotic properties
In this section, we investigate the asymptotic properties of the estimators. Define the following terms:
We define the norm for the vector m, matrix M , and function f as the following: , , . The estimator under the original case−cohort sampling scheme is a special case of the generalized case−cohort sampling scheme, so its asymptotic property is a special case of the generalized case−cohort sampling scheme. Hence, in the Appendix, we focus the proofs on the asymptotic properties of the estimators under the generalized case−cohort design, and . The regularity conditions and the outline of the proofs are provided in the Appendix. The asymptotic properties are summarized in the following theorems.
Theorem 1 Under the regularity conditions in the Appendix, for k = I or II, is a consistent estimator of β0. converges to a Gaussian distribution with mean zero and variance given by
where ,,,,.
Each of these terms, A(β0), Q(β0), V I (β0), and V II (β0) can be estimated respectively by their sample counterparts, ,, and . The explicit forms are provided in the Appendix.
To obtain the asymptotic distribution of the baseline mean function , we need to first define the metric space, Ɗ [0, τ ], consisting of right continuous functions f (t) with left−hand limits and f : [0, τ ] → R. The metric for this space is defined by, ,f( t), g(t) ∈ Ɗ [0, τ ] . The following theorem summarizes the asymptotic properties of .
Theorem 2 Under the regularity conditions, for k = I or II, converges in probability to µ0(t) uniformly in t ∈ [0, τ ]. Further, defining , we have n1/2Wn(t) converges to a Gaussian distribution with mean zero. The variance−covariance function between Wn(t) and Wn(s) is given by
where
Similarly, each of these terms can be consistently estimated by their sample counterparts, which are provided in the Appendix.
Studying the variance components, we can identify three sources of variation in ΣII (β0) and ϕII (t, s)(β0). The three components correspond to the variation due to the different sampling present in the data: one from the cohort, one because of the sampling of the subcohort from the cohort, and the last is from the sampling of the cases outside the subcohort. Further, note that for the original case−cohort design, since no randomness arises from sampling cases outside the random sub−cohort, the third term does not arise in the variance term of .
4. Simulation Results
We have conducted extensive simulation studies to examine the finite sample properties of the proposed estimators. To generate the recurrent event times, we have adopted Jahn−Eimermacher et al (2015)’s algorithm. We consider the following random−effects intensity model to generate the recurrent events:
| (6) |
where ϑ is an unobserved unit−mean positive random variable that is independent of Z. We derived the functions, and , from the intensity process using the formula . Independent random numbers, ai, are drawn from a uniform distribution on [0,1]. The following recursive algorithm is applied to obtain the recurrent event data for individual , j = 1, 2, . . . , ni. We assumed that ϑ has a Gamma distribution with mean 1 and variance σ2. We considered binary covariate generated from Bernoulli (0.5) and continuous covariate from Uniform (0,1). We considered different cohort sizes: 1000, 2000 and 4000 and the number of simulated data sets being considered is 1000. We considered σ2 such that mean recurrence is 3 for those who had at least one event. We considered β0 to be 0.5 or 0.
Table 1 summarizes the simulation results for situations where the proportion of individuals who experienced at least one event was low (5%, 10%, 20%). For the case−cohort sampling, the sub−cohort sampling proportion was 25% and all cases were sampled. The simulation results show that the coefficient estimates are approximately unbiased for all situations considered. From Table 1, we note that the proposed estimated standard errors provide good estimates of the true variability of in all the situations except when both the full cohort and the event rate are very small. As the cohort size increases, the performance of the estimated standard error improves. The variance of decreases as the cohort size and/or the event proportion increases. The coverage rate of the nominal 95% confidence intervals using the proposed method is in the 92‒95% range in all the situations considered except when the event rate along with the cohort size are small. As the cohort size or event rate increases, the 95% confidence interval coverage rate improves.
Table 1.
Summary of Simulation results of
| Z | β0 | Cohort Size | Event proportion | Bias | Model Std. Error | Bootstrap Std. Error | Empirical Std. Dev. | Coverage |
|---|---|---|---|---|---|---|---|---|
| Bern.(0.5) | 0.5 | 1000 | 0.05 | −0.003 | 0.474 | 0.509 | 0.535 | 0.85 |
| 0.10 | −0.031 | 0.342 | 0.329 | 0.323 | 0.91 | |||
| 0.20 | −0.034 | 0.225 | 0.227 | 0.228 | 0.94 | |||
| 2000 | 0.05 | 0.002 | 0.344 | 0.353 | 0.349 | 0.90 | ||
| 0.10 | −0.023 | 0.227 | 0.230 | 0.231 | 0.93 | |||
| 0.20 | −0.013 | 0.161 | 0.160 | 0.165 | 0.94 | |||
| 4000 | 0.05 | −0.02 | 0.245 | 0.247 | 0.245 | 0.93 | ||
| 0.10 | −0.025 | 0.161 | 0.162 | 0.153 | 0.95 | |||
| 0.20 | −0.023 | 0.113 | 0.114 | 0.116 | 0.93 | |||
| 0 | 1000 | 0.05 | −0.0006 | 0.360 | 0.370 | 0.368 | 0.90 | |
| 0.10 | −0.0002 | 0.287 | 0.294 | 0.288 | 0.94 | |||
| 0.20 | 0.008 | 0.219 | 0.221 | 0.222 | 0.94 | |||
| 2000 | 0.05 | 0.008 | 0.256 | 0.259 | 0.261 | 0.92 | ||
| 0.10 | −0.005 | 0.205 | 0.207 | 0.208 | 0.93 | |||
| 0.20 | 0.003 | 0.154 | 0.155 | 0.154 | 0.95 | |||
| 4000 | 0.05 | 0.0146 | 0.181 | 0.182 | 0.188 | 0.94 | ||
| 0.10 | 0.0013 | 0.145 | 0.146 | 0.146 | 0.95 | |||
| 0.20 | −0.0037 | 0.109 | 0.110 | 0.110 | 0.94 | |||
| Unif(0,1) | 0.5 | 1000 | 0.05 | −0.05 | 0.746 | 0.771 | 0.777 | 0.86 |
| 0.10 | −0.022 | 0.566 | 0.566 | 0.589 | 0.90 | |||
| 0.20 | −0.045 | 0.392 | 0.395 | 0.395 | 0.94 | |||
| 2000 | 0.05 | −0.036 | 0.531 | 0.540 | 0.547 | 0.90 | ||
| 0.10 | −0.02 | 0.404 | 0.398 | 0.416 | 0.92 | |||
| 0.20 | −0.011 | 0.279 | 0.278 | 0.281 | 0.94 | |||
| 4000 | 0.05 | 0.008 | 0.381 | 0.383 | 0.399 | 0.91 | ||
| 0.10 | −0.025 | 0.279 | 0.281 | 0.274 | 0.95 | |||
| 0.20 | −0.023 | 0.197 | 0.197 | 0.200 | 0.94 | |||
| 0 | 1000 | 0.05 | −0.006 | 0.627 | 0.633 | 0.636 | 0.89 | |
| 0.10 | −0.006 | 0.509 | 0.512 | 0.507 | 0.92 | |||
| 0.20 | 0.006 | 0.382 | 0.381 | 0.380 | 0.94 | |||
| 2000 | 0.05 | 0.005 | 0.443 | 0.444 | 0.466 | 0.92 | ||
| 0.10 | −0.007 | 0.357 | 0.360 | 0.362 | 0.92 | |||
| 0.20 | 0.0009 | 0.266 | 0.269 | 0.263 | 0.94 | |||
| 4000 | 0.05 | 0.021 | 0.312 | 0.314 | 0.322 | 0.93 | ||
| 0.10 | 0.004 | 0.255 | 0.255 | 0.254 | 0.94 | |||
| 0.20 | −0.002 | 0.189 | 0.190 | 0.189 | 0.95 | |||
Table 2 summarizes the simulation results for situations when the proportion of events is not low (40%, 30% and 25%). We considered generalized case−cohort sampling. The sub−cohort sampling proportion is 10% and sampling proportion for the cases outside the sub−cohort is also 10%. The simulation results show that the coefficient estimates are approximately unbiased, and the bias decreases as the event proportion increases. The proposed variance estimator decreases as the cohort size increases or the event proportion increases and is close to the empirical variance. The difference between the model SE and the empirical standard deviation decreases as the cohort size increases or the censoring proportion decreases. The 95% confidence interval coverage is close to the nominal level for all the situations considered. Additional simulations were conducted for the situations with larger number of recurrent events with average number of recurrent events per person being six. The conclusions are similar as before but with smaller variance as expected.
Table 2.
Summary of Simulation results of
| Z | β0 | Cohort Size | Event proportion | Bias | Model Std. Error | Bootstrap Std. Err. | Empirical Std. Dev. | Coverage |
|---|---|---|---|---|---|---|---|---|
| Bern.(0.5) | 0.5 | 1000 | 0.25 | −0.017 | 0.500 | 0.549 | 0.548 | 0.93 |
| 0.30 | 0.018 | 0.429 | 0.454 | 0.460 | 0.92 | |||
| 0.40 | 0.014 | 0.358 | 0.374 | 0.370 | 0.93 | |||
| 2000 | 0.25 | 0.005 | 0.357 | 0.375 | 0.381 | 0.93 | ||
| 0.30 | 0.01 | 0.305 | 0.317 | 0.327 | 0.93 | |||
| 0.40 | 0.001 | 0.253 | 0.258 | 0.263 | 0.94 | |||
| 4000 | 0.25 | −0.015 | 0.253 | 0.260 | 0.254 | 0.95 | ||
| 0.30 | 0.008 | 0.217 | 0.219 | 0.219 | 0.95 | |||
| 0.40 | −0.003 | 0.178 | 0.182 | 0.179 | 0.96 | |||
| 0 | 1000 | 0.25 | −0.004 | 0.500 | 0.533 | 0.540 | 0.93 | |
| 0.30 | 0.004 | 0.434 | 0.439 | 0.439 | 0.94 | |||
| 0.40 | −0.002 | 0.348 | 0.39 | 0.38 | 0.93 | |||
| 2000 | 0.25 | −0.009 | 0.358 | 0.365 | 0.373 | 0.93 | ||
| 0.30 | −0.007 | 0.307 | 0.302 | 0.320 | 0.94 | |||
| 0.40 | 0.001 | 0.249 | 0.270 | 0.254 | 0.94 | |||
| 4000 | 0.25 | 0.007 | 0.253 | 0.255 | 0.259 | 0.95 | ||
| 0.30 | −0.004 | 0.218 | 0.211 | 0.222 | 0.94 | |||
| 0.40 | 0.006 | 0.177 | 0.19 | 0.178 | 0.95 | |||
| Unif(0, 1) | 0.5 | 1000 | 0.25 | 0.004 | 0.858 | 0.888 | 0.966 | 0.92 |
| 0.30 | −0.003 | 0.736 | 0.759 | 0.806 | 0.93 | |||
| 0.40 | 0.003 | 0.610 | 0.605 | 0.625 | 0.94 | |||
| 2000 | 0.25 | −0.026 | 0.620 | 0.613 | 0.640 | 0.94 | ||
| 0.30 | 0.0099 | 0.530 | 0.526 | 0.550 | 0.94 | |||
| 0.40 | −0.016 | 0.436 | 0.421 | 0.459 | 0.93 | |||
| 4000 | 0.25 | −0.01 | 0.442 | 0.430 | 0.457 | 0.94 | ||
| 0.30 | 0.0004 | 0.376 | 0.368 | 0.391 | 0.94 | |||
| 0.40 | −0.004 | 0.312 | 0.295 | 0.319 | 0.95 | |||
| 0 | 1000 | 0.25 | −0.018 | 0.864 | 0.907 | 0.923 | 0.92 | |
| 0.30 | 0.004 | 0.742 | 0.752 | 0.770 | 0.94 | |||
| 0.40 | 0.008 | 0.600 | 0.664 | 0.640 | 0.92 | |||
| 2000 | 0.25 | −0.015 | 0.620 | 0.628 | 0.664 | 0.92 | ||
| 0.30 | −0.0085 | 0.531 | 0.521 | 0.539 | 0.94 | |||
| 0.40 | −0.0027 | 0.430 | 0.461 | 0.431 | 0.94 | |||
| 4000 | 0.25 | 0.012 | 0.437 | 0.440 | 0.457 | 0.94 | ||
| 0.30 | −0.017 | 0.378 | 0.365 | 0.388 | 0.94 | |||
| 0.40 | 0.003 | 0.305 | 0.324 | 0.310 | 0.94 | |||
We also examined the bootstrap variance estimator based on 1000 bootstrap samples. Results are included in Tables 1 and 2. We observed that the performance of the bootstrap variance estimator is very similar to our proposed estimator based on asymptotic approximations. However, the computation time for the bootstrap variance estimator is about 10 times that of the variance estimator based on the asymptotic approximation and it is even longer when the event proportion is relatively large.
5. Application to ALRI data
A doubly−blinded placebo−controlled community trial was conducted in a cohort of 1207 children in northeastern Brazil, who were followed up from December 1990 to December 1991 (Barreto et al, 1994; Amorim and Cai, 2015). The primary aim of the original trial was to study the effect of high doses of Vitamin A on diarrhea and acute−lower−respiratory−tract−infections (ALRI). The age of the children at baseline ranged from 6 to 48 months. They were randomly assigned to vitamin A supplement or placebo. For the purpose of our analysis, 1190 children were eligible: sixteen subjects had missing information on one of the variables of interest and one child was shifted to Vitamin A from Placebo. Daily information on respiratory rates were collected (3 times a week) with a recall period of 48 to 72 hours. An episode of ALRI was defined as cough plus a respiratory rate of 50 breaths per min or higher for children under 12 months of age, and 40 breaths per min or higher for older children(Barreto et al, 1994; Amorim and Cai, 2015). At these visits, if the child reported cough, then the respiratory rates were measured twice. A new episode of ALRI was defined if there was an interval of 14 or more days (Barreto et al, 1994). Censoring occurred when children were lost to follow−up or the study reached its end. The number of children who had at least one event was 185 with event proportion of 15.37%.
We constructed a case−cohort sample based on this cohort study to illustrate our proposed method. We consider the indicator variable for children ever having measles to be the expensive variable which is only available for the case−cohort sample. We are interested in studying the effect of past occurrence of measles on ALRI. We considered the probability of sub−cohort selection to be 0.2. The total sample size for the case−cohort data is 376 with 238 in the subcohort. The following covariates were considered in the analysis: treatment group (vitamin A vs placebo), child’s gender (male vs female), age at baseline (dichotomized based on whether the child is older than 12 months or not), an indicator for the presence of a toilet in the child’s house (which is considered as a proxy for hygienic habits), and the indicator for children experiencing measles in their lifetime (based on the information provided by their parents). For the analysis, we considered placebo, female gender, child’s age ≤ 12 months, no presence of toilet at home, and never experiencing measles as the reference groups. Table 3 summarizes the distribution of the baseline variables in the subcohort and the full cohort.
Table 3.
Baseline Characteristics of the Acute Lower−Respiratory−Tract Infections study
| Variables | Subcohort | Full Cohort (n = 1190) |
|---|---|---|
| Treatment (Vit. A: 1 vs. Placebo: 0) | 0.4790 | 0.5017 |
| Gender (Boys: 1, Girls: 0) | 0.5462 | 0.5244 |
| Age (≤ 1 yr: 1, > 1 yr: 0) | 0.1597 | 0.1311 |
| Toilet at home (Yes: 1) | 0.7479 | 0.7361 |
Table 3 shows that the distribution of the variables in the subcohort are very similar to that in the full cohort. We applied our proposed method to the case−cohort sample.
Table 4 provides results from the model adjusting for covariates. Dichotomized age and presence of toilet at home are significant predictors of the recurrence of ALRI among young children, adjusting for the other variables in the model. Among the other variables, from Table 4, high doses of Vitamin A, gender and prior measles indicator are not significantly associated with recurrence of ALRI. In other words, controlling for all other variables, children in house−holds with toilets are at a 0.595 times lower risk, of developing ALRI, than the children living in household without a toilet. Similarly, the risk of ALRI recurrence among children who are younger than 12 months are 5.397 times that of children who are older than 12 months.
Table 4.
Estimates and standard errors for the multiplicative rates model with data from case−cohort sample from the ALRI study
| Effects | Proposed method | Mean/Rates method | ||
|---|---|---|---|---|
| Case−Cohort | Full Cohort | |||
| Estimate | SE | Estimate | SE | |
| Treatment (Reference: Placebo) | 0.0534 | 0.1995 | −0.0262 | 0.1552 |
| Gender (Ref: Female) | −0.0056 | 0.1965 | 0.1239 | 0.1580 |
| Age (Ref: > 12 months) | 1.6859 | 0.3102 | 1.6766 | 0.1577 |
| Toilet at home(Ref: Absence) | −0.9033 | 0.1759 | −0.6874 | 0.1605 |
| Measles Indicator (Ref: Never) | 0.0853 | 0.3395 | 0.0429 | 0.2716 |
We have also included the full cohort results in Table 4 for comparison. We note that, although the standard error (SE) is bigger for the case−cohort analysis as expected, the conclusions from the case−cohort analysis is similar to that from the full−cohort analysis in that the dichotomized age and presence of toilet at home are significant predictors of the recurrence of ALRI among young children while high dose of Vitamin A, gender and prior measles indicator are not. Based on the final model with the two significant predictors, age (> 12 months vs younger) and presence of toilet at home, we provided estimated cumulative rate (or mean) function with 95% confidence interval in Figure 1. From the plot, we can see that the younger children who lived in homes without a toilet have a higher cumulative rate of ALRI, compared to children living in households with proper sanitation. The same can be observed in the group with older children (> 12 months). Also, from the graphs, we can see that the children who were older than 12 months were at a much lower risk of having ALRI than the children who were less than a year old.
Fig. 1.

Mean Function Estimates and Bootstrapped 95% CI for Acute Lower−Respiratory Tract Infection in Children in Brazil
6. Final Remarks
This article proposes methods of fitting marginal multiplicative rates model for both the original case−cohort and the generalized case−cohort designs with time−varying weights. The proposed estimators are natural generalizations of the full cohort estimators and has easy interpretation. The proposed estimators are consistent and asymptotically normally distributed. They perform well in finite samples.
In our approach, we do not use all the covariate information that are available for the entire cohort. Developing a more general method taking advantage of those covariates information to improve the efficiency of the estimators is worthy of future research. We note that we have considered event proportion (individuals with at least one event) around 25%‒50% as a relatively common event. When the event proportion is greater than 50%, there is usually no need to consider case−cohort sampling because simple random sample of the entire cohort should include sufficient event data to achieve the desirable power.
Further investigation is needed for fitting additive rates models to recurrent events under such sampling schemes. The multiplicative rates are more commonly used but in many studies, investigators may be interested in the rate difference instead of the relative rates. In these situations, one would fit the additive rates model where the effect of the covariates on the rate function is additive. Further, in our paper, we have considered a single type of recurrent event. Interest may lie in the recurrence of multiple types in the same study, for example one may be interested in recurrence of hospitalizations due to different reasons. Additional research needs to be done for analyzing recurrent events of different types in such circumstances.
Acknowledgements
The authors thank the Editor, Associate Editor and reviewers for their helpful comments and suggestions that have improved the paper.
7. Appendix
Regularity Conditions : We assume the following regularity conditions:
∀i = 1, 2, …, n are independent and identically distributed.
P (Y(τ) > 0) > 0 and Ni(τ) (∀i = 1 2, …,n) are bounded by a constant
almost surely for some constant Cz
The matrix is positive definite.
(Finite Interval)
- (Asymptotic Stability) There exists a neighborhood of β0 that satisfies the following:
- There exists functions s(0)(β, t), s(1)(β, t) and s(2)(β, t) defined on such that
- There exists a matrix Q(β) such that
-
(Asymptotic Regularity) For all , t ∈ [0, τ ], , where s(d)(β, t) are continuous functions of , uniformly in t ∈ [0, τ ] and bounded on and s(0)(β, t) is bounded away from zero on .
The following conditions are pertaining to the asymptotic convergences of case−cohort sampling design.
(Non−Trivial Subcohort and Cases) ,, as n → ∞ where nc is the number of individuals in the cohort who experienced at least one event.
- (Asymptotic Normality of Subcohort Averages at β0) For ϵ > 0
- (Asymptotic Normality of Samples) As n → ∞
- (Asymptotic Stability)As n → ∞ we have the following
- There exists a positive definite matrix, V I (β0), such that
where . -
There is a positive definite matrix, V II (β0) such thatwhere and .
Lemma A1 Let φ = (φ1, φ2, . . . , φn) be a random vector containing n∗ ones and n ‒ n∗ zeroes, with each permutation equally likely. Let Bi(t) be independent and identically distributed real−valued processes on [0, τ ] with E(Bi(t)) = µB(t), var(Bi(0)) ∞ < and var(Bi(t)) < ∞. Assume B(t) = (B1(t), B2(t), . . . , Bn(t)) and φ are independent and all paths of Bi(t) have finite variation. Then converges weakly in l∞[0, τ ] to a zero−mean Gaussian process and uniformly in t. This lemma was stated in Kang and Cai (2009a).
Lemma A2 Let Wn(t) and Gn(t) be two sequences of bounded processes. For some constant τ , assume that the following conditions hold.
for some bounded process, W(t),
Wn(t) is monotone on [0, τ] and
Gn(t) converges to a zero−mean process with continuous sample paths.
Then
This was stated in Kang and Cai (2009b).
First we examine the asymptotic properties of the time−varying sampling weights, i.e., and . Using the Taylor series expansion, the Glivenko−Cantelli theorem, Lemma A2 and Slutsky’s theorem, we have
| (7) |
| (8) |
Outline of Proof of Theorem 1 Let us define . Based on similar arguments, as in Foutz(1977), the consistency of can be shown by proving the following : (i) exists and is continuous in an open neighborhood of β0 in , (ii) is negative definite w.p. → 1 as n →∞, (iii) uniformly for β in a neighborhood of β0, (iv) .
Taking derivative of the expression, we obtain
where . Define . Note that, , we have . (i) is satisfied by the continuity of in a neighborhood of β0. Now, we need to show that converges to A(β) in probability. By condition (vi), it can be shown that for all d = 0, 1, 2. Using Kolmogorov−Centsov Theorem(Karatzas and Shereve, 1988), we can show that converges to a tight zero−mean Gaussian process (Lin et al (2000), Van Der Vaart and Wellner (1996)(example 2.11.16 pg 215)), with continuous sample paths. Finally, using lemma A1, asymptotic distribution of and (equations (7) & (8)), bounded variation of , along with the uniform convergence of to s(d)(β, t), we concluded that converges to zero. Hence, we have
Note that, by regularity condition (iv), A(β0) is positive definite and hence, (ii) is also satisfied. For (iv), we work with n1/2Un(β). If we can show that n1/2Un(β) converges to a zero−mean Gaussian process, then n → ∞.
| (9) |
The first term converges to a zero−mean Gaussian process with variance (Lin et al, 2000). Using Lemma A2, the third term of equation (9) converges to zero in probability by uniform consistency of . From Lemma A1, we have converges to zero in probability uniformly in t. S(0)(β, t) and are bounded away from zero and converges to zero in probability by the Slutsky’s theorem.
Hence, the fourth term of equation (9) converges to zero. The second term can be split into two terms,
and
The variance of the first part is and that of the second part is , where VI(β0) and VII(β0) are defined in Theorem 1. The covariance between these two terms is zero. Similarly, we can show that the covariance between and the above terms are 0. Therefore, n1/2Un(β) converges to a zero−mean Gaussian process. Hence, (iv) holds. Based on (i), (ii), (iii) and (iv) and Taylor series expansion of around β0, the consistency of is established. The asymptotic distribution of also follows from the Taylor series expansion of .
The different parts of the variance term can be estimated by the following:
| (10) |
Outline of Proof of Theorem 2 Next, we examine the distribution of the estimate of the mean function, .
| (11) |
Noting that and have bounded variations, is bounded away from zero, and converges to a zero−mean Gaussian process with continuous sample paths, we can show that the first term converges to 0 as n → ∞ by using Lemma A1. Similarly, using to be a bounded process, Lemma A1 and A2, converges to E ((1 − Δi)dMi(u)) in probability and , uniformly in u, we have the second term of (11) to converges to zero in probability. The third term of (11) can be written as , with , by the consistency of , uniform consistency of S(d)(β, t) and and the Roundedness of μ0(t). The fourth term of the equation (11) can be written as . Using (7), (8), Lemma A1, uniform consistency of and S(d)(β, t), the last term in (11) is asymptotically equivalent to the sum of and . Using the asymptotic expansion of , we have . The variance of the three parts are given by E (νi(β0, t)νi(β0, s)), , and respectively, where νi(β0, t), ψi(β0, t) and ζi(β0, t) are defined in Theorem 2. The covariances between the terms are zero, and hence, the result.
The estimates of the variance components are given by:
where , , , are defined as in (10).
Contributor Information
Poulami Maitra, Department of Biostatistics, University of North Carolina at Chapel Hill, poulamim@live.unc.edu.
Leila DAF Amorim, Department of Statistics, Institute of Mathematics, Federal University of Bahia, Brazil, leila.d.amorim1@gmail.com.
Jianwen Cai, Department of Biostatistics, University of North Carolina at Chapel Hill, cai@bios.unc.edu.
References
- Amorim LD, Cai J (2015) Modelling recurrent events : a tutorial for analysis in epidemiology. International Journal of Epidemiology pp 324–333 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. The Annals of Statistics 10(4):1110–1120 [Google Scholar]
- Barlow WE (1994) Robust variance estimation for the case−cohort design. Biometrics 50(4):1064–1072 [PubMed] [Google Scholar]
- Barreto ML, Farenzena GG, Fiaccone RL, Santos LMP, Assis AMO, Araujo MPN, Santos PAB (1994) Effect of vitamin a supplementation on diarrhoea and acute lower−respiratory−tract infections in young children in brazil. Lancet 344:228–231 [DOI] [PubMed] [Google Scholar]
- Borgan O, Goldstein L, Langholz B (1995) Methods for the analysis of sampled cohort data in the cox proportional hazards model. The Annals of Statistics 23(5):1749–1778 [Google Scholar]
- Borgan O, Langholz B, Samuelson S, Goldstein L, Pogoda J (2000) Exposure stratified case−cohort designs. Lifetime Data Analysis 6:39–58 [DOI] [PubMed] [Google Scholar]
- Breslow NE, Wellner JA (2006) Weighted likelihood for semiparametric models and two−phase stratified samples, with application to cox regression. Scandinavian Journal of Statistics 34:86–102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen F, Chen K (2014) Case−cohort analysis of clusters of recurrent events. Lifetime Data Analysis 20:1–15 [DOI] [PubMed] [Google Scholar]
- Chen K (2001) Generalized case−cohort sampling. Journal of Royal Statistical Society Series B 63(4):791–809 [Google Scholar]
- Cook RJ, Lawless JF, Lakhal−Chaieb L, Lee KA (2009) Robust estimation of mean functions and treatment effects for recurrent events under event−dependent censoring and termination: Application to skeletal complications in cancer metastatic to bone. Journal of the American Statistical Association 104:60–75 [Google Scholar]
- Huang Y, Chen YQ (2003) Marginal regression of gaps between recurrent events. Lifetime Data Analysis 9:293–303 [DOI] [PubMed] [Google Scholar]
- Jahn−Eimermacher A, Ingel K, Ozga AK, Preussler S, Binder H (2015) Simulating recurrent event data with hazard functions defined on a total time scale. BMC Medical Research Methodology 15(1):16–24 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kalbfleisch JD, Lawless JF (1988) Likelihood analysis of multi−state models for disease incidence and mortality. Statisics in Medicine 7:149–160 [DOI] [PubMed] [Google Scholar]
- Kalbfleisch JD, Prentice RL (2002) The Statistical Analysis of Failure Time Data, 2nd Ed. John Wiley, New York [Google Scholar]
- Kang S, Cai J (2009a) Marginal hazards model for case−cohort studies with multiple disease outcomes. Biometrika 96(4):887–901 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang S, Cai J (2009b) Marginal hazards regression for retrospective studies within cohort with possibly correlated failure time data. Biometrics 65:405–414 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karatzas I, Shereve SE (1988) Brownian Motion and Stochastic Calculus Springer; New York [Google Scholar]
- Kulich M, Lin D (2004) Improving the efficiency of relative−risk estimation in case−cohort studies. Journal of the American Statistical Association 99(467):832–844 [Google Scholar]
- Lawless JF, Nadeau C (1995) Some simple robust methods for the analysis of recurrent events. Technometrics 37(2):158–168 [Google Scholar]
- Lin DY, Ying Z (1993) Cox regression with incomplete covariate measurements. Journal of the American Statistical Association 88(424):1341–1349 [Google Scholar]
- Lin DY, Wei LJ, Yang I, Ying Z (2000) Semiparametric regression for the mean and rate functions of the recurrent events. Journal of the Royal Statistical Society (Series B) 62(4):711–730 [Google Scholar]
- Lu SE, Shih JH (2006) Case−cohort designs and analysis for clustered failure time data. Biometrics 62:1138–1148 [DOI] [PubMed] [Google Scholar]
- Pepe MS, Cai J (1993) Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. Journal of the American Statistical Association 88(423):811–820 [Google Scholar]
- Prentice RL (1986) A case−cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73(1):1–11 [Google Scholar]
- Schaubel DE, Cai J (2004) Regression methods for gap time hazard functions of sequentially ordered multivariate failure time data. Biometrika 91(2):291–303 [Google Scholar]
- Schaubel DE, Zeng D, Cai J (2006) A semiparametric additive rates model for recurrent event data. Lifetime Data Analysis 12:389–406 [DOI] [PubMed] [Google Scholar]
- Self SG, Prentice RL (1988) Asymptotic distribution theory and efficiency results for case−cohort studies. The Annals of Statistics 16(1):64–81 [Google Scholar]
- Therneau TM, Hamilton SA (1997) rhdnase as an example of recurrent event analysis. Statistics in Medicine 16:2029–2047 [DOI] [PubMed] [Google Scholar]
- Van Der Vaart AW, Wellner JA (1996) Weak Convergence and Empirical Processes Springer; New York [Google Scholar]
- Wacholder S, Gail MH, Pee D, Brookmeyer R (1989) Alternative variance and efficiency calculations for the case−cohort design. Biometrika 76(1):117–123 [Google Scholar]
- Zhang H, Schaubel D, Kalbeisch JD (2011) Proportional hazards regression for the analysis of clustered survival data from case−cohort studies. Biometrics 67(1):18–28 [DOI] [PMC free article] [PubMed] [Google Scholar]
