Summary
The Nun Study, a longitudinal study to examine risk factors for the progression of dementia, consists of subjects who were already diagnosed with dementia (i.e., prevalent cohort) and those who do not have dementia (i.e., incident cohort) at study enrollment. When assessing the risk factors’ effects on the survival time from dementia diagnosis until death, utilizing data from both cohorts supports more efficient statistical inference because the two cohorts provide valuable complementary information. A major challenge in analyzing the combined cohort data is that the prevalent cases are not representative of the target population. Moreover, the dates of dementia diagnosis are not ascertained for the prevalent cohort in the Nun Study. Hence, the survival time for the prevalent cohort is only partially observed from study enrollment until death or censoring, with the time from dementia diagnosis to study enrollment missing. In this paper, we propose an efficient estimation method that uses both incident and prevalent cohorts under the proportional mean residual life model. By assuming proportionality of the mean residual life time with covariates in the incident cohort, we can utilize the natural relationship between the mean residual life function and the hazard function of the survival time measured from enrollment until death for the prevalent cohort. We evaluate the efficiency gain from using the combined cohort data through simulations and demonstrate that the proposed method is valid and efficient.
Keywords: Combined cohort data, incident cohort, Nun Study, prevalent cohort, proportional hazards model, proportional mean residual life model
1 |. INTRODUCTION
Prospective observational studies are commonly used to identify and evaluate risk factors that are associated with disease-specific survival. Such studies occasionally include both incident and prevalent cohorts. For example, the Nun Study of Aging and Alzheimer’s Disease (Nun Study),1 which motivates this work, involves an incident cohort of subjects who have not experienced dementia onset and are followed over time to monitor the potential diagnosis of dementia and death; and the prevalent cohort of subjects who already have dementia but have not experienced death at the time of study entry. The two cohorts provide valuable complementary information: the incident cohort is a random sample from the target population; and the prevalent cohort includes more deaths since subjects are sampled in the midst of dementia. Thus, analyzing the combined data from both cohorts yields more efficient statistical results. However, statistical analysis using the combined data has received less attention in the literature.
The data from the Nun Study consist of 501 subjects after excluding 177 participants who had missing key covariates (22) or withdrew consent (155). Among the participants represented in the data, 77 (about 15%) already had dementia and 424 were not yet diagnosed with dementia at study entry; these participants comprise the prevalent and incident cohorts, respectively. During the prospective follow-up, 153 subjects among the incident cohort were diagnosed with dementia. The dates of diagnosis of dementia were not available for the 77 subjects with dementia in the prevalent cohort. The combined cohort data are illustrated in Figure S1 of the web-based supplementary materials. In the statistical literature, the Nun Study data have been used primarily to illustrate Markov transition models,2,3,4,5,6 which has excluded the data from the prevalent cohort. We aim to take advantage of data from both the prevalent and incident cohorts for more efficient evaluation of the relationship between the risk factors and the survival time after diagnosis of dementia. In addition to the challenge of properly adjusting for sampling bias, a major issue when analyzing the combined data from the Nun Study is that the dates of dementia diagnosis for the prevalent cases were not ascertained. Thus, we only observe the time from study enrollment to death (referred to as the “forward recurrence time”) with the information of the time from diagnosis of dementia to study enrollment (referred to as the “backward recurrence time”) missing for the prevalent cohort.
We consider the proportional mean residual life (PMRL) model 7 to assess the effect of risk factors on the residual survival time. By assuming proportionality of the mean residual life time with covariates, we can utilize the natural relationship between the mean residual life function and the hazard function of the forward recurrence time to analyze the combined cohort data. In Section 2, we introduce notations to depict the combined cohort data and present the connection between the PMRL model and the proportional hazards (PH) model. We review existing estimation methods for data from the incident cohort only and the prevalent cohort only, and propose efficient estimating equations for the combined cohorts in Section 3. The asymptotic properties are also established in this section. We investigate finite sample properties through simulation studies under various settings in Section 4. In Section 5, we use the proposed method to analyze the Nun Study data. We provide some remarks in Section 6.
2 |. NOTATIONS AND MODEL
We consider data from both the incident and prevalent cohorts with respective sample sizes of n1 and n2. For the incident cohort, we denote T0 and a p × 1 vector X as the duration from disease diagnosis to death and the time-independent covariates, respectively. Let C be the duration from disease diagnosis to a censoring event. Then, the observed data from the incident cohort consist of independent and identically distributed (i.i.d.) {(Ti, Ai, Xi), i = 1, … , n1}, where and . We assume that the censoring time C is conditionally independent of T0 given covariates X. We note that the incident cohort is representative of the target population. For prevalent cases, the dates of dementia diagnosis, which occurred prior to enrollment, are unknown. Thus, only partial information on survival times that is measured from the study enrollment is available. We introduce additional notations to represent the event times observed from the prevalent cohort. Let V0 and a p × 1 vector Xv denote the duration from enrollment to death and the time-independent covariates for the prevalent cohort, respectively. Unlike the censoring time C for the incident cohort, the censoring time Cv is measured from enrollment until a censoring event. The observed prevalent cohort data are i.i.d. , where and . The censoring time for the prevalent cohort, Cv, is assumed to be conditionally independent of V0 given covariates Xv. Based on research about dementia,8,9 it is reasonable to assume that the natural history of dementia follows a stationary Poisson process. Under such an assumption, the prevalent cohort is subject to length-biased sampling.
The mean residual life function for the underlying survival time T0 at time t can be defined as m(t | X) = E(T0−t | T0 > t, X). To assess the covariate effects on the mean residual time, we assume the PMRL model 7 as
(1) |
where m0(t) is the unspecified positive baseline mean residual life function and β is a p × 1 vector of coefficients. We may use existing methods to fit the model to data from the incident cohort only. 10,11 However, the observed data from the prevalent cohort cannot directly fit model (1) because the survival times are length biased and the backward recurrence times are missing. Under length-biased sampling, it is shown that the conditional density function of the forward recurrence time V0 given covariates X is
where S(· | X) is the conditional survival function of T0 and m(0 | X) is the mean survival time of T0 given X. 12 It follows that the hazard function of the forward recurrence time is
where τ is the finite upper bound that satisfies Pr(T > τ) > 0. Therefore, as discussed by Maguluri and Zhang,13 Chen and Cheng,10 and Chen et al.,11 the PMRL model for T0 implies the following PH model for the forward recurrence time V0:
(2) |
where is the positive unspecified baseline hazard function of the forward recurrence time.
3 |. ESTIMATION METHODS
3.1. |. Estimation for Incident Cohort
For data from an incident cohort, Maguluri and Zhang 13 proposed an estimation method under the PMRL model when censoring was absent. Chen et al. 11 extended the method to accommodate right censoring using the inverse probability of censoring weighted (IPCW) approach. The IPCW estimating equation assumes that censoring is independent of the covariates. While the assumption can be relaxed to tackle a censoring distribution that is dependent on the covariates, as discussed in the paper, the censoring mechanism needs to be modelled. An alternative semiparametric estimation procedure was developed based on the counting process theory by Chen and Cheng. 10 We briefly review their method in this section.
Based on the definition of m(t | X) and using an inversion formula, we can derive the conditional survival function of T0 given X,
Under model (1), it follows that
(3) |
where Λi(t) is the cumulative hazard function of . Let Ni(t) = I(Ti ≤ t)Δi and Yi(t) = I(Ti ≥ t). Define
(4) |
where dΛi(t;β, m0) = {exp(−β┬Xi)dt + dm0(t)} for i = 1, …,n1.Expression (4) is a zero-mean martingale when β = β* and , where β* and are the true parameter and the true baseline mean function, respectively. Based on equation (3) and expression (4), the following estimating equations are constructed to estimate and β,
(5) |
(6) |
A closed form solution is available for m0(·) from equation (5),
where and . After replacing with in equation (6), we have the estimating function for β
(7) |
where . The estimator can be obtained from the solution to . Chen and Cheng10 showed that converges weakly to a normal distribution with mean zero and covariance matrix under the regularity conditions (C1)−(C5) listed in Appendix A.1. We define matrices AI and ƩI in Appendix A.2. The covariance matrix can be consistently estimated by , where
in which for any vector a.
3.2 |. Estimation for Prevalent Cohort
As discussed, data arising from prevalent sampling are subject to length bias, which hinders one from applying the method proposed for the incident cohort. Under the PMRL model, Bai et al. 14 proposed a semiparametric method for right-censored length-biased data, adopting the IPCW approach. That method properly addressed the induced dependent censoring issue and sampling bias, which are commonly encountered in length-biased data with right censoring. However, that method is not directly applicable to our motivating data because the survival times are not available due to missing backward recurrence times. Due to the special relationship between the PMRL and the PH models shown in equation (2), it is sufficient to estimate the covariate effects using only the observed forward recurrence times from the prevalent cohort. Note that we are estimating the same regression coefficient β for the target population under model (1) with the prevalent cohort data as with the incident cohort data. This approach has been studied for right-censored length-biased data by Chan et al. 15 for cross-sectional sampled data with no follow-up or data with no information on the disease diagnosis time. The prevalent cohort data in our study belong to the latter case.
Denote and , Define for k = 0,1, and 2, where , , for any vector a. Based on the relationship shown in equation (2), we can estimate the regression parameter _ by adopting the partial likelihood score function,
(8) |
where . The solution to is the estimator Under the regularity conditions (C1)–(C4), and (C6) listed in Appendix A.1, the distribution of converges to a normal distribution with mean zero and covariance matrix , where AP and ƩP are defined in Appendix A.3. We can consistently estimate by , where
Note that the estimating function (8) is equivalent to the score function for conventional survival data under the PH model, except for the unknown regression coefficients being negative of β. Thus, we can implement the estimation method using readily available software.
3.3 |. Estimation Using the Combined Cohorts
Although the data arising from the two cohorts have distinct data structures with different time variables, they are from the same target population. Thus, we may use the combined cohort data to make inference for the target cohort under model (1) regarding survival times. To improve statistical efficiency, we propose an estimation method that combines the two weighted estimating functions using data from the incident and prevalent cohorts. We consider a class of weighted linear combinations of the estimating functions (7) and (8):
(9) |
where W1 and W2 are p × p weight matrices. We combine the estimating equations derived from each cohort instead of using the weighted average of the two estimators, and , to avoid imposing a restrictive condition that the optimal estimator is a linear combination of the two estimators. Note that the total sample size increases to n = n1 + n2 by combining the data from the two cohorts. We can obtain a class of estimators by solving .
Among the class of estimators , we derive the estimator with the smallest asymptotic variance by finding the optimal W = (W1, W2). Let . Based on the large sample properties of the estimators and , the asymptotic covariance matrix of is
By the matrix Cauchy–Schwarz inequality,16 for any W,
We can attain the efficiency bound when the weight matrices and , which are the optimal weights. Since the optimal weights depend on the unknown parameter β, we proceed to a two-step estimation. We first derive an estimator that is consistent with β* by solving with W1 = W2 = Ip×p, where Ip×p is the identity matrix, to obtain the first-step estimator . Then, the efficient estimator is the solution to
where and . The asymptotic properties of are summarized in the following theorem.
Theorem 1. Under the regularity conditions listed in Appendix A.1, converges weakly to a normal distribution with mean zero and covariance matrix .
The detailed proofs of Theorem 1 are provided in Appendix A.4. The covariance matrix can be consistently estimated ,
where .
4 |. SIMULATION STUDY
We conducted simulation studies to investigate the finite sample properties of the proposed estimation method for the combined cohort data. We simulated 1000 datasets that consist of n1 subjects from the incident cohort and n2 subjects from the prevalent cohort. Total sample sizes of n = n1 + n2 = 200 and 400 were considered with various combinations. We considered two covariates: X1 from a Bernoulli distribution with probability 0.5 and X2 from a uniform distribution (0, 1) for both cohorts. Conditioning on X1 and X2, the survival time T0 was generated from the same target population under the mean residual life model m(t | X1, X2) = (at+b) exp(β1X1 +β2X2), where parameters for the baseline mean function (a, b) = (0.1, 0.5) and the true coefficients (β1, β2) = (0.5, −0.5). For the incident cohort, we randomly generated n1 observations, , i = 1, … , n1. For the prevalent cohort, we generated the left truncation time A from a uniform distribution and only kept observations that satisfy T0 > A. We continued the sampling procedure until we sampled n2 observations j = 1, … , n2, where , and for subject j with . Since both cohorts are subject to right censoring, we generated censoring times C and Cv from a uniform distribution (0, τC ) and chose τC to allow for 15% and 30% of censoring rates overall. Under this setting, the censoring rate of each cohort is about the same. The distributions of C and Cv share the same support because the follow-up periods for both cohorts are the same in practice. The generated dataset consists of .
We denoted as the estimator using the simulated incident cohort data only, using the simulated prevalent cohort data only, and and as the proposed estimators using data from both cohorts with identity weight matrices and the optimal weights, respectively. Tables 1 and 2 summarize the simulation results. When the overall censoring rate is as low as 15%, all estimators present virtually unbiased point estimates, the asymptotic standard errors are close to the empirical standard deviations of the point estimates, and the coverage probabilities are close to the nominal level of 95%. We note that the relative efficiency of the estimators and highly depends on the number of samples in each cohort. When there are more samples and hence more failure events in the incident cohort than in the prevalent cohort (i.e., n1 > n2), has smaller variance, which indicates that it is more efficient than , and vice versa. When the proposed method is used for the combined cohort data, we have an increased sample size of n1 + n2. Thus, we observe smaller variance estimates for and compared to and under all settings. To assess the efficiency gain of the proposed estimators over and , we compute the relative efficiency, which is defined as the ratio of the mean squared errors of the estimators. For example, when n1 = 100, n2 = 100, and the censoring rate is 15%, for β1 is 1.86 and 1.93 times more efficient than and , respectively; and is respectively 2.08 and 2.15 times more efficient. The proposed estimator with optimal weights is relatively more efficient than across all settings. While the point estimates for tend to be slightly more biased than due to the two-step estimation procedure, the mean squared errors of are smaller in every setting.
TABLE 1.
β1 | Β2 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n1 | n2 | cr | Est | SD | SE | MSE | CP | Est | SD | SE | MSE | CP | |
125 | 75 | 15% | 0.481 | 0.208 | 0.206 | 0.043 | 0.943 | −0.489 | 0.347 | 0.359 | 0.120 | 0.950 | |
0.526 | 0.276 | 0.274 | 0.077 | 0.948 | −0.502 | 0.492 | 0.468 | 0.242 | 0.950 | ||||
0.502 | 0.171 | 0.172 | 0.029 | 0.954 | −0.484 | 0.302 | 0.297 | 0.091 | 0.952 | ||||
0.487 | 0.162 | 0.164 | 0.026 | 0.950 | −0.493 | 0.279 | 0.282 | 0.078 | 0.947 | ||||
30% | 0.422 | 0.199 | 0.198 | 0.046 | 0.934 | −0.429 | 0.336 | 0.346 | 0.118 | 0.936 | |||
0.519 | 0.306 | 0.300 | 0.094 | 0.951 | −0.499 | 0.547 | 0.517 | 0.299 | 0.940 | ||||
0.471 | 0.186 | 0.181 | 0.035 | 0.948 | −0.450 | 0.322 | 0.314 | 0.106 | 0.934 | ||||
0.443 | 0.164 | 0.165 | 0.030 | 0.940 | −0.456 | 0.281 | 0.285 | 0.081 | 0.934 | ||||
100 | 100 | 15% | 0.476 | 0.232 | 0.229 | 0.054 | 0.945 | −0.470 | 0.390 | 0.401 | 0.153 | 0.950 | |
0.513 | 0.236 | 0.235 | 0.056 | 0.951 | −0.524 | 0.401 | 0.399 | 0.161 | 0.938 | ||||
0.498 | 0.169 | 0.171 | 0.029 | 0.955 | −0.499 | 0.286 | 0.294 | 0.081 | 0.951 | ||||
0.486 | 0.161 | 0.163 | 0.026 | 0.950 | −0.497 | 0.273 | 0.280 | 0.074 | 0.944 | ||||
30% | 0.417 | 0.222 | 0.222 | 0.056 | 0.927 | −0.416 | 0.381 | 0.387 | 0.152 | 0.938 | |||
0.509 | 0.256 | 0.258 | 0.066 | 0.951 | −0.539 | 0.446 | 0.442 | 0.200 | 0.952 | ||||
0.476 | 0.182 | 0.182 | 0.034 | 0.942 | −0.486 | 0.307 | 0.314 | 0.095 | 0.958 | ||||
0.450 | 0.165 | 0.167 | 0.030 | 0.942 | −0.473 | 0.284 | 0.288 | 0.081 | 0.953 | ||||
75 | 125 | 15% | 0.473 | 0.263 | 0.263 | 0.070 | 0.943 | −0.469 | 0.463 | 0.459 | 0.215 | 0.933 | |
0.512 | 0.217 | 0.209 | 0.047 | 0.939 | −0.514 | 0.365 | 0.353 | 0.134 | 0.950 | ||||
0.501 | 0.172 | 0.170 | 0.030 | 0.940 | −0.495 | 0.297 | 0.289 | 0.088 | 0.947 | ||||
0.489 | 0.163 | 0.163 | 0.027 | 0.946 | −0.492 | 0.286 | 0.277 | 0.082 | 0.943 | ||||
30% | 0.421 | 0.252 | 0.254 | 0.069 | 0.942 | −0.415 | 0.452 | 0.443 | 0.212 | 0.940 | |||
0.515 | 0.234 | 0.227 | 0.055 | 0.946 | −0.512 | 0.407 | 0.388 | 0.166 | 0.945 | ||||
0.491 | 0.185 | 0.181 | 0.034 | 0.944 | −0.478 | 0.326 | 0.311 | 0.107 | 0.953 | ||||
0.466 | 0.167 | 0.168 | 0.029 | 0.949 | −0.469 | 0.307 | 0.289 | 0.095 | 0.944 |
TABLE 2.
β1 | Β2 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n1 | n2 | cr | Est | SD | SE | MSE | CP | Est | SD | SE | MSE | CP | |
250 | 150 | 15% | 0.479 | 0.142 | 0.146 | 0.021 | 0.946 | −0.472 | 0.252 | 0.254 | 0.064 | 0.957 | |
0.514 | 0.184 | 0.190 | 0.034 | 0.970 | −0.529 | 0.334 | 0.322 | 0.113 | 0.949 | ||||
0.496 | 0.116 | 0.120 | 0.013 | 0.954 | −0.496 | 0.206 | 0.207 | 0.042 | 0.954 | ||||
0.488 | 0.111 | 0.115 | 0.012 | 0.955 | −0.493 | 0.196 | 0.199 | 0.038 | 0.960 | ||||
30% | 0.424 | 0.138 | 0.141 | 0.025 | 0.916 | −0.422 | 0.243 | 0.246 | 0.065 | 0.941 | |||
0.509 | 0.204 | 0.208 | 0.042 | 0.962 | −0.534 | 0.371 | 0.357 | 0.138 | 0.941 | ||||
0.467 | 0.124 | 0.126 | 0.017 | 0.947 | −0.473 | 0.217 | 0.219 | 0.048 | 0.957 | ||||
0.446 | 0.113 | 0.116 | 0.016 | 0.927 | −0.460 | 0.198 | 0.201 | 0.041 | 0.953 | ||||
200 | 200 | 15% | 0.475 | 0.162 | 0.163 | 0.027 | 0.945 | −0.469 | 0.280 | 0.284 | 0.079 | 0.947 | |
0.516 | 0.172 | 0.164 | 0.030 | 0.931 | −0.508 | 0.272 | 0.278 | 0.074 | 0.955 | ||||
0.501 | 0.121 | 0.120 | 0.015 | 0.940 | −0.490 | 0.199 | 0.206 | 0.040 | 0.961 | ||||
0.490 | 0.115 | 0.115 | 0.013 | 0.951 | −0.488 | 0.191 | 0.198 | 0.037 | 0.956 | ||||
30% | 0.420 | 0.157 | 0.157 | 0.031 | 0.914 | −0.420 | 0.273 | 0.275 | 0.081 | 0.941 | |||
0.516 | 0.188 | 0.179 | 0.035 | 0.938 | −0.517 | 0.309 | 0.307 | 0.095 | 0.952 | ||||
0.481 | 0.130 | 0.128 | 0.017 | 0.929 | −0.475 | 0.217 | 0.221 | 0.048 | 0.958 | ||||
0.456 | 0.116 | 0.118 | 0.015 | 0.928 | −0.464 | 0.202 | 0.204 | 0.042 | 0.943 | ||||
150 | 250 | 15% | 0.478 | 0.192 | 0.188 | 0.037 | 0.944 | −0.453 | 0.330 | 0.328 | 0.111 | 0.939 | |
0.508 | 0.144 | 0.145 | 0.021 | 0.963 | −0.512 | 0.245 | 0.246 | 0.060 | 0.946 | ||||
0.500 | 0.119 | 0.119 | 0.014 | 0.949 | −0.495 | 0.200 | 0.203 | 0.040 | 0.950 | ||||
0.493 | 0.116 | 0.115 | 0.014 | 0.943 | −0.491 | 0.194 | 0.196 | 0.038 | 0.949 | ||||
30% | 0.422 | 0.187 | 0.182 | 0.041 | 0.918 | −0.410 | 0.323 | 0.318 | 0.112 | 0.933 | |||
0.504 | 0.156 | 0.158 | 0.024 | 0.959 | −0.512 | 0.272 | 0.270 | 0.074 | 0.952 | ||||
0.483 | 0.125 | 0.127 | 0.016 | 0.955 | −0.482 | 0.215 | 0.218 | 0.047 | 0.950 | ||||
0.465 | 0.118 | 0.119 | 0.015 | 0.937 | −0.470 | 0.203 | 0.205 | 0.042 | 0.959 |
With an increased censoring rate of 30%, we find some bias for , where only the incident cohort data are used. A similar trend was observed in the original simulation studies on conducted by Chen and Cheng. 10 In the simulation results under a censoring rate of 30%, we observe that the estimators and are less biased and more efficient than . Therefore, combining information from the prevalent cohort data with that from the incident cohort data is desirable, especially under heavy censoring rates.
5 |. APPLICATION
The Nun Study, introduced in Section 1, has been conducted to examine risk factors for the progression of dementia, with a cohort of 678 members of the School Sisters of Notre Dame religious congregation who were 75 years of age or older and recruited between 1991 and 1993. 1 Each participant received an assessment of her cognitive and physical function near-annually up to 10 years. At each examination, the participant’s cognitive status was recorded as one of the five following states: cognitively intact for age, cognitive deficit that does not affect activities of daily living, cognitive deficit in one or more activities of daily living, clinical dementia, and death. Covariates such as age at each exam, presence of the apolipoprotein E-e4 allele (APOE4), and the level of education were collected.
To illustrate the proposed estimation method, we use the combined cohort data, which consist of 501 subjects with complete data from the Nun Study. Among them, 153 incident and 77 prevalent cases were used in the analysis. In the data, the exact time of death was recorded if it occurred before the last follow-up. If a subject did not die by the last examination, her survival time was censored. Among the incident cases, 29 (19%) subjects were right censored; and only two (2.6%) were right censored among the prevalent cases. The overall censoring rate was as low as 13.5%. For the incident cohort, the data include the survival time from dementia diagnosis until death or the censoring event. When a subject was assessed as clinically demented at one of the annual examinations, we assumed that dementia occurred in the middle of two consecutive examinations. However, for the prevalent cohort data, we only have the information that the subject was demented prior to enrollment; hence, the backward recurrence time is missing. Instead, we have the forward recurrence times from study enrollment until death or the censoring event for the prevalent cohort. We considered two covariates of interest: the level of education and the presence of the genetic risk factor APOE4. The distribution of the covariates are summarized in Table 3 by each cohort and for the combined cohorts.
TABLE 3.
Variable | Incident only (n1 = 153) |
Prevalent only (n2 = 77) |
Combined cohorts (n = 230) |
---|---|---|---|
APOE4 | |||
Presence | 39 (25%) | 29 (38%) | 68 (30%) |
Absence | 114 (75%) | 48 (62%) | 162 (70%) |
EDCAT | |||
College and higher | 134 (87%) | 49 (64%) | 183 (80%) |
Others | 19 (13%) | 28 (36%) | 47 (20%) |
We conducted regression analyses to estimate the effects of the educational level and APOE4 on the mean residual survival time under the PMRL model (1). The analyses were carried out using the incident cohort only, the prevalent cohort only, and the combined cohort data with optimal weights. In the analysis of the incident cohort data, the support of the censoring distribution is greater than that of the survival distribution, which satisfies the assumption for the method using only the incident cohort. 10 The estimated distributions of the survival time and the censoring time are provided as Figure S2 in the web-based supplementary materials. We present the results in Table 4 . None of the estimated regression parameters were found to be significantly associated with the mean residual survival time, which is consistent with the findings in the literature. Qiu et al. 17 and Helmer et al. 18 showed that educational level was not significantly correlated with the mortality of subjects who had dementia, while a lower level of education was found to be associated with higher risk of dementia in other studies. 19 Mez et al. 20 suggested that the incidence of dementia may mediate the effect of APOE4 on mortality, given that both APOE4 and dementia are high risk factors for decreased survival times among older adults. Thus, among subjects diagnosed with dementia,
TABLE 4.
Incident only | Prevalent only | Combined cohorts | ||||
---|---|---|---|---|---|---|
Est | SE | Est | SE | Est | SE | |
APOE4 | ||||||
(presence=1, absence=0) | 0.279 | 0.154 | −0.278 | 0.245 | 0.148 | 0.131 |
EDCAT | ||||||
(college and higher=1, others=0) | −0.231 | 0.197 | −0.470 | 0.253 | −0.287 | 0.155 |
APOE4 has not been found to be a significant risk factor for death.
Under the assumption that the incident and prevalent cohorts are from the same population, we can examine the proportional means assumption by checking the proportional hazards assumption using the prevalent cohort. We confirmed that the assumption is reasonable: the p-values are 0.66 and 0.19 for the presence of APOE4 and the level of education, respectively; and 0.39 for the global test. However, it should be noted that the model diagnostic test may have low power. Another assumption is that data observed in the prevalent cohort are subject to length bias (i.e., the incidence of dementia follows a stationary Poisson process). However, the information about the time from dementia onset to enrollment is missing for the prevalent cohort data, which hinders one from checking this assumption. As an alternative, we can compute the dementia incidence rate using the incident cohort only data, provided that the two cohorts are from the same population. The incidence rate was fairly constant over the follow-up period, with no specific trend. Hence, the stationarity assumption is reasonable for our application.
6 |. CONCLUSION
In observational studies, prevalent samples are commonly collected along with the incident cohort from a single study population. Combining data from incident and prevalent cohorts can substantially improve efficiency and ensure robustness of the estimators when assessing the risk factors’ effects on survival times. This is an efficient way of utilizing the data because the combined cohort data are usually available at no additional cost. While statistical methods for the analysis of the combined cohort data would make invaluable contributions to many studies, such methods are limited in the literature.
In this paper, we assume the PMRL model for the target population. One advantage of assuming such a model is that it directly leads to the PH model on the forward recurrence times for the prevalent cohort. Hence, we can use the conventional survival method for the incident cohort under the PMRL model. For the prevalent cohort data, which has a nonstandard structure with missing backward recurrence times, we can use the PH model without additional assumptions or extra effort. Thus, the proposed estimation method involves two estimating functions that are constructed differently for data from the incident and prevalent cohorts.
In the estimating function for the combined data (9), we only use data from the incident cohort to derive the consistent estimator for . To estimate the baseline mean function m0(t) more efficiently, one may consider combining the data om the prevalent cohort. Based on equation (2), m0(t) is the inverse of the baseline hazard function of the forward recurrence time,. Hence, a naive approach is to estimate the inverse of based on the Nelson–Aalen estimator for the cumulative hazard 0function. However, the estimated baseline cumulative hazar0d function is nonsmooth and results in a noisy estimator for as in conventional survival analyses, which leads to an unstable estimation of . As an alternative, one may adopt the kernel smoothing method to estimate the baseline hazard function,.21 A major drawback of applying the smoothing method is that the choice of bandwidth, which is crucial, involves computationally intensive procedures. Further studies on combining data for more efficient estimation of are of interest.
Subjects were examined periodically in the Nun Study. While the dates of death were accurately recorded, the onset of dementia is only known to occur within time intervals (i.e., interval censored). In our application, we adopted a simple approach by assuming that the event occurred in the middle of the interval since that was not the focus of the current paper. Further research that tackles the issue of interval censoring is certainly warranted.
Supplementary Material
ACKNOWLEDGMENTS
The work was partially supported by the U.S. National Institutes of Health through grants CA193878 and CA016672. The Nun Study data reported in this article were collected from the SMART project (AG386561). The authors also acknowledge the Texas Advanced Computing Center at The University of Texas at Austin for providing HPC resources that contributed to the research results reported within this paper.
APPENDIX
A. LARGE SAMPLE PROPERTIES OF THE ESTIMATORS
A.1. Regularity conditions
(C1) Given any ; and given any .
(C2) The parameter space of β is a compact subset of , and the true parameter value β* is in the interior of the parameter space.
(C3) The true baseline mean function is continuously differentiable on [0, τ].
(C4) A p × 1 vector of covariates X is bounded by some constant, and not contained in a (p − 1)-dimensional hyperplane.
(C5) dt is nonsingular, where µX(t) is the limit of as and .
(C6) dt is positive definite, where is the limit of for k = 0,1, and 2, and is the limit of as .
A.2. Asymptotic properties of
The asymptotic properties of the estimator have been established in the appendix of Chen and Cheng. 10 Here, we briefly outline the results. Given that converges to almost surely, we have
Thus converges weakly to a normal distribution with mean zero and covariance matrix
Provided that , it is shown that converges in probability to AI, is defined in (C5). By applying the Taylor series expansion, one can show that . Hence, it follows that converges weakly to a normal distribution with mean zero and covariance matrix .
A.3. Asymptotic properties of
We establish the asymptotic properties of following the large sample studies conducted by Andersen and Gill 22 for conventional survival data. Let , where . We can represent the estimating function evaluated at β* as follow:
The distribution of is asymptotically normal with mean zero and covariance matrix
By the Taylor series expansion,. Note that converges in probability to AP, which is defined in (C6). Thus, asymptotically follows a normal distribution with mean zero and covariance matrix .
A.4. Proofs of Theorem 3.1
Given the optimal weights and , we rewrite the estimating function in summations of i.i.d. vectors, as follows.
Based on the asymptotic properties of and , it follows that is asymptotically normal with mean zero and covariance matrix . This is straightforward because the incident and prevalent cohorts are independent.
By the Taylor series expansion of around β*, we have
where is on the line segment between and β*, and
We can easily show that converges in probability to , which is equal to Ʃ. Therefore, is asymptotically normal with mean zero and covariance .
Denote an arbitrarily small neighborhood of β* as ß Following the arguments in Chen and Cheng,10 because can be extended to any β ϵ ß under the regularity conditions on uniform convergence. Thus, is consistent with β*.
Given the consistency of ÂI(β), ÂP(β), , and , and assuming that ƩI and ƩP are nonsingular, we can show that the estimators of the optimal weights and converge in probability to and , respectively, where is a consistent estimator of β*
References
- 1.Snowdon DA, Greiner LH, Mortimer JA, Riley KP, Greiner PA, Markesbery WR. Brain infarction and the clinical expression of Alzheimer disease. The Nun Study. JAMA 1997;277:813–817. [PubMed] [Google Scholar]
- 2.Tyas SL, Salazar JC, Snowdon DA, et al. Transitions to mild cognitive impairments, dementia, and death: findings from the Nun Study. Am J Epidemiol 2007;165:1231–1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yu L, Tyas SL, Snowdon DA, Kryscio RJ. Effects of ignoring baseline on modeling transitions from intact cognition to dementia. Comput Stat Data Anal 2009;53:3334–3343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yu L, Griffith WS, Tyas SL, Snowdon DA, Kryscio RJ. A nonstationary Markov transition model for computing the relative risk of dementia before death. Stat Med 2010;. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wei S, Xu L, Kryscio RJ. Markov transition model to dementia with death as a competing event. Comput Stat Data Anal 2014;80:78–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wei S, Kryscio RJ. Semi-Markov models for interval censored transient cognitive states with back transitions and a competing risk. Stat Methods Med Res 2016;25:2909–2924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Oakes D, Dasu T. A note on residual life. Biometrika 1990;77:409–410. [Google Scholar]
- 8.Addona V, Wolfson DB. A formal test for the stationarity of the incidence rate using data from a prevalent cohort study with follow-up. Lifetime Data Anal 2006;12:267–284. [DOI] [PubMed] [Google Scholar]
- 9.Asgharian M, Wolfson DB, Zhang X. Checking stationarity of the incidence rate using prevalent cohort survival data. Stat Med 2006;25:1751–1767. [DOI] [PubMed] [Google Scholar]
- 10.Chen YQ, Cheng S. Semiparametric regression analysis of mean residual life with censored survival data. Biometrika 2005;92:19–29. [Google Scholar]
- 11.Chen YQ, Jewell NP, Lei X, Cheng SC. Semiparametric estimation of proportional mean residual life model in presence of censoring. Biometrics 2005;61:170–178. [DOI] [PubMed] [Google Scholar]
- 12.Cox DR. Renewal Theory London: Methuen; 1962. [Google Scholar]
- 13.Maguluri G, Zhang CH. Estimation in the mean residual life regression model. J R Stat Soc Series B Stat Method 1994;56:477–489. [Google Scholar]
- 14.Bai F, Huang J, Zhou Y. Semiparametric inference for the proportional mean residual life model with right-censored length-biased data. Stat Sin 2016;26:1129–1158. [Google Scholar]
- 15.Chan KCG, Chen YQ, Di CZ. Proportional mean residual life model for right-censored length-biased data. Biometrika 2012;99:995–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chaganty NR, Joe H. Efficiency of generalized estimating equations for binary responses.. J R Stat Soc Series B Stat Method 2004;66:851–860. [Google Scholar]
- 17.Qiu C, Bäckman L, Winblad B, Agüero-Torres H, Fratiglioni L. The influence of education on clinically diagnosed dementia incidence and mortality data from the Kungsholmen Project. Arch Neurol 2001;58:2034–2039. [DOI] [PubMed] [Google Scholar]
- 18.Helmer C, Joly P, Letenneur L, Commenges D, Dartigues JF. Mortality with dementia: results from a French prospective community-based cohort. Am J Epidemiol 2001;154:642–648. [DOI] [PubMed] [Google Scholar]
- 19.Sharp ES, Gatz M. The relationship between education and dementia: an updated systematic review. Alzheimer Dis Assoc Disord 2011;25:289–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mez J, Marden JR, Mukherjee S, et al. Alzheimer’s disease genetic risk variants beyond APOEe4 predict mortality. Alzheimers Dement 2017;doi: 10.1016/j.dadm.2017.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wells MT. Nonparametric kernel estimation in counting processes with explanatory variables. Biometrika 1994;81:759– 801. [Google Scholar]
- 22.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Stat 1982;10:1100–1120. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.