Abstract
It is well known that truncated survival data are subject to sampling bias, where the sampling weight depends on the underlying truncation time distribution. Recently, there has been a rising interest in developing methods to better exploit the information about the truncation time, thus the sampling weight function, to obtain more efficient estimation. In this paper, we propose to treat truncation and censoring as “missing data mechanism” and apply the missing information principle to develop a unified framework for analyzing left-truncated and right-censored data with unspecified or known truncation time distributions. Our framework is structured in a way that is easy to understand and enjoys a great flexibility for handling different types of models. Moreover, a new test for checking the independence between the underlying truncation time and survival time is derived along the same line. The proposed hypothesis testing procedure utilizes all observed data and hence can yield a much higher power than the conditional Kendall’s tau test that only involves comparable pairs of observations under truncation. Simulation studies with practical sample sizes are conducted to compare the performance of the proposed method with its competitors. The proposed methodologies are applied to a dementia study and a nursing house study for illustration.
Keywords: Kendall’s tau, Inverse probability weighted estimator, Outcome-dependent sampling, Prevalent sampling, Self-consistency algorithm
1. INTRODUCTION
The prevalent cohort design is frequently used to study the natural history of disease processes. A prevalent cohort consists of individuals with disease at the time of enrollment and is followed for the occurrence of failure events of interest. Compared to the incident cohort approach, which follows initially undiseased individuals from disease onset to failure, the prevalent cohort approach enjoys the advantage of being more efficient and relatively easy to assemble through existing disease registries. However, this design is known to be subject to sampling bias, because diseased individuals who died before the recruitment period would not be qualified to enter the cohort. As a result, the sampling scheme favors individuals who survive longer and thus is outcome dependent. Statistically speaking, the survival time in a prevalent cohort study is subject to left truncation, where the truncation time is the duration from disease onset to enrollment. A survival time can be observed if and only if it is longer than the truncation time. In the case of a stable disease, the truncation times in the unbiased disease population are uniformly distributed under the stationarity assumption with respect to the disease incidence; moreover, the survival time in the prevalent cohort has a length-biased distribution where the probability of a survival time being sampled is proportional to its length (Lancaster, 1990, Chapter 3).
Statistical analysis of truncated survival time data are usually based on non-parametric and semiparametric conditional likelihood methods, conditioning on the observed truncation time (Lynden-Bell, 1971; Wang, 1991; Tsai, Jewell and Wang, 1987). As a result, the inference procedures do not require information about the underlying truncation time distribution. When such information is available, however, the conditional likelihood approaches are known to be inefficient (Wang, 1991) – this is in contrast with the analysis of right-censored survival data where the knowledge about the independent censoring time distribution is ancillary. For survival data collected in the prevalent cohort study of a stable disease, various authors, including Vardi (1989), Vardi and Zhang (1992), Asgharian, M’Lan and Wolfson (2002), Luo and Tsai (2009), Tsai (2009), Qin et al. (2011), Huang and Qin (2012), and Ning, Qin and Shen (2014), have developed more efficient methods that exploit the properties of uniformly distributed truncation times in the estimation procedure. Readers are referred to Shen, Ning and Qin (2017) for a comprehensive review of recent developments.
In this paper, we present a unified framework for analyzing left-truncated and right-censored data with unspecified or known (but not necessary uniform) truncation time distributions. The proposed framework is structured in a way that is easy to understand and enjoys a great flexibility for handling different types of models. Our idea is to treat truncation and censoring as “missing data mechanism” and apply the missing information principle to develop efficient estimation and hypothesis testing procedures. The missing information principle provides a general paradigm for statistical inference in missing data problems. Its theoretical foundation was formally established by Orchard and Woodbury (1972), whose idea dates back to Yates (1933) and Bartlett (1937). Later, Dempster, Laird and Rubin (1977) provided an extensive generalization and named the procedure EM algorithm. Heuristically, one may replace a complete-data estimating function or an unbiased estimator by their conditional expectations given the observed data to obtain unbiased inference. When applied to the score function, the missing in formation principle reduces to a single iteration of the EM-algorithm (Dempster, Laird and Rubin, 1977). It is worthwhile to point out that, under truncation, the number of individuals being truncated is unknown and thus the sample size needs to be imputed via the missing information principle, adding additional level of complication compared to the usual missing data problems.
We begin by deriving score operators from the nonparametric and semiparametric full likelihood function based on completely observed survival data from a representative sample. The score functions are unbiased if there were no missing data, that is, the survival times are neither truncated nor censored. We then apply the missing information principle to the unbiased estimating function and, based on which, derive iterative self-consistency algorithms to obtain maximum likelihood estimation. Compared to existing likelihood-based methods, a major advantage of our approach is that the proposed algorithm is formulated based on the hazard function, making the extension from nonparametric estimation to semiparametric estimation of the Cox model relatively straightforward. Another important feature of our methodology is that, similar to Vardi (1989) and Qin et al. (2011) for survival data under length-biased sampling, the estimated hazard can have positive support on both censored and uncensored failure time points; this is in contrast with the pseudo partial likelihood-based approaches considered by Luo and Tsai (2009) and Tsai (2009) which only allow jumps at uncensored failure times.
We further demonstrate the use of the missing information principle in hypothesis testing, which receives less attention in the truncation data analysis literature compared to model estimation. Specifically, we consider testing the association between the survival time and the truncation time in the target population. Note that, instead of employing the conditional Kendall’s tau statistic based on comparable pairs of survival and truncation times in the prevalent cohort (Tsai, 1990), we evaluate the expected difference between the proportions of concordance and discordance pairs, that is, the unconditional Kendall’s tau statistics, in the unbiased population by applying the missing information principle. Extensive simulation studies shows that the new testing procedure outperforms the conditional Kendall’s tau test, especially in the case where the proportion of comparable pairs are small.
The rest of the article is organized as follows. We demonstrate the application of the missing information principle with left-truncated and/or right-censored data in the case of one-sample estimation (Section 2) and seimiparametric estimation of the Cox model (Section 3). In both cases, we consider the estimation procedure with and without the knowledge of the truncation time distribution. In particular, self-consistency algorithms which guarantee to yield positive hazard function estimates are proposed to incorporate the information about the truncation time distribution. In Section 4, a nonparametric association test is proposed to illustrate the missing information principle when truncation time distribution is not specified. In Section 5, simulation studies are conducted to evaluate the performance of the proposed algorithms. In Section 6, two data examples are presented to illustrate the proposed approaches. A discussion follows in Section 7 to conclude the paper.
2. NONPARAMETRIC ESTIMATION
2.1. Nonparametric estimation with complete data
Let T0 denote the survival time in the population of interest. Note that we use superscript 0 for random variables in the target population. Assume that T0 is absolutely continuous and has a support on [0, τ], that is, T0 has a probability density function f(t), 0 ≤ t ≤ τ. Denote, respectively, by F(t), S(t), λ(t), and Λ(t), the distribution function, survival function, hazard function, and cumulative hazard function of the survival time T0. Suppose the data from n subjects are independent and identically distributed (i.i.d.) realizations of T0. Following Murphy and van der Vaart (2000), we consider the nonparametric likelihood
with Λ{t} the jump size of Λ at t so that the likelihood depends smoothly on . It is easy to check that the corresponding score operator (Begun et al., 1983) is given by
where and κ(u) is any bounded, measurable function. Setting κ(u) = I(u ≤ t) motivates the unbiased estimating equation
with . As a result, solving the complete-data estimating equation for all t ∊ [0, τ] is equivalent to maximizing the nonparametric likelihood function with respect to Λ(t).
2.2. Left-truncated data, with an unspecified truncation time distribution
We now consider the scenario where the observation of the survival time T0 is subject to an independent truncation time A0, that is, the pair of random variables (T0, A0) is observed if and only if T0 ≥ A0. We drop superscript 0 to indicate random variables in the prevalent population. Denote by T and A the survival time and truncation time for individuals in the prevalent population, then (T, A) has the same joint distribution as (T0, A0) | T0 ≥ A0. For simplicity, we assume that the truncation time A0 also has support on [0, τ]. In what follows, we consider nonparametric estimation with an unspecified truncation time distribution.
Let {(Ti, Ai), i = 1, … n} be i.i.d. copies of (T,A). Following Turnbull’s argument of ghost observations (Turnbull, 1976), conditioning on the truncation time Ai, the observation (Ti, Ai) can be considered the remnant of a group of mi unobserved subjects whose survival times are smaller than Ai. Specifically, let be the ghosts corresponding to (Ti,Ai), where and is independent of Ti given Ai for all j = 1,…,mi. Note that, given Ai = a, the sample size mi of the group of unobserved subjects follows a negative binomial distribution with parameters 1 and F(a) and thus E(mi | Ai = a) = F(a)/S(a). Moreover, given Ai = a, the density function of is f(t)F(a)−1I(t < a).
For the ith observed subject, we define the stochastic process
Similarly, for truncated observations (the ghosts) we define
Then it follows from the unbiasedness of the score operator with complete data that has a zero-mean. Because ’s are unobserved, we apply the missing information principle to replace with its conditional expectation
to obtain the imputed stochastic process
(2.1) |
Solving gives . As expected, the application of the missing information principle to left-truncated data yields the asymptotically efficient nonparametric maximum likelihood estimator (NPMLE), that is, the Lynden-Bell estimator (Lynden-Bell, 1971).
2.3. Left-truncated data, with a known truncation time distribution
In many applications, it is reasonable to assume that the incidence of disease onset follows a specific distribution. As an example, several authors, including Addona and Wolfson (2006) and Huang and Qin (2012), have argued that the incidence of dementia onset in the Canadian Study of Health and Aging, one of the largest epidemiology studies of dementia (McDowell, Hill and Lindsay, 2001), follows a Poisson process; that is, the disease incidence is stable over time. Under this stable disease condition, the underlying truncation time is uniformly distributed and the probability of a survival time being sampled is proportional to its length.
Let H be the known distribution function of the underlying truncation time A0, and define . As an example, under the stable disease condition, A0 is uniformly distributed and hence H(t) = t/τ and for t ∊ [0, τ]. Applying Turnbull’s of ghost observations, the observed data (Ti, Ai) can be viewed as the remnant of a group of mi independent subjects whose survival times satisfy are thus not observed. Moreover, () has the same joint distribution as (T0, A0) | T0 < A0. Note that, instead of using the conditioning argument as in Section 2.1, the ghosts corresponding to the observation (Ti, Ai) are not constrained to have the same truncation time Ai. Define The sample size mi follows a negative binomial distribution with parameters 1 and α, and hence E(mi) = α/(1 − α).
Following the spirit of missing information principle, we propose to replace in the unbiased estimating function with its expectation integrating over the given truncation time density function. Specifically, it follows from the result that has the density function that
and that the imputed stochastic process is
(2.2) |
Solving would yield the nonparametric maximum likelihood estimation (NPMLE) of Λ(·) when the underlying truncation time distribution is known. It is easy to see that the estimating equation does not have a closed-form solution. We propose a self-consistency algorithm for deriving the NPMLE.
Define the stochastic processes and , so that . We consider the class of distributions with jumps at the observed failure times. The self-consistency algorithm is described below:
Step 0. Set initial values for the jumps of Λ(0)(t) at observed failure times and obtain S(0)(t) = exp{−Λ(0)(t)}.
Step k. For the k-th iteration, evaluate and by replacing F(t) with 1 − S(k−1)(t) = 1 − exp{−Λ(k−1)(t)} in and . Update Λ(t) with
Iterate until a convergence criterion is met.
Interestingly, when the distribution of the truncation time A0 is known, the construction of imputed stochastic process does not require A being observed. A closer examination reveals that can be re-expressed as
where is given by (2.1) for left-truncated data with a unspecified truncation time distribution, and the function in the brackets is simply the empirical estimate of the survival function subtract the conditional survival function of A0 given T0 ≥ A0.
Recognizing that Ti’s can be viewed a biased sample from f(t) with a sampling weight function H(t), an inverse-probability weighted estimator for Λ(t) (Wang, 1996) can be given by
The assigned weight is inversely proportional to the probability of a subject being sampled. As a result, the weighted risk set has the same probability structure as that that would be formed by an incidence cohort. In most cases, this simple estimator, though consistent for Λ(t), is not identical to the NPMLE obtained by solving and hence is not expected to be fully efficient.
2.4. Left-truncated and right-censored data, with an unspecified truncation time distribution
The observation of left-truncated survival time is usually subject to right censoring due to loss to follow-up or end of study. Let C be the censoring time for the residual survival time V = T − A, where C is assumed to be independent of V given A. Hence we observe {(Ai, Yi, Δi), i = 1, …, n}, where Yi = min(Ti, Ai+Ci) and Δi = I(Ti ≤ Ai + Ci). For censored individuals, the values of and I(Ti ≥ t) in given by (2.1) can not be determined completely. It can be verified that and E{I(Ti ≥ t) | Ai, Yi, Δi} = I(Yi ≥ t) + (1 − Δi)I(Yi < t)S(t)/S(Yi), with Ni(t) = ΔiI(Yi ≤ t). We apply the missing information principle to by replacing and I (Ti ≥ t) with their conditional expectations to yield the imputed stochastic process
(2.3) |
It is easy to see that solving yields the Nelson-Aalen estimator, that is, the NPMLE, for left-truncated and right-censored data with unspecified truncation time distribution.
2.5. Left-truncated and right-censored data, with a known truncation time distribution
When the truncation time A0 follows a known distribution function H, applying the missing information principle to replace and I(Ti ≥ t) in in (2.2) with their conditional expectations yields
(2.4) |
where and . Similar to (2.2), the evaluation of the imputed process (2.4) does not require Ai being observed.
At any time point t* that no failure event was observed, that is, , the equality can be implied by either dΛ(t*) = 0 or . In other words, the NPMLE may have jumps at censored survival times. This is in contrast to right-censored survival data, where the NPMLE has jumps at only uncensored failure times.
Similar to Section 2.3, a self-consistency algorithm can be derived to solve . Here in all iterative steps the cumulative hazard function Λ(k)(t) for k ≥ 0 are allowed to have jumps at all censored and uncensored failure times. Specifically, in the kth iteration, we evaluate
(2.5) |
where and are obtained by substituting F(t) with 1 − S(k−1)(t) in dηi(t) and ξi(t).
Remark 2.1. It is worthwhile to point out that the proposed method can be applied to analyze right-censored survival data under biased sampling, where the sampling weight is proportional to a known function H(t). Luo and Tsai (2009) considered this setting and proposed a pseudo-partial likelihood approach that allows for jumps only at uncensored failure times. Their estimation procedure, however, requires estimation of the censoring time distribution. In contrast, the proposed estimator is derived directly from the full likelihood of complete data and thus is expect to be more efficient.
3. SEMIPARAMETRIC ESTIMATION OF THE COX MODEL
In this section, we apply the missing information principle, along the same line as nonparametric estimation, to estimate the Cox proportional hazards model with left-truncated and right-censored data. To proceed, we assume that, given the p-dimensional covariate vector Z0 = z, the conditional hazard function of the survival time T0 in the target population, Λ(t | z), follows the proportional hazards model, that is, λ(t | z) = λ(t) exp(β′z), where λ(t) is an unspecified baseline hazard function and β is a vector of p × 1 regression parameters.
We begin by deriving the unbiased estimating equations based on complete data. Let the complete data {, i = 1, …, n} be i.i.d. copies of (T0, Z0). Define the stochastic process , with . Denote by Λ{t} the jump size of Λ at t. The score operators for β and Λ derived from the semiparametric likelihood
are given by and , with κ an arbitrary bounded, measurable function. Setting κ(u) = I(u ≤ t) and solving the system of estimating equations and , for all t ∊ [0, τ], yields the semiparametric maximum likelihood estimator.
3.1. Semiparametric estimation with left-truncated data
Under left truncation, we observe (T0, A0, Z0) if and only if T0 ≥ A0, so the observed triplet (T, A, Z) has the same joint distribution as (T0, A0, Z0) | T0 ≥ A0. Let {(Ai, Ti, Zi), i = 1, …, n} be i.i.d. copies of (A, T, Z). We impose the usual independent truncation assumption by assuming that A0 is independent with T0 conditional on Z0.
We first consider the case where the distribution of A0 is left unspecified. Arguing as in Section 2.2, the observation (Ti, Ai, Zi) corresponds to mi unobserved ghosts {(, Ai, Zi), j = 1, …, mi}, where and is independent of Ti given (Ai, Zi). Conditioning on Ai = a and Zi = z, the sample size mi follows a negative binomial distribution with parameters 1 and F(a | z) = 1 − exp{−Λ(a | z)}, where . Hence we have E(mi | Ai, Zi) = F(Ai | Zi)/S(Ai | Zi).
Define the stochastic processes for observed data and for ghost observations. Following the unbiasedness of the score operators with complete data, we have and . In the spirit of missing information principle, we replace with its conditional expectation to obtain the imputed stochastic process
As expected, solving the imputed estimating equations and for all t ∊ [0, τ] yields the maximum partial likelihood estimator (Wang, Brookmeyer and Jewell, 1993) that is the solution of the partial score equation
Next, we consider the case where A0 has a known distribution function H(t). Integrating over the given truncation time density function, straightforward algebra gives
Thus, by replacing with its expectation in the unbiased stochastic process, we obtain the imputed stochastic process
Solving and for all t ∊ [0, τ] would yield the semiparametric maximum likelihood estimator when the distribution of A0 is known.
Define the stochastic processes and . The solution to and satisfies
(3.1) |
and
(3.2) |
Based on (3.1) and (3.2), we propose the following iterative algorithm to obtain estimates of β and Λ(t). As before, we consider the family of Λ that only jumps at the unique failure times in the following algorithm.
Step 0. Set initial values for β(0) and the jumps of Λ(0)(t) at observed failure times. Compute .
Step k. For the k-th iteration, evaluate and by replacing F(t | Zi) with 1 − S(k−1)(t | Zi) = 1 − exp{−Λ(k−1)(t)exp(β(k−1)′)Zi} in and . Solve
for β to obtain β(k), and update Λ(t) with
Iterate until a convergence criterion is met.
3.2. Semiparametric estimation with left-truncated and right-censored data
In the presence of right censoring C in addition to left truncation, the observed data {(Ai, Yi, Δi, Zi)}, i = 1, …, n} are i.i.d copies of (A, Y, Δ, Z), where Y = min(T, A + C) and Δ = I(A + C ≥ T). We assume that the censoring time C is independent of (A, T) given Z. As pointed out by one reviewer, one may also assume that C is independent of T given (A, Z). We adopt the former assumption to be consistent with the existing literature.
We now consider the case where the distribution of the underlying truncation time A0 is not specified. Similarly as before, it can be verified that and E(I(Ti ≥ t) | Ai, Yi, Δi, Zi} = I(Yi ≥ t) + (1 − Δi)I(Yi < t)S(t | Zi)/S(Yi | Zi). Applying the missing information principle to by replacing and I(Ti ≥ t) with their conditional expectations yields
As expected, solving the system of imputed estimating equations and for all t ∊ [0, τ] yields the maximum partial likelihood estimator, which is the solution of the partial score equation
Finally, we consider the case where the distribution of A0 is known to be H. Replacing and I(Ti ≥ t) with their conditional expectations in yields
where
Solving the imputed estimating equations and for all t ∊ [0,τ] gives estimates of β and Λ(t). Because a closed solution does not exist, we develop a self-consistency algorithm for model estimation.
Arguing as in Section 2.5, the estimated baseline cumulative hazard function obtained by solving the imputed estimating equations may have jumps at censored survival times. Hence in all iterative steps the baseline cumulative hazard function Λ(k)(t) for k ≥ 0 are allowed to have jumps at all censored and uncensored failure times. At the kth iteration, we compute and by substituting, {β, F(t | z)} in dηi(t,β) and ξi(t,β) by the estimates from the (k − 1)th iteration, and solve the equations along the same line as those in Step k of the algorithm in Section 3.1, where and are replaced by and . Interestingly, in the special case where H is the distribution function of an uniform random variable, the proposed self-consistency algorithm will converge to the semiparametric MLE described in Qin et al. (2011).
Denote by () the estimators obtained by the proposed self-consistency algorithm and by (β0, Λ0) the true parameter values. The large-sample properties of () is summarized in Theorem 3.1, with regularity conditions and asymptotic distribution given in the Appendix. The proof closely follows Theorems 1 and 2 in Qin et al. (2011) and thus is omitted in this article.
Theorem 3.1. Under the regularity conditions (A1)~ (A6), converges weakly to a zero-mean Gaussian process defined in the Appendix as n → ∞.
Remark 3.1. The self-consistency algorithm described in this section can also be applied to right-censored survival data under biased sampling. For this problem, Tsai (2009) proposed a pseudo-partial likelihood approach to incorporate the knowledge about the sampling weght function H(t) in the estimation procedure. This approach, however, involves estimating the random censoring time distribution and can be inefficient when the censroing proportion is high. Our estimator, on the other hand, naturally accounts for covariate-dependent censoring and is in general more efficient as it is derived from the full likelihood of complete data.
When the underlying truncation time distribution depends on the covariates and is left unspecified, application of the MIP results in the conditional likelihood approach. When the underlying truncation time distribution depends on the covariates and is specified, the self-consistency algorithm can be easily extended to gain efficiency. For example, suppose the cumulative distribution function of A0 conditional on Z0 = z is H(· | z), we can replace H(·) in ηi(t, β) and ξi(t, β) with H(· | Zi) to estimate β and Λ. However, this approach is not practically interesting, since H(· | z) is usually treated as a nuisance function.
4. NONPARAMETRIC ASSOCIATION TEST FOR INDEPENDENT TRUNCATION
The preceding sections illustrate the application of the missing information principle in model estimation. In this section, we consider nonparametric test for the association between the underlying truncation time and the failure time. Under left truncation, the validity of most statistical methods for left-truncated survival time data requires the assumption of quasi-independence to hold, that is, the failure time and the truncation time are independent in the observable region. In the literature, Kendall’s tau (Kendall and Gibbons, 1990) is a popular nonparametric measure of association between two failure time random variables because of its rank-invariance property. To measure the association between the underlying truncation time A0 and the underlying survival time T0, Kendall’s tau can be defined as , where sgn(u) is the sign of u. Clearly, K does not depend on the marginal distribution; moreover, −1 ≤ K ≤ 1 and K = 0 when A0 and T0 are independent. For completely observed data, K can be consistently estimated by
A pair of subjects (i,j) is said to be concordant if , and discordant if . As pointed out by many authors, including Tsai (1990), Martin and Betensky (2005), and Oakes (2008), Kendall’s tau is not directly applicable to left truncated data, as the observed data (A, T) are a biased sample of (A0, T0); moreover, association in the observed bivariate random variable (A, T) arises naturally due to sampling constraint. Failing to account for sampling bias in the construction of test statistics usually leads to incorrect conclusions.
In the absence of right censoring, Tsai (1990) considered conditional Kendall’s tau
for testing the association between A0 and T0 under left truncation. It is easy to see that independence of A0 and T0 in the observable region {(a,t) : 0 ≤ a ≤ t ≤ τ} implies Kc = 0 (but not vice versa). Estimation of the conditional Kendall’s tau is based on comparable pairs {(Ai, Ti), (Aj, Tj)} that satisfy max(Ai, Aj) ≤ min(Ti,Tj) (Bhattacharya, Chernoff and Yang, 1983), and thus can be very inefficient when the number of comparable pairs is small. Specifically, with a negative correlation between the underlying truncation time and survival time, Ai ≥ Aj implies that Ti is likely to be smaller than Tj. As a result, the comparability condition is likely to be satisfied and the conditional Kendall’s tau is likely to utilize most available information. On the other hand, with a positive correlation, fewer pairs are expected to satisfy the comparability condition, as the condition further requires Ai ≤ Tj when Ai ≥ Aj. In what follows, we consider alternative tests that can better utilize the observed data.
Instead of employing the conditional Kendall’s tau for testing association, we propose to apply the missing information principle to construct new test statistics. Arguing as before, we begin by deriving the test statistic using complete data from enrolled individuals and their corresponding (unobserved) ghosts, that is, {(Ai, Ti)(); p = 1, …, n}. If the compete data were observed, the contribution of any pair of subjects (i,j) to the construction of Kendall’s tau statistic is given by , where
Under the null hypothesis that A0 and T0 are independent, it is easy to see that uij has mean zero. Thus a Kendall’s tau type test statistic based on the observed data and “ghost” data is given by
and the denominator is the number of comparable pairs and normalize K0 to be in [−1, 1].
In the absence of right censoring, we apply the missing information principle and replace the unknown quantities in uij with their expectations conditioning on the observed data. Under the null hypothesis, following the arguments in Section 2.2, it can be verified that the conditional expectations of , k = 0, 1, 2, given the observed pair (Ai, Ti) and (Aj, Tj) is
Moreover, the quantity (mi + 1)(mj + 1) can be imputed as
Define , then the imputed test statistic given the observed left truncated data is given by .
When the observation of left-truncated survival times is further subject to independent right censoring, we apply the missing information principle to obtain the imputed test statistic , where with
The test statistic involves the unknown functions S and F = 1 − S. Intuitively, one may replace the survival function S by the product-limit estimator for left-truncated and right-censored data. Our simulation shows that the type I error rate of the test is close to the pre-specified nominal level. Denote by the test statistic with S replaced by the product limit estimator. Following a standard argument and applying the functional delta method, we can show that converges to a zero-mean normal distribution under the null hypothesis. The formula of the asymptotic variance is very complicated, hence we recommend using nonparametric bootstrap method to obtain the confidence interval of the test statistic and reject the null hypothesis at a significance level of 0.05 if the 95% confidence interval does not cover 0.
Note that because a small value of S(A) in the denominator can result in a very large (m = 1, 2), in practice, to stabilize the test statistic, we only evaluate Kendall’s tau on the region {(a, t) : 0 ≤ a ≤ τ0}, where τ0 is an arbitrary constant smaller than τ. Specifically, define , and we use with estimated survival function S as the testing statistics.
5. SIMULATION STUDY
Numerical simulations were carried out to evaluate the performance of the nonparametric and semiparametric estimators from the iterative algorithms in Section 2.5 and Section 3.2. We considered the following two scenarios for the underlying truncation time random variable: (I) A0 follows an exponential distribution with with survival function ; (II) A0 follows a Weibull distribution with survival function . The time from enrollment to loss to follow-up was generated from a uniform distribution so that the censoring rate was approximately 25% and 50%. We generated 1000 datasets, each with a sample size of n = 100 and 400.
We first evaluated the nonparametric estimation procedure for left-truncated and right-censored data given in Section 2.5. To simulate left-truncated data, we generated the unbiased survival time T0 from a Weibull distribution with hazard function 0.5t repeatedly until there are n subjects satisfying the sampling constraint A0 < T0, where A0 were simulated under Scenarios (I) and (II). Table 1 reports the summary statistics for the proposed nonparametric estimator. We compared the proposed estimator with the product-limit estimator for left-truncated and right-censored data, and the nonparametric pseudo partial likelihood estimator proposed by Luo and Tsai (2009). It can be seen that both and have smaller mean square error than in all the scenarios, thus improvement is gained by using information from the underlying truncation time distribution. In Scenario I, the proposed estimator has similar performance as ; in Scenario II, has smaller variance and larger bias compared to and , and performs best in terms of mean square error. For all the three estimators, the bias decreases as sample size increase.
Table 1.
Simulation summary statistics of , and
|
|
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
n | Cen | S(t) | Bias | SE | ESMSE | Bias | SE | ESMSE | Bias | SE | ESMSE |
Scenario I | |||||||||||
100 | 25% | 0.75 | 6 | 55 | 55 | 5 | 56 | 56 | 1 | 60 | 60 |
0.5 | 5 | 56 | 56 | 4 | 56 | 56 | 2 | 60 | 60 | ||
0.25 | 3 | 47 | 47 | 3 | 46 | 47 | 4 | 49 | 49 | ||
50% | 0.75 | 5 | 57 | 58 | 6 | 58 | 58 | 0.4 | 62 | 62 | |
0.5 | −1 | 61 | 60 | 3 | 61 | 61 | 1 | 64 | 64 | ||
0.25 | −10 | 55 | 56 | 5 | 56 | 56 | 7 | 59 | 60 | ||
400 | 25% | 0.75 | 1 | 28 | 28 | 1 | 28 | 28 | −0.3 | 30 | 30 |
0.5 | 1 | 28 | 28 | 0.3 | 28 | 28 | 0.1 | 29 | 29 | ||
0.25 | 0.4 | 23 | 23 | 1 | 23 | 23 | 2 | 24 | 24 | ||
50% | 0.75 | 0.4 | 29 | 29 | 1 | 29 | 29 | −1 | 31 | 31 | |
0.5 | −2 | 30 | 30 | 0.3 | 30 | 30 | 0.2 | 32 | 32 | ||
0.25 | −6 | 27 | 28 | 2 | 28 | 28 | 4 | 29 | 30 | ||
Scenario II | |||||||||||
100 | 25% | 0.75 | 61 | 70 | 93 | 33 | 92 | 97 | 17 | 108 | 109 |
0.5 | 48 | 69 | 84 | 27 | 79 | 84 | 16 | 91 | 93 | ||
0.25 | 25 | 47 | 53 | 10 | 50 | 51 | 10 | 56 | 57 | ||
50% | 0.75 | 59 | 74 | 94 | 45 | 91 | 101 | 17 | 111 | 112 | |
0.5 | 45 | 72 | 85 | 37 | 83 | 91 | 16 | 95 | 96 | ||
0.25 | 20 | 54 | 58 | 13 | 59 | 60 | 12 | 65 | 66 | ||
400 | 25% | 0.75 | 30 | 42 | 51 | 12 | 55 | 56 | 2 | 64 | 64 |
0.5 | 22 | 36 | 42 | 10 | 43 | 44 | 3 | 50 | 50 | ||
0.25 | 10 | 23 | 25 | 2 | 25 | 25 | 1 | 30 | 30 | ||
50% | 0.75 | 30 | 43 | 52 | 19 | 56 | 59 | 2 | 65 | 65 | |
0.5 | 22 | 38 | 44 | 15 | 45 | 48 | 4 | 52 | 52 | ||
0.25 | 8 | 25 | 27 | 1 | 28 | 28 | 1 | 33 | 33 |
Note: Cen is the censoring rate; Bias is the empirical bias (×1000); SE is the empirical standard error (×1000); ESMSE is the square root of empirical mean square error (×1000). is the proposed estimator; is the pseudo partial likelihood estimator; is the partial likelihood estimator.
In the second set of simulation studies, we evaluated the semiparametric estimation procedure presented in Section 3.2. We generated the unbiased failure time T0 from the proportional hazards model with two covariates, where the continuous covariate Z1 follows a uniform distribution on [0, 1], the binary covariate Z2 follows Bernoulli distribution with success probability 0.5. The coefficients are set to be β = (1, 1) and the baseline hazard function is set to be 2t. The variance estimation follows a perturbation procedure in Qin et al. (2011). We compared the proposed estimator in Section 3.2 with the pseudo partial likelihood estimator proposed in Tsai (2009) and the partial likelihood estimator for left-truncated and right-censored data. Table 2 reports the summary statistics for the three estimators. It can be observed that our proposed method has negligible bias, and, as expected, the bias decreases with sample size. The variance of is smaller than that of both and , and the coverage probability is close to the nominal level with moderate sample sizes.
Table 2.
Simulation summary statistics of , and
|
|
|
|||||||
---|---|---|---|---|---|---|---|---|---|
n | Cen | Bias | SE | SEE | CP | Bias | SE | Bias | SE |
Scenario I | |||||||||
100 | 25% | (2,1) | (38,25) | (37,23) | (95,93) | (3,3) | (40,26) | (3,3) | (45,28) |
50% | (11,4) | (41,28) | (40,26) | (94,92) | (3,4) | (50,31) | (2,4) | (55,33) | |
400 | 25% | (2,0.3) | (18,12) | (18,11) | (95,94) | (1,1) | (19,12) | (0.5,1) | (21,13) |
50% | (6,2) | (21,14) | (21,13) | (93,93) | (1,1) | (23,14) | (0.3,1) | (25,16) | |
Scenario II | |||||||||
100 | 25% | (0.3,2) | (35,23) | (33,22) | (94,94) | (5,4) | (39,25) | (5,3) | (46,29) |
50% | (6,0.4) | (37,26) | (35,23) | (92,93) | (9,8) | (53,32) | (5,3) | (55,35) | |
400 | 25% | (0.3,0.1) | (16,11) | (16,11) | (95,95) | (3,2) | (18,11) | (1,1) | (21,14) |
50% | (3,1) | (18,12) | (18,11) | (94,95) | (5,4) | (24,15) | (1,0.4) | (26,16) |
Note: Cen is the censoring rate; Bias is the empirical bias (×100); SE is the empirical standard error (×100); SEE is the empirical mean of the standard error estimates; CP is the empirical coverage probability (×100) of the 95% confidence interval. is the proposed estimator; is the pseudo partial likelihood estimator; is the partial likelihood estimator.
We also evaluated the nonparametric test in Section 4 via a series of simulations. We compared the power of our proposed test with the conditional Kendall’s tau test in Tsai (1990). We generated (A0, T0) from bivariate log-normal distribution truncated at τ, and the associated normal distribution has mean (μ1, μ2) and variance-covariance matrix (σij)2×2. The censoring time C was generated from uniform distribution to produce different censoring rate. In all the scenarios, we set σ11 = σ22 = 0.5, and the other parameters are set to produce different associations and truncation proportions α = P(A0 > T0). We set τ = 4 when μ2 = 0 and 6 when μ2 = 0.5. The results are presented in Table 3, with a sample size of 100 and 1000 iterations. Both tests maintains the nominal level under the null hypothesis. As expected, the proposed test substantially outperforms the conditional Kendall’s tau test when the rate of truncation is high and the correlation is positive, while the two tests have similar performance when the correlation is negative.
Table 3.
Simulated power of the proposed test and conditional Kendall’s tau test
Proposed test |
Tsai’ test |
|||||||
---|---|---|---|---|---|---|---|---|
(μ1, μ2) | σ12 | α | 0% | 25% | 50% | 0% | 25% | 50% |
(0,0) | 0.3 | 0.51 | 98 | 87 | 32 | 44 | 27 | 16 |
(0,0) | 0 | 0.51 | 6 | 7 | 4 | 5 | 6 | 4 |
(0,0) | −0.3 | 0.51 | 84 | 74 | 74 | 85 | 71 | 57 |
(0,0.5) | 0.3 | 0.22 | 91 | 80 | 40 | 76 | 64 | 47 |
(0,0.5) | 0 | 0.32 | 6 | 5 | 7 | 5 | 5 | 4 |
(0,0.5) | −0.3 | 0.36 | 93 | 87 | 82 | 93 | 86 | 73 |
Note: Tsai’s test is based on conditional Kendall’s tau (Tsai, 1990). 0%, 25% and 50% are the censoring rates. α is the proportion of truncation. The table presents power (×100) in each scenario.
6. DATA ANALYSIS
6.1. Analysis of Canadian Study of Health and Aging
In this section, we illustrate the proposed methods by analyzing data from the Canadian Study of Health and Aging (CSHA), one of the largest epidemiology studies of dementia (Wolfson et al., 2001; McDowell, Hill and Lindsay, 2001). CSHA recruited a prevalent cohort of individuals aged 65 and older with dementia during the period between February 1991 and May 1992. In our data analysis, the survival time of interest is the time from onset to death and the truncation time in the prevalent cohort is the duration from the onset of dementia to study enrollment. A total of 807 subjects were analyzed; among them, 249 were diagnosed with possible Alzheimer’s disease, 388 had probable Alzheimer’s disease, and 170 had vascular dementia.
To assess the effect of dementia subtypes on mortality, we fit a Cox proportional hazards model with indicators of probable Alzheimer (X1) and vascular dementia (X2) as covariates. Several authors, including Addona and Wolfson (2006) and Huang and Qin (2012), have examined the stationarity assumption with respect to disease incidence and found that the stable disease condition holds approximately. Instead of imposing an uniform distribution for the underlying truncation time distribution, however, for illustrative purpose we apply the result obtained in Huang, Ning and Qin (2015) to employ the density function h(t) ∝ exp(0.183t−0.028t2+0.001t3)I(0 ≤ t ≤ τ), with τ = 19.85 years. Note that h(t) is a member of Neyman’s smooth alternative which includes uniform distribution as a special case.
We compare the proposed method to the pseudo partial likelihood approach (Tsai, 2009) by using the same truncation time density function h(t) in both estimation procedures. Applying the proposed self-consistency algorithm described in Section 3.2, the estimated regression coefficients are 0.151 (asymptotic standard error [ASE], 0.065; 95% confidence interval [CI], 0.023 to 0.278) for probable Alzheimer and 0.229 (ASE, 0.079; 95% CI, 0.074 to 0.384) for vascular dementia. The estimated covariate effects are similar to the maximum likelihood estimator reported in Qin et al. (2011) obtained under the stable disease assumption. Thus our analysis suggests that probable Alzheimer and vascular dementia are associated with significantly worse survival compared to possible Alzheimer. On the other hand, the pseudo partial likelihood method gives regression coefficient estimates 0.064 (ASE, 0.082) for probable Alzheimer and 0.161 (ASE, 0.106) for vascular dementia. It is easy to see that both regression coefficients are not significantly different from 0 using Tsai’s method.
6.2. Testing independent truncation for nursing home data
We next illustrate the proposed test of independence by analyzing the well-known Channing House data (Hyde, 1977). The study recorded age at entry and age at death for 462 residents of a retirement center, Channing House, from 1964 to 1975. The survival time is left-truncated by study entry and right-censored by end of study or loss to follow-up. We apply the testing statistic in Section 4 to test the null hypothesis that the underlying survival time and underlying the truncation time are independent of each other within each gender group.
Because the variance of the proposed testing procedure is quite complicated, we adopt the nonparametric bootstrap method with 1000 replicates to construct the 95% bootstrap CI for the test statistic. The value of the proposed test statistic is 0.005 (95% bootstrap CI, 0.003 to 0.039) for males and is −0.004 (95% bootstrap CI, −0.025 to 0.014) for females. Thus we conclude that the association between the underlying survival time and the underlying truncation time was significantly different than 0 in the male group, whereas the association was not significant in the female group. For comparison, we also apply the conditional Kendall’s tau test statistic developed by Tsai (1990). The conditional Kendall’s tau is 0.198 (95% bootstrap CI, 0.003 to 0.362) for males and 0.051 (95% bootstrap CI, −0.046 to 0.164) females. Hence the results of our proposed test are consistent with that based on conditional Kendall’s tau test.
7. REMARKS
The main goal of this paper is to develop a unified framework for analyzing left-truncated and right-censored data with an unspecified or known truncation time distribution. Our methodologies are developed based on the idea of treating truncation and censoring as “missing data mechanisms” and applying the missing information principle to unbiased estimating equations obtained in the absence of left truncation and right censoring. Specifically, we derived imputed estimating function from the score function derived from the full nonparametric likelihood (Section 2) and semiparametric likelihood (Section 3) with complete data. This is in contrast with the estimation procedure developed in Luo and Tsai (2009) and Tsai (2009), where the authors derived a pseudo-partial likelihood by integrating the partial likelihood over the given truncation time distribution. As a result, their estimators are not expected to be more efficient than the proposed estimators which are based on the full likelihood of complete data. Moreover, the evaluation of pseudo-partial likelihood requires estimation of the censoring time distribution and is thus less desirable.
In addition to model estimation, we also demonstrate the use of the missing information principle to hypothesis testing problem. In particular, in Section 4 we derive a new nonparametric test for checking the independence between the underlying survival time and the underlying truncation time based on Kendall’s tau statistic. Unlike the conditional Kendall’s tau test that are constructed based on comparable pairs subject to truncation and censoring, our new testing procedure utilizes data from all individuals and hence is expected to be more efficient. Results of simulation studies show that the proposed test enjoys a substantial gain in power, compared to the conditional Kendall’s tau test, when the underlying truncation time and survival time random variables are positively correlated.
Finally, with minor modifications, the missing information principle can be applied to handle more complicated data structures, such as double truncation and competing risk models, as well as non-Cox models, such as accelerated failure time models and additive hazards models. Further research is warranted.
ACKNOWLEDGMENT
The CSHA was supported by the Seniors Independence Research Program, through the National Health Research and Development Program (NHRDP) of Health Canada (project 6606-3954-MC[S]). The progression of dementia project within the CSHA was supported by Pfizer Canada through the Health Activity Program of the Medical Research Council of Canada and the Pharmaceutical Manufacturers Association of Canada; by the NHRDP (project 6603-1417-302[R]); by Bayer; and by the British Columbia Health Research Foundation (projects 38 [93-2] and 34 [96-1]).
APPENDIX
Define , θ0 = (β0, Λ0), S(· | Z) = exp{−Λ(·) exp β′Z} and . The log-likelihood function based on the observed data is
The score function of (β, Λ) is Un(β, Λ) = (U1n(β, Λ), U2n(·, β, Λ)), where
We assume the following regularity conditions for Theorem 3.1.
(A1) The true value of λ0 is continuously differentiable. In addition, the upper bound τ of the support is finite. The parameter space of Λ contains all the nondecreasing functions Λ satisfying Λ(0) = 0 and Λ(τ) < ∞.
(A2) The true value of β0 is in a compact parameter space .
(A3) The truncation time distribution H has a density h on [0, τ].
(A4) The residual censoring time C has a continuous survival function SC.
(A5) The covariate Z is bounded.
(A6) The matrix evaluated at β0 is positive definite.
Condition (A6) implies that the information matrix of the profile likelihood evaluated at the true value β0 is positive definite, which is a classical condition that appears in the study of the Cox model for traditional survival data (Andersen et al. (1993), page 497). (A6) guarantees the existence and uniqueness of the solution in large samples. (A6) also implies that J0, the fisher information matrix of β for known Λ0 is positive definite and thus the map σ11 defined below is invertible.
Following Qin et al. (2011), it can be shown that converges weakly to W = (W1, W2), where W1 is a zero mean Gaussian random vector and W2 is a zero mean Gaussian process. Define , and
Then the Frechet derivative is
where
The inverse of Frechet derivative is
where , the functional and Φ−1 exists by applying the Fredholm integral equations of the second kind. Thus converges weakly to a tight mean zero Gaussian process .
Contributor Information
Yifei Sun, Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032.
Jing Qin, Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892.
Chiung-Yu Huang, Department of Epidemiology and Biostatistics, School of Medicine, University of California San Francisco, San Francisco, CA 94158.
REFERENCES
- Addona V and Wolfson DB (2006). A formal test for the stationarity of the incidence rate using data from a prevalent cohort study with follow-up. Lifetime Data Anal 12 267–284. MR2328577 [DOI] [PubMed] [Google Scholar]
- Andersen PK, Borgan OR, Gill RD and Keiding N (1993). Statistical models based on counting processes. Springer Series in Statistics. Springer-Verlag, New York: MR1198884 [Google Scholar]
- Asgharian M, M’Lan CE and Wolfson DB (2002). Length-biased sampling with right censoring: an unconditional approach. J. Amer. Statist. Assoc 97 201–209. MR1947280 [Google Scholar]
- Bartlett MS (1937). Some Examples of Statistical Methods of Research in Agriculture and Applied Biology. J. R. Statist. Soc. B 4 137–183. [Google Scholar]
- Begun JM, Hall WJ, Huang W-M and Wellner JA (1983). Information and asymptotic efficiency in parametric–nonparametric models. Ann. Statist 11 432–452. MR696057 [Google Scholar]
- Bhattacharya PK, Chernoff H and Yang SS (1983). Nonparametric estimation of the slope of a truncated regression. Ann. Statist 11 505–514. MR696063 [Google Scholar]
- Dempster AP, Laird NM and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion. MR0501537 [Google Scholar]
- Huang C-Y, Ning J and Qin J (2015). Semiparametric likelihood inference for left-truncated and right-censored data. Biostatistics 16 785–798. MR3449843 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang C-Y and Qin J (2012). Composite partial likelihood estimation under length-biased sampling, with application to a prevalent cohort study of dementia. J. Amer. Statist. Assoc 107 946–957. MR3010882 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hyde J (1977). Testing survival under right censoring and left truncation. Biometrika 64 225–230. MR0494775 [Google Scholar]
- Kendall M and Gibbons JD (1990). Rank correlation methods, fifth ed. A Charles Griffin Title. Edward Arnold, London: MR1079065 [Google Scholar]
- Lancaster T (1990). The Econometric Analysis of Transition Data. Cambridge: Cambridge University Press. [Google Scholar]
- Luo X and Tsai WY (2009). Nonparametric estimation for right-censored length-biased data: a pseudo-partial likelihood approach. Biometrika 96 873–886. MR2767276 [Google Scholar]
- Lynden-Bell D (1971). Article Navigation A Method of Allowing for Known Observational Selection in Small Samples Applied to 3CR Quasars. Mon. Not. R. Astron. Soc 155 95–118. [Google Scholar]
- Martin EC and Betensky RA (2005). Testing quasi-independence of failure and truncation times via conditional Kendall’s tau. J. Amer. Statist. Assoc 100 484–492. MR2160552 [Google Scholar]
- McDowell I, Hill G and Lindsay J (2001). An Overview of the Canadian Study of Health and Aging. International Psychogeriatirics 13 1–18. [DOI] [PubMed] [Google Scholar]
- Murphy SA and van der Vaart AW (2000). On profile likelihood. J. Amer. Statist. Assoc 95 449–485. With comments and a rejoinder by the authors. MR1803168 [Google Scholar]
- Ning J, Qin J and Shen Y (2014). Score estimating equations from embedded likelihood functions under accelerated failure time model. J. Amer. Statist. Assoc 109 1625–1635. MR3293615 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oakes D (2008). On consistency of Kendall’s tau under censoring. Biometrika 95 997–1001. MR2461227 [Google Scholar]
- Orchard T and Woodbury MA (1972). A missing information principle: theory and applications. 697–715. MR0400516
- Qin J, Ning J, Liu H and Shen Y (2011). Maximum likelihood estimations and EM algorithms with length-biased data. J. Amer. Statist. Assoc 106 1434–1449. MR2896847 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen Y, Ning J and Qin J (2017). Nonparametric and semiparametric regression estimation for length-biased survival data. Lifetime Data Anal 23 3–24. MR3601682 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsai W-Y (1990). Testing the assumption of independence of truncation time and failure time. Biometrika 77 169–177. MR1049418 [Google Scholar]
- Tsai WY (2009). Pseudo-partial likelihood for proportional hazards models with biased-sampling data. Biometrika 96 601–615. MR2538760 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsai W-Y, Jewell NP and Wang M-C (1987). A note on the product-limit estimator under right censoring and left truncation. Biometrika 74 883–886. [Google Scholar]
- Turnbull BW (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. Roy. Statist. Soc. Ser. B 38 290–295. MR0652727 [Google Scholar]
- Vardi Y (1989). Multiplicative censoring, renewal processes, deconvolution and decreasing density: nonparametric estimation. Biometrika 76 751–761. MR1041420 [Google Scholar]
- Vardi Y and Zhang C-H (1992). Large sample study of empirical distributions in a random-multiplicative censoring model. Ann. Statist 20 1022–1039. MR1165604 [Google Scholar]
- Wang M-C (1991). Nonparametric estimation from cross-sectional survival data. J. Amer. Statist. Assoc 86 130–143. MR1137104 [Google Scholar]
- Wang M-C (1996). Hazards regression analysis for length-biased data. Biometrika 83 343–354. MR1439788 [Google Scholar]
- Wang M-C, Brookmeyer R and Jewell NP (1993). Statistical models for prevalent cohort data. Biometrics 49 1–11. MR1221402 [PubMed] [Google Scholar]
- Wolfson C, Wolfson DB, Asgharian M, M’lan CE, Ostbye T, Rockwood K and Hogan DB (2001). A reevaluation of the duration of survival after the onset of dementia. New England Journal of Medicine 344 1111–1116. [DOI] [PubMed] [Google Scholar]
- Yates F (1933). The Analysis of Replicated Experiments When the Field results Are Incomplete. Emporium Journal of Experimental Agriculture 1 129–142. [Google Scholar]