SUMMARY
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
Keywords: Cox regression model, EM algorithm, Increasing failure rate, Non-parametric likelihood, Profile likelihood, Right-censored data
1. INTRODUCTION
When the observed failure times are not randomly selected from the target population of interest but with probability proportional to their underlying length, we have length-biased time-to-event data. Length-biased data are naturally encountered in applications of renewal processes (Cox and Miller, 1977; Vardi, 1982; Dewanji and Kalbfleisch, 1987; Vardi, 1989), industrial applications (Kvam, 2008), etiologic studies (Simon, 1980), genome-wide linkage studies (Terwilliger et al., 1997), epidemiologic cohort studies (Keiding, 1991; Gail and Benichou, 2000; Gordis, 2000; Sansgiry and Akman, 2000; Scheike and Keiding, 2006), cancer prevention trials (Zelen and Feinleib, 1969; Zelen, 2004), and studies of labor economy (Lancaster, 1990; McClean and Devine, 1995; De Uña Álvarez et al., 2003). In observational studies, a prevalent cohort design that draws samples from individuals with a condition or disease at the time of enrollment is generally more efficient and practical. The recruited patients who have already experienced an initiating event are followed prospectively for the failure event (e.g. disease progression or death) or are right censored. Under this sampling design, individuals with longer survival times measured from the onset of the disease are more likely to be included in the cohort, whereas those with shorter survival times are selectively excluded. Length-biased sampling thereby manifests in the observations, because the “observed” time intervals from initiation to failure within the prevalent cohort tend to be longer than those arising from the underlying distribution of the general population. How to properly adjust for potential selection bias in analyzing length-biased data has been a longstanding statistical problem. Although we use a prevalent cohort study in medical applications here to illustrate length-biased data, it is apparent that the issues caused by biased sampling are common in many potential applications and sampling designs.
In a seminal paper, Vardi (1989) described the multiplicative censorship model, which connected four well-investigated statistical problems: A. Estimating a non-parametric distribution function under multiplicative censoring, B. Estimating the underlying distribution in renewal processes, C. Solving a nonparametric deconvolution problem, and D. Estimating a monotone decreasing density function. Vardi (1989) presented problems A and C, which have a natural connection with the measurement error problem and inverse problem discussed by van der Vaart and Wellner (1992) and Bickel and Ritov (1994). Most importantly, Vardi (1989) and Wang (1991) showed that the nonparametric maximum likelihood estimation (NPMLE) of the survival distribution under multiplicative censoring (problem B) is equivalent to the nonparametric estimation for survival distribution of the observed length-biased data. The large sample properties of the corresponding NPMLE is established in Asgharian et al. (2002), and the asymptotic efficiency of the NPMLE follows from Asgharian and Wolfson (2005) and van der Vaart (1998, Theorem 25.47). In this paper we explore the potential to extend the approach of Vardi (1989) to nonparametric estimations in more general settings and to semiparametric regression models.
The Cox proportional hazards model, the most popular semiparametric model for regression analysis of traditional survival data, assumes a nonparametric baseline hazard function and a regression function of the covariates (Cox, 1972, 1975). Only limited literature exists that describes modeling risk factors on the distribution of the underlying population when observed failure times are subject to length bias. Recently, Tsai (2009) generalized the pseudo-partial likelihood approach of Wang (1996) to model right-censored length-biased data. Qin and Shen (2010) proposed inverse weighted estimating equation approaches for right-censored length-biased data under the proportional hazards model. These approaches do not provide a straightforward way to analyze length-biased data if the censoring time depends on the covariates, and may not yield efficient estimators. For traditional survival data, Zeng and Lin (2007) demonstrated that estimating equations approaches under either the semiparametric Cox model or the transformation models are less efficient than the profile maximum likelihood estimation approach. (For related works on the profile likelihood for traditional survival data, see Nielsen et al. (1992); Klein (1992); Murphy (1994, 1995); Murphy and van der Vaart (2000); Zeng et al. (2005); Zeng and Lin (2007).) For right-censored length-biased data, we expect the similar efficiency advantage of the maximum likelihood estimation (MLE) method, and the robustness of the method to various assumptions of the censoring distribution.
Implementing the profile likelihood method is much more challenging when working with right-censored length-biased data compared to traditional survival data. One significant difference is that the full profile likelihoods have positive support on both censored and failure time points for length-biased data in contrast to the MLEs for traditional survival data and the conditional likelihood estimates for length-biased data. We propose new expectation-maximization (EM) algorithms for the maximum likelihood estimation of the nonparametric and semiparametric Cox regression models for right-censored length-biased data. One new aspect of our method is that we derive the likelihood for the unobserved (i.e. left-truncated) subpopulation given the observed length-biased data in the full likelihood, which serves as the missing data mechanism in the EM algorithm. In constrast to the EM algorithm of Vardi (1989), which estimates the underlying distribution function via estimation of the biased distribution, our EM algorithm directly estimates the target unbiased distribution function. As a result, any model and parameter constraints for the target distribution function can be directly imposed.
The rest of the paper is organized as follows. In Section 2, we introduce a new EM algorithm for the nonparametric estimation of the target distribution given length-biased data. In Section 3, we apply the new EM algorithm to estimate a distribution function with an increasing failure rate constraint. In Section 4, we propose the maximum semiparametric likelihood estimation under the Cox proportional hazards model and derive the large sample properties for length-biased data. We provide a convenient profile estimation approach based on the EM algorithm, with which the standard software for Cox regression can be adapted for right-censored length-biased data. We describe our simulation studies in Section 5 and the application of our method to a data example in Section 6. Section 7 contains some concluding remarks.
2. A NEW EM ALGORITHM FOR ESTIMATING NONPARAMETRIC SURVIVAL FUNCTION
Consider a prevalent cohort study in which the subjects are diagnosed with a disease and are at risk for a failure event. Let T̃, be the duration from the disease onset to failure with the unbiased density function f(t) = dF(t)/dt and survival function S(t). The observed data include the backward recurrence time A (from disease onset to the study entry), forward recurrence time V (from the study entry to failure), and length-biased time T = A+V. Based on the renewal theory (Vardi, 1982, 1989; Lancaster, 1990, Chapter 3), the joint distribution of (A, V) is
When the prevalent cohort is followed prospectively, V is subject to right censoring. The censoring time, denoted by C, is measured from the study entry. Let δ = I(V < C) be the censoring indicator and assume that (A, V) is independent of C. Let X = min(A + V, A + C). Denote the observed data as (Xi, Ai, δi), i = 1, 2, …, n. The density function of the observed biased T is defined as g(y) = dG(y)/dy, where dG(y) = ydF(y)/μ, and the survival function of T̃ is
Therefore, the likelihood for the observed data (Xi, Ai, δi) is proportional to
| (1) |
Vardi (1989) proposed an EM-algorithm for the NPMLE of G. Using the relationship between G and F, dF(t) = t−1dG(t)/∫t−1dG(t), the NPMLE for F can be derived. However, it is often difficult to impose constraints on F when F is estimated from the NPMLE of G, because the constraints on F may not be easily translated to the constraints on G.
As demonstrated in Vardi (1989), to maximize (1) it is sufficient to consider the discrete version of distribution F, i.e., p(T̃ = ti) = pi, nonparametrically on the point masses at
where t1, …, tk are the ordered unique failure and censoring times for {X1, …, Xn}, k ≤ n. In principle, the length-biased observations (A, T) can be equivalently generated from a truncation model with
| (2) |
where τ̂ = tk, A and T̃ are independent, dF(ti) = pi and , and (A, T) is observed if and only if T̃ ≥ A. The probability of observing a length-biased observation under this setting is π = P(T̃ ≥ A) = E(T̃)/τ̂.
We propose an EM algorithm with a different missing mechanism to directly estimate the target distribution, F. For a cohort subject to left truncation, the biased samples on n subjects denoted by O = {(X1, δ1, A1), ···, (Xn, δn, An), Ai ≤ Xi, i = 1, ···, n}, are observed, whereas the data on m subjects are left truncated. Here the latent left-truncated data are denoted by . The random integer m then follows a negative binomial distribution with parameter π. The probability mass function of m is
Following the principle of the EM algorithm, we think of {O, O*} as the ‘complete data’, and consider that pseudo missing data also referred to as “ghosts” data in Turnbull (1976) are and the observed ‘incomplete data’ are O. We derive the full likelihood including the component of the truncated observations. The log-likelihood based on the complete data {O, O*} is
| (3) |
where Ti ≥ Ai, i = 1, 2, ···, n and , l = 1, ···, m. Then conditional on the observed data,
because (1 − δi)P (Ti = tj|Ti ≥ Ai, Ti ≥ Xi) = (1 − δi)P (T = tj | T ≥ Xi) = (1 − δi)I(Xi ≤ tj)pi/S(Xi). Conditional on the observed data O, the expectation for the missing left-truncated data can be expressed as
Under the truncation model specified in (2),
This together with E(m | O) = n(1 − π)/π,
Subject to and pi ≥ 0, we maximize the expected complete-data log-likelihood conditional on the observed data via the EM algorithm,
| (4) |
where p = (p1, ···, pk), and
By simple algebra, . The following iterative EM algorithm can be used to solve p̂j for j = 1, ···, k.
-
Step 1
Select an arbitrary satisfying .
-
Step 2Solve by maximizing (4), so that we replace with
(5) where .
With a given convergence criterion, we can solve pj iteratively. Let p̂j denote the MLE of pj, j = 1, ···, k, the NPMLE , π̂ = ∫ tdF̂(t)/τ̂, and
where and . Thus, the limiting form of (5) is
| (6) |
Remark 1
In contrast to the NPMLE for traditional survival analyses, which has jumps only at the observed failure time points, the proposed NPMLE for length-biased data has jumps at all observed but unique points including censored times, similar to that of Vardi (1989).
Remark 2
Equation (6) for the constructed EM algorithm with the unbiased distribution function F is equivalent to that for Vardi’s EM algorithm based on a ‘multiplicative-censorship’ model with the biased distribution function G. Denoting dĜ(t) = tdF̂(t)/μ̂, where μ̂ = π̂τ̂, we re-express equation (6) as an equation of Ĝ,
which is the same equation derived by Vardi (1989) and Vardi and Zhang (1992). The advantage of the new EM algorithm is that it directly estimates the target distribution function of the unbiased data, which allows one to directly impose constraints on F. This advantage will be further elucidated in the next two sections.
Remark 3
The ‘missing’ data (i.e. left-truncated failure times), { } are assumed not subject to right censoring. It is clear that whether T* is subject to right censoring or not is irrelevant in the derivation of the above EM algorithm.
Remark 4
The development of the methods and large sample properties is focused on [0, τ] throughout the paper, where τ is a finite upper bound of the support of the population survival times, and Λ(τ) < ∞. In practice, τ can be estimated by . We prove (in Appendix A.2) the following lemma that τ̂ ≡ tk → τ in probability, and that the convergence rate is faster than n1/2.
Lemma 1
Suppose that E(C) > 0 and τ < ∞. Then for 1 > η > 1/2, nη(τ̂ − τ) =op(1).
3. NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION WITH INCREASING FAILURE RATE
In some applications, it is known or assumed that the survival function for the target population has an increasing failure rate (Barlow and Proschan, 1975; Padgett and Wei, 1980; Tsai, 1988). The maximum likelihood estimation of a distribution function with an increasing failure rate was derived for traditional right-censored data by Padgett and Wei (1980), and for left-truncated and right-censored data by Tsai (1988). Using the same notation as in Section 2, the observed right-censored length-biased data are denoted by (X, A, δ). Let λ(t) denote the hazard function for the target cumulative density function F. Let z1 < ··· < zk* denote the distinct ordered failure times {X1, ···, Xn}. Let the size of the risk set at time x be denoted by and the size of failure at time x be denoted by . Under the increasing failure rate constraint, Tsai (1988) proposed a maximum conditional likelihood estimator of λ, conditional on the truncation time A,
where
By applying the new EM algorithm, we consider a full likelihood estimation of the hazard function for the target population. Define λ(tj) = λj; pj can be expressed as ; thus the expected complete-data log-likelihood function in (4) is
| (7) |
where λ = (λ1, ··· λk), and t1 < ··· < tk is defined in §2. Because the hazard function λ (.) increases with time,
where λ0 = 0. Taking a partial derivative of λj to the right side of the above inequality, we have
| (8) |
Using arguments similar to those of Marshall and Proschan (1965) and Padgett and Wei (1980), the solution to equation (8) also maximizes the expected log-likelihood ℓE(λ) defined in (7),
Applying the pool-adjacent-violators algorithm, we can then achieve monotonicity for the NPMLE of λ(·),
where . Although the formula for the proposed NPMLE of the monotone hazard function bears some similarity to that of Tsai (1988), the full likelihood approach is essentially different from the conditional likelihood approach of Tsai (1988), where the estimated function has jumps only at distinct failure time points. By using the information of the left-truncated data in the full likelihood function, the NPMLE is expected to be more efficient and smoother than the maximum conditional likelihood estimate. We will further compare the two approaches in empirical studies.
As the hazard function λ (t) is increasing on t ∈ [0, τ], the corresponding cumulative hazard function Λ(t) is convex. Let Λ̂n(·) denote the estimator obtained by the EM algorithm together with the pool-adjacent-violators algorithm. Then Λ̂n(·) is the greatest convex minorant of Λn(·), where Λn(·) is the NPMLE Λ(·) when there is no constraint on its shape. The strong consistent results of Λn(·) uniformly on [0, τ] can be easily derived from the uniform consistency of its survival function, established by Asgharian and Wolfson (2005). The consistency of Λ̂n can be inferred, because the pool-adjacent-violator algorithm leads to a continuous map for Λn(·) to Λ̂n(·). The technical details are provided in the Appendix A.3.
4. MLE UNDER COX REGRESSION MODEL
4.1 Full Likelihood and Score Functions
Since the cornerstone work of Cox (1972, 1975), the proportional hazards model has become the standard regression model for analyzing traditional right-censored survival data. Specifically, the covariate-specific hazard function is specified as
where Z is a covariate vector and the baseline hazard function λ(t) is not specified parametrically. Breslow (1972) showed that by inserting an estimator (Breslow’s estimator) of the hazard function with a fixed β into the full likelihood, the profile likelihood for β is reduced to Cox’s partial likelihood for β. Later, Kalbfleisch and Prentice (1973) and Andersen et al. (1992) proved that the rank-based likelihood method is also equivalent to the partial likelihood method. When survival data are subject to biased sampling, neither the Cox’s partial likelihood approach nor Kalbfleisch and Prentice’s rank-based likelihood method can be directly applied. This is because the observed biased data do not follow the proportional hazards model that is assumed for unbiased data from the target population; and because the rank-based likelihood method is not applicable, due to the dependency of the length-biased data on the magnitude of the length.
The density function of an unbiased T̃ given Z is denoted by f(t | Z) and the corresponding survival function by S(t | Z). For random but length-biased samples of n subjects, the observed data consist of {
≡ (Ai, Xi, δi, Zi), i = 1, ···, n}, which are n i.i.d. copies of
≡ (A, X, δ, Z). The full likelihood function of the observed data is proportional to
| (9) |
where . The identifiability of the model can be established, similar to the case for the Cox model under traditional survival data, where the identifiability has been established (Elbers and Ridder, 1982). By decomposing the full likelihood to the product of the conditional likelihood of X given A and the marginal likelihood of A, we have
Although the estimating equation derived from the likelihood conditional on A (the first component in Ln) shares the same advantage of Cox’s partial likelihood by canceling the baseline hazard function (Wang et al., 1993; Kalbfleisch and Lawless, 1991), the conditional likelihood approach is generally less efficient than the full likelihood approach.
Using the notation of counting process, we denote Ni(t) = I(Xi ≤ t)δi, Yi(t) = I(Xi ≥ t) for i = 1, ···, n. The log-likelihood function of (9) can be expressed as
| (10) |
where , and . The estimation for MLE of β and the infinite dimensional parameter Λ can be computationally intractable if directly maximizing (10) or solving its score equations. We thereby propose an alternative computational approach, which is a generalization of the EM algorithm for NPMLE discussed in Section 2 under the Cox proportional hazards model.
The semiparametric MLE for the baseline hazard function Λ is obtained by maximizing the likelihood over the set of piece-wise constant functions. Of note, the estimator can have jumps at both censored and uncensored times by observing that the likelihood function achieves its maximum for the hazard function with jumps on {t1, ···, tk}, where t1 < ··· < tk denotes distinct failure and censored time points. Similar to the argument in (Vardi, 1989, page 754), for any estimator of Λ (t) that jumps outside of the event times {t1, ···, tk} one can find a greater likelihood with jumps on {t1, ···, tk} only. A detail explanation is in Supplemental Materials available online.
4.2. MLE and EM Algorithm
For i = 1, ···, n, let , j = 1, 2, …, mi be the truncated latent data corresponding to covariate Zi. We develop the EM algorithm based on the discretized version of Λ(u) =Σu≥tj λj, where λj is the positive jump at time tj for j = 1, ···, k, and λ = (λ1, ···, λk). For notational convenience, denote fi(t) = dF (t | Zi). The log-likelihood based on the complete data is then
Conditional on the observed data relative to the ith subject,
= {Xi, Ai, δi, Zi}, we obtain the expectation that
| (11) |
where
Thus, the expected complete-data log-likelihood function conditional on the observed data is as follows:
where , and . In the M-step, we maximize the expected complete-data log-likelihood function conditional on the observed data with respect to the baseline hazard function at tj, for j = 1, ···, k,
which leads to a closed form of λj as a function of β, denoted by
| (12) |
Here, λj is the maximizer of the M-step. Next, we maximize the expected complete-data log-likelihood function with respect to β
| (13) |
By inserting λj(β) of (12) into the equation (13), β can be solved from the following equation,
| (14) |
which is equivalent to maximizing the complete-data profile likelihood function for λ. With the estimated λj (j = 1, ···, k) and β, one can update the expectation of the likelihood via wij in (11) and repeat the M-step until the estimators of β and λj (j = 1, ···, k) converge.
At the M-step, the estimating equation (14) reveals that we may use the existing software for conventional right-censored data to estimate the covariate coefficient β under the Cox proportional hazards model. To simplify the description, consider a model with one covariate Z. First we need to create a vector with a length of nk for the weight function defined by Wnk = (w11, ···, w1k, w21, ···, w2k, ···, wn1, ···, wnk), which is estimated at the E-step. The corresponding failure time data and covariate vectors are constructed with the same length as Wnk, Tnk = (t1, ···, tk, ···, t1, ···, tk) and Znk = (Z1, ···, Z1, ···, Zn, ···, Zn), respectively. By using the function “coxph” in S-PLUS (or R) with the “weights” option, we obtain the estimator of β at the M-step from
where the censoring indicator, Δ = (1, ···, 1), is an identity vector of length nk.
Note that the algorithm computes β and λ iteratively through the EM steps. The value of the complete log-likelihood ℓE(β, λ) increases with each EM step. More specifically, our EM algorithm falls in the general scheme of the ECM algorithm, a variation of EM methods proposed by Meng and Rubin (1993). The convergence of our EM algorithm to the local maximizer is guaranteed by the same conditions that ensure the convergence of the ECM algorithm, as proved in Meng and Rubin (1993). The uniqueness of the NPMLE is guaranteed by the Assumption 5 in Appendix A.1.
4.3. Asymptotic Properties
In this section, we establish the strong consistency and asymptotic normality of the MLE under the regularity conditions list in Appendix A.1. For asymptotic proof, we denote the MLE by (β̂n, Λ̂n), and let (β0, Λ0) be the true value. In Appendix A.4, we prove the strong consistency by the classical Kullback-Leibler information approach, which has been successfully applied for NPMLE in traditional survival analysis (Murphy, 1994; Parner, 1998).
Theorem 1
Under the regularity conditions listed in Appendix A.1, the MLE (β̂n, Λ̂n) are consistent: β̂n converges to β0, and Λ̂n(t) converges to Λ0(t) almost surely and uniformly in t for t ∈ [0, τ] as n → ∞.
The computation for the MLE of Λ is based on the discretized version Λ̂n(t) = Σt≥tj λ̂j. The existence and uniqueness of the NPMLE can be proved based on the log-likelihood function ℓn(β, λ), in terms of {β, λ} where λ ≡ {λ1, …, λk}, and
| (15) |
Let λ̂(·, β) be the maximizer of ℓn(β, λ) for given β. The existence and uniqueness of NPMLE are guaranteed by Assumption 5: the information matrix of the profile likelihood evaluated at the true value β0 is positive defiinite.
Next, we will apply the Z-theorem for the infinite-dimensional estimating equations to prove the weak convergence of the estimators (van der Vaart and Wellner, 1996, Theorem 3.3.1, p. 310). The score equation of β is
| (16) |
To obtain the MLE of Λ(·), consider a submodel defined by dΛη = (1 + ηh)dΛ, where h is a bounded and integrable function. Taking the derivative of ℓn(β, Λη) with respect to η, evaluating it at η = 0, and setting h(·) = 1(· ≤ t), we have the score equation for Λ
| (17) |
Denote the vector for score functions by Un(·, β, Λ) ≡ {U1n(β, Λ), U2n(t, β, Λ)}, and its expectation E0 under the true values (β0, Λ0) by
where U10(β, Λ) = E0{U1n(β, Λ)} and U20(t, β, Λ) = E0{U2n(t, β, Λ)}. Both the score function Un and its expectation U0 are defined on the parameter set
×
, where the set
is assumed to be compact in ℝp, and the set
consists of nondecreasing functions in the space of functions with bounded variation. Let ψ̂n = (β̂n, Λ̂n), ψ = (β, Λ) and ψ0 = (β0, Λ0).
By the definition of MLE, Un(·, ψ̂n) = 0. As the true parameter ψ0 satisfies U0(·, ψ0) = 0, . In the Appendix A.5, we prove a stochastic approximation . Denote that U̇ψ0 is the Fréchet derivative of the map U0(·, ψ) evaluated at ψ0. By the definition of the Fréchet derivative, .
The estimating function evaluated at ψ0,
, is a sum of i.i.d. random quantities. We will prove by the empirical processes theory that
converges weakly to
= (
,
), where
is a Gaussian random vector and
is a tight Gaussian process. The covariance matrix for is
is Σ11 = E0{U1n(β0, Λ0)⊗2}, and the covariance between
(s) and
(t) is Σ22(s, t) = E0{U2n(s, β0, Λ0)U2n(t, β0, Λ0)}. By the Z-theorem for the infinite-dimensional estimating equations (van der Vaart and Wellner, 1996), we have
Theorem 2
Under the regularity conditions listed in the Appendix A.1, converges weakly to a tight mean zero Gaussian process .
Note that the asymptotic distribution of the sequence
is completely determined by the tightness of
and its marginal covariance function. We characterize the Fréchet derivative U̇ψ0, viewed as an operator on the parameter space of
×
. Define for l = 0, 1, 2,
where SC(u | Z) = P(C ≥ u | Z), μ0(Z) = μβ0, Λ0 (Z), , and
By Assumption 5, the Fisher information of β for known Λ0 is positive definite
| (18) |
Then the Fréchet derivative U̇ψ0 can be written in the following form:
| (19) |
where
We show the invertibility of U̇ψ0 by translating the operator into the Fredholm integral equations of the second kind (Tricomi, 1985, Chapter 2). We prove in the appendix A.5 that the inverse of exists, and the inverse is continuous with the following form
| (20) |
where the functional . We shown in the appendix that Φ has an inverse Λ → Φ−1(Λ) expressed in the following form as a function of t
| (21) |
where H(u, v) is the solution of the following integral equation
| (22) |
| (23) |
and with the notation for l = 0, 1,
By Theorem 2, converges in distribution to a mean zero normal random vector characterized by
| (24) |
where the Gaussian process Φ−1(
) has the following form
Note that the stochastic integral is well defined via the formula of integration by the parts because the functions and are of bounded variation on [0, τ].
Additionally, the process converges weakly to a tight Gaussian process
| (25) |
where the processes
If the baseline function Λ0 is known, then converges in distribution to a Gaussian random variate with mean zero and the sandwich covariance matrix . Because of the variation associated with the profile-likelihood estimator Λ̂n, the asymptotic variance of is more complicated, with the extra terms indicated in (24). The variance-covariance matrix may be estimated by its empirical plug-in version, but the computation can be extremely complicated as it needs to solve the integral equation (22). We describe alternative methods for this computation in the following section.
4.4. Variance Estimation
Unlike the estimation of the regression coefficients, variances of the estimator of the MLE, β̂n cannot be obtained directly from existing software such as R or SAS, because they cannot incorporate the variations caused by the profile likelihood estimator. Instead, we can use bootstrapping techniques or use the information matrix to estimate the variance of the estimators. When working with the observed full likelihood with unknown parameters (β, Λ), the total number of parameters has the same order as the number of observed distinct times, which often yields an information matrix of a high dimension. Murphy and van der Vaart (1999) showed that the inverse of the information matrix for the profile likelihood provides valid variance estimates for the finite-dimensional parameters of interest, i.e., β̂n under the semiparametric models. There is no general analytical formula for calculating the profile information matrix; thus we describe a numerical EM-aided differentiation approach (Murphy and van der Vaart, 1999; Chen and Little, 1999) to approximate the profile information matrix. Chen and Little (1999) also proved that the score function of the profile likelihood for the observed data is the same as the expected complete-data score function conditional on the observed data at the convergent point. Therefore, the second derivative of the log profile likelihood evaluated at the MLE β̂n can be approximated by ∂2ℓE(β, λ(β))/∂β2|β=β̂n, which is the first derivative of the expected complete-data score function conditional on the observed data and profiled over λ (β) = (λ 1(β), ···, λk(β)) given by (12). By perturbation around the profile MLE β̂n, the information matrix for β can then be calculated as follows:
Perturb the lth component of β̂n = (β̂1, ···, β̂p) by a small value ε = 1/n in the neighborhood of β̂l (in both directions). The perturbed estimator is denoted by β̂∊;,l = (β1, ···, βl ± ε, ···, βk), where l = 1, ···, p.
-
Approximate the lth row of the information matrix of β by
where the hazard function λ (β) is obtained from (12) using the M-step described in Section 4.2.
When estimating the variance of λ̂n, the bootstrap approach can be used. This re-sampling approach is valid, given our established asymptotic normality for the MLE of β̂n and λ̂n. In this case, we should obtain the variances of both β̂n and λ̂n.
5. SIMULATIONS
We performed simulation studies to evaluate the proposed methods and the corresponding EM algorithms for two settings: nonparametric MLE with an increasing failure rate, and the profile likelihood estimators under the Cox proportional hazards model for length-biased data. We aimed to assess the small sample accuracy and precision of our estimators, and to compare their performance with those of the existing methods under each setting. Each study comprised 1000 repetitions. Sample sizes of 200 and 400 were used.
5.1. Estimating a Distribution Function with an Increasing Failure Rate
We generated independent pairs of (A, T̃) with failure times from a Weibull distribution (F(t) = 1 − exp{−(t/α2)α1}) with α1 = 2 and α2 = 1 and truncation times from a uniform distribution to ensure the stationarity assumption. Here, the specified Weibull distribution has an increasing failure rate. The censoring variables measured from the examination time were independently generated from uniform distributions.
Table 1 compares the performance of our proposed estimator and that of Tsai’s estimator (Tsai, 1988), denoted by Fc. When F (t) is greater than 0.5, both estimators achieved outstanding accuracy. In contrast, when F(t) is small, both had downward biases, and the bias decreased with increasing sample sizes. As expected, the empirical standard deviations of our estimator were as much as 25% lower than those of Tsai’s estimator.
Table 1.
Summary Statistics of Simulations for the Estimated Distribution Function with Increasing Failure Rate
| Sample Size | C% | F(t) | F̂p(t) | F̂c(t) | ||||
|---|---|---|---|---|---|---|---|---|
| Est. | ESD | ESMSE | Est. | ESD | ESMSE | |||
| 200 | 10% | 0.10 | 0.060 | 0.024 | 0.047 | 0.066 | 0.032 | 0.047 |
| 0.25 | 0.211 | 0.044 | 0.059 | 0.211 | 0.045 | 0.060 | ||
| 0.50 | 0.486 | 0.045 | 0.047 | 0.469 | 0.046 | 0.056 | ||
| 0.75 | 0.745 | 0.030 | 0.030 | 0.732 | 0.034 | 0.038 | ||
| 0.90 | 0.898 | 0.018 | 0.018 | 0.892 | 0.020 | 0.021 | ||
| 200 | 30% | 0.10 | 0.068 | 0.028 | 0.043 | 0.065 | 0.033 | 0.048 |
| 0.25 | 0.222 | 0.045 | 0.053 | 0.210 | 0.046 | 0.061 | ||
| 0.50 | 0.490 | 0.046 | 0.047 | 0.468 | 0.049 | 0.058 | ||
| 0.75 | 0.748 | 0.031 | 0.032 | 0.732 | 0.036 | 0.041 | ||
| 0.90 | 0.901 | 0.019 | 0.019 | 0.892 | 0.022 | 0.024 | ||
| 400 | 10% | 0.10 | 0.065 | 0.019 | 0.040 | 0.075 | 0.025 | 0.035 |
| 0.25 | 0.217 | 0.031 | 0.045 | 0.221 | 0.033 | 0.043 | ||
| 0.50 | 0.495 | 0.033 | 0.033 | 0.478 | 0.033 | 0.039 | ||
| 0.75 | 0.748 | 0.021 | 0.021 | 0.738 | 0.023 | 0.026 | ||
| 0.90 | 0.899 | 0.012 | 0.012 | 0.895 | 0.014 | 0.015 | ||
| 400 | 30% | 0.10 | 0.080 | 0.021 | 0.029 | 0.075 | 0.025 | 0.036 |
| 0.25 | 0.236 | 0.033 | 0.036 | 0.221 | 0.033 | 0.044 | ||
| 0.50 | 0.501 | 0.033 | 0.034 | 0.477 | 0.034 | 0.041 | ||
| 0.75 | 0.751 | 0.022 | 0.022 | 0.737 | 0.025 | 0.028 | ||
| 0.90 | 0.901 | 0.013 | 0.013 | 0.895 | 0.016 | 0.016 | ||
Note: F̂p(t) is the proposed estimator; F̂c is Tsai’s conditional estimator; C%=censoring percentage; Est.= average of estimates; ESD = empirical standard deviation; ESMSE = empirical square root of the mean squared error = (bias2 + ESD2)1/2.
5.2. Estimating Regression Coefficients Under the Cox Model
We generated unbiased failure times (T̃) from the proportional hazards model with two covariates, where β = (β1, β2) = (0.5, 1), the binary covariate Z1 ~ Bernoulli(1,0.5), the continuous covariate Z2 ~ uniform(−0.5,0.5), and the baseline hazard function is t. The censoring times C were independently generated either from uniform distributions or from the specified covariate dependent distributions (see Table 2).
Table 2.
Summary Statistics of Simulations for Estimating Regression Coefficients under Cox Model with β = (β1, β2) = (0.5, 1). Mean SE is the mean of the estimated standard errors
| Cohort Size | C% | Proposed approach | EE - I | EE - 2 | Tsai’s Method | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Est. | ESD | Mean SE | 95 % CP | Est. | ESD | Est. | ESD | Est. | ESD | ||
| 200 | 15% | (.49,.98) | (.11,.20) | (.11,.19) | (.96,.95) | (.50,1.01) | (.14,.25) | (.51,1.04) | (.13,.24) | (.51,1.01) | (.12,.22) |
| 30% | (.48,.94) | (.11,.21) | (.11,.20) | (.94,.93) | (.47,.91) | (.19, .33) | (.51,1.01) | (.16,.28) | (.50,1.01) | (.14,.25) | |
| 50% | (.46,.93) | (.12,.21) | (.12,.20) | (.93,.94) | (.42,.85) | (.23,.40) | (.51,1.02) | (.19,.34) | (.49,1.01) | (.18,.31) | |
| 400 | 15% | (.49,.98) | (.08,.14) | (.08,.14) | (.95,.95) | (.50,.99) | (.10,.18) | (.51,1.02) | (.09,.17) | (.52,1.01) | (.09,.15) |
| 30% | (.48,.97) | (.08,.15) | (.08,.14) | (.93,.93) | (.46,.94) | (.14,.24) | (.50,1.01) | (.11,.21) | (.50,1.00) | (.10,.17) | |
| 50% | (.48,.94) | (.08,.15) | (.08,.15) | (.94,.92) | (.42,.84) | (.18,.31) | (.51,1.02) | (.14,.24) | (.49,1.01) | (.13,.22) | |
| λc = t exp (0.5Z1 + 05Z2). | |||||||||||
| 200 | 30% | (.48,.95) | (.11,.20) | (.11,.20) | (.94,.93) | (.39,.89) | (.18,.31) | (.46,.97) | (.15,.27) | (.58,1.07) | (.13,.23) |
| 400 | 30% | (.48,.97) | (.08,.15) | (.08,.14) | (.96,.96) | (.39,.91) | (.13,.23) | (.45,.97) | (.11,.20) | (.58,1.08) | (.10,.17) |
| λc = t exp (Z2) | |||||||||||
| 200 | 30% | (.48,.96) | (.11,.20) | (.11,.20) | (.95,.94) | (.52,.82) | (.18,.31) | (.53,.92) | (.15,.28) | (.49,1.15) | (.14,0.24) |
| 400 | 30% | (.48,.96) | (.08,.15) | (.08,.14) | (.93,.93) | (.50,.80) | (.12,.22) | (.51,.91) | (.11,.19) | (.50,1.15) | (.10,.17) |
| C ~ Z1U (0, 1) + (1 − Z1)U (0, 1.8) | |||||||||||
| 200 | 30% | (.49,.96) | (.12,.21) | (.12,.20) | (.95,.93) | (.82,.92) | (.20,.33) | (.71,1.00) | (.17,.28) | (.41,1.02) | (.15,.25) |
| 400 | 30% | (.48,.96) | (.09,.14) | (.08,.14) | (.92,.95) | (.82,.89) | (.11,.22) | (.69,.97) | (.12,.19) | (.70,.96) | (.09,.17) |
For a light (15%) or moderate (30%) censoring percentage, the mean estimates of the coefficients agreed well with the true parameter no matter whether the censoring distribution was dependent on or independent of the covariates. Even with heavy censoring (50%), the inferences associated with the proposed method were fairly accurate for all of the scenarios investigated: the means of the estimated standard errors were close to the empirical standard errors and the coverage of the 95% confidence intervals were reasonable, ranged from 92% to 96%.
In Table 2, we also show a comparison of the performance between the proposed MLE and the existing estimation methods for length-biased data. The estimating equation approaches of Qin and Shen (2010) are as follow:
where SC(·) is the survival function of the censoring time C, and . Tsai (2009) proposed the pseudo-partial likelihood method with the following estimating equations based on the score statistics,
where .
When the censoring was independent of the covariates, all the types of estimators had a small bias with light censoring (15%). With moderate (30%) or heavy censoring (50%), the biases associated with the MLE were much smaller than those associated with EE I. The MLE method always exhibited clearly superior efficiency, with smaller empirical standard errors. For instance, the standard errors associated with the estimating equations were 1.12 to 1.62 times greater, and the standard errors associated with the pseudo-partial likelihood method were 1.09 to 1.50 times greater than those associated with the MLE method based on a sample size of 200. When the censoring distribution was dependent on the covariates, the estimators obtained by EE-I, EE-II and EE-PL were biased compared with those obtained by the MLE method. In summary, the MLE approach is the most efficient one among the four methods, and is also the most robust to various censoring mechanisms.
6. A REAL DATA EXAMPLE
Dementia is a progressive degenerative medical condition and is one of the leading causes of all deaths in the United States and Canada. The Canadian Study of Health and Aging was a multicenter epidemiologic study of dementia, in which 14,026 subjects 65 years or older were randomly chosen throughout Canada to receive an invitation for a health survey. A total of 10,263 subjects agreed to participate in the study (Wolfson et al., 2001). The participants were then screened for dementia, and 1132 people were identified as having the disease. The individuals with dementia were followed until their deaths or last follow-up dates in 1996, and their dates of dementia onset were ascertained from their medical records.
After excluding subjects with missing data regarding the date of disease onset or classification of dementia subtype, a total of 818 patients remained {393 with probable Alzheimer’s disease, 252 with possible Alzheimer’s disease and 173 with vascular dementia. Other study variables included the approximate date of dementia onset, date of screening for dementia, date of death or censoring and death indicator variable. Given the prevalent cases ascertained cross-sectionally, Asgharian et al. (2006) validated the stationarity assumption, which was defined as that the incidence of dementia did not change over the period of the study.
At the end of the study, 638 out of 818 patients had died and the others were right censored. Within this elderly cohort, it seems reasonable to assume that the overall death rate increases with age. Applying the NPMLE approaches described in Sections 2 and 3, we estimated the hazard function for each subtype of dementia, and plotted the survival of patients with probable Alzheimer’s disease, possible Alzheimer’s disease and vascular dementia with and without the constraint of the increasing risk of death (see Figure 1). It is not surprising that the estimated survival curves with the constraint, i.e., additional information, have narrower confidence intervals than their corresponding survival curves without the constraint. As pointed out by one referee, a monotone hazards constraint may not hold for death due to dementia, particularly for patients with vascular dementia, because of the uncertainty associated with the cause of death.
Figure 1.
Estimated survival curves according to subtypes of dementia, with and without the constraint of increasing risk of death with age
Using the diagnosis subtype of possible Alzheimer’s disease as the baseline cohort, we defined two indicator variables for the other two subtypes of dementia for the Cox proportional hazards model. Applying the proposed method in Section 4, the estimated covariate Effects of two subtypes of dementia and their standard errors are listed in Table 3. The results showed that the long-term survival distributions were statistically significantly different between the group with vascular dementia and the group with possible Alzheimer’s dementia, and marginally different between the group with probable Alzheimer’s dementia and the group with possible Alzheimer’s dementia. We also analyzed the same data set using the estimating equation methods (EE-I and EE-II) by Qin and Shen (2010) and the pseudo-partial likelihood method by Tsai (2009). The estimated coefficients with the associated standard errors obtained by both EE-II and EE-PL indicated no statistically significant survival differences between the three subtypes of dementia. The results from EE-I suggested there existed a statistically significant survival difference between the group with vascular dementia and the group with possible Alzheimer’s dementia, but no statistically significant survival differences between the group with vascular dementia and the group with probable Alzheimer’s dementia. The discrepancy in the inferences is most likely caused by the loss in efficiency when using the estimating equation method or the pseudo-partial likelihood method compared to using the MLE method.
Table 3.
Estimates (Standard Errors) of Regression Coefficients Using Length-biased Adjusted Methods for Dementia Data.
| MLE | EE-I | EE-II | EE-PL | |
|---|---|---|---|---|
| Probable Alzheimer | 0.125 (0.062) | 0.109 (0.092) | 0.134 (0.091) | 0.064(0.081) |
| Vascular Dementia | 0.185 (0.077) | 0.245 (0.110) | 0.208 (0.110) | 0.164(0.111) |
7. CONCLUDING REMARKS
We have proposed new EM algorithms for length-biased data to obtain full likelihood maximum estimators under three settings, and the missing data mechanism in the EM-algorithm is the left truncation for the length-biased data. In constrast to Vardi’s (1989) EM algorithm for estimating nonparametric survival distributions, the advantage of the new EM algorithm is that one can directly estimate the non-parametric survival distribution or hazard function of the unbiased failure time T̃.
One major challenge to maximum likelihood estimation when it involves infinite dimensional parameters is computational intractability. We have implemented the new EM algorithm together with the profile likelihood method for jointly estimating the baseline hazard function and the covariate coefficients under the Cox regression model for length-biased data. Commercially available statistics software for the Cox model can be adapted for easy computation. The EM algorithm is not computational intensive even with continuous covariates, since pij and λj can be obtained easily from the closed form expressions.
Similar to the NPMLE for traditional survival data, the method we have proposed requires the observation of at least one failure time to ensure that the large sample properties hold in the settings for length-biased data. As shown in our empirical studies, the proposed computational algorithms to solve MLE perform well in terms of accuracy, and are more efficient compared to the existing estimating equation approaches, which are more efficient than the conditional approach (Wang et al., 1993). Without assuming a known parametric distribution of Z as in Bergeron et al. (2008), maximizing the likelihood function (1) is equally efficient to maximinizing the full likelihood including the marginal distribution of Z.
Parallel to the observation of Zeng and Lin (2007) for traditional survival data, estimators obtained from MLE are much more robust and efficient than those from the estimating equation approaches, and well suited for the proportional hazards regression model and other nonparametric estimations for length-biased right-censored data. The proposed EM algorithm may be further generalized to other semiparametric models, and the tools for model checking should be developed for length-biased right-censored data.
Supplementary Material
Acknowledgments
This research was partially supported by National Institute of Health grant R01-CA079466.
We thank one Associate Editor and two Referees for their very constructive comments. We also thank Professor Masoud Asgharian and investigators of the Canadian Study of Health and Aging (CHSA) for providing us the dementia data from CHSA. The data reported in the example were collected as part of the CHSA. The core study was funded by the Seniors' Independence Research Program, through the National Health Research and Development Program of Health Canada (Project no.6606-3954-MC(S)). Additional funding was provided by Pfizer Canada Incorporated through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada.
A APPENDIX
A.1 Assumptions
Denote the Euclidean norm by |·|. For a vector z = (z1, …, zp)′, |z| ≡ (|z1|2 + ··· + |zp|2)1/2. To avoid measurability issues, the probability is understood as the outer probability (van der Vaart and Wellner, 1996). We adopt the convention that 0/0 = 0. We assume the following regularity assumptions:
The true value of the hazard function λ0(·) is continuously differentiable. In addition, the upper bound τ of the support of the cumulative hazard function Λ0 is finite.
The parameter β is in a compact set
that contains β0. The parameter set
for the baseline function contains all nondecreasing functions Λ satisfying Λ (0) = 0 and Λ (τ) < ∞.The residual censoring time C satisfies P(C > V) > 0. Its survival function SC(·) is continuous.
For the covariate vector Z, the terms E0|Z|2 and E0|eβ′Z| are bounded.
The information matrix −∂2E[ℓn(β, λ̂ (·, β)]/∂β2 evaluated at the true values β0 is positive definite.
If P(b′Z = c0) = 1 for some constant c0, then b = 0.
Assumption 1 and 3 imply that the bivariate function Q(t, u) defined in (23) is continuous on [0, τ] × [0, τ]. Assumption 3 implies that the censoring may occur after V. Assumption 5 is a classical condition that appears in the study of the Cox model for traditional survival data (Andersen et al., 1992, Condition VII2.1.(e), page 497). It implies that the matrix J0 defined in (18) is positive definite, which is the information matrix for β when the baseline function Λ0 is known. The positive definiteness of J0 can be also implied by the fact the model is identifiable (Rothenberg, 1971). Assumption 6 means that there is no covariate colinearity, which ensure the model identifiability.
A.2 Proof of the lemma on the convergence of τ̂ = tk to τ
Recall that Xi = min(Ai+Vi, Ai+Ci), and tk = max{X1, ···, Xn}. For any arbitrary small ε > 0,
where , and the last equality uses the mean value theorem. Given the above equation, for any arbitrary small η > 0,
where τ − εn−η ≤ ξ ≤ τ. Note that when n → ∞,
Thus, as long as E[C] > 0 which is inferred by Assumption 3, we have w(ξ) → w(τ) > 0 and f(ξ) → f(τ) > 0, for n → ∞, since τ is the upper bound for the support of population time and Λ(τ) < ∞. Therefore, when n → ∞ we have that
which implies that τ̂ → τ in a rate higher than n1/2 but lower than n (i.e. when 1/2 < η < 1, the above probability converges to zero).
Note that by the Borel-Cantelli Lemma, τ̂n → τ in probability implies that for every subsequence, there is a sub-subsequence {n′} such that τ̂n′ → τ almost surely (Ferguson, 1996, page 8). This combined with the fact that τ can be consistently estimated by τ̂n complete the proof for strong consistency of Λ̂n. As τ̂n ≤ τ, these facts also infer that the mean μ of the population failure time T̃ can be consistently estimated by .
A.3 Consistency of Λ̂n with increasing failure rate
Let ||·||τ denote the supremum norm over [0, τ]. We have the following strong consistency result for the NPMLE Λ̂n(·) proposed in Section 3:
Proposition 1
Suppose Λ is convex on its support [0, τ]. Under the regularity conditions,
We adapt the proof in Huang and Wellner (1995). Let εn = ||Λn − Λ||τ. As argued in Asgharian and Wolfson (2005), εn → 0 almost surely when n → ∞. Since Λ is convex on [0, τ], it must be continuous on [0, τ]. The function Λ − εn is convex and is a minorant of Λn, i.e., Λ (s) − εn ≤ Λn(s) for all 0 ≤ s ≤ τ. By the definition of Λ̂n, we have that for all 0 ≤ s ≤ τ,
It follows that −εn ≤ Λ̂n(s) − Λ(s) ≤ Λn(s) − Λ(s) ≤ εn for all 0 ≤ s ≤ τ. The conclusion of the proposition follows as ||Λ̂n − Λ||τ ≤ εn goes to 0 almost surely.
A.4 Consistency: Proof of Theorem 1
Note that the log-likelihood function ℓn(β, λ) is strictly concave in λ as each function of λ in ℓn(β, λ) is concave or strictly concave and the summation of concave functions is concave. Hence, for each β in a compact set
, we can find a unique maximizer of λ̂(·, β) of the likelihood function ln(β, λ). The existence of the NPMLE for {β, λ} follows by the compactness of
for the profile likelihood ℓn(β, λ̂(·, β)), which is continuous in β. The uniqueness of the NPMLE is guaranteed by Assumption 5 for large samples.
The technical details of the consistency proof are similar to those of Murphy (1995) or Parner (1998). We provide only a sketch of the proof. As the MLE (β̂n, Λ̂n) maximizes the log-likelihood function ℓn, the method is to use the empirical Kullback-Leibler distance ℓn(β̂n, Λ̂n) − ℓn(β0, Λ0), which must always be nonnegative. If (β̂n, Λ̂n) converges at all, say, to (β*, Λ*), then ℓn(β̂n, Λ̂n) − ℓn(β0, Λ0) must converge to the negative Kullback-Leibler distance between Pβ*; Λ* and Pβ0, Λ0 by the strong law of large numbers, where Pβ; Λ is the probability measure under the parameter (β, Λ). The Kullback-Leibler distance between Pβ*; Λ* and Pβ0, Λ0 therefore must be zero, and we conclude that Pβ*; Λ* = Pβ0, Λ0 almost surely. It then follows by model identifiability that β* = β0 and Λ* = Λ0.
We need to find, for any subsequence of (β̂n, Λ̂n), a further convergent subsequence. The first step is to show that (β̂n, Λ̂n) stays bounded. As β̂n is in a compact set, it must stay bounded. Because (β̂n, Λ̂n) maximizes the likelihood function, ℓn(β̂n, Λ̂n) − ℓn(β̄, Λ̄) ≥ 0 for each (β̄, Λ̄) in the parameter set. Recall that τ̂ = tk. We show that Λ̂n(τ̂) stays bounded, . We use the method of contradiction. Suppose that Λ̂n(τ̂) diverges. Then we can construct some sequence {β̄n, Λ̄n} such that the empirical Kullback-Leibler distance ℓn(β̂n, Λ̂n) − ℓn(β̄n, Λ̄n) would become negative infinity. This is a contradiction as the Kullback-Leibler distance is always nonnegative. The construction of the contradiction is along the lines as in Murphy (1994). Briefly, we choose β̂n = β0 and define Λ̄n to be
| (26) |
where , and note that
It can be shown easily that Λ̄n converges to Λ0 almost surely and uniformly in t. By a technical argument similar to that of Murphy (1995), we can show that ℓn(β̂n, Λ̂n) − ℓn(β̄n, Λ̄n) → −∞ as n → ∞. This is impossible so Λ̂n must stay bounded.
As Λ̂n stays bounded, we can apply Helly’s selection principle to find a convergent subsequence of (β̂nk Λ̂nk) for an arbitrary subsequence from the sequence indexed; by {1, ···, n}. By the strong law of large numbers, such convergent subsequence must converge to (β0, Λ0) using the classical Kullback-Leibler information approach. For any given subsequence {nk}, we can identify a further subsequence of (β̂ nk, Λ̂ nk) that converges to (β0, Λ0). Helly’s selection theorem implies that the entire sequence (β̂n, Λ̂n(t)) must converge to (β0, Λ0(t)) for each t ∈ [0, τ]. By the assumption that Λ0(·) is monotone and continuous, the convergence of Λ̂n(t) ∈ [0, τ] at each t is also uniformly in t. The convergence is also almost surely a true convergence as the proof is carried out for a fixed ω in the underlying probability space Ω, where we use the law of large numbers, except that it is countable many times.
A.5 Asymptotic Normality: Proof of Theorem 2
We prove the asymptotic normality by the Z-theorem for the infinite-dimensional estimating equations (van der Vaart and Wellner, 1996, Theorem 3.3.1, page 310). This approach has been successfully applied by (Murphy, 1995, Theorem 1) and (Parner, 1998, Theorem 2) among many others. The proof requires the confirmation of the three main conditions of the Z-theorem: Fréchet differentiability, weak convergence of and a stochastic approximation of the estimating equations, which we outline below.
Fréchet Derivative and its Invertibility
Let . We first show that the population estimating equation U0 is Fréchet differentiable and its Fréchet derivative is continuously invertible. We have defined U0(·, β, Λ) = (U10(β, Λ), U20(·, β, Λ)) where
The Fréchet derivative can be calculated from the Gâteaux variations of U0(β, Λ) at (β0, Λ0). That is done to differentiate U0(βη, Λη) with respect to η and evaluated at η = 0, where the submodels are βη = β0 + ηβ and Λη = Λ0 + ηΛ.
The Gâteaux derivative of U20(t, β, Λ) evaluated at (β0, Λ0) is
where , and . The Gâteaux derivative of U10(t, β, Λ) evaluated at (β 0, Λ0) is
where , and .
To obtain the results on weak convergence, we need to strengthen the Ĝateaux differentiability to Fréchet differentiability, essentially for the proof of tightness (van der Vaart and Wellner, 1996, page 310). The Fréchet differentiability of U0(β, Λ) can be confirmed by definition; its derivative has the form in (19). Note that the operator U̇ψ0 is a linear continuous operator defined on parameter space in the product space of ℝp and the Banach space L2[0, τ]. If the inverse operator exists, then it must be continuous by Banach’s continuous inverse theorem (Zeidler, 1995, page 179). Hence, to prove the continuous invertibility of U̇ψ0, we only need to show the existence of the inverse operator .
To show the existence of the inverse operator , we only need to show by the formula in (20) that σ11 and have inverse. The operator is a linear operator, where the matrix J0 defined in (18) is the Fisher information for β for known Λ0. By Assumption 5, the matrix J0 has an inverse, and hence σ11 is invertible. The operator Φ has the following form:
The invertibility of Φ is equivalent to show that there exits a unique solution to the equation Φ(Λ) = Λ̃ for a function Λ̃ of bounded variation. Taking the derivative with respect to t on both sides of the equation,
| (27) |
where Q(t, u) is defined previously in (23). We observe that the integral equation (27) is the Fredholm equation of the second type. By Assumptions 1 and 3, the bivariate function Q(t, u) defined in (23) is continuous on [0, τ] × [0, τ]. Also by assumption 3, the function is continuous and bounded away from 0 for t > 0. By the classical theory for integral equation (Tricomi, 1985, Chapter 2), there is a unique solution dΛ (t) to the Fredholm integral equation (27), characterized by
where H(t, u) satisfies the equation (22). Finally, the invertibility of the functional Φ follows and its inverse operator Φ−1(Λ) has a form expressed in (21).
Weak Convergence of
As the true value (β0, Λ0) satisfies U0(β0, Λ0) = 0, we have
By the multivariate central limit theorem for the summation of independently and identically distributed (i.i.d.) random vectors,
converges in law to
, provided that the second moment is finite. The process
is a sum of i.i.d. processes of bounded variation on [0, τ]. By a lemma for the central limit theorem for processes of bounded variation (van der Vaart and Wellner, 1996, Example 2.11.16),
converges to a tight Gaussian process, say
, provided that the second moment is finite. The weak convergence of
follows by the continuous mapping theorem.
Stochastic Approximation
To apply the Z-theorem for the infinite dimensional estimating equations (van der Vaart and Wellner, 1996, Theorem 3.3.1), we need to confirm the following stochastic approximation:
The function is defined on
×
, where by Assumption 2, the set
is compact that contains β0, and
is a set of nondecreasing function such that for each Λ ∈
, Λ (0) = 0 and Λ (τ) < ∞. Hence
contains Λ0. To apply the Z-theorem, we need Λ in the closed linear subspace
generated by the set
. The subspace
is viewed in the space of functions of bounded variation on [0, τ] endowed with the variation norm by || · ||v, defined by the total variation of Λ on [0, τ], i.e.,
where the supremum is taken over all finite partitions of [0, τ], {0 = s0 < s1 < ··· < sm′ = τ}.
We first derive the score functions for β and Λ (·) for a single observation
. We keep
in the nations for the score functions to emphasize its dependence on the data. By straightforward calculation, the score function for β is
| (28) |
For the infinite dimensional parameter Λ (·), consider a submodel defined by dΛ η = (1 + ηh)dΛ, where h is a bounded and integrable function. By taking the derivative of ℓn(β, Λη) with respect to η and evaluating it at η = 0, we have the score operator for Λ
| (29) |
Taking h(·) = 1(· ≤ t) in (29), we have the score function for Λ
| (30) |
We can write .
Let ℓ̇(t,
, ψ) = {ℓ̇1 (
, ψ), ℓ̇2(t,
, ψ)}, where ψ = (β, Λ). We first introduce some notations from the empirical processes theory. Let ℙn be the empirical probability measure. Then
Denote the empirical process by , where P0 is the expectation under ψ0. Note that is the empirical process indexed by the class of functions
Let the norm ||·||
on
=
×
be defined as ||(β, Λ)||
= |β| + ||Λ||v. Then the stochastic condition that we want to confirm is
To apply the functional central limit theory, we show that the class of functions {ℓ̇(t,
, ψ) − ℓ̇(t,
, ψ0): ||ψ||
<, δ, t ∈ [0, τ]} is P0-Donsker. As
, we only need to show that {ℓ̇2(t,
, ψ) − ℓ̇2(t,
, ψ0): ||ψ||
< δ, t ∈ [0, τ]} is P0-Donsker, where
First, the class of functions exp(β′ Z) for β in the compact set
is P0-Donsker as it is of finite dimension. The class of functions of bounded variation on [0, τ] is P0-Donsker (van der Vaart, 1998, page 273). As the function Λ (t) is of bounded variation on [0, τ], the set of functions {
, t ∈ [0, τ], β, Λ} is P0-Donsker. Similarly, we can show that the sets of functions {
, t ∈ [0, τ], β ∈
, Λ ∈
} and {μZ(β, Λ), β ∈
, Λ ∈
} are P0-Donsker. Note that their envelope functions are (τ − u)eβ′ZΛ (τ) and τ < ∞, respectively. By Assumption 4, (τ − u)2Λ2(τ)E0e2β′Z < ∞ for all β and Λ ∈
. By the permanent property of Donsker classes, we proceed to confirm the Donsker property by using the fact that the sums, production and Lipschitz transformations of simple P0-Donsker classes are still P0-Donsker.
Furthermore, ||ψ − ψ0||
→ 0, ℓ̇2(t,
, ψ) converges to ℓ̇2(t,
, ψ0) for each t. The convergence also holds in the square moment by the dominated convergence theorem. It follows that
The stochastic approximation of to now follows by a technical lemma of van der Vaart and Wellner (1996, Lemma 3.3.5, p. 311).
Footnotes
Jumps for NPMLE of baseline cumulative hazard function for length-biased data: A detail description on why the NPMLE has jumps at both censored and uncensored times for length-biased data.
Contributor Information
Jing Qin, Email: jingqin@niaid.nih.gov, Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, NIH Bethesda, Maryland 20892, USA, Phone: 301-451-2436.
Jing Ning, Division of Biostatistics, The University of Texas, Health Science Center at Houston, School of Public Health, Houston, Texas 77030, USA.
Hao Liu, Division of Biostatistics, Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA.
Yu Shen, Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030, USA.
References
- Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer; New York: 1992. [Google Scholar]
- Asgharian M, M’Lan CE, Wolfson DB. Length-biased Sampling with Right Censoring: an Unconditional Approach. Journal of the American Statistical Association. 2002;97:201–209. [Google Scholar]
- Asgharian M, Wolfson DB. Asymptotic Behavior of the Unconditional NPMLE of the Length-biased Survivor Function From Right Censored Prevalent Cohort Data. The Annals of Statistics. 2005;33:2109–2131. [Google Scholar]
- Asgharian M, Wolfson DB, Zhang X. Checking Stationarity of the Incidence Rate Using Prevalent Cohort Survival Data. Statistics in Medicine. 2006;25:1751–1767. doi: 10.1002/sim.2326. [DOI] [PubMed] [Google Scholar]
- Barlow RE, Proschan F. Statistical Theory of Reliability. New York: Holt, Rinehart & Winston; 1975. [Google Scholar]
- Bergeron P-J, Asgharian M, Wolfson DB. Covariate Bias Induced by Length-Biased Sampling of Failure Times. Journal of the American Statistical Association. 2008;103:737–742. [Google Scholar]
- Bickel PJ, Ritov Y. Efficient Estimation Using both Direct and Indirect Observations. Theory of Probability and its Applications. 1994;38:194–213. [Google Scholar]
- Breslow N. Contribution to the Discussion of the Paper by D. R. Cox. Journal of the Royal Statistical Society B. 1972;34:187–220. [Google Scholar]
- Chen HY, Little RJA. Proportional Hazards Regression with Missing Covariates. Journal of the American Statistical Association. 1999;94:896–908. [Google Scholar]
- Cox DR. Regression Models and Life Tables (with Discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
- Cox DR. Partial Likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
- Cox DR, Miller HD. The Theory of Stochastic Processes. London: New York; 1977. [Google Scholar]
- De Uña Álvarez J, Otero-Giraldez MS, Alvarez-Llorente G. Estimation Under Length-bias and Right-censoring: An Application to Unemployment Duration Analysis for Married Women. Journal of Applied Statistics. 2003;30:283–291. [Google Scholar]
- Dewanji A, Kalbfleisch JD. Estimation of Sojourn Time Distributions for Cyclic sSemi-markov Processes in Equilibrium. Biometrika. 1987;74:281–288. [Google Scholar]
- Elbers C, Ridder G. True and Spurious Duration Dependence: the Identifiability of the Proportional Hazard Model. The Review of Economic Studies. 1982;49:403–409. [Google Scholar]
- Ferguson TS. A Course in Large Sample Theory. 1 London: Chapman & Hall; 1996. [Google Scholar]
- Gail MH, Benichou J. Encyclopedia of Epidemiologic Methods. Wiley; 2000. [Google Scholar]
- Gordis L. Epidemiology. Philadelphia, PA: W. B. Saunders Company; 2000. [Google Scholar]
- Huang J, Wellner JA. Estimation of a Monotone Density or Monotone Hazard under Random Censoring. Scandinavian Journal of Statistics. 1995;22:3–33. [Google Scholar]
- Kalbfleisch JD, Lawless JF. Regression Models for Right Truncated Data with Applications to AIDS Incubation Times and Reporting Lags. Statistica Sinica. 1991;1:19–32. [Google Scholar]
- Kalbfleisch JD, Prentice RL. Marginal Likelihoods Based on Cox’s Regression and Life Model. Biometrika. 1973;60:267–278. [Google Scholar]
- Keiding N. Age-specific Incidence and Prevalence: a Statistical Perspective (with Discussion) Journal of the Royal Statistical Society, Series A. 1991;154:371–412. [Google Scholar]
- Klein J. Semiparametric Estimation of Random Effects Using the Cox Model Based on the EM Algorithm. Biometrics. 1992;48:795–806. [PubMed] [Google Scholar]
- Kvam P. Length Bias in the Measurements of Carbon Nanotubes. Technometrics. 2008;50:462–467. [Google Scholar]
- Lancaster T. The Econometric Analysis of Transition Data. Cambridge: University Press; 1990. [Google Scholar]
- Marshall AW, Proschan F. Maximum Likelihood Estimation for Distributions with Monotone Failure Rate. The Annals of Mathematical Statistics. 1965;36:69–77. [Google Scholar]
- McClean S, Devine C. A Nonparametric Maximum Likelihood Estimator for Incomplete Renewal Data. Biometrika. 1995;82:791–803. [Google Scholar]
- Meng X-L, Rubin DB. Maximum Likelihood Estimation via the ECM Algorithm: a General Framework. Biometrika. 1993;80:267–278. [Google Scholar]
- Murphy SA. Consistency in a Proportional Hazards Model Incorporating a Random Effect. The Annals of Statistics. 1994;22:712–731. [Google Scholar]
- Murphy SA. Asymptotic theory for the frailty model. The Annals of Statistics. 1995;23:182–198. [Google Scholar]
- Murphy SA, van der Vaart AW. Observed Information in Semi-parametric Models. Bernoulli. 1999;5:381–412. [Google Scholar]
- Murphy SA, van der Vaart AW. On Profile Likelihood (with Discussion) Journal of the American Statistical Association. 2000;95:449–485. [Google Scholar]
- Nielsen GG, Gill RD, Andersen PK, Sorensen TIA. A Counting Process Approach to Maximum Likelihood Estimation in Frailty Models. Scandinavian Journal of Statistics. 1992;19:25–43. [Google Scholar]
- Padgett WJ, Wei LJ. Maximum Likelihood Estimation of a Distribution Function with Increasing Failure Rate Based on Censored Observations. Biometrika. 1980;67:470–474. [Google Scholar]
- Parner E. Asymptotic Theory for the Correlated Gamma-frailty Model. The Annals of Statistics. 1998;26:183–214. [Google Scholar]
- Qin J, Shen Y. Statistical Methods for Analyzing Right-censored Length-biased Data under Cox Model. Biometrics. 2010;66:382–392. doi: 10.1111/j.1541-0420.2009.01287.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rothenberg TJ. Identification in Parametric Models. Econometrica. 1971;39:577–591. [Google Scholar]
- Sansgiry P, Akman O. Transformations of the Lognormal Distribution as a Selection Model. The American Statistician. 2000;54:307–309. [Google Scholar]
- Scheike TH, Keiding N. Design and Analysis of Time-to-pregnancy. Statistical Methods in Medical Research. 2006;15:127–140. doi: 10.1191/0962280206sm435oa. [DOI] [PubMed] [Google Scholar]
- Simon R. Length-biased Sampling in Etiologic Studies. American Journal of Epidemiology. 1980;111:444–452. doi: 10.1093/oxfordjournals.aje.a112920. [DOI] [PubMed] [Google Scholar]
- Terwilliger J, Shannon W, Lathrop G, Nolan J, Goldin L, Chase G, Weeks D. True and False Positive Peaks in Genomewide Scans: Applications of Length-biased Sampling to Linkage Mapping. American Journal of Human Genetics. 1997;61:430–438. doi: 10.1086/514855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tricomi FG. Integral Equations. New York: Dover Publications; 1985. [Google Scholar]
- Tsai WY. Estimation of the Survival Function with Increasing Failure Rate Based on Left Truncated and Right Censored Data. Biometrika. 1988;75:319–324. [Google Scholar]
- Tsai WY. Pseudo-partial Likelihood for Proportional Hazards Models with Biased-sampling Data. Biometrika. 2009;96:601–615. doi: 10.1093/biomet/asp026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turnbull BW. The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data. Journal of the Royal Statistical Society Series B Methodological. 1976;38:290–295. [Google Scholar]
- van der Vaart A. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge, UK: Cambridge University Press; 1998. Asymptotic Statistics. [Google Scholar]
- van der Vaart AW, Wellner JA. Existence and Consistency of Maximum Likelihood in Upgrade Mixture Models. Journal of Multivariate Analysis. 1992;43:133–146. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. with applications to statistics. [Google Scholar]
- Vardi Y. Nonparametric Estimation in the Presence of Length Bias. The Annals of Statistics. 1982;10:616–620. [Google Scholar]
- Vardi Y. Multiplicative Censoring, Renewal Processes, Deconvolution and Decreasing Density: Nonparametric Estimation. Biometrika. 1989;76:751–761. [Google Scholar]
- Vardi Y, Zhang CH. Large Sample Study of Empirical Distributions in a Random-Multiplicative Censoring Model. The Annals of Statistics. 1992;20:1022–1039. [Google Scholar]
- Wang MC. Nonparametric Estimation From Cross-Sectional Survival Data. Journal of the American Statistical Association. 1991;86:130–143. doi: 10.1080/01621459.1999.10473831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang MC. Hazards Regression Analysis for Length-biased Data. Biometrika. 1996;83:343–354. [Google Scholar]
- Wang MC, Brookmeyer R, Jewell NP. Statistical Models for Prevalent Cohort Data. Biometrics. 1993;49:1–11. [PubMed] [Google Scholar]
- Wolfson C, Wolfson DB, Asgharian M, M’Lan CE, Ostbye T, Rockwood K, Hogan DB the Clinical Progression of Dementia Study Group. A Reevaluation of the Duration of Survival after the Onset of Dementia. The New England Journal of Medicine. 2001;344:1111–1116. doi: 10.1056/NEJM200104123441501. [DOI] [PubMed] [Google Scholar]
- Zeidler E. Applied Mathematical Sciences. Vol. 109. New York: Springer-Verlag; 1995. Applied Functional Analysis: Main Principles and Their Applications. [Google Scholar]
- Zelen M. Forward and Backward Recurrence Times and Length Biased Sampling: Age Specific Models. Lifetime Data Analysis. 2004;10:325–334. doi: 10.1007/s10985-004-4770-1. [DOI] [PubMed] [Google Scholar]
- Zelen M, Feinleib M. On the Theory of Screening for Chronic Diseases. Biometrika. 1969;56:601–614. [Google Scholar]
- Zeng D, Lin DY. Maximum Likelihood Estimation in Semiparametric Regression Models with Censored Data. Journal of the Royal Statistical Society, Series B. 2007;69:507–564. [Google Scholar]
- Zeng D, Lin DY, Yin G. Maximum Likelihood Estimation for the Proportional Odds Model with Random Effects. Journal of the American Statistical Association. 2005;100:470–483. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

