Abstract
This paper extends the induced smoothing procedure of Brown & Wang (2006) for the semiparametric accelerated failure time model to the case of clustered failure time data. The resulting procedure permits fast and accurate computation of regression parameter estimates and standard errors using simple and widely available numerical methods, such as the Newton–Raphson algorithm. The regression parameter estimates are shown to be strongly consistent and asymptotically normal; in addition, we prove that the asymptotic distribution of the smoothed estimator coincides with that obtained without the use of smoothing. This establishes a key claim of Brown & Wang (2006) for the case of independent failure time data and also extends such results to the case of clustered data. Simulation results show that these smoothed estimates perform as well as those obtained using the best available methods at a fraction of the computational cost.
Some key words: Censoring, Convex optimization, Multivariate survival data, Rank regression
1. Introduction
The need to analyze failure time data, possibly subject to right-censoring, arises in a number of fields, including medicine, economics, epidemiology, demography and engineering. Semiparametric regression models are commonly used for characterizing the relationship between failure time and covariates, with the Cox proportional hazards regression model (Cox, 1972) being used almost exclusively in practice. The accelerated failure time model (e.g. Kalbfleisch & Prentice, 2002) provides a useful but infrequently used alternative to the Cox proportional hazards model. Letting T̄i and Xi respectively denote the failure time and vector of observed covariates for subject i (i = 1, . . . , n), the accelerated failure time model specifies that log T̄i = X′iβ + ∊i, where the error terms are independent and identically distributed with an unspecified distribution. The regression coefficient β has a nice interpretation and a variety of simple estimators are available when T̄1, . . . , T̄n are fully observed. The infrequent use of this model in applications involving censored failure time data may be largely attributed to the computational challenges that arise in both regression parameter and covariance matrix estimation.
In the presence of censoring, the observed data for subject i can be described by the triplet (Ti, Δi, Xi) where Ti = min(T̄i, Ci), Δi = I (T̄i ⩽ Ci), and Ci denotes the censoring time for subject i. Tsiatis (1990) proposes to estimate β using a weighted estimating equation of the form
(1) |
where ei (β) = log(Ti) − X′iβ and wi (·) are nonnegative weight functions (i = 1, . . . , n). Due to the fact that β appears in this expression only inside indicator functions, is not a continuous function of β and a solution to typically does not exist. Parameter estimates may instead be obtained by minimizing , where ||v|| denotes (v′v)1/2 for a vector v. However, this minimization problem may admit several solutions and, because is not necessarily monotone in β, the resulting set of minimizers is not even guaranteed to be convex. Hence, despite the existence of a consistent and asymptotically normal sequence of generalized solutions (e.g. Tsiatis, 1990), identifying this sequence can be challenging in practice.
Fygenson & Ritov (1994) show that using the Gehan weight function (i = 1, . . . , n) leads to the monotone estimating equation
(2) |
Recognizing that Wn(β) is the gradient of the convex objective function
(3) |
a regression parameter estimate may be obtained by minimizing On(β) with respect to β. The resulting set of solutions is convex and thus easier to locate than in the general case. However, even in this comparatively nice setting, the associated lack of smoothness continues to present computational challenges. Numerous methods have been proposed for finding parameter estimates derived from (2) and (3). The most promising methods to date utilize linear programming techniques (e.g. Jin et al., 2003). However, while such methods can be implemented with relative ease, the computational burden can be high, especially with large datasets.
Estimating the covariance matrix of the regression parameter estimate obtained under the accelerated failure time model remains a challenging problem. Fygenson & Ritov (1994) show that the regression parameter estimate derived from (3) is asymptotically normal with a covariance matrix that involves the hazard function of the unspecified error distribution. Direct estimation of the covariance matrix thus requires an estimate of this hazard function. Tsiatis (1990) suggests kernel-based estimation, whereas Fygenson & Ritov (1994) suggest a form of numerical differentiation. Both have proven to be unstable choices in the presence of censored data and several authors have since tackled this problem in other ways; see, for example, Jones (1997) and Jin et al. (2003). Jin et al. (2003) propose to randomly reweight the Gehan log-rank objective function (3) and then minimize the resulting perturbed objective function. Repeating this process a large number of times, the covariance matrix may then be estimated using the empirical covariance matrix of these parameter estimates. This interesting and useful approach eliminates the need to estimate the indicated hazard function. However, the computationally intensive nature of this procedure quickly becomes unwieldy, particularly with large datasets. Huang (2002), Strawderman (2005) and Jin et al. (2006a) propose useful alternatives in three related problems.
Several authors have recently proposed useful smoothing methods for nonsmooth estimating equations arising in the accelerated failure time model; see, for example, Brown & Wang (2005; 2006), Heller (2007) and Song et al. (2007). Each of these smoothing methods leads to a continuously differentiable objective or estimating function that can be dealt with using standard numerical methods. Of direct relevance to this paper are Brown & Wang (2006) and Heller (2007). Building on Brown & Wang (2005), Brown & Wang (2006) propose the use of induced smoothing for the Gehan estimating equation (2). This method, described in more detail in §2, involves solving the equation EZ {Wn(β + ΓnZ)} = 0, where Wn(·) is given in (2), Z is a continuous, mean zero normal random vector independent of all of the data, and Γn is a sequence of matrices converging to zero with elements Γn,ij = Op(n−1/2). The smoothed estimating equation EZ {Wn(β + ΓnZ)} reduces to
(4) |
where , , and Φ (·) denotes the standard normal cumulative distribution function. In a related vein, Heller (2007) directly approximates the indicator function I(u ⩽ 0) in Wn(β) with 1 − ϒ(u/h), where ϒ(·) denotes a local distribution function satisfying certain conditions and the fixed scalar parameter h is used to control the accuracy of approximation. The resulting estimating equation,
(5) |
has the same structure as (4). In fact, upon taking ϒ(·) to be the standard normal distribution function Φ(·), (5) is essentially a special case of (4), utilizing a fixed bandwidth h in place of the covariate-dependent bandwith rn,ij. Heller (2007) also proposes a robust version of (5) having a bounded influence function. A potential difference between (4) and (5) lies in the ability of the former to employ a smoothing parameter that respects the scaling and covariance structure of the solution sequence. Brown & Wang (2006) claim but do not prove that the sequence of solutions obtained under (4) has the same asymptotic distribution as that obtained in the absence of smoothing. Heller (2007) proves that the solution sequence obtained under (5) is consistent and asymptotically normal, provided that h satisfies nh → ∞ and nh4 → 0 as n → ∞. Interestingly, Heller (2007) further proves that (2) and (5) are asymptotically equivalent but does not establish the equivalence result posited in Brown & Wang (2006).
The problem of regression parameter estimation under the accelerated failure time model with correlated survival data has also been considered. For example, Lin & Wei (1992), Lee et al. (1993) and Jin et al. (2006b) consider the setting in which failure times are grouped into clusters, such that observations within a cluster may be correlated but observations in distinct clusters may be considered independent. Each proposes a marginal method for rank-based estimation of regression parameters, avoiding the need to model the correlation structure among observations. Jin et al. (2006b) also devise a suitable extension of the resampling procedure proposed in Jin et al. (2003). Pan (2001) and Zhang & Peng (2007) instead propose frailty models, handling the dependence among failure times within a cluster using an additive cluster level random effect; see also Strawderman (2006) for related work in the case of a recurrent event outcome. These various methods suffer from estimation and computational challenges that equal or exceed those experienced in the case of independent failure time data. However, to our knowledge, the validity and utility of smoothing methods like those developed in Brown & Wang (2006) and Heller (2007) have not been investigated in connection with the clustered data problem.
This paper extends the smoothing procedure of Brown & Wang (2006) to the problem of marginal estimation of the regression parameter in the presence of clustered data. We prove that the resulting estimator is consistent and asymptotically normal in both the independent and correlated data settings. We further establish the equivalence of these limiting distributions with those arising in the unsmoothed case, providing rigorous justification of the equivalence claim made in Brown & Wang (2006) for the case of independent failure times and its extension to the setting of clustered data. Several possible methods of covariance matrix estimation are evaluated, among them a generalization of the Brown & Wang (2006) procedure and a modification of the resampling procedure due to Jin et al. (2006b). A useful consequence of developing the extended Brown & Wang (2006) estimator is an easy-to-compute sandwich estimator that avoids the need for resampling. The proposed methods substantially ease the computational burden of previously proposed methods for parameter and covariance matrix estimation.
2. Methodology and key results
2.1. Notation and assumptions
Consider a random sample of n independent clusters with Ki members in the i th cluster. Let T̄ik and Cik denote the failure time and censoring time for the kth member of the i th cluster, and let Xik denote the corresponding p × 1 vector of covariates. We assume that (T̄i1, . . . , T̄iKi)′ and (Ci1, . . . , CiKi)′ are independent conditional on the covariates (Xi1, . . . , XiKi)′. Let the survival data for the kth member of ith cluster be denoted Wik = (log Tik, Δik, Xik)′, where Tik = min(T̄ik, Cik) and Δik = I(T̄ik ⩽ Cik).
We assume that the marginal distribution of Tik follows the accelerated failure time model
where β0 is a p × 1 vector of unknown regression parameters contained in a compact subset 𝔹 of ℝp and (∊i1, . . . , ∊iKi)′ (i = 1, . . . , n) are independent random error vectors. Within each cluster i, the error terms ∊i1, . . . , ∊iKi may be correlated; however, as in Jin et al. (2006b, § 4), we assume that these error terms are exchangeable with a common, unknown marginal distribution. That is, for any i, j = 1, . . . , n and K ⩽ min(Ki, Kj), the vectors (∊i1, . . . , ∊iK)′ and (∊j1, . . . , ∊jK)′ have the same distribution. Evidently, the case of independent failure time data follows as a special case of the above model upon setting Ki = 1 for all i.
2.2. Estimation for clustered data using the Gehan weight
Let eik(β) = log(Tik) − X′ikβ. Under the assumptions of §2.1, the relevant extension of (3) to the clustered data setting may be written
(6) |
see, for example, Jin et al. (2006b, §4). Observe that Ln(β) is a continuous convex function for β ∈ 𝔹 and thus differentiable almost everywhere. The derivative of the objective function with respect to β, or Sn(β) = ∇Ln(β), is the discontinuous function
(7) |
Let β̂n = argminβ∈𝔹 Ln(β). The solution to this minimization problem may not be unique; however, the convexity of Ln(β) implies that the set of minimizers on 𝔹 is convex (e.g. Fygenson & Ritov, 1994). The lack of smoothness makes minimization of Ln(β) computationally challenging, particularly with multiple covariates. However, under regularity conditions to be described later, the results of Jin et al. (2006b, Theorem 5) imply that there exists a sequence of solutions that is strongly consistent for β0 and, in addition, such that n1/2(β̂n − β0) converges in distribution to a N(0, A−1ΩA−1) random vector, where Ω = limn→∞ var{n1/2 Sn(β0)} and A = ∇S0(β0) for S0(β) = limn→∞ Sn(β). An explicit formula for A is provided in (A1). In addition to the numerical challenges that arise in computing the solution β̂n, variance estimation is difficult due to the dependence on A and the fact that Sn(β) is not differentiable in β.
2.3. Induced smoothing for clustered data
Brown & Wang (2005) propose an induced smoothing method for approximating discontinuous but monotone estimating functions using continuously differentiable functions. Assuming independent failure time observations, Brown & Wang (2006) apply this smoothing method to the problem of estimating the regression parameter in the accelerated failure time model, using (4) in place of (2). As shown below, the extension of this methodology to the problem of estimating β in the clustered data setting under the assumptions of §2.1 is straightforward.
Let Z be a N(0, Ip) random vector independent of the data, where Ip denotes the p × p identity matrix. Let Γ be a p × p matrix such that ||Γ|| = O(1) and Γ2 = Σ, where Σ is some symmetric, positive definite matrix. Then, similarly to Brown & Wang (2005, 2006), a smoothed score function may be constructed by adding the random perturbation n−1/2ΓZ to the argument of the score function Sn(β) in (7) and then taking the expectation with respect to Z. Specifically, with S̃n(β) = EZ{Sn(β + n−1/2ΓZ)}, an easy calculation shows that
(8) |
where . With Ki = Kj = 1 for i, j = 1, . . . , n, this estimating equation reduces to (4). Alternatively, one might work directly with the smoothed objective function L̃n(β) = EZ {Ln(β + n−1/2ΓZ)}. Let ϕ(·) denote the standard normal density function. Then, using standard results for normal random variables and integration by parts, we have
(9) |
where
(10) |
A straightforward calculation shows that ∇L̃n(β) = S̃n(β).
Let β̃n = argminβ∈𝔹 L̃n(β). The smoothed objective function, L̃n(β), is convex and continuously differentiable and standard numerical methods can be used to efficiently compute β̃n. Alternatively, β̃n can be found as the multivariate root of S̃n(β). The asymptotic results, summarized below and proved in the Appendix, also imply that inference for β̃n is straightforward.
Theorem 1. Let Σ = Γ2 be any symmetric and positive definite matrix with ||Γ|| < ∞. Under conditions A1–A4 of the Appendix, β̃n is a strongly consistent estimator of β0.
Theorem 2. Let Σ = Γ2 be any symmetric and positive definite matrix with ||Γ|| < ∞. Under conditions A1–A6 of the Appendix, n1/2(β̃n − β0) converges in distribution to N (0, Ψ), where Ψ = A−1ΩA−1, Ω = limn→∞ var{n1/2Sn(β0)} and A = ∇S0(β0) is defined in (A1).
The above results provide theoretical justification for the proposed smoothing procedure when estimating regression parameters under the marginal accelerated failure time model with clustered failure time data. Importantly, the matrices A and Ω in Theorem 2 are defined in terms of the limiting behaviour (7), demonstrating that the limiting distribution of n1/2(β̃n − β0) coincides with that for n1/2(β̂n − β0), where β̂n is obtained via the unsmoothed objective function (6). Since justification for the independent data case follows directly from the above theorems upon setting Ki = 1 (i = 1, . . . , n), Theorems 1 and 2 also provide rigorous justification for the claims made in Brown & Wang (2006).
Remark 1. The above results hold for a general smoothing matrix Γ that satisfies certain minimal conditions. Brown & Wang (2006) propose an iterative procedure for estimating β0, in which Σ = Γ2 is updated at each iteration using successive estimates of Ψ. One implementation of this procedure in the clustered data setting is provided in §3.2.
Remark 2. The smoothing bandwidth employed in (8) and (9) is O(n−1/2), where n denotes the number of independent clusters. In the absence of clustering, Heller (2007) recommends the choice h = σ̂n−0.26 in (5), where σ̂ is an estimate of the residual variance obtained using a minimizer of the unsmoothed equation (3). The selection h = O(n−0.26) is motivated as that which provides “the quickest rate of convergence while satisfying the bandwidth constraint nh4 → 0”. In asymptotic terms, Theorem 2 suggests that such oversmoothing is unnecessary.
3. Methods of variance estimation
3.1. The sandwich variance estimator
The sandwich form of the covariance matrix of n1/2(β̃n − β0) in Theorem 2 suggests a natural estimator provided that suitable estimates of both A and Ω can be found. In the independent data case, Brown & Wang (2006) suggest estimating A with Ãn = ∇S̃n(β̃n); Theorems 1 and 2 imply that this remains a consistent estimator in the clustered data setting. Brown & Wang (2006) further suggest several estimates of Ω, including the asymptotic variance of n1/2Sn(β0) provided in Jin et al. (2003) and an estimator of Ω based on the U-statistic structure of the estimating function (4). However, neither estimator of Ω properly accounts for the correlation between observations within a cluster. Lee et al. (1993) show that the asymptotic variance of n1/2Sn(β0) in the clustered data case can be consistently estimated via Ω̂n = Ω̂n(β̂n), where
v⊗2 = vv′ for any vector v, and
Conditions A1–A5 of the Appendix ensure that Ω̂n is a consistent estimator of Ω; with the addition of condition A6, Ψ can be consistently estimated in the clustered data setting using
(11) |
3.2. The Brown & Wang (2006) procedure for clustered data
As suggested in Brown & Wang (2006), an iterative procedure can be used to simultaneously estimate the regression parameters and their covariance matrix. Denoting Ãn(β) = ∇S̃n(β), the proposed procedure consists of the following steps in the clustered data setting:
Step 1. Set i = 0 and initialize Σ̂(0) such that ||Σ̂(0)|| = O(1); for example, Σ̂(0) = Ip.
Step 2. Set i = i + 1 and solve S̃n(β) = 0 for β̃(i) using Γ = (Σ̂(i−1))1/2 in equation (8).
Step 3. Using β̃(i), calculate Ã(i) = Ãn(β̃(i)) and Ω̂(i) = Ω̂(β̃(i)).
Step 4. Compute .
Step 5. Repeat steps 2–4 until convergence of both β̃(i) and Σ̂(i) is achieved to a specified tolerance.
In our experience, convergence of this algorithm typically occurs with relatively few iterations, the value of Σ̂(*) at convergence being very close to Ψ̂n in (11).
Remark 3. The above procedure makes use of a data-dependent smoothing parameter. The proofs of Theorems 1 and 2 assume that the matrix Γ is known; however, since ||Γ̂(*) − (A−1ΩA−1)1/2|| = Op(n−1/2), replacing Γ by Γ̂(*) does not alter these asymptotic results.
3.3. The resampling variance estimator
Jin et al. (2006b, §4) propose a useful resampling method for estimating Ψ in the presence of correlated data. This method, which can be motivated by the conditional multiplier central limit theorem (e.g. Martinussen & Scheike, 2006, p. 43), involves randomly reweighting the Gehan log-rank objective function (6) and then minimizing the resulting perturbed objective function. Specifically, let
where Z1, . . . , Zn are independent positive random variables with E(Zi) = var(Zi) = 1 (i = 1, . . . , n). Let . Jin et al. (2006b, Theorem 5) prove that, conditional on the data {Wik; k = 1, . . . , Ki ; i = 1, . . . , n}, the limiting distribution of converges almost surely to the limiting distribution of n1/2(β̂n − β0). Thus, the distribution of β̂n can be approximated by repeatedly generating random samples Z1, . . . , Zn and then minimizing to obtain realizations of . The covariance matrix of β̂n can be approximated directly by the empirical covariance matrix of the realizations of .
Jin et al. (2006b, §4) work directly with the unsmoothed Gehan objective function and utilize linear programming methods in combination with resampling in order to obtain regression parameter and covariance matrix estimates. Specifically, linear programming is used to minimize Ln(β), obtaining the estimated regression parameter β̂n; it is then applied repeatedly in minimizing each of the realizations of generated for the purposes of covariance matrix esimation. The use of linear programming methods can be avoided by randomly reweighting the smoothed objective function L̃n(β) in (9). Such an approach allows for standard numerical methods to be used for minimization, resulting in the potential for computational savings with larger datasets. With Z1, . . . , Zn defined as above, let
where and are defined in (10), and define . Theorems 1 and 2 imply that an argument identical to the one given in Jin et al. (2006b, Theorem 5) can be used to show that the limiting distribution of converges almost surely to the limiting distribution of n1/2(β̃n − β0). The covariance matrix of β̃n can then be approximated exactly as described above, using the simulated realizations of in place of the .
4. Simulation study
Simulation studies were carried out to assess the performance of β̃n as well as to evaluate the covariance matrix estimators described in §3. The proposed simulation studies are modelled after that described in Jin et al. (2006b, §5), allowing for a direct comparison between their simulation results and those to be summarized below.
Specifically, for each cluster, we use the algorithm of Johnson (1987, §10.1) to generate two failure times from the bivariate Gumbel distribution
where −1 ⩽ θ ⩽ 1, Fk (·) is the cumulative distribution function for an exponential random variable with hazard function λk = exp(β1X1k + β2X2k), X1k is Ber(0.5), and X2k is standard normal truncated at ±2 (k = 1, 2). All covariates are generated independently and the correlation between T̄1 and T̄2 is θ/4. The resulting failure time model is a special case of the accelerated failure time model of § 2.1 with true regression parameters β1 = 1 and β2 = 0.5. Censoring times are independently generated from a Un(0,τ) distribution, where τ is selected to achieve a desired level of censoring. Similarly to Jin et al. (2006b, §5), we consider the cases θ = 0 and θ = 1, 50 clusters of size two, and censoring percentages of 0%, 25% and 50%.
Two different estimation methods are considered. Method 1 refers to the iterative method of §3.2 for simultaneously estimating the regression parameters and covariance matrix. Method 2 refers to estimating the regression parameter by minimizing the smoothed objective function (9) with the fixed choice Σ = I2. Within Method 2, we consider estimating the covariance matrix using the resampling-based variance estimate of §3.3 and also using the sandwich variance estimator (11). All simulations were conducted in R and use the routine nlm for optimization (R Development Core Team, 2005); the simulation code is available upon request.
Table 1 summarizes the results of two simulation studies. Each row of the table is based on the same 1000 simulated datasets. In the first simulation study, the semiparametric accelerated failure time model of §2.1 is fitted using the covariates X1k and X2k. We report the results for the estimation of the regression parameters β1 = 1 and β2 = 0.5 and associated standard errors using Methods 1 and 2. The second simulation study repeats the first simulation study, fitting a model that uses the covariates and . The underlying failure time model is identical to that used in the first simulation study, the true regression parameters now being , and β2 = 250. However, in contrast to the first simulation study, the magnitudes of and are quite different from those of and . The results for , not shown, are very similar to those reported in Table 1 for β1; hence, we only report the results for . The intent of the second study is to investigate the impact of using the fixed smoothing parameter Σ = I2 versus the data-dependent smoothing parameter of § 3.2, a choice that better reflects the covariance structure and scaling of the regression parameter.
Table 1.
Regression parameter | θ | Censoring % | Method 1 | Method 2 | |||||
---|---|---|---|---|---|---|---|---|---|
rbias | rse | rsee1 | rbias | rse | rsee2a | rsee2b | |||
β1 = 1 | 0 | 0 | 0.32 | 0.25 | 0.25 | 0.21 | 0.25 | 0.25 | 0.25 |
25 | 3.91 | 0.28 | 0.28 | 3.41 | 0.28 | 0.28 | 0.28 | ||
50 | 4.30 | 0.35 | 0.36 | 2.22 | 0.34 | 0.36 | 0.35 | ||
1 | 0 | 0.91 | 0.26 | 0.25 | 0.78 | 0.25 | 0.25 | 0.25 | |
25 | 1.58 | 0.27 | 0.28 | 1.12 | 0.27 | 0.27 | 0.27 | ||
50 | 5.52 | 0.37 | 0.36 | 3.38 | 0.36 | 0.36 | 0.35 | ||
β2 = 0.5 | 0 | 0 | 0.74 | 0.28 | 0.26 | 0.78 | 0.28 | 0.26 | 0.26 |
25 | 1.89 | 0.30 | 0.30 | 1.58 | 0.30 | 0.30 | 0.30 | ||
50 | 5.13 | 0.38 | 0.38 | 3.40 | 0.38 | 0.38 | 0.38 | ||
1 | 0 | 1.23 | 0.26 | 0.26 | 1.20 | 0.26 | 0.26 | 0.26 | |
25 | 1.99 | 0.30 | 0.30 | 1.69 | 0.28 | 0.30 | 0.30 | ||
50 | 4.40 | 0.40 | 0.38 | 2.70 | 0.38 | 0.38 | 0.38 | ||
0 | 0 | 0.28 | 0.27 | 0.27 | 0.00 | 0.27 | 0.26 | 0.29 | |
25 | 2.99 | 0.29 | 0.30 | 1.96 | 0.29 | 0.29 | 0.35 | ||
50 | 5.77 | 0.38 | 0.38 | 2.76 | 0.37 | 0.38 | 0.46 | ||
1 | 0 | 5.80 | 0.26 | 0.27 | 2.40 | 0.26 | 0.26 | 0.30 | |
25 | 1.56 | 0.29 | 0.29 | 5.60 | 0.29 | 0.29 | 0.35 | ||
50 | 5.25 | 0.37 | 0.38 | 2.20 | 0.36 | 0.37 | 0.46 |
rbias, 1000 × absolute relative bias; rse, empirical standard error, relative to parameter; rsee1, standard error relative to parameter, with standard error estimate obtained using iterative method of § 3.2 with Σ̂(0) = I2; rsee2a, standard error relative to parameter, with standard error estimate obtained using the resampling procedure of § 3.3 with 500 random reweightings and regression parameters estimated using the induced smoothing procedure of § 2.3 with the fixed choice Σ = I2; rsee2b, standard error relative to parameter, with standard error estimate based on (11) and regression parameters estimated as described for rsee2a.
Considering only β1 and β2, the relative biases are observed to be small, comparable in magnitude, and generally increase with the censoring percentage. In addition, estimates obtained using Method 1 frequently exhibit greater bias than those obtained using Method 2, with no apparent reduction in standard error. The standard error estimates for β1 and β2 are accurate and similar across all estimation methods. Remarkably, the results reported here are also comparable to those summarized in the right panel of Table 1 in Jin et al. (2006b) for the Gehan weight function, where no smoothing is employed.
Turning to the comparison of results for β2 and , biases generally follow the patterns described above. In addition, all methods of standard error estimation perform well, though some evidence of inflation in the relative standard error rsee2b is now present. Overall, the results suggest that the smoothing parameter has a minimal impact on the bias or actual standard error of the regression parameter estimates. However, given the relative accuracy of both rsee1 and rsee2a, the discrepancy observed in rsee2b suggests that the scaling of the problem, hence choice of smoothing parameter, can adversely impact the accuracy of (11).
On the basis of these results, we recommend using Method 1 as described in §3.2; in comparison with the simulation-based methodology of Jin et al. (2006b), it requires far less computational effort with no evidence of penalty in bias or accuracy of standard error estimation.
5. Concluding remarks and further extensions
The attractive nature of the induced smoothing procedure, in both computational and theoretical terms, stems largely from the convexity of the Gehan-weighted objective function (9). The asymptotic results obtained in this paper make significant use of this convexity. A minor extension of these results can also be used to justify an alternative smoothing methodology for the bounded influence estimator introduced in Heller (2007). Variations on this smoothing methodology may facilitate simpler and more stable estimation procedures for accelerated failure time frailty models; see, for example, Pan (2001); Strawderman (2006) and Zhang & Peng (2007).
The use of the Gehan weight function in (2) has frequently been criticized for the inefficiency of the resulting estimator. The selection of an alternative weight function may result in efficiency improvements at the expense of monotonicity, resulting in weaker asymptotic statements and increased computational challenges. To counteract these drawbacks, Jin et al. (2003) propose to use the Gehan estimator as a starting point for successively solving a sequence of convex optimization problems derived from (1). Jin et al. (2006b) extend these results to the setting of multivariate failure time data. The resulting class of estimation procedures is computationally stable and yields a consistent and asymptotically normal sequence of estimators with reasonably general weight functions. However, it does not lend itself to a simple method of variance estimation. Use of the resampling method described in §3.3 is recommended for this purpose but only amplifies the required computational effort. Jin et al. (2006a) propose a strongly related class of procedures for the Buckley–James estimator. Starting from the Gehan estimator, Strawderman (2005) demonstrates how one may instead use one-step estimation to achieve the same goal and introduces an alternative simulation-based method of variance computation that requires no additional optimization. The results of this paper show that the induced smoothing methodology provides an asymptotically valid and computationally convenient starting point for each of these other methods of estimation. In addition, the methodology itself can be directly incorporated as part of the iterative methods developed in Jin et. al. (2003, 2006a, b); the asymptotic results of this paper guarantee that their results also remain valid for the corresponding smoothed version.
A direct extension of this smoothing methodology is available for general weight functions. However, it lacks the same computational convenience due to important structural differences between the Gehan-weighted estimating equation and those used for general weight functions.
Acknowledgments
The authors thank the referees, associate editor and the editor for helpful comments. This work was supported by a grant from the U.S. National Institutes of Health.
Appendix
Proofs
We impose the following regularity conditions:
Condition A1. The parameter space 𝔹 containing β0 is a compact subset of ℝp.
Condition A2. is bounded almost surely by a nonrandom constant (i = 1, . . . , n).
Condition A3. The assumptions of §2.1 hold with var(∊11) < ∞.
Condition A4. The matrix A = ∇S0(β0), where S0(β) = limn→∞ Sn (β), exists and is nonsingular.
- Condition A5. Let f0(·) denote the marginal density associated with model error term ∊11 and let λ0(·) denote its corresponding hazard function. Then, f0(·) and f′0(·) are bounded functions on ℝ with
Condition A6. The marginal distribution of Crs is absolutely continuous and has a bounded density grs (·) on ℝ (r = 1, . . . , n; s = 1, . . . , Kr).
As indicated in the statements of Theorems 1 and 2, Σ = Γ2 is assumed to be a symmetric and positive definite matrix with ||Γ|| < ∞. Conditions A1, A2, A4, A5 and A6 are standard and ensure consistency and asymptotic normality of the unsmoothed Gehan estimator (Tsiatis, 1990; Ying, 1993; Jin et al., 2006b, for example). Condition A3 implies |cov(∊ik, ∊il)| ⩽ var(∊11) (i = 1, . . . , n; k, l = 1, . . . , Ki); hence, the covariances between all error terms within a cluster are bounded.
The proof of Theorem 1 relies on the following pair of lemmas, both of which hold under conditions A1–A3. The proof of Lemma 1 is a direct consequence of the strong law of large numbers for U-statistics and results in Andersen & Gill (1982, Theorem II.1). The proof of Lemma 2 relies on this result and properties of the normal cumulative distribution and density functions. These proofs are available in a technical report.
Lemma 1. supβ∈𝔹 |Ln(β) − L0(β)| → 0 almost surely, where L0(β) is convex for β ∈ 𝔹.
Lemma 2. supβ∈𝔹 |L̃n(β) − L0(β)| → 0 almost surely, where L0(·) is defined in Lemma 1.
Proof of Theorem 1. Lemmas 1 and 2, respectively, establish the uniform almost sure convergence of Ln(β) and L̃n(β) to the convex function L0(β) for β ∈ 𝔹. By condition A4, L0(β) is strictly convex at β0 and β0 is a unique minimizer. The respective minimizers β̂n and β̃n of Ln (β) and L̃n(β) thus converge almost surely to β0 (Andersen & Gill, 1982, Corollary II.2).
The next lemma is required in order to prove Theorem 2; an abbreviated proof of this result and also Theorem 2 are provided below, with expanded versions of these arguments available in a technical report. A fact used in proving Lemma 3 is that condition A4, in conjunction with (A1), implies that the probability that X1k ≠ X2l for at least one (k, l) pair must be positive.
Lemma 3. Under A1–A6 and as n → ∞, ||∇S̃n (β0) − A|| → 0 almost surely, where A = ∇S0(β0),
(A1) |
Ḡrs (·) denotes the survivor function of log Crs − X′rsβ0 and for every s.
Proof of Lemma 3. Using calculations analogous to those found in Fygenson & Ritov (1994), it can be shown that E{Sn (β)} = S0(β) + O(n−1), where the O(·) term holds uniformly on β ∈ 𝔹 and
(A2) |
for
The outer expectation in (A2) is understood to be taken over the joint distribution of the covariates. Evidently, S0(β0) = 0. Conditions A1–A6 permit us to differentiate (A2) directly; doing so and evaluating the result at β = β0, we obtain (A1) (Fygenson & Ritov, 1994, p. 737).
Recalling notation from §2.1, let the survival data for cluster i be denoted 𝒲i = {Wik, k = 1, . . . , Ki}. The smoothed equation (8) may then be written as
where cn = 1 − n−1, ψ̃β(𝒲i, 𝒲j) = (1/2){h̃β(𝒲i, 𝒲j) + h̃β(𝒲j, 𝒲i)} and
Differentiating this representation with respect to β, setting β = β0, and then using both the strong law of large numbers for independent observations and for U-statistics, we find
(A3) |
almost surely as n → ∞. The random variable 𝒜ab,cd is defined to be zero with probability one if Xab = Xcd; otherwise, one may write
where , , and ξcd (s) = f0(s)Ḡcd (s) + F̄0(s)gcd(s). Under conditions A1–A6, τ(·) is integrable, continuous and bounded on ℝ with τ (0) = 0 and the second term on the right-hand side therefore vanishes (e.g. Kanwal, 1998, p. 11). Using the resulting formula for 𝒜ab,cd and integration by parts, it can now be shown that
Substituting this last result into (A3), we observe agreement with (A1), proving the result.
Proof of Theorem 2. Using notation introduced in §2.2, we have that A−1n1/2Sn (β0) is asymptotically normal with mean zero and variance A−1ΩA−1 under assumptions A1–A5 (Jin et al., 2006b, Theorem 5). Suppose that
(A4) |
in probability. Then, n1/2(β̃n − β0) → N(0, A−1ΩA−1) in distribution, establishing the desired asymptotic result as well as the equality of the limiting distributions of the smoothed and unsmoothed estimators.
To prove that (A4) holds, we can make use of Theorem 3 of Arcones (1998). Using notation from Arcones (1998), define Gn(β) = nL̃n(β) for all β in 𝔹. For n ⩾ 1, define the sequence of p × 1 random vectors ηn = n1/2Sn (β0) and sequences of nonsingular, symmetric p × p matrices Mn = n1/2Ip and Vn = (1/2)A. The required result (A4) becomes
(A5) |
in probability. The result (A5) follows directly from Arcones (1998, Theorem 3) provided that conditions A1–A6 are sufficient to ensure that the following regularity conditions hold:
Condition B1. Gn(β) is convex and β̃n is a sequence satisfying Gn(β̃n) ⩽ infβ∈𝔹 Gn(β) + op(1).
Condition B2. ηn = Op(1), lim infn→∞ inf|β|=1 β′Vn β > 0 and lim supn→ ∞ sup|β|=1 β′Vn β < ∞.
Condition B3. For each β ∈ ℝp, .
Conditions B1 and B2 are immediate consequences of conditions A1–A6. For example, condition B1 follows because Gn (β) is easily shown to be convex, β̃n = argminβ∈𝔹 L̃n (β) = argminβ∈𝔹 Gn(β), Gn (β) is continuous, and 𝔹 is compact. In addition, conditions A1–A5 are sufficient to ensure that ηn = n1/2Sn (β0) converges in distribution, hence Op(1) as required. Since Vn = (1/2)A is a positive definite matrix for every n, condition B2 is also satisfied.
It remains to establish condition B3. Using the definitions of Mn, Gn(·), L̃n(·) and S̃n(·), and a Taylor series expansion, we may write
(A6) |
where . The triangle inequality, Lemma 3, and the fact that {S̃n} form a sequence of bounded, continuously differentiable functions implies that we can replace in (A6) by the matrix A without altering this result. Therefore, if
(A7) |
in probability, the definitions of Vn and ηn imply that condition B3 holds. To see that (A7) holds, write
where ψ(·) denotes the pdf of ΓZ.
Let Θ be a fixed matrix such that ||Θ|| ⩽ M for some M < ∞ and define for suitable u the function Kn(u; β0, Θ) = ||Sn(β0 + n1/2u) − Sn(β0) − n−1/2Θu||. Then, since ∫ℝpu ψ(u) du = 0, the triangle inequality implies
(A8) |
for any ∊n > 0. The result (A7) therefore holds if we can find ∊n > 0 such that both integrals on the right-hand side of (A8) converge in probability to zero.
Following Ying (1993, Theorem 2) and Jin et al. (2006b, Theorem 5), the matrix A satisfies
(A9) |
for any positive sequence dn → 0. Suppose ∊n = o(n1/2). Then, taking b = β0 + n−1/2u, dn = n−1/2∊n and Θ = A, (A9) implies
(A10) |
An easy calculation, in combination with (A10), now shows that the first integral on the right-hand side of the inequality in (A8) converges in probability to zero, even if ∊n → ∞. With regard to the second term on the right-hand side of (A8), we may use the definition of Kn(·; β0, A) and the triangle inequality to write n1/2∫||u||>∊n Kn (u; β0, A)ψ(u) du ⩽ Q3 + Q4, where
For all β ∈ 𝔹, ||Sn(β)|| ⩽ Q for some constant Q < ∞ by condition A2; hence, Q3 ⩽ 2Qn1/2 p(||ΓZ|| > ∊n). Letting ∊n → ∞, it follows that n1/2 p(||ΓZ|| > ∊n) → 0 as n → ∞. Similarly, ∫||u||>∊n ||u||ψ(u)du → 0. Thus, provided that n, ∊n → ∞, the bounds Q3 and Q4, hence, the second integral on the right-hand side of the inequality in (A8), also converge in probability to zero. Since we can select a sequence ∊n = o(n1/2) such that both n, ∊n → ∞, it follows that (A8) converges in probability to zero as n → ∞, establishing (A7) and concluding the proof.
Contributor Information
Lynn M. Johnson, Department of Statistical Science, Cornell University, Ithaca, New York 14853, U.S.A. Email: lms86@cornell.edu
Robert L. Strawderman, Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, U.S.A. Email: rls54@cornell.edu
References
- Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Statist. 1982;10:1100–20. [Google Scholar]
- Arcones MA. Asymptotic theory for M-estimators over a convex kernel. Economet. Theory. 1998;14:387–422. [Google Scholar]
- Brown BM, Wang Y-G. Standard errors and covariance matrices for smoothed rank estimators. Biometrika. 2005;92:149–58. [Google Scholar]
- Brown BM, Wang Y-G. Induced smoothing for rank regression with censored survival times. Statist Med. 2006;26:828–36. doi: 10.1002/sim.2576. [DOI] [PubMed] [Google Scholar]
- Cox DR. Regression models and life-tables (with discussion) J. R. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
- Fygenson M, Ritov Y. Monotone estimating equations for censored data. Ann Statist. 1994;22:732–46. [Google Scholar]
- Heller G. Smoothed rank regression with censored data. J Am Statist Assoc. 2007;102:552–59. [Google Scholar]
- Huang Y. Calibration regression of censored lifetime medical cost. J Am Statist Assoc. 2002;97:318–27. [Google Scholar]
- Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika. 2003;90:341–53. [Google Scholar]
- Jin Z, Lin DY, Ying Z. On least squares regression with censored data. Biometrika. 2006a;93:147–62. [Google Scholar]
- Jin Z, Lin DY, Ying Z. Rank regression analysis of multivariate failure time data based on marginal linear models. Scand J Statist. 2006b;33:1–23. [Google Scholar]
- Johnson ME. Multivariate Statistical Simulation. New York: Wiley; 1987. Wiley Series in Probability and Mathematical Statistics. [Google Scholar]
- Jones MP. A class of semiparametric regressions for the accelerated failure time model. Biometrika. 1997;84:73–84. [Google Scholar]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd Ed. Hoboken, New Jersey: Wiley-Interscience; 2002. [Google Scholar]
- Kanwal RP. Generalized Functions: Theory and Technique. Boston: Birkhäuser; 1998. [Google Scholar]
- Lee EW, Wei LJ, Ying Z. Linear regression analysis for highly stratified failure time data. J Am Statist Assoc. 1993;88:557–65. [Google Scholar]
- Lin JS, Wei LJ. Linear regression analysis for multivariate failure time observations. J Am Statist Assoc. 1992;87:1091–97. [Google Scholar]
- Martinussen T, Scheike TH. Dynamic Regression Models for Survival Data Statistics for Biology and Health. New York: Springer; 2006. [Google Scholar]
- Pan W. Using frailties in the accelerated failure time model. Lifetime Data Anal. 2001;7:55–64. doi: 10.1023/a:1009625210191. [DOI] [PubMed] [Google Scholar]
- R Development Core Team . R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2005. [Google Scholar]
- Song X, Ma S, Huang J, Zhou X. A semiparametric approach for the nonparametric transformation survival model with multiple covariates. Biostatistics. 2007;6:197–211. doi: 10.1093/biostatistics/kxl001. [DOI] [PubMed] [Google Scholar]
- Strawderman RL. The accelerated gap times model. Biometrika. 2005;92:647–66. [Google Scholar]
- Strawderman RL. A regression model for dependent gap times. Int J Biostatistics. 2006;2 Iss. 1, Article 1. [Google Scholar]
- Tsiatis AA. Estimating regression parameters using linear rank tests for censored data. Ann Statist. 1990;18:354–72. [Google Scholar]
- Ying Z. A large sample study of rank estimation for censored regression data. Ann Statist. 1993;21:76–99. [Google Scholar]
- Zhang J, Peng Y. An alternative estimation method for the accelerated failure time frailty model. Comp Statist Data Anal. 2007;51:4413–23. [Google Scholar]