Comparing Treatments in the Presence of Crossing Survival Curves: An Application to Bone Marrow Transplantation

Brent R Logan; John P Klein; Mei-Jie Zhang

doi:10.1111/j.1541-0420.2007.00975.x

. Author manuscript; available in PMC: 2009 Sep 29.

Published in final edited form as: Biometrics. 2008 Jan 11;64(3):733–740. doi: 10.1111/j.1541-0420.2007.00975.x

Comparing Treatments in the Presence of Crossing Survival Curves: An Application to Bone Marrow Transplantation

Brent R Logan ^1,^*, John P Klein ¹, Mei-Jie Zhang ¹

PMCID: PMC2754119 NIHMSID: NIHMS112702 PMID: 18190619

Summary

In some clinical studies comparing treatments in terms of their survival curves, researchers may anticipate that the survival curves will cross at some point, leading to interest in a long-term survival comparison. However, simple comparison of the survival curves at a fixed point may be inefficient, and use of a weighted log-rank test may be overly sensitive to early differences in survival. We formulate the problem as one of testing for differences in survival curves after a prespecified time point, and propose a variety of techniques for testing this hypothesis. We study these methods using simulation and illustrate them on a study comparing survival for autologous and allogeneic bone marrow transplants.

Keywords: Censored data, Crossing hazard functions, Generalized linear models, Log-rank test, Pseudo-value approach, Weibull distribution, Weighted Kaplan–Meier statistic

1. Introduction

When comparing treatments in terms of their time-to-event distribution, there may be reason to believe that the survival curves will cross, and standard comparison techniques in such cases could lead to misleading results. Often researchers in such cases will focus on which treatment has a better long-term survival probability. In particular, this research is motivated by a common scenario in hematopoietic stem cell transplantation, illustrated using a study comparing autologous and allogeneic bone marrow transplants for follicular lymphoma (Van Besien et al., 2003). The sample contained 175 patients with an HLA-identical sibling allogeneic transplant and 596 patients with an unpurged autologous transplant. We are interested in comparing the disease-free survival (DFS) curves (i.e., the probability a patient is alive and disease free) between the two treatment arms. However, this comparison is complicated by the likely possibility that the hazard functions from these two treatments will cross at some point. Allogeneic transplants tend to have a higher mortality early due to the toxicity of the higher doses of chemotherapy used to ablate the immune system as well as graft-versus-host disease from the donor cells. However, the donor cells may provide a graft-versus-lymphoma effect resulting in less relapse of the primary disease in long-term survivors. In contrast, autologous transplants have lower early toxicity because patients do not experience graft-versus-host disease. However, these patients do not benefit from the protection against relapse from the graft-versus-lymphoma effect, so they tend to experience more relapses. These contrasting profiles are illustrated by the Kaplan–Meier curves for this dataset in Figure 1. The DFS of the allogeneic transplant arm drops quickly early, but then levels off, whereas the DFS of the autologous transplant arm decreases more slowly but does not plateau. The two curves appear to cross at about 1 year. The unweighted log-rank test will have poor power to detect such a difference in survival curves.

Kaplan–Meier estimate of DFS for follicular lymphoma example, by stem cell source.

Generally, the question of interest is, which if any of the treatments yields better long-term survival? Several strategies for addressing this question are possible. One may pick a single long-term time point, and compare the survival estimates between the treatment groups at this single time point, as is discussed in Klein et al. (2007). However, there are some potential problems with this. First the results may be sensitive to the time point chosen. Second, this strategy ignores events occurring after the selected time point. For example, in a clinical trial comparing two treatments one might select 3-year survival as the primary endpoint. However, if everyone is followed up for 3 years and accrual occurs over a period of time such as 2 years, then there is substantial information on later events (between 3 and 5 years) for patients enrolled early in the trial. Therefore, selecting a single time point may be inefficient.

Another alternative would be to estimate simultaneous confidence bounds for the difference in survival curves (Parzen, Wei, and Ying, 1997; Zhang and Klein, 2001), which identify time regions where the two treatments are different. However, because of the large number of time points being considered and adjusted for, these tend to be quite wide and may be inefficient in determining late differences between treatments.

Another option would be the weighted log-rank test, with more weight placed on later time points to reflect interest in late events. For example, Fleming and Harrington (1981) proposed a class of weighted log-rank tests with a weight function equal to Ŝ(t)^ρ(1 − Ŝ(t))^γ. Here setting ρ = 0 and γ = 1 would place more weight on late events and hence late differences in the hazard rates and/or the survival curves. However, even though the weight is placed appropriately, this test is still designed to test the null hypothesis that the entire survival curves are equal. If we are focused instead on late differences rather than the entire survival curve, even the weighted log-rank test may be overly sensitive to early differences in the survival curves. We will illustrate this point in simulations presented later in the article.

We propose a specific formulation of the hypothesis to focus on late differences in the survival curve. We assume that atime point t₀ can be prespecified, so that survival curves are presumed likely to cross prior to that time point if at all. The null hypothesis is H₀ : S₁(t) = S₀(t), for all t ≥ t₀, where S₁(t) and S₀(t) denote the survival curves at time t for the treatment and control groups, respectively, versus the alternative, H₁ : S₁(t) ≠ S₀(t), for some t ≥ t₀. This formulation allows us to specify exactly over what time range the comparison of treatments is of interest, for example, after t₀.

Note that this null hypothesis is equivalent to H₀ : {S₁(t₀) = S₀(t₀)} ⋂ {λ₁(t) = λ₀(t), t > t₀}, where λ_k(t) represents the hazard function at time t for group k, k = 0, 1. This formulation allows us to separate the hypotheses into two sub-hypotheses: the hypothesis of equality of survival at t₀ and the hypothesis of no difference in the hazard function after t₀. The composite hypothesis can then be tested using combinations of test statistics for each of the sub-hypotheses.

In the next section, we describe possible methods for testing this null hypothesis.

2. Methods

2.1 Notation

The data consist of n₁ + n₀ = n subjects with event times t_j. Let the distinct event times be ordered such that t₁ < ⋯ < t_m. At time t_j let d_kj denote the number of events and Y _kj denote the number at risk in the kth group, k = 0, 1.

The Kaplan–Meier estimate of survival in group k is given by

{\hat{S}}_{k} (t) = \underset{t_{j} < t}{Π} (1 - \frac{d_{kj}}{Y_{kj}}) .

The variance of the Kaplan–Meier estimate is estimated by Greenwood's formula given by

\hat{var} {{\hat{S}}_{k} (t)} = {\hat{S}}_{k} {(t)}^{2} {\hat{σ}}_{k} {(t)}^{2},

where

{\hat{σ}}_{k} {(t)}^{2} = \underset{t_{j} \leq t}{Σ} \frac{d_{kj}}{Y_{kj} (Y_{kj} - d_{kj})} .

The Nelson–Aalen estimate of the cumulative hazard function is

{\hat{Λ}}_{k} (t) = \underset{t_{j} \leq t}{Σ} \frac{d_{kj}}{Y_{kj}},

with variance estimated by

\hat{var} {{\hat{Λ}}_{k} (t)} = \underset{t_{j} \leq t}{Σ} \frac{d_{kj}}{Y_{kj}^{2}} .

2.2 Comparisons Based on a Single Time Point

The simplest method for testing the null hypothesis that the survival curves after time t₀ are equal would be to compare the survival curves at a selected point t′ > t₀, using the difference in Kaplan–Meier estimates of survival at t′. One can also construct a test statistic based on transformations of the survival probabilities at a fixed point in time, as described in Klein et al. (2007). Their recommendations were that the complementary log–log transformation of the survival probability works the best overall, resulting in the test statistic

Z_{CLL} (t^{'}) = \frac{\log [- \log {{\hat{S}}_{1} (t^{'})}] - \log [- \log {{\hat{S}}_{0} (t^{'})}]}{\sqrt{{\hat{σ}}_{1}^{2} (t^{'}) ∕ {[\log {{\hat{S}}_{1} (t^{'})}]}^{2} + {\hat{σ}}_{0}^{2} (t^{'}) ∕ {[\log {{\hat{S}}_{0} (t^{'})}]}^{2}}} .

(1)

Alternatively, one could compare the cumulative hazard functions at a selected time point t′ > t₀, using the Nelson–Aalen estimates at t′, Λ̂_k(t′). Tests based on the cumulative hazard function should behave similarly to those using a log transformation of the survival function.

2.3 Weighted Kaplan–Meier Test

One way to compare the entire survival curve after t₀ is to consider a modification of the weighted Kaplan–Meier statistic (Pepe and Fleming, 1989, 1991), where the integral is taken over the restricted range after t₀. The statistic is given by

W_{WKM} (t_{0}) = \int_{t_{0}}^{t_{m}} \hat{w} (t) {{\hat{S}}_{1} (t) - {\hat{S}}_{0} (t)} dt,

where ŵ(t) = {n₁Ĝ₁(t) + n₀Ĝ₀(t)}⁻¹nĜ₁(t)Ĝ₀(t) and Ĝ_k(t) is the Kaplan–Meier estimate of the censoring distribution. Let ℓ denote the index of the event time such that t_ℓ−1 ≤ t₀ < t_ℓ. The (unpooled) variance of this statistic can be estimated by

{\hat{var}}_{WKM} (t_{0}) = Σ_{k = 0}^{1} {A_{k 0}^{2} Σ_{j = 1}^{ℓ - 1} \frac{d_{kj}}{Y_{kj}^{2}} + Σ_{j = ℓ}^{m - 1} A_{kj}^{2} \frac{d_{kj}}{Y_{kj}^{2}}},

where $A_{k 0} = \int_{t_{0}}^{t_{m}} \hat{w} (t) {\hat{S}}_{k} (t) dt$ and $A_{kj} = \int_{t_{j}}^{t_{m}} \hat{w} (t) {\hat{S}}_{k} (t) dt$ . A sketch of the derivation of this variance expression is given in Web Appendix A. Then the standardized weighted K–M statistic follows a standard normal distribution under the null hypothesis and is given by

Z_{WKM} (t_{0}) = \frac{W_{WKM} (t_{0})}{\sqrt{{\hat{var}}_{WKM} (t_{0})}} .

(2)

2.4 Tests Based on Pseudo-Value Observations

Another test is based on a pseudo-value regression technique proposed by Andersen, Klein, and Rosthoj (2003) and Klein and Andersen (2005). Originally applied in the context of regression models for multistate models and competing risks data, it can also be used in the simple survival comparison context. For a given time point τ_j, compute the pooled sample Kaplan–Meier estimator, Ŝ_p(τ_j), based on all n₁ + n₀ observations and the Kaplan–Meier estimator based on the sample of size n₁ + n₀ − 1 with the ith observation removed, ${\hat{S}}_{p}^{(i)} (τ_{j})$ , for i = 1, …, n. Define the ith pseudo-value at time τ_j by ${\hat{θ}}_{ij} = (n_{1} + n_{0}) {\hat{S}}_{p} (τ_{j}) - (n_{1} + n_{0} - 1) {\hat{S}}_{p}^{(i)} (τ_{j})$ , for i = 1, …, n.

To perform inference on survival curves after a fixed time t₀, we use the pseudo-values defined for event times t > t₀. Let τ₁ correspond to the earliest event time occurring after t₀, τ₂ correspond to the next earliest event time after t₀, and so forth, so that there are a total of m′ such observed event times in the dataset. We consider a generalized linear model for the pseudo-values, given by g(θ_ij) = α_j + βZ_i, for i = 1, …, n; j = 1, …, m′, where Z_i is an indicator with value 1 if the patient is in the treatment group and 0 if they are in the control group. Then given that we are only considering pseudo-values for times t > t₀, the null hypothesis H₀ of equal survival curves after t₀ is equivalent to testing $H_{0}^{'} : β = 0$ .

Inference on β may be performed using generalized estimating equations (GEE; Liang and Zeger, 1986). Let μ(·) = g⁻¹(·) be the mean function. Define dμ_i (β, α) to be the vector of partial derivatives of μ(·) with respect to (β, α), where α is an m′-dimensional vector of intercepts at time τ_j, j = 1, …, m′. Let V_i be a working covariance matrix. Express the pseudo-values and their expectations in vector notation as θ̂_i = (θ̂₁₁, …, θ̂_1m′) and θ_i = (θ₁₁, … θ_1m′). Then the estimating equations to be solved are of the form

U (β, α) = \underset{i}{Σ} d μ_{i} {(β, α)}^{'} V_{i}^{- 1} ({\hat{θ}}_{i} - θ_{i}) = \underset{i}{Σ} U_{i} (β, α) = 0 .

Let (β̂, α̂) be the solution to this equation. Using results of Liang and Zeger (1986), under standard regulatory conditions, it follows that $\sqrt{n} {(\hat{β}, \hat{α}) - (β, α)}$ is asymptotically multivariate normal with mean 0. The covariance matrix of (β̂, α̂) can be estimated by the sandwich estimator Σ̂(β̂, α̂) where

\hat{Σ} (β, α) = {I (β, α)}^{- 1} {\underset{i}{Σ} U_{i} (β, α) U_{i} {(β, α)}^{'}} {I (β, α)}^{- 1},

and

I (β, α) = \underset{i}{Σ} d μ_{i} (β, α) V_{i}^{- 1} d μ_{i} {(β, α)}^{'}

is the model-based equivalent of the information matrix (Andersen et al., 2003).

When the number of time points or pseudo-values being included for each patient is large, this can present numerical difficulties in several aspects. Estimation of the parameters can be slow because there are a large number of parameters, and numerical algorithms must be used. Furthermore, calculation of the matrix Σ̂ requires the difficult inversion of a high-dimensional matrix I(β̂, α̂). One option is to consider alimited number of points (say 5 or 10) spread out equally on an event scale over the time period after t₀. An alternative is to use the generalized score statistic for β (Rotnitzky and Jewell, 1990; Boos, 1992), as considered in Lu (2006) for the pseudo-value regression context. The generalized score statistic for β when there is asingle dichotomous predictor can be shown to have a closed form, assuming an independent working correlation matrix and using the complementary log-log link function. Let α̃_j = log{−log(θ̄_·j)} be the solution for α_j in the estimating equation under the null hypothesis, where ${\bar{θ}}_{1 j} = n_{1}^{- 1} Σ_{i : Z_{i} = 1} {\hat{θ}}_{ij}, {\bar{θ}}_{\cdot j} = n^{- 1} Σ_{i} {\hat{θ}}_{ij}$ , and q_j = θ̄_·j log θ̄_·j. The generalized score statistic for testing H₀ : β = 0 simplifies to

\begin{matrix} χ_{PSV}^{2} (t_{0}) & = {\hat{Σ} {(0, \tilde{α})}_{11}}^{- 1} {[{I^{- 1} (0, \tilde{α})}_{11} U {(0, \tilde{α})}_{1}]}^{2} \\ = \frac{n^{2} {\underset{j}{Σ} n_{1} q_{j} ({\bar{θ}}_{1 j} - {\bar{θ}}_{\cdot j})}^{2}}{\underset{j, j^{'}}{Σ} q_{j} q_{j^{'}} [\underset{i}{Σ} {n_{0}^{2} Z_{i} + n_{1}^{2} (1 - Z_{i})} ({\hat{θ}}_{ij} - {\bar{θ}}_{\cdot j}) ({\hat{θ}}_{{ij}^{'}} - {\bar{θ}}_{\cdot j^{'}})]}, \end{matrix}

(3)

where the matrix element (·)₁₁ or vector element (·)₁ refers to the β component. This statistic asymptotically follows a $χ_{1}^{2}$ distribution under the null hypothesis that β = 0. Note, however, that the method can be biased when the censoring distribution depends on covariates.

The pseudo-value regression technique offers several potential improvements over the other methods studied in this article. First, it allows for straightforward inclusion of additional covariates in the generalized linear model. Although other methods discussed here also can be extended to include additional covariates, the generalized linear model framework makes this extension very straightforward. Another advantage is that the pseudo-value regression approach allows one to model the effect of treatment as a time-dependent predictor. Even though we are attempting to eliminate the effect of early differences in outcome by comparing the survival curves after t₀, there is still the possibility that the effect of treatment is still not constant or even consistent after time t₀. Using a single parameter to describe the treatment effect may not be sensitive to these kinds of differences in the late survival curves, whereas allowing for a time-dependent effect in the generalized linear model will have better power to capture such a treatment effect. However, both of these extensions make the analysis more complex, because the simplified form of the generalized score test no longer holds. It is likely that these extensions would require one to use alimited number of time points after t₀, rather than all event times as is done here. We do not consider these extensions further in this article.

2.5 Combination Tests

Finally, we consider alternative test statistics based on the formulation of the overall hypothesis as an intersection of two sub-hypotheses, H₀ = H₀₁ ∩ H₀₂, given by H₀₁ : S₁(t₀) = S₀(t₀), and H₀₂ : λ₁(t) = λ₀(t), t > t₀. Hypothesis H₀₁ can be tested using either a standardized difference in Kaplan–Meier estimates or alternatively a standardized difference in Nelson–Aalen estimates of the cumulative hazard function. Let X_NA(t₀) = Λ̂₁(t₀) – Λ̂₀(t₀), and let ${\hat{σ}}_{NA}^{2} (t_{0}) = \hat{var} {{\hat{Λ}}_{1} (t_{0})} + \hat{var} {{\hat{Λ}}_{0} (t_{0})}$ . Then the test statistic for H₀₁ is Z_NA(t₀) = X_NA/σ̂_NA.

Hypothesis H₀₂ can be tested using a log-rank test, starting from t₀, given by

X_{LR} (t_{0}) = \underset{t_{j} > t_{0}}{Σ} \frac{Y_{1 j} Y_{0 j}}{Y_{j}} {\frac{d_{1 j}}{Y_{1 j}} - \frac{d_{0 j}}{Y_{0 j}}} .

The log-rank test has variance consistently estimated by

{\hat{σ}}_{LR}^{2} (t_{0}) = \underset{t_{j} > t_{0}}{Σ} \frac{Y_{1 j} Y_{0 j}}{Y_{j}^{2}} (\frac{Y_{j} - d_{j}}{Y_{j} - 1}) d_{j},

where d_j and Y _j are the total number of deaths and the total number at risk, respectively, in the pooled sample. Then the standardized test statistic is given by

Z_{LR} (t_{0}) = \frac{X_{LR} (t_{0})}{{\hat{σ}}_{LR} (t_{0})},

(4)

which asymptotically follows a standard normal distribution under H₀. One also could consider a weighted log-rank test of H₀₂; however, in simulations it made little difference, probably because we are only testing hazard rates after t₀, so we do not consider it further.

Note that $\sqrt{n} (X_{NA} (t_{0}), X_{LR} (t_{0}))$ asymptotically follows a bivariate normal distribution under H₀ with mean (0, 0). The variance–covariance matrix of (X_NA(t₀), X_LR(t₀)) can be estimated by $\hat{Σ} = Diag {{\hat{σ}}_{NA}^{2} (t_{0}), {\hat{σ}}_{LR}^{2} (t_{0})}$ . The off-diagonal elements of the covariance matrix are 0 because the events contributing to each statistic are occurring over nonoverlapping times; the Nelson–Aalen estimator only considers events occurring up until time t₀, whereas the log-rank test only considers events starting after t₀. An equivalent result is that (Z_NA(t₀), Z_LR(t₀)) follows a bivariate normal distribution with mean (0, 0) and variance–covariance matrix equal to the identity matrix. To test the composite null hypothesis H₀, we can consider a number of ways of combining Z_NA(t₀) and Z_LR(t₀), including linear combination tests, a quadratic form test, and a Union-Intersection (Roy, 1953) test. However, the Union-Intersection test is omitted for brevity because it did not perform as well as the quadratic test.

2.5.1 Linear combination tests

Linear combination tests consist of a linear combination of the test statistics for each component null hypothesis. These tests have the form

Z (t_{0}) = \frac{{aX}_{NA} (t_{0}) + {bX}_{LR} (t_{0})}{\sqrt{a^{2} {\hat{σ}}_{NA}^{2} (t_{0}) + b^{2} {\hat{σ}}_{LR}^{2} (t_{0})}} .

Several sets of weights are possible; we present one which performed well in simulations. If we set $a = {2 {\hat{σ}}_{NA}^{2} (t_{0})}^{- 1 ∕ 2}$ and $b = {2 {\hat{σ}}_{LR}^{2} (t_{0})}^{- 1 ∕ 2}$ , this yields,

Z_{OLS} (t_{0}) = \frac{1}{\sqrt{2}} {Z_{NA} (t_{0}) + Z_{LR} (t_{0})} .

(5)

This statistic is analogous to the ordinary least squares (OLS) test of O'Brien (1984) for the multiple endpoint testing problem.

Sposto, Stablein, and Carter-Campbell (1997) proposed a partially grouped log-rank test, which is another special case of the linear combination test. Instead of testing H₀₁ with a difference in Nelson–Aalen estimates at t₀, they used the difference in Kaplan–Meier estimates. They estimate the variance of this difference using the pooled samples, yielding a final statistic

Z_{SP, P} (t_{0}) = \frac{[n_{1} n_{0} n^{- 1} {{\hat{S}}_{0} (t_{0}) - {\hat{S}}_{1} (t_{0})}] + X_{LR} (t_{0})}{\sqrt{n_{1} n_{0} \hat{var} {{\hat{S}}_{p} (t_{0})} + {\hat{σ}}_{LR}^{2} (t_{0})}},

(6)

where Ŝ_p(t₀) is the Kaplan–Meier estimate at t₀ based on the pooled sample of data. Under H₀ this statistic has a standard normal distribution. One could also consider a version of this test in which the variances of the first term (the survival difference at t₀) are not estimated using the pooled sample; in our simulations, this statistic performed almost identically to the pooled variance test, so we do not consider it further.

2.5.2 Quadratic tests

Next we consider quadratic forms of (X_NA(t₀), X_LR(t₀)), which result in a χ² test asymptotically. Here

\begin{matrix} χ^{2} (t_{0}) & = {(X_{NA} (t_{0}), X_{LR} (t_{0}))}^{'} {\hat{Σ}}^{- 1} (X_{NA} (t_{0}), X_{LR} (t_{0})) \\ = Z_{NA}^{2} (t_{0}) + Z_{LR}^{2} (t_{0}) . \end{matrix}

(7)

This follows a $χ_{2}^{2}$ distribution under H₀.

3. Design of Simulation Study

A simulation study was designed to compare the performance of the different procedures in terms of their type I error rate and power to detect late differences in survival curves. We set it up based on comparing the survival curves between two equal-sized groups after t₀ = 24 months, and we assume that all patients have a maximum of 72 months of follow-up. In addition to censoring at the fixed time of 72 months, we also overlay an exponential censoring pattern prior to 72 months, with rates selected to induce one of three censoring percentages by 24 months: (1) 0% censoring in each group at 24 months, (2) 15% censoring in each group by 24 months, and (3) 10% censoring in the control group and 20% censoring in the treatment group by 24 months. The overall censoring percentage is between 35 and 60% depending on the scenario, and the censoring time was generated independently of the event time.

For the type I error rate simulations, total sample sizes of n = 100, 200, and 400 with equal sample size per group are studied, across four null hypothesis scenarios. To generate these scenarios we assume piecewise exponential survival curves, where the survival curves differ by 0%, 5%, 10%, or 15% at 8 months (scenarios A–D, respectively), but are equal at 24 months and beyond. These curves are shown in Figure 2a.

Survival curves for treatment (S1) and control (S0) groups used in simulations. Curves for the null hypothesis simulations are shown in (a) for each of the four scenarios, and curves for the alternative hypothesis simulations are shown in (b)–(f) for the five scenarios.

For the power simulations, total sample sizes of n = 400 are studied, across five alternative hypothesis scenarios. We generate these scenarios using Weibull survival curves. The curves are shown in Figure 2b–f for scenarios E–I, respectively. Scenario E refers to a proportional hazards situation with no crossing hazards or survival curves. Scenarios F and G refer to situations where the survival curves cross prior to t₀ but to differing degrees. Scenario H is a situation where the survival curves cross exactly at t₀, and in Scenario I the survival curves cross after t₀. The percentage of patients still at risk at time t₀, which impacts the power, varies between 30 and 66% depending on the scenario, and can be roughly obtained for a given scenario by combining the survival probability from Figure 2 at t₀ with the censoring percentage at t₀ (0%, 15%, or 10%, 20%).

For comparison purposes, the log-rank test and the weighted log-rank test with Fleming–Harrington weights of ρ = 0, γ = 1 are also shown, even though they are not specifically suited for testing the null hypothesis of interest here. All simulations used 10,000 Monte Carlo samples.

To summarize the simulation results, we applied analysis of variance (ANOVA) techniques. For the type I error rate, we defined the outcome Y as the percent rejection rate minus the nominal rate of 5%, so that good performance is indicated by values of the expectation of Y near 0. For the power simulations, the outcome Y is defined as the percent rejection rate, so that good performance is defined by high values of the expectation of Y.

There were 11 test statistics considered in the simulations: three pointwise comparisons based on the complementary log–log transformation (1) evaluated at three different time points t′ after t₀, the statistics given in equations (2)-(7), and the log-rank and weighted log-rank tests starting at time 0. In addition, we included in the ANOVA model factors for scenario (A–D; 4 levels), total sample size n (100, 200, and 400; 3 levels), and censoring pattern (0%, 15%, or (10%,20%); 3 levels). We also considered models in which each of these factors interacted with the main test statistic effect to examine performance of these statistics for specific levels of the other factors (lower order terms included when interaction is present). Specifically, we fit the following models:

E (Y_{tsnc}) = μ_{ts} + α_{n} + β_{c},

(8)

E (Y_{tsnc}) = μ_{tn} + β_{c} + γ_{s},

(9)

E (Y_{tsnc}) = μ_{tc} + α_{n} + γ_{s},

(10)

E (Y_{tsnc}) = μ_{t} + α_{n} + β_{c} + γ_{s} .

(11)

Here the subscripts t, s, n, and c refer to test (11 levels), scenario (4 levels), n (3 levels), and censoring pattern (3 levels), respectively, as described above. We fit the model without an intercept and normalized the effects of the other factors to have a sum of zero because then the estimates for the interaction terms have the interpretation as average deviations from the nominal level of 5% adjusted for the effects of the other factors. The results are shown for each test in Table 1 by scenario (model (8)), by total sample size n (model (9)), by censoring pattern (model (10)), and overall (model (11)).

Table 1.

Average deviations from nominal 5% level for 11 tests adjusted using ANOVA. Deviations are given by scenario, by sample size, by censoring pattern, and overall using models (8–11), respectively. The last two rows refer to the log-rank (LR) test and weighted log-rank (WLR) tests starting at time 0. t₀ = 24.

		Scenario				Total sample size, n			Censoring
Method	Equation	A	B	C	D	100	200	400	0%	15%	(10%, 20%)	Overall
Z_CLL(24)	(1)	−0.05	−0.10	0.08	−0.04	−0.02	0.01	−0.06	0.03	−0.07	−0.03	−0.03
Z_CLL(48)	(1)	0.36	−0.02	0.16	0.12	0.26	0.18	0.03	0.47	−0.04	0.05	0.16
Z_CLL(72)	(1)	0.18	−0.02	0.15	0.07	0.24	0.01	0.04	−0.01	0.11	0.19	0.10
Z_WKM(t₀)	(2)	0.84	0.59	0.71	0.61	1.18	0.59	0.29	0.55	0.73	0.78	0.69
$χ_{PSV}^{2} (t_{0})$	(3)	0.24	−0.01	0.04	−0.04	0.12	0.04	0.01	0.04	0.09	0.04	0.06
Z_LR(t₀)	(4)	0.21	0.09	−0.04	−0.09	0.08	−0.03	0.07	−0.03	0.05	0.10	0.04
Z_OLS(t₀)	(5)	0.10	−0.12	0.01	−0.08	−0.03	−0.05	0.01	−0.12	−0.04	0.09	−0.02
Z_SP,P(t₀)	(6)	0.21	0.10	0.15	0.11	0.27	0.08	0.07	0.06	0.19	0.18	0.14
χ²(t₀)	(7)	−0.13	−0.31	−0.26	−0.40	−0.53	−0.23	−0.07	−0.44	−0.18	−0.21	−0.27
Log rank		0.22	0.38	1.40	3.04	0.67	1.11	2.00	0.77	1.40	1.61	1.26
Weighted log rank ρ = 0, γ = 1		0.20	0.34	1.33	2.87	0.37	0.95	2.24	1.23	1.22	1.12	1.19

Open in a new tab

We can see that most of the methods control the type I error rate accurately even for a total sample size of n = 100, whereas the Z_OLS(t₀) statistic has the most accurate control. Entries in the table where the type I error rate is more than 2 SEs from the nominal level are marked in bold. The weighted Kaplan–Meier statistic, Z_WKM(t₀), has an inflated type I error by about 1% for n = 100, although this inflation of the error rate dissipates with larger sample size. The χ²(t₀) statistic is somewhat conservative for smaller sample sizes. Also, we point out that the log-rank and weighted log-rank tests starting at time 0 do not control the type I error rate for this hypothesis test for scenarios B–D, because of the early differences in the survival curves prior to t₀ = 24. The inflation of the type I error rate worsens with larger early differences in the survival curves (scenario D) and with larger sample sizes, because the test has higher power to detect these early differences. However, the log-rank and weighted log-rank procedures are not designed to test the null hypothesis H₀ of equal survival curves after t₀, so this result is not unexpected.

For the power results, we only considered one sample size (total n = 400) and the same censoring patterns as in the type I error simulations. The power of the various procedures is expected to depend heavily on the scenario; therefore to summarize the power results, we fit an ANOVA model with an interaction between the test statistic and the scenario, adjusting for censoring pattern,

E (Y_{tsc}) = μ_{ts} + β_{c} .

(12)

Here the subscripts t, s, and c refer to test (11 levels), scenario (5 levels), and censoring pattern (3 levels), respectively.

The results are shown for each test and scenario combination in Table 2. Several general patterns emerge from examining this table. The pointwise comparisons based on the complementary log–log transformation, Z_CLL(t′), are sensitive only to differences at that point t′ and do not compare the entire curves. The pointwise comparison at 72 months has the highest power among pointwise comparisons because the largest differences in survival curves are seen there. Although the pointwise comparison at 72 months does well in scenarios F and G, it suffers from loss of power compared to some of the other methods for the proportional hazards situation (E) and when the differences at 72 months are not as pronounced (H–I). In these cases, a comparison of the entire curves using one of the other techniques can have more power.

Table 2.

Average rejection rates for 11 tests adjusted using ANOVA for censoring pattern. Rejection rates given by scenario using model (12). The last two rows refer to the log-rank (LR) test and weighted log-rank (WLR) tests starting at time 0. t₀ = 24.

		Scenario
Method	Equation	E	F	G	H	I
Z_CLL(24)	(1)	62.4	15.3	21.1	4.7	21.8
Z_CLL(48)	(1)	70.1	32.9	65.1	21.5	6.8
Z_CLL(72)	(1)	71.2	44.5	85.1	46.1	25.9
Z_WKM(t₀)	(2)	75.8	35.0	66.3	20.3	6.0
$χ_{PSV}^{2} (t_{0})$	(3)	74.8	32.0	61.2	16.4	4.8
Z_LR(t₀)	(4)	30.7	36.5	85.4	71.7	82.6
Z_OLS(t₀)	(5)	74.7	43.9	84.1	43.4	23.6
Z_SP,P(t₀)	(6)	76.9	40.2	74.8	29.6	10.7
χ²(t₀)	(7)	67.2	36.7	83.1	61.1	81.0
Log rank		78.0	28.9	47.0	8.6	22.2
Weighted log rank ρ = 0, γ = 1		64.7	49.7	93.8	70.0	64.6

Open in a new tab

The weighted Kaplan–Meier comparison, Z_WKM(t₀), and the pseudo-value technique, $χ_{PSV}^{2} (t_{0})$ , both aggregate differences in survival curves across the times after t₀, and as expected they perform well when those differences are in a consistent direction (scenarios E–G) and poorly when those differences are not in the same direction (scenario I) or when many of those early differences are very small (scenario H). Similarly, the linear combination tests (Z_OLS(t₀) and Z_SP,P(t₀)) combine the test statistics before and after t₀ in a linear fashion and would be expected to perform well when those statistics have the same sign (E–G) and poorly when they have the opposite sign (I) or one of them has a zero mean (H). Among these, the OLS test Z_OLS(t₀) has the best power in scenarios (F–I) and has only slightly less power for scenario (E), and would be recommended.

The χ²(t₀) combination test performs well for scenarios (H–I), but may be less efficient when the statistics have the same sign (E–G). However, it appears that the magnitude of the relative power loss for scenarios (E–G) is modest, so this may be a good general method. Note that the log-rank test starting at t₀, Z_LR(t₀), has high power for scenarios (G–I); this component of the combination tests is driving the power much more than the other component, and this disparity in the component statistics is what produces good power for the χ²(t₀) combination test, as opposed to a linear combination test.

It is also worth noting that although the pseudo-value test $χ_{PSV}^{2} (t_{0})$ had poor power for scenarios (H–I), it was implemented assuming a time-independent effect of treatment on survival. One advantage of these models is that one can test for whether there is a time-dependent effect and incorporate this into the models, thereby improving the sensitivity to the treatment effect captured in these scenarios. Further investigation of this is needed, however.

The power for the log-rank and weighted log-rank tests starting at time 0 is also shown for comparative purposes. As expected, the log-rank test has the highest power for the proportional hazards alternative (E), but performs poorly for the remaining crossing hazards alternatives (H–I). The weighted log-rank test has the highest power among all the tests for scenarios (F–H), but performs somewhat worse for E and I. Also, the weighted log-rank test can be overly sensitive to early differences in survival curves, and is not suitable for testing the null hypothesis H₀ of equal survival curves after t₀.

Overall, the χ²(t₀) test from equation (7) has the highest power in scenarios H and I, and power that is not substantially worse than the other methods for scenarios E–G, and could be recommended as a general omnibus technique against a variety of alternative hypothesis scenarios. However, note that in scenarios E–G there is a survival difference at t₀ that is at least maintained or increased, implying a consistent and easily interpretable difference in long-term survival. On the other hand, in scenario I the survival curves come together and cross sometime after t₀. This makes it difficult to interpret which treatment is better, because it depends on when after t₀ you look and how much follow-up there is in the study. The power advantage of the χ²(t₀) test for scenario I may not be worth the power loss for the more relevant and interpretable scenarios E–G. The OLS linear combination test Z_OLS(t₀) from equation (5) has high power for scenarios E–G, and is a better choice than the χ²(t₀) test for identifying consistent differences in long-term survival after t₀.

4. Example

We now return to the motivating example comparing autologous and allogeneic bone marrow transplants for follicular lymphoma. We are interested in comparing the DFS curves (i.e., the probability that a patient is alive and disease free) between the two treatment arms, but in particular we are interested in comparing the DFS curves after 1 year, which should eliminate much of the anticipated differences in early mortality between autologous and allogeneic transplants. There is a modest amount (42%) of censoring present in the dataset.

The usual log-rank test gives p = 0.443, whereas a weighted log-rank test with Fleming–Harrington weights of ρ = 0, γ = 1 gives a p-value of <0.001. Each of the methods described above was applied using t₀ = 12 months. The pointwise comparison using the complementary log–log transformation is shown at t′ = 36 months. The p-values are given in Table 3.

Table 3.

P-values for tests comparing survival curves after 1 year, applied to the follicular lymphoma dataset

Method	Equation	p-value	Method	Equation	p-value
Z_CLL(36)	(1)	0.091	Z_OLS(12)	(5)	0.020
Z_WKM(12)	(2)	0.068	Z_SP,P(12)	(6)	0.055
$χ_{PSV}^{2} (12)$	(3)	0.131	χ²(12)	(7)	<0.001
Z_LR(12)	(4)	<0.001

Open in a new tab

The table indicates that the pointwise comparison Z_CLL(36) does not find a significant difference in the survival estimates at 36 months. However, if you compare the entire survival curves after 12 months, the weighted Kaplan–Meier comparison, the pseudo-value approach, and the Z_SP,P(12) test all have nonsignificant p-values, whereas the χ²(12) test and the linear combination test Z_OLS(12) have significant p-values.

In understanding the discrepancies in the results, we need to look at the shape of the survival curves and where they cross. The survival curves cross right around t₀, whereas the simple log-rank test of the hazard functions after t₀, Z_LR(12), is highly significant. This situation is most similar to simulation scenario H. The weighted Kaplan–Meier statistic, the pseudo-value approach, and the Z_SP,P(t₀) combination test all had low power in this scenario, whereas the χ²(t₀) test and the Z_OLS(t₀) test had higher power. The latter two tests are more sensitive to the component test of the hazard functions after t₀. Using these latter tests, one would conclude that there is a significant difference in the survival curves after 12 months.

5. Discussion

We have considered a number of methods for comparing two survival curves after a prespecified time point, t₀. This situation may be of interest when the survival curves are expected to cross, so that we are only interested in late differences. Our simulations indicate that a simple pointwise comparison of the curves at some time after t₀ is sensitive to the time point chosen and may be inefficient in some settings because they ignore differences in the curves at other times after t₀. Use of the weighted log-rank test may be overly sensitive to early differences prior to t₀. Two of the methods studied stand out, both of which are based on a combination of a pointwise comparison of the survival curves at t₀ and a log-rank test after t₀. A χ²(t₀) combination test given in equation (7) performs well as an omnibus test against a wide variety of alternative hypothesis scenarios. But if the interest is on identifying consistent differences in long-term survival after t₀, a simple equally weighted linear combination test Z_OLS(t₀) given in equation (5) has better power for this alternative hypothesis and is recommended. A pseudo-value regression approach was also discussed, which can be easily extended to account for covariates. Also, the regression model framework can be used to test whether the treatment effect is consistent across time, thereby allowing flexibility for dealing with different alternative hypotheses such as survival curves crossing after the prespecified time t₀. Further research is needed on adapting this approach.

Supplementary Material

webappendix

NIHMS112702-supplement-webappendix.pdf^{(50KB, pdf)}

Acknowledgements

This research was partially supported by a grant (R01 CA54706-10) from the National Cancer Institute.

Footnotes

6. Supplementary Materials

Web Appendix A, referenced in Section 2.3, is available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

References

Andersen PK, Klein JP, Rosthoj S. Generalized linear models for correlated pseudo-observations with applications to multi-state models. Biometrika. 2003;90:15–27. [Google Scholar]
Boos D. On generalized score tests. American Statistician. 1992;46:327–333. [Google Scholar]
Fleming TR, Harrington DP. A class of hypothesis tests for one and two samples of censored survival data. Communications in Statistics. 1981;10:763–794. [Google Scholar]
Klein JP, Andersen PK. Regression modeling of competing risks data based on pseudo-values of the cumulative incidence function. Biometrics. 2005;61:223–229. doi: 10.1111/j.0006-341X.2005.031209.x. [DOI] [PubMed] [Google Scholar]
Klein JP, Logan B, Harhoff M, Andersen PK. Analyzing survival curves at a fixed point in time. Statistics in Medicine. 2007;26:4505–4519. doi: 10.1002/sim.2864. [DOI] [PubMed] [Google Scholar]
Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;78:13–22. [Google Scholar]
Lu L. Explained variation in survival analysis and hypothesis testing for current leukemia-free survival. Medical College of Wisconsin; Milwaukee, Wisconsin: 2006. Ph.D. Dissertation. [Google Scholar]
O'Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed] [Google Scholar]
Parzen MI, Wei LJ, Ying Z. Simultaneous confidence intervals for the difference of two survival functions. Scandinavian Journal of Statistics. 1997;24:309–314. [Google Scholar]
Pepe MS, Fleming TR. Weighted Kaplan-Meier statistics: A class of distance tests for censored survival data. Biometrics. 1989;45:497–507. [PubMed] [Google Scholar]
Pepe MS, Fleming TR. Weighted Kaplan-Meier statistics: Large sample and optimality considerations. Journal of the Royal Statistical Society, Series B. 1991;53:341–352. [Google Scholar]
Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. [Google Scholar]
Roy SN. On aheuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 1953;24:220–238. [Google Scholar]
Sposto R, Stablein D, Carter-Campbell S. A partially grouped logrank test. Statistics in Medicine. 1997;16:695–704. doi: 10.1002/(sici)1097-0258(19970330)16:6<695::aid-sim436>3.0.co;2-c. [DOI] [PubMed] [Google Scholar]
Van Besien K, Loberiza F, Bajorunaite R, et al. Comparison of autologous and allogeneic hematopoietic stem cell transplantation for follicular lymphoma. Blood. 2003;102:3521–3529. doi: 10.1182/blood-2003-04-1205. [DOI] [PubMed] [Google Scholar]
Zhang M-J, Klein JP. Confidence bands for the difference of two survival curves under proportional hazards model. Lifetime Data Analysis. 2001;7:243–254. doi: 10.1023/a:1011636407958. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

webappendix

NIHMS112702-supplement-webappendix.pdf^{(50KB, pdf)}

[R1] Andersen PK, Klein JP, Rosthoj S. Generalized linear models for correlated pseudo-observations with applications to multi-state models. Biometrika. 2003;90:15–27. [Google Scholar]

[R2] Boos D. On generalized score tests. American Statistician. 1992;46:327–333. [Google Scholar]

[R3] Fleming TR, Harrington DP. A class of hypothesis tests for one and two samples of censored survival data. Communications in Statistics. 1981;10:763–794. [Google Scholar]

[R4] Klein JP, Andersen PK. Regression modeling of competing risks data based on pseudo-values of the cumulative incidence function. Biometrics. 2005;61:223–229. doi: 10.1111/j.0006-341X.2005.031209.x. [DOI] [PubMed] [Google Scholar]

[R5] Klein JP, Logan B, Harhoff M, Andersen PK. Analyzing survival curves at a fixed point in time. Statistics in Medicine. 2007;26:4505–4519. doi: 10.1002/sim.2864. [DOI] [PubMed] [Google Scholar]

[R6] Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;78:13–22. [Google Scholar]

[R7] Lu L. Explained variation in survival analysis and hypothesis testing for current leukemia-free survival. Medical College of Wisconsin; Milwaukee, Wisconsin: 2006. Ph.D. Dissertation. [Google Scholar]

[R8] O'Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed] [Google Scholar]

[R9] Parzen MI, Wei LJ, Ying Z. Simultaneous confidence intervals for the difference of two survival functions. Scandinavian Journal of Statistics. 1997;24:309–314. [Google Scholar]

[R10] Pepe MS, Fleming TR. Weighted Kaplan-Meier statistics: A class of distance tests for censored survival data. Biometrics. 1989;45:497–507. [PubMed] [Google Scholar]

[R11] Pepe MS, Fleming TR. Weighted Kaplan-Meier statistics: Large sample and optimality considerations. Journal of the Royal Statistical Society, Series B. 1991;53:341–352. [Google Scholar]

[R12] Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. [Google Scholar]

[R13] Roy SN. On aheuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 1953;24:220–238. [Google Scholar]

[R14] Sposto R, Stablein D, Carter-Campbell S. A partially grouped logrank test. Statistics in Medicine. 1997;16:695–704. doi: 10.1002/(sici)1097-0258(19970330)16:6<695::aid-sim436>3.0.co;2-c. [DOI] [PubMed] [Google Scholar]

[R15] Van Besien K, Loberiza F, Bajorunaite R, et al. Comparison of autologous and allogeneic hematopoietic stem cell transplantation for follicular lymphoma. Blood. 2003;102:3521–3529. doi: 10.1182/blood-2003-04-1205. [DOI] [PubMed] [Google Scholar]

[R16] Zhang M-J, Klein JP. Confidence bands for the difference of two survival curves under proportional hazards model. Lifetime Data Analysis. 2001;7:243–254. doi: 10.1023/a:1011636407958. [DOI] [PubMed] [Google Scholar]

PERMALINK

Comparing Treatments in the Presence of Crossing Survival Curves: An Application to Bone Marrow Transplantation

Brent R Logan

John P Klein

Mei-Jie Zhang

Summary

1. Introduction

Figure 1.

2. Methods

2.1 Notation

2.2 Comparisons Based on a Single Time Point

2.3 Weighted Kaplan–Meier Test

2.4 Tests Based on Pseudo-Value Observations

2.5 Combination Tests

2.5.1 Linear combination tests

2.5.2 Quadratic tests

3. Design of Simulation Study

Figure 2.

Table 1.

Table 2.

4. Example

Table 3.

5. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Comparing Treatments in the Presence of Crossing Survival Curves: An Application to Bone Marrow Transplantation

Brent R Logan

John P Klein

Mei-Jie Zhang

Summary

1. Introduction

Figure 1.

2. Methods

2.1 Notation

2.2 Comparisons Based on a Single Time Point

2.3 Weighted Kaplan–Meier Test

2.4 Tests Based on Pseudo-Value Observations

2.5 Combination Tests

2.5.1 Linear combination tests

2.5.2 Quadratic tests

3. Design of Simulation Study

Figure 2.

Table 1.

Table 2.

4. Example

Table 3.

5. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases