Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Dec 7.
Published in final edited form as: Stat Methods Med Res. 2021 Jul 7;30(9):2057–2074. doi: 10.1177/09622802211017592

Impact of unequal censoring and insufficient follow-up on comparing survival outcomes: Applications to clinical studies

Deo Kumar Srivastava 1, E Olusegun George 2, Zhaohua Lu 1, Shesh N Rai 3
PMCID: PMC9726613  NIHMSID: NIHMS1772939  PMID: 34232837

Abstract

Clinical trials with survival endpoints are typically designed to enroll patients for a specified number of years, (usually 2–3 years) with another specified duration of follow-up (usually 2–3 years). Under this scheme, patients who are alive or free of the event of interest at the termination of the study are censored. Consequently, a patient may be censored due to insufficient follow-up duration or due to being lost to follow-up. Potentially, this process could lead to unequal censoring in the treatment arms and lead to inaccurate and adverse conclusions about treatment effects. In this article, using extensive simulation studies, we assess the impact of such censorings on statistical procedures (the generalized logrank tests) for comparing two survival distributions and illustrate our observations by revisiting Mukherjee et al.’s1 findings of cardiovascular events in patients who took Rofecoxib (Vioxx).

Keywords: Unequal censoring, insufficient follow-up, proportional hazards, logrank test, VIGOR trial

1. Introduction

A randomized clinical trial for comparing two treatment groups is usually designed with patients randomly entering the study arms in a staggered entry over a period of fixed time period until the planned number of patients have been enrolled in each arm. Following treatments, the patients are followed for a preset time, after which the study is terminated and all patients still alive at the time of termination are considered censored. The survival data are then usually presented as {Xi = min(Ti, Ci), i = 1, 2, … , n} for n patients, where the random variables Ti and Ci represent failure and censoring times, respectively, for the ith patient. The censoring time distribution is assumed to be independent of the failure time distribution. The comparison of two or more survival distributions is usually accomplished with the use of the logrank test or its generalizations. This representation of the data is convenient and leads to valid inference irrespective of the underlying failure time or censoring time distributions if the hazard rates for the two groups remain proportional. The underlying premise is that failure time and censoring time occur in parallel but the observed time for an individual/patient is the minimum of the two. It has been well recognized, for example, see Kalbfleisch and Prentice1 (Chapter 3, pp. 56–57) and Hougaard2 (Chapter 2, pp. 44–45), that even in the simplest case, where the failure times are exponentially distributed, estimation by MLE can be complicated by the underlying censoring mechanism, but the inference based on the likelihood theory circumvents this problem.

The comparison of two or more survival distributions is usually accomplished with the use of the logrank test or its generalizations. It is well known that as long as the censoring proportions in the two groups remain similar, the inference based on the logrank test remains valid but becomes more conservative (less powerful) with increasing censoring proportions. However, if the censoring proportions in the two groups are different then, the Xi = min(Ti, Ci) representation will compel an imposition of some restrictions on the underlying censoring distributions. For example, suppose two failure time distributions follow exponential distribution with parameters σ1 and σ2 and the two censoring distributions are also exponential with parameters δ1 and δ2. If 20% and 30% censorings are observed in the two groups, respectively, then the parameters of the two censoring distributions will be chosen so that σ1/(σ1 + δ1) = 0.2 and σ2/(σ2 + δ2) = 0.3. On the other hand, if the censoring distributions are uniform, U(0, τ1) and U(0, τ2), respectively, then τ1 and τ2 will be chosen so that P(T1 > C1) = 0.2 and P(T2 > C2) = 0.3. This effectively imposes different follow-up times in the two groups. However, this framework generally captures data as they are recorded, and is generally reasonable when the study is designed to have long follow-up times. Moreover, within in this framework of proportional hazards, the logrank test for comparing two or more groups is fairly robust in terms of maintaining type I error control for varying censoring proportions although power could decrease as censoring proportion increases.

The asymptotic validity of the logrank-test under possibly different censoring distributions are given by Kong and Slud3 and Dirienzo and Lagakos.4 Beltangady and Frankowski5 evaluated the performance of the logrank and Wilcoxon type tests for comparing two survival distributions in the presence of unequal random censoring for small sample sizes and concluded that the inequality of censoring proportions affected the power of all tests and that greater the difference in censoring proportions led to lower power estimates. However, for small samples and under possibly very different censoring distributions, it has been shown that the conditional permutation test could be substantially anticonservative (Latta6 and Kellerer and Chmelevsky7). Jennrich8 has shown that, the permutation logrank test could be very conservative when the censoring distributions are very different. The basic underlying premise in utilizing the permutation approach is to ensure that the two groups being compared are exchangeable under the null hypothesis. Jennrich9 introduces an artificial mechanism to equalize the censoring distribution between the two groups by randomly selecting a patient for censoring from one group when a failure is observed in the other group but pays a price in terms of power. This censoring scheme is essentially equivalent to “random censoring” of patients in a clinical trial set-up as described above. Heinze et al.10 have proposed two exact conditional tests that are suitable for comparing r groups with unequal follow-up times. The first test is obtained by conditioning on the risk set and the second one is obtained by conditioning on the follow-up time and they demonstrate its superior performance compared to other asymptotic and unconditional exact test in the presence of unequal length of follow-up. Wang et al.11 adopt novel approaches and propose several improvements of the permutation logrank test for comparing two survival distributions with different censoring distributions. Their approach is based on imputing failure and censoring times using Kaplan-Meier estimates of the survival and censoring distribution. Wang et al.11 also propose a permutation test for unequal censoring distributions using imputation approach. It is worth noting that the simulation studies presented in almost all the studies mentioned above assume that the failure time and censoring time processes are occurring simultaneously, and one gets to observe the minimum of the two processes.

However, as will be highlighted through several examples, it is rare in practice to observe failure and censoring times simultaneously. In general, a study for comparing the survival distributions of two groups is designed using randomized trial with or without interim analyses or by comparison with historical group. In the actual conduct of the trial, individuals enter the study randomly and the accrual takes place over a time-period (TA), for example, two or three years. The patients/individuals enter the study at different times, i.e. staggered entry on the study is allowed. Once the required number of patients have been enrolled and who are not lost to follow-up before the completion of the study and have been followed for the minimum pre-set follow-up time TF ≥ 0 (typically 1 to 2 years), the study is terminated and all patients still alive at the time of analysis are considered censored. Thus, the total duration of the study will be equal to TD = TA + TF, which is typically 3–5 years. However, for an individual/patient, the total duration on study would be the time the individual went on study to the time an event occurred, the study was terminated, or the individual was censored.

It is clear that when a study is conducted in this manner, two different types of censorings can arise; the first occurs when a subject (or patient) is lost to follow-up (CT1) and the other occurs due to insufficient follow-up (CT2). It can be easily envisioned that under CT2 a “true” failure time could be treated as censored, or a “true” censoring time could be censored earlier. None of these possible “misclassifications” of data is captured in the representation Xi = min(Ti, Ci) The data obtained from such studies is then placed in the framework of min(Ti, Ci), which is used in the analysis with the logrank test. However, these inadvertent censoring can create unequal censoring in the two groups and influence inference based on the logrank testing procedure. The purpose of this manuscript is to evaluate the impact of unequal censorings as described above on the inferential procedures.

In Section 2 we present some motivating examples for studying the validity of the logrank test in the context of unequal censoring proportions generated under the censoring mechanisms described above. In Section 3 we briefly provide the details of the generalized logrank test and in Section 4 we provide the details of the simulation studies. In section 5 we summarize the results of the simulation studies. Finally, in Section 6 we revisit the motivating example and provide some conclusions.

2. Motivating examples

The following examples highlight some of the issues commonly encountered in practice and the complications associated with the analysis of such data.

2.1. Example 1: A prospective study

This example is taken from findings of adverse cardiovascular events in patients taking the COX-2 inhibitor, Vioxx. This study, also known as VIGOR (Vioxx Gastrointestinal Outcome Research), was published by Bombardier et al.12 in the New England Journal of Medicine (NEJM). The purpose of the study was to assess if Rofecoxib (Vioxx) was more effective in reducing upper gastrointestinal toxicities in comparison to Naproxen in patients with Rheumatoid Arthritis (RA). Patients were randomized to the two arms and the outcome measure, the time to development of confirmed upper gastrointestinal events, was compared between the two groups. In addition, the study was closely monitored for excessive cardiovascular adverse effects. A total of 8076 patients were enrolled on the study between January 1999 and July 1999, out of which 4047 and 4029 were randomized to be treated with Rofecoxib and Naproxen, respectively. The findings of the study were published with median follow-up of about nine months (range: 0.5–13 months) in both arms. The study reported that treatment with Rofecoxib was associated with significantly fewer clinically important upper gastrointestinal events than Naproxen; see Figure 1 (reproduced from Bombardier et al.12). In addition, the authors reported that the overall mortality rate and the rate of death from cardiovascular causes were similar in the two groups.

Figure 1.

Figure 1.

Cumulative incidence of the primary end point of a confirmed upper gastrointestinal event among all randomized patients.

However, in a follow-up meta-analysis, Mukherjee et al.13 found significant risk of cardiovascular events associated with selective COX-2 inhibitor in several clinical trials including the VIGOR study mentioned above. Based on excessive cardiovascular adverse events in one of the arms, the data safety monitoring board recommended blinded adjudication of cardiovascular events. A total of 66 such patients were chosen, and the event free survival analysis of these patients showed that the relative risk (RR) of developing cardiovascular events in Rofecoxib treatment groups was 2.38 with an associated p-value of P < .001; (see Figure 2 (reproduced from Mukherjee et al.13)). Thus, with just few more months of follow-up, this finding directly contradicts the finding of the previous study by Bombardier et al.12 This raises a question as to why there is such a drastic difference in the conclusions drawn with just a few more months of follow-up.

Figure 2.

Figure 2.

Time to cardiovascular adverse event in the VIGOR trial.

Although no actual data have been made available to us, for illustration purposes, we have recreated a data set, based on a figure published in Mukherjee et al.,13 by assuming all failures to occur at the mid-point of the two-month intervals (showing the number of patients at risk) and all censoring to occur just prior to the upper limit of the interval. We fit exponential and Weibull distributions to both arms and found that exponential fit was not good; however, the Weibull distribution fit provided significantly improved fit for both groups but with very different shape and scale parameters.

The two obvious findings are noted from our reanalysis: (1) the proportional hazards assumption may not be valid (very different shape and scale estimates for the two groups with Weibull fits) and (2) the follow-up is too short leading to significant censoring (roughly only 25% at risk at 12 months of follow-up).

Several reasons may be used to explain the above discrepant finding:

  1. Increased Power with Longer Follow-up: With longer follow-up more events are observed, and this increases the power of the test in discriminating the two groups.

  2. Insufficient Follow-up and Random Censoring: With shorter follow-up it is likely that failure times for many individuals/patients would be treated as censored observations because of insufficient follow-up, CT2 type of censoring. That is, several failures observed in Mukherjee et al.13 would have been considered as censored in Bombardier et al.12 analysis, more so for the Rofecoxib arm, which artificially inflates the survival estimates for Rofecoxib arm and results in no difference in cardiovascular events between the two arms. It can be shown that conducting the analysis with relatively shorter follow-up, the power function can be artificially inflated as seen from our simulation studies. If it is a randomized trial, then it can be expected that the follow-up time would be similar in the two groups but with shorter follow-up, the group with worse survival (increasing failure rate) will have higher proportion of observations censored due to insufficient follow-up (CT2).

  3. Unequal Censoring Distributions: The performance of the logrank test could be affected by the unequal distribution of the censoring times when different proportion of subjects/patients, independent of the failure time distribution, drop out of the study, which is CT1 type of censoring.

  4. Validity of Proportional Hazards Assumption: The number of failures would be increasing in both groups, but the ratio of the failure rates between the two groups needs to remain constant for the logrank test to be valid. However, it is very clear that the hazard for Rofecoxib arm is increasing much faster compared to Naproxen arm suggesting that, with longer follow-up, more cardiovascular events would be observed in Rofecoxib arm compared to Naproxen arm and the assumption of constant hazard is not justified. Thus, the use of logrank test may not be valid.

It is worth noting that even at 10 months only about 26% of the patients are at risk indicating that at the time of analysis the majority of the patients, who may later succumb to cardiovascular episodes, were censored.

The above example is typical of many clinical trials where the primary objective is to assess the efficacy (Rofecoxib reduces the upper gastrointestinal events compared to Naproxen) but the secondary objective could be to evaluate safety (monitor adverse cardiovascular risk in the two groups). In such scenarios, the equality of censoring distribution in the two treatment groups cannot be guaranteed and it would be important to assess its impact on the logrank test.

2.2. Example 2: A retrospective study/evaluation of secondary objectives

This example identifies the potential problems with retrospective studies in the context of a clinical trial. It is not uncommon for investigators to learn from previous trials and design new trials to improve the outcomes and compare it to historical data when individuals on previous trial could still be followed. Naturally, one could see very different censoring proportions depending on when the analysis is conducted.

Similarly, after data from a randomized clinical trial are analyzed, some secondary objectives may be of interest. Some of these secondary objectives may be formulated to assess the effect of some known or unknown factors that may influence the outcome measure, such as Overall Survival (OS). As an example, in a bone marrow transplant setting, once a randomized study is completed, the interest could be in assessing the effect of FAB (French-American-British) classification for AML (JMML: Juvenile Myelomonocytic Leukemia vs. others) on OS, the expectation being that the patients with JMML would have an inferior outcome compared to others. In this and other similar settings, the logrank types of statistics are used to determine the predictive value of the covariates. The issues that were noted for the validity of the logrank test may exist in the present setting as well:

  1. Unequal Censoring Proportions: The proportion of censored observations may not be similar compared to historical data and within several levels of the factors being investigated in the secondary analyses.

  2. Random Censoring: Also, because of fixed study duration some of the individuals will be censored at random, in the sense discussed above. It would be critical to evaluate the impact of such imbalances on the logrank type of test procedures.

In order to better understand the underlying complexity that the above examples illustrate, we start by first noting that data we observe is in the form min(Ti, Ci, TiD), for i = 1, 2, … , n, where Ti, Ci and TiD denote the failure time, censoring time and the study duration time, respectively. If TiD is larger than Ti or Ci then the traditional framework of observing min(Ti, Ci) would hold and the inferential procedures such as logrank test would be valid. However, if TiD is smaller than Ti or Ci, then the traditional framework would not hold and individuals may be randomly censored, i.e. some failure times may be censored and some actually censoring times may be censored earlier. It is interesting to note that the individual for whom the failure time may have been observed but is actually censored due to early termination would contribute 1 − F(tiD) rather than f(ti) to the likelihood function. This may potentially bias the inferential procedures.

Our focus is not on evaluating the impact of violation of the proportional hazards assumption or on assessing the impact of unequal censoring in small samples on the logrank test. The focus of this research is to evaluate the impact on the logrank test of two distinct censorings mechanisms; one that occurs naturally due to losing patients to follow-up and the second due to insufficient follow-up of the patients. It is clear that even though there are two distinct types of censoring mechanisms encountered in the conduct of a clinical trial, when analysis is conducted no distinction is made between the two types of censoring mechanisms, although in the second type of censoring a potential failure time may be treated as censored. This type of censoring would essentially amount to “random censoring” of individuals or patients.

In the following, we examine the operating characteristics of the logrank tests and its generalizations, the Gρ statistics, proposed by Harrington and Fleming14 and others on data obtained under these conditions. A brief description of the generalized logrank test is provided in the next section.

3. Class of linear rank statistics

For comparison of two survival curves, H0 : S1 = S2, the most popular method is logrank test which is adapted from the stratified test for 2 × 2 contingency table and its construction can be briefly outlined as follows:

Let τ1, τ2, … , τL denote L distinct failure times. Then at time τj the 2 × 2 table is given as

Observed to fail Did not fail At risk
Group 1 d1j n1jd1j n1j
Group 2 d2j n2jd2j n2j
dj njdj nj

where n1j and n2j denote the number of individuals still at risk just prior to time τj in Groups 1 and 2, respectively. Further, let d1j and d2j denote the observed number of individuals who fail at time τj. It is well known that conditioned on the marginal totals, d1j follows a hypergeometric distributions with mean Ej(d1j) = djn1j/nj and variance Vj(d1j)=dj(n1jn2jnj2)(njdjnj1) and the logrank statistic given by

Z=j=1L(OjEj)/j=1LVj (1)

is asymptotically N(0, 1) under H0. Equivalently, Z2~χ12 asymptotically under H0. The weighted version of the logrank test can be obtained by attaching weights wj at each failure time τj, i.e.

Z=j=1Lwj(OjEj)/j=1Lwj2Vj (2)

and is asymptotically N(0, 1) under H0. By choosing different weights, a wide variety of tests proposed in the literature can be obtained. For example, using wj = (nj/n)η for η ≥ 0 leads to a class of weighted logrank test proposed by Tarone and Ware.15 Further, by using Kaplan-Meier estimates of the survival distribution S^(tj) for the combined samples and using the weights wj=[S^(tj)]ρ[1S^(tj)]γ(ρ0,γ0), Fleming and Harrington16 (Chapter 7) proposed the Gρ,γ family of test procedures, which is a generalization of the Gρ family of test procedures proposed by Harrington and Fleming,14 γ = 0. For (ρ = 0, γ = 0) the statistic reduces to usual logrank test. For (ρ = 1, γ = 0), (ρ = 1, γ = 1) and (ρ = 0, γ = 1), early, middle and late differences in survival curves are emphasized in the testing process. A more detailed description of the test statistic and its generalization for K samples can be found in above mentioned reference or in Bogaerts et al.17

In our evaluation we have focused our attention on the Gρ family of test statistics proposed by Harrington and Fleming,14 i.e. γ = 0. They have shown that under H0 for equal censoring distribution, for time transformed location alternatives the Gρ statistics have good power properties. Further, based on a simulation study they find that for some configurations the logrank test, corresponding to ρ = 0, exhibits good power properties whereas for other configurations higher values of ρ (1, 2) exhibit better power properties. In their simulation study, they do mention that Gρ statistics maintain the true size of the test under varying amounts of censorship. However, it is not indicated if the true size will be maintained if the K samples had different patterns of censoring and the censoring mechanism involved censoring of individuals at random as discussed before. In other words, we want to examine if the rank statistics are robust to unequal censoring arising as a result of random censoring of individuals in K = 2 samples.

For robustness studies, it is essential to differentiate between validity robustness and efficiency robustness. A test would be considered to be validity robust if the size of the test is maintained even if there are moderate departures from the underlying assumptions. However, validity robustness may lead to loss of efficiency, i.e. substantial loss in power. On the other hand, a test would be considered efficiency robust if, for moderate departures from the underlying assumptions, the test while maintaining type I error control has good power properties. Also, it is worth noting that if a test lacks validity robustness then assessment of power properties is a futile exercise. Thus it would be desirable to have tests that are both validity robust and efficiency robust. In the next section, we provide the details of the simulation study to evaluate the robustness of the generalized logrank tests when the censoring distributions could be very different when comparing two or more survival distributions.

4. Simulation experiment

The simulation experiments were conducted for comparing two survival distributions both coming from exponential or Weibull family so that proportional hazard assumption would hold. We have chosen these two distributions since the exponential, piecewise exponential and Weibull are commonly implemented in software such as DSTPLAN,18 EAST6.519 and PASS20 for designing clinical trials with survival outcome. The survival distribution for Weibull can be represented as

S(t)=exp((t/σ)δ),0<t< (3)

where α is the shape parameter and σ is the scale parameter, denoted by Weib(δ, σ). Further, δ = 1 leads to the survival distribution of exponential distribution, denoted by Exp(σ).

We conducted two simulations studies. The first study was conducted in the context of a randomized study where it would be expected that the follow-up time would be similar and unequal censoring in the two groups would essentially be due to subjects being lost to follow-up (CT1). The second study was undertaken in the context of assessing the improvement in the survival with newer treatment plan compared to historical control.

Study 1. Table 1 provides the sample size requirement for comparing two survival curves in a randomized clinical trial:where n1 and n2 are the sample sizes for the two groups (Group 1 and Group 2) and Si(3), i = 1, 2, represents the hypothesized survival at three years for the two groups, respectively. The accrual time was assumed to be three years with equal number of subjects entering in each time period and the minimum follow-up of two years with the total duration of the study five years. The above formulation of the sample size justification leads to exponential distributions with parameters σ1 = 5.873 and σ2 = 10.931 for the Groups 1 and 2, respectively. If the underlying distributions are assumed to be Weibull then for the choice of parameters (δ1 = 0.667, σ1 = 8.217) and (δ2 = 0.667, σ2 = 20.867), the survival probabilities at three years for Groups 1 and 2 would be 60% and 76%, respectively.

Table 1.

Sample size for comparing two survival curves α = 0.05 and conducting a two-sided test.

n 1 n 2 S1 (3) S2 (3) Power based on DSTPLAN Power based on EAST6.5 Power based on PASS16
120 120 60% 76% 79% 76% 82%

Remark: It may be noted that in calculating the sample size no assumptions are made regarding the underlying censoring distribution.

In conducting the simulations studies, two approaches can be adopted. In the first approach (Scenario I) an observation could be censored based on insufficient follow-up and then by random censoring due to the individuals dropping out. It is also possible that censoring could happen randomly and all those individuals (patients) still at risk at the end of the study would be censored due to insufficient follow-up (Scenario II). It may be reasonable to expect that, in practice, Scenario II type censoring would be more common where individuals/patients would drop out of the study randomly and then at the end of the study all those still surviving would be censored.

In order to explain the process of generating the samples we outline the process in the following steps while focusing on Group I (control group). Assume that the accrual time (TA) = 3 years and minimum follow-up time (TF) = 2 years with the total duration of the study (TD) = 5 years. The steps for conducting simulation studies for the two Scenarios are provided below and outlined as a flow chart in Figure 3.

Figure 3.

Figure 3.

Flow chart of simulation set-up for the two scenarios for the control group with three-year uniform enrollment and a five-year study duration.

4.1. Simulation set-up for scenario I

  1. Accrue 40 (1/3) individuals/patients in each of the three years. The entry time for each of the subgroups of subjects were randomly chosen based on the uniform distributions such that; Ei ~ U(0, 1), i = 1, … , 40, Ei ~ U(1, 2), i = 41, …, 80; and Ei ~ U(2, 3), i = 81, …, 120. Let the vector of arrival times be denoted by E.

  2. Correspondingly, failure times (Ti*s) were generated from exponential distribution with parameter σ1 = 5.873, i.e. T* ~ Exp(σ1 = 5.873). In this manner, two vectors E and T* with entry times and the failure times were generated. Also, let censoring indicator Δ be a vector of 1’s indicating failures.

  3. Censoring due to Insufficient Follow-up (CT2): If for the ith observation Ti*>5Ei then the observation was censored and the censoring time Ci was set to 5 − Ei, i.e. Ci = 5 − Ei, and the censoring indicator was set to Δi = 0. Say r1 observations, (ξ1 = r1/n1)100 percentage, were censored due to insufficient follow-up.

  4. Random Censoring (CT1): From the remaining (n1r1) observations with failure times Ti*>5Ei some of the subjects could drop out of the study independent of the failure time process. With ϕ1 representing censoring percentage, s1 = (n1r1) × ϕ1 observations were censored (lost to follow-up). This was achieved by generating Ri ~ U(0, 1), for i = 1, 2, …, n1r1, and censoring an observation if Ri < ϕ1, and setting the censoring indicator Δi to 0. Whenever an observation was censored, the censoring time Ci was randomly chosen from a Uniform random variable U(0,Ti*) where Ti* is the actual failure time, i.e. Ci=U(0,Ti*). For example, if an individual with a failure time of 2.5 is chosen to be censored then we replace that observation with a uniform random number generated from U(0, 2.5), indicating that the individual could have been lost to follow-up anytime between 0 to 2.5 years prior to the individual’s actual failure time being observed. In this manner a random sample of size n1 would be generated with (r1 + s1) observations being censored, r1 due to insufficient follow-up and s1 due to other causes such as lost to follow-up or withdrawal from study. Thus, the percentage of observations censored due to insufficient follow-up and random censoring would be ϖ1 = 100((r1 + s1)/n1).

  5. After censoring subjects due to insufficient follow-up and due to lost to follow-up (random censoring), the remaining (n1r1s1) were observed with the actual failure times Ti=Ti* with the censoring indicator Δi = 1.

4.2. Simulation set-up for scenario II

In this set-up, the first two steps were same as outlined above for Scenario I and the vectors Ti*, Ei and Δi, for i = 1, 2, … , 120 were generated. Then, the following steps were followed where the order of censoring was reversed, i.e. random censoring followed by censoring due to insufficient follow-up:

  1. Random Censoring (CT1): Generate Ri ~ U(0, 1) and if Ri < ϕ1 then generate the censoring time Wi~U(0,ti*), and set the censoring indicator Δi = 0. Let s1 be the number of censored observations.
    1. If Wi ≤ 5 − Ei then the censoring time Ci = Wi and Δi = 0.
    2. However, if Wi > 5 − Ei then the censoring time is set to Ci = 5 − Ei and Δi = 0.
  2. Censoring Due to Insufficient Follow-up (CT2): From the remaining (n1s1) observations the process described below was followed:
    1. If Ti*>5Ei then set the censoring time Ci = 5 − Ei and the censoring indicator Δi = 0. Let r1 be the number of observations censored.
    2. For the remaining (n1s1r1) observations, the failure times Ti=Ti*, and censoring indicator Δi = 1. The percentage of observations censored was ϖ1 = 100((r1 + s1)/n1).

The above processes were repeated to obtain the second sample with ϖ2 = 100((r2 + s2)/n2) observations censored. The censoring process was kept consistent for both samples, Scenario I or II, and then the generalized logrank statistic was applied to test the equality of the two survival distributions under the data representation Xi = min(Ti, Ci), and the p-value was obtained.

4.3. Performance of the test under null hypothesis

The null distribution control was assessed by generating two random samples of sizes n1 = 120, n2 = 120 from exponential distributions with parameters σ1 = σ2 = 5.873 and differing censoring proportions ϕi (= 0, .10, .20, .30, .40, .50), i = 1, 2 were used. The accrual time and the study duration were all fixed as described above. The Gρ(ρ = 0, 0.5, 1, 2) statistics were then applied to each sample and a p-value was obtained. This process was replicated 10,000 times and an estimate of the type I error control was obtained by counting the number of times the p-value was significant at level .05. Similarly, this process was replicated for Weibull distributions with parameters δ = 0.667 and σ1 = σ2 = 8.217.

The type I error control for exponential and Weibull distributions, for Scenario I censoring, is presented in Tables 2 and 3, respectively. Similarly, the results for Scenario II censoring are presented in Tables 4 and 5, respectively.

Table 2.

Type I error control of Gρ statistics for scenario I censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
55.7 55.7 0 0 55.7 55.7 0.047 0.048 0.049 0.049
55.7 55.7 10 10 60.1 60.1 0.049 0.049 0.049 0.048
55.7 55.7 20 20 64.5 64.5 0.047 0.047 0.046 0.046
55.7 55.7 30 30 69.0 69.0 0.048 0.048 0.047 0.047
55.7 55.7 40 40 73.4 73.4 0.050 0.050 0.050 0.049
55.7 55.7 0 10 55.7 60.1 0.075 0.075 0.075 0.072
55.6 55.7 10 20 60.1 64.6 0.079 0.078 0.078 0.076
55.7 55.7 20 30 64.6 69.0 0.080 0.080 0.080 0.080
55.7 55.7 30 40 69.0 73.4 0.090 0.090 0.090 0.086
55.7 55.7 10 0 60.1 55.7 0.075 0.074 0.074 0.072
55.7 55.8 20 10 64.5 60.2 0.074 0.074 0.074 0.073
55.7 55.6 30 20 69.0 64.5 0.086 0.086 0.086 0.081
55.7 55.6 40 30 73.4 68.9 0.094 0.094 0.093 0.091
55.7 55.7 0 20 55.7 64.5 0.153 0.152 0.151 0.141
55.8 55.6 10 30 60.2 68.9 0.168 0.166 0.164 0.157
55.7 55.8 20 40 64.6 73.5 0.195 0.195 0.192 0.188
55.7 55.8 20 0 64.6 55.8 0.158 0.159 0.158 0.151
55.7 55.6 30 10 69.0 60.0 0.174 0.174 0.172 0.164
55.7 55.7 40 20 73.4 64.5 0.200 0.201 0.196 0.187
55.7 55.7 0 30 55.7 69.0 0.310 0.309 0.303 0.292
55.8 55.7 10 40 60.2 73.4 0.358 0.354 0.349 0.337
55.7 55.7 30 0 69.0 55.7 0.316 0.312 0.307 0.295
55.7 55.7 40 10 73.4 60.2 0.350 0.349 0.345 0.333

T1 and T2 ~ Exp(σ1 = σ2 = 5.873).

ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up; ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring; ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

Table 3.

Type I error control for Gρ statistics for scenario I censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
57.3 57.3 0 0 53.7 57.3 0.050 0.050 0.051 0.051
57.3 57.3 10 10 61.6 61.5 0.050 0.049 0.052 0.050
57.2 57.3 20 20 65.7 65.8 0.052 0.051 0.050 0.050
57.3 57.2 30 30 70.1 70.0 0.050 0.051 0.050 0.051
57.3 57.2 40 40 74.4 74.3 0.053 0.054 0.055 0.055
57.2 57.2 0 10 57.2 61.4 0.072 0.072 0.074 0.073
57.3 57.2 10 20 61.5 65.8 0.078 0.078 0.077 0.076
57.2 57.1 20 30 65.8 70.0 0.085 0.085 0.084 0.083
57.3 57.3 30 40 70.1 74.4 0.089 0.089 0.091 0.088
57.1 57.3 10 0 61.4 57.3 0.069 0.069 0.068 0.067
57.2 57.2 20 10 65.8 61.5 0.080 0.080 0.080 0.079
57.2 57.3 30 20 70.0 65.8 0.083 0.081 0.079 0.078
57.3 57.3 40 30 74.4 70.1 0.089 0.090 0.088 0.087
57.3 57.3 0 20 57.3 65.9 0.158 0.158 0.157 0.154
57.3 57.3 10 30 61.6 70.1 0.177 0.176 0.175 0.172
57.3 57.2 20 40 65.8 74.3 0.199 0.200 0.199 0.192
57.3 57.2 20 0 65.8 57.2 0.165 0.166 0.164 0.160
57.3 57.3 30 10 70.1 61.6 0.172 0.172 0.171 0.167
57.2 57.2 40 20 74.3 65.8 0.201 0.200 0.197 0.193
57.3 57.3 0 30 57.3 70.1 0.326 0.323 0.320 0.309
57.3 57.4 10 40 61.6 74.4 0.368 0.367 0.363 0.350
57.2 57.3 30 0 70.0 57.3 0.315 0.315 0.310 0.298
57.3 57.3 40 10 74.4 61.6 0.363 0.363 0.359 0.349

T1 and T2 ~ Weib(δ1 = δ2 = 0.667, and σ1 = σ2 = 8.217). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up; ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring; ωi, i = 1, 2 represent the percentage of observations censored due to insufficient follow-up and random censoring.

Table 4.

Type I error control of Gρ statistics for scenario II censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
55.7 55.7 0 0 55.7 55.7 0.047 0.047 0.048 0.048
53.0 53.0 10 10 60.1 60.1 0.049 0.049 0.049 0.049
50.4 50.3 20 20 64.6 64.5 0.045 0.046 0.045 0.046
47.7 47.7 30 30 69.0 69.0 0.054 0.052 0.052 0.053
45.1 45.0 40 40 73.5 73.5 0.048 0.049 0.048 0.047
55.7 53.0 0 10 55.7 60.1 0.065 0.066 0.065 0.065
53.1 50.4 10 20 60.2 64.6 0.067 0.066 0.067 0.066
50.4 47.7 20 30 64.6 69.0 0.068 0.069 0.069 0.068
47.7 45.0 30 40 69.0 73.4 0.082 0.081 0.08 0.079
53.0 55.6 10 0 60.1 55.6 0.067 0.068 0.069 0.067
50.3 53.0 20 10 64.6 60.1 0.065 0.067 0.067 0.068
47.6 50.3 30 20 69.0 64.5 0.077 0.078 0.078 0.078
44.9 47.6 40 30 73.4 69.0 0.076 0.076 0.077 0.078
55.7 50.3 0 20 55.7 64.5 0.120 0.12 0.12 0.120
53.0 47.6 10 30 60.1 69.0 0.132 0.133 0.134 0.131
50.4 45.0 20 40 64.6 73.5 0.146 0.149 0.15 0.150
50.3 55.6 20 0 64.5 55.6 0.120 0.123 0.122 0.122
47.6 53.0 30 10 69.0 60.1 0.132 0.135 0.137 0.135
44.9 50.3 40 20 73.4 64.5 0.149 0.152 0.154 0.151
55.7 47.6 0 30 55.7 69.0 0.221 0.225 0.226 0.224
53.1 45.0 10 40 60.2 73.4 0.257 0.261 0.262 0.257
47.6 55.8 30 0 69.0 55.8 0.222 0.226 0.226 0.220
45.0 53.0 40 10 73.4 60.2 0.257 0.260 0.260 0.258

T1 and T2 ~ Exp(σ1 = σ2 = 5.873). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up; ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring; ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

Table 5.

Type I error control for Gρ statistics for scenario II censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
57.3 57.2 0 0 57.3 57.2 0.048 0.049 0.048 0.048
55.1 55.2 10 10 61.5 61.6 0.051 0.051 0.051 0.052
53.2 53.2 20 20 65.8 65.8 0.051 0.051 0.051 0.048
51.1 51.0 30 30 70.1 70.1 0.049 0.050 0.051 0.051
49.0 48.9 40 40 74.3 74.3 0.050 0.049 0.049 0.049
57.2 55.2 0 10 57.2 61.6 0.071 0.070 0.068 0.068
55.1 53.1 10 20 61.5 65.8 0.077 0.077 0.078 0.076
53.1 51.1 20 30 65.8 70.1 0.072 0.073 0.072 0.07
51.0 48.9 30 40 70.0 74.3 0.081 0.082 0.081 0.084
55.2 57.2 10 0 61.5 57.2 0.067 0.069 0.069 0.068
53.1 55.2 20 10 65.8 61.6 0.070 0.070 0.070 0.068
51.1 53.1 30 20 70.0 65.8 0.077 0.077 0.077 0.074
49.0 51.1 40 30 74.3 70.0 0.082 0.083 0.083 0.085
57.3 53.1 0 20 57.3 65.8 0.140 0.141 0.141 0.139
55.1 51.0 10 30 61.5 70.1 0.153 0.154 0.154 0.152
53.2 49.0 20 40 65.8 74.3 0.176 0.177 0.176 0.174
53.1 57.3 20 0 65.9 57.3 0.140 0.142 0.141 0.138
51.0 55.2 30 10 70.1 61.5 0.156 0.157 0.158 0.154
49.0 53.2 40 20 74.3 65.9 0.170 0.171 0.171 0.168
57.2 51.0 0 30 57.2 70.1 0.264 0.265 0.266 0.261
55.2 49.1 10 40 61.5 74.4 0.313 0.315 0.314 0.310
51.1 57.3 30 0 70.1 57.3 0.268 0.270 0.270 0.266
48.9 55.2 40 10 74.3 61.5 0.304 0.306 0.306 0.302

T1 and T2 ~ Weib(δ1 = δ2 = 0.667, and σ1 = σ2 = 8.217). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up, ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring; ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

4.3.1. Power estimates

In order to assess the power properties, the above process, as for the null distribution, was replicated. That is, two exponential distributions with parameters σ1 = 5.873 and σ2 = 10.931 corresponding to survival probabilities of S1(3) = 60% and S2(3) = 76%, respectively, were generated. The Gρ(ρ = 0, 0.5, 1, 2) statistics were then applied to each sample and a p-value was obtained. This process was replicated 10,000 times and an estimate of power was obtained as the proportion of times the p-value was significant at level 0.05. This process was replicated for Weibull distributions with parameters (δ1 = 0.667, σ1 = 8.217) and (δ2 = 0.667, σ2 = 20.867) for the two populations with survival probabilities of S1(3) = 60% and S2(3) = 76%, respectively, and the estimate of the power was obtained based on 10,000 simulations. The results of the simulation study corresponding to exponential and Weibull distributions for Scenario I are presented in Tables 6 and 7, and those for Scenario II are presented in Tables 8 and 9, respectively.

Table 6.

Power properties of Gρ statistics for scenario I censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
55.7 72.8 0 0 55.7 72.8 0.810 0.806 0.800 0.780
55.7 72.8 10 10 60.1 75.5 0.783 0.779 0.772 0.750
55.7 72.8 20 20 64.6 78.2 0.738 0.739 0.735 0.723
55.7 72.9 30 30 69.0 81.0 0.689 0.687 0.682 0.667
55.8 72.8 40 40 73.5 83.7 0.632 0.63 0.626 0.615
55.7 72.8 0 10 55.7 75.5 0.894 0.893 0.889 0.875
55.7 72.8 10 20 60.1 78.2 0.874 0.872 0.868 0.853
55.6 72.9 20 30 64.5 81.0 0.860 0.858 0.855 0.842
55.7 72.9 30 40 69.0 83.7 0.831 0.828 0.826 0.816
55.6 72.8 10 0 60.0 72.8 0.660 0.656 0.650 0.631
55.7 72.7 20 10 64.6 75.4 0.585 0.582 0.578 0.560
55.6 72.9 30 20 68.9 78.3 0.538 0.537 0.533 0.523
55.7 72.9 40 30 73.4 81.0 0.458 0.457 0.455 0.447
55.7 72.8 0 20 55.7 78.2 0.952 0.950 0.947 0.936
55.8 72.9 10 30 60.2 81.0 0.943 0.942 0.940 0.931
55.7 72.8 20 40 64.6 83.7 0.930 0.930 0.929 0.923
55.7 72.8 20 0 64.6 72.8 0.446 0.445 0.444 0.427
55.8 72.8 30 10 69.1 75.5 0.376 0.374 0.373 0.363
55.6 72.8 40 20 73.4 78.2 0.307 0.305 0.303 0.296
55.7 72.8 0 30 55.7 81.0 0.982 0.981 0.98 0.976
55.7 72.8 10 40 60.1 83.7 0.979 0.979 0.977 0.974
55.7 72.9 30 0 69.0 72.9 0.248 0.248 0.246 0.239
55.8 72.8 40 10 73.5 75.5 0.181 0.179 0.178 0.171

T1 ~ Exp(σ1 = 5.873) T2 ~ Exp(σ2 = 10.931). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up; ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring; ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

Table 7.

Power properties of Gρ statistics for scenario I censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
57.3 74.0 0 0 57.3 74.0 0.799 0.796 0.793 0.773
57.2 74.0 10 10 61.5 76.6 0.750 0.749 0.744 0.733
57.2 74.0 20 20 65.8 79.2 0.716 0.714 0.712 0.697
57.2 73.9 30 30 70.0 81.7 0.661 0.659 0.653 0.644
57.2 74.0 40 40 74.3 84.4 0.599 0.598 0.598 0.591
57.2 74.0 0 10 57.2 76.6 0.886 0.882 0.878 0.868
57.3 74.0 10 20 61.6 79.2 0.862 0.862 0.859 0.847
57.3 74.0 20 30 65.8 81.8 0.836 0.834 0.831 0.821
57.3 74.0 30 40 70.1 84.4 0.801 0.801 0.798 0.789
57.2 74.1 10 0 61.5 74.1 0.631 0.630 0.622 0.606
57.2 74.1 20 10 65.8 76.7 0.580 0.577 0.573 0.560
57.3 74.0 30 20 70.1 79.2 0.500 0.500 0.496 0.484
57.3 74.0 40 30 74.4 81.8 0.420 0.421 0.417 0.409
57.2 74.0 0 20 57.2 79.2 0.946 0.944 0.941 0.933
57.3 74.0 10 30 61.6 81.8 0.938 0.937 0.933 0.925
57.2 74.0 20 40 65.8 84.4 0.926 0.926 0.923 0.917
57.2 73.9 20 0 65.8 73.9 0.425 0.423 0.419 0.405
57.2 74.0 30 10 70.0 76.6 0.352 0.350 0.345 0.334
57.2 74.0 40 20 74.3 79.2 0.277 0.278 0.276 0.267
57.3 74.0 0 30 57.3 81.8 0.976 0.976 0.974 0.970
57.3 74.1 10 40 61.6 84.5 0.978 0.977 0.977 0.974
57.2 74.0 30 0 70.0 74.0 0.233 0.233 0.232 0.227
57.2 74.1 40 10 74.3 76.7 0.169 0.168 0.166 0.161

T1 ~ Weib(δ1 = 0.667, σ1 = 8.217) T2 ~ Weib(δ2 = 0.667, σ2 = 20.867). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up; ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring; ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

Table 8.

Power properties of Gρ statistics for scenario II censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
55.8 72.7 0 0 55.8 72.7 0.805 0.803 0.796 0.777
53.0 70.1 10 10 60.2 75.6 0.780 0.779 0.774 0.755
50.3 67.5 20 20 64.6 78.3 0.749 0.747 0.742 0.727
47.7 64.8 30 30 68.9 81.0 0.706 0.706 0.701 0.688
44.9 62.0 40 40 73.4 83.7 0.654 0.649 0.646 0.633
55.7 70.1 0 10 55.7 75.6 0.886 0.884 0.880 0.866
53.0 67.4 10 20 60.1 78.2 0.860 0.859 0.856 0.841
50.3 64.8 20 30 64.6 81.0 0.841 0.840 0.836 0.822
47.6 62.1 30 40 69.0 83.7 0.822 0.823 0.821 0.809
53.1 72.8 10 0 60.2 72.8 0.685 0.680 0.672 0.648
50.3 70.1 20 10 64.6 75.6 0.631 0.627 0.621 0.598
47.7 67.4 30 20 69.0 78.2 0.566 0.562 0.554 0.538
45.0 64.7 40 30 73.5 81.0 0.498 0.496 0.492 0.478
55.8 67.4 0 20 55.8 78.3 0.933 0.933 0.930 0.921
53.0 64.8 10 30 60.2 81.0 0.924 0.923 0.920 0.910
50.3 62.1 20 40 64.6 83.7 0.920 0.920 0.918 0.911
50.3 72.9 20 0 64.6 72.9 0.518 0.511 0.503 0.482
47.6 70.1 30 10 68.9 75.5 0.453 0.448 0.440 0.420
45.0 67.4 40 20 73.4 78.3 0.367 0.363 0.357 0.343
55.7 64.8 0 30 55.7 80.9 0.970 0.970 0.968 0.963
53.1 62.0 10 40 60.2 83.6 0.964 0.963 0.962 0.958
47.6 72.9 30 0 69.1 72.9 0.334 0.330 0.324 0.304
45.0 70.2 40 10 73.4 75.5 0.250 0.246 0.241 0.226

T1 ~ Exp(σ1 = 5.873) T2 ~ Exp(σ2 = 10.931). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up. ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring, ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

Table 9.

Power properties of Gρ statistics for scenario II censoring (B = 10, 000, n1 = n2 = 120).

ξ 1 ξ 2 ϕ 1 ϕ 2 ω 1 ω 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
57.3 74.0 0 0 57.3 74.0 0.788 0.787 0.781 0.764
55.2 72.2 10 10 61.6 76.7 0.763 0.761 0.756 0.740
53.1 70.3 20 20 65.8 79.2 0.726 0.725 0.722 0.709
51.1 68.3 30 30 70.1 81.8 0.670 0.668 0.664 0.654
49.0 66.5 40 40 74.4 84.4 0.616 0.614 0.612 0.602
57.3 72.1 0 10 57.3 76.6 0.872 0.871 0.867 0.855
55.2 70.2 10 20 61.6 79.2 0.854 0.853 0.851 0.839
53.1 68.4 20 30 65.9 81.9 0.832 0.831 0.828 0.818
51.1 66.6 30 40 70.1 84.5 0.805 0.807 0.804 0.796
55.2 74.1 10 0 61.6 74.1 0.656 0.654 0.649 0.629
53.1 72.1 20 10 65.7 76.6 0.596 0.594 0.588 0.569
51.0 70.3 30 20 70.1 79.3 0.545 0.543 0.539 0.526
48.9 68.4 40 30 74.3 81.8 0.457 0.454 0.450 0.444
57.3 70.1 0 20 57.3 79.1 0.930 0.930 0.928 0.921
55.2 68.3 10 30 61.5 81.8 0.926 0.925 0.924 0.915
53.0 66.5 20 40 65.7 84.4 0.920 0.919 0.919 0.911
53.2 74.1 20 0 65.8 74.1 0.466 0.463 0.458 0.442
51.1 72.1 30 10 70.1 76.6 0.389 0.389 0.385 0.371
48.9 70.3 40 20 74.3 79.3 0.320 0.318 0.313 0.301
57.2 68.4 0 30 57.2 81.8 0.968 0.970 0.969 0.964
55.2 66.5 10 40 61.5 84.4 0.967 0.968 0.967 0.964
51.0 74.1 30 0 70.1 74.1 0.284 0.283 0.277 0.268
49.0 72.2 40 10 74.4 76.6 0.202 0.200 0.196 0.188

T1 ~ Weib(δ1 = 0.667, σ1 = 8.217) T2 ~ Weib(δ2 = 0.667, σ2 = 20.867). ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up. ϕi, i = 1, 2 represents the percentage of observations censored due to random censoring, ωi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up and random censoring.

Study 2. It is not uncommon to encounter a study in which the survival outcome of an ongoing trial would be compared to a previous (historical) trial. An example would be a trial conducted in a cohort of patients that underwent Bone Marrow Transplantation (BMT) under a particular treatment regimen with the survival at three years estimated to be 60%, i.e. S1(3) = 60%. However, once the trial was completed, a therapy which introduced (Chimeric Antigen Receptor) CAR-T cells in addition to the existing regimen was to be tested and it was hypothesized that the survival would improve by 16% at three years, i.e. S2(3) = 76%, and the patients on the previous trial could still be followed when the treatment with the newer treatment plan is ongoing.

In order to assess the effect of insufficient follow-up and to mimic a situation similar to the one discussed above, we conducted the following simulation study. We start with the same set up of recruiting 120 patients/individuals in each arm with the parameters specified above. We evaluated the impact of the different lengths of follow-up on the null distribution and power by conducting the simulations in the following manner.

4.4. Performance of the test under null hypothesis

We generated two samples from exponential distribution with parameter σ1 = σ2 = 5.873, corresponding to S1(3) = S2(3) = 60%. The samples were generated using the framework described above, i.e. patients were enrolled equally over three-year period with a minimum follow-up period of 2 years. Thus, those recruited in the first period would be followed for 4–5 years, those in the second period would be followed for 3–4 years and those enrolled in the third period would be followed for 2–3 years. The second arm, which would correspond to the historical control, was generated similarly but the follow-up was extended by three years. That is, those recruited in the first period would be followed for 6–8 years, those in the second period will be followed for 4–6 years and those in the third period will be followed for 2–4 years. The process was repeated with Weibull distributions with two samples were generated from Weib1(0.667, 8.217) = Weib2(0.667, 8.217). The results of the simulation studies corresponding to exponential and Weibull distributions are provided in Tables 10 and 11, respectively.

Table 10.

Performance of Gρ statistics with different lengths of follow-up.

T 1 T 2 ξ 1 ξ 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
Type I Error E(5.873) E(5.873) 44.5 55.7 0.046 0.047 0.047 0.048
Power Estimates E(5.873) E(10.931) 44.6 72.8 0.825 0.823 0.815 0.793
E(10.931) E(5.873) 55.7 64.2 0.835 0.833 0.828 0.806

T1, T2 ~ Exp(σ) = E(σ) and n1 = n2 = 120; B = 10, 000 for Type I Error and Power Estimates; follow-up for T1 = 5 years and for T2 = 8years.

ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up.

Table 11.

Performance of Gρ statistics with different lengths of follow-up.

T 1 T 2 ξ 1 ξ 2 ρ = 0 ρ = 0.5 ρ = 1.0 ρ = 2.0
Type I Error Weib(0.667, 8.217) Weib(0.667, 8.217) 50.0 57.3 0.050 0.049 0.050 0.050
Power Estimates Weib(0.667, 8.217) Weib(0.667, 20.867) 50.0 74.0 0.806 0.804 0.799 0.780
Weib(0.667, 20.867) Weib(0.667, 8.217) 57.3 68.7 0.808 0.806 0.802 0.784

T1, T2 ~ Weib(δ, σ) and n = n2 = 120; B = 10, 000 for Type I Error and Power Estimates; Follow-up for T1 = 5 years and for T2 = 8 years.

ξi, i = 1, 2 represents the percentage of observations censored due to insufficient follow-up.

4.4.1. Power estimates

To assess the power of the test, a simulation study similar to those described above was undertaken. Two samples from Exp1(σ1 = 5.873) and Exp2(σ2 = 10.931) corresponding to S1(3) = 60% and S2(3) = 76%, respectively, were generated and power was estimated as the number of times the p-value was significant at 5% level of significance in 10,000 trials. In this set-up, it is assumed that the historical arm had inferior survival but longer follow-up. The simulation studies was replicated by assuming the new treatment arm to have inferior survival with shorter follow-up, where S1(3) = 76% and S2(3) = 60%, i.e. with the new therapy the survival declined. In a similar manner, the simulation studies were repeated with Weibull distribution. The results of the simulation study corresponding to exponential and Weibull distributions are provided in Tables 10 and 11, respectively.

5. Simulation results

The following conclusions can be gleaned from the simulation experiments conducted in Section 4.

  1. It is seen that for both scenarios considered in our simulation studies corresponding to the exponential distribution (Tables 2 and 4) and for Weibull distribution (Tables 3 and 5) that as long as the proportion of censoring is same in the two groups, the null distribution is reasonably well controlled. However, as the difference in proportions of censored observations between the two groups increases, the type I error control becomes increasingly anti-conservative. For example, for exponential distribution (Table 2) with 10% censoring in the first group (ϕ1) and 30% and 40% censoring in the second group (ϕ2), the estimate of 5% probability becomes 16% and 36%, respectively. To illustrate the extent of the problem graphically, eight combinations of (ϕ1, ϕ2) from Table 1 (type I error control for exponential distribution), with ϕ1 fixed at 0 and ϕ2 varying over (0, 10, 20, 30) and then reversing the order with ϕ2 fixed at 0 and ϕ1 varying over (0, 10, 20, 30), were plotted for Gρ (ρ = 0) statistic and presented in Figure 4(a). Further, from Tables 2 to 5, it is clear that the magnitude of the problem of controlling type I error does not diminish irrespective of the underlying survival distribution and the scenario. It is also evident from these tables that anti-conservatism in type I error does not depend the values of ρ in Gρ statistics.

  2. As expected, the powers for both distributions decline progressively as the censorings in the two groups increase but remain same. However, as the censoring proportions differ, the power is significantly increased or decreased. For example, for Weibull distribution (Table 7), Scenario I censoring, the power for the logrank test (ρ = 0) corresponding to no random censoring (ϕ1 = ϕ2 = 0) is approximately 80%, but with 30% censoring in both groups the power declines to about 66%. However, if the proportion of random censorings are ϕ1 = 10% and ϕ2 = 30%(more censoring in the arm with better survival at three years) in groups 1 and 2, respectively, then the power estimate is about 94%. It may be further noted from Table 7 that when random censoring proportion is flipped, ϕ1 = 30% and ϕ2 = 10%, i.e. more random censoring in the arm with inferior survival, then the power estimate is about 35%. Similar phenomena are seen for the exponential distributions for Scenario II censoring as well, see Table 9. To further illustrate the extent of the problem with the power estimates graphically, eight combinations of (ϕ1, ϕ2), those listed in 1 above, corresponding to Table 6 for exponential distribution for Gρ (ρ = 0) are plotted in Figure 4(b).

  3. It is very clear from Tables 10 and 11 for exponential and Weibull distributions, respectively, that type I error and power estimates are not affected by censoring if the censoring is only due to insufficient follow-up. This is further confirmed by observing Tables 6 and 7, line 1, (Scenario I censoring) for exponential and Weibull distributions, respectively. When censoring is only due to insufficient follow-up (censoring proportions for the two groups due to insufficient follow-up are different but the follow-up times are same and there is no random censoring), then the power estimates are unaffected. This clearly shows that as long as there is censoring due to insufficient follow-up and there is no random censoring of failure times, then the characteristics of the underlying distribution are not affected and, consequently, the parameter estimates would be unbiased, and the type I error and power would be unaffected. However, if there is random censoring, then it alters the characteristics of the underlying distribution and the parameter estimates may be biased leading to uncontrolled type I error and erroneous power estimates.

Figure 4.

Figure 4.

(a) Type I error control for a selection of (ϕ1, ϕ2) values from Table 2 for Gρ (ρ= 0) Statistic; (b). Power estimates for a selection of (ϕ1, ϕ2) values from Table 6 for Gρ (ρ= 0) statistic.

6. Example revisited and conclusions

We return to the motivating example of this article. Figure 5 clearly indicates that the proportional hazard assumption may not hold since the parameter estimates of the shape parameters corresponding to Weibull distribution are very different. The Rofecoxib arm seems to have much steeper increasing failure rate compared to Naproxen arm, although the test based on Schoenfeld residuals does not reject the proportionality assumption (p-value = 0.15). Hence, it is possible that with longer follow-up time, more events would have been observed, leading to a more significant p-value. It should be noted that considerably more failures, that were observed with one more year of follow-up in Rofecoxib arm by Mukherjee et al.,13 would have been considered as censored observations in the previous analysis of Bombardier et al.,12 consistent with the concept of “random censoring” discussed in this manuscript. Further, our simulation experiments in which proportional hazards assumption is not violated have shown that if more observations are censored form the worst arm, then the power to detect the differences is significantly lowered depending on the amount of random censoring. It is possible that the no significant difference observed between the two groups in the earlier analysis was as a result of more random censoring in Rofecoxib arm. Hence, the findings of Mukherjee et al.13 suggest that it may be reasonable to conclude that the insufficient follow-up time in the Bombardier et al.12 article not only led the former authors to censor some possible cardiovascular failure times, but also generate highly disproportionate unequal censoring in the two arms which reduced the impact of failure due to cardiovascular events.

Figure 5.

Figure 5.

Exponential (E) and Weibull (W) fits for the two treatment arms.

Our simulation studies were conducted with an accrual time period of three years with the total study duration of five years and the parameters were chosen to provide roughly 80% power to detect an improvement of 16% in survival at three years, i.e. S1(3) = 60% vs. S2(3) = 76%. The choice of the parameters for the simulation study was partly motivated by some of the clinical trials conducted at St. Jude Children’s Research Hospital. However, it is possible that this may not be realistic for some clinical trials and it is likely that the accrual time period and the follow-up time could be significantly shorter. For example, consider the same effect size as for the simulation studies, but the interest is on comparing the survival proportion at two years, i.e. S1(2) = 60% vs. S2(2) = 76%. Assuming the accrual time of two years with the total duration of the study at three years and assuming we can accrue 120 participants in a year (faster accrual, total sample size of 240), the power estimates obtained using EAST6.5, DSTPLAN and PASS15 were 74%, 73% and 77%, respectively. These are somewhat attenuated compared to the corresponding power estimates of 76%, 79% and 82%, respectively, obtained in the simulation study. In such situations, it is likely that we may observe more censoring due to insufficient follow-up but, consistent with our simulation studies, there is no significant impact of censoring due to insufficient follow-up on the power of the test. However, if in addition, there is random censoring then we would expect the power to decline with increasing proportion of random censoring, but if the random censoring proportions between the two groups differ then it can have significant impact on the type I error and the power estimates and may ultimately lead to incorrect conclusions. Interestingly, if the interest was in comparing the survival at one year, i.e. S1(1) = 60% vs. S2(1) = 76% and if all 240 subjects could be accrued in one year, then with one-year follow-up we would have increased power of 78%, 86%, and 88% using EAST6.5, DSTPLAN, and PASS16, respectively. Given the complexities of the censoring pattern, it is recommended that large simulation studies should be undertaken prior to implementing Phase III clinical trials involving time-to-event outcome.

From the extensive simulation studies and the examples presented in this article, it is clear that more attention should be given to the underlying censoring mechanism when analyzing survival data. If the censoring mechanism involves censoring of individuals/patients at random, then it can have severe consequences on the validity of inferences drawn, particularly when the censoring proportions are not same in the two groups. Hence, this suggests a need for designs that would factor possibility of insufficient follow-up time and random censoring of individuals in the analysis of clinical trials data. In planned future studies with smaller sample sizes, we will use exact logrank tests (Heinze et al.10) for assessment of the effects of unequal censoring and insufficient follow-up.

Acknowledgements

The authors are thankful to the referees for their constructive comments and suggestions.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research work of Deo Kumar Srivastava and Zhaohua Lu, was in part supported by the Grant CA21765 and the American Lebanese Syrian Associated Charities. SN Rai was partly supported by Wendell Cherry Chair in Clinical Trial Research Fund, multiple National Institutes of Health (NIH) grants (5P20GM113226, PI: McClain; 1P42ES023716, PI: Srivastava; 5P30GM127607-02, PI: Jones; 1P20GM125504-01, PI: Lamont; 2U54HL120163, PI: Bhatnagar/Robertson; 1P20GM135004, PI: Yan; 1R35ES0238373-01, PI: Cave; 1R01ES029846, PI: Bhatnagar; 1R01ES027778-01A1, PI: States; 1P30ES030283, PI: Sates), and by Kentucky Council on Postsecondary Education grant (PON2 415 1900002934, PI: Chesney).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

  • 1.Kalbfleisch JD and Prentice RL. The statistical analysis of failure time data. New York, NY: John Wiley & Sons, 2002. [Google Scholar]
  • 2.Hougaard P. Analysis of multivariate survival data. New York, NY: Springer-Verlag, 2000. [Google Scholar]
  • 3.Kong FH and Slud E. Robust covariate-adjusted logrank tests. Biometrika 1997; 84: 847–862. [Google Scholar]
  • 4.DiRienzo AG and Lagakos SW. Effects of model misspecification on tests of no randomized treatment effect arising from Cox’s proportional hazards model. J Royal Stat Soc Ser B 2001; 63: 745–757. [Google Scholar]
  • 5.Beltangady MH and Frankowski RF. Effect of unequal censoring on the size and power of the logrank and Wilcoxon types of tests for survival data. Stat Med 1989; 8: 937–945. [DOI] [PubMed] [Google Scholar]
  • 6.Latta RB. A Monte Carlo study of some two-sample rank tests with censored data. J Am Stat Assoc 1981; 76: 713–719. [Google Scholar]
  • 7.Kellerer EL and Chmelevsky D. Small-sample properties of censored-data rank tests. Biometrics 1983; 39: 675–682. [Google Scholar]
  • 8.Jennrich RI. A note on the behavior of the logrank permutation test under unequal censoring. Biometrika 1983; 70: 133–137. [Google Scholar]
  • 9.Jennrich RI. Some exact test for comparing survival curves in the presence of unequal censoring. Biometrika 1984; 71: 57–64. [Google Scholar]
  • 10.Heinze G, Gnant M and Schemper M. Exact logrank tests for unequal follow-up. Biometrics 2003; 59: 1151–1157. [DOI] [PubMed] [Google Scholar]
  • 11.Wang R, Lagakos SW and Gray RJ. Testing and interval estimation for two-sample survival comparisons with small sample sizes and unequal censoring. Biostatistics 2010, 11: 676–692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bombardier C, Laine L, Reichin A, et al. Comparison of upper gastrointestinal toxicity of rofecoxib and naproxen in patients with rheumatoid arthritis. VIGOR Study Group. New Engl J Med 2000; 343: 1520–1528. [DOI] [PubMed] [Google Scholar]
  • 13.Mukherjee D, Nissen SE and Topol EJ. Risk of cardiovascular events associated with selective COX-2 inhibitors. J Am Med Assoc 2001; 286: 954–959. [DOI] [PubMed] [Google Scholar]
  • 14.Harrington DP and Fleming TR. A class of rank test procedures for censored survival data. Biometrika 1982; 69: 553–566. [Google Scholar]
  • 15.Tarone RE and Ware J. On distribution-free tests for equality of survival distributions. Biometrika 1977; 64: 156–160. [Google Scholar]
  • 16.Fleming TR and Harrington DP. Counting processes and survival analysis. New York, NY: John Wiley & Sons, 1991. [Google Scholar]
  • 17.Bogaerts K, Komarek A and Lesaffre E. Survival analysis with interval-censored data. Boca Raton, FL: Chapman & Hall/CRC Press, Taylor & Francis Group, 2018. [Google Scholar]
  • 18.DSTPLAN V4.3. Calculations for sample sizes and related problems. Houston, TX: The University of Texas, M. D. Anderson Cancer Center, 2006. [Google Scholar]
  • 19.EAST 6.5. The complete trial design solution. Cambridge, MA: Cytel, 2020. [Google Scholar]
  • 20.PASS. Power analysis and sample size software. Kaysville, UT: NCSS, LLC, 2020, ncss.com/software/pass. [Google Scholar]

RESOURCES