Abstract
Clinical trials with multiple primary time-to-event outcomes are common. Use of multiple endpoints creates challenges in the evaluation of power and the calculation of sample size during trial design particularly for time-to-event outcomes. We present methods for calculating the power and sample size for randomized superiority clinical trials with two correlated time-to-event outcomes. We do this for independent and dependent censoring for three censoring scenarios: (i) the two events are non-fatal; (ii) when one event is fatal (semi-competing risk); and (iii) when both are fatal (competing risk). We derive the bivariate log-rank test in all three censoring scenarios and investigate the behavior of power and the required sample sizes. Separate evaluations are conducted for two inferential goals, evaluation of whether the test intervention is superior to the control on (1) all of the endpoints (multiple co-primary) or (2) at least one endpoint (multiple primary).
Keywords: dependent censoring, log-rank test, multiple endpoints, semi-competing risk, time-dependent association
1. Introduction
The use of two time-to-event outcomes as primary endpoints has become common in clinical trials evaluating interventions in many disease areas. For example, co-infection/comorbidity trials may utilize primary endpoints to evaluate multiple comorbities, for example, a trial evaluating therapies to treat Kaposi’s sarcoma in HIV-infected individuals may have the time to Kaposi’s sarcoma progression and the time to HIV virologic failure, as primary endpoints. In new anti-cancer drug trials, the most commonly used primary endpoint is overall survival (OS) defined as the time from randomization until death from any cause. However, OS in general requires long follow-up periods after disease progression leading to long and expensive studies. Therefore, in addition to OS, many clinical trials include progression-free survival (PFS) as a primary endpoint, defined as the time from randomization to the first of tumor progression or death. Meanwhile in trials aimed at evaluating treatments to reduce a specific type of mortality, the two endpoints of time to disease-specific mortality and all-cause mortality are primary endpoints. In the first example of co-infection/comorbidity trials, both events are non-fatal (the ‘both non-fatal case’ where neither event-time is censored by the other event). However, in the oncology example, one event is fatal (death) potentially censoring the other non-fatal event (the ‘one fatal case’). Here, we use the term ‘fatal’ to describe an event that censors future events of interest. This is referred as to ‘semi-competing risk problem’, first introduced by [1]. In the last example, both events are fatal as each event-time may be censored by the other event (the ‘both fatal case’).
We developed methods for sizing clinical trials with two primary time-to-event outcomes under a time-dependent correlation structure of three bivariate exponential distributions, where (i) both events are non-fatal [2,3]. In this paper, we discuss the log-rank test-based method using the normal approximation for power and sample size calculations in clinical trials with two time-to-event outcomes, to accommodate two additional situations: (ii) when one event is fatal and (iii) when both are fatal. We also evaluate composite endpoints as a strategy. Hence, six scenarios are evaluated as classified by the three censoring schemes (both non-fatal, one fatal, and both fatal) for the two time-to-event outcomes and whether a composite endpoint is used. Table I illustrates examples of the six scenarios.
Table I.
Classification of event types by the composite and non-composite examples (six scenarios)
| Type | Non-composite examples | Composite examples |
|---|---|---|
| Both non-fatal events | HIV trial
|
HIV trial
|
| One fatal event (one non-fatal event) | Oncology trial
|
Oncology trial
|
| Both fatal events | Cardiovascular trial
|
Cardiovascular trial
|
We consider two inferential goals for clinical trials with multiple endpoints: (1) ‘multiple co-primary endpoints’ (where the trial is designed to evaluate if the intervention is superior to the control on all of the endpoints) and (2) ‘multiple primary endpoints’ (or ‘alternative primary endpoints’, where the trial is designed to evaluate if the intervention is superior to the control on at least one endpoint [4–6]). When considering two or more endpoints as co-primary, no adjustment is needed to control the Type I error rate if the hypothesis associated with each endpoint is evaluated at the same significance level as that required for all of the objectives. However, the Type II error rate increases as the number of endpoints being evaluated increases. Thus, design adjustments are needed to maintain the overall power. In contrast, when designing the trial to evaluate an effect on at least one of the endpoints, an adjustment is needed to control the Type I error rate because the Type I error rate increases as the number of endpoints being evaluated increases.
The paper is structured as follows: in Section 2, we describe the dependence measure and censoring schemes, and then discuss the correlation structure of the bivariate log-rank statistic. In Sections 3 and 4, we provide the power for comparing two groups with respect to two time-to-event outcomes as co-primary or multiple primary and describe methods for calculating the sample size. We also investigate the behaviors of power and the required sample sizes with a real example. In Section 5, we summarize our findings.
2. Censoring schemes, dependency, and correlation
2.1. Notation and framework
Consider a randomized clinical trial designed to compare two interventions with total N participants being recruited and randomized. Suppose that r(2)N participants are assigned to the test intervention group and r(1)N participants to the control intervention group (r(1) + r(2) = 1). Patients are then followed to evaluate bivariate survival times for two endpoints. Let and Cik be the underlying continuous survival time and potential censoring time of the kth primary endpoint for the ith participant (k = 1, 2; i = 1, … , N). Assume Ci = Ci1 = Ci2, because Ci1 and Ci2 are usually the same time. Hence, we observe the bivariate time-to-event data of , where Tik and Δik are the ith observable survival time and right-censoring indicator for the kth primary endpoint, respectively, and gi is the group index j (j = 2 if the ith participant belongs to the test, and j = 1 otherwise). For example, typically we see and (𝟙 (·) is the index function). The information of (Tik, Δik) is represented by the counting process 𝒩ik(t) = 𝟙 (Tik ≤ t, Δik = 1) and the at-risk process 𝒴ik(t) = 𝟙 (Tik ≥ t).
Denote the marginal hazard function and its cumulative function for in the group j by
Let ψk(t) be the hazard ratio (HR) between the two intervention groups. To test the single hypothesis ‘H0k : ψk(t) = 1 for all t’ restricted to the kth endpoint (k = 1, 2), the standardized log-rank statistic
| (1) |
can be applied to the univariate data set of , where τ is the maximum observed follow-up time, Uk(t) is the log-rank process
is the Nelson–Aalen estimator of is the conditional variance of Uk(t) under the null hypothesis H0k,
where and are the at-risk and counting processes for the kth endpoint of individuals belonging to the group j.
2.2. Censoring schemes and dependence measures for two time-to-event outcomes
In the trial where the bivariate time-to-event data of is observed, various situations are considered. First, consider the censoring scheme. [3] discuss the simplest case where both events are non-fatal. We consider the two other situations, that is, where one event is fatal but another is non-fatal (one fatal), and when both events are fatal (both fatal). We also briefly describe the measure of dependence between the two time-to-event outcomes.
Examples of the three censoring scenarios
Both non-fatal outcomes: In HIV clinical trials, if is the time to infant HIV infection and is the time to infant Hepatitis B infection, neither event-time is censored by the other event. If the subjects do not experience both events, then the events are censored at the same time (e.g., by the end of the study or patient drop-out) in the end of follow-up period. So that we have and .
One fatal outcome: In oncology trials, if is the time-to-progression (TTP: defined as the time from randomization until objective tumor progression and does not include deaths) and is the OS (time to all cause death),one endpoint (TTP) is censored by the other endpoint (OS) being completely observed. So that, we have and . As Ti2 is a competing risk for Ti1 but Ti1 is not for Ti2 (Ti1 ≤ Ti2), this situation describes a semi-competing risk discussed by [1] (see [7] for further discussions). If there is a non-zero correlation between the endpoints (i.e., dependent censoring), the situation requires modifying the standard log-rank test to account for the dependent censoring. We may be able to avoid this problems by creating a composite endpoint.
Both fatal outcomes: In trials aimed at reducing a specific type of mortality, if is the time to disease-specific mortality and is the time to other-cause mortalities, each event may be censored by the other event. This situation is called a competing risk. Here, we have , and .
Definition for the composite endpoint and handling censoring
We provide the definition for the composite endpoint in the scenarios with two time-to-event outcomes. See Table I for examples of the composite endpoints, such as PFS composing (combining) the TTP and OS or major adverse cardiac events, where the difference between composing and not composing the endpoints is based on handling rule for censoring in each endpoint. That is, we can consider handling censoring in two ways in our time-to-event context. The first way is as usual to use , k = 1, 2, which is the non-composite setting in the context of this paper. The second one is to handle the censoring indicators by and , which is the definition of the composite setting in this paper. For an example, consider the situation where and are the TTP and OS endpoints, respectively. Then, (Ti1, Δi1) is the ith observation for the TTP endpoint in the former handling censoring (the TTP outcome), while (Ti1, Δi1) is that for the PFS endpoint in the latter (the PFS outcome). Note that the difference between the TTP and PFS outcomes is brought by handling of censoring in this paper, but Ti1 is the observable time with the same length and notation in both TTP and PFS endpoints. Alternatively, because one usually define the PFS endpoint as , the former handling of censoring for the composite endpoint leads to the PFS outcome. However, to avoid confusion in derivations, we discuss the case of composite setting based on the handling rule for censoring without notations to define the composite endpoint such as . Also, in this paper, the observations for the 1st endpoint, (Ti1, Δi1), are used in either non-composite or composite setting. On the other hand, those for the second endpoint, (Ti2, Δi2), are consistently considered as the non-composite setting.
Dependence measure
Two times and may be correlated. We consider a correlation structure between and . Let and (k = 1, 2) be the joint survival and marginal survival functions for the bivariate survival data in the group j, respectively. We consider the correlation between the two cumulative hazard variates [8] defined by
If the marginal of the bivariate survival data are exponential, ρ(j) is the same as the correlation coefficient of the raw data [3]. In order to generate the joint survival function S(j)(t, s) from the and , we prepare a copula function 𝒞 (·), which gives
where the association parameter θ(j) included in 𝒞 (·) is a one-to-one function of ρ(j).
2.3. Bivariate structure of two log-rank test statistics
Consider applying the log-rank statistic (1) to the data for each endpoint (k = 1, 2) extracted from the bivariate time-to-event data obtained under the aforementioned situation. Then, the pair of the standardized log-rank statistics is approximately bivariate normally distributed with mean vector and variance–covariance matrix Σ when N is sufficiently large (see [3]), where
μk(t) and are asymptotic forms of N−1/2Uk(t) and , respectively, and Vkk(t), k = 1, 2 and V12(t) are the asymptotic variances and covariance of U1(t) and U2(t), respectively. These elements can be written as, in all censoring scenarios with/without the composite setting,
| (2) |
where Hk(t) is the asymptotic form of Ĥk(t), ,
is a covariance function for a martingale variation of , G(t) is the survival function of censoring time Ci, and t ∨ s represents max(t, s). However, note that the forms of Hk(t), , and dA(j)(t, s) are different among the censoring scenarios as provided thereafter. Some details of the derivations are in Appendix A. In advance, let Λ(j)(t, s) = −log S(j)(t, s) be the joint cumulative hazard function in group j. Using this notation, note that we can write and .
- (i)
-
(ii)
One fatal outcome: Assume that is the time to non-fatal event while is the time to fatal event, so that and . For example, in oncology trials, the TTP and OS endpoints are defined by non-fatal event time and fatal one , respectively, while the PFS endpoint is defined by as the composite. The log-rank test is applied to survival outcomes obtained using their endpoints. The applications to two sets of bivariate data defined by the TTP and OS endpoints and by the PFS and OS endpoints are discussed in the following non-composite and composite settings, respectively. See Appendix A for the details provided below.
Table II.
Elements of Uk(τ), , Vkk(τ), and V12(τ) for the three censoring scenarios under the non-composite setting. t ∨ s = max(t, s).
| Elements | Both non-fatal case | One fatal case | Both fatal case | |||
|---|---|---|---|---|---|---|
| k | ||||||
|
|
1 | Λ(j)(dt, t) | ||||
| 2 |
|
Λ (j)(t, dt) | ||||
|
|
1 | S(j)(t, t)G(t) | ||||
| 2 |
|
|||||
| Hk(t) | 1 |
|
||||
| 2 |
|
|||||
| dA(j)(t, s) |
|
|
0 |
Non-composite case
Consider the non-composite setting, that is, set , k = 1, 2 for the censoring indicators, where non-fatal is censored by observing fatal earlier than . The forms of Hk(t), , and , j = 1, 2 are that, for the non-fatal endpoint,
depending on the correlation between and , while, for the fatal endpoint,
being the same as those of situation (i). Also, dA(j)(t, s) related to the covariance of the martingale variation is
| (3a) |
Composite case
In the composite setting, set and for the censoring indicators. The forms of Hk(t), and are the same as those in the non-composite case, while and dA(j)(t, s) are changed from the non-composite case, where the intensity information on the second endpoint (fatal event) is added into them (k = 1, 2, j = 1, 2). That is, in this case, we have
and
| (3b) |
-
(iii)
Both fatal outcomes: Both and are the times to fatal events, so that we observe and . See Appendix A for the details provided in the subsequent section.
Non-composite case
Assume the non-composite setting, that is, , k = 1, 2. The forms of Hk(t), , and , k = 1, 2, j = 1, 2 are similar to those for the 2nd endpoint in situation (ii), because, if either of Ti1 or Ti2 is completely observed, the other is always censored. Inthis case, we can derive
and
| (4a) |
Composite case
Here, set and for the composite setting. In this case, only and dA(j)(t, s), j = 1, 2 are changed from the non-composite version. Hence, we have
and
| (4b) |
The forms of Hk(t), , and , k, j = 1, 2 are the same as those in the non-composite setting.
Table II summarizes the statistics regarding , Hk(t), and dA(j)(t, s) among the three censoring scenarios under the non-composite setting. The table clearly shows how these statistics change among the three scenarios. For all scenarios, the power is given by
| (5) |
because the hypothesis is rejected if the bivariate statistic takes the realized values on the area 𝒵(0), where is the bivariate normal density with mean vector and variance–covariance matrix Σ.
3. Co-primary endpoints
3.1. Hypothesis testing, power, and sample sizes
We are interested in testing the hypotheses on the HRs to evaluate a joint reduction of occurrence of events over time on both outcomes, that is, versus . In all of the censoring scenarios, using the two log-rank statistics, , k = 1, 2 for this hypothesis testing, we are able to
| (6) |
at the prespecified significance level of α, where zα is the 100(1 − α) percentile of N(0, 1). The overall Type I error associated with the null hypothesis is controlled by the maximum of the marginal Type I errors [5]. This means that, to investigate whether the overall Type I error is larger than the nominal level, it is enough to investigate whether the marginal ones are larger than the nominal level [2]. The behaviors of the Type I error for the univariate log-rank test are well known [9–11]. The one-sided test procedure (6) based on the asymptotic normality may inflate under some situations such as small sample size and/or unbalanced design (in particular, with r(1) > 0.5). We can improve the precision by correcting the critical value based on the sample size N and the allocation rate r(1) to control the marginal Type I errors. However, the overall Type I error on is, for example, the product of marginal ones for independent endpoints, and usually quite smaller than the maximum of marginal ones as long as ρ(j) is not so high, so that the inflation problem seen in one-sided log-rank test may be moderately reduced in multiple co-primary problem.
In the procedure (6), the rejection region of is { and }. In all of the scenarios, therefore, we have the power function for the joint reduction in both time-to-event outcomes given as
This overall power is referred to as ‘complete power’ [12] or ‘conjunctive power’ [13], which is simply calculated using the cumulative distribution function of the bivariate normal distribution. The power can be approximately calculated (under large samples) by
| (7) |
Let Ncp be the minimum of total sample size N required for testing against . Thus, Ncp is the smallest integer not less than satisfying the power (7) and given by
| (8) |
where ⌈x⌉ is the smallest integer not less than x, and σk(k = 1, 2) are the standard deviations and , respectively, Kβ is the solution of the integral equation
and R is the following correlation matrix
A grid search to find a value of Ncp often requires considerable computing time. The search proceeds by gradually increasing Ncp until the power (7) exceeds the desired power. Alternative methods to reduce the computational time are the Newton–Raphson algorithm for finding Kβ in [14] or the basic linear interpolation algorithm in [2].
One may expect to calculate the required number of participants from that of events such as the Reference [15] procedure in univariate data. However, we encounter difficulty in finding such a procedure in the bivariate case [3]. One of the causes is that the relationship between numbers of events and participants is more complicated than the univariate case, because the patterns of observations (censored or not) increase twofold. Another is that it is difficult to let the difference between the correlation models (such as early or late dependency) on a restricted time-interval reflect using both numbers of events without censoring, because both numbers of events are usually considered under the uncensored model (or after long enough follow-up). Even so, the required numbers of events are useful to monitor a trial and they can be obtained using D𝚤𝚥 = Ncp × P𝚤𝚥, where P𝚤𝚥 is the proportion of each observation pattern, , D𝚤𝚥 is the expected number of events in each case where the first endpoint is observed (𝚤 = 1) or censored (𝚤 = 0) and the second is observed (𝚥 = 1) or not (𝚥 = 0). See Appendix B for details for calculation of P𝚤𝚥 under the three censoring schemes.
3.2. Behavior of the sample size
We investigate the behavior of the sample size and power for detecting the joint reduction in bivariate time-to-event data under the three censoring schemes with the two time-dependent association structures, where the time-dependent association structures are asymmetric late (tail) dependency generated by the Clayton copula [16] and early (tail) dependency by the Gumbel copula [17], which have been widely used in practice. We generate bivariate time-to-event data by supposing that the marginals of and are exponential, and that Ci = U(0, τa) + τf (hence, τ = τa + τf), where τa and τf are the lengths of the entry period to the trial and follow-up period, respectively, and U(0, τa) denotes a uniform random number on (0, τa). The target power 1 − β = 0.8, the significance level α = 0.025, τa = 2, and τf = 3 are used, and all empirical powers are computed by Monte Carlo trials with 100,000 replications throughout.
Let Nsim be the simulation-based sample size required for testing , and let Nk denote the minimum of total sample size required to test the single hypothesis H0k for the kth endpoint. To avoid confusion in Ncp and N1, we write them as and if the first endpoint is replaced by the composite of the first and the second ones (composite setting), and as and otherwise (non-composite setting). Let P̃cp denote the empirical power (%) for detecting when the total number of participants is Ncp designed using the formula (8). See Section B.1 of Supporting Information for results other than ones provided later.
Formula performance and sample size behavior: one fatal case
Focusing on the one fatal case scenario, suppose that and are the TTP and OS endpoints, respectively. We evaluate the performance of the sample size formula (8) for practicality by comparing it with alternative solutions sizing based on univariate versions (NTTP, NPFS, and NOS) and simulation (Nsim) under the common α and β, where , and NOS (= N2) are Nk when the kth endpoint is the TTP, PFS, and OS. In the one fatal case, the bivariate survival data (Ti1, Δi1) and (Ti2, Δi2) are of the TTP and OS outcomes under the non-composite setting, but they are of the PFS and OS outcomes under the composite setting.
Table III displays the required total sample sizes Ncp, Nsim, N1, and N2 with the empirical power P̃cp under and , when and ρ(k) vary for the combinations of , 1.5, 1.7 and ρ(1) = ρ(2) = 0, 0.3, 0.5 and 0.8. Note that ψTTP (= ψ1) and ψOS (=ψ2) are the HRs for the TTP and OS, respectively. Similarly, and are the τ-time survival rates for the TTP and OS.
Table III.
The case of one fatal outcome (ii) (semi-competing risk): total numbers of participants Ncp calculated from (8), the corresponding empirical powers P̃cp (%) and alternative sizing solutions Nsim, NTTP and NOS under and .
| Dependence | Composite setting | Non-composite setting | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||||||
| Structure |
|
ρ(k) | Nsim |
|
P̃cp | NPFS | NOS | Nsim |
|
P̃cp | NTTP | NOS | |||
| Late (Clayton copula) | 1.3 | 0.0 | 968 | 972 | 80.1 | 262 | 972 | 966 | 972 | 80.1 | 288 | 972 | |||
| 1.3 | 0.3 | 968 | 972 | 80.2 | 312 | 972 | 972 | 974 | 80.2 | 314 | 972 | ||||
| 1.3 | 0.5 | 968 | 972 | 80.2 | 348 | 972 | 970 | 976 | 80.3 | 328 | 972 | ||||
| 1.3 | 0.8 | 968 | 972 | 79.8 | 418 | 972 | 966 | 972 | 80.3 | 324 | 972 | ||||
| 1.5 | 0.0 | 430 | 436 | 80.5 | 196 | 432 | 474 | 480 | 80.9 | 284 | 432 | ||||
| 1.5 | 0.3 | 436 | 440 | 80.6 | 232 | 432 | 492 | 496 | 80.9 | 320 | 432 | ||||
| 1.5 | 0.5 | 440 | 444 | 80.6 | 260 | 432 | 500 | 506 | 80.8 | 342 | 432 | ||||
| 1.5 | 0.8 | 450 | 456 | 80.5 | 316 | 432 | 504 | 510 | 80.9 | 362 | 432 | ||||
| 1.7 | 0.0 | 272 | 278 | 81.2 | 160 | 266 | 354 | 360 | 81.3 | 280 | 266 | ||||
| 1.7 | 0.3 | 280 | 286 | 80.9 | 190 | 266 | 380 | 388 | 81.0 | 324 | 266 | ||||
| 1.7 | 0.5 | 288 | 294 | 81.0 | 214 | 266 | 402 | 408 | 81.1 | 354 | 266 | ||||
| 1.7 | 0.8 | 308 | 314 | 80.9 | 262 | 266 | 426 | 432 | 80.9 | 394 | 266 | ||||
| Early (Gumbel copula) | 1.3 | 0.0 | 968 | 972 | 80.2 | 262 | 972 | 968 | 972 | 80.1 | 288 | 972 | |||
| 1.3 | 0.3 | 968 | 972 | 80.3 | 282 | 972 | 966 | 972 | 80.2 | 262 | 972 | ||||
| 1.3 | 0.5 | 968 | 972 | 80.3 | 294 | 972 | 970 | 972 | 80.1 | 232 | 972 | ||||
| 1.3 | 0.8 | 970 | 972 | 80.1 | 306 | 972 | 968 | 972 | 80.2 | 158 | 972 | ||||
| 1.5 | 0.0 | 432 | 436 | 80.4 | 196 | 432 | 474 | 480 | 80.9 | 284 | 432 | ||||
| 1.5 | 0.3 | 430 | 434 | 80.5 | 212 | 432 | 464 | 472 | 80.7 | 278 | 432 | ||||
| 1.5 | 0.5 | 430 | 434 | 80.5 | 224 | 432 | 456 | 462 | 80.5 | 268 | 432 | ||||
| 1.5 | 0.8 | 428 | 432 | 80.3 | 238 | 432 | 432 | 438 | 80.8 | 220 | 432 | ||||
| 1.7 | 0.0 | 272 | 278 | 81.0 | 160 | 266 | 352 | 360 | 81.2 | 280 | 266 | ||||
| 1.7 | 0.3 | 274 | 278 | 80.9 | 176 | 266 | 356 | 364 | 81.2 | 294 | 266 | ||||
| 1.7 | 0.5 | 274 | 278 | 80.8 | 186 | 266 | 358 | 362 | 81.0 | 300 | 266 | ||||
| 1.7 | 0.8 | 274 | 278 | 80.8 | 206 | 266 | 336 | 342 | 80.9 | 290 | 266 | ||||
The sample sizes Ncp calculated from the formula (8) usually provide slightly conservative results compared with Nsim, and the corresponding empirical powers P̃cp are preferable, that is, P̃cp are slightly larger than the target powers (although P̃cp tends to be further away from the target as the HR increases larger than 1). Times to compute Nsim are usually much longer than ones for Ncp, which becomes greater as the effect size is smaller. Hence, the formula (8) reduces the cost greatly, regardless of the effect size, and it is also useful as an initial value to search Nsim. Also, (Ncp under the composite setting) is smaller than (Ncp under the non-composite setting) in all cases. As the absolute value of decreases below , the value of Ncp approaches NOS and diverges from NPFS and NTTP. If both HRs are approximately equal, , then Ncp is slightly larger than the max(N1, N2) (the ratios of Ncp/max(N1, N2) are at most about 1.29 in Table III). Further, when comparing late dependency (Clayton copula) and early one (Gumbel copula), if is relatively close to increases proportionally to ρ(j) in the late dependency, while decreases or does not change as ρ(j) varies in the early dependency. Similar tendencies but more moderate variation is observed for . That is, a high value of ρ(j) makes the two log-rank statistics correlated, but higher ρ(j) increases the censoring rate, so that the higher ρ(j) does not contribute very much to the reduction of Ncp compared with the sample size when ρ(j) = 0. Hence, Ncp increases in the late dependency, because many observations are censored before the dependent effect works.
Sample size behavior under correlated both fatal outcomes
Consider the both fatal outcomes scenario such that is the time to disease-specific mortality and is the time to other-cause mortalities. If there are such fatal endpoints, it is often assumed that their endpoints are uncorrelated for purposes of ease. But the assumption of no correlation is often unjustified scientifically. We investigate how the required sample sizes , and N2 behave when the fatal endpoints are correlated. We compute the sample sizes using formula (8) and provide results in Table IV, displaying Ncp, N1, and N2 with the empirical power P̃cp under , when and ρ(k) vary for the combinations of , 1.7 and ρ(1) = ρ(2) = 0, 0.3, 0.5, and 0.8.
Table IV.
The case of both fatal outcomes (iii) (Competing risk): total numbers of participants Ncp calculated from (8), the corresponding empirical powers P̃cp (%) and alternative sizing solutions Nsim, N1 and N2 under and .
| Dependence | Composite setting | Non-composite setting | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||||||
| Structure |
|
ρ(k) |
|
P̃cp |
|
N2 |
|
P̃cp |
|
N2 | |||||
| Late (Clayton copula) | 1.5 | 0 | 418 | 80.7 | 254 | 404 | 710 | 80.7 | 648 | 404 | |||||
| 1.5 | 0.3 | 474 | 80.8 | 294 | 456 | 834 | 80.6 | 774 | 456 | ||||||
| 1.5 | 0.5 | 518 | 80.4 | 328 | 498 | 944 | 80.5 | 886 | 498 | ||||||
| 1.5 | 0.8 | 626 | 80.7 | 414 | 592 | 1258 | 80.2 | 1211 | 592 | ||||||
| 1.7 | 0 | 404 | 80.8 | 200 | 400 | 528 | 81.0 | 400 | 400 | ||||||
| 1.7 | 0.3 | 466 | 80.8 | 232 | 462 | 608 | 81.0 | 462 | 462 | ||||||
| 1.7 | 0.5 | 518 | 80.6 | 258 | 514 | 676 | 80.8 | 514 | 514 | ||||||
| 1.7 | 0.8 | 652 | 80.7 | 324 | 646 | 850 | 80.7 | 646 | 646 | ||||||
| Early (Gumbel copula) | 1.5 | 0 | 418 | 80.8 | 254 | 404 | 710 | 80.8 | 648 | 404 | |||||
| 1.5 | 0.3 | 444 | 81.0 | 288 | 422 | 854 | 80.7 | 814 | 422 | ||||||
| 1.5 | 0.5 | 452 | 80.9 | 312 | 422 | 1010 | 80.5 | 992 | 422 | ||||||
| 1.5 | 0.8 | 436 | 81.0 | 360 | 364 | 1840 | 80.6 | 1840 | 364 | ||||||
| 1.7 | 0 | 404 | 80.7 | 200 | 400 | 528 | 81.1 | 400 | 400 | ||||||
| 1.7 | 0.3 | 458 | 80.8 | 228 | 454 | 598 | 81.3 | 454 | 454 | ||||||
| 1.7 | 0.5 | 496 | 80.8 | 246 | 492 | 648 | 81.1 | 492 | 492 | ||||||
| 1.7 | 0.8 | 566 | 81.0 | 282 | 562 | 740 | 81.2 | 562 | 562 | ||||||
The sample sizes Ncp from (8) usually provide slightly more conservative results from the perspective that the corresponding empirical powers P̃cp are slightly larger than the target powers, similarly to the one fatal case. The value of is smaller than in any case, and and N2 usually increase proportionally to ρ(j) in both of late and early dependencies, but and N2 decrease inversely only when , ρ(j) = 0.8 and the Gumbel copula. That is, the stronger dependency makes the censoring rate increase more than the one fatal case, and hence, the reduction effect of Ncp based on correlated log-rank statistics are not obtained. On the other hand, the composite endpoint strategy using is the most reasonable in terms of smaller sample sizing. In particular, if , we have .
3.3. Illustration: the ICON7 study
We illustrate the sample size methods with an example. Consider ‘A Randomized, Two-Arm, Multi-Centre Gynaecologic Cancer Inter Group Trial of Adding Bevacizumab to Standard Chemotherapy (Carboplatin and Paclitaxel) in Patients With Epithelial Ovarian Cancer’ (ICON7) [18]. The study was designed to investigate the addition of bevacizumab to standard chemotherapy for first-line treatment of woman with ovarian cancer. The primary endpoints of interest were the PFS and OS. The protocol stated that the number of PFS events, 684 was needed to detect a 28% change in the PFS from a median value of 18 months in the control group to 23 months in the bevacizumab group (i.e., ), with the power of 90% at the significance level of 5% (a two-sided log-rank test), while the number of OS events, 715 was required to detect a 23% improvement in the OS from a median value of 43 months in the control group to 53 months in the bevacizumab group (i.e., ψOS = 1.23), with the power of 80% at a significance level of 5% (two-sided test). The protocol sample size of 1520 patients was determined for the reasons that the required numbers of PFS and OS events were expected to occur until 36 and 60 months after the first randomized, respectively, assuming constant recruitment over 24 months, and some elements of uncertainty are considered.
We set 60 months PFS and OS rates in the control group as and based on the aforementioned information, which is taking account of hazard rates and from the protocol and the exponential assumptions, and its uncertainty. Also, we derive the HR for TTP, because our calculation is based on the TTP and OS rather than the PFS and OS. That is, the HR and survival rate for TTP can be computed as and from the information on medians based on
from the independence and exponential assumptions. These values of the HR and are used for an illustration although the independence assumption is suspect.
Table V shows the total sample size Ncp, and the empirical powers P̃cp, P̃TTP, P̃PFS, and P̃OS (%) to evaluate the joint and single hypotheses ( and H1j) of TTP, PFS, and OS given Ncp, respectively, under and . Alternative sizing of NTTP, NPFS, and NOS gives the sample sizes required to test the single hypotheses of the TTP, PFS, and OS, respectively. The sample sizes were calculated to evaluate the joint reduction in both time-to-event outcomes of the TTP and OS or the PFS and OS with the target power of 80% at the significance level of 2.5%, based on the assumptions of the ICON7 study, assuming τa = 24 months accrual duration and additional τf = 36 months follow-up. The empirical power P̃PFS is larger than the target power of 90% to test the PFS written in the protocol, because the analysis times in the protocol design (τf = 12) and ours (τf = 36) are different. Similarly, note that NPFS is calculated under the target power of 80% smaller than the power 90% in the protocol. Also, the expected numbers of bivariate events are listed using the notation corresponding to D𝚤𝚥 = Ncp × P𝚤𝚥 in which Ncp is replaced by or , and P𝚤𝚥 is the probability ( : see Appendix B) calculated under the non-composite setting. Note that, even in the composite setting with the PFS and OS, D𝚤𝚥 represents the number that either or both of the TTP and OS are observed. In addition, the correlation is assumed to be common between the two groups, that is, ρ(1) = ρ(2).
Table V.
Total sample size Ncp and the empirical power for detecting the joint reduction in the TTP or PFS and the OS, where Ncp is designed with the target power 80% at α = 0.025, based on the assumption of ICON7 study, , τa = 24 and τf = 36 with and .
| Dependence | Composite setting | Expected number of (TTP,OS) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|||||||||
| Structure | ρ(k) |
|
P̃cp (P̃PFS, P̃OS) | NPFS | NOS | D11 | D10 | D01 | D00 | |
| Late (Clayton copula) | 0.0 | 1510 | 80.1 (99.0,80.3) | 658 | 1498 | 240 | 476 | 487 | 306 | |
| 0.3 | 1528 | 80.0 (97.4,80.7) | 789 | 1498 | 275 | 429 | 461 | 363 | ||
| 0.5 | 1544 | 80.0 (96.0,81.1) | 881 | 1498 | 302 | 395 | 441 | 405 | ||
| 0.8 | 1564 | 80.0 (93.4,81.7) | 1024 | 1498 | 370 | 316 | 383 | 495 | ||
| Early (Gumbel copula) | 0.0 | 1510 | 80.3 (99.0,80.6) | 658 | 1498 | 240 | 476 | 487 | 306 | |
| 0.3 | 1508 | 80.1 (98.5,80.3) | 700 | 1498 | 315 | 384 | 411 | 398 | ||
| 0.5 | 1506 | 80.2 (98.2,80.4) | 726 | 1498 | 374 | 327 | 351 | 454 | ||
| 0.8 | 1498 | 80.0 (97.9,80.1) | 744 | 1498 | 510 | 243 | 211 | 533 | ||
| Dependence | Non-composite setting | Expected number of (TTP,OS) | ||||||||
|
|
|
|||||||||
| Structure | ρ (k) |
|
P̃cp (P̃TTP, P̃OS) | NTTP | NOS | D11 | D10 | D01 | D00 | |
|
| ||||||||||
| Late (Clayton copula) | 0.0 | 1628 | 80.3 (96.3,83.3) | 913 | 1498 | 259 | 514 | 525 | 330 | |
| 0.3 | 1674 | 80.2 (94.9,84.2) | 1023 | 1498 | 301 | 470 | 505 | 397 | ||
| 0.5 | 1658 | 80.2 (94.1,84.7) | 1076 | 1498 | 332 | 434 | 484 | 445 | ||
| 0.8 | 1658 | 80.3 (94.1,84.1) | 1056 | 1498 | 392 | 335 | 406 | 525 | ||
| Early (Gumbel copula) | 0.0 | 1628 | 80.2 (96.4,83.3) | 913 | 1498 | 259 | 514 | 525 | 330 | |
| 0.3 | 1594 | 80.1 (96.7,82.3) | 878 | 1498 | 333 | 406 | 435 | 420 | ||
| 0.5 | 1562 | 80.2 (97.0,81.8) | 830 | 1498 | 388 | 339 | 364 | 471 | ||
| 0.8 | 1510 | 80.0 (98.6,80.2) | 695 | 1498 | 514 | 245 | 213 | 538 | ||
P̃TTP, P̃PFS, and P̃OS are 100 times empirical powers (%) when single hypotheses on the TTP, PFS and OS are tested, given total sample sizes Ncp.
In calculating the sample size, we assume that one event is fatal because the TTP may be censored by the OS of fatal event. If the association between the TTP and OS is late-time dependent, then the total sample sizes required to test the PFS and OS jointly with common correlation between the two groups, ρ(k) = 0.0, 0.3, 0.5, and 0.8 are 1510, 1528, 1544, and 1564, respectively. The sample size increases monotonically to ρ(k) = 0.8 from 0: the difference between the smallest and largest sample sizes may seem relatively large, although the ratio is only about 1.04. If the association is early-time dependent, then the total sample sizes required under ρ(k) = 0.0, 0.3, 0.5 and 0.8 are 1510, 1508, 1506, and 1498, respectively. The sample size decreases with increasing correlation, but the reduction rate from the largest sample size given by ρ(k) = 0.0 is quite small. Also, comparing the composite and non-composite endpoints, is smaller than the required to test the TTP and OS jointly, but the difference is slight. The expected numbers of bivariate events, D𝚤𝚥 provide useful information in monitoring process of the trial.
4. Multiple primary endpoints
4.1. Hypothesis testing, power, and sample sizes
We discuss calculating the required sample size for trials with multiple primary endpoints. For simplicity, we will consider the problem on the power and sample size calculation using the simplest procedures, the (weighted) Bonferroni procedure as well known and widely used. The other procedures for controlling the Type I error rate are available [6, 19, 20].
The weighted Bonferroni procedure allocates the Type I error rate α between the endpoints with weight ω, that is, α1 = ωα for the first endpoint and α2 = (1−ω)α for the second endpoint. We are then interested in testing versus at the (overall) significance level of 3, based on the log-rank test statistics and given in Section 2. The testing procedure is to
| (9) |
where zα1 and zα2 are the 100(1 − α1) and 100(1 − α2) percentiles of N(0, 1), respectively. Therefore, because the rejection region of is , we have the power function for a reduction in either of the time-to-event outcomes with the weighted Bonferroni procedure,
This overall power is referred to as ‘minimal power’[12] or ‘disjunctive power’[13]. Similarly to Section 3.1, for all of the three censoring scenarios, the power based on the function (5) can be approximated by
| (10) |
Letting Nmp be the minimum of total sample size N required for testing against , the formula for Nmp is given by
| (11) |
where Lβ is the solution of the integral equation
and , σk, and R are the same as the definitions given in Section 3.1. Similarly to the methods to obtain Ncp, we can use a search method of Lβ such as the Newton–Raphson algorithm [14] or the basic linear interpolation algorithm [2] in order to compute Nmp, which generally take shorter computing time than direct search that let Nmp increase sequentially until (10) exceeds the desired power. Also, similarly to Section 3.1, the required numbers of events that are useful to monitor a trial can be obtained by D𝚤𝚥 = Nmp × P𝚤𝚥 (see Appendix B for details regarding calculation of P𝚤𝚥 under the three censoring scenarios).
The overall Type I error associated with the null is controlled by the sum of marginal Type I errors. Similarly to Section 3, the inflation problem of the overall Type I error in the procedure (6) based on the asymptotic normal approximation is parallel to that of the marginal ones. The overall significance level α1 + α2 on is quite close to the level α1 + α2 − α1α2 considered when two endpoints are independent, so that the influence on marginal inflations may be practically larger in multiple primary problem than in multiple co-primary one. However, we can control the errors by correcting the critical values based on the sample size N and the allocation rate r(1), using simulation or theoretical result well known in the univariate log-rank statistic [9–11]. The practical need for sample size formula is a balanced weighting of the precision and cost, and then the use of the formula based on the asymptotic normal approximation is left in each stage of practice.
4.2. Illustration: one fatal case
We illustrate the behavior of the sample size and power for detecting at least one reduction in bivariate time-to-event data focusing on the one fatal outcome scenario with the two time-dependent association structures. Similarly to Section 3.2, we generate bivariate time-to-event data and perform Monte Carlo trials. The target power 1 − β = 0.8, the significance level α = 0.025, τa = 2, and τf = 3 are used. The notations of NEP, ψEP, and for EP = TTP, PFS, and OS are the same as those in Section 3. Also, Nmp is written as when the first endpoint is the TTP, or as when the first endpoint is the PFS. Let P̃mp denote the empirical power (%) for detecting under Nmp participants designed using the formula (11). Further, we select ω = ω0 as an optimal value of ω to consider a variable weighting strategy for the weighted Bonferroni procedure, that is, ω0 is ω that gives a minimum of Nmp over ω = 0, 0.05, …, 0.95, 1 (ω0 = argminω∈{0,0.05,…,0.95,1}Nmp). One may consider the other testing procedures such as the fixed-sequence procedure when ω0 = 0 or ω0 = 1 is suggested.
Table VI displays the required total sample sizes Nmp (given ω = ω0), the alternative sizing solutions NPFS, NTTP, and NOS, the selected Bonferroni weight ω0 and the empirical power P̃mp under , and when and ρ(k) vary in the combinations of , 1.5, 1.7 and ρ(1) = ρ(2) = 0, 0.3, 0.5, 0.8.
Table VI.
The case of one fatal outcome (ii) (semi-competing risk): total numbers of participants Nmp calculated from (11), the corresponding empirical powers P̃mp (%) and alternative sizing solutions NPFS, NTTP, and NOS under and .
| Dependence | Composite setting | Non-composite setting | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||||||
| structure |
|
ρ(k) | ω 0 |
|
P̃mp | NPFS | NOS | ω0 |
|
P̃mp | NTTP | NOS | |||
| Late (Clayton copula) | 1.3 | 0.0 | 1.00 | 254 | 80.7 | 254 | 1194 | 0.95 | 268 | 80.8 | 270 | 1194 | |||
| 1.3 | 0.3 | 1.00 | 284 | 80.5 | 284 | 1194 | 0.95 | 288 | 81.0 | 288 | 1194 | ||||
| 1.3 | 0.5 | 1.00 | 322 | 80.9 | 322 | 1194 | 0.95 | 296 | 80.5 | 298 | 1194 | ||||
| 1.3 | 0.8 | 1.00 | 366 | 80.8 | 366 | 1194 | 1.00 | 288 | 80.7 | 288 | 1194 | ||||
| 1.5 | 0.0 | 1.00 | 200 | 80.9 | 200 | 532 | 0.80 | 242 | 80.2 | 266 | 532 | ||||
| 1.5 | 0.3 | 1.00 | 232 | 80.7 | 232 | 532 | 0.80 | 266 | 80.4 | 292 | 532 | ||||
| 1.5 | 0.5 | 1.00 | 256 | 80.5 | 256 | 532 | 0.80 | 282 | 80.5 | 306 | 532 | ||||
| 1.5 | 0.8 | 1.00 | 300 | 80.9 | 300 | 532 | 0.80 | 296 | 80.5 | 312 | 532 | ||||
| 1.7 | 0.0 | 1.00 | 168 | 81.1 | 168 | 328 | 0.55 | 208 | 80.4 | 264 | 328 | ||||
| 1.7 | 0.3 | 1.00 | 196 | 80.9 | 196 | 328 | 0.55 | 228 | 80.7 | 296 | 328 | ||||
| 1.7 | 0.5 | 0.95 | 218 | 80.6 | 218 | 328 | 0.55 | 242 | 80.7 | 314 | 328 | ||||
| 1.7 | 0.8 | 0.95 | 260 | 80.8 | 262 | 328 | 0.50 | 264 | 80.7 | 332 | 328 | ||||
| Early (Gumbel copula) | 1.3 | 0.0 | 1.00 | 254 | 80.7 | 254 | 1194 | 0.95 | 268 | 80.9 | 270 | 1194 | |||
| 1.3 | 0.3 | 1.00 | 266 | 80.6 | 266 | 1194 | 1.00 | 246 | 81.0 | 246 | 1194 | ||||
| 1.3 | 0.5 | 1.00 | 272 | 80.6 | 272 | 1194 | 1.00 | 222 | 80.8 | 222 | 1194 | ||||
| 1.3 | 0.8 | 1.00 | 258 | 80.3 | 258 | 1194 | 1.00 | 174 | 81.0 | 174 | 1194 | ||||
| 1.5 | 0.0 | 1.00 | 200 | 80.7 | 200 | 532 | 0.80 | 242 | 80.5 | 266 | 532 | ||||
| 1.5 | 0.3 | 1.00 | 216 | 81.0 | 216 | 532 | 0.90 | 250 | 80.5 | 260 | 532 | ||||
| 1.5 | 0.5 | 1.00 | 224 | 80.7 | 224 | 532 | 0.95 | 246 | 80.6 | 248 | 532 | ||||
| 1.5 | 0.8 | 1.00 | 230 | 80.9 | 230 | 532 | 1.00 | 214 | 81.0 | 214 | 532 | ||||
| 1.7 | 0.0 | 1.00 | 168 | 81.3 | 168 | 328 | 0.55 | 208 | 80.4 | 264 | 328 | ||||
| 1.7 | 0.3 | 1.00 | 186 | 81.0 | 186 | 328 | 0.60 | 228 | 80.3 | 272 | 328 | ||||
| 1.7 | 0.5 | 1.00 | 196 | 80.7 | 196 | 328 | 0.65 | 240 | 80.3 | 280 | 328 | ||||
| 1.7 | 0.8 | 1.00 | 214 | 80.7 | 214 | 328 | 1.00 | 248 | 81.0 | 248 | 328 | ||||
The sample sizes Nmp obtained from the formula (11) consistently provide conservative results because the empirical powers P̃mp are slightly larger than the target powers. Although in many cases, we observe in some cases where NTTP is relatively close to NPFS or smaller than NPFS, which occurs when ρ(j) is relatively large and the effect size of OS (absolute value of ) is smaller than that of TTP. Also, if the effect size of OS is smaller than that of TTP, then NTTP ≪ NOS or NPFS ≪ NOS occurs, so that we have or accompanied with the value of ω0 close to 1. When NTTP (or NPFS) is closer to NOS, ω0 takes a value between 0 and 1.
Further, we add a summary about the tendency how ω0-value ranges from 0 to 1 in Section B.2 of Supporting Information, based on the underlying results by varying STTP(τ) and ρ(k) leaving the other factors fixed, whose parts are provided in Figures B.4, B.5, B.6, and B.7. We provide a guideline based on the summary about ω0-value: generally, the correlation ρ(k) and the dependence structure are unknown, so that these should be examined using meta analysis and/or data from a pilot study, similarly to consideration for the HRs of effect sizes. Although it may be desirable to consider a maximum sample size considering such unknown factors, it is necessary to balance on a degree of uncertainty and the cost. When the TTP and OS are used as two primary endpoints, ω0 ranges from 0 to 1, but ω0 is approximately 0.5 when NTTP/NOS is close to 1. When the PFS and OS are used as two primary endpoints, the ω0-value is usually close to 1 under early dependency and under a late dependency with not high ρ(k), if . The value of ω0 may be away from 0 and 1 otherwise. These considerations about ω0-value suggest that the use of PFS endpoint is one of the reasonable strategies in the multiple primary problem for the TTP and OS if , although the PFS endpoint may be used without consideration in practice. Hence, it is useful to incorporate ω0 in the sample size calculation proposed for multiple primary endpoints.
4.3. Behavior of the sample size as a function of the correlation
We extend the study to the both non-fatal and both fatal cases. We use the same setting as in Section 4.2 (1 − β = 0.8, α = 0.025, τa = 2, and τf = 3), while the notations of Nk, ψk, and for the kth endpoint are used because of the consistent expression of the results from the three situations. Similarly to Section 4.2, we select a minimum of Nmp over ω = 0, 0.05, …, 0.95, 1 via a strategy for the weighted Bonferroni procedure, is Nmp calculated under the composite setting, and is Nmp calculated under the non-composite setting.
Figure 1 shows the required total sample sizes Nmp as a function of ρ(1) = ρ(2) = 0, 0.05, …, 0.95 when varies from 1.3, 1.5 and 1.7 under and the early dependency (Gumbel copula). The 12 plots are arranged from the left for , 0.5 and 0.6, and from the top for the four scenarios: the both non-fatal case without the composite setting ( ), the one fatal case without and with the composite setting ( and ), and the both fatal case with the composite setting ( ). The plots for the late dependency case are provided in Figure B.3 in Section B.1 of Supporting Information (generated under the Clayton copula (late dependency) using the same conditions as Figure 1 except the copula model).
Figure 1.
Behavior of the total sample sizes Nmp as a function of the correlation ρ(j) for , 1.5 and 1.7, arranged from the left for , 0.5, 0.6 and from the top following to from the both non-fatal case, and of one fatal case and of both fatal case, given 1 − β = 0.8, α = 0.025, τa = 2, τf = 3, and early dependency
As ρ(j) increases, the behavior of is monotone increasing up to a constant value (univariate sample size) in the both non-fatal case but is more complicated in the one fatal case. In the one fatal case, there is a complicated interaction between the effect size and the correlation, where a smaller sample size is required with increasing effect size and decreasing correlation, while the censoring rate for non-fatal events increases with increasing correlation. Consider the case of ρ(j) = 1 as a standard, because the sample size Nmp required in bivariate primary endpoints reduces to that of the single endpoint in the optimal weighting Bonferroni strategy when the correlations ρ(j) are 1. In the one fatal case with the composite setting, the complicated behavior is moderated, and tends to decrease as the correlation decreases, but is sometimes larger than . The behavior of in the both fatal case with the composite setting is similar to that of the one fatal case.
5. Summary
Utilizing multiple endpoints in clinical trials may provide the opportunity for characterizing intervention’s multidimensional effects but also creates challenges in design and analysis of clinical trials. Specifically controlling the Types I and II error rates is non-trivial when the multiple primary endpoints are potentially correlated. When designing the trial to detect effects for all of the endpoints, no adjustment is needed to control the Type I error. However, the Type II error increases as the number of endpoints being evaluated increases. In contrast, when designing the trial to detect an effect for at least one of the endpoints, then an adjustment is needed to control the Type I error.
We describe an approach to the evaluation of power and sample size for comparing the effect of two interventions in superiority clinical trials with two time-to-event outcomes for multiple co-primary and multiple primary cases. Designing clinical trials with multiple time-to-event outcomes is more complex compared with endpoints with the other scales, requiring censoring scheme challenges and time-dependent associations among outcomes. We consider the three censoring scenarios based on the types of outcomes: (i) both outcomes are non-fatal; (ii) one outcome is fatal; and (iii) both outcomes are fatal, and we evaluate their composite and non-composite settings. We discuss the two time-dependent association structures: the asymmetric late time-dependency generated by the Clayton copula and the early time-dependency generated by the Gumbel copula. Our findings are summarized as follows.
In the co-primary endpoint situation, if the two time-to-event outcomes are non-fatal, then the required sample size decreases with increasing correlations ρ(j), in both of the late and early time-dependencies, except for the case where one HR is larger than the other. When the correlations are zero, is the largest. However, when one or both outcomes are non-fatal, the behaviors of required sample sizes and have complicated shapes owing to an interaction between the correlations and effect sizes based on the HRs. Zero correlation does not provide the largest required sample size. Thus, careful consideration is required in practice.
In the multiple primary endpoint situation based on the optimal weighting Bonferroni strategy, a standard situation occurs when the correlations ρ(j) are one, corresponding to the single endpoint case. Considering the two non-fatal case, the sample size does not vary or increases monotonously up to a constant (univariate sample size) as a function of the correlations, in both of the late and early time-dependencies. However, when one or both outcomes are non-fatal, the behaviors of the required sample sizes and are complex with an interaction between the correlations and the effect size ratios. Thus, the proposed formula is useful for determining the sample size in practice, noting, when the sample size is small, for example, < 100, one may have to modify a critical value based on the normal approximation.
Unlike the both non-fatal case, larger correlations do not increase the statistical power in the one fatal or both fatal cases. The reason for this is that higher correlation leads to increased censoring. Hence, the standard log-rank statistic must be corrected by incorporating information from informative censoring. We focus on providing a foundation for designing clinical trials with two time-to-event outcomes. Informative censoring is a topic for future work.
When designing a clinical trial with the proposed methods, one needs parameter estimates using the available methods ([7, 21–25, 27, 28]). But specification for the joint distribution is challenging as data to base selection is often limited during trial design. One conservatively alternative is to select the largest sample size of all of the correlation and joint distribution combinations, and stop the clinical trial when the appropriate number of events required for each outcome is observed. Another option is to use group-sequential designs. This may lead to potentially fewer patients than the fixed-sample designs when evidence is overwhelming and thus offers efficiency but introduces the other challenges. Information for the endpoints may not accrue at the same rate and require different information times.
As discussed in [29] and [26], the Type I error rate of the log-rank test for each endpoint may be inflated in small sample sizes or with unequally sized intervention groups. When this occurs in the co-primary endpoint situation, our simulation studies suggest that the overall Type I error associated with the null hypothesis is not larger than the target significance level, except when the correlation is very high (i.e., close to one), even though the marginal Type I error rate is inflated. On the other hand with multiple primary endpoints, the overall Type I error associated with the null hypothesis is larger than the target significance level, particularly when the correlation is small. In these cases, we may consider more direct ways of calculating sample size without using a normal approximation such as the methods in [29] and [26].
Supplementary Material
Acknowledgments
We thank the two anonymous referees for their helpful suggestions and constructive comments that improved the content and presentation. We also thank Dr. Lu Tian, Dr. H.M. James Hung and Dr. Sue-Jane Wang in encouraging us with their valuable comments for this research. This work was supported by JSPS KAKENHI Grant Number 26330032.
APPEENDIX A. Asymptotic forms in the bivariate log-rank statistic
We provide the details of asymptotic forms of the bivariate log-rank statistic discussed in Section 2.3. See [3] for the case when both endpoints are non-fatal. We obtain the covariance process between two incremental differences , k = 1, 2, where Mi1(t) and Mi2(t) are martingale processes relative to the filtration generated by the history prior to time t under the null hypothesis. This corresponds to calculating the expectation of
| (A.1) |
One fatal case
Consider the non-composite setting with the censoring indicator , k = 1, 2. Because and , the expectations of the ith at-risk processes given gi = j, that is, , k = 1, 2 are
for the non-fatal endpoint and
for the fatal endpoint (the latter form is identical to that of both non-fatal case). So each element in Ĥk(t) converges to the corresponding expectation almost surely by Glivenko–Cantelli’s theorem under some regular conditions, so that we have the asymptotic forms of Ĥk(t), k = 1, 2 as
Similarly, the conditional expectations of the ith counting processes given 𝒴ik(t) = 1 are
for the non-fatal endpoint and for the fatal endpoint. Hence, the expectation of (A.1) is
| (A.2) |
because, noting that Ti1 ≤ Ti2 always holds, we have
Therefore, we have the form of dA(j)(t, s) = G(t ∨ s)−1E[dMi1(t)dMi2(s) | gi = j] given in (3a) of Section 2.3.
Next, assume the composite setting and . That is, the definition of d𝒩i1(t) is only changed from the non-composite version. Because observing the 2nd endpoint is composed in the ith counting processes 𝒩i1(t) for the first endpoint, the conditional expectations of d𝒩i1(t) given 𝒴i1(t) = 1 is
where k′ = 3−k. So, the terms of in (A.2) are replaced by Λ(gi)(dt, t)+Λ(gi)(t, dt) in this composite setting. Similarly, the other corrections in (A.2) are of terms of the conditional expectations including d𝒩i1(t), and derived as
We achieve (3b) applying these results into the expectation of (A.1). In this composite setting, we can see that the intensity information on the 2nd endpoint (death event) is added into and dA(j)(t, s) compared with the non-composite version.
Both fatal case
We consider the non-composite setting, , k = 1, 2 and omit the composite setting for simplicity. Following and , the expectations of the ith at-risk processes given gi = j are the same forms as
for both endpoints, identical to that obtained for the one fatal case. Hence, by Glivenko–Cantelli’s theorem under some regular conditions, the asymptotic forms of Ĥk(t), k = 1, 2 are
The conditional expectations of the ith counting processes given 𝒴ik(t) = 1 are
similar to the result for the non-fatal endpoint in the one fatal case, where k′ = 3 − k. Hence, the expectation of (A.1) can be derived as
| (A.3) |
because
Now, let us add an assumption natural for continuous time-to-event data that we cannot observe and simultaneously in both fatal event times. Then, the factors that occur at only t = s, such as 𝟙(t = s) S(j)(dt, ds), do not contribute to the integral of the form , so that we can ignore 𝟙(t = s)S(j)(dt, ds) in (A.3) as the zero term. We can also apply the relations S(j)(s, ds) = −S(j)(s, s)Λ(j)(s, ds) and S(j)(dt, t) = −S(j)(t, t)Λ(j)(dt, t) into (A.3) further. Hence, (A.3) becomes to zero under continuous time-to-event data. As a result, the form of dA(j)(t, s) = G(t ∨ s)−1E[dMi1(t)dMi2(s) | gi = j] is obtained as (4a) in Section 2.3.
APPENDIX B. Probability formula for the observed two endpoints
We provide the formula of for the observation ( ) of the four patterns. We may monitor a trial based on the number of events obtained from the required sample size using the probability formula. Let , a, b = 0, 1, j = 1, 2. Using this notation, we can write
and then, for example, in the one fatal case, we can obtain , and , j = 1, 2, where
is the density function of for the ith participant assigned as gi = j. The changes of in the other censoring scenarios under the non-composite setting are summarized as Table BI including the one fatal case. For an example with the composite setting, let and be Pab when the first endpoint corresponds to the TTP and the PFS, respectively, and the second endpoint is consistently of the OS. Then we have , and It would be enough for us to consider Pab under the non-composite setting as the probability formula, because it includes more information than Pab under the composite setting.
Table BI.
The probability formula on observing ( ) in the three censoring scenarios under the non-composite setting
|
|
Both non-fatal case (Non-competing model) | One fatal case (Semi-competing model) | Both fatal case (Full-competing model) | ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
0 | ||||
|
|
|
|
|
||||
|
|
|
|
|
||||
|
|
|
|
|
Footnotes
Supporting information
Additional supporting information may be found in the online version of this article at the publisher’s web site.
References
- 1.Fine JP, Jiang H, Chappell R. On semi-competing risks data. Biometrika. 2001;88:907–919. doi: 10.1093/biomet/88.4.907. [DOI] [Google Scholar]
- 2.Hamasaki T, Sugimoto T, Evans SR, Sozu T. Sample size determination for clinical trials with co-primary outcomes: exponential event times. Pharmaceutical Statistics. 2013;12:28–34. doi: 10.1002/pst.1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sugimoto T, Sozu T, Hamasaki T, Evans SR. A logrank test-based method for sizing clinical trials with two co-primary time-to-event endpoints. Biostatistics. 2013;14:409–421. doi: 10.1093/biostatistics/kxs057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Offen W, Chuang-Stein C, Dmitrienko A, Littman G, Maca J, Meyerson L, Muirhead R, Stryszak P, Boddy A, Chen K, Copley-Merriman K, Dere W, Givens S, Hall D, Henry D, Jackson JD, Krishen A, Liu T, Ryder S, Sankoh AJ, Wang J, Yeh CH. Multiple co-primary endpoints: medical and statistical solutions. Drug Information Journal. 2007;41:31–46. doi: 10.1177/009286150704100105. [DOI] [Google Scholar]
- 5.Hung HMJ, Wang SJ. Some controversial multiple testing problems in regulatory applications. Journal of Biopharmaceutical Statistics. 2009;19:1–11. doi: 10.1080/10543400802541693. [DOI] [PubMed] [Google Scholar]
- 6.Dmitrienko A, Tamhane AC, Bretz F. Multiple Testing Problems in Pharmaceutical Statistics. Chapman and Hall; Boca Raton, FL: 2010. [Google Scholar]
- 7.Wang W. Estimating the association parameter for copula models under dependent censoring. Journal of the Royal Statistical Society, Series B. 2003;65:257–273. doi: 10.1111/1467-9868.00385. [DOI] [Google Scholar]
- 8.Hsu L, Prentice RL. On assessing the strength of dependency between failure time variates. Biometrika. 1996;83:491–506. doi: 10.1093/biomet/83.3.491. [DOI] [Google Scholar]
- 9.Kellerer AM, Chmelevsky D. Small-sample properties of censored-data rank tests. Biometrics. 1983;39:675–682. doi: 10.2307/2531095. [DOI] [Google Scholar]
- 10.Hsieh FY. Comparing sample size formulae for trials with unbalanced allocation using the logrank test. Statistics in Medicine. 1992;11:1091–1098. doi: 10.1002/sim.4780110810. [DOI] [PubMed] [Google Scholar]
- 11.Strawderman RL. An asymptotic analysis of the logrank test. Lifetime Data Analysis. 1997;3:225–249. doi: 10.1023/A:1009648914586. [DOI] [PubMed] [Google Scholar]
- 12.Westfall PH, Tobias RD, Rom D, Wolfinger RD, Hochberg Y. Multiple Comparisons and Multiple Tests Using the SAS System. SAS; Cary NC: 2011. [Google Scholar]
- 13.Senn S, Bretz F. Power and sample size when multiple endpoints are considered. Pharmaceutical Statistics. 2007;6:161–170. doi: 10.1002/pst.301. [DOI] [PubMed] [Google Scholar]
- 14.Sugimoto T, Sozu T, Hamasaki T. A convenient formula for calculating sample size of clinical trials with multiple co-primary continuous endpoints. Pharmaceutical Statistics. 2012;11:118–128. doi: 10.1002/pst.505. [DOI] [PubMed] [Google Scholar]
- 15.Freedman LS. Table of the number of patients required in clinical trials using the logrank test. Statistics in Medicine. 1982;1:121–129. doi: 10.1002/sim.4780010204. [DOI] [PubMed] [Google Scholar]
- 16.Clayton DG. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease. Biometrika. 1978;65:141–151. doi: 10.1093/biomet/65.1.141. [DOI] [Google Scholar]
- 17.Hougaard P. A class of multivariate failure time distribution. Biometrika. 1986;73:671–678. doi: 10.1093/biomet/71.1.75. [DOI] [Google Scholar]
- 18.Perren TJ, Swart AM, Pfisterer J, Ledermann JA, Pujade-Lauraine E, Kristensen G, Carey MS, Beale P, Cervantes A, Kurzeder C, du Bois A, Sehouli J, Kimmig R, Stähle A, Collinson F, Essapen S, Gourley C, Lortholary A, Selle F, Mirza MR, Leminen A, Plante M, Stark D, Qian W, Parmar MK, Oza AM. A phase 3 trial of bevacizumab in ovarian cancer. New England Journal of Medicine. 2011;365:2484–2496. doi: 10.1056/NEJMoa1103799. [DOI] [PubMed] [Google Scholar]
- 19.Wiens B, Dmitrienko A. On selecting a multiple comparison procedure for analysis of a clinical trial: fallback, fixed-sequence and related procedures. Statistics in Biopharmaceutical Research. 2010;2:22–32. doi: 10.1198/sbr.2010.08035. [DOI] [Google Scholar]
- 20.Bretz F, Hothorn T, Westfall P. Multiple Comparisons Using R. Boca Raton, FL: Chapman and Hall; 2011. [Google Scholar]
- 21.Lagakos SW. A stochastic model for censored-survival data in the presence of an auxiliary variable. Biometrics. 1976;32:551–559. [PubMed] [Google Scholar]
- 22.Lagakos SW. Using auxiliary variables for improved estimates of survival time. Biometrics. 1977;33:399–404. [PubMed] [Google Scholar]
- 23.Lin DY, Robins JM, Wei LJ. Comparing two failure time distributions in the presence of dependent censoring. Biometrika. 1996;83:381–393. doi: 10.1093/biomet/83.2.381. [DOI] [Google Scholar]
- 24.Shih JH. A goodness-of-fit for association in a bivariate survival model. Biometrika. 1998;85:189–200. doi: 10.1093/biomet/85.1.189. [DOI] [Google Scholar]
- 25.Chang SH. A two-sample comparison for multiple ordered event data. Biometrics. 2000;56:183–189. doi: 10.1111/j.0006-341x.2000.00183.x. [DOI] [PubMed] [Google Scholar]
- 26.Wang R, Lagakos SW, Gray RJ. Testing and interval estimation for two-sample survival comparisons with small sample sizes and unequal censoring. Biostatistics. 2010;11:676–692. doi: 10.1093/biostatistics/kxq021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Siannis F, Farewell VT, Head J. A multi-state model for joint modelling of terminal and non-terminal events with application to Whitehall II. Statistics in Medicine. 2007;26:426–442. doi: 10.1002/sim.2342. [DOI] [PubMed] [Google Scholar]
- 28.Parast L, Tian L, Cai T. Landmark estimation of survival and treatment effect in a randomized clinical trial. Journal of the American Statistical Association. 2014;109:384–394. doi: 10.1080/01621459.2013.842488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Heinze G, Gnant M, Schemper M. Exact log-rank test for unequal follow-up. Biometrics. 2003;59:1151–1157. doi: 10.1111/j.0006-341X.2003.00132.x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

