Summary
General expressions are described for the evaluation of sample size and power for the K group Mantel-logrank test or the Cox PH model score test. Under an exponential model, the method of Lachin and Foulkes [1] for the 2 group case is extended to the K ≥ 2 group case using the non-centrality parameter of the K – 1 df chi-square test. Similar results are also shown to apply to the K group score test in a Cox PH model. Lachin and Foulkes [1] employed a truncated exponential distribution to provide for a non-linear rate of enrollment. Expressions for the mean time of enrollment and the expected follow-up time in the presence of exponential losses-to-follow-up are presented. When used with the expression for the non-centrality parameter for the test, equations are derived for the evaluation of sample size and power under specific designs with R years of recruitment and T years total duration.
Sample size and power are also described for a stratified-adjusted K group test and for the assessment of a group by stratum interaction. Similarly computations are described for a stratified-adjusted analysis of a quantitative covariate and a test of a stratum by covariate interaction in the Cox PH model.
Keywords: Sample size, power, logrank test, Cox Proportional Hazards Model, multiple groups, exponential survival, stratified analysis, interactions
1 Introduction
In time-to-event studies with multiple (K ≥ 2) independent groups, we wish to test the general null hypothesis H0: λ1(t) = λ2(t) = … = λK(t) against the alternative H1: λj(t) ≠ λk(t) for some 1 ≤ j < k ≤ K, where λj(t) is the time-varying hazard rate within the jth group, 1 ≤ j ≤ K, t > 0. The Mantel-logrank test is commonly used to test H0 that also specifies equality of the survival functions of the K groups over time. This test can also be obtained as the score test for a binary covariate in the Cox proportional hazards (PH) model and is fully effcient under the proportional hazards assumption. The simplest instance is the exponential model in which the hazards are constant over time, or λj(t) = λj for all t > 0. Thus, the exponential model is commonly employed to evaluate the sample size or power for this test.
George and Desu [2] showed that the power of a test of equality of hazards under an exponential survival model is a function of the number of subjects with the outcome event, a result also derived by Schoenfeld [3] for the Cox PH model. Following the work of Lachin [4], Rubenstein, Gail and Santner [5], and Schoenfeld and Richter [6], among others, Lachin and Foulkes [1] described the assessment of sample size and power for the test of two groups under an exponential model with possibly non-linear recruitment over R years and follow-up over T years, exponential losses to follow-up, and a stratified design where these factors may differ over strata. They did so using a test described in terms of the difference in hazards that Freedman [7] had shown to derive from the limiting distribution of the logrank test, and also using a test of the log hazard ratio that Schoenfeld [8] had shown to derive under a proportional hazards model.
Makuch and Simon [9] provide a generalization of the George-Desu result to determine the number of subjects with the event required to provide the desired level of power for the comparison of K ≥ 2 groups using a one-way ANOVA-like chi-square test. They also describe how their results could be applied to the assessment of sample size in cases where the total exposure time is also specified. Liu and Dahlberg [10] describe the power of the Wald test in the PH model with results close to those of Makuch and Simon. Ahnn and Anderson [11] describe the power of the Tarone-Ware [12] family of tests that are a generalization of the Mantel or logrank test and present an expression explicitly for the case of equal sample sizes and equal censoring distributions. Halabi and Singh [13] generalize the Ahnn-Anderson expression for unequal sample fractions and describe the stratified adjusted test power.
Herein these results are further generalized to allow for non-uniform recruitment and losses to follow-up that yield variable exposure times, and also stratification as in Lachin and Foulkes [1]. First, the power of Cochran’s ANOVA-like χ2 test of homogeneity [10] of the hazard rates among the K groups is described as a function of the non-centrality parameter of the distribution of the test, that then permits assessment of sample size and/or study duration. Then, the non-centrality parameter for the K-group score test in a Cox proportional hazards model is shown to be approximately equal to that of Cochran’s test, so that the results apply more generally than to a simple exponential model. An equivalent T2-like contrast test is then used to evaluate power for a stratified-adjusted analysis. Methods are also presented for the assessment of the power of a stratified-adjusted K-group test, and for the assessment of a group by stratum interaction. Similar computations are described for a stratified-adjusted analysis of a quantitative covariate, and the test of a strata by covariate interaction in the Cox PH model.
2 Lachin-Foulkes Exponential Model
For the jth group, let denote the log of the maximum likelihood estimate of the exponential hazard rate that is distributed as
(1) |
where θj = ln(λj) and based on an expected number of subjects E(Dj) to experience the outcome event (termed events), for j = 1, … , K ≥ 2. Within the jth group, for a given pattern of enrollment and losses-to-follow-up, and hazard rate for the event, the probability πj that the event is observed is determined. Then, for a given sample size nj within the jth group, the expected number of subjects with the event is obtained as E(Dj) = njπj.
To allow for non-uniform entry over a recruitment period of R years, Lachin and Foulkes [1] employed a truncated exponential distribution for the enrollment time r with shape parameter γ and density g(r) and cumulative distribution G(r)
(2) |
for 0 < r ≤ R, and γ ≠ 0, that yields a concave pattern for γ < 0, a convex pattern for γ < 0. The mean enrollment time is readily shown to be
(3) |
In the case of exponential losses with loss hazard rate η, density h(u) = ηe−ηu and cumulative distribution function H(u) = 1 – e−ηu, then the expected potential exposure time is
(4) |
that is easily evaluated numerically.
Then, for a total study duration of T ≥ R years with exponential event hazard rate λ and loss-to-follow-up hazard rate η, Lachin and Foulkes [1] show that
(5) |
For uniform recruitment with density g(z) = 1/R, this simplifies slightly to
(6) |
Lachin and Foulkes [1] also show that the probability of loss-to-follow-up (non-administrative right censoring) is simply πη/λ.
3 An ANOVA-like Test
Under the assumption that there is no difference among groups, or θ1 = … = θK = θ, it is well known (cf. [14]) that a consistent estimate of the common log hazard rate θ is provided by the minimum variance linear estimator (MVLE)
(7) |
that is obtained from the application of weighted least squares. Then, the hypothesis of no differences among groups, or H0: θ1 = … = θK can be tested using Cochran’s [15] test of homogeneity,
(8) |
that is asymptotically distributed as on K – 1 df.
The power of this test is provided by the non-centrality parameter ψ2 = E [X2] for a given total sample size N under an appropriate model for the other parameters. Makuch and Simon [9] describe the computation of power for given values of {E(Dj), θj}, j = 1, … , K. More generally, the total sample size required, or the total amount of information required, can be obtained from the expressions of Lachin and Foulkes [1] above.
Let ζj denote the sample fraction of subjects assigned to the jth group, j = 1, … , K, where ; let θ1, … , θK denote the specified set of log hazard rates that are of interest to detect under the alternative H1: θj ≠ θk for some 1 ≤ j ≤ k ≤ K; and let η1, … , ηK denote the assumed hazard rates of loss-to-follow-up that may vary among groups. Under an exponential model for each group with either a uniform or non-linear rate of entry yielding event probability πj, then the resulting non-centrality parameter is
(9) |
where ϕ2 is the “non-centrality factor” or the component remaining after factoring N, and where
(10) |
is the weighted average of the log hazard rate values within the groups under the specified alternative. Thus, the non-centrality factor and power depend on the weighted sum of squares of the deviations of the log hazards within the K groups from the mean hazard.
Values of the non-centrality parameter ψ2(α β m) providing various levels of power for the non-central χ2 distribution on m df are readily obtained from programs such as the SAS functions PROBCHI for the cumulative probabilities and CINV for quantiles of the χ2 distribution, both of which provide computations under the non-central distribution. The SAS function CNONCT then computes the value of the non-centrality parameter ψ2 that provides power 1 – β for specific levels of α and m. For a test at level α, with critical value , the required non-centrality parameter value is .
To determine sample size for a study, the value ψ2(α, β, m) of the non-centrality parameter is obtained that will provide power 1 – β under the non-central χ2 distribution for a test at level α on m df. Then the value of the non-centrality factor ϕ2 under the alternative hypothesis in (9) is specified as a function of the parameter sets {ζj}, {πj} and {θj}. Given the value of ϕ2, the N required to provide power 1 – β is that value for which ψ2(α, β, m) = Nϕ2, yielding
(11) |
Alternately, for a given value of the parameter ψ2 in (9), the level of power can be computed as
(12) |
4 Cox PH Model Score Test
The above expressions are based on the large sample test of the difference among the hazard rates under an exponential model. An appropriate expression can also be obtained from the non-central distribution of the score test for the treatment group coeffcients in a Cox PH model. With the Kth group as the reference, then the model would employ K – 1 binary covariates (X1, … , XK–1) to represent membership in the jth group with coeffcient vector β = (β1 … βK–1), where the jth coeffcient βj equals the log hazard ratio for the jth group versus the Kth reference group.
Using standard notation, let δi be the binary indicator variable to denote whether the ith subject had the outcome event (δi = 1) or is right censored (δi = 0), R(ti) denote the set of subjects still at risk at the event time ti, and n(ti) be the number of subjects in the risk set at that time. For the ith subject, xi = (xi1 … xi(K–1))T where xij = 1 denotes that the event at time ti occurred in an individual in the jth group, xij = 0 otherwise. Then the partial likelihood, assuming no tied event times, is
(13) |
Under H0: β1 = … = βK–1 = 0, the score equation for the jth coeffcient reduces to
(14) |
where
(15) |
is the proportion of subjects still at risk at event time ti that are members of the jth group. Thus,
(16) |
is the observed number of subjects with the outcome event among subjects in the jth group, and
(17) |
is the estimate of the expected number of events under the null hypothesis. Thus, and the score vector is U0(β) = [U0(β1) … U0(βK–1)T.
Likewise, the expressions for the elements of the information matrix evaluated under H0 reduce to
(18) |
Under H0, and the assumption that there is a common censoring distribution among the K groups, then E [pj(ti)] = ζj for all event times and I0(β) then has elements
(19) |
where D is the total number of outcome events. Then the score test of H0 is provided by
(20) |
Using the same steps as employed by Schoenfeld [3], under the alternative hypothesis H1: β ≠ 0, it can then be shown that
(21) |
Thus, is distributed as non-central chi-square with non-centrality parameter
(22) |
Since I(β) is approximately equal to I0(β) under local alternatives, then
(23) |
Now referring to the prior section, under H0, and the assumption that there is a common censoring distribution among the K groups, then E(Dj) ≅ E(D)ζj and the common parameter value is specified as
(24) |
The corresponding non-centrality parameter using this simplification is
(25) |
Noting that βj = θj – θK, then it can be shown that this expression equals that provided in (23) above. Thus, under the assumption of a common censoring distribution among groups, a sample size or power computation using the exponential model also applies to the Cox PH model.
Under the assumption of equal censoring among groups and equal sample sizes, i.e. ζj = 1/K for ∀j, the above expression also is equivalent to that described by Ahnn and Anderson [11].
5 A Contrast-Based Test
An equivalent form of the test of homogeneity in (8) is obtained as a T2-like quadratic form in contrasts among the K log hazards. Let θ = (θ1 … θK)T designate the vector of log hazard rates within the K groups with sample fractions {ζj} as above. For specified hazard rates {λlj} and loss hazard rates {ηlj} for the K groups, and recruitment pattern parameter γ, then we can compute the event probability πj such that E(Dj) = Nζjπj is the expected number of subjects with the event in the jth group.
Then, the vector of estimated log hazards , , is asymptotically distributed as multivariate normal with expectation θ and covariance matrix
(26) |
that is consistently estimated as . A T2-like contrast test of homogeneity H0: θ1 = … = θK can be constructed as
(27) |
where
(28) |
is of dimension (K – 1) × K. It is well known that a T2-like test of homogeneity with a contrast matrix of this form is equivalent to one using the difference between the jth group estimate and the overall average as in (8) (Anderson [16], p. 170; cf. Lachin [14], p. 151-2). For the jth row of C’, the vector product yields the log hazard ratio for the jth group relative to the Kth group, , for j = 1, … , (K–1). Thus, the vector of log hazard ratios for the first K–1 groups versus the Kth is . Since then ,
(29) |
where Ω consists of the like matrix defined in terms of the expected numbers of subjects with the event. Thus the non-centrality parameter of the test is provided by
(30) |
that equals the expression in (9).
Then, evaluating the above covariance matrix under H0 and the assumption of equal censoring such that E(Dj) ≅ = Dζj leads to a non-centrality parameter that equals the expressions obtained under the PH model, in (23) and in (25).
6 A Statified-Adjusted Analysis
The above can also be generalized to a stratified-adjusted K df logrank test over S independent strata. While such a test can be conducted using the Cox PH model, an alternate approach is to consider a test based on the multivariate normal distribution of the hazard rate estimates, and its associated power function. This is easily described using the contrast test formulation.
Within the lth stratum, let θl = (θl … θlk)T designate the vector of log hazard rates within the K groups, with sample estimates , for l = 1, … , S. Denote the stratum sample fraction as ωl = E(Nl/N), Nl being the lth stratum sample size, ; and denote the stratum-group sample fraction as ζlj = E(nlj/Nl), nlj being group sample size within the lth stratum, for each l. For specified hazard rates λlj and loss hazard rates ηlj for the ljth cell, and recruitment pattern parameter γl for the lth stratum, then we can compute the event probability πlj such that E(Dlj) = Nωlζljπlj is the expected number of subjects with the event in the ljth cell. Within the lth stratum, asymptotically where Σl = diag [E(Dli)−1 … E(DlK)−1] that is consistently estimated as .
As above, vector the of log hazard ratios in the lth stratum is where , for j = 1, … , (K – 1), with variance that is of the same form as (29) as a function of the numbers of events {Dlj}, and where Ωl is the like matrix defined in terms of the expected numbers of subjects with the event {E(Dlj)}. Then, the joint minimum variance or weighted least squares estimate of the vector of adjusted log hazard ratios over strata, and the corresponding covariance matrix, are obtained as
(31) |
Such equations are described by Lachin [17], among others, in the setting of a stratified multivariate analysis. Then the stratified-adjusted contrast T2-like test of the hypothesis that the stratified-adjusted hazard ratios are equal among groups, i.e. H0: β1 = … = β(K–1), is constructed as in (27) that equals
(32) |
Sample size and power are then evaluated using the non-centrality parameter for this test
(33) |
where βl is the vector of assumed log hazard ratios within the lth stratum and Ωl is of the same form as (29) using E(Dlj) = Nωlζljπlj.
For computation of sample size, denote the cell event probability as νlj = ωlζljπlj such that E(Dlj) = Nνlj. Then Ωl = Υl/N where Υl is patterned as in (29) with jth diagonal element and off diagonal elements . Then it follows that the non-centrality factor is
(34) |
For a given set of parameters and total sample size N, power is readily evaluated from (33), whereas the required sample size N is obtained using (34).
7 Test of Group by Stratum Interaction
Alternately, for K groups and S strata it may be desired to conduct a test of homogeneity of the treatment group differences among strata, or a test of a group by stratum interaction, on (K – 1)(S – 1) df. While such a test is conveniently conducted using the Cox PH model, a large sample test can readily be obtained from the above construction based on the multivariate distribution of the hazard rate estimates.
Within the lth stratum, we have a vector of log hazard ratios for each group versus the reference (Kth) with covariance matrix provided by as in (29). These also yield average estimates of the log hazard ratios over the S strata with estimated covariance matrix as shown in (31). Then the test of homogeneity, or no group by stratum interaction, is provided by
(35) |
with non-centrality factor
(36) |
where Ωl = NΥl.
An equivalent test can be obtained from a contrast-based test among the log hazard ratios. Let denote the column vector of S(K – 1) log hazard ratios with estimated covariance matrix . Then construct the contrast matrix C’ such that the jth row of the lth block consists of a contrast among the jth log hazard ratio in that block versus the jth log hazard ratio in the last (reference) block. For example, if (K – 1) = 2 and S = 3 then
(37) |
Then the test of homogeneity among the strata, or of no interaction, is provided by
(38) |
that is algebraically equivalent to (35). Again partitioning Ωl = Υl/N, such that , then the corresponding non-centrality factor is
(39) |
that equals (36) for a specified vector β.
8 Quantitative Covariate Effects
It may also be of interest to describe the effect of a quantitative covariate on the risk of the outcome event after adjusting for differences among groups or strata, and/or to assess the homogeneity of the covariate effect among groups or the interaction of group with the quantitative covariate effect. Hsieh and Lavori [18] describe the assessment of sample size for the effect of a quantitative covariate in a univariate Cox PH model. From their results, it follows that the estimated coefficient , the log hazard ratio per unit increase in the covariate (X), is asymptotically distributed as
(40) |
where σ2 now denotes the variance of X and D the number of events observed in the cohort. Note that the information in the data that determines power, or the inverse of the variance, is equal to E(D)σ2 for a given value of β. For illustration, consider two covariates with the same value for β but with different variances. Then, for a study with D events, this means that the covariate with the larger variance will have greater power because it has a greater range of risk over the range of covariate values, as reflected by the value of σ2.
Likewise, within the jth group (stratum), as above with covariate variance and Dj events in the jth group. As in the prior sections, E(Dj) = Nζjπj is a function of the group sample fraction (ζj) and the event probability within that group (πj) that in turn is a function of the event hazard rate within that group (and other quantities).
Then the minimum variance linear estimator of the common coefficient among groups is provided by
(41) |
with variance
(42) |
that is consistently estimated from the observed Dj and the estimated covariate variance within groups. This then provides a group- or stratified-adjusted test of the covariate effect as
(43) |
that is distributed chi-square on 1 df under H0: β = 0. This test is valid when the true coefficients {βj} may vary among groups, although there will be loss of power as the degree of heterogeneity increases.
For a given set of coefficients {βj}, expected (or realized) numbers of events {E(Dj)} and covariate variances among strata, the non-centrality parameter of the test is
(44) |
from which the power of the test can be obtained. For given E(Dj) = Nζjπj, the non-centrality factor is
(45) |
from which the total required sample size N can be obtained.
Alternately, it may be desired to conduct a test of the hypothesis of homogeneity of the covariate effects among groups, or H0: βj = β for all groups (j = 1, … , K). Cochran’s test of homogeneity (no interaction) is provided by
(46) |
on K – 1 df. For given sets {βj}, {E(Dj)} and among strata, the non-centrality parameter of the test is
(47) |
and the non-centrality factor is
(48) |
where
(49) |
The above simplifies when the variance of the covariate is the same in all groups, for all j.
9 Example
The Glycemia Reduction Approaches in Diabetes Effectiveness, A Cost-Effectiveness Study (GRADE) is designed to compare the effectiveness of four classes of drugs commonly used for treatment of type 2 diabetes. The primary outcome is the time to confirmed inability to maintain adequate glycemic control, which from prior studies is estimated to have a reference hazard rate of λ = 0.0875 per year in the drug group(s) with the least durable effect on glycemic control. Herein we describe sample size and power computations assuming an R = 3 year recruitment interval with a total duration of T = 7 years. To allow for a lag in recruitment it is assumed that 40% of subjects are recruited in the first half of the recruitment period, 60% in the second, that corresponds to a recruitment shape parameter of γ = −0.27. This yields a mean recruitment time of 1.7 years and a mean potential exposure of 4.8 years assuming no losses to follow-up. Allowing for 4% losses per year (η = 0.04), from (4) the mean exposure time is reduced to 3.84 years with an event probability of 0.335 and a loss-to-follow-up probability of 0.153.
A single overall global test of the hypothesis of equality among the 4 groups will be conducted on 3 df. The simplest alternative hypothesis is that three treatments all have the reference hazard rate of 0.0875 and one treatment (say the first) is superior to the other four with a hazard ratio of 0.75 versus the others, i.e. with a hazard of 0.0656 for the first and 0.0875 for the other four groups, and the vector of hazard ratios β = (0.75 1 1)T with group 4 as the reference. With equal sized treatment groups (ζj = 1/4), then the expected probabilities of the event are 0.265 in the first and 0.335 in each of the other 3 groups. These yield a weighted mean log hazard θ = −2.496 corresponding to a geometric mean hazard of 0.0824 and a non-centrality factor of ϕ2 = 0.004338. The non-centrality parameter value that provides 90% power for a 3 df test at the 0.05 level is ψ2(0.05, 0.10, 3) = 14.1715. Substituting into (11), N = ψ2/ϕ2 = 3268 (rounded up from 3267) would be required to provide 90% power to detect the hazard ratio of 0.75 for one therapy versus the others. This yields 216 subjects expected to have the event in the first group and 274 in each of the other 3 groups.
For the case where two therapies are equally superior to the other two with a hazard ratio of 0.75, then an N of 2316 (rounded up from 2315) would provide 90% power. Thus, it is conservative to power the study to detect a single isolated superior drug with HR = 0.75, in which case the total sample size selected might be N = 3300 to provide 90% power.
However, it is also desired to conduct 6 pairwise comparisons among the 4 drug groups. Although the Hochberg closed test procedure will be employed, for the smallest nominal p-value, the adjustment is equivalent to the Bonferroni correction, i.e. a two-sided significance level of 0.05/6 being required for adjusted significance at the 0.05 level. Two group calculations with n = 825 per group shows that the total N = 3300 provides only 71% power to detect a HR = 0.75 between any two groups with this design. Rather, a sample size of n = 1242 per group is required to provide 90% power to detect a HR = 0.75 in a two-group comparison at the 0.05/6 level under the above assumptions, thus requiring a total sample size of 4964, rounded up to N = 5000 as the target enrollment. In this case, the K – 1 df test of homogeneity would provide 98.3% power to detect a single superior drug group with HR = 0.75, and 90% power to detect a single group with HR = 0.796.
It should be noted that another option might be to conduct 4 pairwise comparisons of each drug group versus the other 3 groups combined. With the smaller total N of 3300, such a test at the 0.05/4 level would provide 93% power to detect HR = 0.75. However, the 6 pairwise comparisons are preferred and thus the larger sample size of N = 5000 will be employed.
The study will evaluate various stratification or subgroup factors in which case a stratified-adjusted test may be conducted. To assess the effect of heterogeneity among strata, consider the case where one stratum consists of approximately 2000 subjects with a 20% lower hazard rate of 0.0875 * (0.8) = 0.07/year and a smaller difference between groups with a hazard ratio of 0.85, and the other stratum consists of approximately 3000 subjects with the same risks assumed above. With the same parameters as above, and assuming that a single drug is superior to the others, then the stratum of 2000 subjects would provide an expected 122 events in the first group and 140 events in the other three groups, and the stratum of 3000 subjects would provide 199 and 251 events, respectively. The vector of stratified adjusted log hazard ratios is β = (−0.240713 0 0)T with the first element corresponding to a hazard ratio of 0.786 for the first group versus the reference (i.e. all others). The corresponding covariance matrix has diagonal elements 0.005678 for the first log odds ratio, 0.005114 for the next 2 diagonal elements, and off-diagonal elements 0.002557. The resulting non-centrality parameter is 14.58 that yields power of 90.1%. Thus, the presence of a mild group by stratum interaction (or heterogeneity) leads to some dilution of power for the 4 group test, but at an acceptable level. However, a test of no interaction or homogeneity would have a low power of only 10% to detect a difference in hazard ratios of 0.75 versus 0.85 between these two strata.
Subgroup analyses will also be performed to assess the treatment group differences between segments of the population, such as males and females, with a test of a treatment by subgroup interaction. Again assume an overall hazard rate of 0.0875 and losses at 4% per year with the first group being superior to the rest with an overall hazard ratio of 0.75 in the full cohort for one group versus the rest. Within one subgroup, assume that the hazard ratio is 25% less, i.e. a hazard ratio of 0.75 × 0.75 = 0.563, whereas in the other subgroup it is 25% greater, i.e. 0.75 × 1.25 = 0.938. For equally sized subgroups with n = 2500 each, the test of homogeneity (no interaction) provides 93.9% power. For a factor with three subgroups, each with sample size 1666, the study would provide 68.9% power to detect hazard ratios of 0.563, 0.75 and 0.938.
Analyses may also be conducted involving a quantitative covariate. As for a qualitative covariate (the S strata), one analysis could assess the association of the covariate with the outcome adjusted for treatment group, and another could assess homogeneity of the covariate effect among strata (or a group by stratum interaction).
For N = 5000, or 1250 for each of 4 groups, under the above assumptions, approximately 394 events are expected within each group (1576 total). From recent studies, such as of biomarkers in relation to cardiovascular disease, a hazard ratio of 1.4 per standard deviation change (HRSD) in the covariate is desirable to detect. For a standard deviation σ of the covariate, the hazard ratio per unit change in the covariate is . Then the coefficient is β = log(HRSD)/σ. For σ = 10 the Hsieh-Lavori expression with 1576 events yields virtually 100% power to detect a HRSD = 1.4, and 97% power to detect a smaller HRSD = 1.0. It is also desired to assess the power of the study to detect heterogeneity of a quantitative covariate effect among groups, such as with coefficient values of 1.25, 1.35, 1.45 and 1.55, the weighted average β = 0.0333 corresponding to a HRSD = 1.396. This yields a non-centrality parameter value of ψ2 = 10.3 on 3 for a test of homogeneity, that yields 76.7% power to detect these small differences in the hazard ratios.
10 Discussion
Makuch and Simon [9] describe the assessment of power of the K group logrank test under an exponential model when the numbers of subjects with the event are known (specified), and they provide an equation to compute the average number of such subjects with the event, i.e. assuming that all groups have the same numbers of subjects with the event. Given an assumption about the total exposure time within each group (time to event or censoring), the required sample size can then be obtained. Herein, we take a more general approach by first describing design assumptions including the pattern of recruitment, rate of losses-to-follow-up, and possibly levels of stratification, as in Lachin and Foulkes [1] for the two group case. For a given sample size this then provides the probability in each group that a subject will have the event, which can then be used to determine the required sample size, or to determine power for a given sample size.
The initial approach herein is to describe the non-centrality parameter for a test of homogeneity of the K group hazard rates, that is also shown to apply to a T2 or Wald-type contrast test. These expressions employ the variance of the test statistic evaluated under the alternative, i.e. using the set of specified hazard rates that are desired to be detected. We then derive the non-centrality parameter for the score test in the Cox PH model resulting in an expression identical to that of Ahnn and Anderson [11] when there are equal sized groups and a common pattern of censoring. These expressions employ the variance of the test statistic evaluated under the null hypothesis, i.e. assuming a common probability of the event among the groups. This latter approach based on (23) will provide a smaller number of required events and a smaller total sample size than that using (9). For the above example with 4 groups and a single group superior with a hazard ratio of 0.75, the latter expression yields N = 3268 and D = 1037 whereas (23) yields an expected total number of events of D = 913, for which a smaller sample size N = 2876 would be required. For the comparison of 2 groups, Lachin [4] showed that the comparable expression using the alternative hypothesis variance as in (9) was in general more conservative in that it always provides a larger N and larger required number of events than the expression based on the null hypothesis variance as in (23). On this basis the exponential model based expression might be preferred.
Generalizations then provide the assessment of sample size or power for a stratified-adjusted K-group comparison and for a test of homogeneity or group by stratum interaction, as would be appropriate for a “subgroup” analysis in which the treatment group differences are compared among strata. Likewise, sample size and power are described for a stratified-adjusted analysis of a quantitative covariate effect, and a test of homogeneity of a quantitative covariate effect among strata.
While explicit expressions for a stratified analysis are provided, in many cases a simple approximate computation may suffice. For the above example, with a common recruitment shape parameter in each stratum and a common hazard rate for losses-to-follow-up, the average hazard rate for the event is (3/5)(0.0875) + (2/5)(0.07) = 0.0805 and the average log hazard ratio for one group versus the others is (3/5) [ln(0.75)] + (2/5) [ln(0.85)] = − 0.238 corresponding to an average hazard ratio of 0.788 for one group versus the others. Then a non-stratified computation using (9) with N = 5000 yields power of 90.2%, close to the 90.1% provided by the precise stratified computation. Thus, the principal application of the stratified assessment would be to the case where the strata have different patterns of recruitment and/or different periods of enrollment or follow-up duration and different patterns of losses-to-follow-up. An example of this type is described by Lachin and Foulkes [1] to which the above computations would apply for a K group trial.
The Mantel-Logrank test is a member of the family of linear rank tests for survival data described by Anderson, Borgan, Gill and Keiding [19] that includes the Peto-Prentice modified Wilcoxon test that is optimal under a proportional survival odds model. Jung and Hui [20] describe the non-central distribution of this family of tests from which the power of a particular test can be obtained. Their method allows for a period of uniform recruitment and follow-up, and losses-to follow-up, but it requires numerical integration of stochastic integrals to construct the non-centrality parameter. Conversely, the methods herein are quite simple to apply.
In all cases, the sample size is obtained by solving for the total N that yields a desired number of events. In cases where the number of events is known, or pre-specified in advance, power can be assessed by simply substituting the event numbers into the above expressions, as in (9).
Central to the application of these methods is the precise specification of the log hazard rates within each of the treatment groups (and possibly strata) worthy of detection. In general, from (9) it is clear that the magnitude of the non-centrality parameter, and thus power, depends explicitly on the weighted sum of squares (SS) among the specified log hazards, weighted by the expected number of events within each group. As employed by Makuch and Simon [9], this parameter depends approximately on the unweighted sum of squares that is easily evaluated for a specified set of hazard rates. In this case, well-known results for a balanced one-way ANOVA F-test of equality of K group means will apply approximately. Consider the set of K ordered means (or log hazard rates) with minimum mean θ(1) and maximum θ(K). The maximum power (sum of squares) occurs when the ordered means for half the groups (or for K/2±0.5 if K is odd) equal θ(1) and the other half (or K/2 ± 0.5) equal θ(K). Conversely, the least power occurs when θ(j) = (θ(1) + θ(K))/2 for 1 < j < K. The test also has poor power when the hazards in K – 1 groups are equal and that in the Kth group is different, the so-called case of a single isolated superiority (or inferiority). For example, for a set of K = 5 means with θ(1) = 0 and θ(K) = 4, the maximum SS is 19.2 for ordered mean values of (0, 0, 0, 4, 4) or (0, 0, 4, 4, 4). The SS is 12.8 for the isolated superiority with means (0, 0, 0, 0, 4) or (0, 4, 4, 4, 4); and is a minimum of 8.0 for means of (0, 2, 2, 2, 4). Thus, in the balanced case (ζj = 1/K), the total number of events required using the expression under the null (25) for the minimum SS case is 19.2/8 = 2.4-fold higher that for the maximum SS case.
Makuch and Simon also proposed that the global test in (8) could be employed with Fisher’s Least Significant Difference (LSD) method to guarantee the experiment-wide type 1 error probability at level α for the set of K(K – 1)/2 pairwise tests. However, from the closed testing principle [21], this is true only for K = 3 groups. For an illustration, see Chi [24]. For example, with K = 4 groups, if the 4 group test is significant at level α, one can then test the 4 separate 3 group differences at level α. A given pairwise comparison, e.g. group 1 versus 2, is a component of the null hypothesis for 2 of the 4 such 3 group tests, specifically H0: θ1 = θ2 is a component of the “parent” test hypotheses H0: θ1 = θ2 = θ3 and H0: θ1 = θ2 = θ4. If both of these are significant at level α, then the component pairwise test of H0: θ1 = θ2 can also be tested at level α. If the two parent 3 group null hypotheses are not rejected at level α for a given pairwise hypothesis, then that pairwise test is declared non-significant.
Finally, the methods herein assume that the exponential model or the proportional hazards model apply. If indeed they do not, then the required sample size will be underestimated, and the study power overestimated. For the simple two group design, Lakatos [23] describes a piecewise interval approach to the power of the logrank test when the hazard rate and/or hazard ratio may vary over intervals of time. Ahnn and Anderson [24] describe a generalization of this approach to the K-group logrank test for specified hazard rates, numbers of events and numbers within each group at risk within each interval of time, quantities that are not easily obtained in a complex design with staggered non-uniform patient entry, varying hazards and losses to follow-up.
Programs from the author for the computations herein will be available from www.bsc.gwu.edu under the link to programs available for download.
ACKNOWLEDGMENTS
This work was partially supported by cooperative agreements from the National Institute of Diabetes, Digestive and Kidney Diseases for the Glycemia Reduction Approaches in Diabetes Effectiveness, A Cost-Effectiveness Study (GRADE) and for the Diabetes Prevention Program Outcomes Study (DPPOS).
REFERENCES
- 1.Lachin JM, Foulkes MA. Evaluation of sample size and power for analyses of survival with allowance for non-uniform patient entry, losses to follow-up, non-compliance and stratification. Biometrics. 1986;42:507–519. [PubMed] [Google Scholar]
- 2.George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J. Chronic Dis. 1974;27:15–29. doi: 10.1016/0021-9681(74)90004-6. [DOI] [PubMed] [Google Scholar]
- 3.Schoenfeld DA. Sample-size formula for the proportional-hazards regression model. Biometrics. 1983;39:499–503. [PubMed] [Google Scholar]
- 4.Lachin JM. Introduction to sample size determination and power analysis for clinical trials. Control. Clin. Trials. 1981;2:93–113. doi: 10.1016/0197-2456(81)90001-5. [DOI] [PubMed] [Google Scholar]
- 5.Rubenstein LV, Gail MH, Santner TJ. Planning the duration of a comparative clinical trial with losses to follow-up and a period of continued observation. J. Chronic Dis. 1981;34:469–479. doi: 10.1016/0021-9681(81)90007-2. [DOI] [PubMed] [Google Scholar]
- 6.Schoenfeld DA, Richter JR. Nomograms for calculating the number of patients needed for a clinical trial with survival as the endpoint. Biometrics. 1982;38:163–170. [PubMed] [Google Scholar]
- 7.Freedman LS. Tables of the number of patients required in clinical trials using the logrank test. Stat. Med. 1982;1:121–129. doi: 10.1002/sim.4780010204. [DOI] [PubMed] [Google Scholar]
- 8.Schoenfeld DA. The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika. 1981;68:316–319. [Google Scholar]
- 9.Makuch RW, Simon RM. Sample size requirements for comparing time-to-failure among k treatment groups. J. Chronic Dis. 1982;35:861–867. doi: 10.1016/0021-9681(82)90051-0. [DOI] [PubMed] [Google Scholar]
- 10.Liu PY, Dahlberg S. Design and analysis of multiarm clinical trials with survival endpoints. Control Clin. Trials. 1995;16:119–130. doi: 10.1016/0197-2456(94)00030-7. [DOI] [PubMed] [Google Scholar]
- 11.Ahnn S, Anderson SJ. Sample size determination for comparing more than two survival distributions. Stat. Med. 1995;14:2273–2282. doi: 10.1002/sim.4780142010. [DOI] [PubMed] [Google Scholar]
- 12.Tarone RE, Ware J. On distribution-free tests for equality of survival distributions. Biometrika. 1977;64:156–160. [Google Scholar]
- 13.Halabi S, Singh B. Sample size determination for comparing several survival curves with unequal allocations. Stat. Med. 2004;23:1793–1815. doi: 10.1002/sim.1771. [DOI] [PubMed] [Google Scholar]
- 14.Lachin JM. Biostatistical Methods: The Assessment of Relative Risks. 2nd Edition John Wiley & Sons; New York: 2011. [Google Scholar]
- 15.Cochran WG. The combination of estimates from different experiments. Biometrics. 1954;10:101–129. [Google Scholar]
- 16.Anderson TW. An Introduction to Multivariate Analysis. 2nd edition John Wiley & Sons; New York: 1984. [Google Scholar]
- 17.Lachin JM. Some large sample distribution-free estimators and tests for multivariate partially incomplete data from two populations. Stat. Med. 1992;11:1151–1170. doi: 10.1002/sim.4780110903. [DOI] [PubMed] [Google Scholar]
- 18.Hsieh FY, Lavori PW. Sample-Size Calculations for the Cox Proportional Hazards Regression Model with Nonbinary Covariates. Controlled Clinical Trials. 2000;21:552–560. doi: 10.1016/s0197-2456(00)00104-5. [DOI] [PubMed] [Google Scholar]
- 19.Andersen PK, Borgan O, Gill RD, Keiding N. Linear nonparametric tests for comparison of counting processes, with applications to censored survival data. Int. Statist. Rev. 1982;50:219–258. [Google Scholar]
- 20.Jung S-H, Hui S. Sample size calculations for rank tests comparing K survival distributions. Lifetime Data Analysis. 2002;8:361–373. doi: 10.1023/a:1020518905233. [DOI] [PubMed] [Google Scholar]
- 21.Marcus R, Peritz E, Gabriel KR. On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976;63:655–660. [Google Scholar]
- 22.Chi GYH. Multiple testings: Multiple comparisons and multiple endpoints. Drug. Inf. J. 1998;32:1347S–1362S. [Google Scholar]
- 23.Lakatos E. Sample sizes based on the log-rank statistic in complex clinical trials. Biometrics. 1988;44:229–241. [PubMed] [Google Scholar]
- 24.Ahnn S, Anderson SJ. Sample size determination in complex clinical trials comparing more than two groups for survival endpoints. Stat. Med. 1998;17:2525–2534. doi: 10.1002/(sici)1097-0258(19981115)17:21<2525::aid-sim936>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]