Abstract
The effects of interventions are multi-dimensional. Use of more than one primary endpoint offers an attractive design feature in clinical trials as they capture more complete characterization of the effects of an intervention and provide more informative intervention comparisons. For these reasons, multiple primary endpoints have become a common design feature in many disease areas such as oncology, infectious disease, and cardiovascular disease. More specifically in medical product development, multiple endpoints are utilized as co-primary to evaluate the effect of the new interventions. Although methodologies to address continuous co-primary endpoints are well-developed, methodologies for binary endpoints are limited. In this paper, we describe power and sample size determination for clinical trials with multiple correlated binary endpoints, when relative risks are evaluated as co-primary. We consider a scenario where the objective is to evaluate evidence for superiority of a test intervention compared with a control intervention, for all of the relative risks. We discuss the normal approximation methods for power and sample size calculations and evaluate how the required sample size, power and Type I error vary as a function of the correlations among the endpoints. Also we discuss a simple, but conservative procedure for appropriate sample size calculation. We then extend the methods allowing for interim monitoring using group-sequential methods.
Keywords: Conjunctive power, Co-primary endpoints, Group-sequential designs, Monte-Carlo simulation, Normal approximation, Type I error
1. Introduction
Traditionally in clinical trials, one important and clinically relevant outcome is selected as the primary endpoint. This endpoint is then used as the basis for the trial design including sample size determination, interim monitoring, final analyses, and the reporting of trial results. However, many recent clinical trials have utilized more than one endpoint as co-primary. “Co-primary” in this setting means that the trial is designed to evaluate if the new intervention is superior to the control on all endpoints.
The need for new approaches to the design and analysis of clinical trials with co-primary endpoints has been noted. When designing the trial to evaluate the joint effects on all of the endpoints, no adjustment is needed to control the Type I error rate if each endpoint is tested at the common prespecified significance level. However, the Type II error rate increases as the number of endpoints to be evaluated increases. Thus sample size adjustment is needed to maintain the overall power for detecting the joint effects on all of the endpoints, often resulting in large and impractical sample sizes. In order to provide a more practical sample size, methods for clinical trials with co-primary endpoints have been discussed for fixed sample size designs. Methodologies to address multiple co-primary continuous endpoints are well-developed (Chuang-Stein et al., 2007; Dmitrienko et al., 2010; Eaton and Muirhead, 2007; Hung and Wang, 2009; Julious and McIntyre, 2012; Kordzakhia et al., 2010; Offen et al., 2007; Senn and Bretz, 2007; Sozu et al., 2006; Sugimoto et al., 2012; Xiong et al., 2005). When evaluating relative risks with time-to-event outcomes, Hamasaki et al. (2013) and Sugimoto et al. (2013) have developed methods for sizing clinical trials, focusing on the hazard ratio and logrank test statistics. However methodology for multiple binary endpoints is limited (Song, 2009; Sozu at al., 2010, 2011, 2015; Xu and Yu, 2013).
The lack of availability of appropriate methodology for multiple binary endpoints is problematic since clinical trials are often conducted with the objective of comparing the effect of a test intervention to that of a control intervention based on several binary outcomes. For example, PLACIDE is a randomized double-blinded, parallel group, placebo-controlled clinical trial evaluating lactobacilli and bifidobacteria in the prevention of antibiotic-associated diarrhea in older people admitted to hospital (Allen et al., 2012; Allen et al., 2013). The trial was designed to demonstrate that the administration of a probiotic comprising two strains of lactobacilli and two strains of bifidobacteria alongside antibiotic treatment, prevents antibiotic associated diarrhea. The co-primary outcomes were (1) the occurrence of antibiotic-associated diarrhoea (AAD) within 8 weeks and (2) the occurrence of C difficile diarrhoea (CDD) within 12 weeks of randomization. Another example can be seen in irritable bowel syndrome (IBS), one of the most common gastrointestinal disorders characterized by symptoms of abdominal pain, discomfort, and altered bowel function (American College of Gastroenterology, 2013; Grundmann and Yoon, 2010). The comparison of the interventions to treat IBS is based on the proportions of participants with: (1) adequate relief of abdominal pain and discomfort, and (2) improvements in urgency, stool frequency, and stool consistency. The U.S. Food and Drug Administration (FDA) recommends the use of two endpoints for assessing IBS signs and symptoms: (1) pain intensity, and (2) stool frequency (Food and Drug Administration, 2012). Meanwhile the Committee for Medicinal Products for Human Use (2013) also recommends the use of two endpoints for assessing IBS signs and symptoms: (1) global assessment of symptoms, and (2) assessment of symptoms of abdominal discomfort/pain. Offen et al. (2007) provides other examples.
Methodology for sizing trials when evaluating the absolute difference in proportions can be found in Sozu et al. (2010). The objective of this paper is to describe methodology for the power and sample size determination for clinical trials with multiple co-primary binary endpoints when evaluating relative risks contrasts. Methodology for the odds ratio is briefly discussed in the Appendix.
The paper is structured as follows: in Section 2, we describe a normal approximation method for sample size determination that incorporates the correlations among the endpoints into the calculations. We then evaluate the practical utility of the normal approximation method via Monte-Carlo simulation. In Section 3, we discuss a conservative procedure for sample size calculation by treating the endpoints as if they are not correlated. In Section 4, we extend the methodology to a group-sequential setting. In Section 5, we summarize the findings.
2. Methods for calculating the sample size with relative risks
2.1 Statistical Settings
Consider a randomized clinical trial comparing two interventions with K binary endpoints being evaluated as co-primary (K ≥ 2). There are N total participants, with rN participants assigned to the test (T) and (1 − r)N participants to the control (C) intervention groups, where r is the allocation ratio (0 < r < 1). Let the rN responses to the test intervention be denoted by YTki and the (1 − r)N responses to the control intervention by YCkj (k = 1, …, K; i = 1, …, rN; j = 1, …, (1 − r)N). Suppose that YTki and YCkj are independently distributed as Bernoulli distributions with probabilities of success pTk and pCk, but the K binary endpoints are correlated with common pairwise correlations of ρ(kk′) = corr[YTki, YTk′i] = corr[YCkj, YCk′j](1 ≤ k < k′ ≤ K). Let and denote the number of successes under the test and the control interventions respectively. Then they are distributed as binomial distributions, i.e., YTk ~ B(rN, pTk) and YCk ~ B((1 −r)N, pCk).
We now have the K log-transformed (observed) relative risks logR̂k = log(p̂Tk/p̂Ck), where p̂Tk = (rN)−1YTk and p̂Ck = ((1 −r)N)−1YCk. For large samples, by applying the delta-method, we assume that the joint distribution of (logR̂1, …, logR̂K) is approximately K-variate normal with mean vector μ = (logR1, …, logRK)T and covariance matrix Σ determined by
where Rk = pTk/pCk. Since it is assumed that the correlations among the endpoints are common between the two intervention groups, for large samples, is approximately given by , where and . It follows that . A detailed calculation for is provided in the Appendix.
There are three measures of the correlation ρ(kk′), i.e., the correlation coefficient of a multivariate Bernoulli distribution, the odds ratio, and the correlation coefficient of a latent normal distribution (Pearson 1900; Bahadur, 1961; Dale 1986; Le Cessie and van Houwelingen, 1994; Prentice, 1988). We consider the correlation from a bivariate Bernoulli distribution because it is intuitively attractive and interpretable even though the range of ρ(kk′) is restricted, depending on the marginal probabilities (Prentice, 1988). However, the results for the power and sample size determination are provided in a general form and can be straightforwardly applicable to the other two correlation measures.
2.2 Power and sample sizes calculations
When evaluating the joint effects for multiple correlated binary relative risks, e.g., to establish a risk reduction in the test intervention compared with the control intervention, the hypotheses H0: logRk ≥ 0 for at least one k versus H1: logRk < 0 for all k are tested using the test statistics
| (1) |
where . If each endpoint is evaluated using a one-sided test at the same significance level of a, then the rejection regions of H0 are [{Z1 < −zα} ∩ … ∩ {ZK < −zα}], where zα is the (1 − α)th percentile of the standard normal distribution. Therefore, for the true relative risks Rk, for large samples, using straightforward algebra and substituting population parameters for estimates, provides the approximate overall power:
| (2) |
which is referred to as “conjunctive (or complete) power” (Senn and Bretz, 2007), where
with p̄k = rpTk + (1 − r)pCk. The overall power (2) can be evaluated by using the distribution function of the K-variate normal distribution with zero mean vector and correlation matrix , i.e., . So that the total required sample size for detecting the joint reduction in all relative risks with the overall power 1 − β at the significance level α, using the normal approximation, NAN is the smallest integer satisfying Equation (2). In addition, we can simplify Equation (2) using the pooled variance estimate under the null hypothesis or the unpooled variance estimate under the alternative hypothesis. When using the unpooled variance estimate, the total sample size NUP is the smallest integer satisfying the power function 1 −β, where
Similarly, when using the pooled variance estimate, the total sample size NPL is the smallest integer satisfying the power function 1 −β, where
The relationship of these three sample sizes is NPL ≤ NAN ≤ NUP as the relationship between the pooled and unpooled variances is given by, assuming pT < pC and 0 < r < 1,
Note that the simplified formula for the sample size with a pooled variance of NPL results in the smallest sample size, but our simulation studies suggest that the methodology may not achieve the targeted power in many practical situations as given in supplemental document. Thus it is not recommended for use in practice. No closed form expression is available for calculating NAN (or NUP, NPL) and an iterative procedure is required to solve for the power. We provide a practical algorithm for calculating the sample size in Appendix.
Figure 3.
Behaviors of the overall powers as a function of the effect size(s) for a given sample size in the case of two co-primary relative risks (A) and three co-primary relative risks (B). For the case of two co-primary relative risks, the sample size (equally sized groups; r =0.5) was calculated to detect a reduction in the relative risk R1 with the individual power of (0.8)1/2 =0.894 for a one-sided test at the significance level of α =0.025. For the case of three co-primary relative risks, the sample size (equally sized groups: r =0.5) was calculated to detect a reduction in the relative risk R1 with the individual power of (0.8)1/3 =0.928 for a one-sided test at the significance level of α =0.025.
There are more direct ways of calculating the sample size without using a normal approximation (e.g., see Sozu et al. (2010)). However such methods are computationally difficult and often impractical particularly for a large number of outcomes. On the other hand, the normal approximation discussed here may not work well with extremely small event rates or with small sample sizes from the analogy of discussions on a single binary endpoint. We evaluate the utility of using the normal approximation in Section 2.3.
2.3 Evaluation of the methodology utility
In order to evaluate the utility of the normal approximation described in the previous sections, the Type I error rate and power under given sample sizes NAN were evaluated using Monte-Carlo simulation. Consider a clinical trial with two interventions being compared using two relative risks as co-primary contrasts. The required total sample size NAN is calculated to detect the joint effect on both relative risks R1 and R2 with pC1 and pC2 with the desired overall power of 1 − β =0.80 for a one-sided test at the significance level of α =0.025. The total numbers of 1,000,000 and 100,000 datasets with a total sample size of NAN for each parameter configuration, are generated for the assessments of the Type I error rate and the power respectively. The sample sizes calculated using the normal approximation are compared with the sample sizes derived by the Monte-Carlo simulation-based approach, where 100,000 replications were selected for the power evaluation. Bivariate Bernoulli data for Monte-Carlo simulation were generated using the method described in Emrich and Piedmonte (1991).
Figure 1 displays how the empirical powers and Type I error rates for the tests based on the test statistics (1) with common correlation ρ12 =0.0, 0.3, 0.5, 0.8, and 0.99, in the cases of R1 =0.8, 0.7, 0.6, 0.5 and 0.4 with pC1 =0.5, 0.2, 0.10 and 0.05, and R1 = R2 = R and pC1 = pC2 = pC, for K = 2, under a given sample size for detecting a joint effect on the two relative risks (using the above parameter configuration) with the desired overall power of 1 − β =0.80 for a one-sided test at the significance level of α =0.025. For evaluation of the Type I error rates, we consider two null hypotheses: (i) H0: logR1 = 0 and logR2 = 0, and (ii) H0: logR1 < 0, but logR2 = 0.
Figure 1.
Behaviors of the empirical powers and Type I error rates for the tests based on the test statistics (1), under a given sample size NAN, with correlation ρ12 =0.0, 0.3, 0.5, 0.8 and 0.99, in the cases of R1 =0.8, 0.7, 0.6, 0.5 and 0.4 with pC1 =0.5, 0.2, 0.10 and 0.05, R1 = R2 = R and pC1 = pC2 = pC, and equally sized groups (r = 0.5) for K =2. The given sample size of NAN was calculated to detect the join effect on all of the endpoints for each parameter setting with the desired overall power of 1 − β =0.80 for a one-sided test at the significance level of α =0.025. For evaluation of the Type I error rates, two situation were considered, i.e. (i) H0: logR1 =0 and logR2 =0 and (ii) H0: logR1 =0, but logR2 <0.
Use of NAN achieves the targeted power of 1 − β =0.80, but is larger than the targeted power as R decreases and pC increases, especially when R ≤ 0.5 and pC =0.5. When R ≤ 0.5 and pC =0.5, the required sample size is less than 154 (77 per group), where the sample size was calculated to detect a joint effect in R = 0.5 and pC =0.5 with the targeted power of 0.80 at the significance level of 0.025 for a one-sided test, assuming zero correlation ρ12 =0.0. The Type I error rate increases as the correlation goes toward one, but becomes greater than the nominal significance level of α =0.025 as R decreases and pC increases, especially when R ≤0.5 and pC =0.5 in both hypothesis settings (i) and (ii). As shown in the supplemental document, the empirical power for NUP is larger than the targeted power, while the empirical power for NPL is less than the targeted power. However, the Type I error rates for NUP and NPL behave similarly to those seen for NAN. They are greater than the nominal significance level of α =0.025 when R is small and pC is large, especially when R ≤0.5 and pC =0.5.
Figure 2 displays the ratio of the required sample sizes calculated using the normal approximation to that resulting from the Monte-Carlo simulation-based approach, with a common correlation of ρ12 =0.0, 0.3, 0.5, 0.8 and 0.99, in the cases of R1 =0.8, 0.7, 0.6, 0.5 and 0.4 with pC1 =0.5, 0.2, 0.10 and 0.05, and R1 = R2 = R and pC1 = pC2 = pC, for K = 2, with the desired overall power of 1 − β =0.80 for a one-sided test at the significance level of α =0.025. The ratio is about one when R =0.8 and 0.7. The ratio is slightly larger than one when R =0.6, however, it is much larger than one when R ≤0.5. The most obvious difference occurs when pC =0.5. The absolute difference in the two sample sizes per intervention group lies between 0 and 32.
Figure 2.
Behaviors of the ratio of the required sample size calculated by the normal approximation (NAN) to that by the Monte-Carlo simulation-based approach, with correlation ρ12 =0.0, 0.3, 0.5, 0.8 and 0.99 in the cases of R1 =0.8, 0.7, 0.6, 0.5 and 0.4 with PC1 =0.5, 0.2, 0.10 and 0.05, R1 = R2 = R and pC1 = pC2 = pC, and equally sized groups (r =0.5) for K =2, with the desired overall power of 1 − β =0.80 for a one-sided test at the significance level of α =0.025.
2.4 Example
We illustrate the methodology with an example from the PLACIDE study (Allen et al., 2012; Allen et al., 2013) described in the Introduction. Recall that the study was designed to evaluate if the administration of a probiotic comprising two strains of lactobacilli and two strains of bifidobacteria alongside antibiotic treatment prevents antibiotic associated diarrhea. The co-primary outcomes were (1) the occurrence of antibiotic-associated diarrhoea (AAD) within 8 weeks and (2) the occurrence of C difficile diarrhoea (CDD) within 12 weeks of recruitment. The contrast measures for efficacy analysis were the binary relative risks of these two outcomes. The original sample size of 2,478 participants (1,239 in each group) was derived to detect a 50% reduction in CDD in the probiotic group compared with the placebo group, with a desired power of 0.80 using a two-sided test at 5% significance level, assuming CDD frequencies of 4% in placebo group. Allen et al. (2013) also mention that a trial of this size would provide a power of more than 0.99 to detect a 50% reduction in AAD (assuming AAD frequencies of 10% in placebo group) and a power of 0.90 to detect a 25% reduction in AAD (assuming AAD frequencies of 15% in placebo group). However, if the objective was to evaluate the efficacy of the new intervention based on both endpoints of CDD and AAD, then the power for detecting the joint effects for both endpoints was 0.792 (0.8×0.99) when assuming a 50% reduction in AAD frequencies with 10% in placebo group, and 0.72 (0.8×0.9) when assuming a 25% reduction in AAD frequencies with 15% in placebo group, when the correlation between the two endpoints was assumed to be zero. Thus the planned sample size would be insufficient to detect a joint effect with the overall power of 0.80.
Table 1 displays the required sample size for detecting a joint effect for both endpoints with the overall power of 1 − β =0.80 and 0.90 at the one-sided significance level of α =0.025 based on their original assumptions A1) 1. CDD: 50% reduction with CDD frequencies of 4% in placebo and 2. AAD: 50% reduction with ADD frequencies of 10% in placebo, and A2) 1. CDD: 50% reduction with CDD frequencies of 4% in placebo and 2. AAD: 25% reduction with ADD frequencies of 15% in placebo, when a common correlation between two endpoints in the two intervention groups was assumed.
Table 1.
Total sample size (equally sized groups; r =0.5) required for evaluating the joint effects of both relative risks with an overall power of 1 − β =0.80 and 0.90 for a one-sided test at the significance level of α =0.025
| Assumption | Overall power |
Sample Size | Correlation | |||||
|---|---|---|---|---|---|---|---|---|
| 0.0 | 0.3 | 0.5 | 0.8 | 0.99 | ||||
| A1 | 1. CDD 50% reduction with CDD frequencies of 4% in placebo | 0.80 | Normal approximation | 2,222 | 2,212 | 2,204 | 2,198 | 2,198 |
| (Empirical power) | (0.796) | (0.796) | (0.794) | (0.793) | (0.793) | |||
| Simulation-based | 2,260 | 2,240 | 2,238 | 2,234 | 2,234 | |||
| (Empirical power) | (0.802) | (0.800) | (0.801) | (0.802) | (0.802) | |||
| 2. AAD 50% reduction with ADD frequencies of 10% in placebo | 0.90 | Normal approximation | 2,978 | 2,976 | 2,974 | 2,972 | 2,972 | |
| (Empirical power) | (0.902) | (0.898) | (0.900) | (0.900) | (0.900) | |||
| Simulation-based | 2,968 | 2,972 | 2,974 | 2,972 | 2,972 | |||
| (Empirical power) | (0.900) | (0.900) | (0.900) | (0.900) | (0.900) | |||
| A2 | 1. CDD: 50% reduction with CDD frequencies of 4% in placebo | 0.80 | Normal approximation | 3,130 | 3,052 | 3,014 | 3,014 | 3,014 |
| (Empirical power) | (0.799) | (0.799) | (0.798) | (0.798) | (0.798) | |||
| Simulation-based | 3,134 | 3,060 | 3,022 | 3,022 | 3,022 | |||
| (Empirical power) | (0.802) | (0.802) | (0.800) | (0.800) | (0.800) | |||
| 2. AAD: 25% reduction with ADD frequencies of 15% in placebo | 0.90 | Normal approximation | 3,940 | 3,886 | 3,858 | 3,858 | 3,858 | |
| (Empirical power) | (0.902) | (0.900) | (0.900) | (0.900) | (0.900) | |||
| Simulation-based | 3,928 | 3,886 | 3,856 | 3,856 | 3,856 | |||
| (Empirical power) | (0.900) | (0.900) | (0.900) | (0.900) | (0.900) | |||
The empirical powers under the given sample size using 100,000 Monte-Carlo trials. Monte-Carlo simulation-based sample size was calculated using 100,000 Monte-Carlo trials.
Based on A1, when ρ12 =0.0, 2,222 participants are required for providing a power of 0.80 (a conservative sample size). The required total sample size decreases from 2,212 to 2,198 participants as correlation varies from ρ12 =0.3 to 0.99. Based on A2, 3,130 participants are required when ρ =0 and the required sample size decreases from 3,052 to 3,014 participants as correlation varies from ρ12 =0.3 to 0.99. The empirical powers under the calculated sample size achieve the desired overall power and there is no notable difference in the required sample size resulting from the normal approximation and the Monte-Carlo simulation-based approach as the maximum difference in total sample size is modest (i.e., A1: 38 and 10 for 1 − β =0.8 and 0.9; A2: 8 and 12 for 1 − β =0.8 and 0.9).
3. A conservative procedure for sample size calculation
We consider a conservative sample size strategy when evaluating significance for all of the relative risks by using a suggestion from Hung and Wang (2009). Consider a scenario where there are K relative risks as co-primary. Assuming an unpooled variance estimate for simplicity, with a common value of (i.e., R = R1 = ⋯ = RK and pC = pC1 = ⋯ = pCK) and , if letting γ = 1 − (1 − β)1/K, then a conservative sample size NC is
where zα and zγ are (1 − α)th and (1 − γ)th percentiles of the standard normal distribution, respectively.
Furthermore, we discuss guidance on when and how the simplified equation can be used. Now calculate a total sample size NAN (R1) required to detect the reduction in relative risk R1, with the individual power (1 − γ) at the significance level of α, assuming . The overall power 1 − β under NAN(R1) is , where and Φ(·) is the cumulative distribution function of the univariate standard normal distribution. In addition, Ek(k = 1, …,K) is the standardized effect size given by
Therefore, the overall power can be expressed as a function of the ratio of the standardized effect sizes.
Figure 3A illustrates the behavior of the overall power as a function of E2/E1 for a given sample size with two co-primary relative risks where the sample size (equally sized groups) was calculated to detect the reduction in relative risk R1 with the individual power of 0.81/2 = 0.894 for a one-sided test at the significance level of α =0.025. The overall power increases toward 0.894 as the ratio of E2/E1 increases. In particular when the ratio of E2/E1 is greater than 1.639, the overall power reaches 0.894. This is because the individual power for R2 is very close to one under the given sample size calculated for R1.
Figure 3B illustrates the behavior of the overall power as a function of E2/E1 and E3/E1 for a given sample size with three co-primary relative risks, where the sample size (equally sized groups) was calculated to detect the reduction in relative risk R1 with the individual power of 0.81/3 = 0.928 for a one-sided test at the significance level of α =0.025. The overall power increases toward 0.928 as both of the ratios of E2/E1 and E3/E1 increase. In particular when E2/E1 and E3/E1 are greater than 1.618, then the overall power reaches 0.928. This is because both of individual powers for R2 and R3 are very close to one under the given sample size calculated for R1. In this situation, the required sample size greatly depends on the smallest reduction. If we observe a large difference among the values of Ek, then we could calculate the conservative sample size ,
where zβ is the (1 − β)th percentile of the standard normal distribution.
4. An extension to group-sequential designs
As previously noted, the standard methods for sizing trials with co-primary endpoints often results in large sample sizes due to the conservative nature of the testing procedure even when the correlations among the endpoints is incorporated into the calculation. To improve efficiency, researchers may consider interim analyses to evaluate if the research questions can be answered with fewer trial participants or shorter follow-up. In this section, we extend the methods discussed in Section 2 to a group-sequential setting, allowing for the possibility of stopping a trial early when early evidence is overwhelming. We consider the scenario of a randomized clinical trial comparing two interventions with two binary endpoints being evaluated as co-primary. Suppose that a maximum of L analyses are planned and the two endpoints are analyzed at the same interim timepoint. For more flexible group-sequential designs for clinical trials with co-primary endpoints, please see Asakura et al. (2014, 2015), and Hamasaki et al. (2015).
Let Nl be the cumulative total number of participants at the lth analysis (l = 1, …, L). Hence, up to rNL and (1 − r)NL participants are recruited and randomly assigned to the test and the control intervention groups, respectively. We are interested in conducting a hypothesis test to evaluate if the test intervention is superior to the control, i.e., the hypotheses H0: logR1 ≥ 0 or logR2 ≥ 0 versus H1: logR1 < 0 and logR2 < 0. Let (Z1l, Z2l) be the statistics for testing the hypotheses at the lth analysis, given by , where R̂kl = p̂Tkl/p̂Ckl, , at the lth analysis, with and . The null hypothesis H0 can be rejected if and only if superiority is achieved for the two endpoints simultaneously (i.e., at the same interim timepoint of the trial). If superiority is demonstrated on only one endpoint at a particular interim analysis, then the trial continues and the hypothesis testing is repeated for both endpoints until joint significance for the two endpoints is established simultaneously. The stopping rule is formally given as follows:
- At the lth analysis (l = 1, …, L − 1)
- If Z1l < −z1l and Z2l < −z2l, then reject H0 and stop the trial,
- otherwise, continue to the (l + 1)th analysis,
- at the Lth analysis,
- if Z1L < −z1L and Z2L < −2L, then reject H0,
- otherwise, do not reject H0,
where z1l and z2l are the critical values, which are constant and selected separately, using any group-sequential methods such as the Lan-DeMets (LD) alpha-spending method (Lan and DeMets, 1983) to control an overall Type I error rate as if they were a single primary endpoint, ignoring the other co-primary endpoint. The power is approximately
| (3) |
where
For large samples, we assume that the joint distribution of is approximately 2L multivariate normal with their correlations given by , where and
Based on the power (3), the two types of sample size, i.e., the maximum sample size (MSS) and the average sample number (ASN) can be calculated. The MSS is the sample size required for the final analysis to achieve the desired power 1 − β. The MSS is the smallest integer not less than NL satisfying the power (3) for a group-sequential design at the prespecified R1, R2, pC1, pC2 and ρ(12), with Fisher’s information time for the interim analyses Nl/NL(l = 1, …, L). The ASN is the expected sample size under a specific hypothetical reference. Similarly as in the fixed sample size designs discussed in Section 2, the closed form solutions for the MSS and ASN are not available, and thus an iterative program is required to find these sample sizes. For more details, please see Asakura et al. (2014).
For illustration, consider the PLACIDE study again. Table 2 provides the MSS and ASN, based on the assumption A1. The MSS and ASN were calculated to detect the joint effect in the two endpoints with the overall power of 1 − β =0.80 at the one-sided significance level of α =0.025, where ρ(12) =0.0, 0.3, 0.5, 0.8 and 0.99, L =2, 3, 4 and 5, and r =0.5. The critical values are determined by the O’Brien–Fleming testing procedure (O’Brien and Fleming, 1979) for both endpoints, with the LD alpha-spending method with equally spaced information time.
Table 2.
The MSS and ASN (equally-sized groups), and expected number of stops for detecting the joint effects for both endpoints based on the assumption A1 from the PLACIDE clinical trial, with the overall power of 1 − β =0.80 at the one-sided significance level of α =0.025. The critical values are determined by the O’Brien-Fleming testing procedure for both endpoints, with the LD alpha-spending method with equally spaced information time.
| Correlation | Number of analyses |
MSS | ASN | Expected number of stops |
|---|---|---|---|---|
| 0.0 | 2 | 2,232 | 2,099 | 1.88 |
| 3 | 2,250 | 1,950 | 2.60 | |
| 4 | 2,264 | 1,894 | 3.35 | |
| 5 | 2,270 | 1,858 | 4.09 | |
| 0.3 | 2 | 2,220 | 2,056 | 1.85 |
| 3 | 2,238 | 1,921 | 2.58 | |
| 4 | 2,248 | 1,859 | 3.31 | |
| 5 | 2,260 | 1,827 | 4.04 | |
| 0.5 | 2 | 2,212 | 2,030 | 1.84 |
| 3 | 2,226 | 1,899 | 2.56 | |
| 4 | 2,240 | 1,839 | 3.28 | |
| 5 | 2,250 | 1,805 | 4.01 | |
| 0.8 | 2 | 2,200 | 1,997 | 1.82 |
| 3 | 2,220 | 1,879 | 2.54 | |
| 4 | 2,232 | 1,816 | 3.25 | |
| 5 | 2,240 | 1,780 | 3.97 | |
| 0.99 | 2 | 2,200 | 1,993 | 1.81 |
| 3 | 2,214 | 1,872 | 2.54 | |
| 4 | 2,232 | 1,813 | 3.25 | |
| 5 | 2,240 | 1,777 | 3.97 |
Based on A1, i.e., L = 1 and ρ =0.0, the required sample size is 2,222 shown in Table 1. If four interim and one final analyses are planned (i.e., L =5), and conservatively assuming zero correlation between the endpoints, then the MSS is 2,270 and the ASN is 1,858. If the correlation is incorporated into the calculation when ρ(12) =0.3, 0.5, 0.8, and 0.99, then the MSS are 2,260, 2,250, 2,240 and 2,240 respectively. The ASN are 1,827, 1,805, 1,780 and 1,777 respectively. The MSS for the four testing procedure combinations increases as the number of analyses increases and as the correlation decreases. The ASN decreases as the number of analyses increases and as the correlation increases.
Figure 4 summarizes the probability of rejecting/not rejecting the null hypothesis when ρ(12) =0.0, 0.3, 0.5, 0.8 and 0.99 and L =2, 3, 4 and 5. The figure illustrates that the method offers the possibility to stop a trial early if early evidence is overwhelming and thus potentially requiring fewer patients than the fixed sample size designs. When ρ(12) =0.0, it is difficult to reject the null hypothesis at the earlier analyses, but easier later on. On the other hand, as ρ(12) goes toward one, it is easier to reject the null hypothesis at the earlier analyses.
Figure 4.
The probability of rejecting/not rejecting the null hypothesis when L =2, 3, 4 and 5. The MSS and ASN (equally-sized groups) for detecting the joint effect for both endpoints, with the overall power of 1 − β =0.80 at the one-sided significance level of α =0.025. The critical values are determined by the O’Brien-Fleming testing procedure for both endpoints with the LD alpha-spending method with equally spaced information time.
5. Summary
Traditionally in a clinical trial, a single primary endpoint is selected and is used as the basis for the design, interim data monitoring, final analyses, and publication of a clinical trial. However assessment of an intervention using a single endpoint may not provide a comprehensive picture of the important effects of the intervention. Co-primary endpoints offer an attractive design feature as they capture a more complete characterization of the effect of an intervention. For this reason, co-primary endpoints are becoming a common design feature in many clinical trials. However, these co-primary endpoints are potentially correlated creating complexities in the evaluation of power and sample size in the designing of such clinical trials, specifically relating to control of the Type I or Type II error rate (Gong et al., 2000; Sen and Bretz, 2007; Hung and Wang, 2009; Dmitrienko et al., 2010). It is important to note the distinction between multiple co-primary endpoints and multiple primary endpoints to which many researchers are unaware (Offen et al. (2007)).
In this paper, we discuss sample size determination for clinical trials using multiple correlated binary relative risks being evaluated as co-primary contrasts. We consider a scenario where the objective is to evaluate the superiority of the test intervention compared with the control intervention for all of the relative risks. We evaluate the normal approximation methodology and its utility via Monte-Carlo simulation. We then extend the methodology to group-sequential designs. We summarize our findings as follows:
The sample size and power formulas presented here, address the challenges associated with sizing trials with potentially correlated co-primary binary endpoints when evaluating effects using relative risks. Incorporating the correlations among the endpoints into the power and sample size calculations leads to increased power and effectively reduces the required sample size. As the correlations are usually unknown, incorporating the correlation into the calculations should be done carefully. As the number of co-primary endpoints increases, the complicity associated with sample size calculations also increases. There may be range restrictions on the correlations resulting in adverse effects on the power and sample size.
If the standardized effect size on one endpoint as measured by the relative risk is relatively smaller (roughly 33% smaller) than the others, then the advantage of incorporating the correlation into the sample size calculation is lessened. The required sample size is primarily determined by the smaller standardized effect size and does not greatly depend on the correlation. In this scenario, the equation for the sample size can be simplified using the equation for a single endpoint without adjustment for the power.
The simulation results suggest that use of the normal approximation method may work well in most practical situations. But when the relative risk is larger than 0.5, then the empirical power is slightly above the targeted power and the Type I error rate is slightly inflated. In this scenario, the normal approximation does not work well and cannot be recommended as the Type I error rates are inflated. An alternative methodology is Monte-Carlo simulation, a computationally intensive approach for power evaluation and sample size calculation. Our experience suggests it may require users have considerable mathematical sophistication and appropriate programming knowledge of methods for generating data. When using Monte-Carlo simulation, the number of replications for simulations should be chosen carefully to control simulation error during calculation of the empirical power.
Supplementary Material
Acknowledgements
Research reported in this publication was supported by JSPS KAKENHI under Grant Number 26330038 and the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under Award Numbers UM1AI104681 and UM1AI068634. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Appendix A: Calculation of the correlation of the log-transformed relative risks
Using the delta method, for large samples, we have an approximation of the covariance between the two log-transformed relative risks logR̂k and logR̂k′ (k = 1, …, K; 1 ≤ k < k′ ≤ K; K ≥ 2),
As the variance of logR̂k is approximated by var[logR̂k] ≈ (1 − pTk)/(rNpTk) + (1 − pCk)/{(1 − r)NpCk} (for example, please see Fleiss et al. (2003)), for large samples, we have the correlation between the two log-transformed relative risks
Appendix B: Algorithm for the sample size calculation
When using the methods discussed in Section 2, an iterative procedure is required to identify the sample size that achieves the desired overall power. The easiest method involves a grid search that increases sample size gradually until the power exceeds the desired overall power. However, this method requires considerable computing time. Sugimoto et al. (2012) consider a faster Newton–Raphson algorithm with a convenient formula for N. Another faster but simpler method is to use linear interpolation to identify N (Hamasaki et al., 2013). We briefly describe this algorithm using linear interpolation as follows:
Step 0: Select the values of the relative risks Rk (k = 1, …, K) with their correlations ρ(kk′), and the significance level for the one-sided test α and the desired power 1 − β.
Step 1: Select the two initial values N0 and N1. Then, calculate and .
- Step 2: Update the value of N, using the following equation
Step 3: Evaluate with the updated value Nm+1. Note that Nm+1 is rounded up if Nm+1 is not integer.
Step 4: If Nm+1 − Nm = 0, then the iteration stops with Nm+1 as the final value. If not, then return to Step 2.
Options for the two initial values N0 and N1 are the smallest sample sizes among those calculated for detecting each relative risk Rk with the power 1 − β at the significance level of α using a sample size equation for a single relative risk (e.g., Chow et al. (2007)). Other options are the largest sample sizes among those calculated for detecting each relative risk Rk with the power 1 − (1 −β)1/K at the significance level of α. This is because N lies between these options. In our experience, the iterative procedure tends to converge in a few steps.
Appendix C: Sample size calculation for the odds ratio
We outline methodology for the calculation of the sample size for the detection of the joint effects in K correlated odds ratios (K ≥ 2). We now have the K log-transformed (observed) odds ratios logψ̂k (k = 1, …, K), where logψ̂k = ÔTk/ÔCk with ÔTk = p̂Tk/(1 − p̂Tk) and ÔTk = p̂Ck/(1 − p̂Ck). For large samples, the joint distribution of (logψ̂1, …, logψ̂K) is approximately K-variate normal with mean vector μ = (logψ1, …, logψK)T and covariance matrix Σ determined by
where ψk = OTk/OCk with OTk = pTk/(1 − pTk) and OTk = pTk/(1 − pTk). Thus, assuming ρ(kk′) = corr[YTki, YTk′i] = corr[YCkj, YCk′j] (1 ≤ k < k ′ ≤ K), the correlation between the log-transformed odds ratios, is given by
Let Zk be the test statistics for the log-transformed odds ratios logψ̂k, given by
When requiring joint significance for all of the correlated odds ratios, the overall power is approximately given by
where
The required total sample size for the detection of the joint effects in all odds ratios with the overall power 1 − β at the significance level α, N is the smallest integer satisfying this power.
Footnotes
Views expressed in this paper are the author’s professional opinions and do not necessarily represent the official positions of the Pharmaceuticals and Medical Devices Agency, Japan.
References
- 1.Allen SJ, Wareham K, Bradley C, Harris W, Dhar A, Brown H, Foden A, Cheung WY, Gravenor MB, Plummer S, Phillips CJ, Mack D. A Multicentre Randomised Controlled Trial Evaluating Lactobacilli and Bifidobacteria in the Prevention of Antibiotic-Associated Diarrhoea in Older People Admitted to Hospital: The PLACIDE Study Protocol. BMC Infectious Diseases. 2012;12:108. doi: 10.1186/1471-2334-12-108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Allen SJ, Wareham K, Wang D, Bradley C, Hutchings H, Harris W, Dhar A, Brown H, Foden A, Gravenor MB, Mack D. Lactobacilli and Bifidobacteria in the Prevention of Antibiotic-Associated Diarrhoea and Clostridium Difficile Diarrhoea in Older Inpatients (PLACIDE): A randomised, Double-blind, Placebo-Controlled, Multicentre Trial. The Lancet. 2013;382:1249–1257. doi: 10.1016/S0140-6736(13)61218-0. [DOI] [PubMed] [Google Scholar]
- 3.American College of Gastroenterology website. Understanding Irritable Bowel Syndrome. [Accessed December 4, 2014];2013 www.patients.gi.org/gi-health-and-disease/understanding-irritable-bowel-syndrome leaving site icon. [Google Scholar]
- 4.Asakura K, Hamasaki T, Sugimoto T, Hayashi K, Evans SR, Sozu T. Sample Size determination in Group-Sequential Clinical Trials with Two Co-Primary Endpoints. Statistics in Medicine. 2014;33:2897–2913. doi: 10.1002/sim.6154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Asakura K, Hamasaki T, Sugimoto T, Evans SR, Sozu T. Group-Sequential Designs When Considering Two Binary Outcomes as Co-Primary Endpoints. In: Chen Z, Liu A, Qu Y, Tang L, Ting N, Tsong Y, editors. Applied Statistics in Biomedicine and Clinical Trials Design. New York: Springer; 2015. (in press). [Google Scholar]
- 6.Bahadur RR. A representation of the Joint Distribution of Responses to n Dichotomous Items. In: Solomon H, editor. Studies in Item Analysis and Prediction, Vol. VI, Stanford Mathematical Studies in the Social Sciences. Stanford, CA: Stanford University Press; 1961. pp. 158–168. [Google Scholar]
- 7.Chuang-Stein C, Stryszak P, Dmitrienko A, Offen W. Challenge of Multiple Co-Primary Endpoints: A New Approach. Statistics in Medicine. 2007;26:1181–1192. doi: 10.1002/sim.2604. [DOI] [PubMed] [Google Scholar]
- 8.Chow SC, Shao J, Wang H. Sample Size Calculations in Clinical Research. 2nd edition. Boca Raton, FL: Chapman and Hall; 2007. [Google Scholar]
- 9.Committee for Medicinal Products for Human Use. Guideline on the Evaluation of Medicinal Products for the Treatment of Irritable Bowel Syndrome. [27 June, 2013];CPMP/EWP/785/97 Rev. 2013 1 [Google Scholar]
- 10.Dale JR. Global Cross-Ratio Models for Bivariate, Discrete, Ordered Responses. Biometrics. 1986;42:909–917. [PubMed] [Google Scholar]
- 11.Dmitrienko A, Tamhane AC, Bretz F. Multiple Testing Problems in Pharmaceutical Statistics. Boca Raton, FL: Chapman and Hall; 2010. [Google Scholar]
- 12.Eaton ML, Muirhead RJ. On Multiple Endpoints Testing Problem. Journal of Statistical Planning & Inference. 2007;137:3416–3429. [Google Scholar]
- 13.Emrich LJ, Piedmonte MR. A Method for Generating High-Dimensional Multivariate Binary Variates. American Statistician. 1991;45:302–304. [Google Scholar]
- 14.Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. Third Edition. Hoboken, NJ: John Wily & Sons; 2003. [Google Scholar]
- 15.Food and Drug Administration. Center for Drug Evaluation and Research, Food and Drug Administration. Rockville, MD: 2012. Guidance for Industry. Irritable Bowel Syndrome: Clinical Evaluation of Products for Treatment. 2012. [Google Scholar]
- 16.Grundmann O, Yoon SL. Irritable Bowel Syndrome: Epidemiology, Diagnosis, and Treatment: An Update for Health-Care Practitioners. Journal of Gastroenterology and Hepatology. 2010;25:691–699. doi: 10.1111/j.1440-1746.2009.06120.x. [DOI] [PubMed] [Google Scholar]
- 17.Gong J, Pinheiro JC, DeMets DL. Estimating Significance Level and Power Comparisons for Testing Multiple Endpoints in Clinical Trials. Controlled Clinical Trials. 2000;21:323–329. doi: 10.1016/s0197-2456(00)00049-0. [DOI] [PubMed] [Google Scholar]
- 18.Hamasaki T, Asakura K, Evans SR, Sugimoto T, Sozu T. Group-Sequential Strategies for Clinical Trials with Multiple Co-Primary Endpoints. [Accepted on 22 December 2014];To appear in Statistics in Biopharmaceutical Research. 2015 doi: 10.1080/19466315.2014.1003090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hamasaki T, Sugimoto T, Evans SR, Sozu T. Sample Size Determination for Clinical Trials with Co-Primary Outcomes: Exponential Event-Times. Pharmaceutical Statistics. 2013;12:28–34. doi: 10.1002/pst.1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hung HMJ, Wang SJ. Some Controversial Multiple Testing Problems in Regulatory Applications. Journal of Biopharmaceutical Statistics. 2009;19:1–11. doi: 10.1080/10543400802541693. [DOI] [PubMed] [Google Scholar]
- 21.Julious SA, McIntyre NE. Sample Sizes for Trials Involving Multiple Correlated Must-Win Comparisons. Pharmaceutical Statistics. 2012;11:177–185. doi: 10.1002/pst.515. [DOI] [PubMed] [Google Scholar]
- 22.Kordzakhia G, Siddiqui O, Huque MF. Method of Balanced Adjustment in Testing Co-Primary Endpoints. Statistics in Medicine. 2010;29:2055–2066. doi: 10.1002/sim.3950. [DOI] [PubMed] [Google Scholar]
- 23.Lan KKG, DeMets DL. Discrete Sequential Boundaries for Clinical Trials. Biometrika. 1983;70:659–663. [Google Scholar]
- 24.Le Cessie S, van Houwelingen JC. Logistic Regression for Correlated Binary Data. Applied Statistics. 1994;43:95–108. [Google Scholar]
- 25.O’Brien PC, Fleming TR. A Multiple Testing Procedure for Clinical Trials. Biometrics. 1979;35:549–556. [PubMed] [Google Scholar]
- 26.Offen W, Chuang-Stein C, Dmitrienko A, Littman G, Maca J, Meyerson L, Muirhead R, Stryszak P, Boddy A, Chen K, Copley-Merriman K, Dere W, Givens S, Hall D, Henry D, Jackson JD, Krishen A, Liu T, Ryder S, Sankoh AJ, Wang J, Yeh CH. Multiple Co-Primary Endpoints: Medical and Statistical Solutions. Drug Information Journal. 2007;41:31–46. [Google Scholar]
- 27.Pearson K. Mathematical Contributions to the Theory of Evolution. VII. On the Correlation of Characters not Quantitatively Measurable. Philosophical Transactions of the Royal Society, Series A. 1900;19:1–47. [Google Scholar]
- 28.Pocock SJ. Group Sequential Methods in the Design and Analysis of Clinical Trials. Biometrika. 1977;64:191–199. [Google Scholar]
- 29.Prentice RL. Correlated Binary Regression with Covariates Specific to Each Binary Observation. Biometrics. 1988;44:1033–1048. [PubMed] [Google Scholar]
- 30.Senn S, Bretz F. Power and Sample Size when Multiple Endpoints Are Considered. Pharmaceutical Statistics. 2007;6:161–170. doi: 10.1002/pst.301. [DOI] [PubMed] [Google Scholar]
- 31.Song JX. Sample Size for Simultaneous Testing of Rate Differences in Non-Inferiority Trials with Multiple Endpoints. Computational Statistics & Data Analysis. 2009;53:1201–1207. [Google Scholar]
- 32.Sozu T, Sugimoto T, Hamasaki T. Sample Size Determination in Clinical Trials with Multiple Co-Primary Binary Endpoints. Statistics in Medicine. 2010;29:2169–2179. doi: 10.1002/sim.3972. [DOI] [PubMed] [Google Scholar]
- 33.Sozu T, Sugimoto T, Hamasaki T. Sample Size Determination in Superiority Clinical Trials with Multiple Co-Primary Correlated Endpoints. Journal of Biopharmaceutical Statistics. 2011;21:650–668. doi: 10.1080/10543406.2011.551329. [DOI] [PubMed] [Google Scholar]
- 34.Sozu T, Sugimoto T, Hamasaki T, Evans SR. Sample Size Determination in Clinical Trials with Multiple Primary Endpoints. Springer: 2015. (in press). [Google Scholar]
- 35.Sugimoto T, Sozu T, Hamasaki T. A Convenient Formula for Sample Size Calculations in Clinical Trials with Multiple Co-Primary Continuous Endpoints. Pharmaceutical Statistics. 2012;11:118–128. doi: 10.1002/pst.505. [DOI] [PubMed] [Google Scholar]
- 36.Sugimoto T, Sozu T, Hamasaki T, Evans SR. A Logrank Test-Based Method for Sizing Clinical Trials with Two Co-Primary Time-to-Events Endpoints. Biostatistics. 2013;14:409–421. doi: 10.1093/biostatistics/kxs057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Xiong C, Yu K, Gao F, Yan Y, Zhang Z. Power and Sample Size for Clinical Trials when Efficacy is Required in Multiple Endpoints: Application to an Alzheimer’s Treatment Trial. Clinical Trials. 2005;2:387–393. doi: 10.1191/1740774505cn112oa. [DOI] [PubMed] [Google Scholar]
- 38.Xu J, Yu M. Sample Size Determination and Re-Estimation for Matched Pair Designs with Multiple Binary Endpoints. Biometrical Journal. 2013;55:430–443. doi: 10.1002/bimj.201100231. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




