SUMMARY
We address the problem of establishing two-sided equivalence using paired-sample analysis of two treatments or two laboratory tests with a binary endpoint. Through real data examples and monte carlo simulations, we compare three commonly used testing parameters, namely, the difference of response probabilities, the ratio of response probabilities, and the ratio of discordant probabilities based on score test statistics for constructing equivalence hypothesis tests of paired binary data. We provide suggestions on the choice of these three testing parameters and proper equivalence margins in hypothesis formulation of equivalence testing. In addition, we describe the implementation of a group sequential design in the context of equivalence testing with early stopping to reject, as well as to declare equivalence.
Keywords: binary data, double one-sided testing, equivalence boundary, interim analysis, sample sizes
1 Introduction
As a result of recent advances in medical and biological sciences, equivalence studies have been increasingly used in the evaluation of new treatments or new laboratory tests. The primary objective of these equivalence studies is to show that the efficacy or effectiveness of a new method (treatment or laboratory test) is not worse than the standard method in terms of a certain testing parameter. A testing parameter specifies the estimand of interest for a particular study and based on which a hypothesis of equivalence is formulated. One example of a commonly used testing parameter is the difference of response rates. Unlike conventional hypothesis testing procedures that intend to show a difference (or unequivalence) indicated in the alternative hypothesis, an equivalence testing procedure has the intention of rejecting the null hypothesis of unequivalence. While failure to reject the null hypothesis of unequivalence only indicates insufficient evidence to conclude unequivalence, a widely accepted strategy in equivalence testing is to specify an equivalence margin, δ, so that a particular testing parameter of interest tested less than δ might be considered equivalent of two methods.
For assessing equivalence of two methods, paired-sample designs are mostly adopted to minimize possible confounding factors and to increase efficiency of comparisons. For example, to reduce the variability of a comparison, subjects are asked to take both a new and the standard treatment in a cross-over clinical trial, or specimens from the same individual are split into two and tested by an alternative and the standard laboratory test. When the study endpoint is dichotomized, different testing parameters have been discussed in the literature. For example, because of its simplicity and ease of interpretation, the difference of response probabilities (i.e., rate difference, θd) has been a commonly used testing parameter in biomedical field (e.g, [Lu and Bean(1995), Nam(1997), Tango(1998)]). However, researchers have recently also proposed equivalence testing procedures for binary paired-sample design that use other testing parameters such as the ratio of response probabilities (i.e., relative risk, θrr) (e.g, [Lachenbruch and Lynch(1998), Nian-Sheng Tang and Chan(2003)]) or the ratio of discordant probabilities (i.e., ratio of discordance, θrd) ([Liu JP and MC(2005)]). Although the choice of a testing parameter usually depends on the scientific relevance of the conclusion of an equivalence testing, it is often confusing to researchers how to translate their knowledge of a particular treatment or laboratory test into determining an appropriate testing parameter. After choosing a testing parameter, it is sometimes even more challenging for researchers to set appropriate equivalence margins based on clinical and regulatory judgments. With these issues in mind, in the early part of this paper, we evaluate testing procedures that use these three different testing parameters, θd, θrr and θdr, and provide suggestions on the choice of a testing parameters and its corresponding equivalence margins.
Because equivalence tests often require much higher sample sizes than do ordinary tests to maintain the same level of type I and II error rates, it is of great interest to adopt group sequential designs that allow early stopping for futility (unequivalence) as well as for success (equivalence) by incorporating evidence collected stage-wise in equivalence studies. A group sequential design consists of one or more interim analyses, which the investigator can use to determine whether or not to stop the trial with concrete evidence of success or failure. The ethical reason for early stopping is probably less intriguing in the context of equivalence testing, however, such study designs are of a great need in situations where resources are too limited to achieve desired power, or where one is very uncertain of critical assumptions of the design parameters. For example, to calculate the number of patients needed for a equivalence study of two treatments with a binary endpoint, it is necessary to specify either the response probability on the standard treatment, or the average response probability over the new and standard treatments combined. If these values are misspecified, the resulting fixed sample size test will be under- or over-powered. To this end, in the latter part of this paper, we describe the implementation of a group sequential design in the context of equivalence testing based on paired-binary data. We compare the expected sample size of such a group sequential design with that of a traditional fixed sample size design.
This paper focuses on two-sided equivalence hypotheses with symmetric equivalence margins. The principle of double one-sided approach is used to test the null hypothesis of un-equivalence. This approach is indicated in the International Conference on Harmonization (ICH) E9 Guideline “Biostatistical Principles for Clinical Trials,” which states that “Operationally, this (equivalence test) is equivalent to the method of using two simultaneous one-sided tests to test the (composite) null hypothesis that the treatment difference is outside the equivalence margins versus the (composite) alternative hypothesis that the treatment difference is within the margins.” Because of the symmetry of the hypothesis, one of the double one-sided tests is discussed in more detail in this paper.
2 Methods
In the context of paired-sample analysis, let M1k and M2k, k = 1,…,n, denote the binary (0 and 1) response random variables of the standard method (method 1) and a new method (method 2) based on n paired samples. Let π11, π10, π01 and π00 be the response probabilities of the (M1k, M2k) pairs (1, 1), (1, 0), (0, 1) and (0, 0), respectively. Let a, b, c and d be the corresponding observed counts of the four response pairs. Let p1 and p2 be the response probabilities for method 1 and method 2. A typical layout of the outcomes and marginal totals is presented in Table 1. To model the inter-subject variability of response probabilities and their correlation, a covariance parameter, ϕ, for the response probabilities of method 1 and method 2 conditioning on matched sample characteristics is often introduced in reparameterized multinomial models. By assuming that M1k and M2k are mutually independent given each sample’s unobserved characteristics and that these unobserved characteristics are independent and identically distributed, [Tango(1998)] showed that in the reparameterized multinomial model the expected response probabilities over the matched sample characteristics for the four cells are given as: π11 = p1p2+ϕ, π01 = (1–p1)p2–ϕ, π10 = p1(1–p2)–ϕ and π00 = (1–p1)(1–p2)+ϕ. To formulate the hypothesis of equivalence testing between two methods, three testing parameters have mainly been used by medical field researchers: the rate difference θd = p1 – p2, the rate ratio θrr = p1/p2 and the ratio of discordance θrd = π10/θ01. In the case of two-sided equivalence testing, the notations for method 1 and method 2 are interchangeable. Hence, a testing procedure on the aforementioned measures can be easily adapted to the corresponding measures, θ’d = p2 – p1, θrr’ = p2/p1 and θor’ = π01/θ10 by switching the roles of Method 1 and Method 2. In the following subsections, we focus on one version of the symmetry and consider the testing parameters θd, θrr and θor in terms of the formulation of a testing hypothesis and its corresponding test statistics.
Table 1.
Layout of counts (probabilities) associated with testing of two methods using paired samples in a 2×2 table.
| Treatment 1 | ||||
|---|---|---|---|---|
| + | − | Total | ||
| Treatment 2 | + | a (π11) | b (π01) | a + b (p2) |
| − | c (π10) | d (π00) | c + d (1 – p2) | |
| Total | a + c (p1) | b + d (1 – p1) | n(1.0) | |
2.1 Rate difference
To show that two treatments are equivalent based on the rate difference, θd, the hypothesis can be formulated as follows:
| (i) |
To test the null hypothesis in (i), a “double one-sided test” ([Schuirmann(1987)]) would suggest conducting tests of the following two hypotheses:
| (1a) |
| (1b) |
Simultaneous rejection of the null hypotheses in (1a) and (1b) at a significance level of α implies a rejection of the null hypothesis in (i) also at the level of α [Wellek(2003)]. Using the double one-sided tests approach, the p-value of the equivalence test for (i) is the maximum p-value of the two one-sided tests for (1a) and (1b).
The same procedure could be applied to the testing of the hypothesis in both (1a) and (1b) by switching the subscripts. Let’s consider the hypothesis in (1a). Let βd = p2 – (p1 – δd). The hypothesis in (1a) is then equivalent to the following:
To derive the score statistic for testing the above hypothesis in terms of βd, the log-likelihood of the re-parameterized multinomial model for a random sample with the data structure given in Table 1 can be written as
| (2) |
The corresponding score statistic can be shown to be
| (3) |
where is the larger root of the following quadratic equation:
For the special case of δd = 0, , and Td reduces to the well-known McNemar’s test statistic. See details in [Tango(1998)].
2.2 Rate ratio
To show that two treatments are equivalent based on the rate ratio, θrr, the hypothesis can be formulated as follows:
| (ii) |
The corresponding double one-sided testing hypotheses are:
| (4a) |
| (4b) |
Let βrr = p2 – p1δrr. Similarly, the hypothesis in (4a) is then equivalent to the following:
The score statistic for testing hypotheses in (4a) can be shown to be
| (5) |
where is the larger root of the following quadratic equation:
For the special case of δrr = 1, and Trr reduces to the well-known McNemar’s test statistic. See details in [Nian-Sheng Tang and Chan(2003)].
2.3 Ratio of discordance
To show that two treatments are equivalent based on the ratio of discordance, θrd, the hypothesis can be formulated as follows:
| (iii) |
The corresponding double one-sided testing hypotheses are:
| (6a) |
| (6b) |
Let βrd = π01 – π10δrd. The hypothesis in (6a) is then equivalent to the following:
The score statistic for testing hypotheses in (6a) can be shown to be
| (7) |
where is the RMLE of π10 given at the boundary π01/π10 = δrd and is equal to (b+c)/[n(δrd+1)].
For the special case of and Trd reduces to the well-known McNemar’s test statistic. See details in [Liu JP and MC(2005)].
The score test statistics, Td, Trr and Trd all have an asymptotic standard normal distribution under their respective null hypothesis. It follows that H0 is rejected at the α significance level if T ≥ z1–α, where zα is the 100(1 – α) percentile of the standard normal distribution.
2.4 Equivalence margins and relationships of the three testing parameters
Like any other typical equivalence testing procedures, the three testing procedures introduced previously require a pre-specified equivalence margin, δ. Because a different level of δ might completely alter the conclusion of an equivalence study, it is a point for careful discussion between the statisticians and the investigators planning for such a study. Based on guidance from the FDA [FDA/CDER(2001)], Wellek (2003) has attempted to provide some proposals for choosing equivalence margins. In the same spirit, Table 2 presents some suggested equivalence margins for testing null hypotheses from (i), (ii) and (iii). More general discussions on the choice of equivalence margins can be found in the ICH E10 Guideline ([Guideline(2000)]). The proposed equivalence margins in this paper by no means should be regarded as fixed rules, however, they provide researchers with a range of options to tackle each specific question. For example, for a highly effective treatment with more than 85% response rate, it is advisable to use an equivalence margin (liberal or strict) that are less than 15%. Nevertheless, these proposed values of equivalence margins are used for illustration in the following sections.
Table 2.
Proposals for choosing the limits of a symmetrical equivalence interval.
| Equivalence interval | |||
|---|---|---|---|
| Testing parameter θ | Description | Strict | Liberal |
| 1 θd = p1 – p2 (i.e, θd = π10 – π01) |
difference in response rates (i.e, difference in the prob. of discordant pairs) |
(−.05; .05) | (−.15; .15) |
| 2 θrr = p2/p1 | ratio of response rates | (.95; 1.05) | (.85; 1.18) |
| 3 θor = π10/π01 | ratio of the prob. of discordant pairs | (.95, 1.05) | (.85, 1.18) |
Besides the choice of an appropriate equivalence margin, a careful and sensible decision on the choice of a testing parameter, θ, is also needed in equivalence testing studies. This is because, in contrast to conventional hypothesis testing with the common boundary of null and alternative hypothesis mostly being given by zero, equivalence hypothesis testing remain generally not invariant under redefinitions of the testing parameter. For example, the conventional null hypotheses θd = 0 and θrr = 1 correspond to the same sets of (p1,p2) in the space of [0, 1]×[1, 0]. However, the sets {(p1,p2)| – δd < θd < δd} and {(p1,p2)|1/δrr < θrr < δrr} are different for any choice of the constants 0 < δd, δrr < 1. To investigate the relationships between the three different testing parameters, θd, θrr and θrd, we compare the equivalence regions resulted from each testing procedure. To do this, we need to re-define the equivalence regions in the same space. Specifically, the equivalence region based on the rate ratio,
can be expressed in a space defined by θd = p1 – p2 and p2 as
| (8) |
while the equivalence region based on the rate difference can be expressed in the same space as
| (9) |
From (8), for a given value of p2, we can see that the width of the equivalence boundaries for θrr in terms of θd is p2((1 – δrr)/δrr – (δrr – 1)) while the width of the equivalence boundaries for θd is always 2δd regardless of the value of p2. Therefore, the equivalence boundaries for θrr is wider than that for θd as long as ; the equivalence boundaries for θrr is narrower than that for θd as long as . Note that the equivalence boundaries for θrr is always narrower than that for , i.e., . In the context of equivalence testing, wider ( or narrower) equivalence boundaries would suggest higher (or lower) power to accept equivalence.
Similarly, because the equivalence boundaries based on the rate difference can also be expressed as
| (10) |
we can express the equivalence boundaries based on the ratio of discordance
in a space defined by θd = π10 – π01 and π01 as
| (11) |
From (11), for a given value of π01, we can see that the width of the equivalence boundaries for θrd in terms of θd is π01((1 – δrd)/δrd – (δrd – 1)) while the width of the equivalence boundaries for θd is always 2δd regardless of the value of π01. Similarly, the equivalence boundaries for θrd is wider than that for θd as long as ; the equivalence boundaries for θrd is narrower than that for θd as long as . Note that the equivalence boundaries for θrd is always narrower than that for .
In Figures 1-3, the equivalence regions based on the rate difference is overlayed with that based on the rate ratio (left panel a) and with that based on the ratio of discordance (right panel b) under different settings of the equivalence margins δd, δrr and δrd. Based on Figure 1, we can see that the two pairs of equivalence regions do overlap in a large proportion with δd = 0.1, δrr = 0.8 and δrd = 0.8. In this case, the width of the equivalence boundaries based on θrr is wider than that based on θd when 4/9 < p2 < 1 and narrower when 0 < p2 < 4/9. However, based on Figure 2 where δd = 0.2, δrr = 0.9 and δrd = 0.9, the equivalence boundaries based on θrr is always narrower than that based on θd because . Based on Figure 3 where δd = 0.2, δrr = 0.8 and δrd = 0.8 (i.e., δrr = 1 – δrd), we can see the equivalence boundaries based on θrr or θrd are always narrower than those based on θd because and . In summary, the equivalence boundaries for θd is always more liberal than that for θrr and θrd unless . In most clinical trials, is not an value of interest, because that infers that p2 < 50% with some reasonable setting of the equivalence margins. This suggests that, if the response probability of a treatment is assumed to be at least 50%, there is more power to declare equivalence using the testing parameter of rate difference given that the equivalence margins are comparable. However, in a comparison of two laboratory tests, it is likely to have p2 < 50% and the equivalence boundaries for the rate ratio and the ratio of discordance could be more liberal than those for the rate difference. Therefore, researchers are recommended to take advantage of the information presented in Figures 1-3 when designing an equivalence testing study.
Figure 1.
Equivalence testing regions in terms of the rate difference (black line shaded area) between both the rate ratio (red line shaded area) [left] and the ratio of discordance (red line shaded area)[right] as regions in the p2 × (p1 –p2)-plane [left] and the π10 ×(p1 – p2)-plane [right] with δd = 0.1, δrr = δrd = 0.8. [Green shaded areas = set of possible values of–(p2, (p1 – p2)).]
Figure 3.
Equivalence testing regions in terms of the rate difference (black line shaded area) between both the rate ratio (red line shaded area) [left] and the ratio of discordance (red line shaded area)[right] as regions in the p2 × (p1 –p2)-plane [left] and the π10 ×(p1 – p2)-plane [right] with δd = 0.2, δrr = δrd = 0.8. [Green shaded areas = set of possible values of–(p2, (p1 – p2)).]
Figure 2.
Equivalence testing regions in terms of the rate difference (black line shaded area) between both the rate ratio (red line shaded area) [left] and the ratio of discordance (red line shaded area)[right] as regions in the p2 × (p1 – p2)-plane [left] and the π10 ×(p1 – p2)-plane [right] with δd = 0.2, δrr = δrd = 0.9. [Green shaded areas = set of possible values of–(p2, (p1 – p2)).]
3 Examples
In this section, we look at the impact of using different testing parameters and equivalence margins on the testing results of an equivalence study. We use two data examples. First, we consider the HIV-1 screening test example given in [Lachenbruch and Lynch(1998)]. Table 3 provides the outcomes of the HIV screening test using a particular body fluid versus the testing of plasma directly from 1157 individuals. We applied to this data set the three different testing parameters: rate difference, rate ratio and ratio of discordance; and the suggested liberal and strict equivalence margins that are outlined in Table 2. The testing results are summarized in Table 4. All testing statistics yielded p-value less than 0.05, indicating that the response rate or the rate of discordance of testing the alternative body fluid samples was equivalent to testing plasma samples.
Table 3.
Example 1 – HIV screening test (source: [Lachenbruch and Lynch(1998)]).
| Plasma sample | ||||
|---|---|---|---|---|
| + | − | Total | ||
| Alternative body fluid sample | + | 446 | 5 | 451 |
| − | 16 | 690 | 706 | |
| Total | 462 | 695 | 1157 | |
Table 4.
Example 1 – results of the equivalence test for the HIV screening test examples.
| Test statistic | ||||||
|---|---|---|---|---|---|---|
| Td | Trr | Tor | ||||
| Equivalence limit | δd = 0.05 | δd = 0.15 | δrr = 0.95 | δrr = 0.85 | δor = 0.95 | δor = 0.85 |
|
H01 versus H11 (p-value) |
6.025 (< .001) |
13.184 (< .001) |
2.259 (.012) |
7.308 (< .001) |
−2.284 (.011) |
−2.036 (.021) |
|
H02 versus H12 (p-value) |
8.183 (< .001) |
14.529 (< .001) |
5.449 (< .001) |
9.3 (< .001) |
2.519 (.006) |
2.781 (.003) |
| p-value for equivalence | < .001 | < .001 | 0.012 | < .001 | .011 | .021 |
Second, we consider the cross-over trial of disinfection systems for soft contact lenses given in [Tango(1998)]. Table 5 provides the outcomes of the study in which 44 subjects were evaluated to compare a chemical disinfection system with a thermal disinfection system for soft contact lenses. Again we applied to this data set the three different testing parameters and the suggested liberal and strict equivalence margins. Table 6 shows the results of the six testing procedures. Three testing procedures that use the strict equivalence margins (δd = 0.05 or δrr = δor = 0.95) resulted non-significant p-values at two-sided α = 0.05 most likely due to the small sample size. However, different conclusions could result when a more liberal equivalence margin is used. As shown in column 3, 5 and 7 with δd = 0.15 and δrr = δrd = 0.85, the two statistics based on the rate difference and the rate ratio were less than 0.05 while the one based on ratio of discordance was non-significant (p-value = 0.178). These two examples demonstrate the importance of pre-specifying the testing parameter and equivalence margins to avoid any ambiguity of result interpretation in an equivalence study.
Table 5.
Example 2 – clinical assessment of treatments in cross-over trials of disinfection systems for soft contact lenses (source. [Tango(1998)]).
| Thermal disinfection | ||||
|---|---|---|---|---|
| + | − | Total | ||
| Hydrogen peroxide | + | 43 | 0 | 43 |
| − | 1 | 0 | 1 | |
| Total | 44 | 0 | 44 | |
Table 6.
Example 2 – results of the equivalence test for the disinfection systems for soft contact lenses.
| Test statistic | ||||||
|---|---|---|---|---|---|---|
| Td | Trr | Tor | ||||
| Equivalence limit | δd = 0.05 | δd = 0.15 | δrr = 0.95 | δrr = 0.85 | δor = 0.95 | δor = 0.85 |
|
H01 versus H11 (p-value) |
0.830 (.203) |
2.364 (.009) |
0.830 (.203) |
2.364 (.009) |
−0.975 (.165) |
−0.922 (.178) |
|
H02 versus H12 (p-value) |
1.835 (.033) |
2.990 (.001) |
1.821 (.034) |
2.961 (.002) |
1.026 (.152) |
1.085 (.139) |
| p-value for equivalence | .203 | .009 | .203 | .009 | .165 | .178 |
4 Simulations
In this section, we study the empirical type I and II error rates of the three testing procedures in various parameter settings. We choose n = 50, 100 and 200, p1 = 0.5, 0.8 and 0.95 and ϕ = 0 and 0.03 when applicable. To study the empirical significance levels of various testing procedures, 10, 000 random samples were generated from multinomial proportions of size n with parameters defined by π11 = p1p2 +ϕ, π01 = (1–p1)p2 –ϕ, π10 = p1(1–p2)–ϕ and π00 = (1–p1)(1–p2)+ϕ where ϕ is the covariance of p1 and p2. Note that ϕ = 0 does not necessarily indicate that p1 and p2 are independent. Table 7 shows the empirical significance level of each testing procedure under β = 0, i.e., βd = p1 –p2 –δd = 0, βrr = p2/p1 –δrr = 0 and βrd = π01/π10 –δrd = 0, where δd = 0.05 and δrr = δrd = 0.95. The nominal α level is set to be 0.05. Overall, these simulation results show that under βd = 0, the empirical type I error rates are well maintained for Td while the Trr and Trd are very liberal as their empirical type I error rates could be much higher than the nominal 5% level. Under βrr = 0, the empirical type I error rates for Trr are close to the nominal level. While Td seems to be conservative in this case, Trd is also very liberal. Under the last scenario βrd = 0, both Td and Trr are slightly conservative except when the sample size is small (n = 50) and p1 = 0.5 while Trd seems liberal under all studied settings. Because the constrains on the value of the parameters, p1 + δorπ01 ≤ 1 (i.e., π01 ≤ (1 – p1)/δor), results with p1 = 0.95 are not shown Table 7 to avoid generation of extremely sparse data.
Table 7.
Empirical Type I error (%) based on 10,000 trials simulated under H0 with α = 5% (δd = 0.05 and δrr = δrd = 0.95)
| H0 : p1 – p2 = δd | H0 : p2/p1 = δrd | H0 : π01/π10 = δrd | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | p 1 | ϕ | π 01 | Td | Trr | Trd | π 01 | Td | Trr | Trd | π 01 | Td | Trr | Trd |
| 50 | 0.5 | 0 | 0.28 | 5.5 | 8.9 | 11.1 | 0.26 | 3.6 | 6.1 | 8.1 | 0.28 | 4 | 6.5 | 8.3 |
| 0.5 | 0.03 | 0.25 | 5 | 8.9 | 11.7 | 0.23 | 3.2 | 6.3 | 8.6 | 0.25 | 3.4 | 6.1 | 8.5 | |
| 0.8 | 0 | 0.2 | 5.1 | 6.8 | 13.1 | 0.19 | 4.5 | 5.8 | 11.8 | 0.2 | 2.6 | 3.5 | 8 | |
| 0.8 | 0.03 | 0.17 | 5.4 | 7 | 14.2 | 0.16 | 4.3 | 5.8 | 11.8 | 0.17 | 2.3 | 3 | 7.9 | |
| 100 | 0.5 | 0 | 0.28 | 4.7 | 10.2 | 13.2 | 0.26 | 2.8 | 6.3 | 8.8 | 0.28 | 2.3 | 5 | 6.8 |
| 0.5 | 0.03 | 0.25 | 5.2 | 11.2 | 15.3 | 0.23 | 2.6 | 6.1 | 9.4 | 0.25 | 2 | 4.9 | 6.8 | |
| 0.8 | 0 | 0.2 | 5.1 | 7.5 | 17.9 | 0.19 | 3.5 | 4.9 | 13.6 | 0.2 | 1.5 | 2.1 | 7.2 | |
| 0.8 | 0.03 | 0.17 | 4.9 | 7.6 | 20.5 | 0.16 | 3.2 | 5 | 15.4 | 0.17 | 1.4 | 2.3 | 8.2 | |
| 200 | 0.5 | 0 | 0.28 | 5.2 | 13.3 | 18.3 | 0.26 | 1.8 | 5.3 | 9 | 0.28 | 1.1 | 3.7 | 6.1 |
| 0.5 | 0.03 | 0.25 | 5 | 13.5 | 20.4 | 0.23 | 1.7 | 5.5 | 10 | 0.25 | 0.9 | 3.3 | 6.3 | |
| 0.8 | 0 | 0.2 | 5.3 | 8.5 | 25.5 | 0.19 | 3 | 4.8 | 17.8 | 0.2 | 0.8 | 1.4 | 7 | |
| 0.8 | 0.03 | 0.17 | 4.7 | 8.3 | 29.7 | 0.16 | 2.9 | 5.4 | 21.8 | 0.17 | 0.6 | 1.1 | 7.4 | |
Table 8 shows the empirical power of each testing procedure under p1 = p2. In general, higher power is achieved with larger sample sizes, higher values of p1 and ϕ. This suggests that when the marginal response rates (p1 and p2) are high or when the concordant pairs are high, there is more chance to declare equivalence using the same equivalence margins. It is also observed that Td is always more powerful than Trr and Trd except when the sample size is small (n = 50) and p1 = 0.5 or when n = 50, p1 = 0.8 and ϕ = 0. In summary, these simulation results show that the rate difference is an attractive testing parameter based on which the operating characteristics of a testing procedure is generally well satisfied. Testing procedures based on rate ratio and ratio of discordance do tend to be more liberal, but could be more powerful in settings when the sample size is small (n = 50) and the response rate is low (p1 = 0.5) or when the sample size is small (n = 50), the response rate is relatively high (p1 = 0.8) but the two random variables indicating the binary outcomes of method 1 and method 2 are uncorrelated.
Table 8.
Empirical power (%) using double one-sided tests based on 10,000 trials simulated under p1 = p2 (i.e., π01 = π10) with δd = 0.15 and δrr = δrd = 0.85 (α = 5%)
| n | p 1 | ϕ | π 01 | δ d | Td | δ rr | Trr | δ rd | Trd |
|---|---|---|---|---|---|---|---|---|---|
| 50 | 0.5 | 0 | 0.25 | 0.15 | 1.2 | 0.85 | 1.5 | 0.85 | 21.6 |
| 0.03 | 0.22 | 0.15 | 3.9 | 0.85 | 1.5 | 0.85 | 20.6 | ||
| 0.8 | 0 | 0.16 | 0.15 | 15.5 | 0.85 | 5 | 0.85 | 19.6 | |
| 0.03 | 0.13 | 0.15 | 26.8 | 0.85 | 10.8 | 0.85 | 18.1 | ||
| 0.95 | 0 | 0.05 | 0.15 | 78.6 | 0.85 | 76.2 | 0.85 | 9.9 | |
| 0.03 | 0.02 | 0.15 | 97.8 | 0.85 | 97.6 | 0.85 | 1.3 | ||
| 100 | 0.5 | 0 | 0.25 | 0.15 | 37 | 0.85 | 0.6 | 0.85 | 29.4 |
| 0.03 | 0.22 | 0.15 | 46.7 | 0.85 | 0.5 | 0.85 | 26.4 | ||
| 0.8 | 0 | 0.16 | 0.15 | 67.4 | 0.85 | 44.4 | 0.85 | 24.3 | |
| 0.03 | 0.13 | 0.15 | 78.5 | 0.85 | 57.6 | 0.85 | 21.6 | ||
| 0.95 | 0 | 0.05 | 0.15 | 99.3 | 0.85 | 99.2 | 0.85 | 18.0 | |
| 0.03 | 0.02 | 0.15 | 100 | 0.85 | 100 | 0.85 | 6.0 | ||
| 200 | 0.5 | 0 | 0.25 | 0.15 | 82.4 | 0.85 | 2.8 | 0.85 | 39.6 |
| 0.03 | 0.22 | 0.15 | 88 | 0.85 | 7.2 | 0.85 | 36.4 | ||
| 0.8 | 0 | 0.16 | 0.15 | 96.5 | 0.85 | 87.9 | 0.85 | 33.2 | |
| 0.03 | 0.13 | 0.15 | 98.6 | 0.85 | 93.5 | 0.85 | 30.8 | ||
| 0.95 | 0 | 0.05 | 0.15 | 100 | 0.85 | 100 | 0.85 | 20.1 | |
| 0.03 | 0.02 | 0.15 | 100 | 0.85 | 100 | 0.85 | 15.0 |
5 Group sequential designs for equivalence testing
The topic of group sequential designs has generated a rich set of research. The papers by Pocock (1977), O’Brien & Fleming (1979), Fleming, Harrington & O’Brien (1984) and Lan & DeMets (1983) especially concerned methods with two-sided testing boundaries to allow early stopping to reject the null hypothesis. Subsequent researchers (e.g., Wang & Tsiatis (1987) and Pampallona & Tsiatis (1994)) modified the two-sided testing boundary by inserting an “inner wedge” to permit early stopping to accept the null hypothesis. Especially, Pampallona & Tsiatis (1994)) described group sequential designs that allow asymmetric tests with unequal Type I and II error probabilities.
Under repeated testing in group sequential designs, the critical values of testing statistics need to be adjusted to maintain the overall false positive rate at an acceptable level (≤ β, “consumers’ risk”); the sample sizes need to be adjusted to attain a desired power of testing ≥ 1 – α, “producers’ risk”). For example, in the context of equivalence testing using the rate difference, given a two-sided equivalence hypothesis:
The type I error requirement is Pr{Declare equivalence |θd = δδd}· β and the type II error requirement is P r{Do not declare equivalence | θd = 0}· α. After reparametrization of the testing parameter, i.e., βd = p2 – (p1 – δd) or βd = p1 – (p2 – δd) the above hypotheses become
The error requirements become Pr{Declare equivalence |βd = 0}≤ α and Pr{Do not declare equivalence |βd = ±δd}≤ β. Therefore, the score test statistics for testing of the reparametrized testing parameter that are discussed in previous sections can then be directly used in the adaptation of other methods that look at conventional hypothesis of difference. In the following subsections, because the roles of method 1 and method 2 are simply inter-changed in the reparametrized hypothesis testings of the two one-sided hypothesis, we focus on one version of the symmetry and consider the testing of one of the two one-sided hypotheses (1a), (4a) and (6a) for the three testing parameters, rate difference, rate ratio and ratio of discordance, respectively.
5.1 Two-sided inner wedge tests in terms of the score statistics
Most methods discussed in the literature focus on designs for conventional hypothesis testing of a treatment difference for two independent groups. Here we extend Pampallona & Tsiatis (1994)) and describe the implementation of a group sequential design for paired binary data in the context of equivalence testing. Common approaches of group sequential designs include the definition of appropriate continuation and stopping regions to guarantee the desired error rates under repeated testing. We present the implementation of a sequential design for equivalence testing with early stopping to reject, as well as to declare equivalence. A general decision tree of such design based on pairs of constants (ak, bk) with 0 ≤ ak < bk for k = 1,…,K – 1 and aK = bK:
After group k = 1,…,K – 1
if ∣T∣ ≤ bk stop, reject H0 and declare equivalence
if ∣T∣ < ak stop, accept H0 and declare non-equivalence
otherwise continue to group k + 1
after group K
if ∣T∣ ≤ bK stop, reject H0 and declare equivalence
if ∣T∣ < aK stop, accept H0 and declare non-equivalence
where T refers to the score test statistics presented in section 2. Since aK = bK, termination at analysis K is ensured. For the test with parameter Δ, where larger values of Δ increase the maximum sample size and decrease expected sample size, the critical values ak and bk are:
| (12) |
| (13) |
The constants C1(K, α, β, Δ) and C2(K, α, β, Δ) are chosen to ensure that Type I error and power conditions. The values of these constants can be found in [Pampallona and Tsiatis(1994)]. The upper boundary with critical values bk can be viewed as a repeated significance test of βd = 0; the lower boundary with critical values ak can be viewed as a repeated significance test of βd = δd
5.2 Implementation of a group sequential design using the rate difference in equivalence testing
Suppose α = β = 0.05, δd = 0.2, Δ = 0, it is of desire to design a balanced study with K = 4 equally sized groups. The information needed for a fixed sample size test is
Since , the fixed sample size test requires 324.9*0.26 = 85 subjects assuming . Using the power family of Inner Wedge Tests, the information needed for a sequential test is
| (14) |
Assuming , it requires 88 subjects for the final stage analysis, i.e., 4 groups of 22 observations should be planned in the study.
The test is implemented by applying the decision tree provided in section 5.1 to the score test statistic Tk, k = 1,…,K. The critical values ak and bk are obtained from (12) and (13) with C1(4, 0.05, 0.05, 0) = 1.995 and C2(4, 0.05, 0.05, 0) = 1.708. Specifically, (a1, a2, a3, a4) = (−1.58, 0.19, 1.21, 1.97) and (b1, b2, b3, b4) = (3.99, 2.82, 2.30, 1.995). Since a1 is negative, it indicates that it is not possible to stop and declare equivalence at the first analysis. In addition, we can calculate the expected sample sizes as 65.4, 70.4 and 58.6 under βd = ±0.2, ±0.1 and 0, compared to the 85 subjects required by a fixed sample procedure. Therefore, there is a great gain of resource saving in terms of the expected sample sizes. To note, if Δ is increased, tests would allow greater opportunity for early stopping, resulting in lower expected sample size but the maximum sample size would also increase.
6 Conclusion
In this paper, we investigated three different testing parameters, rate difference, rate ratio and ratio of discordance for matched-pair designs. Based on real data examples and simulation studies, we observed that the score test statistics for testing of the rate difference based on the reparameterized multinomial model are generally a good choice for equivalence testing of two methods with paired binary data. The implementation of a group sequential design as discussed in this paper is by no means the only or best approach, however, it demonstrates the amount of resource-saving in equivalence testing context when group sequential designs are employed as opposed to a fixed sample design. Müller and Schäfer (1999) have suggested that group sequential equivalence testing is most profitable when the power of the trial to establish equivalence is high. Therefore, in view of the savings that can be achieved in experimental costs and time, we would recommend that early stopping to accept the null hypothesis be considered in equivalence testing using group sequential designs. The score test statistics for the testing procedures discussed in this paper are all based on large sample theories. It is of interest to conduct similar studies that use exact inference when study sample sizes are small or the data structure is sparse. Some of the details on group sequential designs using exact calculations for binary data can be found in [Jennison and Turnbull(1999)].
References
- FDA/CDER [last assessed Feb. 2008];Guidance for industry bioanalytical method validation (online document) 2001 http://www.fda.gov/cder/guidance/4252fnl.pdf. [Google Scholar]
- Fleming TR, H.-D. P, O’Brien PC. Designs for group sequential tests. Controlled Clinical Trials. 1984;5:348–361. doi: 10.1016/s0197-2456(84)80014-8. [DOI] [PubMed] [Google Scholar]
- Guideline IE. Choice of control group and related issues in clinical trials. International Conference on Harmonization (ICH) 2000 [Google Scholar]
- Jennison C, Turnbull BW. Interim analyses: repeated confidence interval approach. J R Stat Soc. 1989;51:305–361. [Google Scholar]
- — . Group sequential methods with applications to clinical trials. Chapman & Hall; New York: 1999. [Google Scholar]
- Lachenbruch PA, Lynch CJ. Assessing screening tests: extensions of McNemar’s Test. Statistics in Medicine. 1998;17:2207–2217. doi: 10.1002/(sici)1097-0258(19981015)17:19<2207::aid-sim920>3.0.co;2-y. [DOI] [PubMed] [Google Scholar]
- Lan K, DeMets DL. Descrte sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. [Google Scholar]
- Liu JP, F. H, MC M. Tests for equivalence based on odds ratio for matched-pair design. J Biopharm Stat. 2005;15:889–901. doi: 10.1080/10543400500265561. [DOI] [PubMed] [Google Scholar]
- Lu Y, Bean J. On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Statistics in Medicine. 1995;14:1831–1839. doi: 10.1002/sim.4780141611. [DOI] [PubMed] [Google Scholar]
- Müller HH, Schäfer H. Optimization of testing times and critical values in sequential equivalence testing. Statistics in Medicine. 1999;18:1769–1788. [PubMed] [Google Scholar]
- Nam J-M. Establishing equivalence of two treatments and sample size requirement in matched-pairs design. Biometrics. 1997;53:1422–1430. [PubMed] [Google Scholar]
- Nian-Sheng Tang M-LT, Chan ISF. On tests of equivalence via non-unity relative risk for matched-pair design. Statistics in Medicine. 2003;22:1217–1233. doi: 10.1002/sim.1213. [DOI] [PubMed] [Google Scholar]
- O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. [PubMed] [Google Scholar]
- Pampallona S, Tsiatis A. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. Journal of Statistical Planning and Inference. 1994;42:19–35. [Google Scholar]
- Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64:191–199. [Google Scholar]
- Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics. 1987;15:657–680. doi: 10.1007/BF01068419. [DOI] [PubMed] [Google Scholar]
- Tango T. Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Statistics in Medicine. 1998;17:891–908. doi: 10.1002/(sici)1097-0258(19980430)17:8<891::aid-sim780>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]
- Wang SK, Tsiatis AA. Approximately optimal oneparameter boundaries for group sequential trials. Biometrics. 1987;43:193–200. [PubMed] [Google Scholar]
- Wellek S. Testing statistical hypothesse of equivalence. Chapman & Hall/CRC; 2003. [Google Scholar]
- Whitehead J. Sequential designs for equivalence studies. Statistics in Medicine. 1996;15:2703–2715. doi: 10.1002/(SICI)1097-0258(19961230)15:24<2703::AID-SIM536>3.0.CO;2-V. [DOI] [PubMed] [Google Scholar]



