A Comparison of Testing Parameters and the Implementation of a Group Sequential Design for Equivalence Studies Using Paired-sample Analysis

Yunda Huang; Steven G Self

doi:10.1198/sbr.2009.08078

. Author manuscript; available in PMC: 2013 Jan 1.

Published in final edited form as: Stat Biopharm Res. 2012 Jan 1;2(3):336–347. doi: 10.1198/sbr.2009.08078

A Comparison of Testing Parameters and the Implementation of a Group Sequential Design for Equivalence Studies Using Paired-sample Analysis

Yunda Huang ¹, Steven G Self ¹

PMCID: PMC3481840 NIHMSID: NIHMS408752 PMID: 23110243

SUMMARY

We address the problem of establishing two-sided equivalence using paired-sample analysis of two treatments or two laboratory tests with a binary endpoint. Through real data examples and monte carlo simulations, we compare three commonly used testing parameters, namely, the difference of response probabilities, the ratio of response probabilities, and the ratio of discordant probabilities based on score test statistics for constructing equivalence hypothesis tests of paired binary data. We provide suggestions on the choice of these three testing parameters and proper equivalence margins in hypothesis formulation of equivalence testing. In addition, we describe the implementation of a group sequential design in the context of equivalence testing with early stopping to reject, as well as to declare equivalence.

Keywords: binary data, double one-sided testing, equivalence boundary, interim analysis, sample sizes

1 Introduction

As a result of recent advances in medical and biological sciences, equivalence studies have been increasingly used in the evaluation of new treatments or new laboratory tests. The primary objective of these equivalence studies is to show that the efficacy or effectiveness of a new method (treatment or laboratory test) is not worse than the standard method in terms of a certain testing parameter. A testing parameter specifies the estimand of interest for a particular study and based on which a hypothesis of equivalence is formulated. One example of a commonly used testing parameter is the difference of response rates. Unlike conventional hypothesis testing procedures that intend to show a difference (or unequivalence) indicated in the alternative hypothesis, an equivalence testing procedure has the intention of rejecting the null hypothesis of unequivalence. While failure to reject the null hypothesis of unequivalence only indicates insufficient evidence to conclude unequivalence, a widely accepted strategy in equivalence testing is to specify an equivalence margin, δ, so that a particular testing parameter of interest tested less than δ might be considered equivalent of two methods.

For assessing equivalence of two methods, paired-sample designs are mostly adopted to minimize possible confounding factors and to increase efficiency of comparisons. For example, to reduce the variability of a comparison, subjects are asked to take both a new and the standard treatment in a cross-over clinical trial, or specimens from the same individual are split into two and tested by an alternative and the standard laboratory test. When the study endpoint is dichotomized, different testing parameters have been discussed in the literature. For example, because of its simplicity and ease of interpretation, the difference of response probabilities (i.e., rate difference, θ_d) has been a commonly used testing parameter in biomedical field (e.g, [Lu and Bean(1995), Nam(1997), Tango(1998)]). However, researchers have recently also proposed equivalence testing procedures for binary paired-sample design that use other testing parameters such as the ratio of response probabilities (i.e., relative risk, θ_rr) (e.g, [Lachenbruch and Lynch(1998), Nian-Sheng Tang and Chan(2003)]) or the ratio of discordant probabilities (i.e., ratio of discordance, θ_rd) ([Liu JP and MC(2005)]). Although the choice of a testing parameter usually depends on the scientific relevance of the conclusion of an equivalence testing, it is often confusing to researchers how to translate their knowledge of a particular treatment or laboratory test into determining an appropriate testing parameter. After choosing a testing parameter, it is sometimes even more challenging for researchers to set appropriate equivalence margins based on clinical and regulatory judgments. With these issues in mind, in the early part of this paper, we evaluate testing procedures that use these three different testing parameters, θ_d, θ_rr and θ_dr, and provide suggestions on the choice of a testing parameters and its corresponding equivalence margins.

Because equivalence tests often require much higher sample sizes than do ordinary tests to maintain the same level of type I and II error rates, it is of great interest to adopt group sequential designs that allow early stopping for futility (unequivalence) as well as for success (equivalence) by incorporating evidence collected stage-wise in equivalence studies. A group sequential design consists of one or more interim analyses, which the investigator can use to determine whether or not to stop the trial with concrete evidence of success or failure. The ethical reason for early stopping is probably less intriguing in the context of equivalence testing, however, such study designs are of a great need in situations where resources are too limited to achieve desired power, or where one is very uncertain of critical assumptions of the design parameters. For example, to calculate the number of patients needed for a equivalence study of two treatments with a binary endpoint, it is necessary to specify either the response probability on the standard treatment, or the average response probability over the new and standard treatments combined. If these values are misspecified, the resulting fixed sample size test will be under- or over-powered. To this end, in the latter part of this paper, we describe the implementation of a group sequential design in the context of equivalence testing based on paired-binary data. We compare the expected sample size of such a group sequential design with that of a traditional fixed sample size design.

This paper focuses on two-sided equivalence hypotheses with symmetric equivalence margins. The principle of double one-sided approach is used to test the null hypothesis of un-equivalence. This approach is indicated in the International Conference on Harmonization (ICH) E9 Guideline “Biostatistical Principles for Clinical Trials,” which states that “Operationally, this (equivalence test) is equivalent to the method of using two simultaneous one-sided tests to test the (composite) null hypothesis that the treatment difference is outside the equivalence margins versus the (composite) alternative hypothesis that the treatment difference is within the margins.” Because of the symmetry of the hypothesis, one of the double one-sided tests is discussed in more detail in this paper.

2 Methods

In the context of paired-sample analysis, let M1_k and M2_k, k = 1,…,n, denote the binary (0 and 1) response random variables of the standard method (method 1) and a new method (method 2) based on n paired samples. Let π₁₁, π₁₀, π₀₁ and π₀₀ be the response probabilities of the (M1_k, M2_k) pairs (1, 1), (1, 0), (0, 1) and (0, 0), respectively. Let a, b, c and d be the corresponding observed counts of the four response pairs. Let p₁ and p₂ be the response probabilities for method 1 and method 2. A typical layout of the outcomes and marginal totals is presented in Table 1. To model the inter-subject variability of response probabilities and their correlation, a covariance parameter, ϕ, for the response probabilities of method 1 and method 2 conditioning on matched sample characteristics is often introduced in reparameterized multinomial models. By assuming that M1_k and M2_k are mutually independent given each sample’s unobserved characteristics and that these unobserved characteristics are independent and identically distributed, [Tango(1998)] showed that in the reparameterized multinomial model the expected response probabilities over the matched sample characteristics for the four cells are given as: π₁₁ = p₁p₂+ϕ, π₀₁ = (1–p₁)p₂–ϕ, π₁₀ = p₁(1–p₂)–ϕ and π₀₀ = (1–p₁)(1–p₂)+ϕ. To formulate the hypothesis of equivalence testing between two methods, three testing parameters have mainly been used by medical field researchers: the rate difference θ_d = p1 – p2, the rate ratio θ_rr = p1/p2 and the ratio of discordance θ_rd = π₁₀/θ₀₁. In the case of two-sided equivalence testing, the notations for method 1 and method 2 are interchangeable. Hence, a testing procedure on the aforementioned measures can be easily adapted to the corresponding measures, θ’_d = p2 – p1, θ_rr’ = p2/p1 and θ_or’ = π₀₁/θ₁₀ by switching the roles of Method 1 and Method 2. In the following subsections, we focus on one version of the symmetry and consider the testing parameters θ_d, θ_rr and θ_or in terms of the formulation of a testing hypothesis and its corresponding test statistics.

Table 1.

Layout of counts (probabilities) associated with testing of two methods using paired samples in a 2×2 table.

	Treatment 1
		+	−	Total
Treatment 2	+	a (π₁₁)	b (π₀₁)	a + b (p₂)
	−	c (π₁₀)	d (π₀₀)	c + d (1 – p₂)
	Total	a + c (p₁)	b + d (1 – p₁)	n(1.0)

Open in a new tab

2.1 Rate difference

To show that two treatments are equivalent based on the rate difference, θ_d, the hypothesis can be formulated as follows:

H_{0} : ∣ p_{1} - p_{2} ∣ \geq δ_{d} versus H_{α} : - δ_{d} < p_{1} - p_{2} < δ_{d} .

(i)

To test the null hypothesis in (i), a “double one-sided test” ([Schuirmann(1987)]) would suggest conducting tests of the following two hypotheses:

H_{0} : p_{1} - p_{2} \geq δ_{d} versus H_{α} : p_{1} - p_{2} < δ_{d},

(1a)

H_{0}^{*} : p_{2} - p_{1} \geq δ_{d} versus H_{α}^{*} : p_{2} - p_{1} < δ_{d} .

(1b)

Simultaneous rejection of the null hypotheses in (1a) and (1b) at a significance level of α implies a rejection of the null hypothesis in (i) also at the level of α [Wellek(2003)]. Using the double one-sided tests approach, the p-value of the equivalence test for (i) is the maximum p-value of the two one-sided tests for (1a) and (1b).

The same procedure could be applied to the testing of the hypothesis in both (1a) and (1b) by switching the subscripts. Let’s consider the hypothesis in (1a). Let β_d = p₂ – (p₁ – δ_d). The hypothesis in (1a) is then equivalent to the following:

H_{0} : β_{d} \leq 0 versus H_{α} : β_{d} > 0 .

To derive the score statistic for testing the above hypothesis in terms of β_d, the log-likelihood of the re-parameterized multinomial model for a random sample with the data structure given in Table 1 can be written as

L (β_{d}, ϕ, p_{1}) = a \log (π_{11}) + b \log (π_{01}) + c \log (π_{10}) + d \log (π_{00}) + constant .

(2)

The corresponding score statistic can be shown to be

T_{d} = T_{d} (b, c, n, δ_{d}) = \frac{b - c + n δ_{d}}{\sqrt[]{n (2 {\hat{π}}_{10} - δ_{d} (δ_{d} + 1))}},

(3)

where ${\hat{π}}_{10}$ is the larger root of the following quadratic equation:

f (x) = 2 n x^{2} + (- b - c - (2 n - b + c) δ_{d}) x + c δ_{d} (δ_{d} + 1) = 0 .

For the special case of δ_d = 0, ${\hat{π}}_{10} = (b + c) ∕ 2 n$ , and T_d reduces to the well-known McNemar’s test statistic. See details in [Tango(1998)].

2.2 Rate ratio

To show that two treatments are equivalent based on the rate ratio, θ_rr, the hypothesis can be formulated as follows:

H_{0} : ∣ \log (p_{1} ∕ p_{2}) ∣ \geq \log (1 ∕ δ_{rr}) versus H_{α} : \log (δ_{rr}) < \log (p_{1} ∕ p_{2}) < \log (1 ∕ δ_{rr}) .

(ii)

The corresponding double one-sided testing hypotheses are:

H_{0} : p_{2} ∕ p_{1} \leq δ_{rr} versus H_{α} : p_{2} ∕ p_{1} > δ_{rr},

(4a)

H_{0}^{*} : p_{1} ∕ p_{2} \leq δ_{rr} versus H_{α}^{*} : p_{1} ∕ p_{2} > δ_{rr} .

(4b)

Let β_rr = p₂ – p₁δ_rr. Similarly, the hypothesis in (4a) is then equivalent to the following:

H_{0} : β_{rr} \leq 0 versus H_{α} : β_{rr} > 0

The score statistic for testing hypotheses in (4a) can be shown to be

T_{rr} = T_{rr} (a, b, c, n, δ_{rr}) = \frac{a + b - (a + c) δ_{rr}}{\sqrt[]{n {(1 + δ_{rr}) {\hat{π}}_{10} + (a + b + c) (δ_{rr} - 1) ∕ n}}} .

(5)

where ${\hat{π}}_{10}$ is the larger root of the following quadratic equation:

f (x) = n (1 + δ_{rr}) x^{2} + [(a + b) δ_{rr}^{2} - (a + b + 2 c)] x + c (1 - δ_{rr}) (a + b + c) ∕ n = 0 .

For the special case of δ_rr = 1, ${\hat{π}}_{10} = (b + c) ∕ 2 n$ and T_rr reduces to the well-known McNemar’s test statistic. See details in [Nian-Sheng Tang and Chan(2003)].

2.3 Ratio of discordance

To show that two treatments are equivalent based on the ratio of discordance, θ_rd, the hypothesis can be formulated as follows:

H_{0} : ∣ \log (π_{01} ∕ π_{10}) ∣ \geq \log (1 ∕ δ_{rd}) versus H_{α} : \log (δ_{rd}) < \log (π_{01} ∕ π_{10}) < \log (1 ∕ δ_{rd}) .

(iii)

The corresponding double one-sided testing hypotheses are:

H_{0} : π_{01} ∕ π_{10} \leq δ_{rd} versus H_{α} : π_{01} ∕ π_{10} > δ_{rd},

(6a)

H_{0}^{*} : π_{10} ∕ π_{01} \leq δ_{rd} versus H_{α}^{*} : π_{10} ∕ π_{01} > δ_{rd} .

(6b)

Let β_rd = π₀₁ – π₁₀δ_rd. The hypothesis in (6a) is then equivalent to the following:

H_{0} : β_{rd} \leq 0 versus H_{α} : β_{rd} > 0

The score statistic for testing hypotheses in (6a) can be shown to be

T_{rd} = T_{rd} (b, c, n, δ_{rd}) = \frac{b - c δ_{rd}}{δ_{rd} (δ_{rd} + 1)} \sqrt[]{\frac{δ_{rd} (δ_{r d} + 1)}{n \hat{π_{01}}}}

(7)

where $\hat{π_{10}}$ is the RMLE of π₁₀ given at the boundary π₀₁/π₁₀ = δ_rd and is equal to (b+c)/[n(δ_rd+1)].

For the special case of $δ_{rd} = 1, {\hat{π}}_{10} = (b + c) ∕ 2 n$ and T_rd reduces to the well-known McNemar’s test statistic. See details in [Liu JP and MC(2005)].

The score test statistics, T_d, T_rr and T_rd all have an asymptotic standard normal distribution under their respective null hypothesis. It follows that H₀ is rejected at the α significance level if T ≥ z_1–α, where z_α is the 100(1 – α) percentile of the standard normal distribution.

2.4 Equivalence margins and relationships of the three testing parameters

Like any other typical equivalence testing procedures, the three testing procedures introduced previously require a pre-specified equivalence margin, δ. Because a different level of δ might completely alter the conclusion of an equivalence study, it is a point for careful discussion between the statisticians and the investigators planning for such a study. Based on guidance from the FDA [FDA/CDER(2001)], Wellek (2003) has attempted to provide some proposals for choosing equivalence margins. In the same spirit, Table 2 presents some suggested equivalence margins for testing null hypotheses from (i), (ii) and (iii). More general discussions on the choice of equivalence margins can be found in the ICH E10 Guideline ([Guideline(2000)]). The proposed equivalence margins in this paper by no means should be regarded as fixed rules, however, they provide researchers with a range of options to tackle each specific question. For example, for a highly effective treatment with more than 85% response rate, it is advisable to use an equivalence margin (liberal or strict) that are less than 15%. Nevertheless, these proposed values of equivalence margins are used for illustration in the following sections.

Table 2.

Proposals for choosing the limits of a symmetrical equivalence interval.

		Equivalence interval
Testing parameter θ	Description	Strict	Liberal
1 θ_d = p₁ – p₂ (i.e, θ_d = π₁₀ – π₀₁)	difference in response rates (i.e, difference in the prob. of discordant pairs)	(−.05; .05)	(−.15; .15)
2 θ_rr = p₂/p₁	ratio of response rates	(.95; 1.05)	(.85; 1.18)
3 θ_or = π₁₀/π₀₁	ratio of the prob. of discordant pairs	(.95, 1.05)	(.85, 1.18)

Open in a new tab

Besides the choice of an appropriate equivalence margin, a careful and sensible decision on the choice of a testing parameter, θ, is also needed in equivalence testing studies. This is because, in contrast to conventional hypothesis testing with the common boundary of null and alternative hypothesis mostly being given by zero, equivalence hypothesis testing remain generally not invariant under redefinitions of the testing parameter. For example, the conventional null hypotheses θ_d = 0 and θ_rr = 1 correspond to the same sets of (p₁,p₂) in the space of [0, 1]×[1, 0]. However, the sets {(p₁,p₂)| – δ_d < θ_d < δ_d} and {(p₁,p₂)|1/δ_rr < θ_rr < δ_rr} are different for any choice of the constants 0 < δ_d, δ_rr < 1. To investigate the relationships between the three different testing parameters, θ_d, θ_rr and θ_rd, we compare the equivalence regions resulted from each testing procedure. To do this, we need to re-define the equivalence regions in the same space. Specifically, the equivalence region based on the rate ratio,

δ_{rr} < p_{1} ∕ p_{2} < 1 ∕ δ_{rr},

can be expressed in a space defined by θ_d = p₁ – p₂ and p₂ as

p_{2} (δ_{rr} - 1) < p_{1} - p_{2} < p_{2} (1 - δ_{rr}) ∕ δ_{rr},

(8)

while the equivalence region based on the rate difference can be expressed in the same space as

- δ_{d} < p_{1} - p_{2} < δ_{d} .

(9)

From (8), for a given value of p₂, we can see that the width of the equivalence boundaries for θ_rr in terms of θ_d is p₂((1 – δ_rr)/δ_rr – (δ_rr – 1)) while the width of the equivalence boundaries for θ_d is always 2δ_d regardless of the value of p₂. Therefore, the equivalence boundaries for θ_rr is wider than that for θ_d as long as $2 δ_{d} δ_{rr} ∕ (1 - δ_{rr}^{2}) < p_{2} \leq 1$ ; the equivalence boundaries for θ_rr is narrower than that for θ_d as long as $0 < p_{2} \leq 2 δ_{d} δ_{rr} ∕ (1 - δ_{rr}^{2}) .$ . Note that the equivalence boundaries for θ_rr is always narrower than that for $θ_{d} if 2 δ_{d} δ_{rr} ∕ (1 - δ_{rr}^{2}) > 1$ , i.e., $δ_{d} > (1 - δ_{rr}^{2}) ∕ (2 δ_{rr})$ . In the context of equivalence testing, wider ( or narrower) equivalence boundaries would suggest higher (or lower) power to accept equivalence.

Similarly, because the equivalence boundaries based on the rate difference can also be expressed as

- δ_{d} < π_{10} - π_{01} < δ_{d},

(10)

we can express the equivalence boundaries based on the ratio of discordance

δ_{rd} < π_{10} ∕ π_{01} < 1 ∕ δ_{rd}

in a space defined by θ_d = π₁₀ – π₀₁ and π₀₁ as

π_{01} (δ_{rd} - 1) < π_{10} - π_{01} < π_{01} (1 - δ_{rd}) ∕ δ_{rd} .

(11)

From (11), for a given value of π₀₁, we can see that the width of the equivalence boundaries for θ_rd in terms of θ_d is π₀₁((1 – δ_rd)/δ_rd – (δ_rd – 1)) while the width of the equivalence boundaries for θ_d is always 2δ_d regardless of the value of π₀₁. Similarly, the equivalence boundaries for θ_rd is wider than that for θ_d as long as $2 δ_{d} δ_{rd} ∕ (1 - δ_{rd}^{2}) < π_{01} \leq 1$ ; the equivalence boundaries for θ_rd is narrower than that for θ_d as long as $0 < π_{01} < 2 δ_{d} δ_{rd} ∕ (1 - δ_{rd}^{2})$ . Note that the equivalence boundaries for θ_rd is always narrower than that for $θ_{d} if 2 δ_{d} δ_{rd} ∕ (1 - δ_{rd}^{2}) > 1, i . e ., δ_{d} > (1 - δ_{rd}^{2}) ∕ (2 δ_{rd})$ .

In Figures 1-3, the equivalence regions based on the rate difference is overlayed with that based on the rate ratio (left panel a) and with that based on the ratio of discordance (right panel b) under different settings of the equivalence margins δ_d, δ_rr and δ_rd. Based on Figure 1, we can see that the two pairs of equivalence regions do overlap in a large proportion with δ_d = 0.1, δ_rr = 0.8 and δ_rd = 0.8. In this case, the width of the equivalence boundaries based on θ_rr is wider than that based on θ_d when 4/9 < p₂ < 1 and narrower when 0 < p₂ < 4/9. However, based on Figure 2 where δ_d = 0.2, δ_rr = 0.9 and δ_rd = 0.9, the equivalence boundaries based on θ_rr is always narrower than that based on θ_d because $θ_{d} = 0.2 > (1 - δ_{rr}^{2}) ∕ (2 δ_{rr}) = 0.11$ . Based on Figure 3 where δ_d = 0.2, δ_rr = 0.8 and δ_rd = 0.8 (i.e., δ_rr = 1 – δ_rd), we can see the equivalence boundaries based on θ_rr or θ_rd are always narrower than those based on θ_d because $δ_{d} > (1 - δ_{rr}^{2}) ∕ (2 δ_{rr})$ and $δ_{d} > (1 - δ_{rd}^{2}) ∕ (2 δ_{rd})$ . In summary, the equivalence boundaries for θ_d is always more liberal than that for θ_rr and θ_rd unless $0 < p_{2} \leq 2 δ_{d} δ_{rr} ∕ (1 - δ_{rr}^{2})$ . In most clinical trials, $2 δ_{d} δ_{rr} ∕ (1 - δ_{rr}^{2})$ is not an value of interest, because that infers that p₂ < 50% with some reasonable setting of the equivalence margins. This suggests that, if the response probability of a treatment is assumed to be at least 50%, there is more power to declare equivalence using the testing parameter of rate difference given that the equivalence margins are comparable. However, in a comparison of two laboratory tests, it is likely to have p₂ < 50% and the equivalence boundaries for the rate ratio and the ratio of discordance could be more liberal than those for the rate difference. Therefore, researchers are recommended to take advantage of the information presented in Figures 1-3 when designing an equivalence testing study.

Equivalence testing regions in terms of the rate difference (black line shaded area) between both the rate ratio (red line shaded area) [left] and the ratio of discordance (red line shaded area)[right] as regions in the p₂ × (p₁ –p₂)-plane [left] and the π₁₀ ×(p₁ – p₂)-plane [right] with *δ_d* = 0.1, *δ_rr* = *δ_rd* = 0.8. [Green shaded areas = set of possible values of–(p₂, (p₁ – p₂)).]

3 Examples

In this section, we look at the impact of using different testing parameters and equivalence margins on the testing results of an equivalence study. We use two data examples. First, we consider the HIV-1 screening test example given in [Lachenbruch and Lynch(1998)]. Table 3 provides the outcomes of the HIV screening test using a particular body fluid versus the testing of plasma directly from 1157 individuals. We applied to this data set the three different testing parameters: rate difference, rate ratio and ratio of discordance; and the suggested liberal and strict equivalence margins that are outlined in Table 2. The testing results are summarized in Table 4. All testing statistics yielded p-value less than 0.05, indicating that the response rate or the rate of discordance of testing the alternative body fluid samples was equivalent to testing plasma samples.

Table 3.

Example 1 – HIV screening test (source: [Lachenbruch and Lynch(1998)]).

	Plasma sample
		+	−	Total
Alternative body fluid sample	+	446	5	451
	−	16	690	706
	Total	462	695	1157

Open in a new tab

Table 4.

Example 1 – results of the equivalence test for the HIV screening test examples.

	Test statistic
	T_d		T_rr		T_or
Equivalence limit	δ_d = 0.05	δ_d = 0.15	δ_rr = 0.95	δ_rr = 0.85	δ_or = 0.95	δ_or = 0.85
H₀₁ versus H₁₁ (p-value)	6.025 (< .001)	13.184 (< .001)	2.259 (.012)	7.308 (< .001)	−2.284 (.011)	−2.036 (.021)
H₀₂ versus H₁₂ (p-value)	8.183 (< .001)	14.529 (< .001)	5.449 (< .001)	9.3 (< .001)	2.519 (.006)	2.781 (.003)
p-value for equivalence	< .001	< .001	0.012	< .001	.011	.021

Open in a new tab

Second, we consider the cross-over trial of disinfection systems for soft contact lenses given in [Tango(1998)]. Table 5 provides the outcomes of the study in which 44 subjects were evaluated to compare a chemical disinfection system with a thermal disinfection system for soft contact lenses. Again we applied to this data set the three different testing parameters and the suggested liberal and strict equivalence margins. Table 6 shows the results of the six testing procedures. Three testing procedures that use the strict equivalence margins (δ_d = 0.05 or δ_rr = δ_or = 0.95) resulted non-significant p-values at two-sided α = 0.05 most likely due to the small sample size. However, different conclusions could result when a more liberal equivalence margin is used. As shown in column 3, 5 and 7 with δ_d = 0.15 and δ_rr = δ_rd = 0.85, the two statistics based on the rate difference and the rate ratio were less than 0.05 while the one based on ratio of discordance was non-significant (p-value = 0.178). These two examples demonstrate the importance of pre-specifying the testing parameter and equivalence margins to avoid any ambiguity of result interpretation in an equivalence study.

Table 5.

Example 2 – clinical assessment of treatments in cross-over trials of disinfection systems for soft contact lenses (source. [Tango(1998)]).

	Thermal disinfection
		+	−	Total
Hydrogen peroxide	+	43	0	43
	−	1	0	1
	Total	44	0	44

Open in a new tab

Table 6.

Example 2 – results of the equivalence test for the disinfection systems for soft contact lenses.

	Test statistic
	T_d		T_rr		T_or
Equivalence limit	δ_d = 0.05	δ_d = 0.15	δ_rr = 0.95	δ_rr = 0.85	δ_or = 0.95	δ_or = 0.85
H₀₁ versus H₁₁ (p-value)	0.830 (.203)	2.364 (.009)	0.830 (.203)	2.364 (.009)	−0.975 (.165)	−0.922 (.178)
H₀₂ versus H₁₂ (p-value)	1.835 (.033)	2.990 (.001)	1.821 (.034)	2.961 (.002)	1.026 (.152)	1.085 (.139)
p-value for equivalence	.203	.009	.203	.009	.165	.178

Open in a new tab

4 Simulations

In this section, we study the empirical type I and II error rates of the three testing procedures in various parameter settings. We choose n = 50, 100 and 200, p₁ = 0.5, 0.8 and 0.95 and ϕ = 0 and 0.03 when applicable. To study the empirical significance levels of various testing procedures, 10, 000 random samples were generated from multinomial proportions of size n with parameters defined by π₁₁ = p₁p₂ +ϕ, π₀₁ = (1–p₁)p₂ –ϕ, π₁₀ = p₁(1–p₂)–ϕ and π₀₀ = (1–p₁)(1–p₂)+ϕ where ϕ is the covariance of p₁ and p₂. Note that ϕ = 0 does not necessarily indicate that p₁ and p₂ are independent. Table 7 shows the empirical significance level of each testing procedure under β = 0, i.e., β_d = p₁ –p₂ –δ_d = 0, β_rr = p₂/p₁ –δ_rr = 0 and β_rd = π₀₁/π₁₀ –δ_rd = 0, where δ_d = 0.05 and δ_rr = δ_rd = 0.95. The nominal α level is set to be 0.05. Overall, these simulation results show that under β_d = 0, the empirical type I error rates are well maintained for T_d while the T_rr and T_rd are very liberal as their empirical type I error rates could be much higher than the nominal 5% level. Under β_rr = 0, the empirical type I error rates for T_rr are close to the nominal level. While T_d seems to be conservative in this case, T_rd is also very liberal. Under the last scenario β_rd = 0, both T_d and T_rr are slightly conservative except when the sample size is small (n = 50) and p₁ = 0.5 while T_rd seems liberal under all studied settings. Because the constrains on the value of the parameters, p₁ + δ_orπ₀₁ ≤ 1 (i.e., π₀₁ ≤ (1 – p₁)/δ_or), results with p₁ = 0.95 are not shown Table 7 to avoid generation of extremely sparse data.

Table 7.

Empirical Type I error (%) based on 10,000 trials simulated under H₀ with α = 5% (δ_d = 0.05 and δ_rr = δ_rd = 0.95)

			H₀ : p₁ – p₂ = δ_d				H₀ : p₂/p₁ = δ_rd				H₀ : π₀₁/π₁₀ = δ_rd
n	p ₁	ϕ	π ₀₁	T_d	T_rr	T_rd	π ₀₁	T_d	T_rr	T_rd	π ₀₁	T_d	T_rr	T_rd
50	0.5	0	0.28	5.5	8.9	11.1	0.26	3.6	6.1	8.1	0.28	4	6.5	8.3
	0.5	0.03	0.25	5	8.9	11.7	0.23	3.2	6.3	8.6	0.25	3.4	6.1	8.5
	0.8	0	0.2	5.1	6.8	13.1	0.19	4.5	5.8	11.8	0.2	2.6	3.5	8
	0.8	0.03	0.17	5.4	7	14.2	0.16	4.3	5.8	11.8	0.17	2.3	3	7.9
100	0.5	0	0.28	4.7	10.2	13.2	0.26	2.8	6.3	8.8	0.28	2.3	5	6.8
	0.5	0.03	0.25	5.2	11.2	15.3	0.23	2.6	6.1	9.4	0.25	2	4.9	6.8
	0.8	0	0.2	5.1	7.5	17.9	0.19	3.5	4.9	13.6	0.2	1.5	2.1	7.2
	0.8	0.03	0.17	4.9	7.6	20.5	0.16	3.2	5	15.4	0.17	1.4	2.3	8.2
200	0.5	0	0.28	5.2	13.3	18.3	0.26	1.8	5.3	9	0.28	1.1	3.7	6.1
	0.5	0.03	0.25	5	13.5	20.4	0.23	1.7	5.5	10	0.25	0.9	3.3	6.3
	0.8	0	0.2	5.3	8.5	25.5	0.19	3	4.8	17.8	0.2	0.8	1.4	7
	0.8	0.03	0.17	4.7	8.3	29.7	0.16	2.9	5.4	21.8	0.17	0.6	1.1	7.4

Open in a new tab

Table 8 shows the empirical power of each testing procedure under p₁ = p₂. In general, higher power is achieved with larger sample sizes, higher values of p₁ and ϕ. This suggests that when the marginal response rates (p₁ and p₂) are high or when the concordant pairs are high, there is more chance to declare equivalence using the same equivalence margins. It is also observed that T_d is always more powerful than T_rr and T_rd except when the sample size is small (n = 50) and p₁ = 0.5 or when n = 50, p₁ = 0.8 and ϕ = 0. In summary, these simulation results show that the rate difference is an attractive testing parameter based on which the operating characteristics of a testing procedure is generally well satisfied. Testing procedures based on rate ratio and ratio of discordance do tend to be more liberal, but could be more powerful in settings when the sample size is small (n = 50) and the response rate is low (p₁ = 0.5) or when the sample size is small (n = 50), the response rate is relatively high (p₁ = 0.8) but the two random variables indicating the binary outcomes of method 1 and method 2 are uncorrelated.

Table 8.

Empirical power (%) using double one-sided tests based on 10,000 trials simulated under p₁ = p₂ (i.e., π₀₁ = π₁₀) with δ_d = 0.15 and δ_rr = δ_rd = 0.85 (α = 5%)

n	p ₁	ϕ	π ₀₁	δ _d	T_d	δ _rr	T_rr	δ _rd	T_rd
50	0.5	0	0.25	0.15	1.2	0.85	1.5	0.85	21.6
		0.03	0.22	0.15	3.9	0.85	1.5	0.85	20.6
	0.8	0	0.16	0.15	15.5	0.85	5	0.85	19.6
		0.03	0.13	0.15	26.8	0.85	10.8	0.85	18.1
	0.95	0	0.05	0.15	78.6	0.85	76.2	0.85	9.9
		0.03	0.02	0.15	97.8	0.85	97.6	0.85	1.3
100	0.5	0	0.25	0.15	37	0.85	0.6	0.85	29.4
		0.03	0.22	0.15	46.7	0.85	0.5	0.85	26.4
	0.8	0	0.16	0.15	67.4	0.85	44.4	0.85	24.3
		0.03	0.13	0.15	78.5	0.85	57.6	0.85	21.6
	0.95	0	0.05	0.15	99.3	0.85	99.2	0.85	18.0
		0.03	0.02	0.15	100	0.85	100	0.85	6.0
200	0.5	0	0.25	0.15	82.4	0.85	2.8	0.85	39.6
		0.03	0.22	0.15	88	0.85	7.2	0.85	36.4
	0.8	0	0.16	0.15	96.5	0.85	87.9	0.85	33.2
		0.03	0.13	0.15	98.6	0.85	93.5	0.85	30.8
	0.95	0	0.05	0.15	100	0.85	100	0.85	20.1
		0.03	0.02	0.15	100	0.85	100	0.85	15.0

Open in a new tab

5 Group sequential designs for equivalence testing

The topic of group sequential designs has generated a rich set of research. The papers by Pocock (1977), O’Brien & Fleming (1979), Fleming, Harrington & O’Brien (1984) and Lan & DeMets (1983) especially concerned methods with two-sided testing boundaries to allow early stopping to reject the null hypothesis. Subsequent researchers (e.g., Wang & Tsiatis (1987) and Pampallona & Tsiatis (1994)) modified the two-sided testing boundary by inserting an “inner wedge” to permit early stopping to accept the null hypothesis. Especially, Pampallona & Tsiatis (1994)) described group sequential designs that allow asymmetric tests with unequal Type I and II error probabilities.

Under repeated testing in group sequential designs, the critical values of testing statistics need to be adjusted to maintain the overall false positive rate at an acceptable level (≤ β, “consumers’ risk”); the sample sizes need to be adjusted to attain a desired power of testing ≥ 1 – α, “producers’ risk”). For example, in the context of equivalence testing using the rate difference, given a two-sided equivalence hypothesis:

H_{0} : ∣ θ_{d} ∣ \geq δ_{d} vs . H_{α} : ∣ θ_{d} ∣ < δ_{d} .

The type I error requirement is Pr{Declare equivalence |θ_d = δδ_d}· β and the type II error requirement is P r{Do not declare equivalence | θ_d = 0}· α. After reparametrization of the testing parameter, i.e., β_d = p₂ – (p₁ – δ_d) or β_d = p₁ – (p₂ – δ_d) the above hypotheses become

H_{0} : β_{d} \leq 0 vs . H_{α} : β_{d} > 0 .

The error requirements become Pr{Declare equivalence |β_d = 0}≤ α and Pr{Do not declare equivalence |β_d = ±δ_d}≤ β. Therefore, the score test statistics for testing of the reparametrized testing parameter that are discussed in previous sections can then be directly used in the adaptation of other methods that look at conventional hypothesis of difference. In the following subsections, because the roles of method 1 and method 2 are simply inter-changed in the reparametrized hypothesis testings of the two one-sided hypothesis, we focus on one version of the symmetry and consider the testing of one of the two one-sided hypotheses (1a), (4a) and (6a) for the three testing parameters, rate difference, rate ratio and ratio of discordance, respectively.

5.1 Two-sided inner wedge tests in terms of the score statistics

Most methods discussed in the literature focus on designs for conventional hypothesis testing of a treatment difference for two independent groups. Here we extend Pampallona & Tsiatis (1994)) and describe the implementation of a group sequential design for paired binary data in the context of equivalence testing. Common approaches of group sequential designs include the definition of appropriate continuation and stopping regions to guarantee the desired error rates under repeated testing. We present the implementation of a sequential design for equivalence testing with early stopping to reject, as well as to declare equivalence. A general decision tree of such design based on pairs of constants (a_k, b_k) with 0 ≤ a_k < b_k for k = 1,…,K – 1 and a_K = b_K:

After group k = 1,…,K – 1

if ∣T∣ ≤ b_k stop, reject H₀ and declare equivalence
if ∣T∣ < a_k stop, accept H₀ and declare non-equivalence
otherwise continue to group k + 1

after group K

if ∣T∣ ≤ b_K stop, reject H₀ and declare equivalence
if ∣T∣ < a_K stop, accept H₀ and declare non-equivalence

where T refers to the score test statistics presented in section 2. Since a_K = b_K, termination at analysis K is ensured. For the test with parameter Δ, where larger values of Δ increase the maximum sample size and decrease expected sample size, the critical values a_k and b_k are:

b_{k} = C_{1} (K, α, β, Δ) {(k ∕ K)}^{Δ - 1 ∕ 2} and

(12)

a_{k} = δ \sqrt[]{I_{k}} - C_{2} (K, α, β, Δ) {(k ∕ K)}^{Δ - 1 ∕ 2}, k = 1, \dots, K .

(13)

The constants C₁(K, α, β, Δ) and C₂(K, α, β, Δ) are chosen to ensure that Type I error and power conditions. The values of these constants can be found in [Pampallona and Tsiatis(1994)]. The upper boundary with critical values b_k can be viewed as a repeated significance test of β_d = 0; the lower boundary with critical values a_k can be viewed as a repeated significance test of β_d = δ_d

5.2 Implementation of a group sequential design using the rate difference in equivalence testing

Suppose α = β = 0.05, δ_d = 0.2, Δ = 0, it is of desire to design a balanced study with K = 4 equally sized groups. The information needed for a fixed sample size test is

I_{f} = {Φ^{- 1} (1 - α ∕ 2) + Φ^{- 1} (1 - β)}^{2} ∕ δ_{d}^{2} = {(1.96 + 1.645)}^{2} ∕ {0.2}^{2} = 324.9

Since $I_{k} = n_{k} ∕ (2 {\hat{π}}_{10} - δ_{d} (δ_{d} + 1))$ , the fixed sample size test requires 324.9*0.26 = 85 subjects assuming ${\hat{π}}_{10} = 0.25$ . Using the power family of Inner Wedge Tests, the information needed for a sequential test is

I_{s} = 324.9 * R_{w} (K = 4, α = 0.05, β = 0.05, Δ = 0) = 342.8)

(14)

Assuming ${\hat{π}}_{10} = 0.25$ , it requires 88 subjects for the final stage analysis, i.e., 4 groups of 22 observations should be planned in the study.

The test is implemented by applying the decision tree provided in section 5.1 to the score test statistic T_k, k = 1,…,K. The critical values a_k and b_k are obtained from (12) and (13) with C₁(4, 0.05, 0.05, 0) = 1.995 and C₂(4, 0.05, 0.05, 0) = 1.708. Specifically, (a₁, a₂, a₃, a₄) = (−1.58, 0.19, 1.21, 1.97) and (b₁, b₂, b₃, b₄) = (3.99, 2.82, 2.30, 1.995). Since a₁ is negative, it indicates that it is not possible to stop and declare equivalence at the first analysis. In addition, we can calculate the expected sample sizes as 65.4, 70.4 and 58.6 under β_d = ±0.2, ±0.1 and 0, compared to the 85 subjects required by a fixed sample procedure. Therefore, there is a great gain of resource saving in terms of the expected sample sizes. To note, if Δ is increased, tests would allow greater opportunity for early stopping, resulting in lower expected sample size but the maximum sample size would also increase.

6 Conclusion

In this paper, we investigated three different testing parameters, rate difference, rate ratio and ratio of discordance for matched-pair designs. Based on real data examples and simulation studies, we observed that the score test statistics for testing of the rate difference based on the reparameterized multinomial model are generally a good choice for equivalence testing of two methods with paired binary data. The implementation of a group sequential design as discussed in this paper is by no means the only or best approach, however, it demonstrates the amount of resource-saving in equivalence testing context when group sequential designs are employed as opposed to a fixed sample design. Müller and Schäfer (1999) have suggested that group sequential equivalence testing is most profitable when the power of the trial to establish equivalence is high. Therefore, in view of the savings that can be achieved in experimental costs and time, we would recommend that early stopping to accept the null hypothesis be considered in equivalence testing using group sequential designs. The score test statistics for the testing procedures discussed in this paper are all based on large sample theories. It is of interest to conduct similar studies that use exact inference when study sample sizes are small or the data structure is sparse. Some of the details on group sequential designs using exact calculations for binary data can be found in [Jennison and Turnbull(1999)].

References

FDA/CDER [last assessed Feb. 2008];Guidance for industry bioanalytical method validation (online document) 2001 http://www.fda.gov/cder/guidance/4252fnl.pdf. [Google Scholar]
Fleming TR, H.-D. P, O’Brien PC. Designs for group sequential tests. Controlled Clinical Trials. 1984;5:348–361. doi: 10.1016/s0197-2456(84)80014-8. [DOI] [PubMed] [Google Scholar]
Guideline IE. Choice of control group and related issues in clinical trials. International Conference on Harmonization (ICH) 2000 [Google Scholar]
Jennison C, Turnbull BW. Interim analyses: repeated confidence interval approach. J R Stat Soc. 1989;51:305–361. [Google Scholar]
— . Group sequential methods with applications to clinical trials. Chapman & Hall; New York: 1999. [Google Scholar]
Lachenbruch PA, Lynch CJ. Assessing screening tests: extensions of McNemar’s Test. Statistics in Medicine. 1998;17:2207–2217. doi: 10.1002/(sici)1097-0258(19981015)17:19<2207::aid-sim920>3.0.co;2-y. [DOI] [PubMed] [Google Scholar]
Lan K, DeMets DL. Descrte sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. [Google Scholar]
Liu JP, F. H, MC M. Tests for equivalence based on odds ratio for matched-pair design. J Biopharm Stat. 2005;15:889–901. doi: 10.1080/10543400500265561. [DOI] [PubMed] [Google Scholar]
Lu Y, Bean J. On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Statistics in Medicine. 1995;14:1831–1839. doi: 10.1002/sim.4780141611. [DOI] [PubMed] [Google Scholar]
Müller HH, Schäfer H. Optimization of testing times and critical values in sequential equivalence testing. Statistics in Medicine. 1999;18:1769–1788. [PubMed] [Google Scholar]
Nam J-M. Establishing equivalence of two treatments and sample size requirement in matched-pairs design. Biometrics. 1997;53:1422–1430. [PubMed] [Google Scholar]
Nian-Sheng Tang M-LT, Chan ISF. On tests of equivalence via non-unity relative risk for matched-pair design. Statistics in Medicine. 2003;22:1217–1233. doi: 10.1002/sim.1213. [DOI] [PubMed] [Google Scholar]
O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. [PubMed] [Google Scholar]
Pampallona S, Tsiatis A. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. Journal of Statistical Planning and Inference. 1994;42:19–35. [Google Scholar]
Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64:191–199. [Google Scholar]
Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics. 1987;15:657–680. doi: 10.1007/BF01068419. [DOI] [PubMed] [Google Scholar]
Tango T. Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Statistics in Medicine. 1998;17:891–908. doi: 10.1002/(sici)1097-0258(19980430)17:8<891::aid-sim780>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]
Wang SK, Tsiatis AA. Approximately optimal oneparameter boundaries for group sequential trials. Biometrics. 1987;43:193–200. [PubMed] [Google Scholar]
Wellek S. Testing statistical hypothesse of equivalence. Chapman & Hall/CRC; 2003. [Google Scholar]
Whitehead J. Sequential designs for equivalence studies. Statistics in Medicine. 1996;15:2703–2715. doi: 10.1002/(SICI)1097-0258(19961230)15:24<2703::AID-SIM536>3.0.CO;2-V. [DOI] [PubMed] [Google Scholar]

[R1] FDA/CDER [last assessed Feb. 2008];Guidance for industry bioanalytical method validation (online document) 2001 http://www.fda.gov/cder/guidance/4252fnl.pdf. [Google Scholar]

[R2] Fleming TR, H.-D. P, O’Brien PC. Designs for group sequential tests. Controlled Clinical Trials. 1984;5:348–361. doi: 10.1016/s0197-2456(84)80014-8. [DOI] [PubMed] [Google Scholar]

[R3] Guideline IE. Choice of control group and related issues in clinical trials. International Conference on Harmonization (ICH) 2000 [Google Scholar]

[R4] Jennison C, Turnbull BW. Interim analyses: repeated confidence interval approach. J R Stat Soc. 1989;51:305–361. [Google Scholar]

[R5] — . Group sequential methods with applications to clinical trials. Chapman & Hall; New York: 1999. [Google Scholar]

[R6] Lachenbruch PA, Lynch CJ. Assessing screening tests: extensions of McNemar’s Test. Statistics in Medicine. 1998;17:2207–2217. doi: 10.1002/(sici)1097-0258(19981015)17:19<2207::aid-sim920>3.0.co;2-y. [DOI] [PubMed] [Google Scholar]

[R7] Lan K, DeMets DL. Descrte sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. [Google Scholar]

[R8] Liu JP, F. H, MC M. Tests for equivalence based on odds ratio for matched-pair design. J Biopharm Stat. 2005;15:889–901. doi: 10.1080/10543400500265561. [DOI] [PubMed] [Google Scholar]

[R9] Lu Y, Bean J. On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Statistics in Medicine. 1995;14:1831–1839. doi: 10.1002/sim.4780141611. [DOI] [PubMed] [Google Scholar]

[R10] Müller HH, Schäfer H. Optimization of testing times and critical values in sequential equivalence testing. Statistics in Medicine. 1999;18:1769–1788. [PubMed] [Google Scholar]

[R11] Nam J-M. Establishing equivalence of two treatments and sample size requirement in matched-pairs design. Biometrics. 1997;53:1422–1430. [PubMed] [Google Scholar]

[R12] Nian-Sheng Tang M-LT, Chan ISF. On tests of equivalence via non-unity relative risk for matched-pair design. Statistics in Medicine. 2003;22:1217–1233. doi: 10.1002/sim.1213. [DOI] [PubMed] [Google Scholar]

[R13] O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. [PubMed] [Google Scholar]

[R14] Pampallona S, Tsiatis A. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. Journal of Statistical Planning and Inference. 1994;42:19–35. [Google Scholar]

[R15] Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64:191–199. [Google Scholar]

[R16] Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics. 1987;15:657–680. doi: 10.1007/BF01068419. [DOI] [PubMed] [Google Scholar]

[R17] Tango T. Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Statistics in Medicine. 1998;17:891–908. doi: 10.1002/(sici)1097-0258(19980430)17:8<891::aid-sim780>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]

[R18] Wang SK, Tsiatis AA. Approximately optimal oneparameter boundaries for group sequential trials. Biometrics. 1987;43:193–200. [PubMed] [Google Scholar]

[R19] Wellek S. Testing statistical hypothesse of equivalence. Chapman & Hall/CRC; 2003. [Google Scholar]

[R20] Whitehead J. Sequential designs for equivalence studies. Statistics in Medicine. 1996;15:2703–2715. doi: 10.1002/(SICI)1097-0258(19961230)15:24<2703::AID-SIM536>3.0.CO;2-V. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Comparison of Testing Parameters and the Implementation of a Group Sequential Design for Equivalence Studies Using Paired-sample Analysis

Yunda Huang

Steven G Self

SUMMARY

1 Introduction