Abstract
Manufacturing and testing of pharmaceutical products frequently occur in multiple facilities within a company’s network. It is of interest to demonstrate equivalence among the alternative testing/manufacturing facilities to ensure product consistency and quality regardless of the facility where it was manufactured/tested. In the Frequentist framework, equivalence testing is well established when comparing two labs or manufacturing facilities; however, when considering more than two labs or production sites, the Frequentist approach may not always offer appropriate or interpretable estimates for demonstrating equivalence among all of them simultaneously. This paper demonstrates the utility of Bayesian methods to the equivalence assessment of multiple groups means, with a comparison against traditional Frequentist methods. We conclude that a Bayesian strategy is very useful for addressing the problem of multi-group equivalence. While it is not our intention to argue that Bayesian methods should always replace Frequentist ones, we show that among the advantages of a Bayesian analysis is that it provides a more nuanced understanding of the degree of similarity among sites than the hypothesis testing underpinning the Frequentist approach.
KEYWORDS: Multiple comparison of means, Bayesian methods, frequentist methods, multivariate t-distribution, technology transfer, comparability
1. Introduction
Equivalence testing is often conducted for the comparison of means of two groups. There are many applications for this. A simple example is the use of a validated analytical method at two sites, that should ideally yield the same result when testing the same sample. Another example is when there are two production sites manufacturing the same product which is to be characterized with the same analytical method, and in such a case it is to be demonstrated that the results are sufficiently similar. That is, due to the inherent differences such as location, equipment and personnel, the population means of the results from two sites are not expected to be exactly the same, but the difference should be small enough as to be practically unimportant.
Thus, in equivalence testing the interest is in determining whether the true mean difference (e.g. between sites or labs) is ‘small enough’ (practically ignorable) or not. More specifically, equivalence is demonstrated when it is shown that the true mean difference is smaller than a pre-set threshold, referred to as an ‘equivalence margin.’ It is perfectly acceptable that the population means are actually different, as long as it can be guaranteed that this difference is not too large. Experimental and practical considerations define what ‘too large’ is in practice.
Multiple statistical methods to assess equivalence between two groups using Frequentist methodology are well described in the literature. The most commonly used method is the TOST approach for comparing the means of two groups (Two One-Sided T-tests; see, for example [1,2,4,14,19]); other approaches have been promoted for comparing entire distributions (see, for example [8,12,15,17,20]). TOST, being the most widely used approach, will be the focus of the Frequentist aspects of this paper. Under the TOST procedure, the null hypothesis is that the difference between the population means of the two sites (referred to as and ) is as large as or larger than the equivalence margin (given by ), which is to be rejected in favor of the alternative hypothesis that the difference is smaller than the equivalence margin. Mathematically, these hypotheses can be written as follows:
| (1) |
In practice, the hypothesis is tested by constructing a two-sided confidence interval for the difference , and the limits are compared with the equivalence margin.
Though this TOST procedure works well for equivalence testing involving two groups, in real world applications comparisons of multiple means are likely. Straightforward application of the TOST approach would then lead to the construction of several confidence intervals for group mean differences, that are all to be compared with the equivalence margin. There are Frequentist techniques that control the family-wise error rate in pair-wise comparisons of multiple groups means when it comes to significance testing, but it is not clear how to deal with multiplicity in equivalence testing of more than two means. The literature is scarce on this subject. Lauzon and Caffo [13] suggest a method to apply when the individual group population means if sorted in ascending sequence all differ from each neighbor by , but in that case other differences between population means are a multiple of . Rusticus and Lovato [18] propose using the Games-Howell post-hoc correction used in significance testing to allow for heterogeneity of variances and declare equivalence if all corrected two-sided confidence intervals fall within the pre-defined value . However, the authors do not address the question of whether such a correction is needed in the first place, nor whether multiplicity corrections developed for significance testing can appropriately be applied in equivalence testing in the same way they are applied for significance testing. After all, the null hypothesis in significance testing is that all population means are identical, which is a well-defined situation regardless the number of groups. For equivalence testing, the null hypothesis is that the absolute difference between the population means is larger than some pre-set value (equation 1), which leads to a large number of different scenarios if one tries to define population means for more than 2 groups satisfying this.
As an alternative to the Frequentist approach, Bayesian techniques offer a more clear and flexible assessment of the probability of satisfying an equivalence criterion. The literature provides examples in which Bayesian approaches are employed in comparing multiple means [6,16], for equivalence testing of two group means [10,11,21], and for omnibus non-inferiority testing for ANOVA and linear regression analyses [3]. This paper focuses on the application of Bayesian methods to an equivalence assessment with multiple group means, while performing a comparison against traditional Frequentist-based methods. This comparison is intended to help practitioners understand (1) the similarities in numerical results, but differences in interpretation and richness of information, between the two approaches when diffuse or non-informative priors are used for the Bayesian model and (2) the advantages of being able to incorporate prior information into a Bayesian model when available, thereby reducing statistical uncertainty for a given sample size. Ultimately, our hope is that statistical practitioners, scientists, and regulators will recognize the utility of Bayesian modeling as another statistical ‘tool’ in the toolbox to be used when useful, rather than feel a need to choose between Bayesian and Frequentist methods for all statistical applications. Additionally, simulations are presented which demonstrate that the correction for multiple comparisons is not necessary even in a Frequentist framework. The paper is focused on the relatively simple case of a common replicate variability among the groups, and Normally distributed data.
2. Theoretical background
2.1. Symbols and definitions
In the following formulas, N indicates the number of sites, with site-dependent population means and a common variance . Each site contributes observations. Note that it is not required that all sites contribute the same number of observations. The equivalence margin is referred to as Δ.
The observed mean and corrected sum of squares for the ith site are calculated according to
| (2) |
Further, define
| (3) |
The individual observed site variances ( ) and the pooled variance over all sites ( ) are computed according to
| (4) |
2.2. A Frequentist approach
The null hypothesis for equivalence testing for two groups (see equation 1) can be generalized for the multi-group case to
| (5) |
where and are the population means of groups and , respectively. In other words, all possible differences between the population means should be below the equivalence margin to reject the null hypothesis. After the data have been collected, the confidence intervals for the differences can be calculated, according to the general equation
| (6) |
In this equation is the pooled standard deviation over all 4 groups (equation 4), and is a critical value of the t-distribution, which in significance testing may be adjusted for multiple testing procedures. Equivalence over all groups is declared if all confidence intervals are fully within the range from – Δ to +Δ. In the TOST approach, the one-sided critical t-value corresponding to the significance level α is used; the rationale for this is discussed extensively in literature about the TOST approach (see, for example, [1]).
2.3. A Bayesian approach
2.3.1. Preliminaries
The problem of equivalence testing can also be treated from a Bayesian point of view. The result of such an analysis could be a posterior distribution of the maximum absolute difference (see equation 5) between the site means and using the information from this posterior a conclusion as to equivalence can be drawn. The first step is to set prior distributions for the parameters (informative or non-informative), in this case the analytical standard deviation (assumed to be site-independent) and the individual site means. The likelihood of the experimental data follows from the assumption of normality of the individual data and the common variance. Combining the prior and the likelihood results in the posterior distribution of the analytical standard deviation and the individual site means. From this, the posterior for the maximum difference between the site means easily follows.
2.3.2. Mathematical details
For the current paper, we assume that the original data as observed at the different sites are from Normal distributions. Using the definitions of Section 2.1 it follows that the likelihood for a given site is given by the expression
| (7) |
The total likelihood for the complete data set can be expressed as
| (8) |
For the current paper we will not address the use of prior knowledge on the site means, so we will use non-informative priors for the site means, but we will explore the situations both with and without relevant prior information on the common within-site variance. The conjugate prior for the variance of a Normal likelihood is the inverse gamma distribution. The parameterization used in this paper is
| (9) |
for some pre-defined non-negative parameters and . For this distribution, the expectation of equals . The default non-informative reference prior is obtained by choosing .
Combining the priors (equations 9) with the likelihood (equation 8) yields the posterior distribution
| (10) |
By sampling from this posterior distribution, for example via MCMC procedures, the posterior distribution of the maximal absolute differences between the site means follows by assessing the maximum of all possible absolute differences , which is the most relevant posterior distribution in the equivalence question.
It is relevant to derive the marginal posterior multivariate distribution of the site means, by integrating the posterior distribution (equation 10) over . With some calculus, the marginal posterior can be written as
| (11) |
This is a multivariate t-distribution (see, for example, [9]), which becomes apparent after rewriting the equation as
| (12) |
This multivariate t-distribution for the column vector has degrees of freedom, location vector and a diagonal scale matrix with entries
| (13) |
In the case of a non-informative reference prior, hence , the right-most factor in equation 13 reduces to , which is the pooled variance over all sites (equation 4), and the entries are , reflecting the variances on the site means.
The marginal posterior distribution of the variance follows by rewriting the expression for the posterior (equation 10) to
| (14) |
This expression is to be integrated over all over the full range from – ∞ to +∞, but each time that is integrating a normally distributed variable with variance over the whole range, so it follows
| (15) |
This is an inverse gamma distribution with parameters and . It follows that the posterior expectation of equals . In case of a non-informative prior (hence ), this is the reciprocal of the pooled variance over all sites.
3. Power calculations for the Frequentist and Bayesian approaches
3.1. Preliminaries
In the Frequentist approach, the power of a procedure is the conditional probability of rejecting the null hypothesis when the null hypothesis is in fact false. On the other hand, Type I error is the conditional probability of failing to reject the null hypothesis over all conditions that the null hypothesis holds; this probability of rejection should maximally be equal to the significance level α, usually set to 5%.
In the Bayesian context one could say that equivalence is demonstrated if the (1 – α)-upper percentile of the posterior distribution of the maximum absolute group difference is below the equivalence margin Δ, which amounts to the Bayesian statement of the equivalence success criterion stated in a very similar way to the Frequentist alternative hypothesis. In Bayesian theory, assurance is the probability of satisfying this equivalence criterion over all possible values of the parameters of interest (the priors). Bayesian assurance is an unconditional probability, whereas Frequentist power is a conditional probability calculated at a specific value of the mean difference.
As noted before, testing the null hypothesis of equation 5 in the Frequentist TOST approach involves the construction of multiple confidence intervals for the differences between the site means, and they are all to be compared with the equivalence margin. This leads to the question if it is needed to correct the confidence intervals for multiplicity as is often done in Frequentist significance testing. We performed extensive simulations to demonstrate that Type I error of Frequentist equivalence test is well controlled below 5% without multiplicity correction.
The next subsections deal with the concept of the probability to declare equivalence in the multi-group equivalence case as function of the population means and the common variance . These simulations were designed to demonstrate and visualize the concept, by fixing some of the parameters and varying the rest, which can be conceptually extendable to all-varying parameter settings. Due to graphical limitations, the analyses are restricted to 3 and 4 groups. Here we show simulation results using 20 replicates per group, and with an equivalence margin of Δ = 2. More examples for unbalanced cases are included in the electronic supplement.
3.2. Analyses concerning 3 groups
When studying 3 groups, consider that it suffices without loss of generality that whereas the other two population means are to be chosen over a range from just below –Δ to just above +Δ (from −2.3 to +2.3 with steps of 0.05). Subsequently, as functions of the population means and and some value for the common population standard deviation σ, the probability of declaring equivalence can be assessed by simulation. This was investigated for σ = 0.5 and 1, with a simulation size of 106. The limiting conditions for the null hypothesis (equations 1 and 5) are those for which at least one absolute difference between the population means equals Δ, leading to a hexagon in the plane; horizontal lines for , vertical lines for and diagonal lines for . The results (see Figure 1) indicate that the combinations with 5% Type I error are on or just inside the hexagon when no post-hoc correction is used for the confidence intervals (equation 6), indicating that even without any correction the Type I error rate is ≤ 5% in this scenario.
Figure 1.
Contour plots showing the probability to reject the null hypothesis (equation 1)for 3-group equivalence (n = 20 per group) for two values of σ. Contour lines are at probabilities of 1, 5, 25, 50, 75, 95 and 99%. The hexagon identifies the range for which equivalence holds.
In the simulations, the group sample sizes are balanced, with 20 observations each. The value of (equation 6) in this case equals 0.5287 when no post-hoc correction is used. Using the commonly used corrections of Tukey, Bonferroni and Scheffé the values are 0.6623, 0.6897 and 0.6926, respectively. Of these three, Tukey is the least conservative in the sense that the shortest intervals are obtained, and thus included in Figure 1 (1c and 1d) for comparison. The contour line using Tukey’s correction corresponding to a 5% probability is drawn more to the inside of the hexagon, which is naturally expected for its conservativeness due to the use of wider confidence intervals.
In the Bayesian context, the criterion for equivalence could be for example that the 95th percentile of the posterior distribution of the maximum absolute difference between the group means is below the equivalence margin Δ. Obviously, the information provided is much richer than in the Frequentist framework in that the probability of equivalence can be estimated directly rather than simply rejecting the null hypothesis or not, but the probability of meeting this criterion (assurance) can be evaluated as analogous to the Frequentist power calculations. The results of the simulations are presented in the contour plots of Figure 2. Use was made of the non-informative priors and (equation 9 with ).
Figure 2.
Contour plots showing the probability that the upper 95th percentile of the posterior distribution of the maximum difference between the site means is below 2 for two values of σ. Contour lines are at probabilities of 1, 5, 25, 50, 75, 95 and 99%. The hexagon identifies the range for which equivalence holds.
3.3. Analyses concerning 4 groups
For 4 groups it is not possible to graphically construct plots concerning the power or assurance over multiple values of the population means, but it is possible to focus on the limiting conditions just satisfying the null hypothesis. To this end, it suffices to set , , and both and somewhere in-between (in the simulations from –1 to +1 with steps of 0.05). The results are presented in Figure 3 for the Frequentist approach using no multiplicity correction or Tukey’s correction, and for the Bayesian approach. For the Frequentist approach, it follows that the probability to declare equivalence over this range of hypothesized true means (the Type I error) is 5% or less if no use is made of a multiplicity correction; using Tukey’s correction the highest Type I error values go down to about 1.1% for both σ = 0.5 and 1. For the Bayesian procedure, it similarly follows that the probability to declare equivalence is 5% or less over the whole range.
Figure 3.
Surface plots of the probability to reject the null hypothesis with 4 groups.
4. Simulations based on observed statistics
4.2. Preliminaries
Based on a given set of experimental results, Frequentists draw a conclusion as to reject the null hypothesis (equation 5) or not, whereas Bayesians assess the posterior probability that the maximum absolute difference between the site means is below the threshold Δ. To directly compare the results of the Frequentist and Bayesian analyses of a given set of experimental data, two sets of simulations were performed.
4.3. Example with 3 groups
For the simulation example, an unbalanced design was used with n1 = 10, n2 = 15 and n3 = 20. The observed pooled standard deviation was set to 1 for simplicity, and the equivalence margin was chosen to be Δ = 2. Simulations were performed over differences from –2 to +2 with steps of 0.01. The results are presented in Figure 4. Using the Frequentist approach, all observed differences within the hexagon will lead to rejection of the null hypothesis (which corresponds to a 2-sided 90% confidence interval being entirely within (-2, 2)), hence declaring equivalence at the 95% significance level. The corner points are not at an observed difference of zero due to the unbalanced design. The contour lines present the calculated Bayesian posterior probability that the maximum absolute difference between the site means is below the equivalence margin Δ. Note that the Frequentist analysis leads to a binary conclusion to reject the null hypothesis or not, whereas the Bayesian analysis yields a more nuanced result in terms of the posterior probability that equivalence is shown. Note that the contour line for the 95% posterior probability and the hexagon define similar areas, with some differences at the corner points, making the Bayesian approach somewhat more conservative than the Frequentist.
Figure 4.
Comparing the Frequentist and Bayesian results as a function of the observed differences between the site means at an observed pooled standard deviation of 1. Using the Frequentist approach, all results within the hexagon (thick black lines) would be interpreted as showing equivalence over the three groups. Using the Bayesian approach, posterior probabilities on equivalence are calculated, as indicated by the contour lines.
4.3. Example with 2 groups
When restricting to 2 groups, the Bayesian posterior probabilities can be calculated as a function of the observed difference between the site means and the observed pooled standard deviation. For the example it was chosen to use n1 = 10 and n2 = 15. The results are presented in Figure 5, similarly as was done for Figure 4. The area defined by the posterior Bayesian probability of 95% that the two site means are equivalent is similar to the triangular area defining for which equivalence would be declared by the Frequentist procedure, with the Bayesian area a bit smaller than the Frequentist at the larger values of the pooled standard deviation. Note that at the top of the triangle the two-sided 90% confidence interval coincides with the interval (–Δ,+Δ).
Figure 5.
Comparing the Frequentist and Bayesian results as a function of the observed differences between the site means and the observed pooled standard deviation. Using the Frequentist approach, all results in the triangular area limited by the thick black lines would be interpreted as showing equivalence over the groups. Using the Bayesian approach, posterior probabilities on equivalence are calculated, as indicated by the contour lines.
5. Analysis of the example data
5.1. Example data
The example used in this section is a hypothetical study in which an analytical procedure had been applied at 4 different sites. In this study 4 sites participated, analyzing samples from one homogeneous lot in 8–20 replicates per site. Equivalence is defined as a difference between the site means that is less than Δ = 2 units. Table 1 shows the individual results, with a graphical presentation in Figure 6. Some relevant summary statistics of these data are presented in Table 2. Note that the results do not reflect real data but were generated to illustrate the proposed methods.
Table 1.
Measurement results.
| Site 1 | 30.97 | 31.56 | 31.89 | 32.11 | 32.28 | 32.57 | 33.22 | 33.48 | ||
| Site 2 | 32.64 | 33.03 | 33.18 | 33.32 | 33.55 | 33.91 | 33.99 | 34.40 | 34.56 | |
| Site 3 | 30.47 | 31.14 | 31.29 | 31.64 | 31.69 | 31.74 | 31.85 | 31.90 | 31.99 | 32.26 |
| 32.37 | 32.43 | 32.68 | 32.76 | 32.86 | 33.06 | 33.26 | 33.85 | |||
| Site 4 | 32.47 | 32.69 | 32.71 | 32.76 | 32.97 | 33.11 | 33.11 | 33.14 | 33.39 | 33.71 |
| 33.76 | 33.95 | 34.12 | 34.16 | 34.25 | 34.46 | 34.64 | 34.69 | 34.69 | 34.82 |
Figure 6.

Graphical presentation of the example data.
Table 2.
Summary statistics of the example data.
| Site | n | Mean | CSS | Variance |
|---|---|---|---|---|
| 1 | 8 | 32.26 | 4.82 | 0.6886 |
| 2 | 9 | 33.62 | 3.31 | 0.4138 |
| 3 | 18 | 32.18 | 11.63 | 0.6841 |
| 4 | 20 | 33.68 | 11.46 | 0.6032 |
5.2. Frequentist analysis
For the Frequentist testing of the hypothesis of equation 5 the two-sided 90% confidence intervals for the differences between the site means are needed, and they can be calculated by applying a straightforward one-way analysis of variance to the example data, leading to the relevant estimates as presented in Table 3. Using no correction for multiplicity, as would be the desired approach since it has been shown to be unnecessary, all comparisons (just) meet the criterion, as the confidence intervals are all within the equivalence margin (which was set to 2 for this example). Note that the confidence limit is closest to the equivalence margin for the comparison of site 1 versus 2, whereas the largest observed difference is between sites 2 and 3; the widths of the confidence intervals depend on the numbers of replicates and that differs between the groups. Using the Tukey correction, the confidence intervals become wider, and 4 out of the 6 comparisons fail the equivalence requirement.
Table 3.
Frequentist analysis of the example results.
| Summary statistics per site | Analyses on differences between sites | |||||
|---|---|---|---|---|---|---|
| 90% confidence intervals | ||||||
| Site | Mean | 95% Conf.Int. | Comparison site pairs | Difference between means | No correction | Tukey correction |
| 1 | 32.26 | [31.705; 32.815] | 1 vs 2 | -1.36 | [-1.997; -0.723] | [-2.254; -0.466] |
| 2 | 33.62 | [33.096; 34.144] | 1 vs 3 | 0.08 | [-0.477; 0.637] | [-0.702; 0.862] |
| 3 | 32.18 | [31.810; 32.550] | 1 vs 4 | -1.42 | [-1.968; -0.872] | [-2.189; -0.651] |
| 4 | 33.68 | [33.329; 34.031] | 2 vs 3 | 1.44 | [0.905; 1.975] | [0.689; 2.191] |
| 2 vs 4 | -0.06 | [-0.586; 0.466] | [-0.798; 0.678] | |||
| 3 vs 4 | -1.50 | [-1.926; -1.074] | [-2.098; -0.902] | |||
5.3. Bayesian analysis
5.3.1. Analysis without prior knowledge
Sampling from the joint posterior distribution (using, for example, SAS (via PROC MCMC), as detailed in the electronic supplement) and using a non-informative prior for the variance, the results summarized in Table 4 are obtained.
Table 4.
Posterior distribution parameter summary statistics and derived properties using a non-informative prior for the variance. Simulation size: 106.
| Variable | Mean | 95% Cred.Int. | Variable | Mean | 90% Cred.Int. | Criterion | Fraction |
|---|---|---|---|---|---|---|---|
| μ1 | 32.260 | [31.705; 32.815] | μ1 – μ2 | -1.360 | [-1.997; -0.723] | | μ1 – μ2 | < 2 | 0.9508 |
| μ2 | 33.620 | [33.097; 34.144] | μ1 – μ3 | 0.080 | [-0.477; 0.637] | | μ1 – μ3 | < 2 | 1.0000 |
| μ3 | 32.180 | [31.810; 32.549] | μ1 – μ4 | -1.420 | [-1.968; -0.871] | | μ1 – μ4 | < 2 | 0.9589 |
| μ4 | 33.680 | [33.329; 34.031] | μ2 – μ3 | 1.440 | [0.906; 1.975] | | μ2 – μ3 | < 2 | 0.9571 |
| ν | 0.637 | [0.430; 0.942] | μ2 – μ4 | -0.060 | [-0.586; 0.466] | | μ2 – μ4 | < 2 | 1.0000 |
| μ3 – μ4 | -1.500 | [-1.927; -1.075] | | μ3 – μ4 | < 2 | 0.9727 | |||
| All diffs < 2 | 0.8769 |
ν, μi: posterior estimates of population variance and mean of site i; Cred.Int: Credible Interval.
After the joint posterior distribution of mean differences has been generated, the fraction of all draws in which the absolute differences between the population means of 2 selected sites is below Δ = 2 is the estimate of the probability of similarity between those 2 sites; the fraction for which all absolute site-differences are below 2 is the estimate of the probability of similarity among all sites. These results are also presented in Table 4. The posterior probability that sites 1 and 2 produce equivalent results is close to 95%. This is in line with the Frequentist result for sites 1 and 2, which is that equivalence at the 95% confidence level is borderline based on the unadjusted 2-sided 90% confidence interval in Table 3 (the lower limit is close to the equivalence margin –Δ, whereas the upper limit is far from the upper equivalence limit +Δ). Similar results are observed for the comparisons between sites 1 and 4 and sites 2 and 3. The posterior probabilities are higher for the other comparisons, well in line with the Frequentist results, which is not at all surprising using a non-informative Bayesian analysis. However, the Bayesian analysis provides more nuanced information in the probabilities of the pairwise differences being within the equivalence margin, as well as the posterior probability that all sites produce equivalent results – an estimate not possible at all in the Frequentist approach – which is estimated as 88%. Note that several Frequentist confidence limits are close to the equivalence margin, so the current set of results can be interpreted as the four-dimensional analogue of the results close to the corner points of the hexagon as shown in Figure 4, or close to the top of the triangle as shown in Figure 5, where the Bayesian posterior probability is below 95%. Whether or not this 88% posterior probability should be interpreted as sufficient evidence for equivalence is to be decided in a scientific discussion with the relevant people involved. Thus, even without the use of prior information, the Bayesian analysis provides richer information than the Frequentist analysis, while clearly generating the same estimates of the means and variability.
The mathematical derivations reveal that the marginal multivariate posterior distribution of the site means is a multivariate t-distribution when using an inverse gamma or non-informative reference prior for the variance (equations 11 and 12). In our experience, direct sampling from this distribution is much faster than sampling using PROC MCMC, which is useful when performing large-scale computer simulations. The only draw-back is that no information on the measurement variance is obtained. Inferences on the variance can nevertheless be obtained by using the inverse-gamma marginal posterior distribution given in equation 15.
The PDF and CDF of the posterior distribution of the maximum absolute site difference are presented in Figure 7. The preset equivalence margin Δ is indicated with a vertical reference line. The information provided with these distributions is much richer than a fraction alone. For example, the 95th percentile of the distribution is at 2.15, indicating that the data reveal that there is a 95% posterior probability that the true maximum absolute difference between the site means is less than 2.15, and this information can be included in final decision making on acceptable similarity. The 99th percentile is at 2.37.
Figure 7.
The posterior distribution of the maximum difference between the site means when using a non-informative prior for the measurement variance. The vertical reference line is the pre-set equivalence margin Δ.
Though equivalence is generally based on the difference between site means, it is often more relevant to quantify what the difference between individual observations over the sites of study could be. The importance of evaluating similarity of the distribution of individual units is recognized in, for example, the 2021 EMA document Reflection Paper on Statistical Methodology for the Comparative Assessment of Quality Attributes in Drug Development [7], in which it is noted that ‘ … there is interest in drawing conclusions on similarity between the entirety of the material (which will ever be) produced by each of [the sites].’ This is readily possible in the Bayesian setting, as from the posterior distribution a predictive distribution of individual observations can be generated, while accounting for the variability at each site and the statistical uncertainty of the estimate of the mean and variability at each site. Each draw in the simulated posterior consists of values for the individual site means and for the common measurement variance. For each draw in the posterior distribution, given those population values, individual values can be drawn for each site, simulating possible outcomes at the individual sites, and from these the maximum absolute difference between the individual results can be assessed (Figure 8). From this distribution it follows that 95% of the maximum absolute differences are below 4.02, and 99% is below 4.78. As an alternative to the traditional approach of focusing comparability analyses on differences in means, a study could be set up with the goal of demonstrating that individual observations will differ by no more than a prespecified amount, with high probability (see, for example, [5]). The fact that statistical uncertainty of the parameters is accounted for in generating the posterior predictive distributions from a Bayesian model makes the resulting probability estimate importantly different – and more realistic – vs. ‘bootstrap’ simulations using point estimates of the site means and variances as could be done using a Frequentist approach. Thus, particularly for the goal of understanding the similarity among sites with regard to not just means but distributions of individual observations, a Bayesian analysis is an obvious choice. This is especially true with more than 2 sites, as shown in this simple example even with the assumption of common variance among the sites.
Figure 8.
The predictive distribution of the maximum absolute difference between the individual observations over the four sites when using a non-informative prior for the measurement variance.
5.3.2. Analysis with prior knowledge
Suppose that the analytical method used in the example had been validated at one of the sites, and that this validation yielded an estimate for the variance of 0.25 units, based on 80 degrees of freedom. This knowledge can be used as prior knowledge in the Bayesian equivalence analysis. According to equation 9, an inverse gamma distribution can be used with the proper parameterization. Given equation 15, the logical choices for the parameters and of the prior distribution are the number of degrees of freedom and corrected sum of squares of the historic variance estimate, so a reasonable choice for the prior for the variance would be inverse-gamma with and . This does not affect the estimates for the means and mean differences (as is clear comparing Table 4 and Table 5), yet the credible intervals of these posterior distributions do change, due to the incorporation of prior information into the model. Note that the historical estimate for the variance is considerably lower than the value observed in the example data (see Table 2), and in line with this the estimate for the residual variance also decreases compared to the analysis without informative prior. The posterior probabilities on equivalence between the sites increase (Table 5); now the Bayesian analysis supports declaring equivalence over all sites. The 95% and 99% percentiles for the maximum absolute difference between the site means are 1.99 and 2.16, respectively, whereas for the maximum absolute differences between individual observations these percentiles are 3.46 and 4.03.
Table 5.
Posterior distribution parameter summary statistics and derived properties using an informative prior for the variance ( = 80, = 20). Simulation size: 106.
| Variable | Mean | 95% Cred.Int. | Variable | Mean | 90% Cred.Int. | Criterion | Fraction |
|---|---|---|---|---|---|---|---|
| μ1 | 32.260 | [31.823; 32.697] | μ1 – μ2 | -1.360 | [-1.863; -0.856] | | μ1 – μ2 | < 2 | 0.9814 |
| μ2 | 33.620 | [33.207; 34.033] | μ1 – μ3 | 0.080 | [-0.360; 0.520] | | μ1 – μ3 | < 2 | 1.0000 |
| μ3 | 32.180 | [31.888; 32.472] | μ1 – μ4 | -1.420 | [-1.853; -0.986] | | μ1 – μ4 | < 2 | 0.9856 |
| μ4 | 33.680 | [33.403; 33.957] | μ2 – μ3 | 1.440 | [1.016; 1.863] | | μ2 – μ3 | < 2 | 0.9850 |
| ν | 0.397 | [0.311; 0.506] | μ2 – μ4 | -0.060 | [-0.477; 0.356] | | μ2 – μ4 | < 2 | 1.0000 |
| μ3 – μ4 | -1.500 | [-1.837; -1.163] | | μ3 – μ4 | < 2 | 0.9923 | |||
| All diffs < 2 | 0.9535 |
ν, μi: posterior estimates of population variance and mean of site i; Cred.Int: Credible Interval.
In this example the historic variance estimate is smaller than the values observed in the study, but the reversed is also possible, with an opposite effect on the posterior estimate of the variance, credible intervals and probabilities to declare equivalence.
6. Analysis of the data reported by Rusticus and Lovato
The paper of Rusticus & Lovato [18] presents some summary statistics of exam scores of students at different campus sites of the University of British Columbia (UBC). Based on the reported summary statistics it is possible to perform the equivalence analysis using the Frequentist TOST (based on equation 6) and using the proposed Bayesian procedure, by sampling from the multivariate t-distribution presented in equations 12 and 13. As the paper does not provide any information that could be used to define an informative prior for the variance, the Bayesian analysis was done using the non-informative prior ( ). A summary of the main results of a few courses is presented in Table 6. The equivalence margin was set at Δ = 5 [18]. The intervals as based on Frequentist TOST are all narrower than reported by Rusticus & Lovato, which is not surprising as these authors used a multiplicity correction whereas we did not. The Bayesian approach allows calculation of the posterior probability that all sites are equivalent; only for Blood and Lymphatics the Bayesian analyses indicate that the maximum difference between any combination of 2 sites is less than 5 with more than 98% probability.
Table 6.
Applying the Frequentist and Bayesian analyses to data reported by Rusticus & Lovato [18].
| Course | Comparison | Published Results [18] | Results TOST | Results Bayesian | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 90% Conf.Int | Equiv? | 90% Conf.Int | Equiv? | 90% Cred.Int. | Posterior Prob. | |||||
| Gastroenterology | 1-2 | 0.43 | 5.54 | No | 1.01 | 4.95 | Yes | 1.01 | 4.94 | 95.5% |
| 1-3 | -2.31 | 1.38 | Yes | -1.97 | 1.03 | Yes | -1.97 | 1.03 | 100.0% | |
| 2-3 | -5.46 | -1.44 | No | -4.93 | -1.97 | Yes | -4.93 | -1.97 | 95.8% | |
| (all) | 92.7% | |||||||||
| Blood and | 1-2 | -0.06 | 4.44 | Yes | 0.44 | 3.94 | Yes | 0.44 | 3.94 | 99.6% |
| Lymphatics | 1-3 | -2.68 | 0.70 | Yes | -2.33 | 0.35 | Yes | -2.33 | 0.35 | 100.0% |
| 2-3 | -4.90 | -1.47 | Yes | -4.50 | -1.86 | Yes | -4.50 | -1.87 | 98.9% | |
| (all) | 98.6% | |||||||||
| Musculoskeletal | 1-2 | 1.80 | 6.34 | No | 2.35 | 5.79 | No | 2.35 | 5.79 | 81.4% |
| and locomotor | 1-3 | -0.77 | 2.69 | Yes | -0.35 | 2.27 | Yes | -0.35 | 2.27 | 100.0% |
| 2-3 | -4.79 | -1.42 | Yes | -4.40 | -1.82 | Yes | -4.40 | -1.82 | 99.2% | |
| (all) | 81.3% | |||||||||
| Endocrine and | 1-2 | 2.15 | 7.12 | No | 2.77 | 6.49 | No | 2.77 | 6.49 | 62.8% |
| metabolism | 1-3 | -1.74 | 1.42 | Yes | -1.58 | 1.26 | Yes | -1.58 | 1.26 | 100.0% |
| 2-3 | -6.92 | -2.67 | No | -6.19 | -3.39 | No | -6.19 | -3.39 | 59.8% | |
| (all) | 48.2% | |||||||||
Conf.Int: Confidence Interval; Cred.Int: Credible Interval; Posterior Prob: Posterior probability difference is within equivalence margin based on the Bayesian analysis with non-informative prior ( = = 0).
7. Discussion
In this paper, we discussed equivalence tests in the Frequentist and Bayesian perspectives, with a focus on testing more than 2 groups assuming the measurements from all groups (or sites, as used in the current paper) are normally distributed and with a common variance. The Bayesian approach provides more detailed information about similarity of the sites than the Frequentist approach through the posterior distributions; specifically, using a Bayesian approach the probability of any pairwise difference or of the maximum site difference being within limits of interest can be estimated, which is more nuanced information than simply assessing compliance with a decision rule such as comparing a statistical interval to preset boundaries. In addition, because the Bayesian method yields the posterior distribution of the maximum pairwise difference of site means, an easy inference on the overall equivalence of multiple groups can be obtained. Through the posterior draws, it is further possible to obtain posterior predictive distributions of the individual observations for each site, allowing one to make inferences on the maximum differences of the individual future observations from different sites – an assessment directly relevant to understanding patient and manufacturer risk. In the example presented herein, we assumed that all site-to-site comparisons are relevant, but it is possible that one site is a reference site and interest is only on comparisons of all other sites to this reference site; this is also readily possible using the Bayesian approach. We also illustrated how an informative prior can be used to decrease the uncertainty of the analysis by incorporating historical knowledge into the model. Additionally, we derived that the marginal posterior distribution of the site means is a multivariate t-distribution when a common variance of the normal distributions is assumed to have an inverse-gamma prior. In case of an informative prior for the variance, we have chosen to use the inverse gamma distribution, because it is the conjugate distribution, allowing exact mathematical treatment. Other priors are also possible, at the expense of the absence of an exact solution.
We showed through simulations that the corrections for multiple pairwise comparisons that are commonly done in the Frequentist approach for null hypothesis significance testing are not necessary for equivalence tests, as it leads to an unnecessarily conservative procedure with a Type I error excessively less than 5%. Using the example data, we showed that a Bayesian equivalence test without informative priors yields similar results to the Frequentist equivalence test (with no multiplicity correction) for individual pairwise comparisons – yet, as noted above, the additional information available from the Bayesian analysis still makes it advantageous in this case. Also, there are Bayesian methods readily available that allow to handle different variances across groups and/or random effects. Again, our purpose in comparing Bayesian and Frequentist methods is not to prove that one is better than the other, but rather to convince statistical practitioners, scientist, and regulators of utility of Bayesian modeling as another statistical ‘tool’ in the toolbox to be used when useful, rather than feel a need to choose between Bayesian and Frequentist methods for all statistical applications.
It is fairly easy to explain that no post-hoc correction is needed for equivalence in the Frequentist approach. Consider the general situation that the difference between two groups population means is equal to the equivalence margin and that all others group means are between these two, then the largest difference will, with large probability, be observed for exactly those two limiting groups. To declare equivalence, the confidence interval for that difference needs to fall within the equivalence range, which can only happen if these two observed means are closer together than their underlying population values, in combination with an estimate for the pooled variance that is sufficiently small to enable the complete confidence interval to fall in the equivalence range. Under those conditions the probability that all other confidence intervals for the site differences fall within the equivalence margin will be large. Only when multiple groups have population means near the limiting value, the probability to declare equivalence will drop.
Comparison of the Frequentist versus Bayesian results in Section 4 reveals that the Bayesian procedure is slightly more conservative than the Frequentist in the sense that some results that would be declared as providing sufficient evidence for equivalence in the Frequentist procedure yield a posterior probability below 95% in the Bayesian procedure. Though less easy to observe visually, the same conclusion on relative conservativeness can be drawn from Figure 3 (panel a vs e and b vs f). A 2-site example leading to different conclusions can be any result that in Figure 5 ends up in the top of the triangle above the Bayesian curve corresponding to 95% (for example, observed difference = 0 and pooled SD = 2.5). In this case, the two-sided Frequentist 95% confidence interval (here [–2.11;+2.11]) fully contains the equivalence range of –Δ to +Δ. At the same time the two-sided Frequentist 90% confidence interval (here [−1.75; +1.75]) is fully between –Δ to +Δ, so this would be declared equivalent by the Frequentist method. The Bayesian analysis leads to a posterior probability that the absolute site difference is below Δ of 93.8%, which is below 95%. Interestingly, the top of the triangle corresponds to the situation that the Frequentist two-sided 90% confidence interval coincides with the equivalence margin, whereas the result at the top of the posterior probability of 95% curve corresponds to the situation that the Frequentist two-sided 95% confidence interval coincides with the equivalence margin. The Bayesian analysis simply focuses on the question of what the posterior probability is that the true difference is between the equivalence margins, independently of where in the range the point estimate is.
Future work includes investigations on equivalence tests in case there are multiple sources of variation in the data, for example when the analyses in the sites are performed using replicates over multiple analytical runs. In addition, the variability of the analytical method can depend on the site, which is not included in the current study. Finally, there are real-life examples for which the Normality assumption does not hold (for example, when results are expressed as fraction of successes in a number of trials, or analyses on purities that are close to 100% with a natural upper limit of 100%, and others).
8. Conclusion
It is concluded that a Bayesian strategy is very useful to deal with the problem of multi-site equivalence, providing more knowledge than a hypothesis test as used in the Frequentist approach. The Bayesian approach allows the construction of distributions of the individual site-to-site differences and of the maximum absolute difference over all sites. In addition, it allows assessment of the potential differences between individually observed results, and it allows incorporation of prior knowledge in the analyses.
If one is to use a Frequentist approach, then our simulation studies suggest there is no need to make use of multiplicity corrections that are commonly used in significance testing to control the Type I error. We acknowledge that the simulation results are not conclusive by themselves, but they are highly suggestive. Further research is needed to formally prove the observations.
Supplementary Material
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Berger R.L., and Hsu J.C., Bioequivalence trials, intersection-union tests and equivalence confidence sets. Stat. Sci. 11 (1996), pp. 283–319. [Google Scholar]
- 2.Borman P.J., Chatfield M.J., Damjanov I., and Jackson P., Design and analysis of method equivalence studies. Anal. Chem. 81 (2009), pp. 9849–9857. [DOI] [PubMed] [Google Scholar]
- 3.Campbell H., and Lakens D., Can we disregard the whole model? omnibus non-inferiority testing for R2 in multi-variable linear regression and η2 in ANOVA. Br. J. Math. Stat. Psychol. 74 (2021), pp. 64–89. doi: 10.1111/bmsp.12201. [DOI] [PubMed] [Google Scholar]
- 4.Chambers D., Kelly G., Limentani G., Lister A., Lung K.R., and Warner E., Analytical method equivalency: An acceptable analytical practice. Pharm. Technol. (September 2005), pp. 64–80. [Google Scholar]
- 5.Dewé W., Govaerts B., Boulanger B., Rozet E., Chiap P., and Hubert P., Using total error as decision criterion in analytical method transfer. Chemom. Intell. Lab. Syst. 85 (2007), pp. 262–268. [Google Scholar]
- 6.Duncan D., A Bayesian approach to multiple comparisons. Technometrics. 7 (1965), pp. 171–222. [Google Scholar]
- 7.EMA document “Reflection Paper on Statistical Methodology for the Comparative Assessment of Quality Attributes in Drug Development”, EMA/CHMP/138502/2017, 26 July 2021 (www.ema.europa.eu/en/documents/scientific-guideline/reflection-paper-statistical-methodology-comparative-assessment-quality-attributes-drug-development_en.pdf)
- 8.Giacoletti K. and Heyse J.. Proportion of similar responses to evaluate similarity of immune response in vaccine clinical trials. Presentation, Joint Statistical Meetings, Toronto, Ontario, Canada, 2004.
- 9.Golam Kibria B.M., and Joarder A.H., A short review of multivariate t-distribution. J.Stat.Res 40 (2006), pp. 59–72. [Google Scholar]
- 10.Kelter R., Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: power analysis, type I error rates and equivalence boundary selection in biomedical research. Kelter BMC Medical Research Methodology 21 (2021), pp. 1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kruschke J.K., Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science 1 (2018), pp. 270–280. [Google Scholar]
- 12.Lachenbruch P.A., Rida W., and Kou J., Lot consistency as an equivalence problem. J. Biopharm. Stat. 14 (2004), pp. 275–290. [DOI] [PubMed] [Google Scholar]
- 13.Lauzon C., and Caffo B., Easy multiplicity control in equivalence testing using two one-sided tests. Am. Stat. 63 (2009), pp. 147–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Limentani G.B., Ringo M.C., Ye F., Bergquist M.L., and McSorley E.O., Beyond the t-test: statistical equivalence testing. Anal. Chem. 77 (2005), pp. 221A–226A. [DOI] [PubMed] [Google Scholar]
- 15.Linde M., Tendeiro J.N., Selker R., Wagenmakers E.J. and Van Ravenzwaaij D.. Decisions about equivalence: a comparison of TOST, HDI-ROPE, and the Bayes factor, Psychol. Methods 28 (2023), pp. 740–755. doi: 10.1037/met0000402 [DOI] [PubMed] [Google Scholar]
- 16.Neath A.A., and Cavanaugh J.E., A Bayesian approach to the multiple comparisons problem. J. Data. Sci. 4 (2006), pp. 131–146. [Google Scholar]
- 17.Rom D.M., and Hwang E., Testing for individual and population equivalence based on the proportion of similar responses. Stat. Med. 15 (1996), pp. 1489–1505. [DOI] [PubMed] [Google Scholar]
- 18.Rusticus S.A., and Lovato C.Y., Applying tests of equivalence for multiple group comparisons: demonstration of the confidence interval approach, practical assessment. Research, and Evaluation 16 (2011). [Google Scholar]
- 19.Schuirman D.J., A comparison of the two one-sided test procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 15 (1987), pp. 657–680. [DOI] [PubMed] [Google Scholar]
- 20.Stine R.A., and Heyse J.F., Non-parametric estimates of overlap. Stat. Med. 20 (2001), pp. 215–236. [DOI] [PubMed] [Google Scholar]
- 21.Van Ravenzwaaij C., Monden R., Tendeiro J.N., and Loannidis J.P.A., Bayes factors for superiority, non-inferiority, and equivalence designs. Kelter BMC Medical Research Methodology 19 (2019), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







