Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2018 Apr 13;13(4):e0195145. doi: 10.1371/journal.pone.0195145

Conditional equivalence testing: An alternative remedy for publication bias

Harlan Campbell 1,*, Paul Gustafson 1
Editor: Daniele Marinazzo2
PMCID: PMC5898747  PMID: 29652891

Abstract

We introduce a publication policy that incorporates “conditional equivalence testing” (CET), a two-stage testing scheme in which standard NHST is followed conditionally by testing for equivalence. The idea of CET is carefully considered as it has the potential to address recent concerns about reproducibility and the limited publication of null results. In this paper we detail the implementation of CET, investigate similarities with a Bayesian testing scheme, and outline the basis for how a scientific journal could proceed to reduce publication bias while remaining relevant.

Introduction

Poor reliability, within many scientific fields, is a major concern for researchers, scientific journals and the public at large. In a highly cited essay, Ioannidis (2005) [1] uses Bayes theorem to claim that more than half of published research findings are false. While not all agree with the extent of this conclusion, e.g. [2, 3], the argument raises concerns about the trustworthiness of science, amplified by a disturbing prevalence of scientific misconduct [4]. That a result may be substantially less reliable than the p-value suggests has been “underappreciated” (Goodman & Greenland, 2007) [2], to say the least. To address this problem, some journal editors have taken radical measures [5], while the reputations of researchers are tarnished [6]. The merits of null hypothesis significance testing (NHST) [710] and the p-value are now vigorously debated [1113].

The limited publication of null results is certainly one of the most substantial factors contributing to low reliability. Whether due to a reluctance of journals to publish null results or to a reluctance of investigators to submit their null research [14, 15], the consequence is severe publication bias [16, 17]. Despite repeated warnings, publication bias persists and, to a certain degree, this is understandable. Accepting a null result can be difficult, owing to the well-known fact that “absence of evidence is not evidence of absence” [18, 19]. As Greenwald (1975) [20] writes: “it is inadvisable to place confidence in results that support a null hypothesis because there are too many ways (including incompetence of the researcher), other than the null hypothesis being true, for obtaining a null result.” Indeed, this is the foremost critique of NHST, that it cannot provide evidence in favour of the null hypothesis. The commonly held belief that for a non-significant result to show high “retrospective power” [21] implies support in favour of the null, is problematic [22, 23].

In order to address publication bias, it is often suggested [2427] that publication decisions should be made without regards to the statistical significance of results, i.e. “result-blind peer review” [28]. In fact, a growing number of psychology and neuroscience journals are requiring pre-registration [29] by means of adopting the publication policy of “Registered Reports” (RR) [30], in which authors “pre-register their hypotheses and planned analyses before the collection of data” [31]. If the rationale and methods are sound, the RR journal agrees (before any data are collected) to publish the study regardless of the eventual data and outcome obtained. Among many potential pitfalls with result-blind peer review [32], a genuine and substantial concern is that, if a journal were to adopt such a policy, it might quickly become a “dumping ground” [28] for null and ambiguous findings that do little to contribute to the advancement of science. To address these concerns, RR journals require that, for any manuscript to be accepted, authors must provide a-priori (before any data are collected) sample size calculations that show statistical power of at least 90% (in some cases 80%). This is a reasonable remedy to a difficult problem. Still, it is problematic for two reasons.

First, it is acknowledged that this policy will disadvantage researchers who work with “expensive techniques or who have limited resources” (Chambers et al., 2014) [31]. While not ideal, small studies can provide definitive value and potential for learning [33]. Some go as far as arguing against any requirement for a-priori sample size calculations (i.e. needing to show sufficient power as a requisite for publication). For example, Bacchetti (2002) [34] writes: “If a study finds important information by blind luck instead of good planning, I still want to know the results”; see also Aycaguer and Galbán (2013) [35]. While unfortunate, the loss of potentially valuable “blind luck” results and small sample studies [36] appears to be a necessary price to pay for keeping a “result-blind peer review” journal relevant. Is this too high a price? Based on simulations, Borm et al. (2009) [37] conclude that the negative impact of publication bias does not warrant the exclusion of studies with low power.

Second, a-priori power calculations are often flawed, due to the unfortunate “sample size samba” [38]: the practice of retrofitting the anticipated effect size in order to obtain an affordable sample size. Even under ideal circumstances, a-priori power estimation is often “wildly optimistic” [39] and heavily biased due to “follow-up bias” (see Albers and Lakens, 2018 [40]) and the “illusion of power” [41]. This “illusion” (also known as “optimism bias”) occurs when the estimated effect size is based on a literature filled with overestimates (to be expected in many fields due, somewhat ironically, to publication bias). Djulbegovic et al. (2011) [42] conduct a retrospective analysis of phase III randomized controlled trials (RCTs) and conclude that optimism bias significantly contributes to inconclusive results; see also Chalmers and Matthews (2006) [43]. What’s more, oftentimes due to unanticipated difficulties with enrolment, the actual sample size achieved is substantially lower than the target set out a-priori [44]. In these situations, RR requires that either the study is rejected/withdrawn for publication or a certain leeway is given under special circumstances (Chambers (2017), personal communication). Neither option is ideal. Given these difficulties with a-priori power calculations, it remains to be seen to what extent the 90% power requirement will reduce the number of underpowered publications that could lead a journal to be a dreaded “dumping ground”.

An alternative proposal to address publication bias and the related issues surrounding low reliability is for researchers to adopt Bayesian testing schemes; e.g. [4547]. It has been suggested that with Bayesian methods, publication bias will be mitigated “because the evidence can be measured to be just as strong either way” [48]. Bayesian methods may also provide for a better understanding of the strength of evidence [49]. However, sample sizes will typically need to be larger than with equivalent frequentist testing in situations when there is little prior information incorporated [50]. Furthermore, researchers in many fields remain uncomfortable with the need to define (subjective) priors and are concerned that Bayesian methods may increase “researcher degrees of freedom” [51]. These issues may be less of a concern if studies are pre-registered, and indeed, a number of RR journals allow for a Bayesian option. At registration (before any data are collected), rather than committing to a specific sample size, researchers commit to attaining a certain Bayes Factor (BF). For example, the journals Comprehensive Results in Social Psychology (CRSP) and NFS Journal (the official journal of the Society of Nutrition and Food Science) require that one pledges to collect data until the BF is more than 3 (or less than 1/3) [52]. The journals BMC Biology and the Journal of Cognition require, as a requisite for publication, a BF of at least 6 (or less than 1/6) [53, 54].

In this paper, we propose an alternate option made possible by adapting NHST to conditionally incorporate equivalence testing. While equivalence testing is by no means a novel idea, previous attempts to introduce equivalence testing have “largely failed” [55] (although recently, there has been renewed interest). In Section 2, in order to further the understanding of conditional equivalence testing, we provide a brief overview including how to establish appropriate equivalence margins.

One reason conditional equivalence testing (CET) is an appealing approach is that it shares many of the properties that make Bayesian testing schemes so attractive. As such, the publication policy we put forward is somewhat similar to the RR “Bayesian option”. With CET, evidence can be measured in favour of either the alternative or the null (at least in some pragmatic sense), and as such is “compatible with a Bayesian point of view” [56]. In Section 3, we conduct a simple simulation study to demonstrate how one will often arrive at the same conclusion whether using CET or a Bayesian testing scheme. In Section 4, we outline how a publication policy, similar to RR, could be framed around CET to encourage the publication of null results. Finally, Section 5 concludes with suggestions for future research.

Conditional equivalence testing overview

Standard equivalence testing is essentially NHST with the hypotheses reversed. For example, for a two-sample study of means, the equivalence testing null hypothesis would be a difference (beyond a given margin) in population means, and the alternative hypothesis would be equal (within a given margin) population means. Conditional equivalence testing (CET) is the practice of standard NHST followed conditionally (if one fails to reject the null) by equivalence testing. CET is not an altogether new way of testing. Rather it is the usage of established testing methods in a way that permits better interpretation of results. In this regard, it is similar to other proposals such as the “three-way testing” scheme proposed by Goeman et al. (2010) [57], and Zhao (2016) [58]’s proposal for incorporating both statistical and clinical significance into one’s testing. To illustrate, we briefly outline CET for a two-sample test of equal means (assuming equal variance). Afterwards, we discuss defining the equivalence margin. In Appendix 1, we provide further details on CET including a look at operating characteristics and sample size calculations.

Let xi1, for i = 1, …, n1 and xi2, for i = 1, …, n2 be independent random samples from two normally distributed populations of interest with μ1, the true mean of population 1; μ2, the true mean of population 2; and σ2, the true common population variance. Let n = n1 + n2 and define sample means and sample variances as follows: x¯g=i=1ngxgi/ng, and sg2=i=1ng(xgi-x¯g)2/(ng-1), for g = 1, 2. Also, let sp=((n1-1)s12+(n2-1)s22)/(n1+n2-2). The true difference in population means, μd = μ1μ2, under the standard null hypothesis, H0, is equal to zero. Under the standard alternative, H1, we have that μd ≠ 0.

The term equivalence is not used in the strict sense that μ1 = μ2. Instead, equivalence in this context refers to the notion that the two means are “close enough”, i.e. their difference is within the equivalence margin, δ = [−Δ, Δ], chosen to define a range of values considered equivalent. The value of Δ is ideally chosen to be the “minimum clinically meaningful difference” [59, 60] or as the “smallest effect size of interest” [61].

Let FTdf() be the cumulative distribution function (cdf) of the t distribution with df degrees of freedom and define the following critical t values: tα1/2*=FTn-2-1(1-0.5α1) (i.e. the upper 100·α12-th percentile of the t-distribution with n − 2 degrees of freedom) and tα2*=FTn-2-1(1-α2), (i.e. the upper 100 ⋅ α2-th percentile of the t-distribution with n − 2 degrees of freedom). As such, α1 is the maximum allowable type I error (e.g. α1 = 0.05) and α2 is the maximum allowable “type E” error (erroneously concluding equivalence), possibly set equal to α1. If a type E error is deemed less costly than a type I error, α2 may be set higher (e.g. α2 = 0.10). CET is the following conditional procedure consisting of five steps:

Step 1- A two-sided, two-sample t-test for a difference of means.

Calculate the t-statistic, T=(x¯1-x¯2)/(sp1/n1+1/n2) and associated p-value, p1 = 2 ⋅ FTn−2(−|T|).

Step 2- If |T|>tα1/2*, then declare a positive result. There is evidence of a statistically significant difference, p-value = p1.

Otherwise, if |T|tα1/2*, proceed to Step 3.

Step 3- Two one-sided tests (TOST) for equivalence of means [62, 63].

Calculate two t-statistics: T1=(x¯1-x¯2+Δ)/(sp1/n1+1/n2), and T2=(x¯1-x¯2-Δ)/(sp1/n1+1/n2). Calculate an associated p-value, p2 = max(FTn−2(−T1), FTn−2(T2)). Note: p2 is a marginal p-value, in that it is not calculated under the assumption that p1 > α1.

Step 4- If T1>tα2* and T2<(-tα2*), declare a negative result. There is evidence of a statistically significant equivalence (with margin δ = [−Δ, Δ]), p-value = p2.

Otherwise, proceed to Step 5.

Step 5- Declare an inconclusive result. There is insufficient evidence to support any conclusion.

For ease of explanation, let us define pCET = p1 if the result is positive, and pCET = 1 − p2 if the result is negative or inconclusive. Thus, a small value of pCET suggests evidence in favour of a positive result, whereas a large value of pCET suggests evidence in favour of a negative result. As is noted above, it is important to acknowledge that, despite the above procedure being dubbed “conditional”, p2 is a marginal p-value, i.e. it is not calculated under the assumption that p1 > α1. The interpretation of p2 is taken to be the same regardless of whether it was obtained following Steps 1 and 2 or was obtained “on its own” via standard equivalence testing.

Standard NHST involves the same first and second steps and ends with an alternative Step 3 which states that if p1 > α1, one declares an inconclusive result (‘there is insufficient evidence to reject the null.’). Similar to two-sided CET, one-sided CET is straightforward, making use of non-inferiority testing in Step 3.

CET is a procedure applicable to any type of outcome. Let θ be the parameter of interest in a more general testing setting. CET can be described in terms of calculating confidence intervals around θ^, the statistic of interest. For example, θ may be defined as the the difference in proportions, the hazard ratio, the risk ratio, or the slope of a linear regression model, etc. [6470]. Note that in these other cases, the equivalence margin, δ, may not necessarily be centred at zero, but will be a symmetric interval around θ0, the value of θ under the standard null hypothesis. In general, let δ = [δL, δU], with Δ equal to half the length of the interval. Consider, more generally, CET as the following procedure:

Step 1- Calculate a (1 − α1)% Confidence Interval for θ.

Step 2- If this C.I. excludes θ0, then declare a positive result.

Otherwise, if θ0 is within the C.I., proceed to Step 3.

Step 3- Calculate a (1 − 2α2)% Confidence Interval for θ.

Step 4- If this C.I. is entirely within δ, declare a negative result.

Otherwise, proceed to Step 5.

Step 5- Declare an inconclusive result.

There is insufficient evidence to support any conclusion.

Fig 1 illustrates the three conclusions with their associated confidence intervals. We once again consider two-sample testing of normally distributed data as before, with α1 = 0.05 and α2 = 0.10. Situation “a”, in which the (1 − α1)% C.I. is entirely outside the equivalence margin is what Guyatt et al. (1995) [71] call a “definitive-positive” result. The lower confidence limit of the parameter is not only larger than zero (the null value, θ0), implying a “positive” result, but also is above the Δ threshold. Situations “b” and “c”, in which the (1 − α1)% C.I. excludes zero and the (1 − 2α2)% C.I. is within [−Δ, Δ] are considered “positive” results but require some additional interpretation. One could describe the effect in these cases as “significant yet not meaningful” or conclude that there is evidence of a significant effect, yet the effect is likely “not minimally important”. A “positive” result such as “d”, with a wider confidence interval, represents a significant, albeit imprecisely estimated, effect. One could conclude that additional studies, with larger sample sizes, are required to determine if the effect is of a meaningful magnitude.

Fig 1. Left panel—CET point estimates and confidence intervals.

Fig 1

Point estimates and confidence intervals of thirteen possible results from two-sample testing of normally distributed data are presented alongside their corresponding conclusions. Let n = 90, Δ = 0.5, α1 = 0.05 and α2 = 0.10. Black points indicate point estimates; blue lines (wider intervals) represent (1-α1)% confidence intervals; and orange lines (shorter intervals) represent (1-2α2)% confidence intervals. Right panel- Rejection regions. Let t1=tα1/2*, t2=tα2*. The values of μ^d=x¯1-x¯2 (on the x-axis) and s*=sp(1/n1+1/n2) (on the y-axis) are obtained from the data. The three conclusions, positive, negative, inconclusive correspond to the three areas shaded in green, blue and red respectively. The black lettered points correspond to the thirteen scenarios on the left panel.

Evidently, in some cases, the three categories are not sufficient on their own for adequate interpretation. For example, without any additional information the “positive” vs. “negative” distinction between cases “b” and “f” is incomplete and potentially misleading. These two results appear very similar: in both cases a substantial (i.e. greater than Δ-sized) effect can be ruled out. Indeed, positive result “b” appears much more like the negative result “f” than positive result “a”. Additional language and careful attention to the estimated effect size is required for correct interpretation. While case “k” has a similar point estimate to “b” and “f”, the wider C.I. (Δ is within (1 − 2α2)% C.I.) means that one cannot rule out the possibility of a meaningful effect, and so it is rightly categorized as “inconclusive”.

It may be argued that CET is simply a recalibration of the standard confidence interval, like converting Fahrenheit to Celsius. This is valid commentary and in response, it should be noted that our suggestion to adopt CET (in the place of standard confidence intervals) is not unlike suggestions that confidence intervals should replace p-values [7274]. One advantage of CET over confidence intervals is that it may improve the interpretation of null results [75, 76]. By clearly distinguishing between what is a negative versus an inconclusive result, CET serves to simplify the long “series of searching questions” necessary to evaluate a “failed outcome” [77]. However, as can be seen with the examples of Fig 1, the use of CET should not rule out the complementary use of confidence intervals. Indeed, the best interpretation of a result will be when using both tools together.

As Dienes and Mclatchie (2017) [45] clearly explain, with standard NHST, one is unable to make the “three-way distinction” between the positive, inconclusive and negative (“evidence for H1”, “no evidence to speak of”, and “evidence for H0”). Fig 2 shows the distribution of standard two-tailed p-values under NHST and corresponding pCET values under CET for three different sample sizes. These are the result of two-sample testing of normally distributed data with n1 = n2.

Fig 2. NHST p-values vs. pCET values.

Fig 2

Distribution of two-tailed p-values from NHST and pCET values from CET, with varying n (total sample size) and δ = [−0.5, 0.5]. Data are the results from 10,000 Monte Carlo simulations of two-sample normally distributed data with equal variance, σ2 = 1. The true difference in means, μd = μ1μ2, is drawn from a uniform distribution between -2 and 2. Black points (= 1 − p2) might indicate evidence in favour of equivalence (“negative result”), whereas blue points (= p1) indicate evidence in favour of non-equivalence (“positive result”). An “inconclusive result” would occur for all black points falling in between the two dashed horizontal lines at α1 = 0.05 and α2 = 0.10, (i.e for p2 > α2). The format of this plot is based on Fig 1, “Distribution of two-tailed p-values from Student’s t-test”, from Lew (2013) [11].

Under the standard null (μd = 0) with NHST, one is just as likely to obtain a large p-value as one is to obtain a small p-value. Under the standard null with CET, a small p2 value (a large pCET value) will indicate evidence in favour of equivalence given sufficient data. In Fig 2, note that the blue points that fall below the α1 = 0.05 threshold in the upper panels (NHST) remain unchanged in the lower panels (CET). However, blue points that fall above the α1 = 0.05 threshold in the upper panels (NHST) are no longer present in the lower panels (CET). They have been replaced by black points (= pCET = 1 − p2) which, if near the top, suggest evidence in favour of equivalence. This treatment of the larger p-values is more conducive to the interpretation of null results, bringing to mind the thoughts of Amrhein et al. (2017) [78] who write “[a]s long as we treat our larger p-values as unwanted children, they will continue disappearing in our file drawers, causing publication bias, which has been identified as the possibly most prevalent threat to reliability and replicability.”

Defining the equivalence margin

As with standard equivalence and non-inferiority testing, defining the equivalence margin will be one of the “most difficult issues” [79] for CET. If the margin is too large, then a claim of equivalence is meaningless. If the margin is too small, then the probability of declaring equivalence will be substantially reduced [80]. As stated earlier, the margin is ideally chosen as a boundary to exclude the smallest effect size of interest [61]. However, these can be difficult to define, and there is generally no clear consensus among stakeholders [81]. Furthermore, previously agreed-upon meaningful differences may be difficult to ascertain as they are rarely specified in protocols and published results [42].

In some fields, there are some generally accepted norms. For example, in bioavailability studies, equivalence is routinely defined (and listed by regulatory authorities) as a difference of less than 20%. In oncology trials, a small effect size has been defined as odds ratio or hazard ratio of 1.3 or less [82]. In ecology, a proposed equivalence region for trends in population size (the log-linear population per year regression slope) is δ = [−0.0346, 0.0346] [68].

In cases when a specified equivalence margin may not be as clear-cut, less conventional options have been put forth (e.g. the “equivalence curve” of Hauck and Anderson (1986) [76], and the least equivalent allowable difference (LEAD) of Meyners (2007) [83]). One important choice a researcher must make in defining the equivalence margin is whether the margin should be defined on a raw or standardized scale; for further details, see [55, 61]. For example, in our two-sample normal case, one could define equivalence to be a difference within half the estimated standard deviation, i.e. defining Δ = qsp, with pre-specified q = 0.5. Note, that the probability of obtaining a negative result may be zero (or negligible) for certain combinations of values of α1, α2, n1, n2 and q. For example, with q = 0.5, α1 = 0.05, and α2 = 0.10, Pr(negative) = 0 for all n ≤ 26 (with n1 = n2). As such, q must be chosen with additional practical considerations (see additional details Appendix 1). For binary and time-to-event outcomes, there is an even greater number of different ways one can define the margin [65, 8486].

Since the choice of margin is often difficult in the best of circumstances, a retrospective choice is not ideal as there will be ample room for bias in one’s choice, regardless of how well intentioned one may be. For this reason, for equivalence and non-inferiority RCTs, it is generally expected (and we recommend, when possible with CET) that margins are to be pre-specified [87].

A comparison with Bayesian testing

Recently, Bayesian statistics have been advocated for as a “possible solution to publication bias” [88]. In particular, there have been many Bayesian testing schemes proposed in the psychology literature [89, 90]. What’s more, publication policies based on Bayesian testing schemes are currently in use by a small number of journals and are the preferred approach for some [45]. In response to these developments, we will compare, with regards to their operating characteristics, CET and a Bayesian testing scheme. This brings to mind Dienes (2014) [91] who compares testing with Bayes Factors (BF) to testing with “interval methods” and notes that with interval methods, a study result is a “reflection of the data”, whereas with BFs the result reflects the “evidence of one theory over another”. What follows is a brief overview of one Bayesian scheme and an investigation of how it compares to CET. (We recognize that this scheme is based on reference priors while some advocate for weakly informative priors or elicitation of subjective priors.)

The Bayes Factor is a valuable tool for determining the relative evidence of a null model compared to an alternative model [92]. Consider, for the two-sample testing of normally distributed data, a Bayes Factor testing scheme in which we take the JZS (Jeffreys-Zellner-Siow) prior (recommended as a reasonable “objective prior”) for the alternative hypothesis; see Appendix 2 and Rouder et al. (2009) [93] for details.

Fig 3(A) shows how the JZS Bayes Factor changes with sample size for four different values of the observed difference in means, μ^d=x¯1-x¯2; the observed variance is held constant at sp2=1. When the observed mean difference is exactly 0, the BF increases logarithmically with n. For small to moderate μ^d, the BF supports the null for small values of n. However, as n becomes larger, the BF yields less support for the null and eventually favours the alternative. The horizontal lines mark the 3:1 and 1/3 thresholds (“moderate evidence”) as well as the 10:1 and 1/10 thresholds (“strong evidence”).

Fig 3. JZS Bayes Factor vs. pCET.

Fig 3

Based partially on Fig 5 from Rouder et al. (2009) [93]. Left (middle; right) panel shows how the JZS Bayes Factor (the posterior probability of H0; pCET) changes with sample size for four different observed mean differences, μ^d=x¯1-x¯2. The observed variance is constant, sp2=1.

Fig 3(C) shows pCET-values for the same four values of μ^d, constant sp2=1, and the equivalence margin of [−0.50, 0.50]. The curves suggest that CET possesses similar “ideal behaviour” [93] as is observed with the BF. When the observed mean difference is exactly zero, CET provides increasing evidence in favour of equivalence with increasing n. For small to moderate μ^d, CET supports the null at first and then, as n becomes larger, at a certain point favours the alternative. The sharp change-point represents border cases. Consider case “f” in Fig 1: if n increased, the confidence intervals would shrink and at a certain point, the (1 − α1)% C.I. would be exclude 0 (similar to case “b”). At that point, the result abruptly changes from “negative” to “positive”. While this abrupt change may appear odd at first, it may in fact be more desirable than the smooth transition of the BF. Consider for example when μ^d=0.25. Then for n between 112 and 496, the BF will be strictly above 1/3 and strictly below 3, and as such the result, by BF, is inconclusive. In contrast, for the same range in sample size, pCET will be either above 0.90 or below 0.05. As such, with α1 = 0.05 and α2 = 0.10, a conclusive result is obtained. While careful interpretation is required (e.g. “the effect is significant yet not of a meaningful magnitude”), this may be preferable in some settings to the BF’s inconclusive result.

We can also consider the posterior probability of H0 (i.e. μd = 0) equal to B01/(1 + B01) (when the prior probabilities Pr(H0) and Pr(H1) are equal), plotted in Fig 3(B). The similarities and differences between p-values and posterior probabilities have been widely discussed previously [9496]. Fig 3 suggests that the JZS-BF and CET testing may often result in similar conclusions. We investigate this further by means of a simple simulation study.

Simulation study

We conducted a small simulation study to compare the operating characteristics of testing with the JZS-BF relative to with the CET approach. CET conclusions were based on setting Δ = 0.50, α1 = 0.05 and α2 = 0.10. JZS BF conclusions were based on a threshold of 3 or greater for evidence in favour of the a negative result and less than 1/3 for evidence in favour of a positive result. BFs in the 1/3–3 range correspond to an inconclusive result. A threshold of 3:1 can be considered “substantial evidence” [97]. We also conducted the simulations with a 6:1 threshold for comparison.

Note that one advantage of Bayesian methods, is that sample sizes need not be determined in advance [98]. With frequentist testing, interim analyses to re-evaluate one’s sample size can be performed (while maintaining the Type 1 error). However, these must be carefully planned in advance [99, 100]. Schönbrodt and Wagenmakers (2016) [101] list three ways one might design the sample size for a study using the BF for testing. For the simulation study here we examine only the “fixed-n design”.

For a range of μd (= 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, 0.67) and 14 different sample sizes (n ranging from 10 to 5,000, with n1 = n2) we simulated normally distributed two-sample datasets (with σ2 = 1). For each dataset, we obtained CET p-values, JZS BFs and declared the result to be positive, negative or inconclusive accordingly. Results are presented in Fig 4 (and Figs 57), based on 10,000 distinct simulated datasets per scenario.

Fig 4. Simulation study results: Conclusions; with BF threshold of 3:1.

Fig 4

The probability of obtaining each conclusion by Bayesian testing scheme (JZS-BF with fixed sample size design, BF threshold of 3:1) and CET (α1 = 0.05, α2 = 0.10). Each panel displays the results of simulations with true mean difference, μd = 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, and 0.67.

Fig 5. Simulation study results: Levels of agreement; with BF threshold of 3:1.

Fig 5

The black line (“overall”) indicates the probability that the conclusion reached by CET is in agreement with the conclusion reached by the JZS-BF Bayesian testing scheme; with JZS-BF with fixed sample size design, BF threshold of 3:1; and CET with α1 = 0.05, and α2 = 0.10. The blue line (“positive”) indicates the probability that both methods agree with respect to whether or or not a result is positive. The crimson line (“negative”) indicates the probability that both methods agree with respect to wether or or not a result is negative. Each panel displays the results of simulations with true mean difference, μd = 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, and 0.67.

Fig 7. Simulation study results: Levels of agreement; with BF threshold of 6:1.

Fig 7

The black line (“overall”) indicates the probability that the conclusion reached by CET is in agreement with the conclusion reached by the JZS-BF Bayesian testing scheme; with JZS-BF with fixed sample size design, BF threshold of 6:1; and CET with α1 = 0.05, andα2 = 0.10. The blue line (“positive”) indicates the probability that both methods agree with respect to wether or or not a result is positive. The crimson line (“negative”) indicates the probability that both methods agree with respect to wether or or not a result is negative. Each panel displays the results of simulations with true mean difference, μd = 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, and 0.67.

Several findings merit comment (unless otherwise noted, comments refer to results displayed in Fig 4):

  • In this simulation study, the JZS-BF admits a very low frequentist type I error, recorded at most ≈0.01, for a sample size of n = 110. As the sample size increases, the frequentist type I error diminishes to a negligible level.

  • The JZS-BF requires less data to reach a negative conclusion than the CET. However, with moderate to large sample sizes (n = 100 to 5,000) and small true mean differences (μd = 0 to 0.25), both methods are approximately equally likely to deliver a negative conclusion.

  • While the JZS-BF requires less data to reach a conclusion when the true mean difference is small (μd = 0 to 0.25) (see how solid black curve drops more rapidly than the dashed grey line), there are scenarios in which larger sample sizes will surprisingly reduce the likelihood of obtaining a conclusive result (see how the solid black curve drops abruptly then rises slightly as n increases for μd = 0.07, 0.09, 0.13, and 0.18.)

  • The JZS-BF is always less likely to deliver a positive conclusion (see how the dashed blue curve is always higher than the solid blue curve). In the scenarios like those considered, JZS-BF may require larger sample sizes for reaching a positive conclusion and may be considered “less powerful” in a traditional frequentist sense.

  • Fig 5 shows the probability that, for a given dataset, both methods agree as to whether the result is “positive”, “negative”, or “inconclusive”. Across the range of effect sizes and samples sizes considered, the probability of agreement is always higher than 50%. For the strictly dichotomous choice between “positive” and “not positive”, the probability that both methods agree is above 70% for all cases considered except a few scenarios in which effect size is small (μd = 0.07, 0.09, 0.13, and 0.18) and sample size is large (n > 500).

  • With the 6:1 BF threshold, results are similar. The main difference with this more stringent threshold is that much larger sample sizes are required in order to a achieve high probabilities of obtaining a conclusive result, see Fig 6; and agreement with CET is substantially lower, see Fig 7.

Fig 6. Simulation study results: Conclusions; with BF threshold of 6:1.

Fig 6

The probability of obtaining each conclusion by Bayesian testing scheme (JZS-BF with fixed sample size design, BF threshold of 6:1) and CET (α1 = 0.05, α2 = 0.10). Each panel displays the results of simulations with true mean difference, μd = 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, and 0.67.

The results of the simulation study suggest that, in many ways, the JZS-BF and CET operate very similarly. Think of JZS-BF and CET as two pragmatically similar, yet philosophically different, tools for making “trichotomous significance-testing decisions”.

A CET publication policy

Many researchers have put forth ideas for new publication policies aimed at addressing the issue of publication bias. There is a wide range of opinions on how to incorporate more null results into the published literature [102104]. RR is one of many proposed “two-step” manuscript review schemes in which acceptance for publication is granted prior to obtaining the results, e.g. [24, 105107]. Fig 8 (left panel) illustrates the RR procedure and Chambers et al. (2014) [31] provide an in-depth explanation answering a number of frequently asked questions. The RR policy has two central components: (1) pre-registration and (2) the “RR commitment to publish”.

Fig 8. Left panel—The RR publication policy.

Fig 8

The RR policy has two central components: (1) pre-registration and (2) the “RR commitment to publish”. Right panel—The CET publication policy. The CET policy the registration will not require a sample size calculation showing a specific level of statistical power. Instead, the researcher will define an equivalence margin for each hypothesis test to be carried out. A study will only be published if a conclusive result is obtained.

Pre-registration can be extremely beneficial as it reduces “researcher degrees of freedom” [51] and prevents, to a large degree, many questionable research practices including data-dredging [108], the post-hoc fabrication of hypotheses, (“HARKing”) [109], and p-hacking [110]. However, on its own, pre-registration does little to prevent publication bias. This is simply because pre-registration: (a1) cannot prevent authors from disregarding negative results [111]; (a2) does nothing to prevent reviewers and editors from rejecting studies for lack of significance; and (a3) does not guarantee that peer reviewers consider compliance with the pre-registered analysis plan [44, 112114].

Consider the field of medicine as a case study. For over a decade, pre-registration of clinical trials has been required by major journals as a prerequisite for publication. Despite this heralded policy change, selective outcome reporting remains ever prevalent [115118]. (This being said, new 2017/2018 guidelines for the clinicaltrial.gov registry show much promise in addressing a1, a2, and a3; see [119].)

In order to prevent publication bias, RR complements pre-registration with a “commitment to publish”. In practice this consists of an “in principle acceptance” policy along with the policy of publishing “withdrawn registration” (WR) studies. In order to counter authors who may simply shelve negative results following pre-registration (a1), RR journals commit to publishing the abstracts of all withdrawn studies as WR papers. By guaranteeing that, should a study follow its pre-registered protocol, it will be accepted for publication (“in principle acceptance”), RR prevents reviewers and editors from rejecting a study based on the results (a2). Finally, RR requires that a study is in strict compliance with the pre-registered protocol if it is to be published (a3). In order to keep a RR journal relevant (not simply full of inconclusive studies), RR requires, as part of registration, that a researcher commits to a sample size large enough of achieve 90% (in some cases 80%) statistical power. In a small number of RR journals, a Bayesian alternative option is offered. Instead of committing to a specific sample size, researchers commit to achieving a certain BF. (Note that currently in some RR journals, power requirements are less well defined.)

The policy we put forth here is not meant to be an alternative to the “pre-registration” component of RR. Its benefits are clear, and in our view is most often “worth the effort” [120]. If implemented properly, pre-registration should not “stifle exploratory work” [121]. Instead, what follows is an alternative to the second “commitment to publish” component.

Outline of a CET-based publication policy

Fig 8 (right panel) illustrates the steps of our proposed policy. What follows is a general outline.

Registration- In the first stage of a CET-based policy, before any data are collected, a researcher will register the intent to conduct a study with a journal’s editor. As in the RR policy, this registration process will (1) detail the motivations, merits, and methods of the study, (2) list the defined outcomes and various hypotheses to be tested, and (3) state a target sample size, justified based on multiple grounds. However, unlike in the RR policy, failure to meet a given power threshold (e.g. failure to show calculations that give 80% power) will not generally be grounds for rejection (see the Conclusion for further commentary on this point). Rather, registration under a CET-based policy will be based on defining an equivalence margin for each hypothesis test to be carried out. For example, if a researcher intends to fit a linear regression model with five explanatory variables, a margin should be defined for each of the five variables. The researcher will also need to note if there are plans for any sample size reassessments, interim and/or futility analyses.

Editorial and Peer Review- If the merits of the study satisfy the editorial board, the registration study plan will then be sent to reviewers to assess whether the methods for analysis are adequate and whether the equivalence margins are sufficiently narrow. Once the peer reviewers are satisfied (possibly after revisions to the registration plan), the journal will then agree, in principle, to accept the study for eventual publication, on condition that either a positive or a negative result is obtained.

Data collection- Armed with this “in principle acceptance”, the researcher will then collect data in an effort to meet the established sample size target. Once the data are collected and analyses complete, the study will be published if and only if either a positive or negative result is obtained as defined by the pre-specified equivalence margins. Inconclusive results will not generally be published thus protecting a journal from becoming a “dumping ground” for failed studies. In very rare circumstances, however, it may be determined that an inconclusive study offers a substantial contribution to the field and should therefore be considered for publication. As is required practice for good reporting, any failure to meet the target sample size should be stated clearly, along with a discussion of the reasons for failure and the consequences with regards to the interpretation of results [122].

This proposed policy is most similar to the RR Bayesian option outlined in Section 3. Journals only commit (“in principle acceptance”) to publish conclusive studies (small p-value/ BF above or below a certain threshold) and no a-priori sample size requirements are forced upon a researcher. As noted in Section 3, we expect that the conclusions obtained under either policy will often be the same. The CET-based policy therefore represents a frequentist alternative to the RR-BF policy.

A journal may wish to require stricter or weaker requirements for publication of results and this can be done by setting thresholds for α1 and α2 accordingly, (e.g. stricter thresholds: α1 = 0.01, α2 = 0.05, Δ = 0.5sp; weaker thresholds: α1 = 0.05, α2 = 0.10, Δ = 0.5sp). Also, note that a p-value may be used in one way for the interpretation of results and another way to inform publication decisions. Recently, a large number of researchers [123] have publically suggested that the threshold for defining “statistical significance” be changed from p < 0.05 to p < 0.005. However, they are careful to emphasize that while this stricter threshold should change the description and interpretation of results, it should not change “standards for policy action nor standards for publication”.

“Gaming the system”- When evaluating a new publication policy one should always ask how (and how easily) a researcher could -unintentionally or not- “game the system”. For the CET-policy as described above, we see one rather obvious strategy.

Consider the following, admittedly extreme, example. A researcher submits 20 different study protocols for pre-registration each with very small sample size targets (e.g. n = 8). Suspend any disbelief, and suppose these studies all have good merit, are all well written, and are all well designed (besides being severely underpowered), and are therefore all granted in principle acceptance. Then, in the event that the null is true (i.e. μ = 0) for all 20 studies (and with α1 = 0.05), it is expected that at least one study out of the 20 will obtain a positive result and thus be published. This is publication bias at its worst.

In order to discourage this unfortunate practice, we suggest making a researcher’s history of pre-registered studies available to the public (e.g. using ORCID [124]). However, while potentially beneficial, we do not anticipate this type of action being necessary. With the CET-policy, it is no longer in a researcher’s interest to “game the system”.

Consider once again, our extreme example. The strategy has an approximately 64% chance of obtaining at least one publication at a cost of 20 submissions and a total of 160 = 8 ⋅ 20 observations. If the goal is to maximize the probability of being published [125], then it is far more efficient to submit a single study with n = 160, in which case there is an approximately 98% chance of obtaining a publication (i.e. with α2 = 0.10, Pr(inconclusive|μ = 0, σ2 = 1, Δ = 0.5) = 0.02; 92% chance with α2 = 0.05, Pr(inconclusive|μ = 0, σ2 = 1, Δ = 0.5) = 0.08).

Conclusion

Publication bias has been recognized as a serious problem for several decades now [126]. Yet, in many fields, it is only getting worse [127]. An investigation by Kühberger et al. (2014) [128] concludes that the “entire field of psychology” is now tainted by “pervasive publication bias”.

There remains substantial disagreement on the merits of pre-registration and result-blind peer-review (e.g. [129131]). Yet, all can agree that innovative publication policy prescriptions can be part of the solution to the “reproducibility crisis”. While some call for dropping p-values and strict thresholds of evidence altogether, we believe that it is not worthwhile to fight “the temptation to discretize continuous evidence and to declare victory” [132]. Instead, the research community should embrace this “temptation” and work with it to achieve desirable outcomes. Indeed, one way to address the “practical difficulties that reviewers face with null results” [32] is to further discretize continuous evidence by means of equivalence testing and we submit that CET can be an effective tool for distinguishing those “high-quality null results” [133] worthwhile of publication.

Recently, a number of influential researchers have argued that to address low reliability, scientists, reviewers and regulators should “abandon statistical significance” [134]. (Somewhat ironically, in some fields, such as reinforcement learning, the currently proposed solution is just the opposite: the adoption of “significance metrics and tighter standardization of experimental reporting” [135].) We recognize that current publication policies, in which evidence is dichotomized (without any “ontological basis”) may be highly unsatisfactory. However, one benefit to adopting distinct categories based on clearly defined thresholds (as in the CET-policy) is that one can assess, in a systematic way, the state of published research (in terms of reliability, power, reproducibility, etc.). While using a “more holistic view of the evidence” [134] to inform publication decisions may (or may not?) prove effective, “meta-research” under such a paradigm is clearly less feasible. As such, the question of effectiveness may perhaps never be adequately answered.

Bayesian approaches offer many benefits. However, we see three main drawbacks. First, adopting Bayesian testing requires a substantial paradigm shift. Since the interpretation of findings deemed significant by traditional NHST may differ with Bayes, some will no doubt be reluctant to accept the shift. With CET, the traditional usage and interpretation of the p-value remains unchanged, except in circumstances when one fails to reject the null. As such, CET does not change the interpretation of findings already established as significant. Indeed, CET simply “extend[s] the arsenal of confirmatory methods rooted in the frequentist paradigm of inference” [136]. Second, as we observed in our simulation study, Bayesian testing is potentially less powerful than NHST (in the traditional frequentist sense, with a fixed sample size design) and as such could require substantially larger sample sizes. Finally, we share the concern of Morey and Rouder (2011) [137] who write that Bayesian testing “provides no means of assessing whether rejections of the nil [null] are due to trivial or unimportant effect sizes or are due to more substantial effect sizes”.

There are many potential areas for further research. Determining whether Δ and/or α1 and/or α2 should be chosen with consideration of the sample size is important and not trivial; related work includes Pérez et al. (2014) [138] who put forward a “Bayes/non-Bayes compromise” in which the α-level of a confidence interval changes with n. Issues which have proven problematic for standard equivalence testing must also be addressed for CET. These include multiplicity control [139] and potential problems with interpretation [140]. It would also be worthwhile considering whether CET is appropriate for testing for baseline balance [141]. The impact of a CET policy on meta-analysis should also be examined [142], i.e. how should one account for the exclusion of inconclusive results in the published literature when deriving estimates in a meta-analysis? Finally, Bloomfield et al. (2018) [143] consider how different publication policies will lead to “changes [in] the nature and timing of authors’ investment in their work.” It remains to be seen to what extent the CET-based policy, in which the decision to publish is based both on the up-front study registration and on the results obtained after analysis, will lead to a “shift from follow-up investment to up-front investment”.

The publication policy outlined here should be welcomed by journal editors, researchers and all those who wish to see more reliable science. Research journals which wish to remain relevant and gain a high impact factor should welcome the CET-policy as it offers a mechanism for excluding inconclusive results while providing a space for potentially impactful negative studies. As such, embracing equivalence testing is an effective way to make publishing null results “more attractive” [144]. Researchers should be pleased with a policy that provides a “conditional in principle acceptance” and does not insist on specific sample size requirements that may not be feasible or desirable.

It must be noted that publication policies need not be implemented as “blanket rules” that must apply to all papers submitted to a given journal. The RR policy for instance, is often only one of several options offered to those submitting research to a participating journal. Furthermore, RR is not uniformly implemented across participating journals. Each journal has adapted the general RR framework to suit its objectives and the given research field. We envision similar possibilities for a CET policy in practice. Journals may choose to offer it as one of many options for submissions and will no doubt adapt the policy specifics as needed. For example, in a field where underpowered research is of particular concern, a journal may wish to implement a modified CET publication policy that does include strict sample size/power requirements.

The requirement to specify an equivalence margin prior to collecting data will have the additional benefit of forcing researchers and reviewers to think about what would represent a meaningful effect size before embarking on a given study. While there will no doubt be pressure on researchers to “p-hack” in order to meet either the α1 or α2 threshold, this can be discouraged by insisting that an analysis strictly follows the pre-registered analysis plan. Adopting strict thresholds for significance can also act as a deterrent. Finally, we believe that the CET-policy will improve the reliability of published science by not only allowing for more negative research to be published, but by modifying the incentive structure driving research [145].

Using an optimality model, Higginson and Munafò (2016) [146] conclude that, given current incentives, the rational strategy of a scientist is to “focus almost all of their research effort on underpowered exploratory work [… and] carry out lots of underpowered small studies to maximize their number of publications, even though this means around half will be false positives.” This result is in line with the views of many (e.g. [147149]), and provides the basis for why statistical power in many fields has not improved [150] despite being highlighted as an issue over five decades ago [151]. A CET-based policy may provide the incentive scientists need to pursue higher statistical power. If CET can change the incentives driving research, the reliability of science will be further improved. More research on this question (i.e. “meta-research”) is needed.

Appendix 1

CET operating characteristics and sample size calculations

It is worth noting that there is already a large literature on TOST and non-inferiority testing. See Walker and Nowacki (2011) [62] and Meyners (2012) [63] for overviews that cover the basics as well as more subtle issues. There are also many proposed alternatives to TOST for equivalence testing, what are known as “the emperor’s new tests” [152]. These alternative are offered as marginally more powerful options, yet are more complex and are not widely used. Finally, note that one of the reasons equivalence testing has not been widely used, is that until recently, it has not been included in popular statistical software packages. Packages such as JAMOVI and TOSTER are now available.

Our proposal to systematically incorporate equivalence testing into a two-stage testing procedure has not been extensively pursued elsewhere (one exception may be [76]) and there has not been any discussion of how such a testing procedure could facilitate publication decisions for peer-review journals. One reason for this is a poor understanding of the conditional equivalence testing strategy whereby testing traditional non-equivalence is followed conditionally (if one fails to reject the null), by testing equivalence. In fact, whether or not such a two-stage approach is beneficial has been somewhat controversial. As the sample-size is typically determined based only on the primary test, the power of the secondary equivalence (non-inferiority) test is not controlled, thereby potentially increasing the false discovery rate [153]. Koyama and Westfall (2005) [154] investigate and conclude that, in most situations, such concern is unwarranted. A better understanding of the operating characteristics of CET is required.

What follows is a brief overview of how to calculate the probabilities of obtaining each of the three conclusions (positive, negative, and inconclusive) for CET for two-sample testing of normal data with equal variance (as described in the five steps listed in the text). More in-depth related work includes Shieh (2016) [155] who establishes exact power calculations for TOST for equivalence of normally distributed data; da Silva et al. (2009) [65] who review power and sample size calculations for equivalence testing of binary and time-to-event outcomes; and Zhu (2017) [156] who considers sample size calculations for equivalence testing of poisson and negative binomial rates.

As before, let the true population mean difference be μd = μ1μ2 and the true population variance equal σ2. Let σ*=σ(1/n1+1/n2) and s*=sp(1/n1+1/n2). Fig 1 (right panel) illustrates how each of the three conclusions can be reached based on the values of s* and μ^d=x¯1-x¯2 obtained from the data. For this plot, n = 90 (with n1 = n2), Δ = 0.5, α1 = 0.05 and α2 = 0.10. The black lettered points correspond to the scenarios of Fig 1 (left panel). The interior diamond, ◇, (with corners located at [0, 0], [-Δ/(tα2*/tα1/2*+1),Δ/(tα1/2*+tα2*)],[0,Δ/tα2*],[Δ/(tα2*/tα1/2*+1),Δ/(t1+t2)]) covers the values for which a negative conclusion is obtained. A positive conclusion corresponds to when |x¯1-x¯2| is large and s* is relatively small. Note how Δ and σ impact the conclusion. If the equivalence margin is sufficiently wide, one will have Pr(inconclusive) approach zero as σ approaches zero. Indeed, the ratio of Δ/σ determines, to a large extent, the probability of a negative result. If the equivalence margin is sufficiently narrow and the variance relatively large (i.e. Δ/σ is very small), one will have Pr(negative) ≈ 0.

Let us assume for simplicity that n1 = n2. Then, the sampling distributions of the sample mean and sample variance are well established: μ^dN(μd,σ*2) and (n-2)sp2σ2χn-22. Therefore, given fixed values for μ and σ2, we can calculate the probability of obtaining a positive result, Pr(positive). In Fig 1 (right panel), Pr(positive) equals the probability of μ^d and s* falling into either the left or right “positive” corners and is calculated (as in a usual power calculation for NHST):

Pr(positive;μd,σ)=(1-Fn-2,μdσ*(tα1/2*))+Fn-2,μdσ*(-tα1/2*) (1)

where Fdf,ncp(x) is the cdf of the non-central t distribution with df degrees of freedom and non-centrality parameter ncp.

One can calculate the probability of obtaining a negative result, Pr(negative), as the probability of μ^d and s* falling into the “negative” diamond, ◇. Since μ^d and s* are independent statistics, we can write their joint density as the product of a normal probability density function, fN(), and a chi-squared probability density function, fχ2(). However, the resulting double integral will remain difficult to evaluate algebraically over the boundary, ◇. Therefore, the probability is best approximated numerically, for example, by Monte Carlo integration, as follows:

Pr(negative;μd,σ)=fN(u;μd,σ*2)fχ2(v;n-2)dudv=(Φ(h2(v);μd,σ*)-Φ(h1(v);μd,σ*))fχ2(v;n-2)dvj=1M(Φ(h2(cj);μd,σ*)-Φ(h1(cj);μd,σ*))/M

where Φ() is the normal cdf and Monte Carlo draws from a chi-squared distribution provide cj=σ2q[j]/(n-2), with q[j]χn-22 for j = 1, …, M. The left and right-hand boundaries of the diamond-shaped “negative region”, are defined by h1(cj) = min(0, max(+cj t2 − Δ, −cj t1)) and h2(cj) = max(0, min(−cj t2 + Δ, +cjt1)).

Defining the boundary with h1() and h2() allows for three distinct cases as seen in Fig 1 (right panel):

(1) s*>Δ/tα2*, in which case h1(s*) = h2(s*) = 0;

(2) Δ/(tα1/2*+tα2*)<s*<Δ/tα2*, in which case h1(s*)=-Δ+tα2*s* and h2(s*)=Δ-tα2*s*; and

(3) s*<Δ/(tα1/2*+tα2*), in which case h1(s*)=-tα1/2*s* and h2(s*)=+tα1/2*s*.

When the equivalence boundaries are defined as a function of sp (e.g. Δ = qsp), the calculations are somewhat different; Fig 9 illustrates. In particular, a negative conclusion requires: qs*/1/n1+1/n2>|μ^d±s*tα2*| (i.e. the (1 − 2α2)% C.I. is entirely within [−Δ, Δ]). As such, for a given sample size, it will only be possible to obtain a negative result if q>tα2*1/n1+1/n2. Likewise, an inconclusive result will only be possible if (q/1/n1+1/n2)<(tα1/2*+tα2*).

Fig 9. Rejection regions with the equivalence boundaries are defined as a function of sp.

Fig 9

Let n = 90, Δ = 0.5sp, α1 = 0.05 and α2 = 0.10. The three conclusions, positive, negative, inconclusive correspond to the three areas shaded in green, blue and red respectively.

In order to determine an appropriate sample size for CET, one must replace μd with an a-priori estimate, μ˜d, the “anticipated effect size”, and replace σ2 with an a-priori estimate, σ˜2, the “anticipated variance”. Then one might be interested in calculating six values: the probabilities of obtaining each of the three possible results (positive, negative and inconclusive) under two hypothetical scenarios, (1) where μd = 0, and (2) where μd equal to μ˜d, the value expected given results in the literature. One might also be interested in a hybrid approach whereby one specifies a hybrid null and alternative distribution. Since the objective of any study should be to obtain a conclusive result, sample size could also be calculated with the objective to maximize the likelihood of success, i.e. to minimize Pr(inconclusive).

The quantity:

Pr(success)=1-12Pr(inconclusive;μd=μ˜d,σ˜2)-12Pr(inconclusive;μd=0,σ˜2) (2)

represents the probability of a “successful” study under the assumption that the null (μd = 0) and alternative (with specified μd=μ˜d) are equally likely. (Note that both false positive-, and false negative- studies in this equation are considered “successful”.) This weighted average could be considered a simple version of what is known as “assurance” [157]. Fig 10 shows how consideration of Pr(success) as the criteria for determining sample size attenuates the effect of μ˜d on the required sample size. Suppose one calculates that the required sample size is n = 1,000 based on the desire for 90% statistical power and the belief that σ˜2 = 1 and μd = 0.205, with n1 = n2 as before. This corresponds to a 61% probability of success for Δ = 0.1025 (= 12μd). If the true variance is slightly larger than anticipated, σ2 = 1.25, and the difference in means smaller, μd = 0.123, the actual sample size needed for 90% power is in fact n = 3,476, while the actual sample size needed for Pr(success|Δ, d, s) = 61% is n = 1,770. On the other hand, if the true variance is slightly smaller than anticipated, σ2 = 0.75 and the difference in means greater, μd = 0.33, the actual sample size needed for 90% power is only n = 288, while the actual sample size needed for Pr(success|δ, d, s) = 61% is n = 638. It follows that, if one has little certainty in μd and σ2, calculating the required sample size with consideration of Pr(success) may be less risky.

Fig 10. Required sample size for 90% Power, or Pr(Success) = 61%.

Fig 10

Suppose the anticipated effect size is μ˜d=0.205, the anticipated variance is σ˜2=1 and the equivalence margin is pre-specified with Δ = 0.1025 (= 12μ˜d). Then, based on the desire for Pr(positive) = 90% (i.e. “power” = 0.90) (dashed blue lines) or a 61% “probability of success” (solid red lines) a sample size of n = 1,000 (with n1 = n2) would be required. If the true variance is slightly larger than anticipated, σ2 = 1.25, and the effect size smaller, μd = 0.123, the actual sample size needed for 90% power is in fact 3,476, while the actual sample size needed for Pr(success) = 61% is 1,770; see points “a1” and “a2”. On the other hand, if the true variance is slightly smaller than anticipated, σ2 = 0.75 and the effect size greater, μd = 0.33, the actual sample size needed for 90% power is only 288, while the actual sample size needed for Pr(success|δ, d, s) = 61% is 638; see points “b1” and “b2”.

For related work on statistical power, see Shao et al. (2008) [158] who propose a hybrid Bayesian-frequentist approach to evaluate power for testing both superiority and non-inferiority. Jia and Lynn (2015) [159] discuss a related sample size planning approach that considers both statistical significance and clinical significance. Finally, Jiroutek et al. (2003) [160] advocate that, rather than calculate statistical power (i.e. the probability of rejecting θ = θ0 should the alternative be true), one should calculate the probability that the width of a confidence interval is less than a fixed constant and the null hypothesis is rejected, given that the confidence interval contains the true parameter.

Appendix 2

JZS Bayes Factor details and additional simulation study results

Note that, for the Bayes Factor, the null hypothesis, H0, corresponds to μd = 0; and the alternative, H1, corresponds to μd ≠ 0. The JZS testing scheme involves placing a normal prior on η = (μ2μ1)/σ, ηNormal(0,ση2), and for the hyper-parameter ση, placing an inverse chi-squared prior, ση2inv.χ2(1). Integrating out ση shows that this is equivalent to having a Cauchy prior, ηCauchy. We can write the JZS Bayes Factor in terms of the standard t-statistic, T=(x¯1-x¯2)/s*, with n* = n1 n2/(n1 + n2), and v = n1 + n2 − 2, as follows:

B01=(1+T2v)-(v+1)/20(1+n*g)-1/2(1+T2(1+n*g)v)-(v+1)/2(2π)-1/2g-3/2e-1/(2g)dg (3)

See Rouder et al. (2009) [93] for additional details. Figs 47 plot results from the simulation study. Note that while we resorted to simulation, since the B01 is a function of the t statistic, it would be possible to use numerical methods to compute its operating characteristics.

Supporting information

S1 File. R-code is available to reproduce all figures within this paper on the Open Science Framework (OSF) repository under the “Conditional Equivalence Testing” project (Identifiers: DOI 10.17605/OSF.IO/UXN2Q | ARK c7605/osf.io/uxn2q).

(ZIP)

Acknowledgments

We gratefully acknowledge support from Natural Sciences and Engineering Research Council of Canada. We also wish to thank Drs. John Petkau and Will Welch for their valuable feedback.

Data Availability

There is no original data discussed in this manuscript. Simulated data was generated to produce a number of the figures presented. We have moved all of our code that can reproduce all of our results and figures to a publicly available repository on the Open Science Framework (Identifiers: DOI 10.17605/OSF.IO/UXN2Q | ARK c7605/osf.io/uxn2q).

Funding Statement

This research was funded by an award from the Natural Sciences and Engineering Research Council of Canada (NSERC), http://www.nserc-crsng.gc.ca/, RGPIN 183772-13. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. R-code is available to reproduce all figures within this paper on the Open Science Framework (OSF) repository under the “Conditional Equivalence Testing” project (Identifiers: DOI 10.17605/OSF.IO/UXN2Q | ARK c7605/osf.io/uxn2q).

(ZIP)

Data Availability Statement

There is no original data discussed in this manuscript. Simulated data was generated to produce a number of the figures presented. We have moved all of our code that can reproduce all of our results and figures to a publicly available repository on the Open Science Framework (Identifiers: DOI 10.17605/OSF.IO/UXN2Q | ARK c7605/osf.io/uxn2q).


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES