Abstract
Bayesian hypothesis testing procedures have gained increased acceptance in recent years. A key advantage that Bayesian tests have over classical testing procedures is their potential to quantify information in support of true null hypotheses. Ironically, default implementations of Bayesian tests prevent the accumulation of strong evidence in favor of true null hypotheses because associated default alternative hypotheses assign a high probability to data that are most consistent with a null effect. We propose the use of “non-local” alternative hypotheses to resolve this paradox. The resulting class of Bayesian hypothesis tests permits more rapid accumulation of evidence in favor of both true null hypotheses and alternative hypotheses that are compatible with standardized effect sizes of most interest in psychology.
Keywords: Bayes factor, Bayesian hypothesis test, non-local prior, null hypothesis significance test, sequential Bayes factor, sequential test
Introduction
Innovative statistical methods to evaluate the plausibility of scientific theories have attracted increased attention over the last decade. This attention has resulted in renewed interest in Bayesian methods for assessing evidence (e.g., Rouder et al., 2009), and several novel approaches to sequential testing procedures have recently been proposed (Schönbrodt et al., 2017; Schnuerch and Erdfelder, 2020; Pramanik et al., 2021). As Schönbrodt et al. (2017) point out, each of these sequential testing methods can be motivated from a Bayesian perspective towards testing.
Rouder et al. (2009) provide a useful summary of Bayesian methodology and, through a series of examples, argue that “advances in science often come from identifying invariances,” or “statements of equality, sameness, or lack of association.” As examples, they cite interest in determining “whether cognitive skills vary with gender”; whether subliminal priming occurs; whether detectability of a “briefly flashed stimulus” is invariant to the ratio of the intensity of the stimulus to background, as predicted by the Weber-Fechner law (Fechner, 1966); and whether the exponent in the power function of intensity used to predict sensation is constant for a given intensity variable (Stevens, 1957; Augustin, 2008). To identify invariances, hypothesis testing procedures must permit accumulation of evidence in support of both null and alternative hypotheses (see also Kass and Raftery (1995); Dienes (2011); Etz and Vandekerckhove (2018)). In this regard, Bayesian testing procedures differ from classical testing procedures, in which one can only fail to reject the null hypothesis (e.g., Cover et al., 2012), by allowing researchers to quantify evidence in favor of true null hypotheses, which can reflect the presence of an invariance or lack of an effect.
In the Bayesian paradigm, the posterior odds in favor of an alternative hypothesis H1, based on data x, can be expressed as the product of the Bayes factor and the prior odds in favor of H1; that is
| (1) |
or
| (2) |
It is important to note that this equation can be interpreted from both a frequentist and subjective view of probability. From the frequentist perspective, all probabilities can interpreted as the limiting proportion of the occurrence of an event. That is, if the null hypothesis H0 is repeatedly sampled with probability P(H0) (or H1 with probability P(H1) = 1 − P(H0)), and data x is generated according to m0(x) (or m1(x)), then the posterior probability that data was generated under H1, for a given x, converges in probability to
| (3) |
where BF10(x) = m1(x)/m0(x) is the Bayes factor in favor of H1.
When Bayesian methods are applied to Null Hypothesis Significance Tests (NHSTs), controversy arises in the “subjective” specification of two quantities in these equations. First, the prior odds in favor of H1 must be specified. This specification is equivalent to specifying either the prior probability of the alternative hypothesis, P(H1), or the prior probability of the null hypothesis, P(H0), since P(H0) + P(H1) = 1. A simple approach to setting the prior odds is to assume P(H0) = P(H1) = 0.5, leading to prior odds of 1.0. However, recent evidence gleaned from analyses of replicated experiments suggests that the prior odds in favor of the alternative hypotheses studied in psychology and other social sciences might be closer to 1/9 (Dreber et al., 2015; Johnson et al., 2017). Although it is necessary to set a value of the prior odds in order to calculate the posterior odds, evaluation of the prior odds is not considered further here. Instead, we encourage researchers to perform their own sensitivity analyses to evaluate how various assumptions regarding the prior odds affect the posterior odds for a given Bayes factor.
The second point of controversy arises in the definition of the marginal density of the data under the alternative hypothesis, given by
| (4) |
Here π1(θ) represents the prior density for the parameter of interest θ under the alternative hypothesis, i.e., the alternative prior density. (A more detailed description of the Bayesian hypothesis testing framework may be found in, for example, Jeffreys (1961) or Kass and Raftery (1995).) In NHSTs, the quantity m0(x) simply represents the sampling density of the data, say f(x|θ0), evaluated at the parameter value that defines the null hypothesis, θ0.
The two-sample t test provides a useful context to discuss the specification of the alternative prior density, π1(θ). In this setting, the parameter of interest is usually defined to be either the difference in population means, μ2 − μ1, or the standardized difference in population means δ = (μ2 − μ1)/σ. The former is called the effect size, while the latter is called the standardized effect size. Sample estimates of δ are called Cohen’s d (Cohen, 1988). (Explicit modeling assumptions on x, μ1, μ2, σ2 and δ are provided in the next section.) For purposes of the present discussion, we assume that the null hypothesis requires that δ = 0, and that under the alternative hypothesis δ ≠ 0. In the absence of prior information regarding the value of δ, a common default choice for the alternative prior density on δ is a Cauchy distribution. The Cauchy distribution is a unimodal density that takes its maximum value at 0 and has heavy tails that assign significant mass to large values of the standardized effect size (i.e., δ > 1). When the observational variance σ2 is unknown, a default prior density on σ is the Jeffreys (or non-informative) prior, given by p(σ2) ∝ 1/σ2. Although improper, this prior has attractive theoretical properties, provided that it is used as the prior model for the variance under both the null and alternative hypotheses.
If the Jeffreys prior is assumed for the observational variance σ2 and a Cauchy prior is assumed for δ, then the resulting prior on μ2 − μ1 is called the JZS prior, in deference to Jeffreys, Zellner and Siow (Jeffreys, 1961; Zellner and Siow, 1980,9). It is the default prior recommended in Rouder et al. (2009) for one- and two-sample t tests and by Schönbrodt et al. (2017) in their definition of a sequential Bayes factor (SBF). The JZS prior is an example of a local alternative prior, or a prior density that is positive at parameter values that are consistent with the null hypothesis.
Intrinsic priors (e.g., Berger and Pericchi, 1996) are another class of commonly used default priors in Bayesian testing of a normal mean. The operating characteristics of Bayes factors obtained using intrinsic priors and other default local priors are similar to those obtained using the JZS prior. For brevity, we therefore do not consider them separately here.
The focus of this article is the description of a new approach to specifying alternative hypotheses in Bayesian tests of a normal mean or difference between means. The approach is based on the use of non-local alternative prior densities (NAPs; Johnson and Rossell (2010)). A NAP is a density that exactly equals 0 at parameter values that are consistent with the null hypothesis. For the two-sample t test, this means that the prior density on δ is identically 0 when δ = 0. As we demonstrate below, tests specified with NAPs offer several advantages over tests defined with alternative hypotheses based on local priors. These include the following:
Stronger evidence for true null hypotheses. Because local alternative priors, like the JZS prior, assign high prior probability to parameter values that are consistent with the null hypothesis, data that support the null hypothesis also provide support to the alternative hypothesis. This makes it difficult to obtain evidence that favors a true null hypothesis. We note that accumulating evidence for true null hypotheses is often cited as a primary rationale for the use of Bayes factors in hypothesis testing (e.g., Rouder et al. (2009)). Ironically, local alternative priors are particularly ill-suited for this task (see, for example, Fig. 2). These properties of Bayes factors are discussed below for tests in which sample sizes are fixed at the beginning of a study and for sequential tests.
Comparable or stronger evidence for true alternative hypotheses. Because NAPs assign negligible probability to parameter values that are consistent with the null hypothesis, they are able to assign more prior probability to parameter values that support the alternative hypothesis. When data support the alternative hypothesis, the marginal density for the data under the alternative thus tends to be higher than it is with a local alternative prior, which increases the Bayes factor in favor of the alternative hypothesis. This is especially true when the NAP assigns high prior probability to standardized effect sizes that are common in the psychological and social sciences.
Smaller Average Sample Number (ASN) in sequential tests. Because NAPs tend to provide more evidence in favor of true null hypotheses and comparable evidence in favor of true alternative hypotheses, sequential tests based on them are likely to reach termination thresholds more quickly. This means that sequential tests based on NAPs often require fewer subjects to make a decision.
Logical consistency. In a properly specified hypothesis test, null and alternative hypotheses are mutually exclusive. That is, if the alternative hypothesis is true, then the null hypothesis cannot be. Despite this truism, local alternative priors assign prior mass to neighborhoods of parameter values that are consistent with the null hypothesis. Indeed, in many cases their densities take their maximum value at the parameter value that defines the null hypothesis. In this regard, the use of NAPs more accurately reflect the prior belief, under the alternative hypothesis, that the tested parameter does not equal a value specified under the null hypothesis.
Figure 2.

Average weight of evidence against alternative hypotheses when the null hypothesis is true. Curves depicted in the plot correspond to normal moment priors with modes at ±0.3 and ±0.5; the JZS prior with scale and 1; and a composite alternative hypothesis that places one-half mass at ±0.3σ. The horizontal axis is displayed on the logarithmic scale because of the large differences in samples sizes required by the different methods to obtain, on average, strong or very strong weight of evidence against each alternative hypothesis The JZS priors do not, on average, yield very strong weight of evidence until sample sizes exceed 40,000.
With regard to the last item, proponents of local alternative prior densities (e.g., Jeffreys (1961); Rouder et al. (2009); Berger and Pericchi (1996)) might argue that local alternative priors like the JZS prior reflect a belief that the true parameter value is “close” to the null value. That is, the fact that a hypothesis test is being conducted at all suggests that the tested effect size must not be too large. Thus, it is reasonable for the prior density for the alternative model to take its maximum at the null value. This was the argument originally posited by Jeffreys (1961) in proposing a Cauchy prior for the unknown mean of a normal population.
We have two objections to this perspective. First, as we demonstrate below it is generally not feasible to detect small standardized effect sizes without very large sample sizes. As a consequence, investigators who wish to detect small effects are compelled to design studies with large sample sizes. If such studies are planned, investigators can also specify non-local alternative prior densities that are appropriately scaled to detect the targeted effects. The resulting NAPs are sharply spiked at small effect sizes, which increases the Bayes factor in favor of the alternative hypothesis when it is supported by data. The use of appropriately scaled NAPs can thus lead to savings in sample size and experimental cost.
Second, point null hypotheses are often used to approximate a belief that a standardized effect size is small. When this is the case, the use of local alternative prior densities is even more problematic because they then concentrate prior probability on a range of parameter values that are consistent with the null hypothesis.
The simplest NAP densities are simple alternative hypotheses. For example, in a one-sided test of whether a standardized effect size is zero, a simple alternative hypothesis might be H1 : δ = 0.3σ. We demonstrate below that simple alternative hypotheses make it easy to collect evidence in favor of both true null and true alternative hypotheses, but that they can lack power in detecting true alternative hypotheses defined by other parameter values (e.g., δ = 0.15σ). For this reason, we describe a class of continuous NAPs called normal moment densities that are strictly positive at all non-null parameter values.
The rest of the article is organized as follows. In the next section, we describe the class of normal moment densities that can be used to define alternative hypotheses for standardized effect sizes in one- and two-sample t and z tests. Unlike simple alternative hypotheses, tests constructed with these alternative prior densities provide support for a range of true alternative hypotheses. They also permit rapid accumulation of evidence in favor of true null hypotheses. Conveniently, these densities lead to explicit expressions for Bayes factors in one- and two-sample z and t tests. In the next two sections we compare the empirical properties of tests defined with NAPs to tests defined with default, objective local alternative priors in fixed and sequential design settings. The third section examines tests in which the sample size is fixed prior to analyses of data. We refer to such tests as fixed design tests. The fourth section examines the performance of NAP-based tests in sequential designs. We conclude with a discussion of results.
Non-local alternative prior densities
NAPs are probability density functions that take the value 0 at parameter values that are consistent with the null hypothesis (Johnson and Rossell, 2010). A simple example of a NAP for the test of a normal mean can be described as follows.
Suppose x = (x1, …, xn) are independent and identically distributed (iid) Gaussian random variables with mean μ and known variance σ2, i.e., . Suppose we wish to test the null hypothesis H0 : μ = 0 against a two-sided alternative H1 : μ ≠ 0. Let ϕ(a | m, c2) denote the normal density function evaulated at a with mean m and variance c2. A NAP density that can be used to define an alternative hypothesis for this test is the normal moment prior density, which can be expressed as
| (5) |
A plot of this density for τ2 = 0.32/2 = 0.045 and σ2 = 1 is provided in Fig. 1. For comparison, the dashed curve in this plot depicts the Cauchy density with scale parameter , a local default alternative prior density that is often used to define the alternative hypothesis for this test (see, for example, Morey and Rouder (2015); Schönbrodt et al. (2017); Stefan et al. (2021)). We denote the distribution associated with a generic normal moment density (5) by NM(0, τ2σ2). The density depicted in Fig. 1 has modes at . For τ2 = 0.32/2, the modes of the density occur at ±0.3σ, or at standardized effect sizes of ±0.3. The area of the shaded region in Fig. 1, which represents the prior probability assigned to standard effect sizes between (−0.8, −0.2) and (0.2, 0.8), is 0.825. That is, this normal moment prior assigns approximately 83% of its prior probability between “small” (±0.2) and “large” (±0.8) effect sizes (Cohen, 1988). These values approximately match the median and interquartile ranges of non-null standardized effect sizes reported in meta-analyses in psychology (Bakker et al., 2012; Anderson et al., 1999; Hall, 1998; Lipsey and Wilson, 1993; Meyer et al., 2001; Richard et al., 2003; Tett et al., 1994).
Figure 1.

Normal moment prior. This is an example of a NAP that can be used to define the alternative hypothesis in test for a normal mean. The shaded area in the figure depicts the prior probability assigned to standardized effect sizes having magnitude between 0.2 and 0.8.
The NAP density depicted in Fig. 1 assigns only 17.2% of its prior mass to standardized effect sizes less than 0.2 in magnitude, and is identically 0 when the standardized effect size is 0. In contrast, the Cauchy density depicted in this figure assigns about 36.3% of its prior probability to effect sizes that fall in the range “small” to “large,” and assigns 17.5% of its probability to effect sizes that are less than 0.2 in magnitude. The prior probability assigned to standardized effect sizes greater than 1 in magnitude is 0.392. The mode of this density is 0, which corresponds to the null hypothesis.
Throughout the remainder of this paper we define the prior depicted in Fig. 1 as the default NAP prior for defining the alternative hypothesis in testing whether the mean of a normal sample, or difference between means of normal samples, is 0. When the observational variance is unknown, we assume the Jeffreys’ prior for σ2 under both null and alternative hypotheses.
From a computational perspective, an advantage of the normal moment alternative prior density in normal models is that it results in closed form expressions for the Bayes factors in both one- and two-sided tests. In contrast, Bayes factors based on the JZS and other default priors (i.e., intrinsic priors) do not have closed-formed expressions and so must be computed using numerical integration routines. For one-sided tests, we use the positive half of the density to define the alternative hypothesis. That is, for testing H1 : μ > 0 we define
| (6) |
A similar definition is used to test H1 : μ < 0. We denote the distribution corresponding to this density as NM+(0, τ2σ2) (or NM−(0, τ2σ2) for μ < 0).
We now define the specific assumptions used to perform one and two sample tests for normal means against a two-sided alternative hypothesis, both when the variance is known and unknown. We also provide explicit expressions for the resulting Bayes factors. Expressions for one-sided tests are provided in the supplemental material.
For tests conducted in the psychological sciences with small to moderate sample sizes, and for which no specific prior information regarding the magnitude of standardized effect size is available, we recommend a default value of τ2 = 0.045.
1. One-sample, known variance test.
Suppose x = (x1, …, xn) denote iid N(μ, σ2) random variables and that σ2 is known. The Bayes factor of the test H1 : μ ~ NM(0, τ2σ2) versus H0 : μ = 0 is given by
| (7) |
where
| (8) |
Here, Z is the test statistic used in the frequentist z test.
2. One-sample, unknown variance test.
Suppose the conditions in [1] hold, except that σ2 is unknown. Suppose further that the Jeffreys’ prior density for σ2, proportional to 1/σ2, is assumed under both hypotheses. Then the Bayes factor in favor of the alternative hypothesis can be expressed as
| (9) |
where r and are defined as in (8), and
| (10) |
| (11) |
Here, T is the test statistic used in the frequentist t test.
3. Two-sample, known variance test.
Suppose denote iid N(μ1, σ2) random variables, iid N(μ2, σ2) random variables, x1 and x2 are independent of each other, and that σ2 is known. We assume that the prior density for μ1 is uniformly distributed on an interval (−a, a) for a large value of a under both hypotheses. Then the Bayes factor for the test H1 : μ2−μ1 ~ NM(0, τ2σ2) versus H0 : μ2 − μ1 = 0 can be expressed as
| (12) |
where for i = 1, 2,
| (13) |
| (14) |
The value Z is the test statistic in the classical z test. We note that the labeling of samples is arbitrary and the marginal prior density on μ2 is also approximately uniform on (−a, a). The Bayes factor is obtained by taking the limit a → ∞.
4. Two-sample, unknown variance test.
Suppose the conditions in [3] hold, except now that σ2 is unknown. Suppose further that the Jeffreys’ prior density for σ2 is assumed under both hypotheses. Then the Bayes factor in favor of the alternative hypothesis can be expressed as
| (15) |
where , , r, n, m are defined in (13)–(14), and
| (16) |
| (17) |
Here T is the test statistic in the classical t test. As in [3] the labeling of samples is arbitrary, and the Bayes factor is obtained by taking the limit a → ∞.
Fixed design tests
Classical tests of a normal mean parameter are most commonly based on z or t tests. These tests are designed to control Type I (α) and Type II (β) error probabilities at pre-specified levels. A key disadvantage of these tests is that they do not quantify evidence in favor of true null hypotheses. Instead, they may simply “fail to reject” the null hypothesis. Psychology and other social science researchers often have a need to quantify evidence in favor of true null hypotheses (for example, Rouder et al., 2009). Bayes factors provide such a measure.
“Weight of evidence” as a measure of evidence
To summarize the performance of various Bayesian tests, we adopt the measurement scale for evidence based on the natural logarithm of the Bayes factors, ln(BF10). This quantity, called the “weight of evidence”, has the advantage of being on the same scale as the classical likelihood ratio statistic (Jeffreys, 1961; Kass and Raftery, 1995).3 Because −ln(x) = ln(1/x), the weight of evidence in favor of the alternative hypothesis is equal to the negative of the weight of evidence in favor of the null hypothesis (and vice versa). Descriptors for the weight of evidence were proposed by Kass and Raftery (1995) and Jeffreys (1961). Under the former, weight of evidence between 0 and 1 in magnitude is considered “not worth more than a bare mention”; weight of evidence between 1 and 3 is considered “positive”; weight of evidence between 3 and 5 is “strong”, and above 5 is labeled as “very strong”. At the border between positive and strong (3), the corresponding Bayes factor is about 20, and at the border between strong and very strong, the Bayes factor is about 150. Strong and very strong weights of evidence in favor of the null hypothesis are −3 and −5, or Bayes factors of approximately 1/20 and 1/150.
Bayes factors must be multiplied by the prior odds that the null hypothesis is true to determine the posterior odds. If the prior odds are 1 (that is, P(H0) = P(H1) = 0.5), then weight of evidence equal to 3 implies a Bayes factor and posterior odds of about 20, and posterior probability of the alternative hypothesis equal to 0.95. Similarly, weight of evidence of −5 implies a Bayes factor and posterior odds of about 1/150, and posterior probability of the null hypothesis equal to 1 − 0.0066 = 0.9934. This probability is very close to 1.0, but it is predicated on the assumption that the prior odds are 1.0.
Recent evidence from replication of experiments in psychology and social sciences suggest that the prior probability of a null hypothesis examined in these fields is likely between 0.80–0.95 (Dreber et al., 2015; Open Science Collaboration, 2015; Johnson et al., 2017). If P(H0) = 0.9, then weight of evidence equal to 3 implies that the posterior probability of the alternative hypothesis is only 0.69, while weight of evidence equal to 5 implies that the posterior probability of the alternative hypothesis is 0.94.
Performance comparison
With the background from the above section in place, we now consider the average weight of evidence that is obtained from a two-sided, one-sample t test that a normal mean is equal to 0 when the true mean is 0. We assume the conditions of test [2] above hold. Operating characteristics for two-sided z tests and two-sided, two-sample t tests are very similar to those obtained for the two-sided one-sample t test. Corresponding results for these tests are provided in the supplemental materials. The R package BayesFactor (Morey et al., 2018) was used to compute the Average Sample Number (ASN) for the JZS alternatives.
True null hypothesis.
Fig. 2 displays the average weight of evidence obtained under several alternative hypotheses when the null hypothesis of no effect is true. These curves were based on simulating one-million standard normal random deviates at each sample size. The alternative hypotheses considered in this plot include the following:
The default NAP (normal moment) prior with τ2 = 0.32/2 = 0.045 (modes at ±0.3),
The default JZS prior based on a Cauchy with scale ,
A normal moment prior with τ2 = 0.52/2 = 0.125 (modes at ±0.5),
The JZS prior based on a Cauchy with scale r = 1,
A composite alternative hypothesis that assigns 1/2 mass to standardized effect sizes of ±0.3. The mass of a simple hypothesis is split between ±0.3 to reflect the two-sided specification of the test. Approximate Bayes factors for this test were computed as the ratio of non-central and central t distributions evaluated at simulated t statistics.
Fig. 2 illustrates a critical deficiency of the JZS priors (and related local priors): The use of such priors to define the alternative hypothesis makes it difficult to obtain “very strong” weight of evidence in favor of a true null hypothesis. For two-sided t tests, the default JZS prior requires about 80,000 subjects, on average, to obtain very strong weight of evidence in favor of a true null hypothesis, and the JZS prior with r = 1 requires about 40,000 subjects. In contrast, the NAP prior with modes at ±0.3 and ±0.5 require about 300 and 1200 subjects, on average, for the same purpose.
Obtaining even strong weight of evidence in favor of a true null hypothesis is difficult when standard JZS priors are used to define the alternative hypothesis. On average, 1,400 subjects are required to obtain strong weight of evidence when the default JZS prior is used, and on average 750 subjects are needed when the JZS prior with scale r = 1 is used to define the alternative hypothesis. In contrast, alternative hypotheses defined with NAP priors require about 300 subjects at the default scaling, and 110 subjects if the prior mode is set to 0.5.
Like the continuous NAP priors, the composite hypothesis that places one-half point mass at ±0.3σ is also able to quickly obtain evidence in favor of a true null hypothesis. Indeed, because this nonlocal composite hypothesis places no prior mass in the interval (−0.3, 0.3), it accumulates evidence in favor of a true null faster than the normal moment priors do.
True alternative hypotheses.
What is the cost that NAP priors when used to detect true alternative hypotheses? As it turns out, not too much for normal moment alternative prior specifications, but more with the two-point composite alternative hypothesis. Fig. 3 shows the average weights of evidence obtained under these prior specifications for a range of sample sizes in fixed-design tests as a function of the true standardized effect sizes. It shows that the NAP priors (based on normal moment priors) achieve strong or very strong weight of evidence in favor of the alternative hypothesis for smaller standardized effect sizes than the JZS priors do. Alternative hypotheses defined with the JZS priors provided, on average, higher weights of evidence for larger standardized effect sizes, but this additional evidence tends to occur when the evidence for the alternative hypothesis provided by the NAP priors was also very strong. For sample sizes greater than about 40 and standardized effect sizes between about 0.10 and 0.65, the default NAP prior produces, on average, higher weight of evidence against the null hypotheses than do default JZS priors.
Figure 3.

Weight of evidence for true alternative hypotheses. Curves depicted in the plots denote the average weight of evidence versus true effect size when the alternative hypothesis was defined by various NAP and JZS densities.
The properties of tests defined using the two-point composite hypothesis are more ambiguous. For standardized effect sizes within (−0.15σ, 0.15σ) (or 1/2 the magnitude of the simple alternatives that comprise the composite alternative), tests based on the composite hypothesis provide, on average, support for a false null hypothesis (that is, a negative weight of evidence). For sample sizes of 200 and 400, the average weight of evidence in favor of the false null hypothesis is even very strong for smaller effect sizes. This phenomenon is not unexpected, however, because the null hypothesis in these cases is “closer” to the data-generating parameter than the composite alternative is. The composite alternative hypothesis also provides, on average, substantially less weight of evidence for large standardized effect sizes. This happens because the composite alternative hypothesis assigns no prior probability to standardized effect sizes greater than 0.3σ in magnitude.
Local priors, like the JZS prior, provide more support for very small standardized effect sizes. However, strong evidence in favor of very small standardized effect sizes can only be obtained with very large sample sizes. When the sample size is 500 and the standardized effect size is less than 0.10, all four of the Bayes factors based on alternative hypothesis defined by the JZS and NAP priors in Fig. 4(a) yield average weights of evidence that are negative, thus favoring the null hypothesis of no effect. Indeed, for standardized effect sizes less than about 0.045, use of the default NAP prior provides, on average, “strong” support for the null hypothesis, and when the standardized effect size is less than about 0.023 the NAP prior with mode at 0.5 provides “very strong” support for the null hypothesis. This misleading performance of the NAP priors for true standardized effect sizes less than 0.05 persists, and even degrades, for sample sizes up to 4,000. When the sample size is 1,000 (Fig. 4(b)), the default NAP prior and the JZS priors begin to show positive support (i.e., log(BF10(x)) > 1) for standardized effect sizes greater than about 0.09. None of the alternative models depicted in Fig. 4(b) provide, on average, strong support for the alternative hypothesis for any standardized effect size less than 0.1. If the sample size is increased to 2,000 (Fig. 4(c)), then the JZS priors and default NAP prior provide, on average, strong evidence for standardized effect sizes greater than about 0.08, and positive evidence for effect sizes greater than about 0.065 (JZS) or 0.07 (default NAP). Increasing the sample size to 4,000 (Fig. 4(d)) yields a similar pattern, except that “very strong” weight of evidence is obtained, on average, for standardized effect sizes greater than about 0.07 if the default NAP or JZS priors are used to define the alternative hypothesis.
Figure 4.

Weight of evidence for true alternative hypotheses with very small effect sizes. Curves depicted in the plots denote the average weight of evidence versus true effect size when the alternative hypothesis was defined by various NAP and JZS prior densities. Dashed lines at ±3 provide boundaries for strong support of the alternative hypothesis (> 3) or null hypothesis (< −3).
The conclusions from Figs. 2–4 might be simply stated as follows. Alternative hypotheses defined with NAP priors can provide strong or very strong weight of evidence in favor of true null hypotheses for small or moderate sample sizes (i.e., < 400). In many practical settings (i.e., n < 2000), JZS or other local priors cannot. For small to medium standardized effects (i.e., in (0.2 − 0.5)), alternatives defined with the default NAP prior provide, on average, slightly higher weight evidence for small to moderate sample sizes than do JZS priors with standard scale specifications. For medium or larger standardized effect sizes (> 0.6), alternative hypotheses defined with JZS priors provide higher average weights of evidence, with all specifications providing strong or very strong weights of evidence for sample sizes greater than 40. Alternative hypotheses defined with JZS priors provided higher average weight of evidence for very small effect sizes (i.e., < 0.10), but require large or very large sample sizes (> 2000) to provide strong support. Nearly identical conclusions apply to two-sample t tests and z tests.
An Application to incidental disfluency studies
To illustrate the use of NAP-based Bayes factors on real data, we applied them to replications of an incidental disfluency study, one of the 28 studies included in the “Many Labs 2” project (Klein et al., 2018). Data for this example are available from the Open Science Framework (OSF) (https://osf.io/8cd4r/). For purposes of illustration, we restrict attention to Bayes factors based on default NAP and JZS priors.
In the original disfluency study, Alter et al. (2007) investigated whether a slow, analytical, and deliberate processing style can be activated by metacognitive experiences of difficulty or disfluency during the process of reasoning. To test this hypothesis, participants in the study were asked questions after reading two-statement syllogisms presented in either a hard-to-read or an easy-to-read font. Forty-one undergraduates from Princeton University completed a questionnaire that contained one of six syllogistic reasoning problems. The syllogisms were selected based on their accuracy rates established in prior research: two were easy, two were moderately difficult, and two were very difficult. Alter’s original study compared responses based on the two moderately difficult syllogisms. Participants were randomly assigned to a questionnaire printed in either a hard-to-read (disfluent) or an easy-to-read (fluent) font. Each questionnaire contained 6 questions and the number of questions correctly answered by each participant was recorded as the response.
The study was subsequently replicated 13 times by researchers in multiple countries. To minimize differences between replications, “English in-lab” questionnaires were used in the following analyses (i.e., on the OSF website, “English” from the “Language” column and “In a lab” from the “setting” column). In total, these studies collected 2,580 responses, 1,268 from the fluent condition and 1,312 from the disfluent condition.
In previous analyses of these data, authors of the original and replication studies used two-sample t tests to test the null hypothesis that the mean number of correct responses from the two conditions were the same. Following their lead, we assume that the sample means of correct responses under fluent and disfluent conditions are independently and normally distributed with means μf and μd, respectively, and unknown common variances σ2/nf and σ2/nd, where nf and nd are the numbers of subjects responding under the fluent and disfluent conditions. The tested hypotheses can then be expressed in frequentist terms as
| (18) |
The P-value for the two-sample t-test is 0.43, which does not support the rejection of the null hypothesis of no effect. Neither does it provide an interpretable summary of evidence in favor of the null.
Taking a Bayesian perspective, we computed Bayes factors from these data by using default NAP and JZS priors on the standardized difference (μf − μd)/σ to define the alternative hypothesis H1.
Table 1 displays the weight of evidence accumulated by the each test using all 2,580 responses. The table shows that both priors favor the null hypothesis. The Bayes factor based on the default JZS prior fails to provide “strong” evidence against the null, whereas the NAP-based Bayes factor does. These values correspond to odds (i.e. Bayes factors) of about 17:1 in favor of H0 using the JZS prior, and 76:1 using the NAP-based Bayes factor. In other words, over four times more support is obtained in favor of the null hypothesis when the NAP is used to define the alternative hypothesis.
Table 1:
Weight of evidence accumulated by the default NAP and JZS priors in favor of H1 in (18) in a fixed-design test.
| Prior | Weight of evidence |
|---|---|
| Default NAP | −4.33 |
| Default JZS | −2.81 |
Sequential tests
Unlike fixed sample size tests, sequential testing procedures are designed to terminate as soon as compelling evidence has been collected in favor of either the null or alternative hypothesis. After each subject or group of subjects is observed, they employ a rule that determines whether to (i) continue to collect data, (ii) stop data collection and reject the null hypothesis, or (iii) stop data collection and reject the alternative hypothesis. An important advantage of sequential designs is that they offer a potential mechanism for reducing the number of subjects that are needed to perform statistical tests.
Sequential tests have been developed extensively since their introduction by Wald in the 1940s (e.g., Wald, 1945). For a comprehensive review of developments in the statistical theory underlying sequential probability ratio tests (SPRT), see Siegmund (2013). More recently, sequential designs have been proposed for application in psychology and other social sciences by Schönbrodt et al. (2017), Schnuerch and Erdfelder (2020), Pramanik et al. (2021), and Stefan et al. (2021).
Schönbrodt et al. (2017) proposed a sequential Bayes factors (SBF) in which data are collected until the Bayes factor crosses predefined thresholds. They discuss a variety of prior densities on the standardized effect size that might be used to define the alternative model and Bayes factor, but recommend as a default the JZS prior with scale parameter . Schnuerch and Erdfelder (2020) discuss a modification of Wald’s SPRT that applies to two-sample t tests (Hajnal, 1961). The thresholds for making a decision in this test are chosen to maintain Type I and Type II error control. Hajnal’s test is based on computing the ratio of non-central t and F sampling densities under the alternative hypotheses to central t and F densities that apply under the null. Schnuerch and Erdfelder (2020) do not provide objective criteria or default values for the standardized effect sizes that define the non-centrality parameters in these tests. Stefan et al. (2021) discuss the connections between Schönbrodt et al. (2017) and Schnuerch and Erdfelder (2020), pointing out that the thresholds for the Bayes factors in the former can be adjusted to control Type I and II error probabilities. Readers interested in more detailed descriptions of these and related sequential testing procedures are encouraged to consult Schönbrodt et al. (2017) and Schnuerch and Erdfelder (2020).
Pramanik et al. (2021) propose a modification of the SPRT that they call the modified sequential probability ratio test (MSPRT). There are three innovations of this test. First, unlike the SPRT and SBF, the maximum sample size for the MSPRT is set in advance. Second, if a decision has not been reached after the maximum sample size is determined, a decision threshold is used at the end of the test to determine whether to the accept or reject each hypothesis. This threshold is estimated numerically so that the Type II error probability is minimized under the constraint that the targeted Type I error is maintained. Finally, the simple alternative hypothesis for the test is determined using the uniformly most powerful Bayesian test (UMPBT; Johnson (2013)) at the maximum sample size, say N. In the test of whether a normal mean equals 0, the UMPBT alternative is of order N−1/2. Successful application of the MSPRT thus implicitly depends on the selection of a maximum sample size that is commensurate with the anticipated magnitude of the standardized effect size. For example, if N = 10, 000, the point alternative hypothesis for the standardized effect size in a MSPRT for a one-sided z-test of size 0.05 is . If the magnitude of the anticipated standardized effect size is substantially larger than this, then the UMPBT default value should not be used. Pramanik et al. (2021) do not provide guidance on the selection of alternative values or maximum sample size for the test. Because the alternative hypotheses defined in this procedure depend on the maximum sample size specified for the test, it is difficult to compare its operating characteristics to the other sequential procedures, and so it is not considered in the comparisons below.
Results presented for fixed design tests suggest that the use of the JZS prior (as well as other local alternative priors) to define alternative hypotheses makes it difficult to accumulate evidence in favor of true null hypotheses and “very small” effect sizes. In sequential tests, this means that sequential procedures may not reach termination criteria before available sample sizes are expended when the null hypothesis is true. To resolve this difficulty, we propose using the default NAP prior to define the alternative hypothesis in these tests.
We now explore this proposal in the two contexts suggested by Schönbrodt et al. (2017) and Schnuerch and Erdfelder (2020). First, we examine the Bayesian approach and the SBF proposed in Schönbrodt et al. (2017). In this test, data is accumulated until the weight of evidence exceeds specified thresholds. After this, we examine the performance of both methods viewed from the frequentist perspective of Schnuerch and Erdfelder (2020) in which SPRT-type thresholds are determined so as to maintain specified Type I and Type II error probabilities.
Sequential design with symmetric evidence thresholds
Performance comparison.
In this section, we again consider a one-sample two-sided t test of a normal mean μ, with H0 : μ = 0. In each simulated replication of the test, samples are collected until the weight of evidence in favor of the alternative hypothesis exceeds 3 or 5 or the weight of evidence against the alternative hypothesis is less than −3 or −5. The performances of the tests are summarized over 50,000 simulations of the tests at each standardized effect size.4
Several alternative hypotheses were considered. Following Schönbrodt et al. (2017), we computed Bayes factors using the default JZS prior with scale parameter . We also examined the default NAP prior with τ2 = 0.045. Bayes factors for these procedures were computed using formulae provided in section two. We refer to the sequential testing procedures based on these prior assumptions as SBF-JZS and SBF-NAP. To facilitate comparisons with Schnuerch and Erdfelder (2020), we also tested the SPRT proposed in Hajnal (1961) with a simple alternative hypothesis defined to be a point mass at concentrated on standardized effect sizes of ±0.3. This value matches the mode of the default NAP prior and matches the composite hypothesis examined in the previous section. As noted in Schnuerch and Erdfelder (2020), the Bayes factors from these tests are most efficient when the true standardized effect size is close to the assumed alternative hypothesis. For ease of exposition, we refer to the sequential test based on Hajnal’s approximate Bayes factor with alternative hypothesis equal to a standardized effect size of magnitude d as the “Hajnal(d)” test.
True null hypothesis.
Fig. 5 presents the boxplot of sample sizes and the ASN required by the SBF-NAP, SBF-JZS and Hajnal(0.3) tests to exceed thresholds of ±3 and ±5 when the null hypothesis is true. As the plots show, the SBF-JZS test typically requires significantly more samples to reach a decision. In the case of thresholds of ±3, the ASN’s for the SBF-JZS test and Hajnal(0.3) test were 968 and 99, respectively, while the ASN for the SBF-NAP test was 239. For a threshold of ±5, the corresponding ASN’s were 54,833, 164, and 1,026, respectively. These trends mimic those observed for fixed design studies.
Figure 5.

ASN for sequential procedures under a true null hypothesis. The plots are truncated at 1500 and 80,000 to enhance comparisons at moderate sample sizes. Panel (a) provides a boxplot estimate of the distribution of sample sizes required for the SBF-NAP, SBF-JZS and Hajnal(0.3) procedures to cross an exceedance threshold of ±3. About 0.3% percent of SBF-NAP tests and 11% of SBF-JZS tests required more than 1500 samples to reach a decision. All Hajnal(0.3) tests terminated by 550 samples. Panel (b) provides the corresponding boxplots when the exceedance threshold is ±5. About 12% of SBF-JZS tests required more than 80,000 samples to reach a decision. The black diamonds show the ASN’s for each procedure. All SBF-NAP tests reached a decision by 54750 samples, and all Hajnal(0.3) tests terminated by observation 980.
True alternative hypothesis.
The increased efficiency of the SBF-NAP and Hajnal(0.3) tests under the null hypothesis is offset by decreased power to detect smaller standardized effect sizes. This phenomenon is illustrated in Fig. 6. The panels on the left side of this figure represent the ASN and power achieved by the three sequential tests when an evidence threshold of ±3 was imposed, while the panels on the right correspond to evidence thresholds of ±5.
Figure 6.

Operating characteristics under true alternative hypotheses. Panels (a) and (b) depict the ASN’s for three sequential tests when the exceedance thresholds are ±3 and ±5, respectively, versus the data-generating value of the standardized effect size. Panels (c) and (d) provide the corresponding probabilities that each test rejects the null hypothesis as a function of the standardized effect size.
The general take-away from this figure is that the SBF-JZS provides substantially better power than the SBF-NAP test for standardized effect sizes less than 0.25 (left) or 0.10 (right). However, the cost of the additional power can be very high in terms of the ASN required to reach a decision. For instance, SBF-JZS requires ASN’s that are greater than 50,000 to reach a decision for standardized effect sizes less than about 0.02 and weight of evidence thresholds of ±5, even though the power at these smaller effect sizes can be well below 0.5.
Sequential analysis of the incidental disfluency study.
In this section we compare the performances of the SBF-JZS and the SBF-NAP priors using the disfluency data described earlier. For brevity, we again only compare the default choices of NAP and JZS priors.
To perform a sequential analysis of the data collected in the 13 replicated studies, we assume for illustration purposes that data from these studies was collected sequentially according to study number, and that all data from each study was collected simultaneously.
Given this ordering, we calculated the weight of evidence against the null hypothesis specified in (18) after data from each study “arrived.” The weight of evidence was then computed using all available data. If the weight of evidence for one hypothesis was strong (i.e., > 3 or < −3), that test was terminated. The time courses for the accumulation of weight of evidence for the SBF-NAP and SBF-JZS procedures are displayed in Figure 7.
Figure 7.

A comparison of the SBF-JZS and the SBF-NAP with symmetric “strong” thresholds in case of the replicated incidental disfluency data. For each prior the natural logarithm of the Bayes factor in favor of the alternative hypothesis that incidental disfluency activates a deliberate, analytic processing style is calculated. The curves corresponding to each prior depicts the sequentially calculated values after observing each of the 13 studies until they exceed ±3. The horizontal axis displays the studies in the assumed order they were observed.
From the figure we see that weight of evidence from the SBF-JZS approaches, but never crosses, the strong weight of evidence threshold. In contrast, the SBF-NAP test provides strong weight of evidence in favor of the null hypothesis after Study 3, using only 588 of the 2,580 combined study participants. Thus, application of the SBF-NAP procedure uses nearly 2,000 fewer subjects to conclude that there is a negligible disfluency effect, while at the same time providing stronger evidence in favor of this conclusion.
Sequential design with the SPRT thresholds
Performance comparison.
The sequential probability ratio test, as proposed by Wald (1945), is based on comparing the likelihood ratio between a simple null and a simple alternative hypothesis and terminating an experiment when the likelihood ratio strongly favors one of the two. More specifically, let x1, x2, … represent independent, identically distributed realizations from a distribution with density function f(x; θ) under both hypotheses. Suppose the null hypothesis H0 stipulates that θ = θ0 and the alternative hypothesis H1 that θ = θ1. Then the likelihood ratio statistic in favor of the alternative hypothesis based on the first n observations may be expressed as
| (19) |
Wald’s SPRT continues data collection until L(θ0, θ1; xn) > A and the null hypothesis is rejected, or L(θ0, θ1; xn) < B and the alternative hypothesis is rejected. The decision thresholds are defined as
| (20) |
Typical design parameters in the social sciences and medicine often assume that Type I and Type II errors fall in the range (0.005, 0.05) and (0.05, 0.2), respectively. It follows that the SPRT thresholds for (α, β) = (0.05, 0.2) are A = 16 and B = 0.21, and for (α, β) = (0.005, 0.05) are A = 190 and B = 0.05.
Stefan et al. (2021) point out that the SPRT can be modified for use with composite hypotheses by replacing the likelihood ratio with the Bayes factor between hypotheses. Hajnal (1961) and Schnuerch and Erdfelder (2020) extended the SPRT to t tests by replacing the likelihood ratio for normally distributed data with unknown means and common variance by the ratio of a non-central t density to a central t density, evaluated at the t statistic for the experiment (e.g., ). Stefan et al. (2021) provided numerical comparisons of the Hajnal test to the SPRT based on the Bayes factor defined with the JZS prior (and several other prior choices). We now extend this comparison to include the SPRT obtained by using the default normal moment prior (default NAP) density to define the alternative hypothesis. Before doing so, however, it is useful to compare the SPRT thresholds to the symmetric thresholds examined in the previous section.
For (α, β) = (0.05, 0.2), the Bayes factor thresholds are A = 16 and B = 0.21, with ln(A) = 2.77 and ln(B) = −1.56. The latter value represents the threshold at which the alternative hypothesis is rejected. It is substantially smaller in magnitude than the thresholds of −3 and −5 examined previously. With prior odds equal to 1, weight of evidence equal to −1.56 implies that the posterior probability of the alternative hypothesis is 0.17, which might be considered too high for rejection. The use of this less stringent threshold for “accepting” the null hypothesis reduces the ASN required by the SBF-JZS test. Values of (α, β) = (0.005, 0.05) yield weight-of-evidence thresholds that are more similar to those studied in the previous section. With prior odds equal to 1, the alternative hypothesis is not rejected unless it has posterior probability less than 0.05, and the null hypothesis is not rejected unless it has posterior probability less than 0.0052.
True null hypothesis.
Fig. 8 depicts the ASN for three sequential tests when Type I and Type II error probabilities were constrained to (0.05, 0.20) (left panel) and (0.005, 0.05) (right panel). As in the previous section, all three sequential tests were designed to test the null hypothesis that the mean of a sample of normal random variables with unknown variance was equal to 0. The boxplots in this figure were based on 50,000 replications of each test. The decision thresholds for each test were set according to (20), and data for each test were simulated under the null hypothesis that the standardized effect size was 0.
Figure 8.

ASN for SPRT procedures when the null hypothesis is true. Panel (a) provides a boxplot estimate of the distribution of sample sizes required for the SBF-NAP, SBF-JZS and Hajnal(0.3) procedures to cross Wald’s decision thresholds at α = 0.05 and β = 0.2. The plot is truncated at 150 samples (5.49% of SBF-NAP tests, 3.35% of SBF-JZS tests, and 1.75% of Hajnal(0.3) tests required more than 150 samples). Panel (b) provides the corresponding estimate when Wald’s decision thresholds were based on α = 0.005 and β = 0.05. The plot is truncated at 1500 samples (0.54% of SBF-NAP and 11.1% of SBF-JZS tests required more than 1500 samples; none of Hajnal(0.3) tests did). The black diamonds show the ASN for each procedure.
The three tests included in the plot include the SPRT based on the Bayes factor obtained by defining the alternative hypothesis with the default NAP on the standardized effect size (i.e., a normal moment prior with τ2 = 0.045), the SPRT based on the Bayes factor obtained by defining the alternative hypothesis with the default JZS prior on the standardized effect size (), and the Schnuerch and Erdfelder (2020) version of Hajnal’s two-sided t-test with a composite hypothesis that assigned one-half probability to ±0.3σ.
The left panel of Fig. 8 shows that the test based on the JZS alternative required the smallest mean and median ASN when the targeted Type I and Type II errors were 0.05 and 0.2, respectively. The realized Type I errors for the tests were 0.035, 0.043, and 0.044 for the alternative hypotheses defined by the JZS, composite, and NAP priors.
The right panel depicts similar findings when the targeted Type I and Type II errors were 0.005 and 0.05, respectively. With thresholds again determined from (20), the ASN required by the JZS test jumps significantly at the more stringent significance threshold, requiring an average of over 1,000 observations before reaching a decision. The NAP and composite tests required an average of 253 and 103 observations, respectively.
True alternative hypothesis.
Fig. 9 provides the ASN and power of each of the three sequential tests as a function of true standardized effect size. As in Fig. 8, the panels on the left (a,c) reflect the operating characteristics of the test with targeted Type I and Type II error probabilities equal to 0.05 and 0.2, while panels (b,d) targeted to error probabilities of 0.005 and 0.05.
Figure 9.

Operating characteristics under true alternative hypotheses. Panels (a) and (b) depict the ASN for three SPRT procedures based on Wald’s decision thresholds for (α, β) = (0.05, 0.2) and (0.005, 0.05), respectively, versus the data-generating value of the standardized effect size. Panels (c) and (d) provide the probability that each procedure rejected the null hypothesis as a function of the standardized effect size.
From panels (a) and (c), we see that the NAP prior requires, on average, a higher number of samples to reach a decision for standardized effect sizes less than about 0.3 (JZS) or 0.42 (composite), although it provides better power over the range of standardized effect sizes depicted. True standardized effect sizes of 0.27, 0.29, and 0.33 are needed for the NAP, composite, and JZS to reach their targets of 80%. For the composite alternative hypothesis, this value is close to the point mass alternatives used to define the test.
Panels (b) and (d) reveal a somewhat different trend for the more stringent tests. With error probability targets of (0.005, 0.05), the ASN for the test defined with the JZS alternative can be as large as 3,500. However, these larger sample sizes provide higher power, with 95% power achieved for standardized effect sizes greater than 0.1, whereas the tests defined with the composite and NAP priors only provide 95% power for standardized effect sizes greater than 0.29 and 0.19, respectively. As in the less stringent test, the composite hypothesis achieves its targeted power at the point mass alternatives used in its definition.
Sequential analysis of the incidental disfluency study (continued).
We previously examined the efficacy of the SBF-NAP and SBF-JZS tests in accumulating strong evidence in favor of the null hypothesis against a disfluency effect using symmetric exceedance thresholds. We now consider a similar analysis using the weak ((α, β) = (0.05, 0.20)) and stringent (α, β) = (0.005, 0.005) Wald thresholds. Note that the weight-of-evidence curves displayed in Fig. 7 for the SBF-JZS and SBF-NAP tests do not change according to the termination thresholds that are used.
The weight of evidence thresholds that correspond to the less stringent criterion of (α, β) = (0.05, 0.20) are A = 2.77 and B = −1.56. As Fig. 7 suggests, both the SBF-JZS and SBF-NAP prior fall below the lower threshold after the first study (weights of evidence equal to −1.81 and −1.63, respectively).
The weight of evidence thresholds that correspond to the more stringent criterion of (α, β) = (0.005, 0.05) are A = 5.24 and B = −2.99. Because the lower threshold is close to −3.0, the conclusions from the last section apply here also: The SBF-NAP test terminates after the third study in favor of the null hypothesis and uses only 588 subjects, while the SBF-JZS does not terminate even after responses from all 2580 subjects are accumulated.
Discussion
This article has explored the use of non-local alternative prior densities, or NAP’s, to define alternative models in Bayesian z and t tests. From a subjective perspective, evidence suggests that NAPs approximate the marginal distribution of non-null effect sizes observed in the psychology and social science literature (Bakker et al., 2012; Anderson et al., 1999; Hall, 1998; Lipsey and Wilson, 1993; Meyer et al., 2001; Richard et al., 2003; Tett et al., 1994). Viewed more objectively, the operating characteristics of Bayesian tests based on NAP’s provide an opportunity for researchers to more rapidly accumulate evidence in favor of true null hypotheses and alternative hypotheses in which standardized effect sizes are moderate in magnitude.
Table 2 illustrates this effect when the null hypothesis is true. Sample sizes required to obtain strong weight of evidence, on average, are nearly 5 times larger using the JZS specification than the NAP specification. To obtain very strong weight of evidence, the sample size required by the JZS specification needs to be 65 times larger. Table 3 provides a similar comparison when the alternative hypothesis is true. Evidence for small and medium standardized effect sizes accumulates faster, although the gains for these alternative hypotheses is less pronounced. Bayes factors based on default JSZ priors outperform those based on default NAP priors for large effect sizes, an advantage that increases with increasing standardized effect size. Of course, the sample sizes required to detect large effects tend to be fairly small no matter which alternative is specified.
Table 2:
Average sample numbers required for fixed-design tests under true null hypotheses. This table displays the minimum sample sizes required for Bayes factors to achieve, on average, strong (log(BF01) ≥ 3) or very strong weight of evidence (log(BF01) ≥ 5) in favor of true null hypotheses.
| Prior | Strong | Very strong |
|---|---|---|
| Default NAP | 294 | 1,208 |
| Default JZS | 1,445 | 79,424 |
Table 3:
Average sample numbers required for fixed-design tests under true alternative hypotheses. This table displays the average sample sizes required for Bayes factors to achieve strong (log(BF10) ≥ 3) or very strong weight of evidence (log(BF10) ≥ 5) for small (0.2), medium (0.5) and large (0.8) standardized effect sizes.
| Prior | Small effect | Medium effect | Large effect | |||
|---|---|---|---|---|---|---|
| Strong | Very strong | Strong | Very strong | Strong | Very strong | |
| Default NAP | 225 | 335 | 37 | 56 | 21 | 31 |
| Default JZS | 267 | 379 | 42 | 62 | 19 | 28 |
Tables 4 and 5 demonstrate that similar trends persist for sequential tests based on Bayes factors. In the case of tests with symmetric thresholds, however, the smaller ASN’s achieved by the NAP-based Bayes factors should be balanced against the fact that these tests have high probability of generating evidence in favor of the null hypothesis when the magnitude of a standardized effect sizes is less than 0.1.
Table 4:
Average sample numbers for sequential tests under true null hypotheses. Columns refer to the average sample sizes required for Bayes factors to exceed, on average, strong (| log(BF10)| ≥ 3) or very strong weight of evidence (| log(BF01)| ≥ 5) thresholds when termination thresholds are symmetric.
| Priors | Symmetric thresholds | |
|---|---|---|
| Strong | Very strong | |
| Default NAP | 238 | 1,026 |
| Default JZS | 968 | 54,832 |
Table 5:
Maximum average sample numbers for sequential tests under true alternative hypotheses. This table does not reflect the power of the tests, which for standardized effect sizes less than 0.2 is greater for the default JZS prior with symmetric thresholds. Columns list the maxiumum of the ASN required for Bayes factors to exceed, on average, strong (| log(BF10)| ≥ 3) or very strong weight of evidence (| log(BF01)| ≥ 5) thresholds. The power and standardized effect sizes at which these values obtain can be discerned from Fig. 6.
| Priors | Symmetric thresholds | |
|---|---|---|
| Strong | Very strong | |
| Default NAP | 458 | 3,853 |
| Default JZS | 2,399 | 158,235 |
Perhaps related to this trade-off, Tendeiro and Kiers (2019) argue that Bayes factors can either favor a point null hypothesis (Issue 9) or an alternative hypothesis (Issue 10). With regard to the latter, they cite Johnson and Rossell (2010) and express concern that evidence is accumulated asymmetrically in favor of the alternative model. van Ravenzwaaij and Wagenmakers (2019) correctly point out that ‘the claim that something is absent is more difficult to support than the claim that something is present, at least when one is uncertain about the size of the phenomenon that is present. Consider, for instance, the null hypothesis “There is no animal in this room,” tested against the alternative hypothesis: “There is an animal in this room, but it could be as small as an ant or as big as a cow”. Now if the “effect” is of medium size (say a cat), it can be quickly discovered and H1 then receives decisive support. But if a cursory inspection does not reveal any animal, then support for H0 will only be weak (after all, it is easy to miss an ant). Now there is a way to collect strong evidence for H0, but it requires more effort – a systematic search with a magnifying glass, for instance.’
Theoretical support for this statement can be found in the pioneering work of Bahadur (1967) and Bahadur and Bickel (1967), who showed that likelihood ratios and Bayes factors in favor of true null hypotheses and true alternative hypotheses increase exponentially fast with sample size when the parameter spaces associated with the two hypotheses are separated. Sub-exponential convergence occurs when the parameter defining one hypothesis falls on the boundary between the spaces. This is the case with NHSTs, where the null parameter value is not separated from parameter values that define alternative hypotheses. One objective of NAP-based tests is to approximately “separate” the hypotheses. This goal is complicated by the desire to avoid discontinuities in the prior densities that define the alternative hypotheses. For example, assigning positive prior density to say, 0.3, and 0 density to all smaller values may not make sense.
The comments of van Ravenzwaaij and Wagenmakers (2019) illustrate this principle well. If only animals larger than cats are considered–so that the hypotheses are well separated–then one can test “no animal present” versus “animal present” very quickly. If ants and even smaller animals count, then the null hypothesis is not well separated from the alternative and testing takes longer. For the NAP-based tests proposed in this article, Bayes factors in favor of true null hypotheses increase at a rate of n3/2. For local alternative hypotheses, this rate is only (Johnson and Rossell, 2010). In contrast, the rate for any true alternative hypothesis, which is always distinct from the null value, increases exponentially fast with n.
The default NAP-based tests proposed in this article should not be categorized as objective Bayesian tests because they explicitly target the detection of standardized effect sizes of most interest in psychology and other social sciences. Nevertheless, it is interesting to examine their properties using criteria that are sometimes used to judge the performance of objective Bayesian tests. As summarized in, for example, Bayarri et al. (2012) and Consonni et al. (2018), such criteria include basic (Bayesian) consistency, model selection consistency, intrinsic consistency, information consistency, predictive matching, and scale-location invariance.
NAP-based tests satisfy basic and intrinsic consistency since they are Bayesian tests that do not depend on arbitrary normalizing constants, training sample sizes, or other arbitrary effects that do not disappear with increasing sample sizes. Model selection consistency requires that the posterior probability of the true model converges to 1 as the sample size increases. The NAP-based z and t tests proposed here satisfy this criterion. The Hajnal tests do not.
The NAP-based z tests proposed in this article satisfy information consistency. That is, they are able to obtain unbounded evidence against the null hypothesis for arbitrarily extreme observations based on any given sample size. When the observational variance is known, an arbitrarily large sample mean (or difference in sample means) can provide arbitrarily high evidence against the null hypothesis, regardless of the sample size.
NAP-based t tests are not information consistent. We do not regard this as a shortcoming of the tests, however. In our view, it should not necessarily be possible to obtain unbounded information in favor of an alternative hypothesis using a finite sample of measurements if the properties of the measuring device or error structure (e.g., the variance) are not known. This is particularly true when prior knowledge suggests that the value of the tested parameter under the null and alternative hypotheses are not too dissimilar.
As a final comment on this issue, we note that lack of information consistency for NAP-based t tests cannot be attributed to the improper prior assumed for the observational variance. Even if a proper inverse gamma distribution is assumed on the observational variance, NAP-based t tests do not attain information consistency (see Theorem S2.8 of Supplemental Materials). That is, formal Bayesian tests based on fully specified statistical models with proper priors on all unknown parameters may not satisfy the information consistency criterion.
Because the NAP-based tests are functions of z and t statistics, they inherit the invariance properties of those test statistics.
Predictive matching requires that Bayes factors between models based on “minimal” sample sizes should approximately equal 1. Exact predictive matching requires that they exactly equal 1. Minimal sample sizes can be loosely interpreted as the smallest sample size that makes maximum likelihood estimation possible for all parameters in all models. In the case of a one-sample t test, for example, the minimal sample size necessary to estimate the mean and variance is 2 if improper priors are specified on both parameters.
Predictive matching and information consistency are antithetical for minimal sample sizes. Predictive matching requires that the Bayes factor remain close to 1 whenever a minimal sample has been obtained, while the information consistency requires that the Bayes factor can become unbounded for extreme data. Given the discussion above, it is therefore not surprising that NAP-based z tests are not predictive matching and that NAP-based t tests are. In the former case, the minimal sample size is 1 and the Bayes factor grows exponentially with the magnitude of a single observation. For one sample t tests and minimal sample size of 2, the NAP-based Bayes factors range between (1 + 2τ2)−3/2 and . For the default value τ2 = 0.045, the corresponding range is approximately (0.88, 1.13).
Our interpretation of these results is that predictive matching and information consistency desiderata are not useful as general criteria for defining Bayesian tests. On one hand, a single large normal observation with known variance can provide very strong evidence against a null hypothesis that a normal mean equals 0. If the minimal sample size is 1, then accepting such evidence violates the predictive matching criterion. On the other, the posterior probability against a null hypothesis of no effect should not necessarily become arbitrarily small based on a finite sample when there is uncertainty regarding the precision of the values that were measured.
This article has concentrated on default NAP-based tests in which targeted standardized effect sizes fall in the range (0.2,0.8). However, in some testing contexts specific prior information regarding the magnitude of a standardized effect size may be known. For instance, a researcher may wish to detect a very small standardized effect size (e.g., < 0.2). In such cases, we recommend defining , where δp denotes the prior estimate of the standardized effect size or difference in standardized effect sizes.
To illustrate, suppose the magnitude of a standardized effect size is expected to be approximately δp = 0.05 in a one-sample test of a normal mean with unknown variance. Then a good choice for τ2 is .052/2 = .00125. A plot of this NAP density is provided in Fig. 11. If a NAP-based test is conducted with a normal moment prior with this value of τ2, then the average weight of evidence from a fixed-design test with n = 4000 observations is slightly greater than 4.0. In contrast, the average weight of evidence for the default NAP-based test with τ2 = 0.045 is approximately −0.017, and for the default JZS-based test is 1.46 (see Fig. 4d). Thus, by including subjective prior information into a test, an investigator can substantially increase the evidence collected in favor of a very small standardized effect size.
Figure 11.

Normal moment prior for detecting a very small standardized effect. This normal moment prior density has peaks at ±0.05 and places most of its prior mass on standardized effect sizes with magnitudes in the interval (0.02, 0.10).
Although this article has focused on two-sided tests, one-sided tests can also be conducted using formulae for Bayes factors provided in the supplemental information. The NAP prior used for one-sided tests are twice as large as the densities used for two-sided tests for either positive (or negative) standardized effect sizes, and 0 for negative (or positive) standardized effects. This implies that the weight of evidence in favor of a true alternative in a one-sized test can be as much as 0.69 = ln(2) higher than in a two-sided test, and that the average weight of evidence in favor of true null hypotheses can also be higher, particularly when the sign of the sample mean of data disagrees with sign of the standardized effect size assumed under the alternative. Figures summarizing simulation studies for one-sided tests are provided in the supplemental materials.
R functions R Core Team (2021) for implementing the NAP and Hajal tests described in this article are available at CRAN and GitHub.
Supplementary Material
Figure 10.

A comparison of the SBF-JZS and the SBF-NAP with the SPRT thresholds in case of the replicated incidental disfluency data. For each prior the natural logarithm of the Bayes factor in favor of the alternative hypothesis that incidental disfluency activates a deliberate, analytic processing style is calculated. The curves corresponding to each prior depicts the sequentially calculated values after observing each of the 13 studies until they exceed the SPRT thresholds corresponding to (α, β) = (0.005, 0.05). The horizontal axis displays the studies in the assumed order they were observed.
Footnotes
Kass and Raftery (1995) propose 2 ln(BF10(x)) as a default measure, but by omitting the factor of 2 their descriptors are more compatible with the measure proposed by Jeffreys (1961).
To manage simulation time for SBF-JZS, the sample size at each sequential step is increased following Schönbrodt et al. (2017). For a sequential comparison at the next step, we add 1 new sample until the total sample size (n) reaches 100, 5 new samples until n reaches 1000, 10 new samples until n reaches 2500, 20 new samples until n reaches 5000, and 50 new samples afterwards.
References
- Alter AL, Oppenheimer DM, Epley N, and Eyre RN (2007). Overcoming intuition: metacognitive difficulty activates analytic reasoning. Journal of Experimental Psychology: General, 136(4):569. [DOI] [PubMed] [Google Scholar]
- Anderson C, Lindsay J, and Bushman B (1999). Research in the psychological laboratory. Current Directions in Psychological Science, 8:3–9. [Google Scholar]
- Augustin T (2008). Stevens’ power law and the problem of meaningfulness. Acta Psychologica, 128. [DOI] [PubMed] [Google Scholar]
- Bahadur R (1967). Rates of convergence of estimates and test statistics. Annals of Mathematical Statistics, 38(2):303–324. [Google Scholar]
- Bahadur R and Bickel P (1967). Asymptotic optimality of bayes’ test statistics. Technical report, The University of Chicago. [Google Scholar]
- Bakker M, van Dijk A, and Wicherts JM (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7(6):543–554. [DOI] [PubMed] [Google Scholar]
- Bayarri M, Berger J, Forte A, and Garcia-Donato G (2012). Criteria for Bayesian model choice with application to variable selection. Annals of Statistics, 40(3):1550–1577. [Google Scholar]
- Berger J and Pericchi L (1996). On the justification of default and intrinsic Bayes factors. In Lee J, Johnson W, and Zellner A, editors, Modelling and Prediction Honoring Seymour Geisser, pages 173–204. Springer, New York. [Google Scholar]
- Cohen J (1988). Statistical power analysis for the behavioral sciences, 2nd edition. Erlbaum, Hillsdale, N.J. [Google Scholar]
- Consonni G, Fouskakis D, Liseo B, and Ntzoufras I (2018). Prior distributions for objective Bayesian analysis. Bayesian Analysis, 13(2):627–679. [Google Scholar]
- Cover J, Curd M, and Pincock C (2012). Philosophy of Science: The Central Issues, 2nd edition. W.W. Norton and Company, New York. [Google Scholar]
- Dienes Z (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6:274–290. [DOI] [PubMed] [Google Scholar]
- Dreber A, Pfeiffer T, Almenberg J, Isaksson S, Wilson B, Chen Y, Nosek BA, and Johannesson M (2015). Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112(50):15343–15347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Etz A and Vandekerckhove J (2018). Introduction to Bayesian inference for psychology. Psychonomic Bulletin and Review, 25(1):5–34. [DOI] [PubMed] [Google Scholar]
- Fechner G (1966). Elements of Psychophysics. Holt, Rinehart & Winston, New York. [Google Scholar]
- Gradshteyn IS and Ryzhik IM (2014). Table of integrals, series, and products. Academic press. [Google Scholar]
- Hajnal J (1961). A two-sample sequential t-test. Biometrika, 48:65–75. [Google Scholar]
- Hall J (1998). How big are nonverbal sex differences? In Canary D and Dindia K, editors, Sex differences and similarities in communication, pages 155–177. Erlbaum, Mahwah, N.J. [Google Scholar]
- Hankin R (2016). The gauss hypergeometric function. In https://CRAN.R-project.org/package=BayesFactor.
- Jeffreys H (1961). Theory of Probability. Oxford University Press, New York. [Google Scholar]
- Johnson V (2013). Uniformly most powerful Bayesian tests. The Annals of Statistics, 41(4):1716 – 1741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson V, Payne R, Wang T, Asher A, and Mandal S (2017). On the reproducibility of psychological science. Journal of the American Statistical Association, 112(517):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson V and Rossell R (2010). On the use of non-local prior densities in Bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B, 72:143–170. [Google Scholar]
- Kass R and Raftery A (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795. [Google Scholar]
- Klein RA, Vianello M, Hasselman F, Adams BG, Reginald B Adams J, Alper S, Aveyard M, Axt JR, Babalola MT, Štěpán Bahník, Batra R, Berkics M, Bernstein MJ, Berry DR, Bialobrzeska O, Binan ED, Bocian K, Brandt MJ, Busching R, Rédei AC, Cai H, Cambier F, Cantarero K, Carmichael CL, Ceric F, Chandler J, Chang J-H, Chatard A, Chen EE, Cheong W, Cicero DC, Coen S, Coleman JA, Collisson B, Conway MA, Corker KS, Curran PG, Cushman F, Dagona ZK, Dalgar I, Rosa AD, Davis WE, de Bruijn M, Schutter LD, Devos T, de Vries M, Doğulu C, Dozo N, Dukes KN, Dunham Y, Durrheim K, Ebersole CR, Edlund JE, Eller A, English AS, Finck C, Frankowska N, Ángel Freyre M, Friedman M, Galliani EM, Gandi JC, Ghoshal T, Giessner SR, Gill T, Gnambs T, Gómez Ángel, González R, Graham J, Grahe JE, Grahek I, Green EGT, Hai K, Haigh M, Haines EL, Hall MP, Heffernan ME, Hicks JA, Houdek P, Huntsinger JR, Huynh HP, IJzerman H, Inbar Y, Innes-Ker Åse H., Jiménez-Leal W, John M-S, Joy-Gaba JA, Kamiloğlu RG, Kappes HB, Karabati S, Karick H, Keller VN, Kende A, Kervyn N, Knežević G, Kovacs C, Krueger LE, Kurapov G, Kurtz J, Lakens D, Lazarević LB, Levitan CA, Neil A Lewis J, Lins S, Lipsey NP, Losee JE, Maassen E, Maitner AT, Malingumu W, Mallett RK, Marotta SA, Međedović J, Mena-Pacheco F, Milfont TL, Morris WL, Murphy SC, Myachykov A, Neave N, Neijenhuijs K, Nelson AJ, Neto F, Nichols AL, Ocampo A, O’Donnell SL, Oikawa H, Oikawa M, Ong E, Orosz G, Osowiecka M, Packard G, Pérez-Sánchez R, Petrović B, Pilati R, Pinter B, Podesta L, Pogge G, Pollmann MMH, Rutchick AM, Saavedra P, Saeri AK, Salomon E, Schmidt K, Schönbrodt FD, Sekerdej MB, Sirlopú D, Skorinko JLM, Smith MA, Smith-Castro V, Smolders KCHJ, Sobkow A, Sowden W, Spachtholz P, Srivastava M, Steiner TG, Stouten J, Street CNH, Sundfelt OK, Szeto S, Szumowska E, Tang ACW, Tanzer N, Tear MJ, Theriault J, Thomae M, Torres D, Traczyk J, Tybur JM, Ujhelyi A, van Aert RCM, van Assen MALM, van der Hulst M, van Lange PAM, van ‘t Veer AE, Vásquez-Echeverría A, Vaughn LA, Vázquez A, Vega LD, Verniers C, Verschoor M, Voermans IPJ, Vranka MA, Welch C, Wichman AL, Williams LA, Wood M, Woodzicka JA, Wronska MK, Young L, Zelenski JM, Zhijia Z, and Nosek BA (2018). Many labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4):443–490. [Google Scholar]
- Korotkov NE and Korotkov AN (2020). Integrals Related to the Error Function. CRC Press. [Google Scholar]
- Lipsey M and Wilson D (1993). The efficacy of psychological, educational and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48:1181–1209. [DOI] [PubMed] [Google Scholar]
- Meyer J, Finn S, Eyde L, Kay G, Moreland K, Dies R, others, and Reed G (2001). Psychological testing and psychological assessment: A review of evidence and issues. American Psychologist, 56:128–156. [PubMed] [Google Scholar]
- Morey R and Rouder J (2015). Bayesfactor: Computation of Bayes factors for common designs. In https://CRAN.R-project.org/package=BayesFactor.
- Morey R, Rouder J, Jamil T, Urbanek S, Forner K, and Ly A (2018). Package “BayesFactor”. R Foundation for Statistical Computing. [Google Scholar]
- Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251). [DOI] [PubMed] [Google Scholar]
- Pramanik S, Johnson V, and Bhattacharya A (2021). A modified sequential probability ratio test. Journal of Mathematical Psychology, 101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Richard F, Bond C, and Stokes-Zoota J (2003). One hundred of years of social psychology quantitatively described. Review of General Psychology, 7:331–363. [Google Scholar]
- Rouder J, Speckman P, Sun D, and Morey R (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin and Review, 16:225–237. [DOI] [PubMed] [Google Scholar]
- Schnuerch M and Erdfelder E (2020). Controlling decision errors with minimal costs: The sequential probability ratio t-test. Psychological Methods, 25:206Ð226. [DOI] [PubMed] [Google Scholar]
- Schönbrodt F, Wagenmakers E-J, Zehetleitner M, and Perugini M (2017). Sequential hypothesis testing with bayes factors: Efficiently testing mean differences. Psychological Methods, 22(2):322–339. [DOI] [PubMed] [Google Scholar]
- Siegmund D (2013). Sequential analysis: Tests and confidence intervals. New York, NY: Springer Science & Business Media. [Google Scholar]
- Stefan A, Schönbrodt F, Evans N, and Wagenmakers E-J (2021). Efficiency in sequential testing: Comparing the sequential probability ratio test and sequential bayes factor. Unpublished paper. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stevens S (1957). On the psychophysical law. Psychological Review, 64:153–181. [DOI] [PubMed] [Google Scholar]
- Tendeiro J and Kiers H (2019). A review of issues about null hypothesis Bayesian testing. Psychological Methods, 24(6):774–795. [DOI] [PubMed] [Google Scholar]
- Tett R, Meyers J, and Roese N (1994). Applications of meta-analysis: 1987–1992. International Review of Industrial and Organizational Psychology, 9:71–112. [Google Scholar]
- van Ravenzwaaij D and Wagenmakers E-J (2019). Advantages masquerading as “issues” in Bayesian hypothesis testing: A commentary on tendeiro and kiers (2019). to appear in Psychological Methods. [DOI] [PubMed] [Google Scholar]
- Wald A (1945). Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186. [Google Scholar]
- Zellner A and Siow A (1980). Posterior odds ratio for selected regression hypotheses. In Bernardo J, DeGroot M, Lindley D, and Smith A, editors, Bayesian Statistics 1, pages 585–603. University Press, Valencia. [Google Scholar]
- Zellner A and Siow A (1986). Basic Issues in Econometrics. University of Chicago, Chicago. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
