Bayes factor functions for reporting outcomes of hypothesis tests

Valen E Johnson; Sandipan Pramanik; Rachael Shudde

doi:10.1073/pnas.2217331120

. 2023 Feb 13;120(8):e2217331120. doi: 10.1073/pnas.2217331120

Bayes factor functions for reporting outcomes of hypothesis tests

Valen E Johnson ^a,¹, Sandipan Pramanik ^a, Rachael Shudde ^a

PMCID: PMC9974512 PMID: 36780516

Significance

Bayes factors represent an informative alternative to P-values for reporting outcomes of hypothesis tests. They provide direct measures of the relative support that data provide to competing hypotheses and are able to quantify support for true null hypotheses. However, their use has been limited by several factors, including the requirement to specify alternative hypotheses and difficulties encountered in their calculation. Bayes factor functions (BFFs) overcome these difficulties by defining Bayes factors from classical test statistics and using standardized effect sizes to define alternative hypotheses. BFFs provide clear summaries of the outcome from a single experiment, eliminate arbitrary significance thresholds, and are ideal for combining evidence from replicated studies.

Keywords: Bayes factors, meta-analysis, P-value, replication study, significance threshold

Abstract

Bayes factors represent a useful alternative to P-values for reporting outcomes of hypothesis tests by providing direct measures of the relative support that data provide to competing hypotheses. Unfortunately, the competing hypotheses have to be specified, and the calculation of Bayes factors in high-dimensional settings can be difficult. To address these problems, we define Bayes factor functions (BFFs) directly from common test statistics. BFFs depend on a single noncentrality parameter that can be expressed as a function of standardized effects, and plots of BFFs versus effect size provide informative summaries of hypothesis tests that can be easily aggregated across studies. Such summaries eliminate the need for arbitrary P-value thresholds to define “statistical significance.” Because BFFs are defined using nonlocal alternative prior densities, they provide more rapid accumulation of evidence in favor of true null hypotheses without sacrificing efficiency in supporting true alternative hypotheses. BFFs can be expressed in closed form and can be computed easily from z, t, χ², and F statistics.

Two approaches are commonly used to summarize evidence from statistical hypothesis tests: P-values and Bayes factors. P-values are more frequently reported. As noted in the American Statistical Association Statement on Statistical Significance and P-Values, the significance of many published scientific findings is based on P-values, even though this index “is commonly misused and misinterpreted. This has led to some scientific journals discouraging the use of P-values, and some scientists and statisticians recommending their abandonment ... Informally, a P-value is the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value” (1). P-values do not provide a direct measure of support for either the null or alternative hypotheses, and their use in defining arbitrary thresholds for defining statistical significance has long been a subject of intense debate; see, for example, refs. 2–8. Interpreting evidence provided by P-values across replicated studies is also challenging.

Bayes factors represent the ratio of the marginal probability assigned to data by competing hypotheses and, when combined with prior odds assigned between hypotheses, yield an estimate of the posterior odds that each hypothesis is true. That is,

\begin{matrix} posterior odds = Bayes factor \times prior odds, \end{matrix}

[1]

or, more precisely,

\begin{matrix} \frac{P (H_{1} | x)}{P (H_{0} | x)} = \frac{m_{1} (x)}{m_{0} (x)} \times \frac{P (H_{1})}{P (H_{0})} . \end{matrix}

[2]

Here, $P (H_{i} | x)$ denotes the posterior probability of hypothesis H_i given data $x$ ; P(H_i) denotes the prior probability assigned to H_i, and $m_{i} (x)$ denotes the marginal probability (or probability density function) assigned to the data under hypothesis H_i, for i = 0 (null) or i = 1 (alternative).

The marginal density of the data under the alternative hypothesis is given by

\begin{matrix} m_{1} (x) = \int_{Θ} f (x | θ) π_{1} (θ) d θ, \end{matrix}

[3]

where $f (x | θ)$ denotes the sampling density of the data given an unknown parameter θ. In null hypothesis significance tests (NHSTs), the marginal density of the data under the null hypothesis, $m_{0} (x)$ , is simply the sampling density of the data assumed under the null hypothesis. That is, if H₀ : θ = θ₀, then $m_{0} (x) = f (x | θ_{0})$ . The function π₁(θ) represents the prior density for the parameter of interest θ under the alternative hypothesis, i.e., the alternative prior density.

The specification of π₁(θ) is problematic. As a consequence, numerous Bayes factors based on “default” alternative prior densities have been proposed. Among others, these include (9–16). Nonetheless, the value of a Bayes factor depends on the alternative prior density used in its definition, and it is generally difficult to justify or interpret any single default choice. In addition, the numerical calculation of Bayes factors can be difficult, often requiring specialized software, and each of these problems is exacerbated in high-dimensional settings, e.g., ref. 17.

We propose several modifications to the existing Bayes factor methodology designed to enhance the report of scientific findings. First, we define Bayes factors directly from standard z, t, χ², and F test statistics (18). Under the null hypotheses, the distribution of these test statistics is known. Under alternative hypotheses, the asymptotic distributions of these test statistics depend only on scalar-valued noncentrality parameters. Thus, the specification of the prior density that defines the alternative hypothesis is simplified, and no prior densities need to be specified under the null hypothesis.

Second, for a given test statistic, we calculate a range of Bayes factors by varying the prior densities imposed on the noncentrality parameter used to define the alternative hypothesis. The families of prior densities used to define Bayes factors are indexed by standardized effect size. Bayes factor functions (BFFs) are defined as the mapping of standardized effects to Bayes factors (or, more formally, the mapping of prior densities centered on standardized effect sizes to Bayes factors). BFFs thus make the connection between Bayes factors and prior assumptions more transparent by allowing Bayes factors to be interpreted in the context of the prior densities used in their definition. Because BFFs provide Bayes factors as a function of standardized effect size, they also facilitate the integration of evidence across multiple studies of the same phenomenon.

Third, the prior densities we propose for noncentrality parameters are special cases of nonlocal alternative prior (NAP) densities. These densities are identically zero when the noncentrality parameter is zero. This property makes it possible to more quickly accumulate evidence in favor of both true null and true alternative hypotheses (19–21); this particular feature of NAP densities is discussed in detail in ref. 22.

Finally, we provide closed-form expressions for Bayes factors and BFFs. These expressions eliminate computational difficulties sometimes encountered when calculating other Bayes factors.

1. Mathematical Framework

We write a|b ∼ D(b) to indicate that a random variable a has distribution D that depends on a parameter vector b. N(a, b) denotes the normal distribution with mean a and variance b; T(ν, λ) denotes a t distribution with ν degrees-of-freedom and noncentrality parameter λ; F(k, m, λ) denotes an F distribution on k, m degrees-of-freedom and noncentrality parameter λ; G(α, λ) denotes a gamma random variable with shape parameter α and rate parameter λ; H(ν, λ) denotes a χ² distribution on ν degrees of freedom and noncentrality parameter λ; and J(μ₀, τ²) denotes a normal moment distribution with mean μ₀ and rate parameter τ² (19). The density of a J(μ₀, τ²) random variable can be written as

\begin{matrix} j (x | μ_{0}, τ^{2}) = \frac{{(x - μ_{0})}^{2}}{\sqrt{2 π} τ^{3}} exp (- \frac{{(x - μ_{0})}^{2}}{2 τ^{2}}) . \end{matrix}

[4]

The modes of this distribution occur at $x = μ_{0} \pm \sqrt{2} τ$ .

We use lowercase letters to denote corresponding densities; e.g., a gamma density evaluated at x is written g(x|α, λ).

With this notation, theorems describing Bayes factors based on z, t, F, and χ² statistics are provided below. Each Bayes factor depends on a hyperparameter τ² and is denoted by BF₁₀(x|τ²). Procedures for setting τ² as a function of standardized effect size are described in Bayes Factors as Functions of Standardized Effect Size. Proofs of the theorems appear in (SI Appendix).

Theorem 1.

z test. Assume that the distributions of a random variable z under the null and alternative hypotheses are described by

$H_{0} : z \sim N (0, 1),$ [5]

$H_{1} : z | λ \sim N (λ, 1), λ | τ^{2} \sim J (0, τ^{2}) .$ [6]

Then, the Bayes factor in favor of the alternative hypothesis is

$\begin{matrix} B F_{10} (z | τ^{2}) = & {(τ^{2} + 1)}^{- 3 / 2} (1 + \frac{τ^{2} z^{2}}{τ^{2} + 1}) \\ \times exp [\frac{τ^{2} z^{2}}{2 (τ^{2} + 1)}] . \end{matrix}$ [7]

Theorem 2.

t test. Assume that the distributions of a random variable t under the null and alternative hypotheses are described by

$H_{0} : t \sim T (ν, 0),$ [8]

$H_{1} : t | λ \sim T (ν, λ), λ | τ^{2} \sim J (0, τ^{2}) .$ [9]

Then, the Bayes factor in favor of the alternative hypothesis is

$\begin{matrix} B F_{10} (t | τ^{2}) = {(τ^{2} + 1)}^{- 3 / 2} {(\frac{r}{s})}^{\frac{ν + 1}{2}} (1 + \frac{q t^{2}}{s}), \end{matrix}$ [10]

where

$\begin{matrix} r = 1 + \frac{t^{2}}{ν}, s = 1 + \frac{t^{2}}{ν (1 + τ^{2})}, and q = \frac{τ^{2} (ν + 1)}{ν (1 + τ^{2})} . \end{matrix}$

Theorem 3.

$χ^{2}$ test. Assume that the distributions of a random variable h under the null and alternative hypotheses are described by

$H_{0} : h \sim H (k, 0),$ [11]

$H_{1} : h | λ \sim H (k, λ), λ | τ^{2} \sim G (\frac{k}{2} + 1, \frac{1}{2 τ^{2}}) .$ [12]

Then, the Bayes factor in favor of the alternative hypothesis is

$\begin{matrix} B F_{10} (h | τ^{2}) = & {(τ^{2} + 1)}^{- k / 2 - 1} (1 + \frac{τ^{2}}{k (τ^{2} + 1)} h) \\ \times exp (\frac{τ^{2} h}{2 (τ^{2} + 1)}) . \end{matrix}$ [13]

For k = 1 and z² = h, the Bayes factor in Eq. 13 has the same value as the Bayes factor specified in Eq. 7. The choice of the shape parameter as k/2 + 1 for the gamma density (a scaled χ_k + 2² random variable) in Eq. 12 was based on the fact that χ_ν² distributions are not 0 at the origin for integer degrees of freedom ν < 3. Thus, they are not NAP densities for ν < 3 and typically cannot provide strong evidence in favor of true null hypotheses without large sample sizes (19).

Theorem 4.

F test. Assume that the distributions of a random variable f under the null and alternative hypotheses are described by

$H_{0} : f \sim F (k, m, 0),$ [14]

$H_{1} : f | λ \sim F (k, m, λ), λ | τ^{2} \sim G (\frac{k}{2} + 1, \frac{1}{2 τ^{2}}) .$ [15]

Then, the Bayes factor in favor of the alternative hypothesis is

$\begin{matrix} B F_{10} (f | τ^{2}) = & {(τ^{2} + 1)}^{- \frac{k}{2} - 1} {[\frac{(1 + \frac{kf}{m})}{(1 + \frac{kf}{v})}]}^{\frac{k + m}{2}} \\ \times [1 + \frac{(k + m) τ^{2} f}{v (1 + \frac{kf}{v})}], \end{matrix}$ [16]

where v = m(τ² + 1).

For k = 1 and t² = f, the Bayes factors in Eq. 16 has the same value as the Bayes factor specified in Eq. 10.

A. Bayes Factors as Functions of Standardized Effect Size.

Theorems 1–4 describe Bayes factors based on classical test statistics. Like other Bayes factors, these Bayes factors depend on a hyperparameter τ². Rather than ignoring this dependence and simply reporting a single Bayes factor, we construct BFFs that vary with τ². Unfortunately, τ² is difficult to interpret scientifically, and, as we show below, its interpretation changes from one test to another. For this reason, we report BFFs as functions of standardized effect sizes.

To illustrate this procedure, consider a z test of a null hypothesis H₀ : μ = 0 based on a random sample x₁, …, x_n, where x_i ∼ N(μ, σ) and σ² is known. For this test, $z = \sqrt{n} \bar{x} / σ$ , and the distribution of z is

\begin{matrix} z | μ, σ \sim N (\frac{\sqrt{n} μ}{σ}, 1) . \end{matrix}

[17]

Under the null hypothesis, z ∼ N(0, 1). Under the alternative hypothesis, μ defines the deviation from the null value 0 and is called the effect size. The noncentrality parameter is $λ = \sqrt{n} μ / σ$ .

Effect sizes are often standardized. Standardization can serve two purposes. First, it makes the effect size invariant to units of measurement—for example, whether weights are measured in ounces or grams. Second, standardization scales the effect size according to the random variation between observational units. For the z test, the effect size μ can be standardized by dividing it by the SD of the observations, leading to a standardized effect ω = μ/σ. The noncentrality parameter for the test can then be expressed as $λ = \sqrt{n} ω$ . Cohen categorizes standardized effect sizes as small (0.2), medium (0.5), and large (0.8), and effect sizes larger 1.0 are not common in the social and behavioral sciences (23, 24).

Given a relationship between a standardized effect size ω and a noncentrality parameter λ, we compute the BFF by setting τ² so that the modes of the prior density on λ occur at values defined by standardized effect sizes ω.

More explicitly, suppose that the noncentrality parameter λ can be written as a function of the standardized effect size ω as λ = r(ω), and let π(λ|τ²) denote a generic prior density on λ given τ². For given ω, τ_ω² is implicitly defined such that,

\begin{matrix} r (ω) = \underset{λ}{arg max} π (λ | τ_{ω}^{2}), \end{matrix}

[18]

the value of τ² that makes the prior modes equal to r(ω). Given τ_ω² for a range of ω values, the BFF based on x consists of ordered pairs (BF(x|τ_ω²),ω).

To illustrate this procedure, consider again the test of a normal mean. The noncentrality parameter for this test is $λ = \sqrt{n} ω$ . The default prior on the noncentrality parameter λ is a J(0, τ²) distribution, which has maxima at $λ^{*} = \pm \sqrt{2} τ$ (that is, $\pm \sqrt{2} τ = \underset{λ}{arg max} j (λ | 0, τ^{2})$ ). Equating the noncentrality parameter to the modes of the prior density (i.e., $\sqrt{n} ω = λ^{*} = \pm \sqrt{2} τ$ ) implies that τ_ω² = nω²/2.

Fig. 1 displays the BFF in favor of the alternative hypothesis using the mapping τ_ω² = nω²/2 when z = 2.0 and n = 100. From the BFF, we can conclude that the maximum Bayes factor in favor of the alternative hypothesis is 2.90 and that this Bayes factor is achieved when the prior modes on the standardized effect size are ±0.15. That is, the value of 2.9 is obtained conditionally on the (data-dependent) selection of τ_ω² = 1.125 and a J(0, 1.125) alternative prior on the noncentrality parameter λ. More generally, the selection of the maximum Bayes factor from the BFF provides the strongest information in favor of the alternative hypothesis that can be obtained from within the specified family of prior densities on the noncentrality parameter.

The odds in favor of alternative hypotheses fall below 1:1 when the alternative hypotheses are centered on standardized effect sizes greater than 0.4; they fall below 1:5 when the prior modes of the alternative prior density on standardized effect size are greater than 0.8. Other examples of BFFs for z statistics are provided in SI Appendix.

As the standardized effect size approaches 0, the alternative hypothesis becomes indistinguishable from the null hypothesis, driving the Bayes factor to 1. The red, orange, blue, and green zones in this figure are arbitrarily colored and correspond to very small ((0, 0.1)), small ((0.1, 0.35)), medium ((0.35, 0.65)), and large (> 0.65) standardized effect sizes, respectively.

Table 1 provides a mapping between standardized effect sizes ω and τ_ω² for several common statistical tests. Special cases of tests in the “Multinomial/Counts” row include Pearson’s χ² test for goodness of fit (s = 0 and $f (θ)$ known) and tests for independence in contingency tables (K − s − 1 = (# rows − 1)(#columns − 1)). Recall that a test for the value of a binomial proportion can be framed as Pearson’s χ² test. In contrast, a test for a difference in proportions can be specified as a test for independence in contingency tables.

Table 1.

Default choices for τ_ω²

Test	Statistic	Standardized effect (ω)	τ _ω ²
1-sample z	$\frac{\sqrt{n} \bar{x}}{σ}$	$\frac{μ}{σ}$	$\frac{n ω^{2}}{2}$
1-sample t	$\frac{\sqrt{n} \bar{x}}{s}$	$\frac{μ}{σ}$	$\frac{n ω^{2}}{2}$
2-sample z	$\frac{\sqrt{n_{1} n_{2}} ({\bar{x}}_{1} - {\bar{x}}_{2})}{σ \sqrt{n_{1} + n_{2}}}$	$\frac{μ_{1} - μ_{2}}{σ}$	$\frac{n_{1} n_{2} ω^{2}}{2 (n_{1} + n_{2})}$
2-sample t	$\frac{\sqrt{n_{1} n_{2}} ({\bar{x}}_{1} - {\bar{x}}_{2})}{s \sqrt{n_{1} + n_{2}}}$	$\frac{μ_{1} - μ_{2}}{σ}$	$\frac{n_{1} n_{2} ω^{2}}{2 (n_{1} + n_{2})}$
Multinomial/Poisson	$χ_{ν}^{2} = \sum_{i = 1}^{k} \frac{{(n_{i} - n f_{i} (\hat{θ}))}^{2}}{n f_{i} (\hat{θ})}$	${(\frac{p_{i} - f_{i} (θ)}{\sqrt{f_{i} (θ)}})}_{k \times 1}$	$\frac{n ω' ω}{k} = n {\tilde{ω}}^{2}$
Linear model	$F_{k, n - p} = \frac{(R S S_{0} - R S S_{1}) / k}{[(R S S_{1}) / (n - p)]}$	$\frac{L^{- 1} (A β - a)}{σ}$	$\frac{n ω' ω}{2 k} = \frac{n {\tilde{ω}}^{2}}{2}$
Likelihood ratio	$χ_{k}^{2} = - 2 \log [\frac{l (θ_{r 0}, \hat{θ_{s}})}{l (\hat{θ})}]$	$L^{- 1} (θ_{r} - θ_{r 0})$	$\frac{n ω' ω}{k} = n {\tilde{ω}}^{2}$

Open in a new tab

For one-sample tests, x₁, …, x_n are assumed to be iid N(μ, σ²), where n refers to sample size. In two-sample tests, x_{j, 1}, …, x_{j, n_j}, j = 1, 2, are assumed to be iid N(μ_j, σ²). Integers n₁ and n₂ refer to sample sizes in each group. A bar over a variable denotes the sample mean. The variance of normal observations is denoted by σ² and is assumed to be equal in both groups in two-sample tests. Standard deviations are denoted by s and are the pooled estimate in the two-sample t test. In multinomial/Poisson tests, $f (θ)$ maps an s × 1 vector $θ$ into a k × 1 probability vector, where k denotes the number of cells. The degrees of freedom ν equals k − s − 1. The quantities p_i and n_i represent cell probabilities and counts, respectively, and n is the sum of all cell counts. In the linear model, the alternative hypothesis is $A β = a$ , where A is a k × p matrix of rank k, $β$ is a p × 1 vector of regression coefficients, and a is a k × 1 vector. The quantities RSS₀ and RSS₁ denote the residual sum of squares under the null and alternative hypotheses, respectively. The quantity n is the number of observations, and σ² is the observational variance. In the likelihood ratio test, l(⋅) denotes the likelihood function for a parameter vector $θ = (θ_{r}, θ_{s})$ . The k × 1 subvector $θ_{r}$ equals $θ_{r 0}$ under the null hypothesis. The maximum likelihood estimate of $θ$ under the alternative hypothesis is $\hat{θ}$ , and the maximum likelihood of $θ_{s}$ under the null hypothesis is ${\hat{θ}}_{s}$ . In the linear model and likelihood ratio tests, the matrix L⁻¹ represents the Cholesky decomposition of the covariance matrix for the tested parameters, scaled to a single observation. Further explanation of τ_ω² values appear in SI Appendix.

The last three rows in Table 1 contain vectors of standardized effects $ω$ . Because the recommended value of τ_ω² depends on $ω$ only through the inner product $ω' ω$ , in many applications, it is easier to study the BFF as a function of the root mean square effect size (RMSES), $\tilde{ω}$ , defined as

\begin{matrix} \tilde{ω} = \sqrt{\frac{1}{k} \sum_{i = 1}^{k} ω_{i}^{2}} . \end{matrix}

[19]

The last entry in Table 1 provides default choices for τ_ω² based on the asymptotic distribution of the likelihood ratio statistic. This choice is based on classical results summarized in, for example, ref. 25. This entry is of particular interest due to the widespread application of the likelihood ratio test statistic in nonlinear models.

Justification for the values of τ_ω² in Table 1 appears in SI Appendix.

2. Applications

The following examples show how BFFs can be used to summarize outcomes of hypothesis tests based on χ² and F test statistics.

A. Cancer Sites and Blood Type Association.

White and Eisenberg (26) collected data from 707 patients with stomach cancer and investigated the association between cancer site and blood type. Data from their study are summarized in Table 2. The χ² test for independence for these data is 12.65 on 6 degrees of freedom (P = 0.049).

Table 2.

White and Eisenberg’s classification of cancer patients

	Results for the following blood groups
Site	O	A	B or AB
Pylorus and antrum	104	140	52
Body and fundus	116	117	52
Cardia	28	39	11
Extensive	28	12	8

Open in a new tab

Bayes factors to test the independence of cancer site and blood type were previously calculated by refs. 27, 28, and 18. The Bayes factors reported in refs. 27 and 28 require the specification of prior distributions on the marginal probabilities of blood type and cancer site under the null hypothesis and the specification of a Dirichlet distribution on all combinations of blood type × cancer site probabilities under an alternative model. Johnson (18) maximized a Bayes factor based on the chi-squared statistic similar to that proposed in Theorem 3, except that the prior on the noncentrality parameter was a scaled chi-squared distribution on 6 degrees of freedom (rather than 8). The scale of the chi-squared prior was chosen to maximize the Bayes factor against the null hypothesis of independence. The Bayes factors reported in refs. 18, 27, and 28 were 2.97, 3.02, and 3.06, respectively.

Fig. 2 displays the BFF as a function of the RMSES using results from Theorem 3 and the τ_ω² values provided in Table 1. The maximum Bayes factor in favor of dependence occurs for $\tilde{ω} = 0.035$ , where it equals 3.07. The Bayes factor favors the independence model against alternatives with $\tilde{ω} > 0.068$ . A standardized effect size of 0.2 represents what is often considered a small standardized effect in the social science and medical literature (23), and the Bayes factor against such an effect size for these data is greater than 400:1. The Bayes factors reported in refs. 18, 27, and 28 and the maximum value of 3.07 reported here are, for many practical purposes, similar in their scientific interpretation. They all suggest that the data support an alternative hypothesis of nonindependence three times more than the null hypothesis of independence. However, the previous methods do not emphasize that evidence against the null hypothesis is garnered only for alternative hypotheses representing very small effect sizes ( $\tilde{ω} < 0.068$ ), and even for such small effect sizes, the evidence is weak.

B. Biases Associated with Confirmatory Information Processing.

To illustrate BFFs based on F statistics in replicated studies, we turn to a study reported in ref. 29 that was replicated in ref. 30. Both studies sought to determine whether states of self-regulation depletion or ego threat caused participants to exhibit more bias in confirmatory information processing. The studies compared preferences for decision-consistent and decision-inconsistent information processing between three groups: high depletion of self-regulation, low depletion of self-regulation, and ego-threatened subjects. The dependent variable consisted of a normalized score for participants’ selection of decision-consistent and decision-inconsistent reports regarding a hiring decision upon which they had made a preliminary decision. The original study’s authors recruited 85 undergraduate students as subjects, while 140 subjects participated in the replicated study. Differences between the outcomes in the three groups were assessed using one-way ANOVA. The F statistics reported in the original and replicated studies were F_{2, 82} = 4.05 (P = 0.021) and F_{2, 137} = 1.99 (P = .141), respectively.

To construct a Bayes factor from these studies, it is necessary to define the hypotheses being tested. To make the discussion more general, we assume a total of S studies; S = 2 in this example.

Let x₁, …, x_S denote S independent F statistics with numerator degrees of freedom k₁, …, k_S and denominator degrees of freedom m₁, …, m_S, respectively. In the present case, k₁ = k₂ = 2, and m₁ = n₁ − 3 and m₂ = n₂ − 3, where n₁ = 85 and n₂ = 140 are the sample sizes in the two studies.

Under the null hypothesis, we assume

\begin{matrix} H_{0} : x_{s} \sim F (k_{s}, m_{s}, 0), for s = 1, \dots, S . \end{matrix}

[20]

Given the independence of the {x_s}, the marginal density of the data under the null hypothesis is

\begin{matrix} m_{0} (x_{1}, \dots, x_{S}) = \prod_{s = 1}^{S} m_{0} (x_{s}) = \prod_{s = 1}^{S} f (x_{s} | k_{s}, m_{s}, 0) . \end{matrix}

[21]

Under the alternative hypothesis, we assume

\begin{matrix} H_{1} : x_{s} | λ \sim F (k_{s}, m_{s}, λ_{s}), & λ_{s} | τ_{s}^{2} \sim G (\frac{k_{s}}{2} + 1, \frac{1}{2 τ_{s}^{2}}), \\ τ_{s}^{2} = \frac{n_{s} {\tilde{ω}}^{2}}{2} . \end{matrix}

[22]

Different prior distributions are specified for the noncentrality parameters {λ_s} to account for the dependence of these parameters on each study’s sample size. However, the rate parameters τ_s² that define these prior densities were determined from a common RMSES, $\tilde{ω}$ . This stipulation models the belief that the interventions have similar effects across studies.

Assuming that the noncentrality parameters are conditionally independent across studies, it follows that the marginal density of the data under the alternative hypothesis is

\begin{matrix} m_{1} (x_{1}, \dots, x_{S} | \tilde{ω}) = \prod_{s = 1}^{S} m_{1} (x_{s} | τ_{\tilde{ω}}^{2}) . \end{matrix}

[23]

Here, the dependence of the marginal densities on the assumed value of $\tilde{ω}$ and $τ_{\tilde{ω}}^{2}$ has been indicated. Dividing equation (23) by (21) leads to

\begin{matrix} B F_{10} (x_{1}, \dots, x_{S} | \tilde{ω}) & = \frac{\prod_{s = 1}^{S} m_{1} (x_{s} | τ_{\tilde{ω}}^{2})}{\prod_{s = 1}^{S} m_{0} (x_{s})} \\ = \prod_{s = 1}^{S} B F_{10} (x_{s} | τ_{\tilde{ω}}^{2}) . \end{matrix}

[24]

Thus, the Bayes factor for the combined study can be obtained by multiplying the Bayes factors from the individual studies.

Eq. 24 can be applied generally to obtain Bayes factors based on independent z, t, χ², and F statistics using Theorems 1–4 and Table 1, under the assumption that noncentrality parameters are drawn independently from their prior distributions.

Returning to our example, the Bayes factors for the two studies can be combined according to Eq. 24 and Theorem 4 as follows:

\begin{matrix} B F_{10} [f_{2, 82} & = 4.05, f_{2, 137} = 1.99 | \tilde{ω}] \\ = B F_{10} [f_{2, 82} = 4.05 | τ_{\tilde{ω}}^{2} = \frac{85 {\tilde{ω}}^{2}}{2}] \\ \times B F_{10} [f_{2, 137} = 1.99 | τ_{\tilde{ω}}^{2} = \frac{140 {\tilde{ω}}^{2}}{2}] . \end{matrix}

[25]

Fig. 3 depicts the BFF versus RMSES $\tilde{ω}$ based on both experiments alongside the original and replication studies. By combining information from the two studies, we see that there is support for very small or small standardized effect sizes, with greater than 2:1 support for RMSES greater than 0.05 and less than 0.26. There is no support for standardized effect sizes greater than 0.31. The maximum Bayes factor against the null hypothesis of no effect was obtained for an RMSES of 0.14, where it was 5.75:1 against the null hypothesis of no group effect.

3. Discussion

The Bayes factors and BFFs described above are based on the specification of normal-moment and gamma prior densities imposed on scalar noncentrality parameters. Other prior specifications on noncentrality parameters are, of course, possible. However, the proposed prior densities possess several attractive features. Among these, they represent NAP densities, making it possible to accumulate evidence more rapidly in favor of true null hypotheses. They also yield closed-form expressions for Bayes factors, which facilitates BFF calculation. The coefficients of variation of the gamma priors in Theorems 3 and 4 are equal to $\sqrt{2 / (k + 2)}$ and so depend only on the (numerator) degrees of freedom of the test statistics. Under the proposed framework, the standard deviations of prior distributions on noncentrality parameters are thus scaled according to sample size. When used in conjunction with the normal moment priors specified in Theorems 1 and 2, these choices also yield Bayes factors that are invariant to the choice of the test statistic in the sense that z and z² = χ₁² tests and t_ν and t_ν² = F_{1, ν} tests produce the same Bayes factors when a common value of τ² is selected.

For the test statistics considered above, it is possible to compute a “maximum BFF” as the ratio of that test statistic’s noncentral alternative density to its central density under the null hypothesis (without averaging over a prior density). This procedure essentially produces a plot of the likelihood ratio for each test statistic. Because the probability that the test statistic matches this maximum value is either zero (continuous data) or small (discrete data), the maximum Bayes factor reported from such a procedure overstates evidence in favor of the alternative hypothesis. This procedure also precludes the collection of evidence in favor of true null hypotheses and fails to model the variability of standardized effect sizes across studies.

To a lesser extent, similar concerns also affect the interpretation of the maximum of the BFF defined using Theorems 1–4. However, these Bayes factors are obtained by averaging over prior densities and should be interpreted from a conditional perspective. For example, under the specified model assumptions, an appropriate interpretation of the data collected to study confirmatory information processing biases is that the maximum Bayes factor against the null hypothesis is at most 5.75:1; this Bayes factor occurs for the prior density corresponding to a RMSES of 0.14; and the data do not support alternative hypotheses with priors centered on RMSES greater than 0.31. Bayes factors against the null hypothesis are greater than 2:1 for alternative hypotheses corresponding to RMSES between approximately 0.05 and 0.26. Along similar lines, in the study of associations between cancer sites and blood types, there is no support for alternative hypotheses centered on RMSES greater than 0.07, and there is greater than 400:1 support against alternatives representing even small standardized effect sizes ( $\tilde{ω} \geq 0.20$ ).

An advantage of the conditional approach inherent to BFFs is that they provide evidence supporting specific alternative hypotheses. This is important when subjective prior information regarding effect size magnitude is unavailable. By integrating the likelihood function with respect to parameter values most consistent with a given effect size, Bayes factors in favor of plausible alternative hypotheses are thus not adversely affected by default prior specifications that place significant mass on unrealistically large or small parameter values under the given hypotheses.

Many scientists now acknowledge the critical role that replication studies play in improving the reproducibility of scientific studies (31, 32). The final example demonstrates that BFFs provide a formal mechanism to combine information collected across replicated experiments using only reported test statistics. Their use thus provides a potential tool for enhancing the reproducibility of scientific research.

Finally, BFFs provide a viable alternative to the report of P-values. Using BFFs, researchers can quickly assess the level of support that data provide to alternative hypotheses centered on a range of standardized effect sizes as well as the scientific significance of those standardized effects.

An R package to calculate default BFFs for tests described in this article, “BFF,” is available for download at cran.r-project.org.

Supplementary Material

Appendix 01 (PDF)

Click here for additional data file.^{(321.1KB, pdf)}

Acknowledgments

We thank M. Pourahmadi, A. Bhattarcharya, and two anonymous reviewers for their careful reading of the manuscript and helpful comments. This research was supported by NIH CA R01 158113.

Author contributions

V.E.J. designed research; V.E.J., S.P., and R.S. performed research; V.E.J. and R.S. analyzed data; and V.E.J. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

All data described in the article are contained within the article. An R package to implement methods is available in CRAN at https://cran.r-project.org/web/packages/BFF/index.html.

Supporting Information

References

1.Wasserstein R., Lazar N., The ASA statement on P-values: Context, process, and purpose. Am. Stat. 70, 129–133 (2016). [Google Scholar]
2.Edwards W., Lindman H., Savage L., Bayesian statistical inference for psychological research. Psychol. Rev. 70, 193–242 (1963). [Google Scholar]
3.Berger J. O., Selke T., Testing a point null hypothesis: The irreconcilability of P values and evidence. J. Am. Stat. Assoc. 82, 112–122 (1987). [Google Scholar]
4.Johnson V. E., Revised standards for statistical evidence. Proc. Natl. Acad. Sci. U.S.A. 110, 19313–19317 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Nuzzo R., Scientific method: Statistical errors. Nature 506, 150–152 (2014). [DOI] [PubMed] [Google Scholar]
6.Greenland S., et al. , Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur. J. Epidemiol. 31, 337–350 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Benjamin D. J., Berger J. O., Johannesson M., Johnson V. E., Redefine statistical significance. Nat. Hum. Behav. 2, 6–10 (2017). [DOI] [PubMed] [Google Scholar]
8.Lakens D., et al. , Justify your alpha. Nat. Hum. Behav. 2, 168–171 (2018). [Google Scholar]
9.O’Hagan A., Fractional Bayes factors for model comparison. J. R. Stat. Soc. Ser. B 57, 99–118 (1995). [Google Scholar]
10.J. Berger, L. Pericchi, “On the justification of default and intrinsic Bayes factors” in Modelling and Prediction Honoring Seymour Geisser, J. Lee, W. Johnson, A. Zellner, Eds. (Springer, New York, 1996), pp. 173–204.
11.Berger J. O., Pericchi L. R., The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. 91, 109–122 (1996). [Google Scholar]
12.Liang F., Paulo R., Molina G., Clyde M., Berger J., Mixtures of g priors for Bayesian variable selection. J. Am. Stat. Assoc. 103, 410–423 (2008). [Google Scholar]
13.Rouder J., Speckman P., Sun D., Morey R., Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bull. Rev. 16, 225–237 (2009). [DOI] [PubMed] [Google Scholar]
14.Wagenmakers E. J., Lodewyckx T., Kuriyal H., Grasman R., Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognit. Psychol. 60, 158–189 (2010). [DOI] [PubMed] [Google Scholar]
15.Consonni G., Fouskakis D., Liseo B., Ntzoufras I., Prior distributions for objective Bayesian analysis. Bayesian Anal. 13, 627–679 (2018). [Google Scholar]
16.Etz A., Vandekerckhove J., Introduction to Bayesian inference for psychology. Psychonomic Bull. Rev. 25, 5–34 (2018). [DOI] [PubMed] [Google Scholar]
17.Morey R., Rouder J., Pratte M., Speckman P., Using MCMC chain outputs to efficiently estimate Bayes factors. J. Math. Psychol. 55, 368–378 (2011). [Google Scholar]
18.Johnson V. E., Bayes factors based on test statistics. J. R. Stat. Soc.: Ser. B 67, 689–701 (2005). [Google Scholar]
19.Johnson V. E., Rossell D., On the use of non-local prior densities in Bayesian hypothesis tests. J. R. Stat. Soc.: Ser. B 72, 143–170 (2010). [Google Scholar]
20.Rossell D., Telesca D., Non-local priors for high-dimensional estimation. J. Am. Stat. Assoc. 112, 254–265 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Cao X., Yang F., On the non-local priors for sparsity selection in high-dimensional Gaussian DAG models. Stat. Theory Relat. Fields 5, 332–345 (2021). [Google Scholar]
22.Pramanik S., Johnson V. E., Efficient alternatives for Bayesian hypothesis tests in psychology. Psychol. Methods (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Cohen J., Statistical Power Analysis for the Behavioral Sciences (Erlbaum, Hillsdale, N.J., ed. 2, 1988). [Google Scholar]
24.Hedges L., Higgins J., Rothstein H., Borenstein M., Introduction to Meta-Analysis (John Wiley& Sons, Hoboken, N.J., 2021). [Google Scholar]
25.A. Stuart, J. K. Ord, Kendall’s Advance Theory of Statistics (Oxford University Press, New York, 1991), vol. 2.
26.White C., Eisenberg H., ABO blood groups and cancer of the stomach. Yale J. Biol. Med. 32, 58–61 (1959). [PMC free article] [PubMed] [Google Scholar]
27.Albert J., A Bayesian test for a contingency table using independence priors. Can. J. Stat. 18, 347–363 (1990). [Google Scholar]
28.Good I., Crook J., The robustness and sensitivity of the mixed Dirichlet Bayesian test for ‘independence’ in contingency tables. Ann. Stat. 15, 670–693 (1987). [Google Scholar]
29.Fischer P., Greitemeyer T., Frey D., Self-regulation and selective exposure: The impact of depleted self-regulation resources on confirmatory information processing. J. Personality Soc. Psychol. 94, 382–395 (2008). [DOI] [PubMed] [Google Scholar]
30.Open Science Collaboration, Estimating the reproducibility of psychological science. Science 349 (2015). [DOI] [PubMed]
31.Munafó M. R., et al. , A manifesto for reproducible science. Nat. Hum. Behav. 1, 1–9 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Ioannidis J., Why replication has more scientific value than original discovery. Behav. Brain Sci. 41, E137 (2018). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Click here for additional data file.^{(321.1KB, pdf)}

Data Availability Statement

All data described in the article are contained within the article. An R package to implement methods is available in CRAN at https://cran.r-project.org/web/packages/BFF/index.html.

[r1] 1.Wasserstein R., Lazar N., The ASA statement on P-values: Context, process, and purpose. Am. Stat. 70, 129–133 (2016). [Google Scholar]

[r2] 2.Edwards W., Lindman H., Savage L., Bayesian statistical inference for psychological research. Psychol. Rev. 70, 193–242 (1963). [Google Scholar]

[r3] 3.Berger J. O., Selke T., Testing a point null hypothesis: The irreconcilability of P values and evidence. J. Am. Stat. Assoc. 82, 112–122 (1987). [Google Scholar]

[r4] 4.Johnson V. E., Revised standards for statistical evidence. Proc. Natl. Acad. Sci. U.S.A. 110, 19313–19317 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Nuzzo R., Scientific method: Statistical errors. Nature 506, 150–152 (2014). [DOI] [PubMed] [Google Scholar]

[r6] 6.Greenland S., et al. , Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur. J. Epidemiol. 31, 337–350 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Benjamin D. J., Berger J. O., Johannesson M., Johnson V. E., Redefine statistical significance. Nat. Hum. Behav. 2, 6–10 (2017). [DOI] [PubMed] [Google Scholar]

[r8] 8.Lakens D., et al. , Justify your alpha. Nat. Hum. Behav. 2, 168–171 (2018). [Google Scholar]

[r9] 9.O’Hagan A., Fractional Bayes factors for model comparison. J. R. Stat. Soc. Ser. B 57, 99–118 (1995). [Google Scholar]

[r10] 10.J. Berger, L. Pericchi, “On the justification of default and intrinsic Bayes factors” in Modelling and Prediction Honoring Seymour Geisser, J. Lee, W. Johnson, A. Zellner, Eds. (Springer, New York, 1996), pp. 173–204.

[r11] 11.Berger J. O., Pericchi L. R., The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. 91, 109–122 (1996). [Google Scholar]

[r12] 12.Liang F., Paulo R., Molina G., Clyde M., Berger J., Mixtures of g priors for Bayesian variable selection. J. Am. Stat. Assoc. 103, 410–423 (2008). [Google Scholar]

[r13] 13.Rouder J., Speckman P., Sun D., Morey R., Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bull. Rev. 16, 225–237 (2009). [DOI] [PubMed] [Google Scholar]

[r14] 14.Wagenmakers E. J., Lodewyckx T., Kuriyal H., Grasman R., Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognit. Psychol. 60, 158–189 (2010). [DOI] [PubMed] [Google Scholar]

[r15] 15.Consonni G., Fouskakis D., Liseo B., Ntzoufras I., Prior distributions for objective Bayesian analysis. Bayesian Anal. 13, 627–679 (2018). [Google Scholar]

[r16] 16.Etz A., Vandekerckhove J., Introduction to Bayesian inference for psychology. Psychonomic Bull. Rev. 25, 5–34 (2018). [DOI] [PubMed] [Google Scholar]

[r17] 17.Morey R., Rouder J., Pratte M., Speckman P., Using MCMC chain outputs to efficiently estimate Bayes factors. J. Math. Psychol. 55, 368–378 (2011). [Google Scholar]

[r18] 18.Johnson V. E., Bayes factors based on test statistics. J. R. Stat. Soc.: Ser. B 67, 689–701 (2005). [Google Scholar]

[r19] 19.Johnson V. E., Rossell D., On the use of non-local prior densities in Bayesian hypothesis tests. J. R. Stat. Soc.: Ser. B 72, 143–170 (2010). [Google Scholar]

[r20] 20.Rossell D., Telesca D., Non-local priors for high-dimensional estimation. J. Am. Stat. Assoc. 112, 254–265 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21] 21.Cao X., Yang F., On the non-local priors for sparsity selection in high-dimensional Gaussian DAG models. Stat. Theory Relat. Fields 5, 332–345 (2021). [Google Scholar]

[r22] 22.Pramanik S., Johnson V. E., Efficient alternatives for Bayesian hypothesis tests in psychology. Psychol. Methods (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23] 23.Cohen J., Statistical Power Analysis for the Behavioral Sciences (Erlbaum, Hillsdale, N.J., ed. 2, 1988). [Google Scholar]

[r24] 24.Hedges L., Higgins J., Rothstein H., Borenstein M., Introduction to Meta-Analysis (John Wiley& Sons, Hoboken, N.J., 2021). [Google Scholar]

[r25] 25.A. Stuart, J. K. Ord, Kendall’s Advance Theory of Statistics (Oxford University Press, New York, 1991), vol. 2.

[r26] 26.White C., Eisenberg H., ABO blood groups and cancer of the stomach. Yale J. Biol. Med. 32, 58–61 (1959). [PMC free article] [PubMed] [Google Scholar]

[r27] 27.Albert J., A Bayesian test for a contingency table using independence priors. Can. J. Stat. 18, 347–363 (1990). [Google Scholar]

[r28] 28.Good I., Crook J., The robustness and sensitivity of the mixed Dirichlet Bayesian test for ‘independence’ in contingency tables. Ann. Stat. 15, 670–693 (1987). [Google Scholar]

[r29] 29.Fischer P., Greitemeyer T., Frey D., Self-regulation and selective exposure: The impact of depleted self-regulation resources on confirmatory information processing. J. Personality Soc. Psychol. 94, 382–395 (2008). [DOI] [PubMed] [Google Scholar]

[r30] 30.Open Science Collaboration, Estimating the reproducibility of psychological science. Science 349 (2015). [DOI] [PubMed]

[r31] 31.Munafó M. R., et al. , A manifesto for reproducible science. Nat. Hum. Behav. 1, 1–9 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r32] 32.Ioannidis J., Why replication has more scientific value than original discovery. Behav. Brain Sci. 41, E137 (2018). [DOI] [PubMed] [Google Scholar]

PERMALINK

Bayes factor functions for reporting outcomes of hypothesis tests

Valen E Johnson

Sandipan Pramanik

Rachael Shudde

Significance

Abstract

1. Mathematical Framework

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

A. Bayes Factors as Functions of Standardized Effect Size.

Fig. 1.

Table 1.

2. Applications

A. Cancer Sites and Blood Type Association.

Table 2.

Fig. 2.

B. Biases Associated with Confirmatory Information Processing.

Fig. 3.

3. Discussion

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bayes factor functions for reporting outcomes of hypothesis tests

Valen E Johnson

Sandipan Pramanik

Rachael Shudde

Significance

Abstract

1. Mathematical Framework

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

A. Bayes Factors as Functions of Standardized Effect Size.

Fig. 1.

Table 1.

2. Applications

A. Cancer Sites and Blood Type Association.

Table 2.

Fig. 2.

B. Biases Associated with Confirmatory Information Processing.

Fig. 3.

3. Discussion

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases