Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2023 Mar 23;13:4731. doi: 10.1038/s41598-023-31838-8

Simple nested Bayesian hypothesis testing for meta-analysis, Cox, Poisson and logistic regression models

Klaus Rostgaard 1,2,
PMCID: PMC10036629  PMID: 36959371

Abstract

Many would probably be content to use Bayesian methodology for hypothesis testing, if it was easy, objective and with trustworthy assumptions. The Bayesian information criterion and some simple bounds on Bayes factor are closest to fit this bill, but with clear limitations. Here we develop an approximation of the so-called Bayes factor applicable in any bio-statistical settings where we have a d-dimensional parameter estimate of interest and the d x d dimensional (co-)variance of it. By design the approximation is monotone in the p value. It it thus a tool to transform p values into evidence (probabilities of the null and the alternative hypothesis, respectively). It is an improvement on the aforementioned techniques by being more flexible, intuitive and versatile but just as easy to calculate, requiring only statistics that will typically be available: e.g. a p value or test statistic and the dimension of the alternative hypothesis.

Subject terms: Medical research, Risk factors, Mathematics and computing

Introduction

The majority of epidemiological studies are exercises in measurement; i.e., we try to estimate as accurately as we can some potential (possibly causal) association between an exposure and an outcome. Occasionally it is also of substantive interest to assess the evidence in favor of the null hypothesis, e.g. in terms of a probability of the null hypothesis. When assessed as probabilities this requires that the evidence in favor of the alternative(s) is assessed too. This is not possible in the traditional frequentist approach to statistical inference, as it is only based on the expectations that flows from assuming a particular null data generating mechanism/model. In the Bayesian approach it is possible to do it, but at the cost of having to specify some priors as input to the calculations. In the standard Bayesian paradigm these priors are supposed to model the beliefs of the investigator or client based on all relevant knowledge, not just studies or experiments similar to the one being analyzed. The subjectivism that flows from that is anathema to the standard scientific learning process, which is one reason why the standard frequentist approach is still dominant today. See Gilboa1 p. 40–48 for an excellent presentation of why you would want to act as a Bayesian in some situations and as a frequentist in other situations regarding the same substantive matters. However, it has often been demonstrated that the evidence for the alternative is weaker than usually recognized in the classical p value based scenario24. This would suggest that a Bayesian approach to model assessment would be preferable, if at all feasible.

In the following we shall use the terms model/data generating mechanism M0 as synonomous to a null hypothesis H0 and model/data generating mechanism M1 as synonomous to the alternative hypothesis Ha, also denoted H1. The full-blown Bayesian approach provides the probability of the null hypothesis after seeing the data D, pr(M0|D) from the ratio pr(M1|D)/pr(M0|D) which in turn is constructed from the product of the so-called Bayes factor and the prior odds pr(M1)/pr(M0), see Eqs. (3) and (4). As Eq. (4) states the Bayes factor is the ratio of the probability of observing the data D under the alternative model and the probability of observing the data under the null model, i.e. the Bayes factor is a ratio of predictive performance on the data D of the data generating mechanisms/models M1 and M0. Hence in the Bayesian framework the Bayes factor is the sole modifier of prior beliefs about the model probabilities into posterior beliefs after having seen the data.

We will argue that it is possible to choose an objective informative “consensus” prior, essentially defined by requiring that the expression for Bayes factor in the case of a univariate interest parameter, generalizes in the natural way to the multivariate case thereby ensuring that the evidence in favor of the alternative is monotone in the likelihood-ratio and hence the p value. Unlike the situation for parameter estimation, Bayes factors depend critically on the priors over the interest parameter θ: p1(θ) for M1 and p0(θ) for M0 (the latter is trivial), which therefore cannot just be made “uninformative” at no cost, see Kass & Raftery5.

We will argue that the ultimate (pre-data) prior odds (pr(M1)/pr(M0)) in an objective scientific setting should be set to 1. Readers may enter their own subjective (pre-data) prior odds into Eq. (3) and revise posterior inferences accordingly. Often this is all we ask for when we are discussing what our particular study adds (through Bayes factor) to the body of knowledge about the potential association between exposure X and outcome Y.

We provide a simple, defendable, objective way of generating p1(θ) and the ensuing inferences, including the Bayes factor, applicable to situations where the data likelihood does not contain a dispersion parameter or its value can be assumed effectively known. Thus our methodology is immediately applicable in the many epidemiological studies where the interest parameters are estimated using e.g. logistic regression, Poisson regression or Cox regression.

For an accessible overview of Bayesian methodology for epidemiologists and contrasts to traditional statistics, see Wagenmakers et al.6.

The disposition of the paper is as follows. The next section develops and motivates the method. In the first subsection we introduce the setting and notation. The next subsection fully develops our consensus priors for the case of a univariate interest parameter including choosing the only free parameter λ that expresses a balance of information content between prior and data. The next subsection swiftly generalizes this methodology to the general multivariate case. The following section compare our approach to existing approaches including the Bayesian Information Criterion (BIC). In the next section we elaborate our view on how to choose the pre-data prior odds, and discuss alternatives. The next section provides an epidemiological example of why we need this Bayesian approach (we believe in H0) and what comes out of it, and illustrates considerations regarding alternative values of λ. We end the paper with a discussion of mainly how the inferences obtained from using our machinery differs from those obtained with traditional frequentist means.

The method

Setting and notation

We only consider interest parameters summarized in a parameter (vector) θ and assume large-sample asymptotics, i.e. everything is treated as multivariate normal Nd(·,·). Thus we use the same assumptions underlying the standard statistical software output of parameter estimates with associated standard errors, confidence limits, χ2-based test statistics etc used when analyzing Cox, Poisson and logistic regression models. Stated differently, we at most assume known the maximum likelihood interest parameter estimates and their associated observed covariance-matrix (a submatrix of the inverted observed Fisher information matrix), as would be used as the input for a (multivariate) meta-analysis7. These statistics and various test statistics based upon them are the only data that you are always likely to be allowed to communicate in studies of humans. This is a first order approximation to a much more elaborate and accurate calculation of the Bayes factor that would only be possible to do for some-one with access to all original raw data. On the other hand this asymptotic approximation yields fully efficient parameter estimation under very reasonable assumptions8. The approach here is an easy addition on top of standard analysis output, and in the end allows us to retrospectively apply it to previous studies using only a few test statistics that should often be available to us (e.g. p values), in line with other model selection criteria like the Akaike Information Criterion (AIC), the BIC and various test-based bounds on the Bayes factor, as surveyed in Held & Ott9. The methodology developed here applies equally to the summary of a single study as a meta-analysis style summary of multiple studies.

The notation for the univariate case (d=1), where all the relevant vectors and matrices can be treated as scalars (numbers) is as follows:

  • Data D: L(D,θ)=L0exp(-12(θ-θ^)V-1(θ-θ^)),

  • Prior M0: p0(θ)=δ0 (all probability mass in the point 0),

  • Prior M1: p1(θ)=N(θ1,W).

Let KV-1 and PW-1.

We have:

pr(M1|D)pr(M0|D)=BF10×pr(M1)pr(M0) 1
BF10pr(D|M1)pr(D|M0)=L(D,θ)p1(θ)dθL(D,θ)p0(θ)dθ 2

In order to calculate BF10 we have to choose θ1 and W, the mean and covariance respectively for the a priori distribution of θ.

The notation for the general case is as follows:

  • Data D: L(D,θ)=L0exp(-12(θ-θ^)tV-1(θ-θ^)),

  • Prior M0: p0(θ)=δ0 (all probability mass in the point 0),

  • Prior M1: p1(θ)=Nd(θ1,W).

Let KV-1 and PW-1.

We have:

pr(M1|D)pr(M0|D)=BF10×pr(M1)pr(M0) 3
BF10pr(D|M1)pr(D|M0)=L(D,θ)p1(θ)dθL(D,θ)p0(θ)dθ 4

In order to calculate BF10 we have to choose θ1 and W of dimension d and d×d, respectively. θ and other vectors are column vectors. θt denotes θ transposed.

Note that much literature on Bayes factors, including Held & Ott9 and Wagenmakers et al.6, gives formulas for and bounds on BF01=BF10-1, while we prefer to use BF10 to highlight similarities to usual penalized likelihood methods.

An asymptotic Bayes factor for a univariate hypothesis (d=1)

Taking as starting point the typical epidemiological research question “Does X affect the risk of Y in any way?” and it’s classical statistical formulation as H0:θ=0 versus an alternative of no such constraint on θ clearly suggests that the prior p1(θ) should be centered at θ1=0. We may have a hunch about the direction of an effect, but noting how rarely anyone dares to consider only one-sided hypotheses etc it seems irrelevant to consider other values than 0 as the center of the prior. Or stated differently: If we were very certain about where θ10 should be, we probably would not need to assess pr(M0) or the Bayes factor in the first place. Other desiderata that we may consider, e.g. that θ1 should be simple, unique, self-evident, biased towards H0/M0 etc, would point in the same direction. 0 simply seems to be the only point that could possibly fulfill most desiderata.

Assume the above expressions for the priors and likelihood and θ1=0. For reasons to become apparent let P=λK, ψλ/(1+λ) and LR=exp(12θ^Kθ^). Then

BF10=LRψ1/2exp-12[ψθ^Kθ^] 5
=ψ1/2LR1-ψ 6

In deviance form (6) is logBF10=12logψ+1-ψ2χ2 where χ2 is the difference in deviance between models 0 and 1. See Supplementary Eqs. E1 & E2 for derivations.

λ is a ratio between the information in the prior and the data formalized as P=λK. So any formula for calculating λ should reflect this, e.g. λ-1K, and λ0 as more data are gathered. If we choose λ large the alternative θ will be shrunk very much towards 0 and M1 will look very similar to M0 and Bayes factor will by necessity be close to 1, i.e. we essentially learn nothing from our data, the inference is what we put into the model in the form of the prior. If on the other hand we make λ too small we are always going to prefer M0, due to having spread out the probability mass too thinly and hence placed very little in the vicinity of θ^. This never-vanishing importance of the choice of the prior when testing hypotheses stands in glaring contrast to the situation where we estimate parameters. Here the choice of prior is usually not very important, because as the amount of data increases the posterior distribution will converge to the same limiting distribution5,6.

Bayes factor is maximized by ψ=1/θ^Kθ^λ for λ1 so λ=1/θ^Kθ^ is not too small and shrinks at the right pace as more data are gathered and effectively maximizes the evidence in favor of the alternative. However, this λ may be too large. We may actually believe in the null as an appropriate approximation of the truth and want Bayes factor to favor the null (BF10<1) and more so the smaller θKθ is below some value. We may obtain this by introducing an upper limit to how large λ may be. Consider logBF10=12logψ+1-ψ2χ2. BF10=1 when logψ=(ψ-1)χ2. This only has a solution besides ψ=1 when χ2>1. E.g. the solution to the Equation with χ2=2, corresponding to λmax0.255 yields preferences similar to applying the Akaike information criterion (AIC): when the decrease in deviance per dimension is larger than 2 we prefer the alternative, complicated model, when the decrease in deviance per dimension is smaller than 2 we prefer the simpler model (H0). Likewise if we choose χ2=3.92 as our “watershed”, corresponding to the usual p=0.05 accept/reject dichotomy for a one-dimensional hypothesis, this corresponds to λmax0.022.

Examining the proposal of employing a λmax more closely reveals features that may guide the choice of λmax. When λ=1/χ2 throughout Bayes factor is not completely monotonely decreasing in χ2 (Fig. 1), yielding an argument for introducing a λmax small enough to ensure monotonicity. Furthermore if we require that Bayes factor becomes 1 at some prespecified watershed then we have to require λmax<1/1.54=0.65 corresponding to χ2>1.54. Thus there is actually little leeway to choose a sensible λmax>0.255 and we would therefore argue against that. It may however make good sense in specific situations to choose λmax<0.255.

Figure 1.

Figure 1

Bayes factor as a function of χ2 and λ. BF0: λ=1/χ2, BF1: λ=1, BF2: λ=0.255, BF3: λ=0.063, BF4: λ=0.65.

We therefore propose as default

λ=min(1/θ^Kθ^,λmax) 7

or more generally

λ=min(1/ΔDEV01,λmax) 8

where ΔDEV01 is the change in deviance between models 0 and 1, and λmax0.255. The corresponding Bayes factor is now a continuous monotonely increasing function of ΔDEV01. For ΔDEV01<1/λmax Bayes factor is a simple exponential function, reaching its minimum value of ψmax1/2 for ΔDEV01=0.

In some studies we would be more lenient towards formally statistically significant results, either because we would suspect various biases that we could not mitigate or because the effect sizes we detect as statistically significant would not amount to a practically or clinically meaningful difference. It is the same kind of logic that persuades professional surveyers not to make their studies as large as logistically possible because random fluctuations are soon swamped by inevitable biases as sources of error10. So we could augment λmax according to such a “practically null” criterion. Then the watershed χ2 would be on the form T2/V where V is the variance of the parameter, e.g. estimated from the width of the relevant confidence limits and |T| is the largest effect size we would tolerate as being in favor of the null hypothesis.

λ being a ratio of information in the prior and the data suggests choosing λ=ν/μ where ν and μ are counts of some information carrying unit as yet another way of specifying λ and the watershed in a way that is objective, transparent and transportable. In survival analysis (Cox regression, Poisson regression) the growth of statistical information as the sample grows is reflected more accurately in the number of events observed (= the number of uncensored survival times) than the number of observational units11,12. This suggests that μ be the number of observed events in the data in survival analysis, perhaps just the number of observed events among the exposed, if exposure is rare. ν would then be the equivalent postulated information content in the prior—it would seem equivalent to empirical data containing ν events of the type counted by μ.

Suppose that θ^θ~0 as more data are gathered. Then W=λ-1V=(θ^V-1θ^)Vθ~2, i.e. the limiting prior variance is then well-defined and constant, thus mimicking having chosen a priori and subjectively a fixed W=θ~2. Furthermore the region around 0 where we prefer H0 is on the form {θ:θKθ<c|θ|<cV}, where V is halved every time we double the number of observations (n) or other information carrying units. Thus the size of this region will be shrinking at the pace of n. Through the device of requiring λV/T2 to accomodate a practical null result this shrinkage can be halted, to make this region asymptotically fixed. If θ~ is within it we will asymptotically prefer H0, if it is outside that region we will asymptotically end up preferring H1 and the evidence in favor of H1 measured by the Bayes factor will become infinite.

Our approach has been to constrain a hypothetical subjectively specified prior in ways that would make it objective. Evidently we have succeeded in generating a recipe for such a prior that asymptotically behaves as if fixed a priori and subjectively. Conversely, one may ask if this prior is likely also to be a consensus subjective prior in the sense of representing subjective beliefs in the scientific community on the subject matter to an acceptable degree? The traditional subjective prior N(θ1,W) allows us to specify beliefs about θ as a location (θ1) and a degree of uncertainty about this location (W). Our new prior has introduced the constraint Eθ=θ1=0 and thus we are forced to express our prior beliefs about θ by specifying beliefs only about Eθ2=VarS(θ)+(ESθ)2=λ-1V, where we have used S to designate subjective quantities. This will only potentially be very different from employing the prior N(θ1,W) when |θ1/W| is large. However, if we were so sure about where θ was located (without having looked at the data!), it would seem more appropriate to make the comparison between H0 and H1 a comparison between two simple hypotheses, i.e. H1 would be the hypothesis that θ=θ1. We have elaborated on bounds on how large λ should be allowed to be. Imagine a subjectively specified λSmin(1/ΔDEV01,λmax). This would almost surely be a consequence of believing in numerically larger effect sizes than what turned out to be the case, and as such signifying beliefs in H1, i.e. that θ was far from 0. A person having such beliefs should be happy to be “corrected” in the direction of more evidence for H1 by our consensus prior. Altogether, we believe that many would-be practitioners of subjective Bayesianism in science would be relieved and happy to employ our admittedly flexible consensus prior for scientific nested model comparison.

Classical Bayesian inference using Bayes factors may suffer from Lindley’s paradox, which has caused some to suggest abandoning Bayes factors altogether for nested hypothesis testing13. In our setup the paradox corresponds to imagining a sequence of test statistics (χ2s) that are constant, but corresponding to monotonically decreasing Vs as more data are gathered13. In that case the Bayes factor will at some point start to favor the null hypothesis over the alternative to an arbitrary degree, despite the test statistic being fixed at some level that would usually by the scientist be considered strongly in favor of H1. If we use λ=ν/μ, letting μ and hence λ0, and fix the test statistic and hence the LR=LR0 we have BF10λLR00, indeed exhibiting Lindley’s paradox. However, the standard data-driven version of Bayes factor proposed here does not suffer from the paradox: the only χ2s that will make us favor H0 are those that we deliberately through our choice of a watershed χ2 have designated as being in favor of H0.

A general asymptotic Bayes factor (d1)

When generalizing our Bayes factor from one dimension to multiple dimensions it would seem natural to have a formula and priors that do so too. This is indeed possible. Let θ1=0 and P=λK and let ψλ/(1+λ) and LR=exp(12θ^tKθ^) and assume the above expressions for the priors and likelihood. Then

BF10=LRψd/2exp-12[ψθ^tKθ^] 9
=ψd/2LR1-ψ 10

In deviance form (10) is logBF10=d2logψ+1-ψ2χ2 where χ2 is the difference in deviance between models 0 and 1. See Supplementary Eqs. E3E5 for derivations. Obviously monotonicity in LR and hence the p value has been maintained.

P may also be essentially obtained from requiring monotonicity of the the Bayes factor in the p value, rather than for esthetic and computational reasons, as elaborated below.

The precision matrix of the study K=V-1 is often called the information matrix: it tells us what our study is most informative about; which parameters can be estimated with the biggest precision. Thus the standard Wald test statistic θ^tKθ^ essentially collects evidence against the null, penalizing deviations from the null of a given size harder in directions where the sample/study is informative than in directions where it is less informative. To be more specific: any covariance matrix, and it’s inverse, has a representation as a diagonal matrix, i.e. K=OtΛO where O is a rotation matrix and Λ=diag(λ1,,λd) with λ1λd. Thus the Wald test statistic θ^tKθ^ has a representation in another basis (specified by O) on the form i=1dwi2λi. Noting that most sources of information about θ are likely to resemble our study and the correlations between components of θ to be roughly similar choosing the prior precision PK therefore seems an obvious idea. Also if we view the Bayes factor as an extension of the traditional significance test we may insist that the Bayes factor in favor of the alternative should increase monotonically as the p value decreases. In our set-up such a constraint can be honored, but it requires that P is diagonal in the same basis as K and further puts restrictions on the ranks of the eigenvalues of P. If we furthermore require that any scale copy of P should obey the p value monotonicity constraint, the eigenvalues of P has to obey the same ranking as the eigenvalues of K, see Supplementary Equation E6 for derivations. The only practically viable option for obtaining this is to have P=c1I+c2K, with c1,c20, noting that c1I has the same representation in all bases. The mean of the posterior of θ under M1 will be m=(K+P)-1Kθ^ which will be exactly in the direction of θ^ only when PK, or stated differently: m can only be interpreted as merely shrinking θ^ towards 0 in case PK, which is trivially fulfilled when d=1.

Further in favor of using PK we note that the information matrix K for Poisson and logistic regression models is derived from a larger information matrix on the form (in standard notation) XtWX where X is an a priori known design matrix and W is a diagonal matrix of weights of each observation in the data set. These weights will in general depend on θ but even then K will typically deviate little from K(θ1=0), i.e. K calculated based on the weights corresponding to the null hypothesis, especially when the effect sizes are small. The expression for the information matrix in Cox regression is more complicated than for Poisson and logistic regression, but the same argument applies; that the difference between K and K(θ1=0) is likely to be small, vindicating the use of PK.

So P=λK is not only very convenient, yielding simple formulas; it is also deeply meaningful as the best suggestion absent other prior knowledge about what P should be. For other examples of approaches where the specification of the details of the prior distribution regarding correlation structure etc is based on the data see Chen & Ibrahim14 and Bedrick et al.15.

We have used Eqs. 9 and 10 and derivatives thereof interchangeably. The former is what comes out of our modeling, while the latter is an interpretation of it that provides the more general, robust and accurate way to calculate Bayes factor according to our ideas. Use of Eq. (10) and hence deviance (χ2) firstly guarantees that our inference is indeed monotone in the LR and hence in the p value, secondly is in better correspondence with the Savage–Dickey density ratio theorem, which states that Bayes factor quite generally can be calculated by dividing the prior in 0 by the posterior in 0 both under H1, i.e. BF10=p1(0)/p1(0|D)6. The Savage–Dickey density ratio theorem also provides an argument why the calculation of Bayes factor should be insensitive to specification of priors and likelihoods for nuisance parameters6.

In Supplementary Equation E7 we provide further heuristic arguments why a universal Bayes factor should look like Eq. (10), using the Savage–Dickey density ratio theorem6.

If λ is not chosen prior to seeing θ, then it must be a function of θtKθ to ensure monotonicity in the p value; i.e. all θ for which θtKθ=c should lead to the same λ to obtain the same inference.

Bayes factor is maximized by ψ=d/θ^tKθ^λ for λ1 so λ=d/θ^tKθ^ is not too small and shrinks at the right pace as more data are gathered and effectively maximizes the evidence in favor of the alternative.

So we immediately end in the natural generalization of our proposed λ estimator for d=1:

λ=min(d/θ^tKθ^,λmax) 11

or more generally

λ=min(d/ΔDEV01,λmax) 12

where ΔDEV01 is the change in deviance between models 0 and 1, and λmax0.255. The corresponding Bayes factor is now a continuous monotonely increasing function of ΔDEV01. For ΔDEV01<d/λmax Bayes factor is a simple exponential function, reaching its minimum value of ψmaxd/2 for ΔDEV01=0.

In the general case we may also want to use a “practically null” criterion to put an upper bound on λ. For ease of interpretation and communication we suggest that such a criterion should typically be based on a one-dimensional margin of the interest parameter.

Connections to other theory

Most approaches to “objective” statistical inference have more or less equated “objective” with using minimally informative or even improper priors, including the fiducial approach by Fisher and in the same spirit, the p value function by Fraser1618. Inherently, this in many cases clearly favors the null hypothesis in nested hypothesis testing. Our approach to “objective” statistical inference here is the complete opposite. Our starting point is that modern “subjective” Bayesian statistics in the tradition of Savage and others logically and otherwise works fine1,19, the only real defect in the scientific context being that it may be accused of being “subjective” in its choice of priors. So our project here has been to examine which sensible constraints on the priors could turn this methodology into an objective methodology in the spirit and self-image of hard science. As argued earlier we believe we have managed not just to provide a data-driven objective prior, but also at the same time a likely consensus subjective prior.

In its classical form the Minimum Description Length (MDL) principle used for model selection to a first approximation corresponds to using the AIC20 and as such is likely to yield inferences very similar to the inferences we would obtain from the default version of our Bayes factor. Later developments of the MDL principle have had a less Bayesian flavor21,22. However, the MDL principle and similarly looking penalized likelihood methods22,23 do not seem to match our consensus prior regarding flexibility, ease of calculation and ease of interpretation.

In essence BIC is obtained from our approach by insisting on λ=ν/μ where ν and μ are counts of some information carrying unit. In the BIC ν=1, corresponding to a prior with the information content of a single average observation (or whatever unit we are counting) and as such as little information in the prior as empirically conceivable.

A frequently proposed upper bound on Bayes factor is 1/(-eplog(p))24. This and other bounds on Bayes factor are surveyed in Held & Ott9. However, neither of these bounds nor the BIC admits the flexibility and realism of our approximation of Bayes factor. E.g. the BIC has a clear tendency to favor H0 and in the opposite direction the aforementioned bound on Bayes factor always yields BF101 corresponding to letting ψ1 and thus using improbably precise priors, rendering the data irrelevant for inference.

We also note that BF10=ψd/2LR1-ψ (Eq. 10) also appears as an approximate Bayes factor in work on the fractional Bayes approach (section 2 in O’Hagan25), suggesting a wider applicability of Eq. (10) than stated here. We don’t find this surprising since we also learn the prior from the data, although we have argued that this is merely for convenience; often the result should be very close to what we could learn from fitting M0. However, we do part company with O’Hagan in our recommendations regarding ψ (section 6 in O’Hagan25).

In Supplementary Equation E4 we have adapted our machinery for use in classical subjective Bayesian inference with a prior on the form Nd(θ1,λ-1V), thus solving an escalating logistic problem of eliciting/specifying very many parameters of the prior. Specifying d+1 meaningful parameters should certainly be doable.

Much of what we have developed here is foreshadowed by several decades, at least in the univariate case, by work from the inventor of the Bayes factor, Sir Harold Jeffreys26. Among other things Jeffreys developed approximate expressions for the Bayes factor very similar to (5) and (6), i.e. in its simplest form in our notation BF10λLR thus realizing that the Bayes factor roughly is the product of a term that is a function of the p value (LR) and a term that depends on the square root of the information usually proportional to the number of observations n (λ)26. He also realized that large and middle p values represents evidence in favor of H0, not just absence of evidence for H126. Thus the novelty of the present contribution only lies in generalizing and specializing such type of approximative expression for Bayes factor to a broad class of regression models that are completely dominant in e.g. epidemiological research, providing arguments for choosing and scaling (through λ) the priors to be objective and possibly quite informative at the same time.

Odds prior to data—and final inference

Before discussing pre-data prior odds of M0 and M1 we need to understand what the hypotheses really mean. If we were in a position to collect as much data as we would want we would probably in all but the rarest cases be able to identify an effect size different from 0. So the meaning of M0 is not really that we believe it to be absolutely true, but rather that we believe it to be so small as to be predictively indistinguishable from 0 on potentially available data. Bayes factor in our context is a simplifying device. It collects evidence in favor of each hypothesis and therefore the null hypothesis of a simple model may not need to be abandoned until the evidence in favor of the alternative that you consider likely is much larger. As such a non-vanishing pr(M0|D) is a license to ignore the true non-null effect size that we haven’t been able to pinpoint with sufficient precision. Just as a model is a simplification of reality, the null model is a simplification of an extended model. H0 may be a hypothesis we wish to entertain, for convenience or simplicity, or something we wish to refute in order to demonstrate that some exposure affects the probability of some output, e.g. that some treatment is better than another treatment.

In epidemiology we are only likely to identify very small effect sizes with certainty when both the outcome and the exposure is very common, say when studying 30-day mortality following blood transfusion. Then even a tiny apparent relative difference in probability of the outcome by blood product characteristic, would, if true and causal translate into an actionable possibility of avoiding x adverse events per year. Non-trivial decision making is best done using decision theory. But if our interest lies in using Bayes factor as a simplifying device unless overwhelmed by evidence for the alternative we may do so by a slight change in the meaning of H0 and Ha to something closer to our implicit interpretation of H0 and Ha also in the situation with very abundant data by instead considering the posterior probability of θ being 0 or practically 0 (θRε), i.e. pr(M0|D)+pr(θRε|M1,D)pr(M1|D). This is in the spirit of “modernizations” of the traditional significance test as advocated in Goodman et al.27 and Blume et al.28.

In order to obtain proper posterior odds and probabilities of hypotheses you need to asses (pre-data) prior odds of the models/hypotheses. In analogy with our choice of θ1=0 for the consensus parameter prior we consider pre-data prior model odds pr(M1)/pr(M0)=1 as the best possible practical universal consensus pre-data prior odds. Setting the prior odds equal to 1 corresponds to evaluating the posterior odds at the boundary between your a priori position, and the position of your adversary (who favors the opposite hypothesis) where the odds are as far in favor of your adversary’s point of view as you can accommodate. Further it could be argued that in keeping with the role of models as simplifying devices and Occam’s razor and the special role assigned to H0 in science we should always have pr(M1)/pr(M0)1. And on the other hand if we see the point of the test to be to possibly falsify/reject H0 we should have pr(M1)/pr(M0)1. For an opposing view in favor of assessing/discussing the true pre-data prior odds in epidemiological studies, see Goodman et al.29.

There are other methods for determining pre-data prior odds, but they do not seem particularly reliable and objective in our view3033. Anyway, the reader of your results can multiply their own pre-data prior odds with your objective Bayes factor to obtain their subjective posterior odds and probabilities.

Finally, we could avoid specifying pre-data prior odds of hypotheses altogether if we instead asked what is the expected posterior loss if we act as if some simplifying or interesting assumption (H0) is true, measured in a big estimated model (M1) we believe in30. But this of course requires an elaborate M1 model and that you can obtain consensus with your readers/clients about what loss function to use.

A practical example

We will illustrate the use of our methodology in an example concerning an eight-dimensional interest parameter, where we believe the null hypothesis to be a good approximation of the truth. We will show the simple calculations involved in assessing Bayes factor only based on statistics published in an epidemiological paper34.

Most people become infected by Epstein-Barr virus (EBV); once infected the virus persists in the host. In the western world primary EBV infection occurs mostly in infancy (0–3 years) and in teenage years. Occasionally primary EBV infection is accompanied by infectious mononucleosis; this happens rarely in infancy, but commonly in teenage-years and later. EBV is mostly transmitted through saliva and is not very contagious. Having siblings reduce the risk of infectious mononucleosis since each sibling may infect you with EBV in infancy, thereby pre-empting primary EBV infection in teenage-years with its associated larger risk of infectious mononucleosis. The protection against infectious mononucleosis obtained from each sibling varies widely by age difference; the smaller the difference in age, the more protection and with younger siblings being more protective than older siblings with the same absolute age difference to the followed-up person. This has been modeled in multiplicative (Poisson or Cox regression) models with time-varying counts of siblings in each of eight disjoint categories of age-difference as predictors34.

Infectious mononucleosis is a well-known risk factor for multiple sclerosis with hazard ratios (HRs) consistently in the range 2–3. Whether it is the infectious mononucleosis (an exaggerated immune reaction) per se, or infectious mononucleosis as a marker of so-called delayed EBV infection that is the culprit is unclear. But based on other evidence the latter seems most likely. If the latter was the case one should expect the HRs of multiple sclerosis as a function of sibship constellation to be the same as the HRs for infectious mononucleosis as a function of sibship constellation, as modeled by the aforementioned eight-dimensional predictor. This would then be our H0. This was examined in a population-based study of persons born in Denmark since 1971 in a stratified Cox regression model with hospital contacts for multiple sclerosis and infectious mononucleosis, respectively as outcomes34. In this joint model the interest parameters are θIM=θ and θMS=θ+Δθ with H0:Δθ=0. The details of the modeling are un-important here, it suffices to know that the hypothesis of common sibling parameter estimates for the two outcomes was examined using a likelihood-ratio test34.

It is obvious from the paper34 that the alternative hypothesis is 8-dimensional (d=8). And we are informed that the p value is 0.19, which with this d corresponds to a deviance χ2=11.21. We thus obtain λ=min(8/11.21,0.255)=0.255 and hence ψ=λ/(1+λ)=0.255/1.255=0.203. Plugging into logBF10=d2logψ+1-ψ2χ2 yields BF10=exp(-1.908)=0.1484 and hence [under the assumption of uniform prior odds pr(H1)/pr(H0)=1] we obtain the posterior probabilities pr(H1|D)=0.129 and pr(H0|D)=0.871. We consider these calculations uncontroversial and therefore perfectly adequate for the situation, suggesting that H0 is indeed likely to be true. In this example the AIC actually puts a bound on λ. If we hadn’t used this bound we would instead have ended up in BF10=0.7923 and thus pr(H1|D)=0.442 and hence pr(H0|D)=0.558  i.e. a result much closer to equiprobability of the two hypotheses as expected.

In the following we will examine various considerations that could potentially suggest a lower λ than the one based on the AIC.

As an example of accomodating a practical null result let us consider the parameter for the effect of each additional 0–2 years younger sibling. This parameter has the largest effect size for the infectious mononucleosis outcome |log(HR)|=-log(0.80)=0.223. If we take T to be 20% of this effect size, to accomodate for instance that not all infectious mononucleosis is due to EBV we obtain T=0.2×0.223=0.045. The relevant variance estimate is obtained from V=(log(1.19)-log(0.89))/3.92=0.074 where the confidence limits 0.89 and 1.19 belong to the estimate of the HR between the common HR for the two outcomes (=HR for infectious mononucleosis) and the HR for multiple sclerosis per sibling 0–2 years younger. The resulting χ2=0.0452/0.0742=0.362 is useless. To obtain a useful χ2 would require a much larger study (lower V) or that we were much more lenient in our choice of effect sizes favoring H0 (larger T) or both.

Considering the one-dimensional case the AIC (χ2=2) corresponds to using a significance level of 0.16 to distinguish between accepting or rejecting H0. In keeping with the idea of sticking with H0 in the absence of strong evidence against it (low p values), it could be sensible to let e.g. the 90 or 95 percentile of the χ2-distribution be the watershed between supporting H0 or H1. However, we think this type of argument is most reasonable for arguing for convenience null hypotheses, e.g. as a license to avoid modeling and reporting interactions if they are deemed inconsequential, and not important for the study. In this case when it is the central hypothesis we are discussing, it would seem like tilting the scales in the direction of a desired result.

The prior p(θ1) employed in our formula is supposed to represent pre-data prior knowledge pertinent to the study. A priori we know more or less the distribution of the interest parameter for the infectious mononucleosis outcome. But we don’t know it for the interest parameter regarding the multiple sclerosis outcome (4442 cases). In the study cohort we found 103 cases of multiple sclerosis following infectious mononucleosis at age 12+ years, yielding a standardized incidence ratio of 2.35. This elevated incidence is one of the key inspirations for our hypothesis: the hypothesized protection from having siblings is supposed in a way to explain the elevated standardized incidence ratio in people having had infectious mononucleosis as a marker of delayed primary EBV infection. So according to this view a sensible value on a grid G on the form λ=ν/μλmax, with ν{1,2,5,10,20,50,} and μ being counts of some information carrying unit in a reasonable prior and the data, respectively, would be λ=100/4442=0.023. Using this λ yields BF10=5.65×10-5 and hence a vanishing probability of the alternative hypothesis. However, there are many more studies available on risk of multiple sclerosis following infectious mononucleosis, yielding remarkably similar results34. Taking these into account would quickly increase ν to a degree where the resulting λ would be the same as when using the AIC. We also note that the way an inconspicuous χ2 in this case was turned into overwhelming evidence in favor of H0 exemplifies Lindley’s paradox13.

R code for this example is provided in Supplementary Methods.

Discussion

The traditional frequentist hypothesis test works by collecting evidence against the null (“model criticism”); the methodology of rejecting the null hypothesis when the p value becomes small is the statistical equivalent of Popper’s paradigm of falsifying hypotheses. The Bayesian learning process collects evidence in favor of hypotheses; it is symmetric in the models. The frequentist approach, on the other hand, is designed to prefer a null model (for simplicity), and only to make us grudgingly be persuaded in favor of an (unspecified) alternative when the evidence is very much against the null. Our proposals are primarily intended to enhance the traditional frequentist methodology in a way that only causes us to abandon the null hypothesis if we have a specific alternative that performs noticeably better in terms of predicting the data at hand.

Bayesians have always told frequentists that it is logically wrong and unsound just to consider inferences based on the null model24. Viewed in that context it is slightly embarrasing to end up with an approximate expression for Bayes factor that only depends on the dimension and p value of the hypothesis. Having swallowed this embarrassment it is however very comforting to be able to translate or calibrate the objective p value for any given hypothesis to a Bayes factor and thus achieve a more realistic picture of the evidence conveyed by the data in favor of the null and the alternative hypothesis, respectively. It is actually a Bayesian solution to the Fisherian project of making statistical inference using only likelihood functions35!

Having constrained our Bayes factor to be monotonely increasing in the likelihood-ratio has ensured that the formula for Bayes factor for d>1 is the natural generalization from the case with d=1 where monotonicity is always fulfilled. It has also made our Bayes factor both objective, meaningful and at least corresponding to both frequentist likelihood-ratio-based and pure likelihood inference, and thus likely to be accepted by the scientific community3,4,36. Furthermore, this Bayes factor is easily calculated from standard statistical output, e.g. an uncategorized p value and dimension of the hypothesis, which is usually available in epidemiological papers, and certainly in statistical software.

How is the Bayesian inference proposed here quantitatively different from the classical frequentist inference? If we divide the parameter space into regions where we either reject or accept H0 it is clear from the formula BF10=ψd/2LR1-ψ that the “accept” regions for the Bayesian and frequentist approaches would be of the same shape and orientation (asymptotically an ellipsoid centered at 0), but with different boundaries so that the regions where we accept H0 would tend to be larger in the Bayesian approach. For example, if we use Eq. (8) with λmax=0.255 and pr(H1)/pr(H0)=1 as our default methodology BF10=1 would correspond to p values of 0.1573 and 0.0293 under 1- and 10-dimensional hypothesis, respectively. And BF10=19 corresponding to pr(M0|D)=0.05 would correspond to p values of 0.0026 and 1.66×10-7 under 1- and 10-dimensional hypothesis, respectively. These differences in quantitative behavior between our Bayesian proposal and p value based methodology is illustrated in Fig. 2. If the p value is either very large or very small we would of course reach qualitatively the same conclusion irrespective of the chosen method. Thus if we use Bayes factor merely to choose the preferred/most likely model then the standard inference is exactly the same as when using the AIC. If we instead use Occam’s razor and only deviate from H0 if there is strong evidence against it, then the Bayes factor would lead to fewer rejections of H0 than when using significance testing (pα) vs (Pr(H0|D)α). And the evidence for the null and the alternative model is quantified in a meaningful way as probabilities; something that the traditional frequentist inference never came close to.

Figure 2.

Figure 2

Lack of support for H0 (measured as -log(p) (Pd) and -log(Pr(H0|D)) (BFd), respectively, as a function of χ2 and dimension d of H1.

The mapping (p,d)BF10 is monotone in p for fixed d, but is otherwise non-trivial, and therefore cannot and should not be attempted without calculating it. It would be a grave mistake and missing the point to just go on using p values in the belief that due to monotonicity it would lead to the same statistical inferences as using our Bayes factor.

There has been many attempts to unseat p values as the main vehicle for statistical inference besides confidence intervals24,37. This is yet another attempt to do that, and based on history is likely to fail. If it fails again it will only be because many researchers actually love all these statistically significant false positive findings in the quest for funding, promotion and what not or perhaps just lazy inertia. On the other hand it would be quite simple for journal editors and other stakeholders to recommend or require “the Bayesian version” of statistical inference presented or taken as starting point whenever a test or model choice would be deemed relevant. And maybe this methodology could also stop authors from sprinkling their texts with the (usually superfluous) words “statistically significant” when they are in fact only estimating quantities, and there is no strong evidence for some hypothesis in need of being communicated.

Supplementary Information

Author contributions

K.R. conceived and wrote the paper.

Funding

This paper was not supported by any specific grants.

Data availability

All data generated or analysed during this study are included in this published article and its supplementary information files.

Competing interests

The author declares no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-023-31838-8.

References

  • 1.Gilboa I. Theory of Decision under Uncertainty (Econometric Society Monographs) New York, NY: Cambridge University Press; 2009. [Google Scholar]
  • 2.Benjamin DJ, et al. Redefine statistical significance. Nat. Hum. Behav. 2018;2:6–10. doi: 10.1038/s41562-017-0189-z. [DOI] [PubMed] [Google Scholar]
  • 3.Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p < 0.05”. Am. Stat. 2019;73:1–19. doi: 10.1080/00031305.2019.1583913. [DOI] [Google Scholar]
  • 4.Johnson VE. Revised standards for statistical evidence. Proc. Natl. Acad. Sci. 2013;110:19313–19317. doi: 10.1073/pnas.1313476110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kass RE, Raftery AE. Bayes factors. J. Am. Stat. Assoc. 1995;90:773–795. doi: 10.1080/01621459.1995.10476572. [DOI] [Google Scholar]
  • 6.Wagenmakers EJ, Lodewyckx T, Kuriyal H, Grasman R. Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cogn. Psychol. 2010;60:158–189. doi: 10.1016/j.cogpsych.2009.12.001. [DOI] [PubMed] [Google Scholar]
  • 7.Jackson D, Riley R, White IR. Multivariate meta-analysis: potential and promise. Stat. Medicine. 2011;30:2481–98. doi: 10.1002/sim.4172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lin DY, Zeng D. On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika. 2010;97:321–332. doi: 10.1093/biomet/asq006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Held L, Ott M. On p -Values and Bayes Factors. Annu. Rev. Stat. Appl. 2018;5:393–419. doi: 10.1146/annurev-statistics-031017-100307. [DOI] [Google Scholar]
  • 10.Groves, R. M. Survey Errors and Survey Costs. Wiley Series in Probability and Statistics (John Wiley & Sons, Inc., Hoboken, NJ, USA, 1989).
  • 11.Raftery AE. Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Biometrika. 1996;83:251–266. doi: 10.1093/biomet/83.2.251. [DOI] [Google Scholar]
  • 12.Volinsky CT, Raftery AE. Bayesian information criterion for censored survival models. Biometrics. 2000;56:256–262. doi: 10.1111/j.0006-341X.2000.00256.x. [DOI] [PubMed] [Google Scholar]
  • 13.Bernardo, J. M. Nested hypothesis testing: The Bayesian reference criterion. In Bernardo, J. M., Berger, J. O., Dawid, A. & Smith, A. (eds.) Bayesian Statistics 6, 101–130 (Oxford University Press, 1999).
  • 14.Chen M, Ibrahim J. Conjugate priors for generalized linear models. Statistica Sinica. 2003;13:461–476. [Google Scholar]
  • 15.Bedrick EJ, Christensen R, Johnson W. A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 1996;91:1450–1460. doi: 10.1080/01621459.1996.10476713. [DOI] [Google Scholar]
  • 16.Berger JO. The case for objective Bayesian analysis. Bayesian Anal. 2006;1:385–402. doi: 10.1214/06-BA115. [DOI] [Google Scholar]
  • 17.Fraser D. p values: The insight to modern statistical inference. Annu. Rev. Stat. Appl. 2017;4:1–14. doi: 10.1146/annurev-statistics-060116-054139. [DOI] [Google Scholar]
  • 18.Fraser DAS. The p value function and statistical inference. Am. Stat. 2019;73:135–147. doi: 10.1080/00031305.2018.1556735. [DOI] [Google Scholar]
  • 19.Diaconis P, Skyrms B. Ten. Great Ideas About Chance. Princeton University Press; 2017. [Google Scholar]
  • 20.Rissanen J. A universal prior for integers and estimation by minimum description length. Ann. Stat. 1983;11:416–431. doi: 10.1214/aos/1176346150. [DOI] [Google Scholar]
  • 21.Grünwald PD. The Minimum Description Length principle. Cambridge, Massachusetts: The MIT Press; 2007. [Google Scholar]
  • 22.Lanterman AD. Schwarz, Wallace, and Rissanen: Intertwining themes in theories of model selection. Int. Stat. Rev. 2001;69:185–212. doi: 10.1111/j.1751-5823.2001.tb00456.x. [DOI] [Google Scholar]
  • 23.Gilboa I, Schmeidler D. Simplicity and likelihood: An axiomatic approach. J. Econ. Theory. 2010;145:1757–1775. doi: 10.1016/j.jet.2010.03.010. [DOI] [Google Scholar]
  • 24.Benjamin DJ, Berger JO. Three Recommendations for improving the use of p values. Am. Stat. 2019;73:186–191. doi: 10.1080/00031305.2018.1543135. [DOI] [Google Scholar]
  • 25.O’Hagan A. Fractional Bayes factors for model comparison. J. R. Stat. Soc. B. 1995;57:99–138. [Google Scholar]
  • 26.Wagenmakers, E.-j. Approximate objective Bayes factors from p values and sample size. The 3psqrt(n) rule. PsyArxiv Preprints (2022).
  • 27.Goodman WM, Spruill SE, Komaroff E. A proposed hybrid effect size plus p value criterion: Empirical evidence supporting its use. Am. Stat. 2019;73:168–185. doi: 10.1080/00031305.2018.1564697. [DOI] [Google Scholar]
  • 28.Blume JD, Greevy RA, Welty VF, Smith JR, Dupont WD. An introduction to second-generation p values. Am. Stat. 2019;73:157–167. doi: 10.1080/00031305.2018.1537893. [DOI] [Google Scholar]
  • 29.Goodman, S. N. Toward evidence-based medical statistics. 2: The Bayes factor. Ann. Intern. Med. 1999;130:1005–13. doi: 10.7326/0003-4819-130-12-199906150-00019. [DOI] [PubMed] [Google Scholar]
  • 30.Bernardo, J. M. Integrated objective bayesian estimation and hypothesis testing. In Bernardo, J. M. et al. (eds.) Bayesian Statistics 9, 1–68 (Oxford University Press, 2011).
  • 31.Dreber A, et al. Using prediction markets to estimate the reproducibility of scientific research. Proc. Natl. Acad. Sci. 2015;112:15343–15347. doi: 10.1073/pnas.1516179112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc. Natl. Acad. Sci. 2018;115:2600–2606. doi: 10.1073/pnas.1708274114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Good, I. J. Some Logic and History of Hypothesis Testing (#1234). In Good Thinking: The Foundation of Probability and Its Applications, chap. 14, 129–148 (Dover Publications, Mineola, New York, 1983), dover edn.
  • 34.Rostgaard, K., Nielsen, N. M., Melbye, M., Frisch, M. & Hjalgrim, H. Siblings reduce multiple sclerosis risk by preventing delayed primary Epstein-Barr virus infection. Brain (2022). [DOI] [PubMed]
  • 35.Efron BRA. Fisher in the 21st century. Stat. Sci. 1998;13:95–114. [Google Scholar]
  • 36.Johnson VE. Uniformly most powerful Bayesian tests. Ann. Stat. 2013;41:1716–1741. doi: 10.1214/13-AOS1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Goodman SN. Why is getting rid of p values so hard? Musings on science and statistics. Am. Stat. 2019;73:26–30. doi: 10.1080/00031305.2018.1558111. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data generated or analysed during this study are included in this published article and its supplementary information files.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES