Introduction
Panagiotou and Ioannidis1 (PI) have examined what they term ‘borderline associations’, based on P-value thresholds, and conclude that current genome-wide significance (GWS) guidelines (e.g. 5 × 10−8) are too stringent. To remedy this problem, PI recommend a P-value threshold of 10−7, though the statistical rationale for this particular figure is not clear. I agree with PI that the current criteria for declaring an association as ‘significant’ are often not appropriate and will here lay out the arguments for a Bayesian remedy.
In general, a major difficulty with P-values is on deciding on a threshold for significance and this is true even for a single test. The current norm within the genome-wide association study (GWAS) literature is for P-value thresholds to be based upon controlling the family-wise error rate (FWER) (i.e. the probability of a single incorrect null rejection) at a very low level, typically 0.05. For example, with m = 1 000 000 tests the Bonferroni approach to controlling the FWER at 0.05 gives a threshold of 0.05/1 000 000 = 5 × 10−8. This approach, I will argue, is ill-advised since it pays no regard to the power of the tests. In particular, the same threshold is suggested for all sample sizes. A more appealing approach is to have a threshold that changes with power. In this way, a procedure can be formulated that produces both type I and type II errors that decrease to zero as hypothetical studies of increasing sample size are conducted.
PI highlight various approaches to the generic problem of determining significance but do not discuss Bayesian approaches. Below, a Bayesian approach for testing many hypotheses is described which clearly illustrates the role of power. Put simply, all P-values are not born equal because knowledge of the associated power is vital for interpretation. PI have carried out a tremendous amount of work in cataloging associations that are borderline significant and in tracking down those associations to see if they are reproducible. Unfortunately, however, the power associated with the analysis of each SNP (which may be calculated from the sample size and minor allele frequency) is not reported. The approach we detail can either be used within a fully Bayesian approach, or for those not willing to fully commit, can be viewed as a mechanism by which a ‘sensible’ threshold for significance can be determined.
A Bayesian alternative to P-values
For simplicity, we suppose that a case–control study has been carried out and that Δ represents a univariate measure of association. For example, if P is the probability of disease, we may fit the model
| (1) |
where x is the number of copies of the minor allele for the SNP under examination, Hence, exp(Δ) is the change in the odds associated with each additional copy of the minor allele. The hypotheses of interest are: H0 : Δ = 0 vs HA : Δ ≠ 0. In a frequentist setting, one may calculate the P-value from a Wald test, i.e. calculate the Z-score,
, where
is the maximum likelihood estimator (MLE) and
is the standard error of
. Under this model, with n cases and n controls the standard error is
| (2) |
where MAF is the minor allele frequency. The P-value is calculated as 2[1 − Φ(| z |)] where Φ(·) is the distribution function of a standard normal random variable.
We now describe a Bayesian approach to the examination of hypotheses, based on the Bayes factor2,3, which is the ratio of the probability of the data under the null to the probability of the data under the alternative:
The BF summarizes the evidence for the hypotheses in the data with BF > 1 giving evidence for the null, and BF <1 evidence for the alternative. To obtain the posterior probability of H0, we need to specify a prior probability in our belief that the null is true. We denote this prior by π0 = Pr(H0), so that the prior on the alternative hypothesis is 1 − π0. Bayes theorem gives the posterior probability of the null as
An alternative, perhaps more intuitive form, is
or, in words,
We let PO = π0 /(1 − π0) represent the prior odds of no association.
PI discuss the importance of the relative costs of type I (false discoveries) and type II (false non-discoveries). These relative costs can be included in a formal Bayes decision theory approach; Table 1 gives the possible costs of incorrect decisions in the case of a pair of hypotheses. A Bayes decision theory solution is to report H0 if
where R = CII/CI is the ratio of the costs of type II to type I errors. For example, if type II errors are 4 times as bad as type I errors (this asymmetry may be true when we are in discovery mode and do not wish to miss associations) then R = 4 and we should call a SNP significant if the posterior odds on H0 drop below 4 (or, equivalently, if the posterior probability on the null drops below 0.8).
Table 1.
2 × 2 times table of costs when there are two hypotheses; CII is the cost of a type II error and CI the cost of a type I error
| Truth |
|||
|---|---|---|---|
| H0 | HA | ||
| Decision | H0 | 0 | CII |
| HA | CI | 0 | |
Bayes factors are very appealing but their practical use faces a number of challenges. The first concerns computation. The Bayes factor is calculated as:
| (3) |
where θ0 and θA are the parameters under the null and alternative, and π(θ0) and π(θA) are the prior distributions on the parameters included in the null and alternative. Hence, integration is required. The second difficulty concerns priors. As (3) makes clear, to follow a fully Bayesian approach we need to specify priors for all the parameters under both H0 and HA. With respect to the case–control logistic regression model (1) this corresponds to a prior on α under the null and both α and Δ under the alternative. The third and final difficulty is on how to decide upon a significance threshold. If one specifies PO and R, a threshold is immediately obtained to reject H0 if BF < R/PO but specification of each of PO and R is non-trivial.
To overcome the first two difficulties, following from previous work,4–6 I suggested a simple approach that is relevant when the sample sizes are large.7 The key is to summarize the information in the data through the normal likelihood
where
is again the standard error of the estimator (with the notation now emphasizing the dependence on the sample size n). This formulation sidesteps the need to specify priors on parameters that are not of interest since now a prior for Δ only is required. The natural choice is the normal prior: Δ ~ N(0, W), so that W is the prior variance of the log odds ratio. A large value of W means we believe a priori that the log odds ratio can be relatively large or small. For example, if we believe that the log odds ratio lies between [−log 1.5, + log 1.5] with probability 0.95 we obtain W = 0.252. A more formal justification for this likelihood-prior model is available elsewhere.8 These choices lead to a simple form for the Bayes factor:
![]() |
(4) |
Calculation only requires three inputs: the Z-score, the standard error
and the prior variance W. A crucial observation is that the evidence is based on the Z-score and on the power. Only a confidence interval is required for calculation since given such an interval we can solve for the required
and Vn. Appendix A, available as supplementary data at IJE online, gives a simple example, with accompanying code, to illustrate the calculations required, using data from PI.
A simple example
We present a sanity check via a simple example. Assume observations Y1, … , Yn are a random sample from the normal distribution N(Δ, σ2) with σ2 known. The maximum likelihood estimator (MLE) is
and has distribution N(Δ, Vn) with standard error
. The null and alternative hypotheses are H0 : Δ = 0, HA : Δ ≠ 0, and under HA we assume Δ ~ N(0, W). We assume W = σ2, which corresponds to the so-called unit information prior of Kass and Wasserman,9 which is a relatively uninformative choice. From (4) we derive the Bayes factor as
![]() |
The decision rule is to reject the null if BF × PO < R, i.e. if
![]() |
Rearrangement gives a Z2 statistic threshold of:
from which the P-value threshold may be derived.
In Table 2 below, we assume R = 1 so that type I and type II errors are equally costly. The table shows the P-value that corresponds to the Z2 threshold for a range of sample sizes and prior probabilities π0 = Pr(H0). Notice first how the P-value thresholds go to zero as the sample size increases. For π0 = 0.5 (PO = 1) and n = 20, 50, 100, the thresholds are ~0.05. It is interesting that these values are in the ballpark of the sample sizes that Fisher (who is accredited to popularizing the 0.05 threshold) would have been working with and presumably in the experiments in which he was involved the prior on the null would not have been close to 1 or 0. For example, in Tables 29 and 30 of Statistical Methods for Research Workers10 the sample sizes were 30 and 17 and Fisher discusses the 0.05 limit in each case, though in both cases he concentrates more on the context than on the absolute value of 0.05.
Table 2.
P-value thresholds when costs of type I and type II errors are equal, as a function of the prior on the null, π0 and the sample size n
| π0 = 0.25 | π0 = 0.50 | π0 = 0.95 | |
|---|---|---|---|
| n = 10 | 0.64 | 0.10 | 0.0025 |
| n = 20 | 0.35 | 0.074 | 0.0022 |
| n = 50 | 0.18 | 0.045 | 0.0016 |
| n = 100 | 0.12 | 0.031 | 0.0011 |
| n = 1000 | 0.030 | 0.0085 | 0.00034 |
The figures in bold represent situations in which the 0.05 threshold is approximately appropriate.
To conclude: the 0.05 P-value threshold can be justified in some situations, but it is not reasonable to use this value as a universal rule. The GWAS situation is far different from that considered above because the prior on the null is so much closer to 1, and the sample sizes vary over a larger range.
Determining Bayes factor boundaries in a GWAS context
We now show how the above methodology can be applied in a GWAS. As in the previous simple example we can rearrange the Bayes factor form given in (4) to obtain a P-value threshold. The Z2 score threshold is:
| (5) |
to give a threshold which is an explicit function of sample size, R and PO. We note that:
If the prior odds of the null, PO, increases, the threshold increases corresponding to a more stringent rule.
If the relative cost of type II to type I errors, R, increases, the threshold decreases to give a more liberal procedure.
Beyond a certain point, as n increases the type I error decreases to zero.
The threshold depends crucially on the sample size, but not on the number of tests being performed. In contrast, frequentist procedures depend on the number of tests, but not on the sample size.
The multiple testing aspect is not the important factor for the Bayesian analysis, rather it is the prior on each association. This is clear if one considers an imaginary experiment in which a researcher presents a statistician with the data from a single, randomly selected SNP. A frequentist analysis would presumably use a conventional level of significance (such as 0.05), whereas a Bayesian would use a threshold based on a prior on H0 that is close to 1. If instead a million SNPs are presented then the Bayesian analysis is unchanged (i.e. the same threshold is used as in the single test) whereas a frequentist analysis would need to adjust for the multiple testing aspect.
We now turn to the thorny issue of how to decide upon a threshold. If one has a good prior estimate of π0 and is willing to specify the ratio of costs R, then one can proceed directly using equation (5) (the Bayes factor is relatively insensitive to the choice of prior variance W). A more pragmatic approach is to treat the Bayes factor as a device by which the ‘correct’ behaviour of the threshold can be obtained as a function of sample size and MAF.
To illustrate, and to compare with Bonferroni, consider the situation in which we carry out a case–control study with n = n1 cases and n = n1 controls. We use the form (2) and assume a MAF of 0.5. Further suppose we have m = 1 000 000 tests and we set the ratio of costs at 10 (so that type II errors are 10 times as costly as type I errors) and π0 = 1 − 1/100 000. For the prior on the effect size suppose that there is a 95% chance that the log odds ratio lies between [− log 2, + log 2] to give W = 0.422. For comparison, we base the Bonferroni threshold on controlling the FWER at 0.05 so that the P-value threshold is 0.05/1 000 000 = 5 × 10−8. In Figure 1A we plot the type I error for both of the procedures; we see that the type I error is constant under Bonferroni by construction while the Bayes type I error is decreasing as a function of sample size. The Bonferroni procedure with an FWER of 0.05 is highly conservative and so the power is reduced, as is demonstrated in Figure 1B which shows the type II error as a function of sample size. Type I and type II errors can be difficult to translate into practical implications. The expected number of false discoveries (EFD) is m0α, where m0 is the the true number of null associations out of m and α is the type I error. Similarly, the expected number of true discoveries (ETD) is m1(1 − β) where m1 is the number of true signals. In practice, m0 (the number of null associations) and m1 are unknown but as an illustration suppose there are 50 true signals m1 = 50 out of m = 1 000 000. In Figure 1C we plot the EFD versus sample size and see that the Bayes procedure starts with around two false discoveries and then decreases as n increases. Bonferroni has an EFD of 0.04999 for all sample sizes. The benefit of the increased number of false discoveries under the Bayes procedure is the increased number of expected true discoveries as illustrated in Figure 1D. For example, for n = 3000 the expected number of true discoveries for Bayes is 27.9 and for Bonferroni is 16.2. Hence, in this example, trading around 2 false discoveries for 10 extra finds would seem beneficial. These sorts of simulation experiments can be carried out before analysis with the required operating characteristics being examined by altering π0 and R.
Figure 1.
Operating characteristics of Bonferroni and the threshold rule based on the Bayes factor. Type I and type II errors are displayed in the top row, and the expected number of false discoveries (EFD) and expected number of true discoveries (ETD) on the bottom row. This simulation is based on a situation in which the total number of tests is m = 1 000 000 and the true number of associations is m1 = 50
One can even explicitly set the required operating characteristic at a particular sample size n⋆ and MAF M⋆ in order to indirectly specify the ratio PO/R. For example, we may control the type I error rate, α⋆, or the expected number of false discoveries, EFD. To illustrate, suppose we wish to have a certain type I error rate at a particular sample size n⋆. Then, from (4) we can equate the Z2-statistic at a given type I error we wish to control,
to U = log(PO/R), i.e.
| (6) |
which gives
As an illustration, we take the type I error as 5 × 10−8 at n⋆ = 3000 and for a MAF M* = 0.25. Solving for PO/R using (6) we obtain PO/R = 243 801. For example, if R = 1 this gives π0 = 0.9999959. One interpretation of this figure is that if the same prior were used for m = 1 000 000 tests then we would expect m × (1 − π0) = 4.1 true associations to be present (which is another indicator of the conservative nature of this particular threshold). Figure 2 shows how this rule translates to a type I error for different sample sizes.
Figure 2.
Type I error as a function of sample size for Bonferroni and for the rule in which we fixed the type I error at 5 × 10−8 at n⋆ = 3000 (which is indicated by the vertical line)
Combination of data across studies
A key element in the article of PI is how information is combined across many studies, an endeavor which is likely to become more and more popular with the increasing number of consortia and decreased costs of genotyping. In Appendix B, available as Supplementary Data at IJE online, we detail the calculations required to extend the above approach and report the key formulas here. In the case of two studies, the approximate Bayes factor is
![]() |
where R = W/(V1W + V2W + V1V2),
and
are the usual Z statistics. The ABF will be small and thus giving evidence for HA when the absolute values of Z1 and Z2 are large and they are of the same sign. For the case of K studies
![]() |
where Zk and
are the Z-scores and standard errors from each of the studies, k = 1, … , K.
We finally note that more sophisticated Bayesian methods for meta-analysis in a GWAS context have recently been described (Wen and Stephens).11
Concluding comments
PI state that ‘The GWS should account for the multiplicity of comparisons’, but as we have seen, this is not the case in a Bayesian approach. The Bayes threshold boundary depends crucially on the sample size n but not on the number of tests m; the usual frequentist boundaries depend on m but not on n.
We have concentrated on emphasizing that boundaries should depend on sample size but the power also depends on the MAF. For MAFs in the range 0.1 to 0.5 the power does not change too greatly. However, in the future it is likely that reliable data on SNPs with low MAF will be obtained and in this case the implications for a threshold will be more marked. In these cases, it would be beneficial to allow the the variance of the prior W to depend on MAF. For example, we might anticipate larger effect sizes at small MAF. The information contained in Table 1 of PI is helpful in this regard as it gives both the MAF and the estimated effect size, see also Park et al.12 Of course, when collecting together results on effect sizes and MAFs, in order to specify a form for W that depends on the MAF, one must account for the fact that we are unlikely to be seeing effects at low MAFs because of low power. In other words, the selection bias of the signals we are seeing must be considered.
Software to carry out the calculations described in this article, written in the R language is available at:
Supplementary Data
Supplementary Data are available at IJE online.
Funding
This work was funded by the National Institutes of Health (grant NIH U01 HG 005157).
Conflict of interest:
None declared.
Supplementary Material
References
- 1.Panagiotou OA, Ioannidis JPA. What should the genome-wide significance threshold be? Int J Epidemiol. 2012;41:273–86. doi: 10.1093/ije/dyr178. [DOI] [PubMed] [Google Scholar]
- 2.Kass R, Raftery A. Bayes factors. J Am Stat Assoc. 1995;90:773–95. [Google Scholar]
- 3.Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Ann Int Med. 1999;130:1005–13. doi: 10.7326/0003-4819-130-12-199906150-00019. [DOI] [PubMed] [Google Scholar]
- 4.Johnson VE. Properties of Bayes factors based on test statistics. Scand J Stat. 2007;35:354–68. [Google Scholar]
- 5.Wacholder S, Chanock S, Garcia-Closas M, El-ghormli L, Rothman N. Assessing the probability that a postitive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst. 2004;96:434–42. doi: 10.1093/jnci/djh075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Johnson VE. Bayes factors based on test statistics. J R Stat Soc Ser B. 2005;67:689–701. [Google Scholar]
- 7.Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81:208–27. doi: 10.1086/519024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wakefield J. Bayes factors for genome-wide association studies: comparison with p-values. Genet Epidemiol. 2009;33:79–86. doi: 10.1002/gepi.20359. [DOI] [PubMed] [Google Scholar]
- 9.Kass RE, Wasserman L. A reference Bayesian test for nested hypotheses and its relationship to the schwarz criterion. J Am Stat Assoc. 1995;90:928–34. [Google Scholar]
- 10.Fisher RA. Statistical Methods, Experimental Design and Scientific Inference. Oxford: Oxford Science Publications; 1990. [Google Scholar]
- 11. Wen X, Stephens M. Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions, arXiv:1111.1210v2 [stat.ME], available at: http://arxiv.org/abs/1111.1210 (2 February 2012, date last accessed)
- 12.Park J, Wacholder S, Gail M, et al. Estimating effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet. 2010;42:570–75. doi: 10.1038/ng.610. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







