Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Dec 11.
Published in final edited form as: Am Stat. 2012 Jan 24;65(3):10.1198/tast.2011.10117. doi: 10.1198/tast.2011.10117

A Note on Comparing the Power of Test Statistics at Low Significance Levels

Nathan Morris 1, Robert Elston 1
PMCID: PMC3859431  NIHMSID: NIHMS532048  PMID: 24347668

Abstract

It is an obvious fact that the power of a test statistic is dependent upon the significance (alpha) level at which the test is performed. It is perhaps a less obvious fact that the relative performance of two statistics in terms of power is also a function of the alpha level. Through numerous personal discussions, we have noted that even some competent statisticians have the mistaken intuition that relative power comparisons at traditional levels such as α = 0.05 will be roughly similar to relative power comparisons at very low levels, such as the level α = 5 × 10−8, which is commonly used in genome-wide association studies. In this brief note, we demonstrate that this notion is in fact quite wrong, especially with respect to comparing tests with differing degrees of freedom. In fact, at very low alpha levels the cost of additional degrees of freedom is often comparatively low. Thus we recommend that statisticians exercise caution when interpreting the results of power comparison studies which use alpha levels that will not be used in practice.

Keywords: Power, Small Significance Levels

1 Introduction

Hypothesis testing is an established part of any conventional statistics education, as are the familiar phrases “P-Value,” “type I error,” “alpha level” and “power.” Every first-year student of statistics learns that if the P-Value is less than α = 0.05 one should “reject the null hypothesis.” While this cutoff is arbitrary, it is nevertheless tradition, and so we have grown used to it. We have developed our professional intuitions around this sacred, but quite arbitrary, standard. For example, traditional claims about the robustness of the central limit theorem in terms of Type I error are based on α = 0.05. While such claims may be challenged even when α = 0.05, one could certainly wonder if such rules of thumb are even remotely valid when alpha is orders of magnitude smaller than traditional levels (Bradley, 1978). In this note, however, we will pretend that all such considerations relating to the proper control of type I error at extremely small alpha levels are adequately addressed. Instead, we will consider power, and show that intuitions developed for α = 0.05 may not apply at much lower alpha levels.

To motivate our discussion, let us consider genome-wide association studies (Wang et al., 2005). There are positions in the genome where the DNA of some individuals differs from that of others by a single nucleotide. These variations among individuals are referred to as single nucleotide polymorphisms (SNPs). Usually, SNPs have only two possible variations (i.e., two possible alleles), and we will refer to the more common allele as “A” and the less common allele as “a.” Because humans have two copies of most of their DNA, there are three possible combinations of alleles (i.e., three possible genotypes) that an individual can have: “AA,” “Aa” and “aa.” We make no distinction between “Aa” and “aA.” A typical genome-wide association study will assay something on the order of 106 SNPs for each individual. A hypothesis test is then performed for each SNP to determine if it is associated in some way with the trait of interest. This leads to a Bonferoni corrected alpha level of α = 0.05/106 = 5 × 10−8 to control the family wise error rate at 0.05.

If the trait of interest (y) is quantitative, then we may easily view the test in the linear model framework. For SNP j and individual i we may code a predictive variable xij as 0, 1 or 2 for genotypes “AA,” “Aa” or “aa,” respectively (i.e., xij represents the number of “a” alleles carried by individual i at SNP j). Two possible single SNP linear models are: (m1) yi = β0+β1xij+εi and (m2) yi=β0+β1xij+β2xij2+i. For model m1 we would test the null hypothesis H0(1):β1=0 vs. Ha(1):β0, while for model m2 we would test the hypothesis H0(2):β1=β2=0vs. Ha(2):β10 or β2 ≠ 0. Comparing the power of a test of H0(1) with a test of H0(2) involves comparing two tests with differing degrees of freedom. It is true that in some sense these are distinct hypotheses, and technically, H0(1) may be true at the same time H0(2) is false. However, the rejection of either hypothesis will lead us to the conclusion that the SNP being investigated is associated with the disease. Some have suggested testing for association between y and two neighboring markers at a time (Kim et al., 2010). In this case we have numerous possible linear models such as: (m3) yi = β0 + β1xij + β2xi(j+1) + εi and (m4) yi = β0 + β1xij + β2xi(j+1) + β3xi(j+1)xij + εi. For m3 we would test the hypothesis H0(3):β1=β2=0vs. Ha(3):β10 or β2 ≠ 0, while for m4 we would test the hypothesis H0(4):β1=β2=β3=0vs. Ha(4):β10 or β2 ≠ 0 or β3 ≠ 0. Again, comparing these two tests involves comparing two tests with differing degrees of freedom, though from a biological viewpoint they are equivalent in that they both may be used to demonstrate that a particular genomic region is associated with a trait. Clearly more complicated tests may be created with quadratic terms and higher-order interaction terms leading to tests with a large number of degrees of freedom.

Suppose we are comparing the power of two test statistics that are based upon two models ma and mb. For example, ma and mb could be m1 and m2 above. We define νa as the number of coefficients in model ma, and νb is defined similarly. We will assume that νa < νb. For ma we perform a test (using, say, the Wald test) of H0(a):β1==βνa, and for mb we perform a test of H0(b)=β1==βνb. We refer to the test statistics as Ta and Tb, respectively. Let χν,δ2 represent a non-central chi-squared distribution with degrees of freedom ν and a non-centrality parameter (NCP) Δ. Also, let n represent the sample size. In order to make our discussion simple we will simply assume that under the null hypothesis Taχνa2 and Tbχνb2. Under a given alternative, we will assume that the distributions can be adequately characterized by the two non-centrality parameters (NCPs) Δa = a and Δb = b, so that Taχνa,δa2 and Taχνb,δb2. Some discussion of how to calculate such NCPs for genome-wide association studies may be found for example in Kim et al. (2010) and Yang et al. (2010).

We now introduce a key piece of notation:

q=(δbδa)δa, (1)

or equivalently δb = δa(1 + q). Thus q, which is independent of the sample size, represents the fractional increase in the NCP that can be achieved by using a hypothesis test based on the higher dimensional model mb instead of the lower dimensional model ma. When q = 0 then the NCPs of the two models are equal. Unfortunately, Tb has additional degrees of freedom compared to Ta, and therefore at a given alpha level the cutoff for rejecting H0 using Tb is larger than the cutoff for rejecting using Ta. There is a cost/benefit analogy to be drawn; we can purchase a larger NCP (as quantified by q) at the cost of increased degrees of freedom. We will only be willing to pay this price if q is large compared to the increase in degrees of freedom.

In what follows, we pose two simple questions, and show that the answers to these questions depend on alpha. All of the computations below were performed in R 2.9.2. The numerical accuracy of the chi-squared CDF at low alpha levels in R has been evaluated by Bangalore et al. (2009).

2 Results

First, we ask: what must q be for Ta and Tb to have equal power? That is, to borrow from the cost benefit analogy, we ask what value of q would allow us to break even with costs equal to the benefits. To answer this question, we first consider a test statistic (T) such that Tχν2 distribution under H0 and Tχν,δ2 under some alternative hypothesis. In this case, the power for a given α and ν is 1Fν,δ(Fν,01(1α)) where Fν,δ(Fν,01(1α)) is the cumulative distribution function of the non-central chi-squared distribution with degrees of freedom ν and NCP = Δ evaluated at the 1 − α quantile of the central chi-squared distribution with ν degrees of freedom. Because the power of T at a fixed α is a strictly increasing (thus invertible) function of Δ, we may solve for the Δ which would yield a power (p) where p ∈ (α, 1). We define the function Δ (α, p, ν) as the solution to this equation (i.e. the Δ such that 1Fν,δ(Fν,01(1α))=p. We wish to find the value of q that would give equal power (p) for Ta and to Tb at a given α, and we define the function Q(α, p, νa, νb) as the answer to this question. From equation (1) we have

Q(α,p,νa,νb)=[Δ(α,p,νb)Δ(α,p,νa)]Δ(α,p,νa). (2)

Figures 1a and 1b show plots of Q(α, p, 1, 2) and Q(α, p, 1, 3), which are obviously dependent on alpha. Figure 1a and 1b may be interpreted as saying that, in general, for a given p, α, νa and νb, we should use Tb if q is above what is suggested by the corresponding curve and we should use Ta otherwise. What is perhaps most remarkable about Figures 1a and 1b is the decreasing behavior of the function. This indicates that the cost of an additional degree of freedom, at least in this context, is less at a low alpha level than at a standard 0.05 alpha level. That is very small. The general behavior of the curves is similar over a broad range of choices for νa and νb.

Figure 1.

Figure 1

Power comparisons are a function of alpha level. By definition q = (δb − δa)/δa is the relative increase in the NCP purchased by the additional degrees of freedom. Note the log scale on the x-axis of all four graphs. (a,b) For different values of α, power (p) of Ta, and degrees of freedom, we show how large q must be for Tb and Ta to have equal power. (c,d) For different values of q and α and with the power of Ta fixed at 65% we show the percentage of power gained by using Tb (i.e., 100% × [[1Fνb(1+q)Δ(α,0.65,νa)(Fνb,01(1α))0.65]/0.65).

Our second question is this: given some values of q and the power of Ta, how much power would be gained or lost by using Tb instead of Ta? Using the same notation described above, we want to know

[1Fνb,(1+q)Δ(α,p,νa)p]p (3)

In equation (3) the non-centrality parameter (1 + q) Δ (α, p, νa) comes from equation (1) because Δb = (1 + q)Δa. In Figures 1c and 1d we show that, in the worst case scenario, when q = 0 (i.e., the extra degree(s)of freedom buy(s) no increase in the NCP) the percentage of power lost is fairly constant across alpha levels. However, if there is even a small increase in the NCP, the behavior at different alpha levels may be quite different. The curves shown in Figures 1c and 1d have the power (p) of Ta fixed at 65%. As may be expected, for larger p the dependence on alpha is less dramatic (i.e., the slopes are not as large) because a power cannot be larger than 100%.

3 Conclusion

We may wonder why we instinctively believe that power comparisons between different statistics will not be dependent on alpha. It may be our natural propensity to generalize, or it may be our experience with point hypotheses and uniformly most powerful (UMP) tests (Lehmann and Romano, 2005). UMP tests are typically not only most powerful for every value of the parameter in the parameter space, but also for every “attainable” value of α ∈ [0, 1] (Hogg and Craig, 1995, p. 411). In the examples we present, no UMP test exists because we can easily find genetic models that would make the test of H0(a) more powerful than H0(b) and vice versa. Therefore, power comparison studies must compare power over a broad range of plausible genetic models.

With the development of numerous high throughput molecular techniques such as array-based genotyping, many modern applications in statistics involve high dimensional scans, and thus require multiple-test corrections leading to extremely small alpha levels. Many methodological papers set out to compare the power of various test statistics for such high dimensional applications. This note serves as a warning that power comparisons done with α = 0.05 may not generalize to these realistic applications, especially when comparing statistics with differing degrees of freedom.

Acknowledgments

Morris was supported by NCI Grant R25T CA094186. Elston was supported by the NCRR grant P41 RR03655 and NCI grant P30 CAD43703.

References

  1. Bangalore S, Wang J, Allison D. How Accurate are the Extremely Small P-values Used in Genomic Research: An Evaluation of Numerical Libraries. Computational Statistics and Data Analysis. 2009;53:2446–2452. doi: 10.1016/j.csda.2008.11.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bradley J. Robustness. British Journal of Mathematical and Statistical Psychology. 1978;31:144–152. [Google Scholar]
  3. Hogg R, Craig . Introduction to Mathematical Statistics. Fifth Edition Prentice-Hall; New Jersey: 1995. [Google Scholar]
  4. Kim S, Morris N, Won S, Elston R. Single marker and two marker association tests for unphased case control genotype data, with a power comparison. Genetic epidemiology. 2010;34:67–77. doi: 10.1002/gepi.20436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Lehmann E, Romano J. Testing statistical hypotheses. Springer Verlag; New York: 2005. [Google Scholar]
  6. Wang W, Barratt B, Clayton D, Todd J. Genome-wide Association Studies: Theoretical and Practical Concerns. Nature Reviews Genetics. 2005;6:109–118. doi: 10.1038/nrg1522. [DOI] [PubMed] [Google Scholar]
  7. Yang J, Wray N, Visscher P. Comparing apples and oranges: equating the power of case-control and quantitative trait association studies. Genetic epidemiology. 2010;34(3):254–257. doi: 10.1002/gepi.20456. [DOI] [PubMed] [Google Scholar]

RESOURCES