Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2009 Apr;16(4):625–638. doi: 10.1089/cmb.2008.0144

Properties of Balanced Permutations

Lucinda K Southworth 1, Stuart K Kim 2, Art B Owen 3,
PMCID: PMC3148117  PMID: 19361331

Abstract

This paper takes a close look at balanced permutations, a recently developed sample reuse method with applications in bioinformatics. It turns out that balanced permutation reference distributions do not have the correct null behavior, which can be traced to their lack of a group structure. We find that they can give p-values that are too permissive to varying degrees. In particular the observed test statistic can be larger than that of all B balanced permutations of a data set with a probability much higher than 1/(B + 1), even under the null hypothesis.

Key words: functional genomics, gene expression, genomics, statistics

1. Introduction

Sample reuse methods such as permutation testing and the bootstrap are invaluable tools in high-throughput genomic settings, such as microarray analyses. They are used to test hypotheses and compute p-values without making strong parametric assumptions about the data, and they adapt readily to complicated test statistics.

Permutation tests, described in more detail below, work by permuting the treatment labels of the data and comparing the resulting values of a test statistic to the original one. Suppose for example, that there are n observations Inline graphic in the treatment group and also n observations Inline graphic in the control group. We might measure the treatment effect via Inline graphic where Inline graphic and Inline graphic are averages of the treatment and control observations, respectively. There are Inline graphic ways to redo the assignment of treatment versus control labels, and they each give a value for the treatment effect. If the actual treatment effect is larger than that from all of the other permutations, then we may claim a p-value of 1/P. More generally, if the observed effect beats (is larger than) exactly b of these values we can claim p = 1 − b/P. The smallest available p value is 1/P because the actual treatment allocation is always included in the reference set.

One reason why permutation tests work and are intuitively reasonable is their symmetry. If Inline graphic is the actual difference and Inline graphic is the difference for any reassignment of labels, chosen without looking at the X and Y values, then

graphic file with name M9.gif (1)

holds for all D, under the null hypothesis Inline graphic that Xi and Yi are all independent and identically distributed. That is, Inline graphic and Inline graphic have the same distribution.

Recently, a special form of permutation analysis, called balanced permutation, has been employed. In a balanced permutation, we make sure that after relabeling, the new treatment group has exactly n/2 members that came from the original treatment group and n/2 from the original control group. The number of balanced permutations is Inline graphic which can be much smaller than the total number P of permutations. Balanced permutations require n to be even, but nearly balanced permutations are possible for odd n.

These balanced permutations are mentioned in remark E on page 1159 of Efron et al. (2001). This is the earliest mention we have found. Since then, they have been applied numerous times in the literature on statistical analysis of microarrays. As of 2008, the National Cancer Institute describes balanced permutations in their page on statistical tests at http://discover.nci.nih.gov/microarrayAnalysis/Statistical.Tests.jsp. They say that more extreme p-values typically result, but they remark on a granularity problem. For fixed n, 1/B is bigger than 1/P, and so balanced permutations require larger sample sizes n than ordinary ones do, if one is to attain very small p-values.

Balanced permutations are a subset of all possible permutations, and so equation (1) holds for them too. Intuitively, a balanced permutation analysis should be even better than a permutation analysis. If Inline graphic then Inline graphic while Inline graphic remains at 0 due to cancellation. By contrast, for ordinary permutations some of the Inline graphic are positive and some are negative.

Equation (1) is not enough to make balanced permutations give properly calibrated p-values. The theory of permutation tests also requires a group structure for the set of permutations used, and balanced permutations do not satisfy this condition.

Though they fail a sufficient condition for giving exact p values, that does not in itself mean that they give bad p values. The goal of this paper is to investigate the p-values produced via balanced permutations. What we find is that those p values can be much smaller than they should be. This is most extreme in the case where the original test statistic beats all of the balanced permutations.

The outline of this paper is as follows. Section 2 describes permutation based inferences including balanced permutations and it distinguishes balanced from stratified permutations. Section 3 shows through theory and simulations that the reference distribution provided by balanced permutations can lead to permissive p-values, sometimes by astonishingly large factors.

The major use of balanced permutations is in estimating a pooled p-value distribution for N test statistics assumed to have the same distribution, and for estimating the false discovery rate (FDR) of a procedure. Section 4 looks at balanced permutations for pooled reference distributions. It is well known that including non-null cases makes a permutation analysis conservative. We find that including non-null cases in the balanced permutation scheme has the opposite effect, making it overly permissive. Even purely null cases can give rise to a permissive test, though this problem diminishes in large scale problems, when the null cases are independent of the one being tested. Section 5 summarizes our conclusions on balanced permutations, recommending against their use.

2. Background

To be concrete, suppose that we are comparing gene expression for two groups of subjects, treatment and control. Gene expression is a proxy measurement for the amount of product, either protein or RNA, made by a gene. In the past decade, various high-throughput technologies, generally termed microarrays, have made it possible to measure the expression activity of all genes in a genome with a single assay. We will later look at the issues that arise when there are many gene expression measurements to consider. For simplicity, we begin with the expression of a single gene.

We have n ≥ 2 subjects in each group. For the treatment subjects we measure Inline graphic for Inline graphic and for the control group we measure Inline graphic for Inline graphic. We assume an unpaired framework, so there is no connection between Xi and Yi. The Xi are modeled as n independent samples from a distribution F and the Yi are similarly sampled from G. We will always assume that F and G both have finite means. This is reasonable for the applications we have in mind and it rules out uninteresting corner cases.

We are usually interested in Inline graphic. We also consider the stricter hypothesis Inline graphic. There is no exact nonparametric test for a one-sample mean (Bahadur and Savage 1956) and so we cannot expect one for H0 either. One often makes do with exact tests for Inline graphic.

The customary test for H0 is Student's t test. Let Inline graphic and Inline graphic. Then under H0, the statistic

graphic file with name M27.gif (2)

where

graphic file with name M28.gif

has the t(2n−2) distribution. The t-test rejects H0 if Inline graphic. There are also one tailed versions.

The t test is designed for the case where Xi and Yi are normally distributed with common variance and possibly different means. The test holds up fairly well for non-normal data. In this particular setting with equal sample sizes of n, it is also reasonably accurate when Xi and Yi have unequal variances. For a discussion of the properties of t tests when assumptions may be violated, see Miller (1986).

2.1. Permutation tests

Permutation tests are very simple and intuitive. The original treatment assignment is one of Inline graphic possible assignments. Under the null hypothesis, they all have an equal chance to be the most extreme. The explanation is not based on the P test statistics being independent and identically distributed. In fact, they are not independent. The explanation comes from a symmetry argument which in turn stems from the group structure of the set of permutations.

Here we make a formal presentation of permutation tests of Inline graphic, so that we may later highlight their differences with balanced permutations. Balanced permutations also seem intuitively reasonable and even seem intuitively better than ordinary permutations. Readers who primarily want to apply permutation methods may prefer to skip this subsection.

We suppose as before that Xi ∼ F and Yi ∼ G are all independent, for Inline graphic. The sample mean difference is Inline graphic, where

graphic file with name M34.gif

We use Inline graphic to represent both samples combined.

Suppose that we relabel the data, changing which subjects are treatment and which are control. Any relabeling of the data can be obtained via a permutation of indices as follows. Let Inline graphic be a permutation of the integers Inline graphic. We shuffle the components of Inline graphic obtaining the vector Inline graphic. We use the symbol π for two meanings, a permutation of indices and the resulting rearrangement of a vector, but this should cause no confusion.

Now let

graphic file with name M40.gif (3)

Under Inline graphic the distribution of Inline graphic is the same as that of Inline graphic, for all permutations π.

The set of all permutations of Inline graphic comprises a group, called Inline graphic. The group operation is composition. For two permutations Inline graphic their composition, denoted by Inline graphic, is the permutation one gets by applying Inline graphic first and then applying π to the result.

Although there are M = (2n)! permutations of Inline graphic there are only Inline graphic distinct cases for Inline graphic determined by which n observations get π(i) ≤ n and which get π(i) > n. The ratio M/P = (n!)2 reflects the fact that there are n! ways to permute the observations with π(i) ≤ n, (and similarly the ones with π(i) > n) without affecting Inline graphic.

A permutation test uses the uniform distribution on all M possible sample values of Inline graphic as a reference distribution for Inline graphic, to test Inline graphic. In practice we accomplish the permutation test by the shortcut of computing only the P distinct cases. Those cases all have multiplicity M/P and so we can still use a uniform distribution on them. When even P is too large for enumeration to be feasible, then one commonly substitutes a very large random sample of permutations for the complete enumeration.

Here we construct permutation tests of Inline graphic at level Inline graphic based on Chapter 15.2 of Lehmann and Romano (2005). We will work with all M relabelings, instead of reducing to the P distinct ones, in order to use the simple group structure of Inline graphic.

To test Inline graphic, let T be a real valued function such that Inline graphic takes larger values in response to stronger evidence against Inline graphic. For example T (Δ) = |Δ| for a usual two tailed test and T (Δ) = Δ or −Δ for one tailed tests. To get a test with level α we reject Inline graphic if the observed Inline graphic is larger than of the permuted values. We sort the M values of the permuted test statistics Inline graphic, as π ranges through Inline graphic, getting

graphic file with name M66.gif

Let Inline graphic. Roughly speaking, the test will reject Inline graphic if Inline graphic is larger than Inline graphic but the recipe below takes care of ties, the number of which can in principle depend on the data Inline graphic. We let Inline graphic be the number of Inline graphic values strictly larger than Inline graphic and Inline graphic be the number of Inline graphic values exactly equal to T(k). Then define the tie breaker quantity Inline graphic and let

graphic file with name M78.gif

By construction

graphic file with name M79.gif (4)

holds for any Inline graphic.

The interpretation of ψ is as follows. If ψ = 1 then we reject Inline graphic. If ψ = 0 then we don't reject it. If 0 < ψ < 1 then we reject Inline graphic with probability ψ. It is best to have all the ψ values equal to 0 or 1 to avoid randomized decisions. We can usually do this for special choices of α. For continuously distributed data all P possible sample differences are distinct with probability one. Then taking α = A/M for a positive integer A < M yields Inline graphic and so eliminates randomized decisions.

Theorem 1

Suppose that Xi ∼ F and Yi ∼ G for Inline graphic are all independent and that Inline graphic holds. For a test statistic T (z) and the randomization test quantity ψ constructed above from the permutation group Inline graphic, Inline graphic.

Remark

Theorem 1 follows from Theorem 15.2.1 of Lehmann and Romano (2005) of which it is a special case. We recap their proof, for later use.

Proof

Using Inline graphic and then equation (4), we get

graphic file with name M89.gif

When we have arranged for ψ to be either 0 or 1, then Inline graphic. Therefore, the test rejects Inline graphic with probability α. The p value is the smallest α rejected (among the non-randomized choices).

2.2. Stratified permutations and balanced permutations

Two different ideas have been called “balanced permutation” in the literature. In one kind of balanced permutation, the subjects can be split into two groups in two different ways. The split of primary interest is treatment versus control. The second split has levels that are variously called strata or blocks. It may already be known to be quite important, but it is not of primary interest. For example, test and control might correspond to young and old mice while the strata might correspond to two different strains. We might want a test for age that adjusts for known differences between strains.

Suppose that we have n test subjects, where n is even for simplicity, with n/2 from stratum A and n/2 from stratum B. Similarly, there are n/2 control subjects from each stratum. This study design has balanced the treatment versus control dichotomy with respect to the strata, and it is reasonable to enforce such a balance on all the permutations. To get a permutation that is balanced with respect to the stratification, we may select n/2 of the subjects from stratum A and n/2 of the subjects from stratum B to be relabeled as test subjects. The remaining n subjects are relabeled as controls, and of course are also balanced. Stratified permutation tests of this type can be generalized to stratification factors with more than 2 levels and also to settings with more than 1 factor.

Tusher et al. (2001) employed this form of stratification to compare effects of radiation on genes, adjusting for cell lines. Their subjects had a third dichotomy, corresponding to aliquots, that need not concern us here.

The other method called balanced permutations is the one described in the introduction. These balanced permutations are not the same as stratified permutations. The difference is that the former stratifies on a covariate dichotomy, while the latter stratifies on the dichotomy of interest. Stratification on a covariate is sometimes studied as a restricted randomization and has a long history in nonparametric methods; for example, see Good (2000) and Edgington and Onghena (2007).

We will use the term “balanced permutation” to describe only the new method of stratifying on the variable of interest. As described in the introduction, there is reason to be optimistic about the performance of balanced permutations. On the other hand, an ordinary unbalanced permutation test applied to Inline graphic, is asymptotically equivalent to Student's t-test. That test is the appropriate one for Gaussian data, and is asymptotically well behaved otherwise, under mild moment conditions. We might therefore expect that there is little room for improvement from balanced permutations, at least for Inline graphic and moderately large n.

To justify using balanced permutations to construct a reference distribution, we need to apply a result like Theorem 1. The key ingredient is the group structure. Stratification has a group structure. In the setting above we may employ permutations πA and πB within strata A and B respectively. The permutation of the whole data set may be represented by the pair A, πB). These pairs have a group structure. Formally, it is the direct sum of the groups to which πA and πB belong. Practically, the permutations for strata A and B are just handled separately and independently. The group structure can extend to three or more strata.

The set Inline graphic of balanced permutations is not a group. To begin with, the original data Inline graphic are the identity permutation of the data, but the identity permutation is not a balanced permutation. We might simply include the identity permutation, with the same multiplicity as each balanced permutation into Inline graphic, getting the set Inline graphic. Then our putative p value is the proportion of permutations Inline graphic for which Inline graphic holds.

The problem with this approach is that the set Inline graphic is not a group either, under composition. A balanced permutation of a balanced permutation is not necessarily another balanced permutation. Therefore, Inline graphic and Inline graphic both fail the closure requirement of a group. An example is illustrated in Figure 1.

FIG. 1.

FIG. 1.

This figure shows that the result of applying two balanced permutations in succession need not be another balanced permutation. The first column shows treatment and control groups, T and C, of four observations each. The middle column shows the results of a balanced permutation that swapped two of the treatment subjects for two of the controls as indicated by line segments. The third column shows the result of a second such swap. The final treatment group has three members of the original treatment group and one member of the original control group, and so it is unbalanced.

It is not obvious at first sight where the proof of Theorem 1 makes use of the group property of Inline graphic. But if we replace the group Inline graphic by Inline graphic (or Inline graphic), and follow the steps, the proof fails at the third equality sign. We can still pick k so that Inline graphic is beaten by a fraction α of Inline graphic taken over balanced permutations π. For a group of permutations, that means T(k) will be beaten by a fraction α of all Inline graphic as Inline graphic ranges over the group and π is any fixed group member. This happens for the group Inline graphic, because Inline graphic. But no such result holds for Inline graphic or Inline graphic. For balanced permutations, the set of ordered values of Inline graphic as Inline graphic varies for fixed π is not constant. That set depends on π and so its 1 − α quantile will in general depend on π.

3. Coverage Properties Of Balanced Permutation

Because balanced permutations do not satisfy Theorem 1, we cannot be sure that they provide a suitable p value for Inline graphic. We look at some theoretical and empirical results to see how they do.

In microarray experiments, n is often small, because arrays are costly. For example, Tusher et al. (2001) compare two groups of n = 4 arrays. Our main interest is therefore in values of n between 4 and 20. We also look briefly at n = 2 because a mathematical answer is possible there and because it illustrates the issues.

3.1. Theory for n = 2

For n = 2 there are only B = 4 distinct balanced permutations, and so we can easily look at all of them. Suppose that Inline graphic. Then we can write out results from all of the balanced permutations as a vector

graphic file with name M119.gif

Theorem 2

Suppose that X1, X2, Y1, and Y2 are independent random variables with the same continuous distribution F. Let Inline graphic be the number of balanced permutations that the original permutation beats. Then Inline graphic for Inline graphic and Inline graphic.

Proof

We easily find that Inline graphic if and only if X2 > Y1. From the other three cases, we find that Inline graphic beats all 4 balanced permutations if and only if both Xi are larger than both Yi. This has probability 1/6 by symmetry. A similar argument shows that Inline graphic has probability 1/6 of beating none of the Inline graphic. For Inline graphic to beat exactly 3 of the Inline graphic, the X's must rank first and third among the four data points. Thus Pr(b = 3) = 1/6 as well. Similarly Pr(b = 1) = 1/6, and by subtracting we find Pr(b = 2) = 1/3. ▪

Thus, for n = 2, the balanced permutation histogram for comparing means is not uniform. It is slightly conservative. When the original permutation beats all 4 balanced permutations, we would naively claim a p value of 1/5. But this event only has probability 1/6 of happening. Balanced permutations are not likely to be applied when n = 2, except perhaps when results from multiple genes are to be pooled as described in Section 4. Unfortunately, when n gets larger, the non-uniformity does not disappear, and the method switches from conservative to permissive.

It is interesting to note that the number of permutations beaten has a double high spike in the middle at 2. If we insert a pseudo value Inline graphic, then by symmetry we split that spike into 2 equal pieces and obtain a uniform histogram for Inline graphic. This spike at the median appears in all of the balanced permutation histograms for treatment effects that we have investigated. It is not a simple consequence of the histogram being symmetric. For instance, the histogram for the full permutation set is symmetric but it does not have a double high spike in the middle.

3.2. Numerical results for small n>2

The method used to prove Theorem 2 does not give us an answer for n > 2. When n = 4 the argument there shows that the observed difference will beat all 36 balanced permutations if and only if the smallest n/2 = 2 of the Xi have a higher sum than the largest 2 of the Yi. The probability of this event depends on the underlying distribution of Xi and Yi. We proceed numerically, using the N(0, 1) distribution for illustration.

For n = 4 observations per group there are now Inline graphic balanced permutations. The observed Inline graphic will beat some number in Inline graphic of them. Ideally, the number of balanced permutations giving Inline graphic should have the uniform distribution on those 37 levels. The top plot in Figure 2 shows the actual distribution, from 10,000 simulations of the case where Inline graphic are independent N(0, 1) random variables. The histogram is nearly uniform except that the bars at 0, 18, and 36 are about twice as high as desired. There is also a slight tendency for small and large multiples of 3 to have higher bars than their neighbors.

FIG. 2.

FIG. 2.

These figures plot the probability mass function of the number of balanced permutations of X − Y beaten by the original mean difference. The scale is chosen so that the heights of the bars average to 1 in each figure. The top figure is for n = 4 N(0,1) random variables contributing to each of Inline graphic and Inline graphic. With n = 4 there are 36 balanced permutations. The bottom figure is for n = 6 where there are 400 balanced permutations. Both figures differ from the ideal in which all bars have height 1.

The bottom plot in Figure 2 considers the case of n = 6. Now there are B = 400 balanced permutations. The bar at 200 is about twice the desired height, while the ones at 0 and 400 are about 6.8 times as high as they should be.

Histograms for n = 8 and 10 (not shown) are also nearly flat except for extra height in the middle and much greater height at the ends. The heights at the extreme ends are described in Section 3.3, for values of n from 2 to 20.

The p values quoted above are one-tailed. For two-tailed p-values one might simply double both the nominal and observed quantities. The true two-tailed p value for a mean that beats all 400 balanced permutations would be about Inline graphic instead of Inline graphic. In principle we might be better off putting both Inline graphic and Inline graphic into the reference group. Then the two-tailed p-value would be about 2 × 6.8/402, but changing the denominators from 401 to 402 is a much smaller effect than the others we're studying, so we don't consider it further.

For n = 2 there is enough symmetry to determine the histogram of p values. For larger n and Gaussian data it may be possible to apply computational geometry. For example, when the treatment difference beats all balanced permutations the combined vector Z lies in an intersection of Inline graphic half spaces in Inline graphic. But as remarked above, the answer is not distribution free. To confirm this distribution effect, we compared results for exponential data to those for normal data for n = 4.

Results of a simulation of 100,000 cases for n = 4 are reported in Table 1. The simulations were run with independent N (0, 1) data for all Xi and Yi. We counted how many times the observed Inline graphic beat all 36 balanced permutation values. Then the simulation was rerun changing the N(0, 1) values to exponential (mean 1) values. The second simulations were coupled to the first by the probability integral transformation, Inline graphic which turns a N(0, 1) random variable X into an exponential one. There were 370 cases where the Gaussian simulation beat all 36 of its balanced permutations while the exponential did not. There were 801 reverse cases. This matched pair difference is statistically significant by McNemar's test. It is equivalent to more than 12 standard deviations using equation 10.4 of Agresti (2002). The size of the effect however is not extremely large. For Gaussian data, the observed treatment effect beats all 36 balanced permutations about 1.843 times as often as it should, while for exponential data the ratio is about 2.003.

Table 1.

Results of 100,000 Simulations of a Balanced Permutation Test for the Difference of Means with n = 4 per Group

  Exp. < 36 Exp. = 36
Gaussian < 36 94217 801
Gaussian = 36 370 4612

The simulation was run with Gaussian values (all observations in both groups were independent N (0, 1)). Then it was rerun with values from the exponential distribution of mean 1.The simulations were coupled. The table shows counts of the number of times that the original permutation beat all 36 balanced permutations.

3.3. Results for larger n

The most interesting feature of the balanced permutation histogram is the probability that Inline graphic is larger than all Inline graphic balanced permutations. It is fairly easy to investigate this probability in the Gaussian setting because of the following lemma.

Lemma 1

Let n ≥ 2 be an even integer, and suppose that Inline graphic and Inline graphic are independent N (μ, σ2) random variables for some σ > 0. Then the observed treatment effect Inline graphic is statistically independent of the whole collection of balanced permutation treatment effects Inline graphic.

Proof

Under the hypothesis Inline graphic has a multivariate normal distribution. Therefore, so does Inline graphic. Because of the balance, the correlation between Inline graphic and Inline graphic is zero. Therefore, Inline graphic is independent of Inline graphic. ▪

Lemma 1 lets us estimate the probability that Inline graphic beats all Bn balanced permutations by sampling Xi and Yi for Inline graphic many times, computing Inline graphic each time, and then averaging the values of

graphic file with name M162.gif

The results of this computation are shown in Table 2. For each even n from 2 to 20 inclusive, 100,000 data sets were investigated. The probability pn of beating all balanced permutations grows much faster than 1/(Bn + 1). By n = 20 the p values are off by roughly 400,000. A least squares fit of Inline graphic versus n has intercept −2.335, slope 0.745 and an R2 of 0.995. While the fit is good over the range 2 to 20, there is also positive curvature suggesting that extrapolating this model will underestimate the ratio.

Table 2.

Rounded Numerical Estimates of the Probability that an Estimated Treatment Effect from n Observations per Group Beats All Inline graphic of its Balanced Permutation Treatment Effects

n Inline graphic Inline graphic 2.57CVn
 2 0.166000 0.83 0.006
 4 0.051000 1.89 0.008
 6 0.016700 6.71 0.011
 8 0.005700 27.90 0.014
10 0.001980 125.00 0.017
12 0.000699 596.00 0.021
14 0.000248 2,920.00 0.026
16 0.000089 14,800.00 0.030
18 0.000032 75,500.00 0.037
20 0.000012 401,000.00 0.046

The estimate Inline graphic is in the second column. The normalized estimate Inline graphic is in the third colunn. Ideally this should be near 1. The fourth column has Φ−1 (0.995) = 2.57 times the sampling coefficient of variation of Inline graphic. For example, the approximate 99% confidence interval for θ8 is 1.4% higher or lower than the estimate 27.9.

3.4. Balanced permutation tests based on t statistics

In high-throughput settings, permutation tests are often run on the t-statistic of equation (2), instead of the raw treatment effect Inline graphic. Let t* be the value of t computed after a reassignment of the treatment versus control labels.

For a single test statistic this may seem redundant, because the permutation test on Inline graphic already behaves similarly to the t statistic. But in high throughput settings we often have many test statistics, for instance one per probe, whose underlying data have extremely different variances. The Inline graphic from different probes are not comparable. The t* values are more comparable because they normalize Inline graphic by a standard error. Permutations of t* from one probe may thus be useful in forming a test for another probe.

The theorem below shows that balanced permutations of t statistics have the same calibration problem that balanced permutations of treatment differences do. As a result, the values in Table 2 apply to t tests too.

Theorem 3

Let n ≥ 2 be an even number. Let Inline graphic, Inline graphic be distinct numbers and let Inline graphic be those same values after a possible relabeling. Let Inline graphic, let t be the t statistic in (2) and take Inline graphic and t* to be the analogous quantities on the relabeled values. Then Inline graphic holds if and only if t ≥ t*

Proof

The value Inline graphic is a strictly increasing function of Inline graphic, as we vary the labelling. The proof follows by showing that t* is also strictly increasing in Inline graphic. For notational simplicity we show that t is strictly increasing in Inline graphic.

Let Inline graphic. Then Inline graphic, where m does not depend on the labeling. Let D be the positive square root of Inline graphic, so that Inline graphic. We need to show that Inline graphic is strictly increasing in Inline graphic, where D depends on the labeling.

Introduce Inline graphic which does not depend on labeling. Replacing Inline graphic by Inline graphic in D2, and similarly for the Y values, we get Inline graphic after simplification. Now

graphic file with name M194.gif (5)

The derivative of (5) with respect to Inline graphic is Inline graphic, which is strictly positive, and so t is increasing in Inline graphic. ▪

Theorem 3 shows that we get the same calibration problem whether we use t or Inline graphic when using a one tailed criterion. The same equivalence holds for the two tailed criteria Inline graphic and Inline graphic. Both Inline graphic and Inline graphic are increasing in Inline graphic over the range Inline graphic both are decreasing in Inline graphic when Inline graphic, and both are 0 when Inline graphic.

The assumption that all 2n values are distinct can be weakened. It ensures that D is never 0, but that can be enforced with a weaker condition.

3.5. Practical consequences

What we see in the figures is that the reference histogram is roughly uniform but with a spike in the middle and two more at the ends that grow quickly with n.

The spikes at the ends are the most serious because they strongly affect p-values. Suppose for example that n = 10. Then when the treatment effect beats all balanced permutations we might naively quote a one-tailed p value of Inline graphic. However, the proper p-value in that setting is about 1.98 × 10−3, roughly 125 times larger.

In studies that screen for interesting genes, using a small number n of arrays per group, the case where the observed effect beats all permutations is the most important one. When many genes are investigated, one has to make adjustments for multiple comparisons. Step down methods start by looking at the most significant genes first before choosing a cutoff and quoting a false discovery rate. For an account of multiple testing procedures, see Dudoit and van der Laan (2008). When the very smallest p-values are permissive, the whole process will be adversely affected.

The problem is not confined to the very smallest p-value. If the original treatment effect beats all but one of the balanced permutations, then the claimed p-value is about (125 + 1) / 2 = 63 times as small as it should be, assuming that the second bar from the right in the reference histogram has nearly its proper height. Similarly, outcomes that are near the ends of the histogram will get permissive p-values.

4. Pooled Balanced Permutations

As mentioned previously, microarray technologies can simultaneously measure the expression levels of thousands of genes at once. To this end, most microarrays use multiple measuring points (probes) for every gene. The main use of balanced permutations in high throughput problems is to calibrate a p value for one probe, by pooling data from a large number N of similar probes. It is also used to compute summaries of N linked tests, such as estimated FDRs. Sometimes there is a near one to one match relating probes to genes, but in other settings there can be numerous probes per gene. Typically, N is in the thousands, while n is much smaller.

Fan et al. (2004) use balanced permutations to get a pooled reference distribution to set N p-values. Jones-Rhoades et al. (2007) use them to estimate a rate for false discovery of genes, in a procedure that merges nearby differentially expressed probes into up and downregulated genomic intervals and then searches for new genes in those intervals.

Suppose that we are making N tests of type H0, one for each probe. For the i'th probe we get a t statistic ti. The null distribution of ti will not be exactly the t(2n−2) distribution, if the underlying data are not normally distributed. Furthermore, one might replace the t test by some quantity that is more robust to very noisy estimates of probe specific standard deviations.

Either a permutation or a balanced permutation analysis generates B permutations of the data and forms a reference distribution of BN test statistics tij for probes Inline graphic and permutations πj for Inline graphic. With such a reference distribution, the p value for ti′ is

graphic file with name M211.gif (6)

where the usual choice for T is Inline graphic. In using equation (6) the original permutation of the data should be one of the B included.

4.1. Non-null genes

There are known problems with pooling permutation-based statistics from N problems, whether the permutations are balanced or not. The main one is that the presence of non-null genes among the N cases tested distorts the reference distribution. In a full permutation analysis, many of the sampled permutations are very unbalanced for the treatment effect and they generate a heavy tailed reference distribution. The result is a conservative method that over-estimates FDR. Fan et al. (2005) and Xie et al. (2005) both raise this issue and seek to apply the permutations only to those genes that are not differentially-expressed. Identifying null genes is at least as hard as identifying non-null genes. Yang and Churchill (2007) compare methods that attempt to remove non-null genes.

The opposite occurs in balanced permutations. The balance has the effect of inflating the denominators of t statistics while the treatment effect cancels out of the numerators. The result is a resampled test statistic distribution with tails that are too short. Then, to the extent that these genes are included in the permutation, the FDR will be underestimated.

Table 3 shows some results of this type. Let the non-null data have Xi∼ N(μ, 1) and Yi ∼ N(0, 1) all independent, and suppose that the null data have a common N(τ, σ2) for all of Xi and Yi. The table presents the probability that t computed from the null data beats all the balanced permutation t statistics from the non-null data. The effect grows with μ and with n. The case with μ = 0 and n = 4 is not permissive, but μ = 0 is a null case, described in more detail below.

Table 3.

This Table Presents the Probability that a Null t Statistic Beats the Largest t Statistic Among Balanced Permutations of Independent Non-null Data

 
n = 4
 
 
n = 6
 
μ = 0 μ = 1 μ = 2 μ = 0 μ = 1 μ = 2
0.971 1.936 4.234 1.813 6.144 20.64
0.030 0.025 0.017 0.043 0.037 0.021

The values shown in the top row are Inline graphic, which should be 1 for a properly calibrated reference set. Here t is a t statistic computed with independent Xi, Inline graphic for Inline graphic and Inline graphic are all of the balanced permutations of a t statistic, on data where Inline graphic and Inline graphic independently of each other and of t. The second row gives 99% relative error values. There were 10,000 simulations of each case.

There is partial remedy for the non-null case, described by Pan (2003), which should be helpful when n is large. Notice that each observation belongs to either the test or control group in the original data and to either the test or control group in the permuted data. This yields four different sets of observations in any balanced permutation. The idea of Pan (2003) is to form the variance estimate in the denominator of the t statistic by pooling sample variances from these four subsamples instead of just the permuted treatment and control groups. This will remove mean differences from variance estimate in the denominator of t. Unfortunately, the degrees of freedom for the resulting reference distribution are 2n − 4 instead of 2n − 2, which will be an important difference when n is small.

Scheid and Spang (2006, 2007) also note the problem from including non-null genes in the reference distribution. They have proposed a method of filtered permutations. Instead of filtering the list of genes to be mostly null, they filter the set of permutations to be the ones π for which a subsequent permutation Inline graphic gives nearly U(0, 1) distributed p values for a comparison of the two permuted groups. That is, they employ an iterated permutation analysis. By considering permutations of permutations their method might be accomplishing numerically what the group structure provides mathematically. But there is as yet no theoretical analysis of filtered permutations.

4.2. Null genes

In the case of balanced permutations, there are also difficulties in the distribution of p values from the null genes. For any probe i, we know that the uniform distribution on Inline graphic for Inline graphic or Inline graphic fails to be a good reference distribution. However, when we combine all N of these test distributions into a single reference distribution, then for whichever probe i′ we are testing, our reference distribution is dominated by data from probes i ≠ i′. Hence we need to look at how a reference distribution from one probe works for data from another probe.

A precise comparison of T(ti′) with π(T (ti)) depends on the dependence between data from probes i and i′. There is usually some dependence, because different genes on the same array tend to be correlated. The case where probes i and i′ are highly correlated should be similar to that in which a probe is compared to its own balanced permutations. The case of independent probes is easier to study than probes with arbitrary dependence, and this case should be favorable to the use of balanced permutations. So we consider Inline graphic that are independent of Inline graphic, Inline graphic.

For the test statistic Inline graphic, on normal data, we find from Lemma 1 that it makes no difference whether probe i is compared to balanced permutations of itself or of another null probe i′.

As mentioned above, when multiple genes are used to calibrate each other, it is more reasonable to permute t statistics. In Table 3, the values for μ = 0 correspond to independent null probes. Figure 3 shows histograms of the number of balanced permuted t statistics from one sample that are beaten by a t statistic from another independently generated sample. Because the second sample is independent of the first we can use the known t distribution to advantage. In this simulation, we generated the list of balanced t values 1000 times. For each of these times, we sorted the 36 possible values, splitting the real line into 37 bins, and evaluated the probability that an independent t random variable would land in each of those bins. The bars shown are the averages of these probabilities over the 1000 simulations.

FIG. 3.

FIG. 3.

These plots show histograms of the number of balanced permutations of the t statistic from one data set that are beaten by a t test from an independently sampled data set. All data values were independent N(0,1) in both data sets. The top figure is for n = 4, and the bottom one is for n = 6.

When n = 4 the spikes at the ends have vanished, but the one in the middle remains at about 2.0. So for this case we would get very nearly the desired flat histogram from the device of inserting 0 as a pseudo value. When we repeat the simulation for n = 6, there are spikes at the end. The t statistic computed on one data set with n = 6 has about twice the chance it should have of beating all the balanced permutations from another data set.

4.3. Multiple independent null genes

Now suppose that probe i′ under test is compared to a reference distribution made up of numerous probes indexed by Inline graphic and that all I + 1 of the probes are statistically independent. Probe i′ may tend to beat all balanced permutations of any probe i more often than it should. But when the I probes are independent of each other it becomes less likely that i′ beats all balanced permutations of all I other probes.

We generated 1,000,000 normally distributed data sets of two groups with n = 6 observations each. From each of these, we found Inline graphic, the maximum of 200 two-tailed t statistics on 10 degrees of freedom each. These million values were then arranged into roughly 106/I lists of I values each. The probability that an independently generated value of |t(10)| exceeds the largest of these I values should average to 1/(200I) for proper calibration. Figure 4 shows corresponding sample averages presented as a multiple of 1/(200I), as I varies from 1 to 25. We see that pooling independent null data from balanced permutations is somewhat permissive but that the effect disappears by around I = 15 to 25.

FIG. 4.

FIG. 4.

This figure shows that a reference distribution based on balanced permutation t tests of I > 1 independent null probes is not as permissive as one using a single independent null probe. The horizontal axis shows the number I of independent probes used in a simulation of balanced permutation t tests for two groups of n = 6 observations each. The vertical axis shows 200 I times the probability that the original data has a value of |t| larger than that of all 200 I balanced permutation |t| values in the reference set. The dots show estimated probability ratios, the curves show upper and lower 99% pointwise confidence limits, and there is a horizontal reference line at 1.

The same pattern holds for n = 8 observations per group. There the ratio starts nearer to 4.0 but is again close to 1 for I = 20.

If we can successfully eliminate the non-null cases and then pool a large number of independent probes, we will get a suitable reference distribution. Then again we might also do that with ordinary permutations, and should we miss some non-null cases, the consequence would be a conservative test instead of a permissive one.

5. Discussion

We find no reason to prefer balanced permutations to ordinary permutations, for genomic applications. The intuitive argument that leads one to expect greater power for them, is countered by them producing permissive p-values when there is no effect. At a minimum, more work is required to justify their use or properly calibrate them.

When used to judge the sampling distribution of a test statistic from N different genes, balanced permutation reference sets are adversely affected by non-null cases among those permuted. So are ordinary permutations. But where ordinary permutations yield a conservative reference distribution, balanced permutations provide a permissive one, when based on the t statistic or on a difference in means. Such permissive methods are more likely to lead to false discoveries.

In practice, it is important to attempt to remove the non-null cases before constructing a permutation-based reference distribution. When the balanced permutations are applied to null cases, we find that balanced permutations can still be mildly permissive. Fortunately, the effect seems to disappear when a large number of null cases have been pooled. However, that disappearance was for null cases that were independent of the case being tested, a condition not required for ordinary permutations. Correlations among the null cases and between the null cases and the gene being tested could result in a distortion that does not decrease for large numbers of pooled reference genes.

For the problems that we have analyzed, it is clear that balanced permutations cannot simply be used as a substitute for full permutation methods. Our analysis does not cover very complicated methods such as the one Jones-Rhoades et al. (2007) used or the empirical Bayes analysis proposed by Efron et al. (2001). We would hesitate to use balanced permutations as an ingredient in more complicated analyses, because they work poorly in simple situations.

We started looking at balanced permutations as an ingredient in a similarly complicated analysis, involving data from the AGEMAP study of Zahn et al. (2007). We considered near balanced permutations with respect to age (old versus very old mice) within strata defined by male and female mice. It could only be near balance because there were 5 mice of each sex at each age. For each mouse, there were 14 tissues. The effects of interest were related to differences in the gene-gene correlation matrices at the two ages. Simulations of even a simple before versus after correlation for just one pair of genes whose correlation does not change revealed permissive p-values like the ones we report here. As a consequence that work continued with ordinary permutations instead of balanced ones.

A final disadvantage of balanced permutations is the one mentioned at the NCI website. There are fewer balanced permutations, so that granularity properties are exacerbated.

Acknowledgments

We thank Sarah Emerson for helpful discussions. This work was supported by the U.S. National Science Foundation (grant DMS-0604939), the U.S. National Institutes of Health, and a training fellowship from the National Library of Medicine to the Biomedical Informatics Training Program at Stanford (NLM grant no. 07033).

Disclosure Statement

No competing financial interests exist.

References

  1. Agresti A. Categorical Data Analysis. 2nd. Wiley; New York: 2002. [Google Scholar]
  2. Bahadur R.R. Savage L.J. The nonexistence of certain statistical procedures in nonparametric problems. Ann. Math. Stat. 1956;27:1115–1122. [Google Scholar]
  3. Dudoit S. van der Laan M.J. Multiple Testing Procedures with Applications to Genetics. Springer-Verlag; New York: 2008. [Google Scholar]
  4. Edgington E.S. Onghena P. Randomization Tests. 4th. Chapman and Hall/CRC; Boca Raton, FL: 2007. [Google Scholar]
  5. Efron B. Tibshirani R. Storey J.D., et al. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001;96:1151–1160. [Google Scholar]
  6. Fan J. Tam P. Woude G.V., et al. Normalization and analysis of cDNA microarrays using within-array replications applied to neuroblastoma cell response to a cytokine. Proc. Natl. Acad. Sci. USA. 2004;101:1135–1140. doi: 10.1073/pnas.0307557100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fan J. Chen Y. Chan H.M., et al. Removing intensity effects and indentifying significant genes for Affymetrix arrays in macrophage migration inhibitory factor-suppresed neuroblastoma cells. Proc. Natl. Acad. Sci. USA. 2005;102:17751–17756. doi: 10.1073/pnas.0509175102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Good P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. 2nd. Springer-Verlag; New York: 2000. [Google Scholar]
  9. Jones-Rhoades M.W. Borevitz J.O. Presus D. Genome-wide expression profiling of the Arabidopsis female gametophyte identifies families of small, secreted proteins. PLOS Genet. 2007;3:1848–1861. doi: 10.1371/journal.pgen.0030171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lehmann E.L. Romano J.P. Testing Statistical Hypotheses. 3rd. Springer; New York: 2005. [Google Scholar]
  11. Miller R.G. Beyond ANOVA, Basics of Applied Statistics. Wiley; New York: 1986. [Google Scholar]
  12. Pan W. On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics. 2003;19:1333–1340. doi: 10.1093/bioinformatics/btg167. [DOI] [PubMed] [Google Scholar]
  13. Scheid S. Spang R. Permutation filtering: a novel concept for significance analysis of large-scale genomic data. Lect. Notes Comput. Sci. 2006;3909:338–347. [Google Scholar]
  14. Scheid S. Spang R. Compensating for unknown confounders in microarray data analysis using filtered permutations. J. Comput. Biol. 2007;14:669–681. doi: 10.1089/cmb.2007.R009. [DOI] [PubMed] [Google Scholar]
  15. Tusher V.G. Tibshirani R. Chu G. Significance analysis of microarrays applied to the ioninzing radiation response. Proc. Natl. Acad. Sci. USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Xie Y. Pan W. Khodursky A.B. A note on using permutation-based false discovery estimates to compare different analysis methods for microarray data. Bioinformatics. 2005;21:4280–4288. doi: 10.1093/bioinformatics/bti685. [DOI] [PubMed] [Google Scholar]
  17. Yang H. Churchill G. Estimating p-values in small microarray experiments. Bioinformatics. 2007;23:38–43. doi: 10.1093/bioinformatics/btl548. [DOI] [PubMed] [Google Scholar]
  18. Zahn J. Poosala S. Owen A.B., et al. AGEMAP: a gene expression database for aging in mice. PLOS Genet. 2007;3:2326–2337. doi: 10.1371/journal.pgen.0030201. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES