Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2015 Nov 4;102(4):753–766. doi: 10.1093/biomet/asv050

Optimal multiple testing under a Gaussian prior on the effect sizes

Edgar Dobriban 1, Kristen Fortney 2, Stuart K Kim 2, Art B Owen 3
PMCID: PMC4813057  PMID: 27046938

Abstract

We develop a new method for large-scale frequentist multiple testing with Bayesian prior information. We find optimal Inline graphic-value weights that maximize the average power of the weighted Bonferroni method. Due to the nonconvexity of the optimization problem, previous methods that account for uncertain prior information are suitable for only a small number of tests. For a Gaussian prior on the effect sizes, we give an efficient algorithm that is guaranteed to find the optimal weights nearly exactly. Our method can discover new loci in genome-wide association studies and compares favourably to competitors. An open-source implementation is available.

Keywords: Genome-wide association study, Multiple testing, Nonconvex optimization, p-value weighting, Weighted Bonferroni method

1. Introduction

The research presented in this paper is motivated by the genetics of human longevity. Genome-wide association studies of longevity compare long-lived individuals with matched controls (Brooks-Wilson, 2013). More than 500 000 genetic variants have been tested for their association with longevity, which amounts to a large multiple hypothesis testing problem. In addition to multiplicity, the sample size is small, usually of the order of a few hundred. As a consequence, only a few loci have been replicably associated with human longevity, and they do not explain the heritability of the trait (Hjelmborg et al., 2006).

The multiplicity may be countered by testing only a few candidate variants selected based on prior scientific knowledge. In a separate work in preparation, led by the second author, we find that a more general genome-wide test helps to improve power in a study of longevity. We leverage prior information from genome-wide association studies of age-related diseases, such as coronary artery disease and diabetes. For this task, we develop a new large-scale method of frequentist multiple testing with Bayesian prior information. In this paper we provide the theory for this method.

Our method is a novel Inline graphic-value weighting scheme; Inline graphic-value weighting is a general methodology for multiple testing that leverages independent prior information to improve power (Roeder & Wasserman, 2009; Gui et al., 2012). Suppose that we test hypotheses Inline graphic via the Inline graphic-values Inline graphic for Inline graphic. For a significance level Inline graphic, the weighted Bonferroni method declares the Inline graphicth hypothesis to be significant if Inline graphic. The weights Inline graphic are based on independent data. The familywise error rate, the probability of making at least one error, is controlled if the weights average to 1, as it equals at most Inline graphic.

In previous work, optimal weights have been found in a Gaussian model of hypothesis testing. Let the test statistics in the current study be Inline graphic, where the Inline graphic are the means, or effect sizes; we test the null hypotheses Inline graphic against Inline graphic. We have some information about the Inline graphic from prior studies. Roeder & Wasserman (2009) and Rubin et al. (2006) considered a model where the Inline graphic are known exactly from the prior data, and the weights are allowed to depend on the Inline graphic. In such a model they found the optimal weights for the weighted Bonferroni method, which maximize the expected number of discoveries. We show that this amounts to solving a convex optimization problem.

The assumption that the Inline graphic are known precisely is problematic: if they were known, there should be no follow-up study. In practice, empirical estimates of the Inline graphic are used. However, the fixed-Inline graphic weights do not take into account the uncertainty in the estimates. Instead, we account for uncertainty explicitly by considering the model with uncertain prior information in the form Inline graphic. Only the prior means Inline graphic and standard errors Inline graphic are known from independent data, not the precise effect sizes. Finding the optimal weights, which we call Bayes weights, is then a nonconvex optimization problem.

Westfall et al. (1998) formulated a general framework that includes this problem as a special case and allows, for instance, for Student Inline graphic-distributed priors. They used a direct numerical solver, a quasi-Newton optimization method which generally scales as Inline graphic, to find the weights. Published examples using this approach are typically small (Westfall et al., 1998; Westfall & Soper, 2001). This method of computing the weights does not scale up for our problems, which involve more than 500 000 genetic variants. Further, the generic quasi-Newton method has no guarantee of finding the global optimum of the nonconvex problem.

Our key contribution here is to provide an efficient method of finding the weights that maximize average power for the weighted Bonferroni method, in the model with Gaussian priors. We solve the optimization problem exactly for small Inline graphic, less than a problem-dependent value which is often between Inline graphic and Inline graphic. For larger Inline graphic, we can solve the problem for a nearby Inline graphic such that Inline graphic. The cost per iteration of our algorithm is Inline graphic in the first case and Inline graphic in the second case. We observe that a nearly constant number of iterations is used, regardless of Inline graphic. We find it remarkable that this problem admits a near-exact solution.

For large-scale problems, this approach leads to a method for multiple testing that controls a frequentist error measure while also taking into account Bayesian prior information. This method follows George Box's advice to be Bayesian when predicting but frequentist when testing (Box, 1980). Similar ideas were used previously by Carlin & Louis (1985); see §2. As mentioned, a more general formulation was also considered in Westfall et al. (1998) and Westfall & Soper (2001). We show that our approach is feasible for large-scale problems.

When prior information is uncertain, we show via simulations that the new method has more power and is more stable than competitors. We also show theoretically that weighting leads to substantially improved power. We apply the method to genome-wide association studies. By analysing several such datasets, we show that our method has advantages in terms of power and easier tuning compared to other methods.

With rapidly increasing volumes of data available as prior information for any given study, our method should be useful for other problems in biology and elsewhere. The data analysis and computational results in this paper are reproducible, and an open-source implementation of the method is available from the authors.

2. Related work

There is a large literature on statistical methods for multiple testing with prior information, some of which is reviewed in Roeder & Wasserman (2009) and Gui et al. (2012). Spjøtvoll (1972) devised optimal single-step multiple testing procedures maximizing average or minimal power and controlling the familywise error rate. Later it was recognized that Spjøtvoll's results are equivalent to optimal Inline graphic-value weighting methods. For instance, Benjamini & Hochberg (1997) developed extensions of Spjøtvoll's methods for Inline graphic-value weighting, allowing for weights also in the importance of the hypotheses.

Leveraging Spjøtvoll's results, Rubin et al. (2006) and Roeder & Wasserman (2009) found an explicit formula for optimal weights of the weighted Bonferroni method in the Gaussian model Inline graphic, assuming that the effects are known exactly. In practice the effects are estimated, but the weights do not take this into account. These weights are optimal for average power, and this efficient method is suitable for large applications. Eskin (2008) and Darnell et al. (2012) applied the framework of Roeder & Wasserman (2009) to genome-wide association studies; they accounted for correlations between the tests but assumed that the effects are known exactly.

Another popular approach is to test the top candidates from a prior study, often known as two-stage testing or candidate study. It can be viewed as a Inline graphic-value weighting method where some of the weights equal zero. A specific version for genome-wide association studies has been called the proxy-phenotype method (Rietveld et al., 2014).

In the literature on carcinogenicity trials, related methods have been devised to select tumour sites based on historical data (Carlin & Louis, 1985; Louis & Bailey, 1990); the methods are explicitly Bayesian with regard to historical data and frequentist in analysing current data. These models and methods differ from ours, and focus on pairwise comparisons based on Fisher's exact test (Louis & Bailey, 1990).

Westfall et al. (1998) considered a Gaussian model Inline graphic for the effects in hypothesis testing where prior distributions are known for the means. They formulated the problem of finding the weights that maximize expected power for the weighted Bonferroni method, and this was followed up for binary data in Westfall & Soper (2001), motivated by carcinogenicity trials. As mentioned in §1, published studies using their optimization methods are typically small.

Less work exists on weighted methods beyond the single-step Bonferroni method, or beyond the control of the familywise error rate. The step-down method of Holm (1979) can use weights, and Westfall & Krishen (2001) and Westfall et al. (2004) discuss the choice of optimal weights. Genovese et al. (2006) showed that the weighted Benjamini–Hochberg procedure controls the false discovery rate, and Roquain & Van De Wiel (2009) proposed a method of choosing weights optimally, assuming fixed known effects. Peña et al. (2011) developed a general framework for optimal multiple decision functions for the control of familywise error rate and false discovery rate, assuming exact knowledge of the alternatives.

In this paper we focus on the familywise error rate, because it is the standard measure of error controlled in our motivating application, and because in this case it is already challenging to find the optimal weights accounting for uncertainty on a large scale. Extension of this work to the Benjamini–Hochberg procedure and to false discovery rate control is left for future research.

3. Theoretical results

3.1. Background

We work in the Gaussian means model of hypothesis testing: we observe test statistics Inline graphic and test each null hypothesis Inline graphic against Inline graphic. The Inline graphic-value for testing Inline graphic is Inline graphic, where Inline graphic denotes the normal cumulative distribution function.

For a weight vector Inline graphic and a significance level Inline graphic, the weighted Bonferroni procedure rejects Inline graphic if Inline graphic. Usually this corresponds to Inline graphic. For general weights, the expected number of false rejections, known as the per-family error rate, equals Inline graphic. If Inline graphic, the expected number of false rejections is at most Inline graphic. By Markov's inequality, this implies that the familywise error rate is at most Inline graphic. Hence the weighted Bonferroni method controls the familywise error rate. This result does not require independence of the Inline graphic. We assume always that Inline graphic, and usually that Inline graphic. Without loss of generality, we restrict the weights to the interval Inline graphic.

Let us denote the number of rejections by Inline graphic, where Inline graphic is the indicator function. The optimal weights maximizing the expected number of discoveries, assuming a priori known effects Inline graphic, were found explicitly by Roeder & Wasserman (2009) and independently by Rubin et al. (2006). Denoting by Inline graphic the expectation with respect to Inline graphic, they solved the constrained optimization problem

3.1. (1)

It was not noted previously that this problem is convex. The objective is a sum of terms of the form Inline graphic, whose concavity follows directly by differentiation. Yet, by simple Lagrangian optimization, the above papers showed that if all Inline graphic, the optimal weights are Inline graphic where

3.1. (2)

Here Inline graphic is the unique normalizing constant such that the weights sum to Inline graphic. Interestingly, the weights are not monotonic as a function of Inline graphic, but are largest for intermediate values of Inline graphic. As noted by Roeder & Wasserman (2009), formula (2) is a direct consequence of Spjøtvoll's theory of optimality in multiple testing (Spjøtvoll, 1972). Accordingly, we call these weights the Spjøtvoll weights.

3.2. Weighting leads to substantial power gain

To illustrate theoretically that Inline graphic-value weighting can lead to increased power, we compare the power of optimal weighting with that of unweighted testing in a sparse mixture model.

First, we note that Inline graphic-value weighting exploits the heterogeneity of the tests. In the simplest case there are only large and small negative effects, say Inline graphic. We consider the Inline graphic limit, and for simplicity we suppose that Inline graphic. Let the fractions of large and small effects be Inline graphic and Inline graphic, respectively, so that Inline graphic of the means equal Inline graphic and the remaining Inline graphic equal zero. We solve for the optimal weights.

Proposition 1 —

There is a set of optimal Inline graphic-value weights that gives the same weights to the same means, i.e., weights Inline graphic and Inline graphic to means 0 and Inline graphic, respectively, where

graphic file with name M90.gif

Further, the power of the optimal Inline graphic-value weighting method is

graphic file with name M92.gif

If the absolute effect size Inline graphic is small enough that Inline graphic, all the weight is placed on the larger means, which is the behaviour we would expect intuitively. However, if Inline graphic is large enough that Inline graphic, then it is advantageous to place some weight on the small means, because a large absolute effect size Inline graphic will be detected with high probability.

In Fig. 1(a) we plot the ratio Inline graphic for Inline graphic, where Inline graphic is the power of unweighted Bonferroni testing. For most effect sizes Inline graphic and for Inline graphic, we see a power gain of at least 50% relative to unweighted Bonferroni testing. Moreover, there is a hotspot where the power gain can be three- to four-fold. Optimal weighting can lead to a significant gain in power.

Fig. 1.

Fig. 1.

Power gain and nonconvexity: (a) contour plot of the power ratio of optimal to unweighted testing for sparse means; (b) plots of four different instances of the function that is summed in the optimization objective; the nonconvex summand Inline graphic with Inline graphic is plotted for the pairs Inline graphic (solid), Inline graphic (dashed), Inline graphic (dotted) and Inline graphic (dot-dashed).

3.3. Weights with imperfect prior knowledge

In the previous sections it was assumed that the effects Inline graphic are known precisely. We now assume that we have uncertain prior information in the form Inline graphic.

Following Westfall et al. (1998), we maximize the expected power Inline graphic averaged with respect to the random Inline graphic and Inline graphic. Introducing Inline graphic, the optimization problem, which we call the Bayes weights problem, becomes

3.3. (3)

This objective function is not concave if any Inline graphic. To help with visualization, the function Inline graphic is plotted in Fig. 1(b) for four parameter pairs Inline graphic. On the interval Inline graphic, the function is first concave and then convex.

Our main contribution is to solve this problem efficiently for large Inline graphic. The results in this respect are two-fold. First, we can solve the problem exactly in the special case where Inline graphic is sufficiently small. Second, we have a nearly exact solution for arbitrary Inline graphic. Starting with the simpler first case, we define

3.3.

A weighted one-sided test Inline graphic can be written equivalently in terms of the critical values as Inline graphic. It turns out that the critical values corresponding to the optimal Bayes weights can be expressed in terms of Inline graphic, when Inline graphic is small enough that

3.3. (4)

In our data analysis examples and simulations, this mild restriction requires that Inline graphic be below values in the range Inline graphic to Inline graphic. In the next result we give the exact optimal weights for small Inline graphic when all Inline graphic.

Theorem 1 —

If the significance level Inline graphic is small enough that (4) holds, then the optimal Bayes weights maximizing the average power (3) are Inline graphic, where Inline graphic is the unique constant such that Inline graphic.

In the Supplementary Material, we solve this problem by maximizing the Lagrangian. Two key properties that we use are joint separability of the objective function and constraint, and analytic tractability of the Gaussian density.

Figure 2 displays an instance of the optimal weights Inline graphic as a function of the prior mean Inline graphic and the standard deviation Inline graphic. In the theorem the weights are a function of Inline graphic, but they can also be viewed as a function of Inline graphic via the natural map Inline graphic. As the standard error Inline graphic becomes small, our weights tend to the Spjøtvoll weights.

Proposition 2 —

For any Inline graphic and Inline graphic, the Bayes weight function defined by Inline graphic tends to the Spjøtvoll weight function defined in (2) as Inline graphic.

Fig. 2.

Fig. 2.

Bayes weights: (a) surface plot and (b) contour plot of the Bayes weight function Inline graphic defined in Theorem 1; Spjøtvoll weights are on the segment Inline graphic.

With Inline graphic, the weights are regularized: more extreme weights are shrunk towards a common value in a nonlinear way. For finite Inline graphic, our weights can be viewed as a smooth interpolation between Spjøtvoll weights and uniform weights. It is reasonable to think at first that as all Inline graphic, the best weight allocation becomes the uniform one. However, this is not the case: a symmetry-breaking phenomenon occurs due to nonconvexity.

Consider a weight vector Inline graphic that equals Inline graphic for Inline graphic indices, and assume that Inline graphic is not an integer. Distribute the remaining strictly positive weight equally among the remaining hypotheses. It is now easy to see that the hypotheses with weights equal to Inline graphic are always rejected, so their power equals 1. For the remaining hypotheses the power Inline graphic tends to Inline graphic as Inline graphic. This shows that the limiting power of this unbalanced weighting scheme is Inline graphic. For uniform weighting, the power tends to Inline graphic as Inline graphic, for each hypothesis. This shows that the limiting power of uniform weighting is Inline graphic. Hence, the power of the skewed weighting scheme is larger than that of uniform weighting. This illustrates the symmetry-breaking phenomenon caused by the extreme nonconvexity of the optimization problem.

Fortunately, the situation is better when condition (4) holds. In addition to being easy to check for any given parameters Inline graphic and Inline graphic, we now show that the constraint is mild. Often we want to keep Inline graphic small even if Inline graphic is large, because Inline graphic is the number of false rejections that we tolerate. In this regime, the condition holds as long as there are a few average-sized negative prior means Inline graphic. We denote by Inline graphic the normal quantile function.

Proposition 3 —

Condition (4) holds if there are Inline graphic distinct indices Inline graphic with negative Inline graphic, for which Inline graphic

If Inline graphic, then Inline graphic, so the simple condition holds provided that Inline graphic For instance, if Inline graphic and Inline graphic, then Inline graphic. If, moreover, Inline graphic so that Inline graphic, and Inline graphic, then we need only ten effect sizes with Inline graphic. This is a weak requirement.

When Inline graphic is small, we use a damped Newton's method to find the right constant Inline graphic from Theorem 1 via a one-dimensional line search. The function evaluations cost Inline graphic per iteration, and empirically we find that the algorithm takes only a small number of iterations to converge, independently of Inline graphic. We can solve problems involving more than two million tests in a few seconds on a desktop computer.

Now we present our result for the general case.

Theorem 2 —

For any Inline graphic, the nonconvex Bayes weights problem can be solved for a nearby Inline graphic for which Inline graphic. The optimal weights and Inline graphic can be found in Inline graphic steps.

This result is relevant when Inline graphic, the expected number of errors under the null hypothesis, is controlled at a threshold greater than 1/2. Our weights will be optimal for a Inline graphic that is close to Inline graphic. We see from the proof that even for large Inline graphic, Inline graphic often equals Inline graphic. The method also returns the value of Inline graphic, which the user can inspect. It is then the user's decision as to whether to perform multiple testing adjustment at the original level Inline graphic or at the new level Inline graphic.

The analysis of nonconvex optimization problems is challenging. It seems remarkable that the nonconvex Bayes weights problem admits a nearly exact solution.

4. Simulation studies

4.1. Bayes weights are more powerful than competing weighting schemes

We perform two simulation studies to explore the empirical performance of our method. First, we show that Bayes weights increase power more reliably than two other weighting schemes, namely exponential weights and filtering.

For Bayes weights, we multiply the variances by a dispersion factor Inline graphic, i.e., Inline graphic. The default value for this tuning parameter is Inline graphic and, as discussed in §5.3, we recommend use of the default value in most cases. The purpose of changing the dispersion is to explore the robustness of our method with respect to misspecification of the prior variances. The dispersion ranges from 0 to 4, and Spjøtvoll weights correspond to Inline graphic.

Exponential weights with tilt parameter Inline graphic are defined as Inline graphic, where Inline graphic. This weighting scheme was proposed by Roeder et al. (2006), who recommend Inline graphic as the default value. We consider the range Inline graphic. As noted by Roeder et al. (2006), exponential weights are sensitive to large means. To guard against this sensitivity, we truncate weights larger than Inline graphic and redistribute their excess weight among the next largest weights.

Filtering methods test only the most significant effects Inline graphic, using the unweighted Bonferroni method. These methods can be viewed as weighting schemes in which some weights are zero. Such methods are known under many names, such as two-stage testing, screening, or proxy-phenotype methods (Rietveld et al., 2014). We adopt the term filtering used by Bourgon et al. (2010), who filter based on independent information in the current dataset rather than prior information. The threshold Inline graphic ranges from Inline graphic to 0. If Inline graphic is large and fewer than Inline graphic hypotheses would be tested, then we instead test the most significant Inline graphic hypotheses.

In the simulation, we generate Inline graphic random means and variances independently according to Inline graphic and Inline graphic, and we set Inline graphic. For any weight vector Inline graphic, we calculate the power as the objective from (3) divided by Inline graphic, to reflect the average power per test.

The results are shown in Fig. 3(a). Each method can improve the power over unweighted testing. However, Bayes weights yield more power than the other methods. The best power is attained when the dispersion Inline graphic is equal to 1, but good power is reached in a large neighbourhood of Inline graphic. Our weights are robust with respect to misspecification of the tuning parameter.

Fig. 3.

Fig. 3.

(a) Power of four Inline graphic-value weighting methods plotted as a function of their parameter: unweighted (solid), Bayes (dashed) as a function of the dispersion Inline graphic, exponential (dotted) as a function of Inline graphic, and filtering (dot-dashed) as a function of Inline graphic; the Spjøtvoll weights correspond to the point at the origin Inline graphic on the Bayes weights curve. (b) Power comparison for sparse means: deterministic (left) and average (right) power plotted as a function of the proportion of large means Inline graphic, for the unweighted (solid), Spjøtvoll (dashed) and Bayes (dotted) methods.

In particular, taking uncertainty into account helps. Spjøtvoll weights, which assume fixed and known effects, and are represented on the figure as regularized weights with Inline graphic, have less power than Bayes weights with positive Inline graphic, for a wide range of Inline graphic.

The remaining two methods, filtering and exponential weights, have disadvantages. While filtering yields a gain in power for a thresholding parameter Inline graphic, it also leads to a substantial power loss for Inline graphic. For sufficiently large Inline graphic the power equals Inline graphic, because only the top Inline graphic hypotheses are selected. Another significant disadvantage is that there seems to be no principled way to choose Inline graphic a priori without additional assumptions. Similarly, exponential weighting leads to at most a small gain in power, and it usually leads to a power loss.

We conclude that Bayes weights are robust with respect to the choice of the tuning parameter and have uniformly good power. In contrast, exponential weighting and filtering are more sensitive, and their power can drop substantially.

4.2. Bayes weights have a worst-case advantage

We show that Bayes weights have a worst-case advantage compared to Spjøtvoll weights. We use the sparse means model and generate Inline graphic means Inline graphic distributed as Inline graphic, where Inline graphic and Inline graphic. We set Inline graphic and vary Inline graphic from 0 to 0Inline graphic1. We set all Inline graphic to equal Inline graphic and consider Inline graphic or 1.

Spjøtvoll weights are optimal for Inline graphic, while Bayes weights are optimal for Inline graphic. We evaluate these weighting schemes by calculating the power that they do not maximize, i.e., the average power (3) for Spjøtvoll weights and the deterministic power (1) for Bayes weighting. We also compute the power of the unweighted Bonferroni method.

The results are displayed in Fig. 3(b). Bayes weights lose only a little power compared to the optimal Spjøtvoll weights. In contrast, Spjøtvoll weights lose a lot of power relative to Bayes weights, which maximize the worst-case power. Bayes weights show a maximin property. Further, as shown in the Supplementary Material, Spjøtvoll weights lose power near Inline graphic because they set the weights equal to zero on the small means.

5. Application to genome-wide association studies

5.1. Review of genome-wide association studies

We adapt our framework to genome-wide association studies, relying on basic notions of quantitative genetics (see, e.g., Lynch & Walsh, 1998). In this section we present in detail the methodology for this application, while also illustrating the steps of using our framework for specific problems.

We study a quantitative trait Inline graphic in a population, with the goal of understanding the effects of single nucleotide polymorphisms Inline graphic on the trait. We assume that Inline graphic has mean 0 and known variance; here Inline graphic denotes the centred minor allele count of variant Inline graphic for an individual. We rely on the linear model for the effect of the Inline graphicth variant on the trait: Inline graphic. In this model Inline graphic is the phenotype of a randomly sampled individual from the population, so Inline graphic is random, Inline graphic is a fixed unknown constant, and Inline graphic is the residual error. This error is a zero-mean random variable that is independent of Inline graphic, with variance Inline graphic.

Suppose that we observe a sample of Inline graphic independent and identically distributed observations from this model. We use the standard linear regression estimate Inline graphic, which for a large sample size has an approximate distribution Inline graphic. To standardize, we divide by Inline graphic, where Inline graphic is the variance of Inline graphic.

With these steps, we have framed our problem in the Gaussian means model. Writing Inline graphic and Inline graphic, we have Inline graphic, which has the required form. Let us also define the standardized effect size Inline graphic, which will be of key importance.

5.2. Prior information

To use prior information, assume that we also have a prior trait Inline graphic which is measured independently on a different, independent sample from the same population. With the same assumptions on Inline graphic, we can write Inline graphic. Here Inline graphic is a fixed unknown constant, and Inline graphic is random. Suppose that we have independent samples of size Inline graphic and Inline graphic for the two traits. If we define Inline graphic and Inline graphic by analogy to the definitions for Inline graphic, we can write Inline graphic.

We model the relatedness of the two traits as a relation between the standardized effect sizes Inline graphic, which do not depend on the sample size. If the two traits are closely related, the first-order approximation is equality, or Inline graphic. This model captures the pleiotropy between the two traits (Solovieff et al., 2013).

The final step is to compute the distribution of Inline graphic given the prior data Inline graphic. For this we need to choose a prior for Inline graphic, and for simplicity we will use a flat prior.

We now have all ingredients for the model of Gaussian hypothesis testing with uncertain information. Specifically, we have Inline graphic, where Inline graphic, Inline graphic and Inline graphic.

The uncertainty in Inline graphic may be different from 1, and may exceed it due to overdispersion. This is one way to weaken the first-order approximation assumption. To allow for overdispersion, we recall the parameter Inline graphic used in our simulation. We model the prior data as Inline graphic, and then the variance becomes Inline graphic. The default value Inline graphic is recommended in most cases. Finally, we compute the Bayes weights Inline graphic with parameters Inline graphic and Inline graphic, and we run the weighted Bonferroni method on the current Inline graphic-values. This fully specifies the method, which is summarized in the following algorithm.

Algorithm 1 —

Bayes-weighted Bonferroni multiple testing in genome-wide association studies

  1. Let Inline graphic be the prior effect sizes for Inline graphic.

  2. Let Inline graphic and Inline graphic be the prior and current sample sizes.

  3. Let Inline graphic be the current Inline graphic-values.

  4. Let Inline graphic be the significance threshold; the default value is Inline graphic.

  5. Let Inline graphic be the dispersion; the default value is Inline graphic.

  6. Set the prior means and variances: Inline graphic and Inline graphic.

  7. Compute the Bayes weights Inline graphic, defined via (3), with parameters Inline graphic and Inline graphic.

  8. Output indices Inline graphic such that Inline graphic.

5.3. Practical remarks

It is important that we retain Type I error control even when the modelling assumptions fail. The only requirement is that we have marginally valid Inline graphic-values. We list two common deviations from our model. First, summary data for genome-wide association studies sometimes include only the magnitude of the effects and not their sign. In this case we have two choices: we could assume that the directions of effects are the same, and perform a one-tailed test of the current effect in the prior direction; alternatively, we could do a two-tailed test by including the tests with prior parameters Inline graphic and Inline graphic for each Inline graphic, for a total of Inline graphic tests. Large effects will often be in the same direction, whereas small effects may change direction between the prior and current studies. Our procedure for dealing with two-sided effects may lead to minor power loss while retaining Type I error control. On the other hand, in some cases the prior and current traits can be of different types; for instance, the prior trait could be binary and the current trait quantitative. In such a situation, the model Inline graphic should be re-examined, but it is still convenient to use as a first approximation.

We recommend using the default value of the tuning parameter, Inline graphic, in all but exceptional cases. This value was derived from a natural Bayesian model, and our simulations and data analysis show that it provides good performance in most cases. The same numerical results demonstrate that our method is not too sensitive to the choice of tuning parameter. If the relationship between the two traits is thought to be weak, one could use a larger Inline graphic, such as Inline graphic. If the uncertainty in the prior information is less than that suggested by the usual model, one could use a small Inline graphic, such as Inline graphic. If the value Inline graphic was tried first, the results of that analysis should also be reported.

One may wish to use the weighted Benjamini–Hochberg method with our weights (Genovese et al., 2006); but in general this will be underpowered, as optimal weights for stepwise methods differ greatly from those for single-step methods (Westfall & Soper, 2001). However, in the special case of very small Inline graphic, in our data analysis examples we have observed that weights often become monotonically increasing with the magnitude of the effect size, and thus are similar to the optimal weights for stepwise methods.

6. Data analysis

6.1. Data sources

We illustrate the application of our method by analysing data from publicly available genome-wide association studies. We use the Inline graphic-values, recorded for 500 000 to 2Inline graphic5 million genetic variants, from five studies: CARDIoGRAM and C4D for coronary artery disease (Schunkert et al., 2011; Coronary Artery Disease Genetics Consortium, 2011), blood lipids (Teslovich et al., 2010), schizophrenia (Schizophrenia Psychiatric Genome-Wide Association Study Consortium, 2011), and estimated glomerular filtration rate creatinine (Köttgen et al., 2010); see the Supplementary Material.

We analyse three pairs of datasets, with a specific motivation for each. First, we use CARDIoGRAM as prior information for C4D. This is a positive control for our method, since both studies measure coronary artery disease. We choose C4D as the target because it has a smaller sample; hence prior information may increase power more substantially.

Second, we use the blood lipids study as prior information for the schizophrenia study. Andreassen et al. (2013) demonstrated improved power with this pair. They used a fully Bayesian method, and our goal is to evaluate the power improvement using a frequentist method. There is a small overlap between the controls of the two studies.

Third, we use the creatinine study as prior information for the C4D study. Heart disease and renal disease are comorbid (Silverberg et al., 2004), so this set-up may improve power.

6.2. Methods and additional details

We run weighted Bonferroni multiple testing for each of five weighting schemes. The prior data is Inline graphic, where Inline graphic is the Inline graphicth prior Inline graphic-value. The familywise error rate is controlled at Inline graphic, so that the Inline graphic-value thresholds are approximately Inline graphic to Inline graphic.

The first four weighting schemes are: unweighted Bonferroni testing, where all weights equal unity; Spjøtvoll weights with parameters Inline graphic; Bayes weights with dispersion Inline graphic or 10; and exponential weights (Roeder et al., 2006), introduced in §4.1, with tilt Inline graphic or 4.

The fifth and last weighting scheme is filtering, which selects the smallest Inline graphic-values from the prior study and tests their hypotheses in the current study. We use three Inline graphic-value thresholds, Inline graphic, Inline graphic and Inline graphic. Rietveld et al. (2014) proposed a method for choosing the optimal Inline graphic-value threshold for filtering, which requires the genotypic correlation between the two traits and the additive heritability of the current trait. For complex traits, these parameters are usually estimated with large uncertainty, and substantial domain expertise is needed to specify them.

We prune the significant single nucleotide polymorphisms for linkage disequilibrium using the DistiLD database (Palleja et al., 2012). Specifically, for each weighting scheme we select one locus from each linkage disequilibrium block that contains significant loci. Our data analysis pipeline is given in the Supplementary Material.

We compute a score Inline graphic for each weighting scheme Inline graphic with parameters Inline graphic, on each dataset Inline graphic. This is defined as Inline graphic1 if the weighting scheme increases the number of detections relative to unweighted testing, 0 if it leaves the number unchanged, and Inline graphic1 otherwise. The score Inline graphic of a weighting scheme Inline graphic with parameters Inline graphic is the sum of scores across datasets. The total Inline graphic of the weighting scheme Inline graphic is the sum of scores Inline graphic across parameters.

6.3. Results

Table 1 shows the number of significant loci for each pair of studies and for each weighting scheme. We also present the results pruned for linkage disequilibrium, which act as a proxy for the number of independent loci found.

Table 1.

Number of significant loci for five methods on three examples: the top portion of the table shows results pruned for linkage disequilibrium, the middle part shows results without pruning, and the bottom portion reports the score of each method

BayesInline graphic
ExpInline graphic
FilterInline graphic
Parameter Un Spjot Inline graphic 1 10 1 2 4 2 4 6
Pruned
CG Inline graphic C4D 4 11 10 8 4 4 5 4 10 10 6
Lipids Inline graphic SCZ 4 1 1 1 5 1 0 0 2 2 2
eGFRcrea Inline graphic C4D 4 2 2 4 4 4 5 4 1 0 1
Unpruned
CG Inline graphic C4D 29 45 44 39 29 32 34 27 40 48 34
Lipids Inline graphic SCZ 116 214 214 223 123 92 0 0 217 96 39
eGFRcrea Inline graphic C4D 29 18 18 23 29 29 28 19 1 0 1
Scoring
Score 0 0 0 1 1 0 0 -1 0 -1 -1
Total 0 0 Inline graphic Inline graphic Inline graphic

Un, unweighted; Spjot, Spjøtvoll; BayesInline graphic, Bayesian with Inline graphic or 10; ExpInline graphic, exponential with Inline graphic or 4; FilterInline graphic, filtering with Inline graphic or 6; CG, CARDIoGRAM; SCZ, schizophrenia study; eGFRcrea, creatinine study.

The results are somewhat inconclusive. In the positive control example, all weighting schemes except exponential weighting detect more loci than unweighted testing. Spjøtvoll weighting and filtering lead to the largest number of loci. In the blood lipids example, the methods generally detect fewer pruned loci, except for Bayes weights with Inline graphic. The methods can detect both a larger and a smaller number of unpruned loci, except in the case of Bayes weights, which uniformly increase the number of loci. For the eGFR creatinine example, exponential weights produce the best behaviour. We also see that the default Inline graphic never performs worse than both unweighted testing and Spjøtvoll weights, and for the unpruned lipids example it is better.

If we allow tuning of parameters for the three weighting schemes that have such a parameter, Bayes weights show good performance: they are either first or second in all examples. This shows that our method is robust with respect to the choice of tuning parameter.

Finally, only Bayes weights with Inline graphic or 10 have a positive score. The total score, summed across parameter settings, is also positive only for Bayes weights. Judging from these results, our method shows promise. However, from this analysis alone we cannot establish conclusively the relative merits of the methods. In future work it will be necessary to evaluate Inline graphic-value weighting methods on more datasets.

Supplementary material

Supplementary material available at Biometrika online includes proofs of the theoretical results, software implementations in R and MATLAB, and code to reproduce the simulations and data analysis results.

Acknowledgments

Kristen Fortney and Stuart Kim are also affiliated with the Department of Genetics, Stanford University. This research was partially supported by the U.S. National Science Foundation and National Institutes of Health. We are grateful for the reviewers' constructive comments, which have helped to improve the paper.

References

  1. Andreassen O. A., Djurovic S., Thompson W. K., Schork A. J., Kendler K. S., O'Donovan M. C., Rujescu D., Werge T., van de Bunt M. & Morris A. P. et al. (2013). Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. Am. J. Hum. Genet. 92, 197–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Benjamini Y. & Hochberg Y. (1997). Multiple hypotheses testing with weights. Scand. J. Statist. 24, 407–18. [Google Scholar]
  3. Bourgon R., Gentleman R. & Huber W. (2010). Independent filtering increases detection power for high-throughput experiments. Proc. Nat. Acad. Sci. 107, 9546–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Box G. E. P. (1980). Sampling and Bayes' inference in scientific modelling and robustness (with Discussion). J. R. Statist. Soc. A 143, 383–430. [Google Scholar]
  5. Brooks-Wilson A. R. (2013). Genetics of healthy aging and longevity. Hum. Genet. 132, 1323–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carlin B. J. & Louis T. A. (1985). Controlling error rates by using conditional expected power to select tumor sites. In Proc. Biopharm. Sect., Am. Statist. Assoc. Alexandria, Virginia: American Statistical Association, pp. 11–8.
  7. Coronary Artery Disease Genetics Consortium (2011). A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nature Genet. 43, 339–44. [DOI] [PubMed] [Google Scholar]
  8. Darnell G., Duong D., Han B. & Eskin E. (2012). Incorporating prior information into association studies. Bioinformatics 28, i147–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Eskin E. (2008). Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Genome Res. 18, 653–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Genovese C. R., Roeder K. & Wasserman L. (2006). False discovery control with Inline graphic-value weighting. Biometrika 93, 509–24. [Google Scholar]
  11. Gui J., Tosteson T. D. & Borsuk M. E. (2012). Weighted multiple testing procedures for genomic studies. BioData Mining 5, article no. 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hjelmborg J., Iachine I., Skytthe A., Vaupel J. W., McGue M., Koskenvuo M., Kaprio J., Pedersen N. L. & Christensen K. (2006). Genetic influence on human lifespan and longevity. Hum. Genet. 119, 312–21. [DOI] [PubMed] [Google Scholar]
  13. Holm S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Statist. 6, 65–70. [Google Scholar]
  14. Köttgen A., Pattaro C., Böger C. A., Fuchsberger C., Olden M., Glazer N. L., Parsa A., Gao X., Yang Q. & Smith A. V. et al. (2010). New loci associated with kidney function and chronic kidney disease. Nature Genet. 42, 376–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Louis T. A. & Bailey J. K. (1990). Controlling error rates using prior information and marginal totals to select tumor sites. J. Statist. Plan. Infer. 24, 297–316. [Google Scholar]
  16. Lynch M. & Walsh B. (1998). Genetics and Analysis of Quantitative Traits. Sunderland: Sinauer Associates. [Google Scholar]
  17. Palleja A., Horn H., Eliasson S. & Jensen L. J. (2012). DistiLD Database: Diseases and traits in linkage disequilibrium blocks. Nucleic Acids Res. 40, D1036–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Peña E. A., Habiger J. D. & Wu W. (2011). Power-enhanced multiple decision functions controlling family-wise error and false discovery rates. Ann. Statist. 39, 556–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Rietveld C. A., Esko T., Davies G., Pers T. H., Turley P., Benyamin B., Chabris C. F., Emilsson V., Johnson A. D. & Lee J. J. et al. (2014). Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Nat. Acad. Sci. 111, 13790–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Roeder K. & Wasserman L. (2009). Genome-wide significance levels and weighted hypothesis testing. Statist. Sci. 24, 398–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Roeder K., Bacanu S.-A., Wasserman L. & Devlin B. (2006). Using linkage genome scans to improve power of association in genome scans. Am. J. Hum. Genet. 78, 243–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Roquain E. & Van De Wiel M. A. (2009). Optimal weighting for false discovery rate control. Electron. J. Statist. 3, 678–711. [Google Scholar]
  23. Rubin D., Dudoit S. & Van der Laan M. (2006). A method to increase the power of multiple testing procedures through sample splitting. Statist. Applic. Genet. Molec. Biol. 5, 1–19. [DOI] [PubMed] [Google Scholar]
  24. Schizophrenia Psychiatric Genome-Wide Association Study Consortium (2011). Genome-wide association study identifies five new schizophrenia loci. Nature Genet. 43, 969–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Schunkert H., König I. R., Kathiresan S., Reilly M. P., Assimes T. L., Holm H., Preuss M., Stewart A. F., Barbalic M. & Gieger C. et al. (2011). Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genet. 43, 333–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Silverberg D., Wexler D., Blum M., Schwartz D. & Iaina A. (2004). The association between congestive heart failure and chronic renal disease. Curr. Opin. Nephrol. Hypertens. 13, 163–70. [DOI] [PubMed] [Google Scholar]
  27. Solovieff N., Cotsapas C., Lee P. H., Purcell S. M. & Smoller J. W. (2013). Pleiotropy in complex traits: Challenges and strategies. Nature Rev. Genet. 14, 483–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Spjøtvoll E. (1972). On the optimality of some multiple comparison procedures. Ann. Math. Statist. 43, 398–411. [Google Scholar]
  29. Teslovich T. M., Musunuru K., Smith A. V., Edmondson A. C., Stylianou I. M., Koseki M., Pirruccello J. P., Ripatti S., Chasman D. I. & Willer C. J. et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Westfall P. H. & Krishen A. (2001). Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. J. Statist. Plan. Infer. 99, 25–40. [Google Scholar]
  31. Westfall P. H. & Soper K. A. (2001). Using priors to improve multiple animal carcinogenicity tests. J. Am. Statist. Assoc. 96, 827–34. [Google Scholar]
  32. Westfall P. H., Krishen A. & Young S. S. (1998). Using prior information to allocate significance levels for multiple endpoints. Statist. Med. 17, 2107–19. [DOI] [PubMed] [Google Scholar]
  33. Westfall P. H., Kropf S. & Finos L. (2004). Weighted FWE-controlling methods in high-dimensional situations. In Recent Developments in Multiple Comparison Procedures, Y. Benjamini, F. Bretz and S. Sarkar, eds. Beachwood, Ohio: Institute of Mathematical Statistics, pp. 143–54.

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES