Optimal multiple testing under a Gaussian prior on the effect sizes

Edgar Dobriban; Kristen Fortney; Stuart K Kim; Art B Owen

doi:10.1093/biomet/asv050

. 2015 Nov 4;102(4):753–766. doi: 10.1093/biomet/asv050

Optimal multiple testing under a Gaussian prior on the effect sizes

Edgar Dobriban ¹, Kristen Fortney ², Stuart K Kim ², Art B Owen ³

PMCID: PMC4813057 PMID: 27046938

Abstract

We develop a new method for large-scale frequentist multiple testing with Bayesian prior information. We find optimal Inline graphic -value weights that maximize the average power of the weighted Bonferroni method. Due to the nonconvexity of the optimization problem, previous methods that account for uncertain prior information are suitable for only a small number of tests. For a Gaussian prior on the effect sizes, we give an efficient algorithm that is guaranteed to find the optimal weights nearly exactly. Our method can discover new loci in genome-wide association studies and compares favourably to competitors. An open-source implementation is available.

Keywords: Genome-wide association study, Multiple testing, Nonconvex optimization, p-value weighting, Weighted Bonferroni method

1. Introduction

The research presented in this paper is motivated by the genetics of human longevity. Genome-wide association studies of longevity compare long-lived individuals with matched controls (Brooks-Wilson, 2013). More than 500 000 genetic variants have been tested for their association with longevity, which amounts to a large multiple hypothesis testing problem. In addition to multiplicity, the sample size is small, usually of the order of a few hundred. As a consequence, only a few loci have been replicably associated with human longevity, and they do not explain the heritability of the trait (Hjelmborg et al., 2006).

The multiplicity may be countered by testing only a few candidate variants selected based on prior scientific knowledge. In a separate work in preparation, led by the second author, we find that a more general genome-wide test helps to improve power in a study of longevity. We leverage prior information from genome-wide association studies of age-related diseases, such as coronary artery disease and diabetes. For this task, we develop a new large-scale method of frequentist multiple testing with Bayesian prior information. In this paper we provide the theory for this method.

Our method is a novel Inline graphic -value weighting scheme; -value weighting is a general methodology for multiple testing that leverages independent prior information to improve power (Roeder & Wasserman, 2009; Gui et al., 2012). Suppose that we test hypotheses via the -values for . For a significance level , the weighted Bonferroni method declares the Inline graphic th hypothesis to be significant if . The weights are based on independent data. The familywise error rate, the probability of making at least one error, is controlled if the weights average to 1, as it equals at most .

In previous work, optimal weights have been found in a Gaussian model of hypothesis testing. Let the test statistics in the current study be Inline graphic , where the are the means, or effect sizes; we test the null hypotheses against . We have some information about the from prior studies. Roeder & Wasserman (2009) and Rubin et al. (2006) considered a model where the are known exactly from the prior data, and the weights are allowed to depend on the Inline graphic . In such a model they found the optimal weights for the weighted Bonferroni method, which maximize the expected number of discoveries. We show that this amounts to solving a convex optimization problem.

The assumption that the Inline graphic are known precisely is problematic: if they were known, there should be no follow-up study. In practice, empirical estimates of the are used. However, the fixed- weights do not take into account the uncertainty in the estimates. Instead, we account for uncertainty explicitly by considering the model with uncertain prior information in the form Inline graphic . Only the prior means and standard errors are known from independent data, not the precise effect sizes. Finding the optimal weights, which we call Bayes weights, is then a nonconvex optimization problem.

Westfall et al. (1998) formulated a general framework that includes this problem as a special case and allows, for instance, for Student Inline graphic -distributed priors. They used a direct numerical solver, a quasi-Newton optimization method which generally scales as , to find the weights. Published examples using this approach are typically small (Westfall et al., 1998; Westfall & Soper, 2001). This method of computing the weights does not scale up for our problems, which involve more than 500 000 genetic variants. Further, the generic quasi-Newton method has no guarantee of finding the global optimum of the nonconvex problem.

Our key contribution here is to provide an efficient method of finding the weights that maximize average power for the weighted Bonferroni method, in the model with Gaussian priors. We solve the optimization problem exactly for small Inline graphic , less than a problem-dependent value which is often between and . For larger , we can solve the problem for a nearby such that . The cost per iteration of our algorithm is in the first case and in the second case. We observe that a nearly constant number of iterations is used, regardless of Inline graphic . We find it remarkable that this problem admits a near-exact solution.

For large-scale problems, this approach leads to a method for multiple testing that controls a frequentist error measure while also taking into account Bayesian prior information. This method follows George Box's advice to be Bayesian when predicting but frequentist when testing (Box, 1980). Similar ideas were used previously by Carlin & Louis (1985); see §2. As mentioned, a more general formulation was also considered in Westfall et al. (1998) and Westfall & Soper (2001). We show that our approach is feasible for large-scale problems.

When prior information is uncertain, we show via simulations that the new method has more power and is more stable than competitors. We also show theoretically that weighting leads to substantially improved power. We apply the method to genome-wide association studies. By analysing several such datasets, we show that our method has advantages in terms of power and easier tuning compared to other methods.

With rapidly increasing volumes of data available as prior information for any given study, our method should be useful for other problems in biology and elsewhere. The data analysis and computational results in this paper are reproducible, and an open-source implementation of the method is available from the authors.

2. Related work

There is a large literature on statistical methods for multiple testing with prior information, some of which is reviewed in Roeder & Wasserman (2009) and Gui et al. (2012). Spjøtvoll (1972) devised optimal single-step multiple testing procedures maximizing average or minimal power and controlling the familywise error rate. Later it was recognized that Spjøtvoll's results are equivalent to optimal Inline graphic -value weighting methods. For instance, Benjamini & Hochberg (1997) developed extensions of Spjøtvoll's methods for -value weighting, allowing for weights also in the importance of the hypotheses.

Leveraging Spjøtvoll's results, Rubin et al. (2006) and Roeder & Wasserman (2009) found an explicit formula for optimal weights of the weighted Bonferroni method in the Gaussian model Inline graphic , assuming that the effects are known exactly. In practice the effects are estimated, but the weights do not take this into account. These weights are optimal for average power, and this efficient method is suitable for large applications. Eskin (2008) and Darnell et al. (2012) applied the framework of Roeder & Wasserman (2009) to genome-wide association studies; they accounted for correlations between the tests but assumed that the effects are known exactly.

Another popular approach is to test the top candidates from a prior study, often known as two-stage testing or candidate study. It can be viewed as a Inline graphic -value weighting method where some of the weights equal zero. A specific version for genome-wide association studies has been called the proxy-phenotype method (Rietveld et al., 2014).

In the literature on carcinogenicity trials, related methods have been devised to select tumour sites based on historical data (Carlin & Louis, 1985; Louis & Bailey, 1990); the methods are explicitly Bayesian with regard to historical data and frequentist in analysing current data. These models and methods differ from ours, and focus on pairwise comparisons based on Fisher's exact test (Louis & Bailey, 1990).

Westfall et al. (1998) considered a Gaussian model Inline graphic for the effects in hypothesis testing where prior distributions are known for the means. They formulated the problem of finding the weights that maximize expected power for the weighted Bonferroni method, and this was followed up for binary data in Westfall & Soper (2001), motivated by carcinogenicity trials. As mentioned in §1, published studies using their optimization methods are typically small.

Less work exists on weighted methods beyond the single-step Bonferroni method, or beyond the control of the familywise error rate. The step-down method of Holm (1979) can use weights, and Westfall & Krishen (2001) and Westfall et al. (2004) discuss the choice of optimal weights. Genovese et al. (2006) showed that the weighted Benjamini–Hochberg procedure controls the false discovery rate, and Roquain & Van De Wiel (2009) proposed a method of choosing weights optimally, assuming fixed known effects. Peña et al. (2011) developed a general framework for optimal multiple decision functions for the control of familywise error rate and false discovery rate, assuming exact knowledge of the alternatives.

In this paper we focus on the familywise error rate, because it is the standard measure of error controlled in our motivating application, and because in this case it is already challenging to find the optimal weights accounting for uncertainty on a large scale. Extension of this work to the Benjamini–Hochberg procedure and to false discovery rate control is left for future research.

3. Theoretical results

3.1. Background

We work in the Gaussian means model of hypothesis testing: we observe test statistics Inline graphic and test each null hypothesis against . The -value for testing is , where denotes the normal cumulative distribution function.

For a weight vector Inline graphic and a significance level , the weighted Bonferroni procedure rejects if . Usually this corresponds to . For general weights, the expected number of false rejections, known as the per-family error rate, equals . If , the expected number of false rejections is at most . By Markov's inequality, this implies that the familywise error rate is at most Inline graphic . Hence the weighted Bonferroni method controls the familywise error rate. This result does not require independence of the . We assume always that , and usually that . Without loss of generality, we restrict the weights to the interval .

Let us denote the number of rejections by Inline graphic , where is the indicator function. The optimal weights maximizing the expected number of discoveries, assuming a priori known effects , were found explicitly by Roeder & Wasserman (2009) and independently by Rubin et al. (2006). Denoting by the expectation with respect to , they solved the constrained optimization problem

(1)

It was not noted previously that this problem is convex. The objective is a sum of terms of the form Inline graphic , whose concavity follows directly by differentiation. Yet, by simple Lagrangian optimization, the above papers showed that if all , the optimal weights are where

(2)

Here Inline graphic is the unique normalizing constant such that the weights sum to . Interestingly, the weights are not monotonic as a function of , but are largest for intermediate values of . As noted by Roeder & Wasserman (2009), formula (2) is a direct consequence of Spjøtvoll's theory of optimality in multiple testing (Spjøtvoll, 1972). Accordingly, we call these weights the Spjøtvoll weights.

3.2. Weighting leads to substantial power gain

To illustrate theoretically that Inline graphic -value weighting can lead to increased power, we compare the power of optimal weighting with that of unweighted testing in a sparse mixture model.

First, we note that Inline graphic -value weighting exploits the heterogeneity of the tests. In the simplest case there are only large and small negative effects, say . We consider the limit, and for simplicity we suppose that . Let the fractions of large and small effects be and , respectively, so that of the means equal Inline graphic and the remaining equal zero. We solve for the optimal weights.

Proposition 1 —

There is a set of optimal -value weights that gives the same weights to the same means, i.e., weights and to means 0 and , respectively, where

Further, the power of the optimal -value weighting method is

If the absolute effect size Inline graphic is small enough that , all the weight is placed on the larger means, which is the behaviour we would expect intuitively. However, if is large enough that , then it is advantageous to place some weight on the small means, because a large absolute effect size will be detected with high probability.

In Fig. 1(a) we plot the ratio Inline graphic for , where is the power of unweighted Bonferroni testing. For most effect sizes and for , we see a power gain of at least 50% relative to unweighted Bonferroni testing. Moreover, there is a hotspot where the power gain can be three- to four-fold. Optimal weighting can lead to a significant gain in power.

3.3. Weights with imperfect prior knowledge

In the previous sections it was assumed that the effects Inline graphic are known precisely. We now assume that we have uncertain prior information in the form .

Following Westfall et al. (1998), we maximize the expected power Inline graphic averaged with respect to the random and . Introducing , the optimization problem, which we call the Bayes weights problem, becomes

(3)

This objective function is not concave if any Inline graphic . To help with visualization, the function is plotted in Fig. 1(b) for four parameter pairs . On the interval , the function is first concave and then convex.

Our main contribution is to solve this problem efficiently for large Inline graphic . The results in this respect are two-fold. First, we can solve the problem exactly in the special case where is sufficiently small. Second, we have a nearly exact solution for arbitrary . Starting with the simpler first case, we define

A weighted one-sided test Inline graphic can be written equivalently in terms of the critical values as . It turns out that the critical values corresponding to the optimal Bayes weights can be expressed in terms of , when is small enough that

(4)

In our data analysis examples and simulations, this mild restriction requires that Inline graphic be below values in the range to . In the next result we give the exact optimal weights for small when all .

Theorem 1 —

If the significance level is small enough that (4) holds, then the optimal Bayes weights maximizing the average power (3) are , where is the unique constant such that .

In the Supplementary Material, we solve this problem by maximizing the Lagrangian. Two key properties that we use are joint separability of the objective function and constraint, and analytic tractability of the Gaussian density.

Figure 2 displays an instance of the optimal weights Inline graphic as a function of the prior mean and the standard deviation . In the theorem the weights are a function of , but they can also be viewed as a function of via the natural map . As the standard error becomes small, our weights tend to the Spjøtvoll weights.

Proposition 2 —

For any and , the Bayes weight function defined by tends to the Spjøtvoll weight function defined in (2) as .

Fig. 2. — Bayes weights: (a) surface plot and (b) contour plot of the Bayes weight function defined in Theorem 1; Spjøtvoll weights are on the segment .

With Inline graphic , the weights are regularized: more extreme weights are shrunk towards a common value in a nonlinear way. For finite , our weights can be viewed as a smooth interpolation between Spjøtvoll weights and uniform weights. It is reasonable to think at first that as all , the best weight allocation becomes the uniform one. However, this is not the case: a symmetry-breaking phenomenon occurs due to nonconvexity.

Consider a weight vector Inline graphic that equals for indices, and assume that is not an integer. Distribute the remaining strictly positive weight equally among the remaining hypotheses. It is now easy to see that the hypotheses with weights equal to are always rejected, so their power equals 1. For the remaining hypotheses the power Inline graphic tends to as . This shows that the limiting power of this unbalanced weighting scheme is . For uniform weighting, the power tends to as , for each hypothesis. This shows that the limiting power of uniform weighting is . Hence, the power of the skewed weighting scheme is larger than that of uniform weighting. This illustrates the symmetry-breaking phenomenon caused by the extreme nonconvexity of the optimization problem.

Fortunately, the situation is better when condition (4) holds. In addition to being easy to check for any given parameters Inline graphic and , we now show that the constraint is mild. Often we want to keep small even if is large, because is the number of false rejections that we tolerate. In this regime, the condition holds as long as there are a few average-sized negative prior means . We denote by the normal quantile function.

Proposition 3 —

Condition (4) holds if there are distinct indices with negative , for which

If Inline graphic , then , so the simple condition holds provided that For instance, if and , then . If, moreover, so that , and , then we need only ten effect sizes with . This is a weak requirement.

When Inline graphic is small, we use a damped Newton's method to find the right constant from Theorem 1 via a one-dimensional line search. The function evaluations cost per iteration, and empirically we find that the algorithm takes only a small number of iterations to converge, independently of . We can solve problems involving more than two million tests in a few seconds on a desktop computer.

Now we present our result for the general case.

Theorem 2 —

For any , the nonconvex Bayes weights problem can be solved for a nearby for which . The optimal weights and can be found in steps.

This result is relevant when Inline graphic , the expected number of errors under the null hypothesis, is controlled at a threshold greater than 1/2. Our weights will be optimal for a that is close to . We see from the proof that even for large , often equals . The method also returns the value of , which the user can inspect. It is then the user's decision as to whether to perform multiple testing adjustment at the original level Inline graphic or at the new level .

The analysis of nonconvex optimization problems is challenging. It seems remarkable that the nonconvex Bayes weights problem admits a nearly exact solution.

4. Simulation studies

4.1. Bayes weights are more powerful than competing weighting schemes

We perform two simulation studies to explore the empirical performance of our method. First, we show that Bayes weights increase power more reliably than two other weighting schemes, namely exponential weights and filtering.

For Bayes weights, we multiply the variances by a dispersion factor Inline graphic , i.e., . The default value for this tuning parameter is and, as discussed in §5.3, we recommend use of the default value in most cases. The purpose of changing the dispersion is to explore the robustness of our method with respect to misspecification of the prior variances. The dispersion ranges from 0 to 4, and Spjøtvoll weights correspond to Inline graphic .

Exponential weights with tilt parameter Inline graphic are defined as , where . This weighting scheme was proposed by Roeder et al. (2006), who recommend as the default value. We consider the range . As noted by Roeder et al. (2006), exponential weights are sensitive to large means. To guard against this sensitivity, we truncate weights larger than Inline graphic and redistribute their excess weight among the next largest weights.

Filtering methods test only the most significant effects Inline graphic , using the unweighted Bonferroni method. These methods can be viewed as weighting schemes in which some weights are zero. Such methods are known under many names, such as two-stage testing, screening, or proxy-phenotype methods (Rietveld et al., 2014). We adopt the term filtering used by Bourgon et al. (2010), who filter based on independent information in the current dataset rather than prior information. The threshold Inline graphic ranges from to 0. If is large and fewer than hypotheses would be tested, then we instead test the most significant hypotheses.

In the simulation, we generate Inline graphic random means and variances independently according to and , and we set . For any weight vector , we calculate the power as the objective from (3) divided by , to reflect the average power per test.

The results are shown in Fig. 3(a). Each method can improve the power over unweighted testing. However, Bayes weights yield more power than the other methods. The best power is attained when the dispersion Inline graphic is equal to 1, but good power is reached in a large neighbourhood of . Our weights are robust with respect to misspecification of the tuning parameter.

In particular, taking uncertainty into account helps. Spjøtvoll weights, which assume fixed and known effects, and are represented on the figure as regularized weights with Inline graphic , have less power than Bayes weights with positive , for a wide range of .

The remaining two methods, filtering and exponential weights, have disadvantages. While filtering yields a gain in power for a thresholding parameter Inline graphic , it also leads to a substantial power loss for . For sufficiently large the power equals , because only the top hypotheses are selected. Another significant disadvantage is that there seems to be no principled way to choose a priori without additional assumptions. Similarly, exponential weighting leads to at most a small gain in power, and it usually leads to a power loss.

We conclude that Bayes weights are robust with respect to the choice of the tuning parameter and have uniformly good power. In contrast, exponential weighting and filtering are more sensitive, and their power can drop substantially.

4.2. Bayes weights have a worst-case advantage

We show that Bayes weights have a worst-case advantage compared to Spjøtvoll weights. We use the sparse means model and generate Inline graphic means distributed as , where and . We set and vary from 0 to 01. We set all to equal and consider or 1.

Spjøtvoll weights are optimal for Inline graphic , while Bayes weights are optimal for . We evaluate these weighting schemes by calculating the power that they do not maximize, i.e., the average power (3) for Spjøtvoll weights and the deterministic power (1) for Bayes weighting. We also compute the power of the unweighted Bonferroni method.

The results are displayed in Fig. 3(b). Bayes weights lose only a little power compared to the optimal Spjøtvoll weights. In contrast, Spjøtvoll weights lose a lot of power relative to Bayes weights, which maximize the worst-case power. Bayes weights show a maximin property. Further, as shown in the Supplementary Material, Spjøtvoll weights lose power near Inline graphic because they set the weights equal to zero on the small means.

5. Application to genome-wide association studies

5.1. Review of genome-wide association studies

We adapt our framework to genome-wide association studies, relying on basic notions of quantitative genetics (see, e.g., Lynch & Walsh, 1998). In this section we present in detail the methodology for this application, while also illustrating the steps of using our framework for specific problems.

We study a quantitative trait Inline graphic in a population, with the goal of understanding the effects of single nucleotide polymorphisms on the trait. We assume that has mean 0 and known variance; here denotes the centred minor allele count of variant for an individual. We rely on the linear model for the effect of the th variant on the trait: Inline graphic . In this model is the phenotype of a randomly sampled individual from the population, so is random, is a fixed unknown constant, and is the residual error. This error is a zero-mean random variable that is independent of , with variance .

Suppose that we observe a sample of Inline graphic independent and identically distributed observations from this model. We use the standard linear regression estimate , which for a large sample size has an approximate distribution . To standardize, we divide by , where is the variance of .

With these steps, we have framed our problem in the Gaussian means model. Writing Inline graphic and , we have , which has the required form. Let us also define the standardized effect size , which will be of key importance.

5.2. Prior information

To use prior information, assume that we also have a prior trait Inline graphic which is measured independently on a different, independent sample from the same population. With the same assumptions on , we can write . Here is a fixed unknown constant, and is random. Suppose that we have independent samples of size and for the two traits. If we define and by analogy to the definitions for Inline graphic , we can write .

We model the relatedness of the two traits as a relation between the standardized effect sizes Inline graphic , which do not depend on the sample size. If the two traits are closely related, the first-order approximation is equality, or . This model captures the pleiotropy between the two traits (Solovieff et al., 2013).

The final step is to compute the distribution of Inline graphic given the prior data . For this we need to choose a prior for , and for simplicity we will use a flat prior.

We now have all ingredients for the model of Gaussian hypothesis testing with uncertain information. Specifically, we have Inline graphic , where , and .

The uncertainty in Inline graphic may be different from 1, and may exceed it due to overdispersion. This is one way to weaken the first-order approximation assumption. To allow for overdispersion, we recall the parameter used in our simulation. We model the prior data as , and then the variance becomes . The default value Inline graphic is recommended in most cases. Finally, we compute the Bayes weights with parameters and , and we run the weighted Bonferroni method on the current -values. This fully specifies the method, which is summarized in the following algorithm.

Algorithm 1 —

Bayes-weighted Bonferroni multiple testing in genome-wide association studies

Let be the prior effect sizes for .

Let and be the prior and current sample sizes.

Let be the current -values.

Let be the significance threshold; the default value is .

Let be the dispersion; the default value is .

Set the prior means and variances: and .

Compute the Bayes weights , defined via (3), with parameters and .

Output indices such that .

5.3. Practical remarks

It is important that we retain Type I error control even when the modelling assumptions fail. The only requirement is that we have marginally valid Inline graphic -values. We list two common deviations from our model. First, summary data for genome-wide association studies sometimes include only the magnitude of the effects and not their sign. In this case we have two choices: we could assume that the directions of effects are the same, and perform a one-tailed test of the current effect in the prior direction; alternatively, we could do a two-tailed test by including the tests with prior parameters Inline graphic and for each , for a total of tests. Large effects will often be in the same direction, whereas small effects may change direction between the prior and current studies. Our procedure for dealing with two-sided effects may lead to minor power loss while retaining Type I error control. On the other hand, in some cases the prior and current traits can be of different types; for instance, the prior trait could be binary and the current trait quantitative. In such a situation, the model Inline graphic should be re-examined, but it is still convenient to use as a first approximation.

We recommend using the default value of the tuning parameter, Inline graphic , in all but exceptional cases. This value was derived from a natural Bayesian model, and our simulations and data analysis show that it provides good performance in most cases. The same numerical results demonstrate that our method is not too sensitive to the choice of tuning parameter. If the relationship between the two traits is thought to be weak, one could use a larger Inline graphic , such as . If the uncertainty in the prior information is less than that suggested by the usual model, one could use a small , such as . If the value was tried first, the results of that analysis should also be reported.

One may wish to use the weighted Benjamini–Hochberg method with our weights (Genovese et al., 2006); but in general this will be underpowered, as optimal weights for stepwise methods differ greatly from those for single-step methods (Westfall & Soper, 2001). However, in the special case of very small Inline graphic , in our data analysis examples we have observed that weights often become monotonically increasing with the magnitude of the effect size, and thus are similar to the optimal weights for stepwise methods.

6. Data analysis

6.1. Data sources

We illustrate the application of our method by analysing data from publicly available genome-wide association studies. We use the Inline graphic -values, recorded for 500 000 to 25 million genetic variants, from five studies: CARDIoGRAM and C4D for coronary artery disease (Schunkert et al., 2011; Coronary Artery Disease Genetics Consortium, 2011), blood lipids (Teslovich et al., 2010), schizophrenia (Schizophrenia Psychiatric Genome-Wide Association Study Consortium, 2011), and estimated glomerular filtration rate creatinine (Köttgen et al., 2010); see the Supplementary Material.

We analyse three pairs of datasets, with a specific motivation for each. First, we use CARDIoGRAM as prior information for C4D. This is a positive control for our method, since both studies measure coronary artery disease. We choose C4D as the target because it has a smaller sample; hence prior information may increase power more substantially.

Second, we use the blood lipids study as prior information for the schizophrenia study. Andreassen et al. (2013) demonstrated improved power with this pair. They used a fully Bayesian method, and our goal is to evaluate the power improvement using a frequentist method. There is a small overlap between the controls of the two studies.

Third, we use the creatinine study as prior information for the C4D study. Heart disease and renal disease are comorbid (Silverberg et al., 2004), so this set-up may improve power.

6.2. Methods and additional details

We run weighted Bonferroni multiple testing for each of five weighting schemes. The prior data is Inline graphic , where is the th prior -value. The familywise error rate is controlled at , so that the -value thresholds are approximately to .

The first four weighting schemes are: unweighted Bonferroni testing, where all weights equal unity; Spjøtvoll weights with parameters Inline graphic ; Bayes weights with dispersion or 10; and exponential weights (Roeder et al., 2006), introduced in §4.1, with tilt or 4.

The fifth and last weighting scheme is filtering, which selects the smallest Inline graphic -values from the prior study and tests their hypotheses in the current study. We use three -value thresholds, , and . Rietveld et al. (2014) proposed a method for choosing the optimal -value threshold for filtering, which requires the genotypic correlation between the two traits and the additive heritability of the current trait. For complex traits, these parameters are usually estimated with large uncertainty, and substantial domain expertise is needed to specify them.

We prune the significant single nucleotide polymorphisms for linkage disequilibrium using the DistiLD database (Palleja et al., 2012). Specifically, for each weighting scheme we select one locus from each linkage disequilibrium block that contains significant loci. Our data analysis pipeline is given in the Supplementary Material.

We compute a score Inline graphic for each weighting scheme with parameters , on each dataset . This is defined as 1 if the weighting scheme increases the number of detections relative to unweighted testing, 0 if it leaves the number unchanged, and 1 otherwise. The score of a weighting scheme with parameters is the sum of scores across datasets. The total Inline graphic of the weighting scheme is the sum of scores across parameters.

6.3. Results

Table 1 shows the number of significant loci for each pair of studies and for each weighting scheme. We also present the results pruned for linkage disequilibrium, which act as a proxy for the number of independent loci found.

Table 1.

Number of significant loci for five methods on three examples: the top portion of the table shows results pruned for linkage disequilibrium, the middle part shows results without pruning, and the bottom portion reports the score of each method

			Bayes			Exp			Filter
Parameter	Un	Spjot		1	10	1	2	4	2	4	6
	Pruned
CG C4D	4	11	10	8	4	4	5	4	10	10	6
Lipids SCZ	4	1	1	1	5	1	0	0	2	2	2
eGFRcrea C4D	4	2	2	4	4	4	5	4	1	0	1
	Unpruned
CG C4D	29	45	44	39	29	32	34	27	40	48	34
Lipids SCZ	116	214	214	223	123	92	0	0	217	96	39
eGFRcrea C4D	29	18	18	23	29	29	28	19	1	0	1
Scoring
Score	0	0	0	1	1	0	0	-1	0	-1	-1
Total	0	0

Open in a new tab

Un, unweighted; Spjot, Spjøtvoll; Bayes Inline graphic , Bayesian with or 10; Exp, exponential with or 4; Filter, filtering with or 6; CG, CARDIoGRAM; SCZ, schizophrenia study; eGFRcrea, creatinine study.

The results are somewhat inconclusive. In the positive control example, all weighting schemes except exponential weighting detect more loci than unweighted testing. Spjøtvoll weighting and filtering lead to the largest number of loci. In the blood lipids example, the methods generally detect fewer pruned loci, except for Bayes weights with Inline graphic . The methods can detect both a larger and a smaller number of unpruned loci, except in the case of Bayes weights, which uniformly increase the number of loci. For the eGFR creatinine example, exponential weights produce the best behaviour. We also see that the default never performs worse than both unweighted testing and Spjøtvoll weights, and for the unpruned lipids example it is better.

If we allow tuning of parameters for the three weighting schemes that have such a parameter, Bayes weights show good performance: they are either first or second in all examples. This shows that our method is robust with respect to the choice of tuning parameter.

Finally, only Bayes weights with Inline graphic or 10 have a positive score. The total score, summed across parameter settings, is also positive only for Bayes weights. Judging from these results, our method shows promise. However, from this analysis alone we cannot establish conclusively the relative merits of the methods. In future work it will be necessary to evaluate Inline graphic -value weighting methods on more datasets.

Supplementary material

Supplementary material available at Biometrika online includes proofs of the theoretical results, software implementations in R and MATLAB, and code to reproduce the simulations and data analysis results.

Acknowledgments

Kristen Fortney and Stuart Kim are also affiliated with the Department of Genetics, Stanford University. This research was partially supported by the U.S. National Science Foundation and National Institutes of Health. We are grateful for the reviewers' constructive comments, which have helped to improve the paper.

References

Andreassen O. A., Djurovic S., Thompson W. K., Schork A. J., Kendler K. S., O'Donovan M. C., Rujescu D., Werge T., van de Bunt M. & Morris A. P. et al. (2013). Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. Am. J. Hum. Genet. 92, 197–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benjamini Y. & Hochberg Y. (1997). Multiple hypotheses testing with weights. Scand. J. Statist. 24, 407–18. [Google Scholar]
Bourgon R., Gentleman R. & Huber W. (2010). Independent filtering increases detection power for high-throughput experiments. Proc. Nat. Acad. Sci. 107, 9546–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
Box G. E. P. (1980). Sampling and Bayes' inference in scientific modelling and robustness (with Discussion). J. R. Statist. Soc. A 143, 383–430. [Google Scholar]
Brooks-Wilson A. R. (2013). Genetics of healthy aging and longevity. Hum. Genet. 132, 1323–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carlin B. J. & Louis T. A. (1985). Controlling error rates by using conditional expected power to select tumor sites. In Proc. Biopharm. Sect., Am. Statist. Assoc. Alexandria, Virginia: American Statistical Association, pp. 11–8.
Coronary Artery Disease Genetics Consortium (2011). A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nature Genet. 43, 339–44. [DOI] [PubMed] [Google Scholar]
Darnell G., Duong D., Han B. & Eskin E. (2012). Incorporating prior information into association studies. Bioinformatics 28, i147–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eskin E. (2008). Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Genome Res. 18, 653–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Genovese C. R., Roeder K. & Wasserman L. (2006). False discovery control with -value weighting. Biometrika 93, 509–24. [Google Scholar]
Gui J., Tosteson T. D. & Borsuk M. E. (2012). Weighted multiple testing procedures for genomic studies. BioData Mining 5, article no. 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hjelmborg J., Iachine I., Skytthe A., Vaupel J. W., McGue M., Koskenvuo M., Kaprio J., Pedersen N. L. & Christensen K. (2006). Genetic influence on human lifespan and longevity. Hum. Genet. 119, 312–21. [DOI] [PubMed] [Google Scholar]
Holm S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Statist. 6, 65–70. [Google Scholar]
Köttgen A., Pattaro C., Böger C. A., Fuchsberger C., Olden M., Glazer N. L., Parsa A., Gao X., Yang Q. & Smith A. V. et al. (2010). New loci associated with kidney function and chronic kidney disease. Nature Genet. 42, 376–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
Louis T. A. & Bailey J. K. (1990). Controlling error rates using prior information and marginal totals to select tumor sites. J. Statist. Plan. Infer. 24, 297–316. [Google Scholar]
Lynch M. & Walsh B. (1998). Genetics and Analysis of Quantitative Traits. Sunderland: Sinauer Associates. [Google Scholar]
Palleja A., Horn H., Eliasson S. & Jensen L. J. (2012). DistiLD Database: Diseases and traits in linkage disequilibrium blocks. Nucleic Acids Res. 40, D1036–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peña E. A., Habiger J. D. & Wu W. (2011). Power-enhanced multiple decision functions controlling family-wise error and false discovery rates. Ann. Statist. 39, 556–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rietveld C. A., Esko T., Davies G., Pers T. H., Turley P., Benyamin B., Chabris C. F., Emilsson V., Johnson A. D. & Lee J. J. et al. (2014). Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Nat. Acad. Sci. 111, 13790–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roeder K. & Wasserman L. (2009). Genome-wide significance levels and weighted hypothesis testing. Statist. Sci. 24, 398–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roeder K., Bacanu S.-A., Wasserman L. & Devlin B. (2006). Using linkage genome scans to improve power of association in genome scans. Am. J. Hum. Genet. 78, 243–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roquain E. & Van De Wiel M. A. (2009). Optimal weighting for false discovery rate control. Electron. J. Statist. 3, 678–711. [Google Scholar]
Rubin D., Dudoit S. & Van der Laan M. (2006). A method to increase the power of multiple testing procedures through sample splitting. Statist. Applic. Genet. Molec. Biol. 5, 1–19. [DOI] [PubMed] [Google Scholar]
Schizophrenia Psychiatric Genome-Wide Association Study Consortium (2011). Genome-wide association study identifies five new schizophrenia loci. Nature Genet. 43, 969–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schunkert H., König I. R., Kathiresan S., Reilly M. P., Assimes T. L., Holm H., Preuss M., Stewart A. F., Barbalic M. & Gieger C. et al. (2011). Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genet. 43, 333–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Silverberg D., Wexler D., Blum M., Schwartz D. & Iaina A. (2004). The association between congestive heart failure and chronic renal disease. Curr. Opin. Nephrol. Hypertens. 13, 163–70. [DOI] [PubMed] [Google Scholar]
Solovieff N., Cotsapas C., Lee P. H., Purcell S. M. & Smoller J. W. (2013). Pleiotropy in complex traits: Challenges and strategies. Nature Rev. Genet. 14, 483–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spjøtvoll E. (1972). On the optimality of some multiple comparison procedures. Ann. Math. Statist. 43, 398–411. [Google Scholar]
Teslovich T. M., Musunuru K., Smith A. V., Edmondson A. C., Stylianou I. M., Koseki M., Pirruccello J. P., Ripatti S., Chasman D. I. & Willer C. J. et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Westfall P. H. & Krishen A. (2001). Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. J. Statist. Plan. Infer. 99, 25–40. [Google Scholar]
Westfall P. H. & Soper K. A. (2001). Using priors to improve multiple animal carcinogenicity tests. J. Am. Statist. Assoc. 96, 827–34. [Google Scholar]
Westfall P. H., Krishen A. & Young S. S. (1998). Using prior information to allocate significance levels for multiple endpoints. Statist. Med. 17, 2107–19. [DOI] [PubMed] [Google Scholar]
Westfall P. H., Kropf S. & Finos L. (2004). Weighted FWE-controlling methods in high-dimensional situations. In Recent Developments in Multiple Comparison Procedures, Y. Benjamini, F. Bretz and S. Sarkar, eds. Beachwood, Ohio: Institute of Mathematical Statistics, pp. 143–54.

[ASV050C1] Andreassen O. A., Djurovic S., Thompson W. K., Schork A. J., Kendler K. S., O'Donovan M. C., Rujescu D., Werge T., van de Bunt M. & Morris A. P. et al. (2013). Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. Am. J. Hum. Genet. 92, 197–209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C2] Benjamini Y. & Hochberg Y. (1997). Multiple hypotheses testing with weights. Scand. J. Statist. 24, 407–18. [Google Scholar]

[ASV050C3] Bourgon R., Gentleman R. & Huber W. (2010). Independent filtering increases detection power for high-throughput experiments. Proc. Nat. Acad. Sci. 107, 9546–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C4] Box G. E. P. (1980). Sampling and Bayes' inference in scientific modelling and robustness (with Discussion). J. R. Statist. Soc. A 143, 383–430. [Google Scholar]

[ASV050C5] Brooks-Wilson A. R. (2013). Genetics of healthy aging and longevity. Hum. Genet. 132, 1323–38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C6] Carlin B. J. & Louis T. A. (1985). Controlling error rates by using conditional expected power to select tumor sites. In Proc. Biopharm. Sect., Am. Statist. Assoc. Alexandria, Virginia: American Statistical Association, pp. 11–8.

[ASV050C7] Coronary Artery Disease Genetics Consortium (2011). A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nature Genet. 43, 339–44. [DOI] [PubMed] [Google Scholar]

[ASV050C8] Darnell G., Duong D., Han B. & Eskin E. (2012). Incorporating prior information into association studies. Bioinformatics 28, i147–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C9] Eskin E. (2008). Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Genome Res. 18, 653–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C10] Genovese C. R., Roeder K. & Wasserman L. (2006). False discovery control with -value weighting. Biometrika 93, 509–24. [Google Scholar]

[ASV050C11] Gui J., Tosteson T. D. & Borsuk M. E. (2012). Weighted multiple testing procedures for genomic studies. BioData Mining 5, article no. 4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C12] Hjelmborg J., Iachine I., Skytthe A., Vaupel J. W., McGue M., Koskenvuo M., Kaprio J., Pedersen N. L. & Christensen K. (2006). Genetic influence on human lifespan and longevity. Hum. Genet. 119, 312–21. [DOI] [PubMed] [Google Scholar]

[ASV050C13] Holm S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Statist. 6, 65–70. [Google Scholar]

[ASV050C14] Köttgen A., Pattaro C., Böger C. A., Fuchsberger C., Olden M., Glazer N. L., Parsa A., Gao X., Yang Q. & Smith A. V. et al. (2010). New loci associated with kidney function and chronic kidney disease. Nature Genet. 42, 376–84. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C15] Louis T. A. & Bailey J. K. (1990). Controlling error rates using prior information and marginal totals to select tumor sites. J. Statist. Plan. Infer. 24, 297–316. [Google Scholar]

[ASV050C16] Lynch M. & Walsh B. (1998). Genetics and Analysis of Quantitative Traits. Sunderland: Sinauer Associates. [Google Scholar]

[ASV050C17] Palleja A., Horn H., Eliasson S. & Jensen L. J. (2012). DistiLD Database: Diseases and traits in linkage disequilibrium blocks. Nucleic Acids Res. 40, D1036–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C18] Peña E. A., Habiger J. D. & Wu W. (2011). Power-enhanced multiple decision functions controlling family-wise error and false discovery rates. Ann. Statist. 39, 556–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C19] Rietveld C. A., Esko T., Davies G., Pers T. H., Turley P., Benyamin B., Chabris C. F., Emilsson V., Johnson A. D. & Lee J. J. et al. (2014). Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Nat. Acad. Sci. 111, 13790–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C20] Roeder K. & Wasserman L. (2009). Genome-wide significance levels and weighted hypothesis testing. Statist. Sci. 24, 398–413. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C21] Roeder K., Bacanu S.-A., Wasserman L. & Devlin B. (2006). Using linkage genome scans to improve power of association in genome scans. Am. J. Hum. Genet. 78, 243–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C22] Roquain E. & Van De Wiel M. A. (2009). Optimal weighting for false discovery rate control. Electron. J. Statist. 3, 678–711. [Google Scholar]

[ASV050C23] Rubin D., Dudoit S. & Van der Laan M. (2006). A method to increase the power of multiple testing procedures through sample splitting. Statist. Applic. Genet. Molec. Biol. 5, 1–19. [DOI] [PubMed] [Google Scholar]

[ASV050C24] Schizophrenia Psychiatric Genome-Wide Association Study Consortium (2011). Genome-wide association study identifies five new schizophrenia loci. Nature Genet. 43, 969–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C25] Schunkert H., König I. R., Kathiresan S., Reilly M. P., Assimes T. L., Holm H., Preuss M., Stewart A. F., Barbalic M. & Gieger C. et al. (2011). Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genet. 43, 333–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C26] Silverberg D., Wexler D., Blum M., Schwartz D. & Iaina A. (2004). The association between congestive heart failure and chronic renal disease. Curr. Opin. Nephrol. Hypertens. 13, 163–70. [DOI] [PubMed] [Google Scholar]

[ASV050C27] Solovieff N., Cotsapas C., Lee P. H., Purcell S. M. & Smoller J. W. (2013). Pleiotropy in complex traits: Challenges and strategies. Nature Rev. Genet. 14, 483–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C28] Spjøtvoll E. (1972). On the optimality of some multiple comparison procedures. Ann. Math. Statist. 43, 398–411. [Google Scholar]

[ASV050C29] Teslovich T. M., Musunuru K., Smith A. V., Edmondson A. C., Stylianou I. M., Koseki M., Pirruccello J. P., Ripatti S., Chasman D. I. & Willer C. J. et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ASV050C30] Westfall P. H. & Krishen A. (2001). Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. J. Statist. Plan. Infer. 99, 25–40. [Google Scholar]

[ASV050C31] Westfall P. H. & Soper K. A. (2001). Using priors to improve multiple animal carcinogenicity tests. J. Am. Statist. Assoc. 96, 827–34. [Google Scholar]

[ASV050C32] Westfall P. H., Krishen A. & Young S. S. (1998). Using prior information to allocate significance levels for multiple endpoints. Statist. Med. 17, 2107–19. [DOI] [PubMed] [Google Scholar]

[ASV050C33] Westfall P. H., Kropf S. & Finos L. (2004). Weighted FWE-controlling methods in high-dimensional situations. In Recent Developments in Multiple Comparison Procedures, Y. Benjamini, F. Bretz and S. Sarkar, eds. Beachwood, Ohio: Institute of Mathematical Statistics, pp. 143–54.

PERMALINK

Optimal multiple testing under a Gaussian prior on the effect sizes

Edgar Dobriban

Kristen Fortney

Stuart K Kim

Art B Owen

Abstract

1. Introduction

2. Related work

3. Theoretical results

3.1. Background

3.2. Weighting leads to substantial power gain

Proposition 1 —

Fig. 1.

3.3. Weights with imperfect prior knowledge

Theorem 1 —

Proposition 2 —

Fig. 2.

Proposition 3 —

Theorem 2 —

4. Simulation studies

4.1. Bayes weights are more powerful than competing weighting schemes

Fig. 3.

4.2. Bayes weights have a worst-case advantage

5. Application to genome-wide association studies

5.1. Review of genome-wide association studies

5.2. Prior information

Algorithm 1 —

5.3. Practical remarks

6. Data analysis

6.1. Data sources

6.2. Methods and additional details

6.3. Results

Table 1.

Supplementary material

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases