Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2012 Jun 9;28(12):i147–i153. doi: 10.1093/bioinformatics/bts235

Incorporating prior information into association studies

Gregory Darnell 1, Dat Duong 2, Buhm Han 1, Eleazar Eskin 1,3,*
PMCID: PMC3371867  PMID: 22689754

Abstract

Summary: Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power.

Availability: The method presented herein is available at http://masa.cs.ucla.edu

Contact: eeskin@cs.ucla.edu

1 INTRODUCTION

The cost of collecting genetic information has been decreasing with advances in high-throughput genomic technology (Matsuzaki et al., 2004). Over the last few years, hundreds of genes have been identified as being associated with common human disease (Risch and Merikangas, 1996; Visscher et al., 2012). Traditionally, association studies are performed without making any assumptions about which variants are more or less likely to be involved in the disease. These methods evaluate an association statistic at each single nucleotide polymorphism (SNP), and only take into account one SNP at a time. The Bonferroni correction is often used to control the overall false-positive rate by uniformly limiting the significance threshold at each SNP (Franke et al., 2010).

Current standard approaches report a p-value for each variant and there is a good understanding in the community of what significance levels are required for genome-wide association (Pe'er et al., 2008). Virtually all association studies report p-values as their results which allows investigators to interpret their findings in the context of other groups' findings.

Although the lack of assumptions has the advantage of being unbiased in the search for variants involved in the disease, we know that not all SNPs contribute equally to the disease (Adzhubei et al., 2010). Recent studies (Eskin, 2008) have shown that incorporating prior information such as the results of functional studies (ENCODE Project Consortium, 2007) can increase statistical power. Unfortunately, the methods for incorporating prior information are complicated and difficult to apply in practice. Methods that use prior information are typically Bayesian association study methods (Pe'er et al., 2006; Fridley et al., 2010) and instead report Bayes factors which are not usually reported in other studies. Although standard methods do not take into account prior information, they do have the advantage of being simple.

We present a novel method for performing association studies using prior information, which is as simple to apply as standard association statistics and reports p-values, yet, optimally incorporates prior information. We extend our method to take advantage of the correlation structure to consider multiple markers in a region (de Bakker et al., 2005; Devlin and Risch, 1995). When considering multiple markers, we compute an association statistic at each SNP in the region, not only the collected SNPs. Incorporating prior SNP information increases power over traditional association studies while maintaining the same overall false-positive rate.

Our method can be used in association studies to improve both the power and resolution of a study. When our method is applied to simulations using data from individuals in the HapMap, we demonstrate a significant increase in power and resolution compared with the methods used in a traditional association study. Our method also has the advantage of maintaining the inherent simplicity of a traditional association study. As a result, the computational complexity of our method matches that of a traditional association study.

We measure resolution by calculating the distance between the location of the assumed causal SNP and the location of the SNP corresponding to the maximum-likelihood ratio. Our method increases the average resolution of the four HapMap populations by 27% and the average power by 8% over the traditional method.

Our method has a connection to multithreshold associations presented in (Eskin, 2008). In this work, we show that the multithreshold association method which uses the prior information optimally to maximize statistical power can be interpreted as a likelihood ratio test (LRT). This is the key observation underlying our approach and allows us to propose a very simple method which is also optimal with respect to statistical power, but has the advantage of being simple and easy to interpret.

2 METHODS

2.1 Traditional association studies

Traditional association studies collect m markers in N/2 case and N/2 control individuals. We assume a low-disease prevalence, however, our method can be easily extended to higher prevalence. For each marker i and relative risk γ, the true case allele frequency is defined as follows: p+ifi/((γ−1)fi+1), where fi is the population allele frequency. The control allele frequency for the causal maker, pi, is approximately equal to the population allele frequency, pifi. The non-causal markers have equal case and control allele frequencies, that is, p+i = pi. The overall allele frequency in the case–control sample is, pi = (p+i+pi)/2. Let Inline graphic, Inline graphic and Inline graphic be the observed values of the corresponding statistic.

The statistic

graphic file with name bts235m1.jpg (1)

for each marker i is approximately normally distributed with variance 1 and mean non-centrality parameter Inline graphic. The power of a standard association study to detect a significant association at a maker i, relies on the non-centrality parameter and is

graphic file with name bts235um1.jpg

where the value of t is the significance threshold, which serves to control the overall false-positive rate, and Φ and Φ−1 denote the cumulative density function (CDF) and the inverse of CDF of the standard normal distribution, respectively. The false-positive rate represents the probability of rejecting the null hypothesis for any marker, assuming there is no causal marker. In our case, we control this probability to α.

If markers are assumed to be independent, the significance threshold is computed using the Bonferroni correction, αs=α/M. It is important to note that in traditional association studies, the significance threshold, αs, is fixed for all markers, M.

When computing the overall power of a study, the power is first computed for each marker, Inline graphic. The overall power, defined as Inline graphic, is the average of the power computed at each marker. For clarity, we have assumed that only the markers are the causal variants, which is clearly not realistic; we drop this assumption below.

2.2 Association studies with prior information

Obviously, we do not know which variant is causal and which variant is not causal. However, some variants are more likely involved in the disease than others based on information on how much of a molecular effect that variant has. Let us assume that a marker i has probability ci of being causal. We define a revised power function that includes prior probabilities

graphic file with name bts235m2.jpg (2)

where αs is set to control the overall false-positive rate to α.

2.3 Maximizing power in a multithreshold association study

One way to incorporate prior information into association studies is to use multithreshold association (Eskin, 2008). In this approach, we use a different significance threshold at each marker and these thresholds are set to maximize the statistical power taking into account prior information (Adzhubei et al., 2010). We denote the new set of thresholds as variables t1tm. Overall power can be defined as a function of the thresholds t1tm.

graphic file with name bts235m3.jpg (3)

By optimizing the threshold at each marker, we can increase the overall power of a standard association study. Our task is to find the t*1t*m that maximizes (3) under the condition that Σt*i=α. This constrained optimization can be solved using the method of Lagrange multipliers, assuming the markers are not correlated. Below we show how to use this method to maximize our object function, and as a result find the per marker threshold, t*1t*m.

The objective function to maximize is

graphic file with name bts235um2.jpg

We take the partial derivative of the objective function with respect to ti and l and set them equal to 0 to obtain

graphic file with name bts235m4.jpg (4)

where ϕ is the probability density function (PDF) of the standard normal distribution. See Appendix for the detailed derivation.

As (4) is true for all marker i∈1…m, we set up the following equality:

graphic file with name bts235m5.jpg (5)

where Σti=α. We can numerically find t*1t*m satisfying (5). We begin with a guess for c*, set it equal to (5), and solve for t1tm simultaneously. If Σti>α, we decrease c* (when Σti<α, we do otherwise) and repeat the process. The resulting t1tm satisfying the constraint Σti=α are denoted t*1t*m.

We gain intuition on our method by imagining that we have a total budget of α to distribute among a portfolio of i assets or stocks. Each asset has a certain return on investment, which depends on the ti, the fraction of the total budget that we allocate to the asset. In order to optimize the total return on our budget of α, we allocate our funds such that the marginal return on investment for each asset is equal. Intuitively, this is because if the marginal return is not equal between investments, we can always increase the overall return by taking out funds from the smaller return investment and put them into the larger return investment. In the case of power, the budget is our overall significance threshold, α. The power return for each marker again depends on the ti we allocate to each marker. In setting the optimal significance threshold for each marker, we consider the rate of return, or power, which depends on the amount invested, or significance threshold. In determining how to allocate the fraction of the overall significance threshold to each marker, we compute the partial derivative of the power function at each marker and set them equal to each other, which is equivalent to the investment-return analogy.

2.4 Connection to LRT

The key observation in this article is that the multithreshold association has a close connection to the LRT. The LRT compares the likelihood ratio of a statistic with a given threshold C*, where the likelihood ratio is a direct comparison of the probability of observing the statistic under the null distribution versus the alternative distribution. It is possible to apply the LRT to determine an LRT statistic of a given marker, and thus determine a significance designation for that marker.

Consider the probability of observing the statistic si in Equation (1). The null distribution is

graphic file with name bts235um3.jpg

and the alternative distribution is, given the non-centrality parameter Inline graphic,

graphic file with name bts235um4.jpg

where the 50:50 mixture is taken assuming that we do not know the direction of the effect (two-sided test). A standard LRT will reject the null hypothesis at si if Inline graphic. C* can be set to control the overall false-positive rate to α.

For the purposes of a multithreshold association study, we will modify the likelihood ratio and the LRT such that the prior information is included,

graphic file with name bts235um5.jpg

This likelihood ratio is exactly the same term found in (5). LRT in this case will be done by comparing this likelihood ratio to a number C*, where

graphic file with name bts235um6.jpg

where C* is the threshold that controls overall false-positive rate to α.

Note that when observed value si at marker i is >−Φ−1(t*i/2) or <Φ−1(t*i/2), then

graphic file with name bts235um7.jpg

and we can reject the null hypothesis at i. Thus, LRT and multithreshold association test are equivalent.

When there are correlations between markers, we can still find a C* such that the chance of  rejecting any of the null hypotheses is α. C* can be easily calculated using permutation (see Appendix).

2.5 Maximizing power for tag SNPs

Previously, we made the assumption that the markers themselves are causal. Usually, markers are more likely to be tags for the causal variation. Using this information, we can assign each potential polymorphism to the best marker, or tag. We use notation vkTi to associate each set of polymorphisms vk to a single marker i. The effect of non-centrality parameter of indirect association is reduced by a factor |rki|, where |rki| is the correlation coefficient between polymorphism k and marker i (Pritchard and Przeworski, 2001). This correlation coefficient can be determined from reference data such as the (HapMap et al., 2005). We can give each polymorphism a prior probability of being causal ck. If a polymorphism k is causal, the power function when observing marker i is Inline graphic. Let us denote the total power captured by a marker i as P(ti,Ti,N). In our case, the total power function of the association study is

graphic file with name bts235um8.jpg

This power function can be maximized with respect to t1tm using the same approach as before. There is a constraint that Σti=α. The objective function now becomes

graphic file with name bts235um9.jpg

We take partial derivatives of this objective function with respect to ti and α and set them equal to zero

graphic file with name bts235um10.jpg

Similarly to Equation (A.3), we can obtain

graphic file with name bts235m6.jpg (6)

In Section 2.4, when markers are assumed to be causal, we detect an association at marker i if the observed statistic si at i satisfies

graphic file with name bts235um11.jpg

Similarly, when makers themselves are not assumed to be causal, we determine an association at maker i if the statistic si at i satisfies

graphic file with name bts235um12.jpg

M* is the threshold that controls false-positive rate to α, the overall significant threshold of the association study. We can determine this M* by permutation even when there are correlations between markers.

2.6 Multiple-testing adjusted p-value

We can obtain multiple-testing adjusted p-values in our multithreshold association study as follows. In an association study with only one marker, we define the test to be significant if the observed statistic Inline graphic is greater than its threshold Φ−1(α/2). Then, we can use Inline graphic to compute a p-value Inline graphic which measures how significant of an association we observed by using the relationship Inline graphic. For example, in a traditional association study with one marker, if the Inline graphic is ≪0.05, then we can say that this marker strongly associates with the disease. In an association study with m markers, we identify the associated markers by comparing each of the observed statistics Inline graphic against its corresponding cutoff Φ−1(t1/2),…,Φ−1(tm/2). To determine how significant the association at each marker is, we need to compute the p-value at that marker. As the cutoff values Φ−1(t1/2),…,Φ−1(tm/2) are usually not identical in our multithreshold association study, we compute the multiple-testing adjusted p-values Inline graphic at m markers separately. The multiple-testing adjusted p-value is the probability under the null hypothesis of observing a significant association at any marker.

We compute the adjusted p-value Inline graphic at a marker i by using its observed Inline graphic. If Inline graphic, then the multiple-testing adjusted p-value is α. For Inline graphic, we need to find the p-value Inline graphic. Estimating this significance level is equivalent to finding the Inline graphic and a new set of thresholds t1*,…,tm* such that when we maximize equation (3) with the constraint Inline graphic, then the cutoff for marker i is Inline graphic. We denote the gradient in Equation (5) at the observed marker i as Inline graphic. As all the partial derivatives of Equation (3) are equal at the optimal solution [Equation (5)], we can determine a new set t1*,…,tm*, where Inline graphic, by setting each of the gradients in Equation (5) equal to Inline graphic. Finally, the p-value Inline graphic is the sum of t1*,…,tm*. Once the Inline graphic have been determined, these values are sorted. The marker corresponding to the lowest p-value is the one that is most strongly associated to the disease. Using this approach, we can report a p-value which is adjusted for prior information for each variant.

2.7 New multivariate normal distribution method for correlated markers

In the previous sections, we assumed independent markers 1…m that are possibly correlated to causal variation. However, marker themselves can be correlated to each other. If the markers are correlated, the statistics at the markers follow a multivariate normal distribution (MVN) with variance–covariance matrix Σ, where the entries of Σ are the correlation coefficients between the markers (Han et al., 2009).

Here, we propose a new LRT method designed for the situation that the markers are correlated. By using information from all the markers, we can have better resolution than when inspecting one of the markers.

If variation i is causal and the marker j is correlated to i, the non-centrality parameter at marker j is Inline graphic (Pritchard and Przeworski, 2001). If we consider correlations between variation i and all the markers, the vector of non-centrality parameters with respect to causal variation i will be Inline graphicInline graphic.

We modify the LRT of the previous section to take into account multiple correlated markers. Given a putative causal variation i, we have the following null hypothesis Inline graphic and the alternate hypothesis Inline graphic at markers 1…m.

Let Inline graphic be the vector of the observed statistics of all markers. The LRT is performed by comparing the probability of Inline graphic under the null and alternative hypothesis, using the formula

graphic file with name bts235m7.jpg (7)

where C* is set to control the false-positive rate to α and can be found by permutation. ϕ is the PDF of the MVN distribution. Again, the 50:50 mixture of the distributions is taken for the alternative hypothesis to perform two-sided test. When the condition in Equation (7) is satisfied, we have sufficient evidence to reject the null hypothesis at variation i.

It should be noted that the new method is different from the methods we described previously in that the testing is performed per each putative causal variant instead of per each marker. The information of putative causal variants is obtained from the reference dataset, for example, by considering all known variants. This implies increased multiple-testing burden because the number of known variants is much greater than the number of markers in general. However, although the number of tests are considerably increased, the test statistics are highly correlated and therefore the actual multiple-testing burden increases less steeply than the number of tests if we use permutation. Our results show that the new method outperforms previous methods in terms of both power and resolution even after we take into account the increased multiple testing burden.

3 RESULTS

3.1 Candidate gene study associations Project

We follow the evaluation protocol shown in (de Bakker et al., 2005) to simulate association studies in a candidate gene-sized region using the HapMap data ENCODE (ENCODE Project Consortium, 2007). In these simulations, we make case and control individuals by randomly sampling from the pool of haplotypes from HapMap samples in the ENCODE regions. The disease status for each individual is decided by randomly assigning an SNP from this region as causal with a certain relative risk. We assume the scenario where we are using a whole-genome genotyping product such as the Affymetrix 500k SNP chip (Matsuzaki et al., 2004) and where a subset of SNPs from this region are collected as markers. Simulation studies are done over the four HapMap populations in each of the 10 ENCODE regions.

We compare power of four different methods. The first method is the traditional association test where the Bonferroni correction is used. The second method is the multithreshold association test where the markers are assumed to be causal. Since this method only takes into account the non-centrality parameter at each marker, it is equivalent to accounting for the minor allele frequencies (MAFs) of the markers to determine the optimal multithresholds. We call this method multithreshold method with MAF prior. The third method is the multithreshold association test where we assume causal variants are in LD with markers. We use the HapMap data and assign each SNP to a marker by choosing the marker with the highest correlation coefficient with the SNP. This method takes into account not only the non-centrality parameters (or MAF) at the causal variants but also how many causal variants are assigned to each marker and how much the marker and the assigned causal variants are correlated. We call this method multithreshold method with LD and MAF prior. This method is equivalent to the method presented in (Eskin, 2008). The fourth method is the new MVN method where we assume the markers are correlated. The testing is performed at each putative causal variant.

To measure power of each method while correctly accounting for the multiple testing burden of each method, we perform the following simulation procedure. Assuming the null hypothesis of no association, we generate 1000 null panels. We compute the maximum statistic among all markers for all null panels to obtain the empirical null distribution of maximum statistic for each method. The top 5% quantile of the distribution gives us the empirically estimated threshold for α=0.05. Then, we generate 1000 alternative panels assuming the disease model. The power is measured as the number of alternative panels whose maximum statistic exceeds the threshold corresponding to α=0.05. This empirical procedure is shown to accurately control the false-positive rate nearly identical to permuting each panel (de Bakker et al., 2005).

Table 1 shows the power of all four methods. In general, the multithreshold method with MAF prior does not show a better performance than the traditional method. This shows that taking into account only MAF and not the correlations between the marker and the causal variants can be not helpful. The multithreshold method with both LD and MAF prior increases power compared with the traditional method, on average by 1.2%. When we consider only the SNPs whose traditional method's power is in mid-range (between 0.1 and 0.9), the power is increased by a greater amount, 4.0%, which goes along with the results of (Eskin, 2008).

Table 1.

Summary of the power comparison among all 10 ENCODE regions

Pop. No. of No. of All SNPs
SNPs with power between 0.1 and 0.9
tags SNPs Trad Mult (MAF), Mult (LD), MVN, Trad Mult (MAF), Mult (LD), MVN,
n (%) n (%) n (%) n (%) n (%) n (%)
CEU 678 10 710 0.664 0.660 (−0.6) 0.672 (1.3) 0.697 (5.0) 0.698 0.706 (1.1) 0.733 (5.0) 0.784 (12.2)
YRI 708 13 176 0.445 0.437 (−1.6) 0.451 (1.5) 0.537 (20.8) 0.586 0.577 (−1.6) 0.604 (3.1) 0.731 (24.7)
CHB 606 8934 0.716 0.710 (−0.9) 0.726 (1.4) 0.760 (6.1) 0.708 0.708 (−0.1) 0.740 (4.5) 0.813 (14.7)
JPT 608 9248 0.684 0.675 (−1.4) 0.690 (0.9) 0.722 (5.6) 0.712 0.696 (−2.3) 0.736 (3.4) 0.803 (12.8)
Average 0.627 0.621 (−1.1) 0.635 (1.2) 0.679 (8.3) 0.676 0.671 (−0.7) 0.703 (4.0) 0.782 (15.7)

The numbers in parentheses are the power gain compared with the traditional method

The power increase of the new MVN method is the greatest among all methods. It increases power by 8.3% on average when considering all SNPs and by 15.7% on average when considering the SNPs with mid-range power. The power increase is even as great as 24.7% in the YRI population for the SNPs with mid-range power. This shows that the new MVN method can be very helpful in detecting associations.

We also look at the resolution, by how far the peak association statistic is located from the actual causal variant. We measure this distance in the unit of basepairs and report the results in Table 2. The multithreshold method with MAF prior does not help the resolution either. On the other hand, the multithreshold method with both LD and MAF prior improves the resolution by 2.4% on average for all SNPs and by 4.0% on average for SNPs with mid-range power. However, it failed to improve the resolution in one case (CEU population and for the SNPs with mid-range power).

Table 2.

Summary of the resolution comparison among all 10 ENCODE regions

Pop. All SNPs
SNPs with power between 0.1 and 0.9
Trad Mult (MAF), Mult (LD), MVN, Trad Mult (MAF), Mult (LD), MVN,
n (%) n (%) n (%) n (%) n (%) n (%)
CEU 33 334 33 377 (−0.1) 33 078 (0.8) 27 153 (18.5) 42 983 43 658 (−1.6) 43 040 (−0.1) 27 505 (36.0)
YRI 47 017 49 128 (−4.5) 45 561 (3.1) 30 247 (35.7) 49 245 50 828 (−3.2) 45 690 (7.2) 27 299 (44.6)
CHB 26 582 25 340 (4.7) 25 977 (2.3) 20 045 (24.6) 36 633 35 182 (4.0) 34 875 (4.8) 19 981 (45.5)
JPT 30 740 30 195 (1.8) 29 808 (3.0) 22 917 (25.4) 42 342 40 771 (3.7) 40 731 (3.8) 20 183 (52.3)
Average 34 418 34 510 (−0.3) 33 606 (2.4) 25 090 (27.1) 42 801 42 610 (0.4) 41 084 (4.0) 23 742 (44.5)

The unit of resolution is basepairs. The numbers in parentheses are the improvement percentage in resolution compared with the traditional method

The new MVN method shows the greatest amount of improvement in resolution. The resolution is improved by 27.1% on average for all SNPs and by 44.5% on average for SNPs with mid-range power. This shows that the resolution is so dramatically improved that the distance between the peak association statistic and the causal variant became, on average, almost half compared with the traditional method in the case of the SNPs with mid-range power.

We make an unrealistic assumption that we know the relative risk of the causal polymorphism to be 1.30 and use this assumption to determine the optimal thresholds. We measure the effect of an incorrect assumption by obtaining optimal thresholds at a relative risk of 1.30 and measure the power of these thresholds under a wide range of relative risks. Figure 1 shows the total average power under different relative risks for the traditional method, the multithreshold method with LD and MAF prior, and the new MVN method. Even when the assumed relative risk is incorrect, the multithreshold method and the new MVN method outperform the traditional method.

Fig. 1.

Fig. 1.

Average power under varying relative risks

3.2 Extrinsic information on candidate gene-sized regions

We measure the impact of extrinsic information on candidate gene-sized regions using the HapMap data ENCODE regions by simulating association studies with unequal priors at the polymorphisms. We first consider the assumption that causal SNPs can be anywhere in the genome, and randomly pick 10% of the variants. Then, we upweight the likelihoods of being causal of these polymorphisms by 25 times. We simulate association studies by picking the causal SNP among these polymorphisms. As we upweight the likelihood of being causal for the actual causal SNP in this simulation, we expect that the methods accounting for the prior information will show a better performance.

In our results, the power increase of the multithreshold method with LD and MAF prior and the MVN method are 2.0% and 11.3% relative to the traditional method, respectively. The amount of power increase is greater compared with when no extrinsic prior information was given (1.2% and 8.3%), as expected. The resolution improvements of the two methods are 5.6% and 55.5%, which are also greater than the improvements we had when no extrinsic prior information was given (4.0% and 15.7%). The most notable change by incorporating extrinsic prior information occurred on the resolution improvement percentage of the MVN method (15.7–55.5%). When considering the SNPs with mid-range power, the resolution improvement of the MVN method relative to the traditional method is as great as 70.4% after incorporating the prior information. This shows that the use of correct prior information can help both the power and the resolution of the methods we proposed, especially the resolution of the MVN method. On the other hand, the power and resolution of the multithreshold method with only MAF prior did not benefit from the extrinsic prior information, possibly because the method only takes into account the markers and not the uncollected putative causal variants.

4 DISCUSSION

We have presented a novel statistical method for incorporating prior information into association studies. The advantage of our method is that we can optimally incorporate prior information with respect to statistical power but still report p-values for each variant. By incorporating the correlation structure underlying the HapMap, we manipulate the significance threshold at each SNP to improve the overall power of a study. Experiments show that our method has a similar computational overhead yet greatly increased power and resolution to traditional association study methods.

Our method has similarities with, the method presented in (Roeder et al., 2007) and (Roeder and Wasserman, 2009). The difference is that while Roeder et al. present the general framework for optimally setting multithresholds for tests, our method is specifically focusing on the context of genetic association studies using available information of both MAF and LD. For example, if only the non-centrality parameter of the statistic at the tested markers is considered as presented in the general framework of Roeder et al., the method will be exactly equivalent to the multithreshold method with MAF prior that we examined in our simulations. We show that this method does not achieve a better performance than the traditional method in terms of both power and resolution. Moreover, we present the novel MVN method that assumes correlation between markers and that outperforms both the traditional method and the multithreshold method of (Eskin, 2008), which is distinctive from the framework assuming independency between the tests.

The new MVN method has some similarities with the weighted haplotype test (Zaitlen et al., 2007) or imputation method (Marchini et al., 2007). In both the new method and their methods, the unobserved causal variant is tested using the observed markers. However, the intrinsic difference is that our method takes into account prior information to optimally set significance thresholds differently to each causal variant in the context of multiple-testing correction. Moreover, the application of the method is much simpler than those methods, requiring only the MAF and LD information from the reference dataset but not the actual haplotype data.

As we use prior information to improve power and resolution, the drawback is that the performance will not be optimal if the prior information is incorrect. For example, the MAF or LD information from the HapMap data can have sampling variation and the extrinsic information about the deleterious effect of the variant can be inaccurate. A possible approach dealing with inaccurate prior information can be explicitly accounting for the uncertainty. For example, we reduce the correlation coefficient estimated from the HapMap by a small amount because the likelihood of the MVN distribution becomes zero if the perfectly correlated markers in the reference is not perfectly correlated in the sample. A systematic approach dealing with prior uncertainty will be an interesting subject of the future research. However, we assume that the uncertainty in the prior information will be decreased in the future as the sample size of the reference dataset increases (1000 Genomes Project Consortium, 2010).

Funding: G.D., D.D., B.H. and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676 and 1065276, and National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568 and PO1-HL28481. B.H. is supported by the Samsung Scholarship.

Conflict of Interest: None declared.

APPENDIX

A1. Maximizing power in a multithreshold association study

Our task is to find t*1t*m that maximize (3) under the condition Σti=α. We use the method of Lagrange multipliers to find such t*1t*m. This is our objective function

graphic file with name bts235um13.jpg

When we take partial derivative with respect to ti and set it equal to 0, we observe

graphic file with name bts235um14.jpg

First, we present the definition of the CDF of the normal distribution:

graphic file with name bts235m8.jpg (A.1)

Now, we present the properties of erf(x).

graphic file with name bts235m9.jpg (A.2)

Now, we consider the power function

graphic file with name bts235um15.jpg

where k is the non-centrality parameter.

To maximize the power of a standard association study with respect to t, we find

graphic file with name bts235um16.jpg

By applying (A.1), we get

graphic file with name bts235um17.jpg

By using (A.2), we get

graphic file with name bts235m10.jpg (A.3)

Equation (A.3) can be simplified using the property Φ−1 (1−t/2)=−Φ−1(t/2),

graphic file with name bts235um18.jpg
graphic file with name bts235m11.jpg (A.4)

Now, we solve for Inline graphic. We use the property:

graphic file with name bts235m12.jpg (A.5)

and the property:

graphic file with name bts235m13.jpg (A.6)

Using, (A.5) and (A.6), we find

graphic file with name bts235um19.jpg

By plugging this into (A.4), we have

graphic file with name bts235um20.jpg

This equation is the likelihood ratio of a random variable s at the value Φ−1(t/2) or equivalently at the value −Φ−1(t/2), between the alternative hypothesis s~0.5N(k,1)+0.5N(−k,1) and the null hypothesis s~N(0,1). The 50:50 mixture of the two symmetrically positioned distributions under the alternative hypothesis can be considered due to the two-sided testing that we perform when we do not have any prior knowledge about the direction of the allele effect.

A2. Controlling false-positive rate using permutation

When considering many correlated markers in an association study, we need to find a valid threshold of statistic, C*, to set the significance threshold for each test such that the overall false-positive rate of the study is controlled to α. The following algorithm is the general method for the permutation procedure, for which the resulting vector allows us to determine a certain C* to limit the overall false-positive rate of the study. In our example, the 5% quantile of S yields an α level of 0.05.

graphic file with name bts235i37.jpg

Footnotes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

REFERENCES

  1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Adzhubei I.A., et al. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Altshuler D., et al. A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. de Bakker P.I.W., et al. Efficiency and power in genetic association studies. Nat. Genet. 2005;37:1217–1223. doi: 10.1038/ng1669. [DOI] [PubMed] [Google Scholar]
  5. Devlin B., Risch N. A comparison of linkage disequilibrium measure for fine-scale mapping. Genomics. 1995;29:311–322. doi: 10.1006/geno.1995.9003. [DOI] [PubMed] [Google Scholar]
  6. ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the encode pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Eskin E. Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Genome Res. 2008;18:653–660. doi: 10.1101/gr.072785.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Franke A., et al. Genome-wide meta-analysis increases to 71 the number of confirmed crohn's disease susceptibility loci. Nat. Genet. 2010;4:1118–1125. doi: 10.1038/ng.717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fridley B.L., et al. Bayesian mixture models for the incorporation of prior knowledge to inform genetic association studies. Genet. Epidemiol. 2010;34:418–426. doi: 10.1002/gepi.20494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Han B., et al. Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. PLoS Genet. 2009;5:1–14. doi: 10.1371/journal.pgen.1000456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Marchini J., et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007;39:906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
  12. Matsuzaki H., et al. Genotyping over 100,000 SNPS on a pair of oligonucleotide arrays. Nat. Methods. 2004;1:109–111. doi: 10.1038/nmeth718. [DOI] [PubMed] [Google Scholar]
  13. Pe'er I., et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet. 2006;38:663–667. doi: 10.1038/ng1816. [DOI] [PubMed] [Google Scholar]
  14. Pe'er I., et al. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 2008;32:381–385. doi: 10.1002/gepi.20303. [DOI] [PubMed] [Google Scholar]
  15. Pritchard J., Przeworski M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 2001;69:1–4. doi: 10.1086/321275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Roeder K., Wasserman L. Genome-wide significance levels and weighted hypothesis testing. Stat. Sci. 2009;24:398–413. doi: 10.1214/09-STS289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Risch N., Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
  18. Roeder K., et al. Improving power in genome-wide association studies: weights tip the scale. Genet. Epidemiol. 2007;31:741–747. doi: 10.1002/gepi.20237. [DOI] [PubMed] [Google Scholar]
  19. Visscher P.M., et al. Five years of GWAS discovery. Am. J. Hum. Genet. 2012;90:7–24. doi: 10.1016/j.ajhg.2011.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Zaitlen N., et al. Leveraging the hapmap correlation structure in association studies. Am. J. Hum. Genet. 2007;80:683–691. doi: 10.1086/513109. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES