Abstract
Objective
To develop effective methods for genome wide association studies (GWAS) in admixed populations, such as African Americans.
Methods
We show that when testing the null hypothesis that the test single nucleotide polymorphism (SNP) is not in background linkage disequilibrium (LD) with the causal variants, several existing methods cannot control well the family-wise error rate (FWER) in the strong sense in GWAS; the existing methods include association tests adjusting for global ancestry and joint association tests that combine statistics from admixture mapping tests and association tests that correct for local ancestry. Furthermore, we describe a generalized sequential Bonferroni (smooth-GSB) procedure for GWAS that incorporates smoothed weights calculated from admixture mapping tests into association tests that correct for local ancestry. We have applied the smooth-GSB procedure to analyses of GWAS data on American Africans from the Atherosclerosis Risk in Communities (ARIC) Study.
Results
Our simulation studies indicate that the smooth-GSB procedure not only can control the FWER, but also improve statistical power compared with association tests correcting for local ancestry.
Conclusion
The smooth-GSB procedure can result in a better performance than several existing methods for GWAS in admixed populations.
Keywords: Admixture mapping, GWAS, sequential Bonferroni procedures, admixture LD, background LD
INTRODUCTION
Admixed populations are populations formed by the recent admixture of two or more ancestral populations. For example, African Americans often have ancestries from West Africans and Europeans. The global ancestry of an admixed individual is defined as the proportion of his/her genome inherited from a specific ancestral population, and can be estimated as the average proportion of alleles inherited from one ancestry across the whole genome. The local ancestry of an individual at a specific marker is the proportion of alleles at the marker that are inherited from the given ancestral population with a true value of 0, 0.5 or 1. The difference between the local ancestry at a specific marker and the global ancestry of an individual is referred to as the local deviation of ancestry at the marker.
Three sources of linkage disequilibrium in admixed populations
For admixed populations, there are three types of linkage disequilibrium (LD) [1]. The first LD source is variation in global ancestry among sampled individuals, which leads to dependence (i.e., LD) among markers across the genome, even though they are from different chromosomes. Individuals with a large global ancestry from a specific ancestral population have an excess of alleles that are common in the ancestral population. This type of LD is called mixture LD. In association studies, mixture LD can generate spurious associations (false positive findings) and adjusting for global ancestry is able to control the false positive findings caused by mixture LD.
The second type of LD, admixture LD, is formed in local chromosome regions as a result of admixture over the past several hundred years when large chromosomal segments were inherited from a particular ancestral population, resulting in the temporary generation of long haplotype blocks (usually several megabases (Mbs) or longer) [2]. The local ancestry (proportion) from a specific ancestral population in these blocks may differ from the global ancestry (proportion). The admixture LD only exists in local regions and can be used to identify a chromosomal region (usually several Mbs) harboring a causal variant.
The third type of LD is background LD, which is inherited by admixed populations from ancestral populations. The background LD is the traditional LD that exists in much shorter haplotype blocks (usually less than a few hundred kilobases (Kbs)) in homogeneous ancestral populations and is the result of recombination over hundreds to thousands of generations [2]. To illustrate admixture LD and background LD, we show a special case in Figure 1 where a large chromosomal region with admixture LD contains a small region with background LD and a causal variant is located inside the small background LD region. For association studies, we hope to identify SNPs that are in background LD with causal variants.
Figure 1.

A large admixture LD (ALD) region (whole white bar) that contains a small background LD (BLD) region (grey bar) with a causal variant inside the small BLD region.
Ancestry-trait admixture mapping tests
Admixture LD has been exploited to locate causal variants that have different allele frequencies among different ancestral populations [3–7]. Mapping by admixture linkage disequilibrium is also called admixture mapping [2]. Admixture mapping can only map a causal variant into a wide region of 4–10 cM [2, 8]. Roughly speaking, most admixture mapping tests are based on testing the association between a trait and the local ancestry deviation at a marker. For example, the null hypothesis in admixture mapping tests can be H01: the test SNP is not in admixture LD with the causal variants. A main advantage of admixture mapping is that only ancestry informative markers (AIMs) are required to be genotyped and tested. An AIM is a marker that has a substantial allele frequency difference between two ancestral populations [9]. If a marker has signal in an admixture mapping test, the marker is called to have an admixture mapping signal. A necessary condition for a marker to have an admixture mapping signal is that the marker is in admixture LD with a causal variant that has different allele frequencies between the ancestral populations.
Genotype-trait association tests correcting for ancestries
Since admixture mapping tests can only identify wide regions (several Mbs) harboring causal variants, (genotype-trait) association tests have been employed to map causal variants into small regions (with a few hundred Kbs) by using background LD. For an association test, if no confounding effects exist, the null hypothesis of no association between the test SNP and the trait is approximately equivalent to the null hypothesis H02: the test SNP is not in background LD with causal variants. To control for the confounding effect of global ancestry, some association tests adjust for global ancestry (or adjust for principal components of genome-wide genotype scores) have been developed [10–12]. These tests can remove the confounding effect from the mixture LD but cannot remove the confounding effect from the local admixture LD in a special situation. Here we illustrate the special situation with an example.
Example 1
Suppose that admixture LD extends across a region of 20Mb. In the region, we consider a causal variant with a strong admixture mapping signal and a AIM SNP that is about 10 Mbs away from but not in background LD with the causal variant (see also Figure 1). By using the association tests adjusting for global ancestry, the AIM may have association signals because the local ancestry at the AIM may be associated with the trait due to admixture LD between the AIM and the causal variant. This results in an association between the genotype of the AIM and the trait. Therefore, association tests adjusting for global ancestry may identify large regions (with several Mbs) harboring causal variants.
To correct for the confounding effect of admixture LD in local regions and therefore map causal variants into small regions with a few hundred Kbs, association tests that adjust for local ancestries have been developed [13, 14]. However, these methods can have relatively low power in detecting causal variants with admixture mapping signals.
Joint association tests combining admixture mapping tests and association tests with correction for local ancestry
To acquire increased power, Pasaniuc et al. proposed a joint association test, MIXSCORE (or MIX) for binary traits, which combines an admixture mapping test statistic (using admixture LD) with an association test statistic that was conditioned on local ancestry (i.e., correcting for local ancestry) [10]. Let Ω(R) denote an ancestry odds ratio, which is the relative increase in risk per extra allele from an ancestry population such as Europeans. The MIXSCORE is based on an implied assumption that the ancestry odds ratio Ω(R) is a function of the SNP odds ratio R (see their Section of MIX: mixed SNP and admixture association). Based on the function, if R =1 then Ω(R) = 1. Under this assumption, if a test SNP is not in background LD with a causal variant, then it is also not in admixture LD with the causal variant. However, this assumption may not always be true. Example 1 (see above) shows that we may find a AIM not very far (about 10Mb) from the causal variant with strong admixture mapping signal, such that the AIM and the causal variant are in admixture LD but not in background LD (see also Figure 1). Therefore, the AIM has an admixture mapping signal. Using MIXSCORE, the AIM may be called significant (i.e., having an association with the trait). Therefore, the joint association test MIXSCORE may be more suitable for identifying large chromosome regions harboring causal variants (usually several Mbs) rather than small chromosome regions (such as < a few hundred Kbs).
Another joint method is the two-stage approach [15, 16], which selects promising regions with admixture mapping signals in the first stage by admixture mapping and which then tests markers in the selected regions in the second stage by association tests that correct for global or local ancestries. A limitation of this joint method is that it has almost no power to detect causal variants without admixture mapping signals.
Thus, it is therefore imperative to develop effective association tests to incorporate information from the admixture mapping test into association tests that correct for local ancestry, so that the new association tests can improve power for mapping causal variants into small regions (with length < a few hundred Kbs) while controlling for the FWER in GWAS analysis. To achieve this goal, in this study, we propose a novel application of the generalized sequential Bonferroni (GSB) procedure of Holm (1979) [17] to GWAS in admixed populations. We propose to calculate smoothed weights by using p-values from admixture mapping tests in the GSB procedure and the weights are used to adjust p-values of association tests that correct for local ancestry. We have applied the proposed methods to analyze GWAS data on American Africans from the Atherosclerosis Risk in Communities (ARIC) Study.
METHODS
In this section, we first describe the concept of type I error rate and FWER in strong sense and some existing admixture mapping tests and association tests adjusting for ancestries. We then describe a smooth generalized sequential Bonferroni procedure for GWAS that incorporates information from admixture mapping tests into association tests that correct for local and global ancestries.
Type I error rate and FWER in the strong sense for association tests
Type I errors under different hypotheses
Admixture mapping tests often identify wide chromosomal regions with several Mbs harboring causal variants, with corresponding null hypothesis H01: the test SNP is not in admixture LD with the causal variants. If a test SNP is in admixture LD with causal variants, rejecting the null hypothesis (i.e., calling the test SNP significant) is considered as a true positive finding. However, in the context of GWAS analysis, we aim to identify small regions with a few hundred Kbs harboring causal variants by testing the null hypothesis H02: the test SNP is not in background LD with the causal variants. Under the null hypothesis H02, if a test SNP is in admixture LD with causal variants but not in background LD with any causal variants, calling the test SNP significant (i.e., rejecting the null hypothesis H02will be regarded as a type I error or a false positive finding. To evaluate this type of errors, we describe the concept of the type I error rate and the FWER in the strong sense.
Type I error rate and FWER in the strong sense
To evaluation the type I error rate of an association test for GWAS, people often assume no causal variants in the genome. In this study, we consider the type I error rate and the FWER in the strong sense (see also [18], p10), which are the type I error rate and the FWER for null SNPs, respectively, while some causal variants exist elsewhere in the genome. Under the null hypothesis H02: the test SNP is not in background LD with the causal variants, we define a SNP as a null SNP if it is not in background LD (traditional LD) with any causal variants in both of the ancestral populations, irrespective of whether it is in admixture LD with the causal variants in the admixed population.
Association tests for admixed populations
Below we describe association tests for admixed populations in generalized linear model (GLM) frameworks. The GLM can be applied to any traits that follow distributions from the exponential family [19], such as binary traits from case-control designs and continuous traits following normal distributions.
Notation
Let Yi denote the phenotypic value of individual i. For example, in a case-control design, Yi = 1 (0) denotes case (control) status. For quantitative traits, Yi has a continuous value. Let Gij denote the coded genotypic score (0, 1, 2) of individual i at the j-th SNP marker under the assumption of the additive model. Let Aij denote the local ancestry (i.e., proportion of alleles) of individual i at the j-th marker inherited from a given ancestral population. Let Qi denote the global ancestry proportion of individual i inherited from the given ancestral population (such as Europeans). The local deviation of ancestry of individual i at the j-th marker is defined as Dij = Aij − Qi. The ancestries Aij and Qi can be estimated by existing software such as SABER [20], HAPMIX [21] or LAMP [22, 23].
Ancestry-trait admixture mapping test Tadmix
We describe a test Tadmix here for admixture mapping based on a GLM. We assume a link function (see also [15])
| (1) |
where μi is the expected value of Yi (i.e., μi = E(Yi)), α0 is the intercept, α1 is the coefficient for the global ancestry Qi, and α2 is the coefficient for local deviation of ancestry Dij at the j-th SNP. For traits following different distributions, we can use different link functions in model (1). For example, for binary traits from case-control designs, we can use a logit link function, i.e., h(μi) = logit Pr(Yi = 1) = log (μi/(1−μi)). For quantitative traits following normal distributions, we can use an identity link function, i.e., h(μi) = μi.
In the test Tadmix, a likelihood ratio test statistic (or another related statistic) can be used to test the null hypothesis of no association of the trait with the local deviation of ancestry (i.e. testing the null hypothesis α2 = 0 against the alternative hypothesis: α2 ≠ 0). This test Tadmix can also be used to test the null hypothesis H01: the test SNP is not in admixture LD with the causal variants. However, the Tadmix test can only map a causal variant with admixture mapping signal into a large admixture LD region with several Mbs.
Association test Tglobal correcting for global ancestry
A association test correcting for global ancestry (Tglobal) can be constructed based on a GLM with a link function (see also [19])
| (2) |
A likelihood ratio test statistic (or another statistic) based on model (2) can be used to test the null hypothesis of no association between the trait and the genotype (i.e., testing the coefficient β = 0).
The association test Tglobal is approximately equivalent to the following two methods: 1) EIGENSTRAT [11] that adjusts for the first principal component of genome-wide SNP scores and 2) Armitage trend test with correction for global ancestry (ATT) proposed by Pasaniuc et al. [10]. The ATT method adjusts the phenotypic values and genotypic scores for the effects of global ancestry separately and then constructs a statistic using the adjusted phenotypic values and adjusted genotypic scores. As stated earlier, Tglobal is more suitable for identifying large admixture LD regions with several Mbs that harbor causal variants, because it may not be able to control the confounding effect from the (local) admixture LD.
Association test Tlocal correcting for both global and local ancestries
We describe a test that corrects for both global and local ancestries (Tlocal), which is based on a GLM with a link function
| (3) |
The Tlocal test uses a likelihood ratio test statistic (or another related statistic) to test the null hypothesis of no association between the trait and the genotype (i.e., testing the coefficient β = 0). The model (3) is a variant of the logistic model proposed by Wang et al. [13] who stated that adjusting for local ancestry can control the confounding due to either global or local ancestries. The Tlocal test can also be used to test the null hypothesis H02: the test SNP is not in background LD with the causal variants and to map a causal variant into a small background LD region (such as with less than a few hundred Kbs). However, the Tlocal test may have relatively low power for detecting causal variants with admixture mapping signals. To increase power to identify small background LD regions harboring causal variants, below we propose a novel application of the generalized sequential Bonferroni (GSB) procedure of Holm (1979) to incorporate information from admixture mapping test Tadmix into the test Tlocal in GWAS for admixed populations.
GSB procedures for GWAS for admixed populations
Suppose that there are m SNPs in GWAS such that m null hypotheses (H1, H2, …, Hm) are tested, where Hj is the null hypothesis of no association between the j-th SNP and the disease status (j = 1, 2, …, m). Let α be the nominal level of FWER the multiple testing with the m tests. The GSB procedure for GWAS can be implemented by the following steps:
Given a weight wj for the j-th marker, we adjust the corresponding p-value pj from the association test Tlocal at the j-th marker by the weight wj, i.e., we calculate a B-value as Bj = pj/wj for the j-th marker.
Order B-values as B(1) ≤ B(2) ≤ … ≤ B(m). Let w(1), w(2), …, w(m) and H(1), H(2), …, H(m) denote the corresponding weights and hypotheses of the ordered B-values.
Starting from j = 1, given H(1), H(2), …, H(j-1) have been tested and rejected, if , reject H(j); otherwise, accept H(j), H(j+1), …, H(m), and stop the GSB procedure.
Calculating weights by use of p-vaules from the admixture mapping test Tadmix
The original GSB procedure of Holm (1979) does not provide a method to calculate weights. Here we propose to calculate the weight wj = 1/qj for the j-th marker, where qj is the p-value from the admixture mapping Tadmix. If the weights calculated from the Tadmix test are independent of the p-values calculated from the test Tlocal, then the GSB procedure controls the FWER. For the quantitative traits following normal distributions, we can prove the asymptotic independence between the corresponding weights and p-values from the association test that adjust for local ancestry (see Appendix I). However, for other traits such as binary traits from case-control designs, it is not easy to prove the independence theoretically. From Holm (1979), we can see that, to control the FWER of the GSB procedure, the condition “weights are independent of the p-values of the tests” is a sufficient but not necessary condition. Our simulation studies [24] showed that under some situations, even the weights were weakly correlated with the p-values, the original GSB procedure of Holm (1979) still controlled the FWER well. We will show below by simulation studies that the GSB procedure for GWAS can control the FWER well under the null hypothesis H02.
Smooth-GSB procedure for GWAS using smoothed weights
One concern is that the GSB procedure may give too much weight to SNPs in regions with admixture mapping signals and therefore may markedly reduce the power to detect causal variants located outside the regions with admixture mapping signals. To address this concern, we adopt a method of Roeder et al. (2007) [25] to smooth the weight wj at the j-th marker. We calculate smoothed weight at the j-th marker as , which is a linear combination of the original weight wj and the average weight of all markers . We refer to the GSB procedure using the smoothed weights as smooth-GSB procedure (or S-GSB). The parameter λ determines how much influence of the original weight wj on the smoothed weight. If the j-th marker has admixture mapping signal, then wj provides useful information and a small λ value (large value of 1− λ) will result in increased power of the smooth-GSB procedure; otherwise, if the j-th marker does not have admixture mapping signal, wj provides only noise such that a small λ value will result in loss of power. An open question is how to determine the optimal value of λ. We will discuss how to select the value of λ based on our simulation studies.
SIMULATION STUDIES
We conducted simulation studies to evaluate the type I error rate and/or FWER in the strong sense and power for several existing methods and the GSB-procedure and the smooth-GSB procedure, under the null hypothesis H02: the test SNP was not in background LD with the causal variants. In our simulation studies, we considered binary traits form case-control designs and quantitative traits that follow normal distributions. Below we focus on describing the simulation studies on binary traits. The simulation results for quantitative traits had similar patterns to those for case-control designs and are described in the Supplementary material.
Evaluation of the type I error and the FWER in the strong sense
It is computationally intensive to evaluate the type I error or the FWER for the whole genome data by simulation studies. Therefore, we estimated the type I error rate and/or FWER by using the SNPs in the first chromosome. In addition, we only considered SNPs with no background LD in the ancestral populations, because it is also computationally intensive to evaluate the type I error rates or FWERs on the a large number of simulated dense SNPs with background LD.
Simulating SNP data assuming no background LD for the first chromosome
We simulated SNP data sets on African Americans assuming that no background LD existed in the ancestral populations. Each data set consisted of 1,000 cases and 1,000 controls. For each individual, we simulated the first chromosome consisting of 1,030 SNPs selected from the HapMap II data (http://hapmap.ncbi.nlm.nih.gov/), which included 100 AIMs. The distance between adjacent SNPs was approximately ≥ 200kb (about 0.2cM). We assumed that there was no background LD (i.e. no dependence) among the 1,030 SNPs in the two ancestral populations. However, there might be correlations (admixture LD) among the SNPs in the simulated admixed individuals (African Americans) as a result of admixture.
We generated the genotype data in first chromosome based on a simulation method of Price et al. [21]. For each admixed individual, we randomly sampled a value from a Beta distribution Beta (3, 12) to represent the global ancestry proportion (Qi) that individual i inherited from the given ancestral population. The Beta distribution had a mean of 0.2 and standard deviation 0.1. To generate each haploid chromosome in an admixed individual, we randomly generated the number of crossover points from a Poisson distribution with mean μ = l × g, where l is the genetic length (Morgan) of the chromosome and g is the generation since admixture which we set to 7 in our simulation studies. We then randomly generated an ancestry indicator according to the global ancestry proportion Qi to represent if the haplotype between two crossover points is of European or West African ancestry. If the haplotype is from European (or West African), then the haplotype was generated by sampling alleles at different SNPs independently using the allele frequencies in the CEU (or YRI) samples from the HapMap project. We used the allele frequencies in the CEU and YRI samples to approximate the frequencies in the two ancestral populations, West Africans and Europeans, respectively.
Simulating phenotypes influenced by one causal variant with an admixture signal
To evaluate the type I error rate and FWER in the strong sense, when simulating the phenotype for each individual, we chose one AIM (at rs10465723) from the 1,030 simulated SNPs in the fist chromosome as a causal variant. This AIM SNP had a strong admixture mapping signal, with corresponding allele frequencies of 0.858 in the YRI data and 0.283 in the CEU data. The causal AIM SNP was not in background LD with any of other simulated SNPs in the two ancestral populations, and therefore all simulated SNPs except the causal AIM SNP were null SNPs. Calling any of these null SNPs significant by a test was treated as a type I error.
For case-control designs, the case-control status (Yi =1 or 0) of individual i was simulated based on a logistic model
| (4) |
where β was the log-odds ratio of the causal allele, and Gi was the genotype score of individual i at the causal SNP. Not that the right side of model (4) did not include Qi or Aij, because effects of Qi and Aij on phenotypes were included in the effect of the genotype Gi. We set the odds ratio of the risk allele as 1.5, and β = ln(1.5). We set Pr(Yi = 1) = 0.1 when Gi = 0. The model (4) is different from the models that Wang et al. (2011) and Qin et al. (2011) used for evaluating the type I error rate in their simulation studies. Their models included a covariate of local ancestry.
Evaluation of type I error rate in the strong sense
We simulated 10,000 replicated data sets and estimated empirical type I error rates for the (1,030-1) null SNPs in the first chromosome under the null hypothesis H02 for the following five methods: 1) the test Tglobal associated with model (2); 2) the test Tlocal associated with model (3); 3) the popular EIGENSTRAT method that corrects for principal components of genome-wide genotypic scores [11]; 4) the MIXSCORE method that combines an admixture mapping test statistic with an association test statistic that conditioned on local ancestry [10]; and 5) the ATT method with correction for global ancestry [10]. Both the Tglobal and ATT are approximately equivalent to EIGENSTRAT that corrects for the first principal component [11]. For estimating the type I error rate, we set the nominal significance level for single tests at 0.05. To save computing time, we assumed that global ancestry Qi and local ancestry Aij were known. We did not estimate the type I error rates for the GSB procedures because they are multiple testing procedures. Instead we estimated the FWERs for the GSB procedures (see below).
In Table 1 we list the estimated type I error rates in the strong sense at four selected null AIM SNPs with different distances from the causal AIM SNP. From Table 1 we can observe that under the null hypothesis H02, Tlocal controlled the type I error rate for all null SNPs. However, Tglobal, MIXSCORE, ATT, and EIGENSTAT had inflated type I error rates (much larger than the nominal level of 0.05) for null AIM SNPs that were not in background LD with but close to the causal AIM SNP. For example, at the null AIM SNP that was 9 Mb away from the causal AIM SNP, the estimated type I error rates for Tglobal, MIXSCORE, and ATT were 0.1448, 0.2357, and 0.1447, respectively. In addition, the estimated type I error rates for ES1, ES2, and ES10 were 0.1398, 0.1171, and 0.0733, respectively, where ESk (k =1, 2, 10) denotes the EIGENSTAT method with correction for the top k principal components of genome-wide genotypic scores. When k increased, the inflated type I error rates of EIGENSTRAT decreased. These methods (Tglobal, MIXSCORE, ATT, and EIGENSTAT) had inflated type I error rates at the null AIM SNPs because the null AIM SNPs were not far from the causal AIM SNP and therefore were in admixture LD (but not in background LD) with the causal AIM SNP. Since the causal AIM SNP had a strong admixture mapping signal, this could lead the null AIM SNPs to have admixture mapping signals. The results also showed that as the distance from the causal SNP increased, the type I error rates decreased. Therefore, when a causal variant has a strong admixture mapping signal, the four methods (Tglobal, MIXSCORE, ATT, and EIGENSTAT) are more suitable for identifying a wide chromosome region (with up to about 20 Mbs) that harbor the causal variant. On the other hand, our simulation results showed that Tglobal, MIXSCORE, ATT, and EIGENSTAT all controlled the type I error rate well at all null non-AIM SNPs and at the null AIM SNPs that had a distance from the causal AIM SNP greater than or equal to 24 Mb (data not shown).
Table 1.
Empirical type I error rates of methods for case-control designs at four selected null AIM SNPs based on 104 replicated data sets (α = 0.05)a.
| Dist (Mb)b | Tlocal | Tglobal | MIXSCORE | ATT | ES1c | ES2 | ES10 |
|---|---|---|---|---|---|---|---|
| 2.5 | .0508 | .2622 | .4538 | .2613 | .2517 | .2162 | .1378 |
| 5 | .0474 | .2221 | .3865 | .2214 | .2102 | .1689 | .0968 |
| 9 | .0504 | .1448 | .2375 | .1447 | .1398 | .1171 | .0733 |
| 24 | .0483 | .0590 | .0525 | .0509 | .0502 | .0503 | .0491 |
Each data set included 1,000 cases and 1,000 controls; the causal variant was an AIM SNP (with allele frequencies fE = 0.283 in CEU and fA = 0.858 in YRI); the odds ratio for the causal allele was 1.5.
The distances (Mb) between the null AIM SNPs and the causal AIM SNP.
ES1, ES2 and ES10 are the EIGENSTRAT tests with correction for the first PC, the top two PCs and the top 10 PCs. The PCs are computed using 13,056 simulated SNPs across 22 chromosomes (data now shown).
Evaluation of FWER in the strong sense
We estimated FWERs in the strong sense for the GSB procedures. Because the methods Tglobal, MIXSCORE, ATT, and EIGENSTAT showed inflated type I error rates, it is straightforward that they will have inflated FWERs. As an example to illustrate the degree of the inflated FWER of these methods, we report the FWER of Tglobal. In addition, we estimated the FWER of Tlocal. We set the nominal FWER equal to 0.05. This is a criterion much stricter than setting the nominal type I error rate as 0.05 for single tests as described above. We simulated 105 replicated data sets. The estimated FWERs are listed in Table 2. We can see that Tlocal (with the Bonferroni correction), the GSB procedure, and the smooth-GSB procedure controlled the FWERs well. As expected, Tglobal failed to control the FWER, because it cannot control the type I error rate at some null AIM SNPs as shown in Table 1.
Table 2.
Empirical FWER of methods for case-control designs for the (1,030-1) null SNPs in chromosome 1 based on 105 replicated data sets (nominal FWER = 0.05).
| Tlocal | Tglobal | GSB | S-GSBa
|
||
|---|---|---|---|---|---|
| λ=0.6 | λ=0.7 | λ=0.8 | |||
| .0497 | .0819 | .0489 | .0497 | .0498 | .0504 |
The smooth-GSB procedure with different λ values.
Evaluation of power of the GSB procedures
We evaluated the power of Tlocal test, the GSB procedure and the smooth-GSB procedure in the analysis of simulated dense SNP data that have LD patterns similar to the real GWAS data. We did not estimate power for Tglobal, ATT, and the EIGENSTRAT method ESk because they cannot control the FWERs well. We simulated replicated GWAS datasets on African Americans with background LD among SNPs in the ancestral populations only for the first chromosome.
Simulating SNP data with background LD for the first chromosome
We simulated SNP data in the first chromosome with background LD for African Americans in two steps:
Step 1: we used the HapGen software [26, 27] to generate a large pool of 20,000 unrelated haploid chromosomes for each of the two ancestral populations (West Africans and Europeans). We used the haplotype information on the CEU and YRI populations from the phased HapMap2 data, which can be downloaded from the IMPUTE website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference), for simulating the haploid chromosomes for West Africans and Europeans, respectively. Therefore, these simulated haploid chromosome pools for the two ancestral populations had similar background LD patterns to those in the data sets of CEU and YRI haplotypes. We then thinned the simulated haploid chromosomes and kept only the SNPs that appeared in Affymetrix 6.0. Rare variants (with minor allele frequency <5%) in each ancestry population were excluded. This resulted in 58,441 SNPs left on chromosome 1. The genetic distance among SNPs was calculated based on the combined genetic map downloaded from the IMPUTE website.
Step 2: We adapted the method of Price et al. [21] to generate two haploid chromosomes for each African American (admixed individual). Each haploid chromosome was generated as a mixture of two ancestral haploid chromosomes sampled separately from the two pools of ancestral haploid chromosomes (without repeating).
Evaluation of power when phenotypes were influenced by one causal variant with an admixture mapping signal
We estimated power for the Tlocal test, and the GSB procedure and the smooth-GSB procedure based on analyses of the simulated data sets under the assumption that the phenotypes were influenced by one causal variant with an admixture mapping signal.
We simulated 1,000 replicated data sets; each data set was composed of 1,000 cases and 1,000 controls. To simulate the phenotypes, we chose the AIM SNP rs2806404 in the first chromosome from the Affymetrix 6.0 SNP set as the causal SNP (we did not use the previously selected causal SNP rs10465723, because it was not in the Affymetrix 6.0 SNP set). The risk allele frequencies were 0.292 in CEU and 0.825 in YRI.
We simulated phenotypes by model (4) with a set of β values corresponding to different odds ratios at the causal SNP ranging from 1.3 to 1.6 (Table 3). We considered two scenarios. In scenario 1, we assumed that the risk allele at the causal SNP had the same odds ratios in the two ancestral populations. In scenario 2, we assumed that the causal SNP was an African-Specific causal variant and we set the odds ratio of the risk allele inherited from Europeans ORE = 1 and the odds ratio of the risk allele inherited from West Africans ORA = 1.5 (see the last row in Table 3). We set the genotypic score Gi equal to the number of risk alleles inherited from West Africans and the coefficient β of Gi equal to ln(ORA). Torgerson et al. [28] and Lettre et al. [29] reported African-specific causal variants that are associated with disease in African Americans but not in European Americans.
Table 3.
Estimated power of methods for case-control designs based on 1,000 replicated data sets for the first chromosome with one causal SNP that had an admixture mapping signala.
| Odds Ratiob
|
Power
|
||||||
|---|---|---|---|---|---|---|---|
| S-GSBd
|
|||||||
| ORE | ORA | Tadmixc | Tlocal | GSB | λ=0.6 | λ=0.7 | λ=0.8 |
| 1.3 | 1.3 | .005 | .013 | .017 | .012 | .013 | .015 |
| 1.4 | 1.4 | .015 | .102 | .149 | .134 | .127 | .122 |
| 1.5 | 1.5 | .034 | .314 | .429 | .391 | .379 | .376 |
| 1.6 | 1.6 | .059 | .588 | .732 | .696 | .687 | .676 |
| 1 | 1.5 | .312 | .051 | .138 | .116 | .112 | .105 |
Each data set consisted of 1,000 cases and 1,000 controls. The significance threshold for Tlocal at individual SNPs was 5×10−8 and the nominal FWER for GSB and S-GSB was 5×10−8×m, where m = 58,441 is the total number of SNPs.
*ORE (ORA) denotes the odds ratio in Europeans (West Africans) at the causal SNP. The causal SNP rs2806404 had an admixture mapping signal, risk allele frequencies fE = 0.292 in CEU, and fA = 0.825 in YRI.
The power of Tadmix was evaluated at significance level of 10−5,indicating the strength of admixture mapping signal.
The smooth-GSB procedure with different λ values.
We estimated power for the Tlocal test, and the GSB procedure and the smooth-GSB procedure based on analyses of the simulated data sets for the first chromosome. To mimic real GWAS data analysis, we set the significant threshold for the test Tlocal at individual SNPs to be 5×10−8, which was a very strict threshold. Correspondingly, we set the nominal FWER as 5 × 10−8 × m for the GSB and S-GSB procedures for the m SNPs, where m = 58,441 is the total number of SNPs in the first chromosome.
To illustrate that there is an admixture mapping signal at the causal SNP, we also estimated the power of the admixture mapping test Tadmix, Following Pasaniuc et al. [10], we used the significance threshold of 10−5 for Tadmix, which is different from the threshold 5×10−8 for Tlocal. The less strict threshold for Tadmix is motivated by the smaller number of independent admixture mapping tests Tadmix across the genome due to admixture LD.
It usually requires a very long time to estimate individual’s ancestries in the simulated data sets with background LD among SNPs by using existing software such as Lamp [23], HAPMIX [21], or SABER [20]. To save computing time, we also assumed that the true global and local ancestries were known.
The estimated power is showed in Table 3. Both the GSB procedure and the smooth-GSB procedure had higher power than that of the test Tlocal. For example, when the odds ratio was 1.6 in both ancestry populations, the power of Tlocal, the GSB procedure, and the smooth-GSB procedure (with λ = 0.7) were 0.524, 0.687, and 0.634, respectively. The smooth-GSB procedure was robust to different λ values when 0.6 ≤ λ ≤ 0.8. When λ changed from 0.6 to 0.8, the change of power of the smooth-GSB procedure was always less than 3% in absolute terms (see also below the Section of Influence of the parameter λ on the smooth-GSB procedure). We evaluated the power of the admixture mapping test Tadmix using the threshold 10−5. We can also see that the power was always much greater than the threshold 10−5, ranging from 0.005 to 0.312, which indicated that there was an admixture mapping signal at the causal SNP.
From the last row in Table 3, we can observe that at the African-specific causal variant, where the risk allele of European ancestry had an odds ratio of ORE =1, and risk allele of West African ancestry had an odds ratio of ORA =1.5, the power of smooth-GSB procedure (with λ=0.7) was 0.112, much higher than the power 0.051 of Tlocal.
Evaluation of power when phenotypes were influenced by two causal variants (one with and the other one without an admixture mapping signal)
As mentioned earlier, one concern is that these GSB procedures may give too much weight to SNPs with admixture mapping signals and therefore markedly reduce the power to detect causal variants without showing admixture mapping signals. To address this concern, we evaluated the power of Tlocal, Tadmix, and the GSB and smooth-GSB procedures based on analyzing simulated case-control data sets under the assumption that the phenotypes were influenced by two causal variants (one with and one without an admixture mapping signal).
Simulating phenotypes influenced by two causal variants
To simulate the phenotypes, we chose two specific SNPs (about 150 Mbs apart) from the first chromosome as causal variants. The first causal SNP was SNP rs2806404 with allele frequencies fE = 0.292 in CEU, and fA = 0.825 in YRI; the second causal SNP was SNP rs12748791 with allele frequency fE = fA = 0.133. There was an admixture mapping signal at the first causal SNP but no admixture mapping signal at the second causal SNP.
We simulated 1,000 replicated sets, with each data set composed of 1,000 cases and 1,000 controls; the phenotypic value Yi of individual i was simulated based on a logistic model,
| (5) |
where Gi1 (Gi2) was the genotype score at the first (second) causal SNP.
Estimated power when phenotypes were influenced by two causal variants
We estimated the power of Tlocal, Tadmix, and the GSB and smooth-GSB procedures based on analyzing the 1,000 simulated case-control data sets described in the previous section. We used the significance threshold of 5×10−8 for Tlocal at individual SNPs and the nominal FWER 5×10−8×m for the GSB and the smooth-GSB procedures, where m = 58,441 is the total number of SNPs in the first chromosome. We used the significance threshold of 10−5 for Tadmix.
The empirical power at the two causal SNPs is shown in Table 4. We can see that Tadmix always had power much higher than the threshold 10−5 at the first causal SNP, indicating there was an admixture mapping signal at this causal variant and that Tadmix had power very close to zero at the second causal SNP, indicating no admixture mapping signal at the second causal SNP. Compared with the association test Tlocal, although the GSB procedure had much higher power at the first causal variant with an admixture mapping signal, it had much lower power at the second causal variant without an admixture mapping signal. For example, when the odds ratio was 1.6, the GSB procedure had power of 0.272 at the second causal variant, much lower than the power 0.414 of the test Tlocal. Therefore the GSB procedure is not appropriate for GWAS analysis in admixed populations. On the other hand, compared with the association test Tlocal, our smooth-GSB procedure not only had much higher power at the first causal SNP with an admixture mapping signal, but also had comparable (slightly lower) power at the second causal SNP without an admixture mapping signal. For example, when the odds ratio was 1.5, the test Tlocal and our smooth-GSB procedure (λ=0.7) had power of 0.264 and 0.351 at the first causal SNP, respectively, and had power of 0.172 and 0.167 at the second causal SNP, respectively.
Table 4.
Estimated power of methods for case-control designs based on 1,000 replicated GWAS data sets for the first chromosome with two causal SNPsa.
| Odds Ratiob
|
Power at SNP1c
|
Power at SNP2
|
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S-GSBe
|
S-GSB
|
|||||||||||||
| OR1,E | OR1,A | OR2 | Tadmixd | Tlocal | GSB | λ=.6 | λ=.7 | λ=.8 | Tadmix | Tlocal | GSB | λ=.6 | λ=.7 | λ=.8 |
| 1.3 | 1.3 | 1.3 | .001 | .016 | .020 | .018 | .018 | .018 | 0 | .005 | .003 | .004 | .005 | .005 |
| 1.4 | 1.4 | 1.4 | .012 | .092 | .150 | .135 | .129 | .120 | 0 | .043 | .029 | .037 | .039 | .040 |
| 1.5 | 1.5 | 1.5 | .028 | .264 | .392 | .363 | .351 | .332 | 0 | .172 | .118 | .162 | .167 | .170 |
| 1.6 | 1.6 | 1.6 | .071 | .524 | .687 | .646 | .634 | .617 | 0 | .414 | .272 | .389 | .397 | .404 |
| 1 | 1.5 | 1.5 | .309 | .053 | .141 | .113 | .102 | .096 | 0 | .167 | .052 | .146 | .148 | .155 |
Each data set consisted of 1,000 cases and 1,000 controls. The significance threshold for Tlocal at individual SNPs was 5×10−8 and the nominal FWER for GSB and S-GSB was 5×10−8×m, where m = 58,441 is the total number of SNPs in the first chromosome.
OR1,E (OR1,A) denotes the odds ratio in Europeans (West Africans) at the 1st causal SNP; OR2 denotes the odds ratio at the 2nd causal SNP (assuming that risk allele had the same odds ratio in two ancestral populations);
the 1st causal SNP had an admixture mapping signal, with allele frequencies fE = 0.292 in CEU, and fA = 0.825 in YRI; the 2nd causal SNP had no admixture mapping signal, with allele frequency fE = fA = 0.133 in the two ancestral populations.
The power of Tadmix was evaluated at significance level 10−5, indicating the strength of admixture mapping signal.
The smooth-GSB procedure with different λ values.
Influence of the parameter λ on the smooth-GSB procedure
We also evaluated the influence of different values of the parameter λ on the power of the smooth-GSB procedure. We considered nine values of λ: 0.1, 0.2, …, 0.9, and calculated the corresponding power of the smooth-GSB procedure. We found that when 0.6 ≤ λ ≤ 0.8, the smooth-GSB procedure had power comparable with that of Tlocal in detecting the causal variant without admixture mapping signals (the second causal SNP), and had significantly higher power than Tlocal at the causal variant with an admixture signal (see Table 4). In addition, compared with the power of the smooth-GSB procedure with λ = 0.7, that when λ decreased to 0.6 or increased to 0.8 the absolute value of the difference in power at the first causal variant with an admixture mapping signal was always less than 0.03; the absolute value of the difference in power at the second variant without an admixture mapping signal was always less than or around 0.02. In other words, when the value of λ changed around 0.7, the power of the smooth-GSB procedure only changed slightly. Therefore, we recommend using a value of λ around 0.7 in the smooth-GSB procedure.
REAL DATA ANALYSIS
We applied the smooth-GSB procedure and Tlocal test to GWAS of data sets on American Africans drawn from the Atherosclerosis Risk in Communities (ARIC) Study [30]. We did not apply the other existing methods described above due to the possibility of not controlling the FWER. We did a quality control (QC) filtering and excluded A/T and G/C SNPs to avoid complementarity issues as described in [10]; minor allele frequencies less than 0.01 were also removed. After the QC filtering, there were 3,075 individuals with 584,535 SNPs remaining. We analyzed three phenotypes: type 2 diabetes (T2D), LDL-cholesterol (LDL), and HDL-cholesterol. The data set for T2D consisted of 531 cases and 1,887 controls. The datasets for quantitative phenotypes LDL and HDL consisted of 2,897 and 2,924 individuals, respectively.
We noted that when the original GSB procedure of Holm is applied to testing association at genome-wide markers, it does not provide a new p-value for each marker. In our previous contribution [24], we proposed an adjusted p-value Pgenome for each marker for GSB procedure, which is used to compare sequentially with the nominal FWER level α (see also Westfall and Yong (1993), page 64–65). For comparison with the Tlocal test, we calculated a new p-value at each SNP for the smooth-GSB procedure that was equal to the adjusted p-value Pgenome divided by number of test SNPs (m). This new p-value at each SNP is used to compare sequentially with the Bonferroni-corrected threshold (α/m) for single SNPs. We listed SNPs in Table 5 that had p-values less than 10−6 either in the GWAS using the test Tlocal or the smooth-GSB procedure. We also listed the p-values from the Tadmix test corresponding to these SNPs to show if there were admixture signals. Because of the small sample size for the T2D phenotype, no SNPs had p-values less than 10−6. When we set the genome-wide FWER = 0.05, the corresponding Bonferroni-corrected threshold was 8.55×10−8. The Tlocal test only identified one SNP (rs247617) significant with a p-value of 6.02×10−22 for the phenotype HDL and this SNP was also identified by our S-GSB procedure with p-value of 8.52×10−22 (when λ = 0.7). The SNP rs247617 was also reported significantly associated with HDL by Lettre [29] in their meta-analysis on African Americans (see their Table 3). In Table 5, we can also see that when the value of λ in our S-GSB procedure changed from 0.6 to 0.8, the corresponding p-values were very close to each other.
Table 5.
SNPs with P-values less than 10−6 in GWAS for the phenotypes LDL or HDL.
| Chrom1 | SNP | CEU freq2 | YRI freq3 | P-value (Tlocal) | P-value (Tadmix) | P-value of S-GSB4
|
||
|---|---|---|---|---|---|---|---|---|
| λ=.6 | λ=.7 | λ=.8 | ||||||
| LDL | ||||||||
| 2 | rs1485059 | 0.088 | 0.167 | 6.15×10−7 | 0.97 | 9.75×10−7 | 8.51×10−7 | 7.54×10−7 |
| 2 | rs1485055 | 0.093 | 0.197 | 5.93×10−7 | 0.97 | 9.40×10−7 | 8.20×10−7 | 7.27×10−7 |
| 19 | rs1160985 | 0.566 | 0.33 | 2.63×10−7 | 0.028 | 1.58×10−7 | 1.76×10−7 | 1.97×10−7 |
| HDL | ||||||||
| 3 | rs4462984 | 0 | 0.33 | 6.20×10−7 | 0.169 | 9.66×10−7 | 8.48×10−7 | 7.55×10−7 |
| 7 | rs6950206 | 0.04 | 0.686 | 3.85×10−7 | 0.073 | 5.53×10−7 | 4.98×10−7 | 4.54×10−7 |
| 8 | rs2916700 | 0.531 | 0.187 | 4.38×10−7 | 0.439 | 7.11×10−7 | 6.15×10−7 | 5.42×10−7 |
| 16 | rs247617 | 0.341 | 0.241 | 6.02×10−22 | 0.809 | 9.88×10−22 | 8.52×10−22 | 7.48×10−22 |
| 16 | rs6499863 | 0.173 | 0.299 | 1.57×10−7 | 0.809 | 2.58×10−7 | 2.22×10−7 | 1.95×10−7 |
Chromosome number;
The allele frequencies in HapMap CEU and YRI populations, respectively;
P-value calculated as the adjusted p-value divided by the number of tests (see [24]).
In addition, Table 5 shows that at a SNP with admixture signals (p-values from Tadmix < 0.05), the S-GSB procedure had smaller p-values than Tlocal. For example, at the SNP rs1160985, the p-value from Tlocal for LDL was 2.63×10−7. Since this SNP showed a weak admixture signal with a p-value of 0.028 (< 0.05) from the test Tadmix, the p-value from our S-GSB procedure (λ = 0.7) was 1.76×10−7, which was closer to the strict Bonferroni-corrected threshold (8.55×10−8) than the p-value from the test Tlocal. The SNP rs1160985 was also reported significantly associated with HDL by Lettre [29] (see their Table 3). On the other hand, at a SNP with no admixture signal, the p-value from the smooth-GSB procedure was slightly larger than that from the test Tlocal. For example, for the phenotype HDL, at the SNP rs2916700, which did not show an admixture signal (p-value from Tadmix was 0.439), the p-value from the smooth-GSB procedure was 6.15×10−7, which was slightly larger than the p-value (4.38×10−7) from the test Tlocal.
DISCUSSION
In this study, we describe a smooth-GSB procedure for GWAS in admixed populations to incorporate information from admixture mapping tests into the association tests that adjust for local ancestry. The smooth-GSB procedure for GWAS can be applied to identify small regions with a few hundred Kbs harboring causal variants for binary traits, quantitative traits, and any other trait that follow distributions from the exponential family. Our simulation studies indicate that the smooth-GSB procedure can control the FWER well. Compared with association tests that adjust for local ancestry, the smooth-GSB procedure can attain substantially improved power at the causal variants with admixture mapping signals, and have comparable power at the causal variants with no admixture mapping signals.
In the smooth-GSB procedure, we calculate the smoothed weight as , which is a linear combination of the original weight w and the average weight of all markers . In our future study, we will also investigate if the smooth-GSB procedure has better performance when calculating the smoothed weight by (1- λ)w + λE(w), where E(w) is the expected value of w under the null hypothesis of no association of the trait with the local deviation of ancestry.
We have evaluated the impact of the values of the parameter λ on the power of the smooth-GSB procedure by simulation studies. When 0.6 ≤ λ ≤ 0.8, the power of the smooth-GSB procedure only changed slightly and have good performance in detecting both causal variants with and without admixture mapping signals. Based on our simulated studies, we recommend using a value of λ around 0.7 in the smooth-GSB procedure. To evaluate the sensitivity of the parameter λ in a specific real data analysis, investigators can use different values of λ, such as 0.6, 0.7 and 0.8, and compare the corresponding adjusted p-values for each of the most significant SNPs (see also the Section of Real data analysis). It is still an open question how to determine the optimal value of the parameter λ. We will explore using data-driven methods to determine the optimal value in our future research.
As we mentioned earlier, there are three types of LD in admixed populations: mixture LD, admixture LD, and background LD. The mixture LD is caused by variation of global ancestry during the admixture; admixture LD is generated in local regions during the admixture; and background LD is the traditional LD inherited from ancestral populations. For GWAS in admixed populations, we aim to identify SNPs in background LD with the causal variants and locate each causal variant in a small chromosomal region (often with less than a few hundred Kbs). It is well known that the variation of global ancestry or mixture LD can generate confounding effects. In this study, we show that the admixture LD existing in a local region (with several Mbs or longer) may also cause spurious association (false positive) findings at null AIMs that are located in the local region with the admixture LD, if the local region includes a causal variant with a strong admixture mapping signal. Adjusting for global ancestry can control spurious association findings caused by the mixture LD, but may not be able to control the false positive findings caused by admixture LD. Therefore it is necessary to adjust for local ancestry in the association tests to strictly control the false positive findings caused by the admixture LD.
In GWAS analysis, to test for each SNP, the null hypothesis of no association is approximately equivalent to the hypothesis H02: the SNP is not in background LD with the causal variants. The association tests that adjust for global ancestry, such as Tglobal, ATT, and the EIGENSTRAT method, are often directly or indirectly used to test H02. However, our simulation studies indicate that when a causal variant is an AIM and has a strong admixture mapping signal, these tests can have inflated type I error rates under the null hypothesis H02 at the null AIM SNPs that are in admixture LD but not in background LD with the causal variant. We noticed that similar results have also been reported in [13, 14]. When evaluating the type I error rate by simulation studies, Wang et al. (2011) and Qin et al. (2011) used models including a covariate of local ancestry to simulate the case-control status. In contrast, in our simulation studies, we used a model including a covariate of genotypic score but not a covariate of local ancestry (see model (4)). In the association tests adjusting for global ancestry, calling a SNP significant cannot guarantee the SNP is in background LD with the causal variants, it can be only in admixture LD with the causal variants. Therefore, it may be more appropriate to use the association tests adjusting for global ancestry to locate a causal variant in a large chromosomal region (several Mbs).
As previously described, the joint test MIXSCORE of Pasaniuc et al. [10] is based on an implied assumption that ancestry adds ratio Ω(R) =1 if the SNP odds ratio is R =1. In other words, the joint association test is approximately based on testing an implied null hypothesis HJoint: H01 and H02, that is, the test SNP is not in admixture LD with the causal variants, and the test SNP is not in background LD with the causal variants. Rejecting the null hypothesis HJoint (i.e., calling the test SNP significant) is equivalent to stating that the test SNP is in admixture LD or in background LD with the causal variants, and cannot guarantee that the test SNP is in background LD with the causal variants. Therefore this joint association test is more appropriate for use in identifying large chromosome regions (usually several Mbs) harboring the causal variants.
For weighted multiple testing procedures (such as the GSB procedure and the smooth-GSB procedure), a traditional criterion for (type I) error control is the FWER for all tests rather than the type I rate for single tests as used in the Bonferroni procedure. In weighted multiple testing procedures for GWAS, the actual significance levels used for different null SNPs may be different due to different weights: some null SNPs may be tested by a significance level higher than the Bonferroni-corrected threshold while some other null SNPs may be tested by significance levels lower than the Bonferroni-corrected threshold.
Our smooth-GSB procedures for GWAS are based on using the p-values from single SNP-based association tests for genome-wide SNPs and ignoring the correlations among SNPs. However, ignoring the correlations can result in reduced power. In our future research, we will explore using p-values from SNP set (or gene)-based association tests in the smooth-GSB procedures. The SNP set (or gene)-based association tests can account for the correlations among SNPs.
In addition, in real GWAS data on admixed populations, there may be cryptic relatedness among individuals. In our future research, we will also account for the cryptic relatedness among individuals in the smooth-GSB procedure by using an efficient mixed-model analysis such as the method GEMMA [31].
Supplementary Material
Acknowledgments
This research was supported by the National Institutes of Health grants: R01GM073766 and R01GM081488 from the National Institute of General Medical Sciences, U01HL101064 from the National Heart, Lung, and Blood Institute, and R01 HD060913 from the Eunice Kennedy Shriver National Institute of Child Health & Human Development. The research was also partly supported by grant UL1TR000058 from the National Center for Advancing Translational Sciences. The Atherosclerosis Risk in Communities Study was carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C). The authors thank the staff and participants of the ARIC study for their important contributions.
Appendix I: Null independence between statistics Tadmix and Tlocal for quantitative traits
For individual i, we let yi, Qi, Di and Gi denote the quantitative trait value, global ancestry, local ancestry deviation and genotype at a test SNP, respectively. For n unrelated individuals, we write y = (y1, …, yn)′, x1 = (1, …, 1)′, x2 = (Q1, …, Qn)′, x3 = (D1, …, Dn)′, and x4 = (G1, …, Gn)′. Moreover, we write Xk = [x1, …, xk] and for k = 2, 3, 4. Apparently, all Hk and I − Hk are idempotent matrices, and and thus Hj(I − Hk) = 0 for each j ≤ k(= 2,3,4). In Tadmix test for quantitative tratis, we test the null hypothesis α3 = 0 under
where random vector ε follows multivariate normal distribution N(0, σ2I). We adopt the partial F test statistic
to measure the admixture mapping information after adjusting for the global ancestry. Model I is a reduced model of
where random vector ε follows a multivariate normal distribution N(0, σ2I). For quantitative traits, the Tlocal tests the null hypothesis α4 = 0 under Model II. we use the partial F test statistic
We prove that F1 and F2 are asymptotically independent if the null hypothesis α4 = 0 is true (no matter α3 = 0 or α3 ≠ 0). For this purpose, we rewrite the F statistics as
where c1n = y′(H3 − H2)y/σ2, d1n = y′ (I − H3)y/(σ2(n − 3)), c2n = y′(H4 − H3)y/σ2, and d2n = y′(I − H4)y/(σ2(n − 4)).
If α4 = 0 is true under Model II, then y~N(γ, σ2In) with γ = x1α1 + x2α2 + x3α3. By Craig’s theorem in the general case, c1n and c2n are independent with each other if and only if (H3 − H2)(H4 − H3) = 0. Since Hj(I − Hk) = 0 for each j ≤ k(= 2,3,4), we observe that
Since I − H3 is an idempotent matrix of rank n − 3, there is a unique n × (n − 3) matrix U3 such that and . Under Model II, if α4 = 0, then y = X3(α1, α2, α3)′ + ε, (I − H3)y = (I − H3)ε, and hence,
Since ε~N(0, σ2In), , the n − 3 elements of z are i.i.d N(0,1) variables. Hence, we prove according to standard strong large number theory. In words, d1n converges to 1 almost surely. Similarly, we observe that under Model II (no matter α4 = 0 or α4 ≠ 0).
Since H3 − H2 and H4 − H3 are orthogonal idempotent matrices of rank 1, there are unit length vectors u, v such that H3 − H2 = uu′, H4 − H3 = vv′ and u′v = 0. Under the given condition, we obtain (H3 − H2)y/σ = (I – H2)x3α3/σ + uu′ (ε/σ) and hence
where μ1 = u′(I − H2)x3α3/σ . Under the same condition, we obtain the following representation:
From u′v = 0 and (ε/σ)~N(0, I), we obtain that (ζ1 − μ1, ζ2)′ = (u′(ε/σ), v′(ε/σ))′ ~ N(0, I2). Since and , according to Slutsky’s theorem we have
Therefore, and are asymptotically independent. Since and , the two F statistics are asymptotically independent.
References
- 1.Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–87. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Smith MW, O’Brien SJ. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Rev Genet. 2005;6:623–32. doi: 10.1038/nrg1657. [DOI] [PubMed] [Google Scholar]
- 3.Montana G, Pritchard JK. Statistical tests for admixture mapping with case-control and cases-only data. Am J Hum Genet. 2004;75:771–89. doi: 10.1086/425281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler D, Daly MJ, Reich D. Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004;74:979–1000. doi: 10.1086/420871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhu X, Cooper RS, Elston RC. Linkage analysis of a complex disease through use of admixed populations. Am J Hum Genet. 2004;74:1136–53. doi: 10.1086/421329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhu X, Zhang S, Tang H, Cooper R. A classical likelihood based approach for admixture mapping using EM algorithm. Hum Genet. 2006;120:431–45. doi: 10.1007/s00439-006-0224-z. [DOI] [PubMed] [Google Scholar]
- 7.Zhu X, Tang H, Risch N. Admixture mapping and the role of population structure for localizing disease genes. Adv Genet. 2008;60:547–69. doi: 10.1016/S0065-2660(07)00419-1. [DOI] [PubMed] [Google Scholar]
- 8.Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. Design and analysis of admixture mapping studies. Am J Hum Genet. 2004;74:965–78. doi: 10.1086/420855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tian C, Hinds DA, Shigeta R, Kittles R, Ballinger DG, Seldin MF. A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am J Hum Genet. 2006;79:640–9. doi: 10.1086/507954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WH, Ruczinski I, Fornage M, Siscovick DS, Zhu X, Larkin E, Lange LA, Cupples LA, Yang Q, Akylbekova EL, Musani SK, Divers J, Mychaleckyj J, Li M, Papanicolaou GJ, Millikan RC, Ambrosone CB, John EM, Bernstein L, Zheng W, Hu JJ, Ziegler RG, Nyante SJ, Bandera EV, Ingles SA, Press MF, Chanock SJ, Deming SL, Rodriguez-Gil JL, Palmer CD, Buxbaum S, Ekunwe L, Hirschhorn JN, Henderson BE, Myers S, Haiman CA, Reich D, Patterson N, Wilson JG, Price AL. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet. 2011;7:e1001371. doi: 10.1371/journal.pgen.1001371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–9. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 12.Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002;23:181–96. doi: 10.1002/gepi.210. [DOI] [PubMed] [Google Scholar]
- 13.Wang X, Zhu X, Qin H, Cooper RS, Ewens WJ, Li C, Li M. Adjustment for local ancestry in genetic association analysis of admixed populations. Bioinformatics. 2011;27:670–7. doi: 10.1093/bioinformatics/btq709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Qin H, Morris N, Kang SJ, Li M, Tayo B, Lyon H, Hirschhorn J, Cooper RS, Zhu X. Interrogating local population structure for fine mapping in genome-wide association studies. Bioinformatics. 2010;26:2961–8. doi: 10.1093/bioinformatics/btq560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhu X, Young JH, Fox E, Keating BJ, Franceschini N, Kang S, Tayo B, Adeyemo A, Sun YV, Li Y, Morrison A, Newton-Cheh C, Liu K, Ganesh SK, Kutlar A, Vasan RS, Dreisbach A, Wyatt S, Polak J, Palmas W, Musani S, Taylor H, Fabsitz R, Townsend RR, Dries D, Glessner J, Chiang CW, Mosley T, Kardia S, Curb D, Hirschhorn JN, Rotimi C, Reiner A, Eaton C, Rotter JI, Cooper RS, Redline S, Chakravarti A, Levy D. Combined admixture mapping and association analysis identifies a novel blood pressure genetic locus on 5p13: contributions from the CARe consortium. Hum Mol Genet. 2011;20:2285–95. doi: 10.1093/hmg/ddr113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Qin H, Zhu X. Power comparison of admixture mapping and direct association analysis in genome-wide association studies. Genet Epidemiol. 2012;36:235–43. doi: 10.1002/gepi.21616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Holm S. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics. 1979;6:65–70. [Google Scholar]
- 18.Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for P-value adjustment. New York: Wiley; 1993. [Google Scholar]
- 19.McCullagh P, Nelder JA. Generalized linear models. London, New York: Chapman and Hall; 1989. [Google Scholar]
- 20.Tang H, Coram M, Wang P, Zhu X, Risch N. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet. 2006;79:1–12. doi: 10.1086/504302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, Beaty TH, Mathias R, Reich D, Myers S. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5:e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sankararaman S, Sridhar S, Kimmel G, Halperin E. Estimating local ancestry in admixed populations. Am J Hum Genet. 2008;82:290–303. doi: 10.1016/j.ajhg.2007.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, Rodriguez-Cintron W, Chapela R, Ford JG, Avila PC, Rodriguez-Santana J, Burchard EG, Halperin E. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 2012;28:1359–67. doi: 10.1093/bioinformatics/bts144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen W, Chen X, Archer KJ, Liu N, Li Q, Zhao Z, Sun S, Gao G. A Rapid Association Test Procedure Robust under Different Genetic Models Accounting for Population Stratification. Hum Hered. 2013;75:23–33. doi: 10.1159/000350109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Roeder K, Devlin B, Wasserman L. Improving power in genome-wide association studies: weights tip the scale. Genet Epidemiol. 2007;31:741–7. doi: 10.1002/gepi.20237. [DOI] [PubMed] [Google Scholar]
- 26.Spencer CC, Su Z, Donnelly P, Marchini J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009;5:e1000477. doi: 10.1371/journal.pgen.1000477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27:2304–5. doi: 10.1093/bioinformatics/btr341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Torgerson DG, Ampleford EJ, Chiu GY, Gauderman WJ, Gignoux CR, Graves PE, Himes BE, Levin AM, Mathias RA, Hancock DB, Baurley JW, Eng C, Stern DA, Celedon JC, Rafaels N, Capurso D, Conti DV, Roth LA, Soto-Quiros M, Togias A, Li X, Myers RA, Romieu I, Van Den Berg DJ, Hu D, Hansel NN, Hernandez RD, Israel E, Salam MT, Galanter J, Avila PC, Avila L, Rodriquez-Santana JR, Chapela R, Rodriguez-Cintron W, Diette GB, Adkinson NF, Abel RA, Ross KD, Shi M, Faruque MU, Dunston GM, Watson HR, Mantese VJ, Ezurum SC, Liang L, Ruczinski I, Ford JG, Huntsman S, Chung KF, Vora H, Li X, Calhoun WJ, Castro M, Sienra-Monge JJ, del Rio-Navarro B, Deichmann KA, Heinzmann A, Wenzel SE, Busse WW, Gern JE, Lemanske RF, Jr, Beaty TH, Bleecker ER, Raby BA, Meyers DA, London SJ, Mexico City Childhood Asthma Study (MCAAS) Gilliland FD, Children’s Health Study (CHS) and HARBORS study. Burchard EG, Genetics of Asthma in Latino Americans (GALA) Study, Study of Genes-Environment and Admixture in Latino Americans (GALA2) and Study of African Americans, Asthma, Genes & Environments (SAGE) Martinez FD, Childhood Asthma Research and Education (CARE) Network. Weiss ST, Childhood Asthma Management Program (CAMP) Williams LK, Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE) Barnes KC, Genetic Research on Asthma in African Diaspora (GRAAD) Study. Ober C, Nicolae DL. Meta-analysis of genome-wide association studies of asthma in ethnically diverse North American populations. Nat Genet. 2011;43:887–92. doi: 10.1038/ng.888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lettre G, Palmer CD, Young T, Ejebe KG, Allayee H, Benjamin EJ, Bennett F, Bowden DW, Chakravarti A, Dreisbach A, Farlow DN, Folsom AR, Fornage M, Forrester T, Fox E, Haiman CA, Hartiala J, Harris TB, Hazen SL, Heckbert SR, Henderson BE, Hirschhorn JN, Keating BJ, Kritchevsky SB, Larkin E, Li M, Rudock ME, McKenzie CA, Meigs JB, Meng YA, Mosley TH, Newman AB, Newton-Cheh CH, Paltoo DN, Papanicolaou GJ, Patterson N, Post WS, Psaty BM, Qasim AN, Qu L, Rader DJ, Redline S, Reilly MP, Reiner AP, Rich SS, Rotter JI, Liu Y, Shrader P, Siscovick DS, Tang WH, Taylor HA, Tracy RP, Vasan RS, Waters KM, Wilks R, Wilson JG, Fabsitz RR, Gabriel SB, Kathiresan S, Boerwinkle E. Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project. PLoS Genet. 2011;7:e1001300. doi: 10.1371/journal.pgen.1001300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Musunuru K, Lettre G, Young T, Farlow DN, Pirruccello JP, Ejebe KG, Keating BJ, Yang Q, Chen MH, Lapchyk N, Crenshaw A, Ziaugra L, Rachupka A, Benjamin EJ, Cupples LA, Fornage M, Fox ER, Heckbert SR, Hirschhorn JN, Newton-Cheh C, Nizzari MM, Paltoo DN, Papanicolaou GJ, Patel SR, Psaty BM, Rader DJ, Redline S, Rich SS, Rotter JI, Taylor HA, Jr, Tracy RP, Vasan RS, Wilson JG, Kathiresan S, Fabsitz RR, Boerwinkle E, Gabriel SB, NHLBI Candidate Gene Association Resource Candidate gene association resource (CARe): design, methods, and proof of concept. Circ Cardiovasc Genet. 2010;3:267–75. doi: 10.1161/CIRCGENETICS.109.882696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44:821–4. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
