Abstract
Individuals of admixed ancestries (for example, African Americans) inherit a mosaic of ancestry segments (local ancestry) originating from multiple continental ancestral populations. This offers the unique opportunity of investigating the similarity of genetic effects on traits across ancestries within the same population. Here we introduce an approach to estimate correlation of causal genetic effects across local ancestries and analyze 38 complex traits in African-European admixed individuals (N = 53,001) to observe very high correlations (meta-analysis , 95% credible interval 0.93–0.97), much higher than correlation of causal effects across continental ancestries. We replicate our results using regression-based methods from marginal genome-wide association study summary statistics. We also report realistic scenarios where regression-based methods yield inflated heterogeneity-by-ancestry due to ancestry-specific tagging of causal effects, and/or polygenicity. Our results motivate genetic analyses that assume minimal heterogeneity in causal effects by ancestry, with implications for the inclusion of ancestry-diverse individuals in studies.
Large-scale genotype–phenotype studies are increasingly analyzing diverse sets of individuals of various continental and subcontinental ancestries1-4. A fundamental open question in these studies is to what extent the genetic basis of common human diseases and traits are shared/distinct across different ancestry populations and its impact to genetic discovery and prediction5-9. For example, it is unclear how much of the low polygenic score portability can be attributed to differences in genetic causal effects across ancestries5,10,11. Hence, understanding the role of ancestry in variability of causal effect sizes has tremendous implications for understanding the genetic basis of disease and portability of genetic risk scores in personalized and equitable genomic medicine1,10-13.
The standard approach to estimating similarity in causal effects across ancestries has focused on cross-population analyses (typically at continental level) in which effect sizes estimated by large-scale genome-wide association studies (GWAS) are compared across continental-level ancestry groups5-8,14,15. Such studies have found significant differences, albeit with modest magnitude, of causal effects in cross-continental comparisons. However, a main drawback of such studies is the differences in definition of environment/phenotype across such broad units of ancestry that can reduce the observed similarity; for example, the low estimated similarity in causal genetic effects for major depressive disorder across Europeans and East Asians may be attributed to different diagnostic criteria in the two populations8,16.
As an alternative to studying populations across different continents, causal effects similarity by ancestry can also be studied within recently admixed populations. Recently admixed individuals have the unique feature of having their genomes as mosaic of ancestry segments (local ancestry) originating from the ancestral populations within the past few dozen generations; for example, African American genomes are composed of segments of African and European ancestries within the past 5–15 generations17. Unfortunately, admixed populations are vastly underrepresented in genomic studies18, partly because of the lack of understanding of how the genetic causal effects vary across ancestries10,17,19-22. For example, heterogeneity of marginal effects (which is estimated in GWAS single variant scan and can tag effects from nearby variants due to linkage disequilibrium (LD)) for a few traits and loci has been reported23-26, but it remains unknown whether this reflects true difference in causal genetic effects or confounding due to different allele frequencies and/or LD by ancestry. Recent work15 has reported evidence of causal effect heterogeneity for single nucleotide polymorphisms (SNPs) in regions of European ancestries comparing individuals of European versus African American ancestries; however, these studies focused on cross-population comparisons instead of comparing effects across local ancestries within admixed populations. Estimating the magnitude of similarity in causal effects across ancestries is important for all genotype–phenotype studies in admixed populations from mapping to polygenic prediction, particularly within methods that allow for effects to vary across local ancestry segments19-22.
In this Article, we quantify the similarity in the causal effects (that is, change in phenotype per allele substitution) across local ancestries within admixed populations; such similarity can be defined as the correlation of ancestral causal genetic effects across African and European local ancestries. We develop a method that leverages the polygenic architecture of complex traits to model all variants (GWAS-significant and non-significant); this approach is accurate and robust across a wide range of realistic simulated genetic architectures. We also investigate regression-based approaches that use marginal effects of SNPs prioritized in GWAS risk regions. Through simulation studies, we find that regression-based methods can yield deflated estimates of similarity (that is, inflated heterogeneity) especially for highly polygenic traits.
We analyze complex traits in African-European admixed individuals in Population Architecture using Genomics and Epidemiology (PAGE)1 (24 traits, average N = 9,296), UK Biobank (UKBB)2 (26 traits, average N = 3,808), and All of Us (AoU)3 (10 traits, average N = 20,496); there are 38 unique traits in total. We find causal effects are largely consistent across local ancestries within admixed individuals (through meta-analysis across 38 traits, estimated correlation of , 95% credible interval 0.93–0.97). In addition, we find that the heterogeneity in marginal effects exhibited at several trait–locus pairs can be explained by multiple nearby causal variants within a region, consistent with our simulation studies. Our results suggest that the causal effects are largely consistent across local ancestries within African-European admixed individuals, and this motivates future genetic analysis in admixed populations that assume similar effects across ancestries for improved power.
Results
Overview
We start by describing the statistical model we use to relate genotype to phenotypes in two-way admixed individuals; we focus on two-way African-European admixture because their local ancestries can be accurately inferred (Methods; for extension to other admixed populations, see Discussion). For a given individual, at each SNP , we denote number of minor alleles from maternal and paternal haplotypes as and local ancestries as . Denoting as the indicator function, we define the local ancestry dosage as allele counts from each of ancestries; for example, for African (similarly for European). For modeling convenience, we use variables that encode the genotypes conditional on local ancestries as the allele counts specific to each of local ancestries: (similarly for ). The phenotype of an admixed individual is modeled as a function of allelic effect sizes that are allowed to vary across ancestries:
(1) |
where , are the causal effects at SNP , is the total number of causal SNPs in the genome, and are other covariates (for example, age, sex and genome-wide ancestries) and their effects, and is the environmental noise. , are usually referred as allelic effects: change in phenotype with each additional allele. This is in contrast with standardized effects defined as change in phenotype per standard deviation increase of genotype where genotypes at each SNP are standardized to have unit variance5,27. We refrain from using standardized effects in this work due to complexities arising from different ancestries yielding different ancestry-specific frequencies for the same SNP5 (Methods).
Our goal is to estimate the similarity in the causal effects across local ancestries in admixed populations (Fig. 1); the similarity can be evaluated across all genome-wide causal SNPs that are common across ancestries in a form of cross-ancestry genetic correlation5,8 (for consistency with previous works we use ‘genetic correlation’ to refer to correlation of genetic effects across ancestries): , are modeled as random variables following a bivariate Gaussian distribution parametrized by , , denoting the variance and covariance of the effects:
(2) |
where are variant-specific parameters determined by the genetic architecture assumption (Methods). Under this model, the genome-wide causal effects correlation is defined as indicates same causal effects across local ancestries, while indicates differences across ancestries. To estimate , given the genotype and phenotype data for a trait, we calculate the profile likelihood curve of , obtained by maximizing the likelihood of model defined by equations (1) and (2) with regard to parameters and environmental variance for each fixed . We assume a priori both because causal effects are unlikely to be negatively correlated across ancestries and to reduce search space for reducing computational cost; we have also performed real data analyses to verify this assumption (see below). We obtain the point estimate, credible interval and perform hypothesis testing either for each individual trait using the trait-specific profile likelihood curve, or for meta-analysis across multiple traits using the multiplication of the likelihood curves across multiple traits (analogous to inverse variance weighted meta-analysis; Methods).
We organize next sections as follows. First, we show that our proposed approach provides accurate estimation of in extensive simulations. Second, we show is very close to 1 in real data of African-European admixed individuals from PAGE, UKBB and AoU. Third, we replicate our findings using methods that use GWAS summary data (marginal SNP effects at GWAS significant loci). Finally, we investigate pitfalls of methods4,14,15,28 that use marginal SNP effects showing inflated heterogeneity; we find that Deming regression is the only approach robust enough to quantify from marginal GWAS effects in admixed individuals.
Polygenic method for is accurate in simulations
We performed simulations to evaluate our proposed polygenic method using real genome-wide genotypes. We simulated phenotypes using genotypes and inferred local ancestries with N = 17,299 individuals and million SNPs (with MAF >0.5% in both ancestries in PAGE dataset; we omitted population-specific rare SNPs to reduce estimation variance; Methods). Phenotypes were simulated under a range of genetic architectures with a frequency-dependent causal effects distribution29,30, and varying proportion of causal variants , heritability and true (Methods). We used in our main simulations (to simulate a typical polygenic complex trait31). When estimating , we either used all SNPs in the imputed genotypes that were used to simulate phenotypes, or restricted to HapMap3 (HM3) SNPs32 to simulate scenarios where causal variants are not perfectly typed in the data (Methods).
Our method produced accurate point estimates and well-calibrated credible intervals of across a range of simulation settings (Fig. 2a and Supplementary Tables 1 and 2). We first evaluated our method in simulations with a realistic range of , 0.25 and 0.5 and , 0.95 and 1.0. When using the imputed SNPs for estimation, results were approximately unbiased (average and maximal relative biases across simulation settings were −0.42% and −1.8% respectively). Credible intervals of meta-analyzed across simulations approximately cover true : for the most biased setting (, , ), 95% credible interval 0.915–0.948. When using the HM3 SNPs for estimation, there was a consistent but small downward bias (Fig. 2a; average and maximal relative biases were −1.0% and −2.0%, respectively). This small downward bias was due to imperfect tagging that some of the causal SNPs were not included in the HM3 SNPs. Nonetheless, the magnitude of bias using either imputed or HM3 SNPs was small, indicating our method was accurate and robust to imperfect tagging. We next performed simulations to investigate the potential bias in estimating due to omitting population-specific rare variants. We re-applied our methods using SNPs with MAF >1% and MAF >5% in both populations (in addition to the default MAF >0.5%) to the same simulated data. We observed downward bias in estimated as more stringent MAF threshold was used and more SNPs were filtered out in estimation procedure. For example, the mode of the estimation was 0.966 when methods were applied with MAF >5% in simulation of (Fig. 2b and Supplementary Table 3). This indicates omitting population-specific rare variants can lead to downward bias (Discussion). We also investigated the impact of prior assumption of : we applied a revised methodology that allows for and we found that estimated were highly consistent when assuming (default method) versus when assuming (Fig. 2c).
We performed several secondary analyses. We determined our method remained accurate at other simulated (Supplementary Table 2; ranging from 0.001% to 1%) and broader range of simulated (Supplementary Table 4; ranging from −0.5 to 1). In null simulations , we determined the false positive rate of hypothesis test was properly controlled for most simulation settings, and was only slightly inflated when HM3 SNPs were used, and/or extremely low was simulated. In simulations with , power to detect increased with increasing and decreasing (Supplementary Tables 1 and 2). In addition, we found heritability can be accurately estimated in these simulations (Supplementary Tables 5 and 6, and Methods). In summary, our method can be reliably used to estimate .
Causal effects are similar across local ancestries
We applied our polygenic method to estimate within African-European admixed individuals in PAGE1 (24 traits, average N = 9,296, average fraction of African ancestries 78%), UKBB2 (26 traits, average N = 3,808, average fraction of African ancestries 59%) and AoU3 (10 traits, average N = 20,496, average fraction of African ancestries 74%) (Methods). Meta-analyzing across 38 traits from PAGE, UKBB and AoU (60 study–trait pairs), we observed a high similarity in causal effects across ancestries (, 95% credible interval 0.93–0.97). Results were highly consistent across datasets despite different ancestry compositions (PAGE: , 0.85–0.94; UKBB: , 0.91–1; AoU: , 0.94–1) as well as across traits (Fig. 3a, Table 1 and Supplementary Table 7). Height was the only trait that had significant (after Bonferroni correction; nominal ; meta-analyzed across three datasets; Table 1) albeit with high estimated , 0.89–0.97. Estimates of the same traits across datasets were only weakly correlated (Extended Data Fig. 1), suggesting similar causal effects by ancestry consistently across traits (true for all traits).
Table 1 ∣.
Trait | mode | 95% credible interval(s) | value | ||
---|---|---|---|---|---|
BMD | 1,668 | 0.000 | 0.00–0.78 | 0.012 | 0.34 ± 0.16 |
Neuroticism | 3,044 | 1.000 | 0.36–1.00 | 1 | 0.36 ± 0.11 |
Education years | 3,324 | 0.000 | 0.00–0.94 | 0.4 | 0.055 ± 0.075 |
MCHC | 3,650 | 0.228 | 0.00–0.87 | 0.061 | 0.21 ± 0.092 |
Type 1 diabetes | 3,767 | 0.381 | 0.00–0.95 | 0.77 | −0.033 ± 0.016 |
HLR count | 3,852 | 1.000 | 0.07–1.00 | 1 | 0.12 ± 0.086 |
RBC distribution width | 3,925 | 1.000 | 0.27–1.00 | 1 | 0.28 ± 0.087 |
Lymphocyte count | 3,935 | 1.000 | 0.00–0.60, 0.66–1.00 | 1 | 0.13 ± 0.086 |
Monocyte count | 3,935 | 0.972 | 0.26–1.00 | 0.82 | 0.3 ± 0.087 |
MCH | 3,948 | 0.829 | 0.07–1.00 | 0.36 | 0.2 ± 0.076 |
RBC count | 3,948 | 1.000 | 0.37–1.00 | 1 | 0.31 ± 0.09 |
Hypothyroidism | 4,063 | 1.000 | 0.05–1.00 | 1 | 0.046 ± 0.07 |
PR interval | 4,071 | 0.844 | 0.08–1.00 | 0.36 | 0.22 ± 0.084 |
QRS interval | 4,078 | 1.000 | 0.07–1.00 | 1 | 0.12 ± 0.082 |
Asthma | 4,079 | 1.000 | 0.15–1.00 | 1 | 0.21 ± 0.087 |
Ever smoked | 4,083 | 0.764 | 0.04–0.98 | 0.31 | 0.17 ± 0.082 |
QT interval | 4,089 | 0.920 | 0.07–1.00 | 0.69 | 0.16 ± 0.083 |
HbA1c | 5,353 | 0.954 | 0.08–1.00 | 0.77 | 0.19 ± 0.078 |
Cigarettes per day | 6,995 | 0.999 | 0.08–1.00 | 1 | 0.097 ± 0.047 |
Fasting insulin | 7,753 | 1.000 | 0.21–1.00 | 1 | 0.13 ± 0.044 |
eGFR | 7,978 | 0.805 | 0.16–1.00 | 0.09 | 0.19 ± 0.046 |
C-reactive protein | 8,321 | 0.995 | 0.82–1.00 | 0.94 | 0.28 ± 0.046 |
Fasting glucose | 9,646 | 0.695 | 0.00–0.93 | 0.27 | 0.064 ± 0.035 |
Coffee consumption | 11,587 | 0.982 | 0.10–1.00 | 0.9 | 0.074 ± 0.0 3 |
Platelet count | 12,545 | 0.783 | 0.20–0.98 | 0.025 | 0.19 ± 0.038 |
White blood cell count | 12,755 | 0.931 | 0.70–1.00 | 0.26 | 0.23 ± 0.036 |
Type 2 diabetes | 18,630 | 0.897 | 0.49–1.00 | 0.23 | 0.12 ± 0.024 |
Hypertension | 20,744 | 0.929 | 0.30–1.00 | 0.45 | 0.08 ± 0.027 |
LDL | 21,979 | 0.958 | 0.70–1.00 | 0.55 | 0.14 ± 0.046 |
HDL | 22,039 | 0.961 | 0.82–1.00 | 0.46 | 0.22 ± 0.057 |
Triglycerides | 22,494 | 0.843 | 0.54–0.98 | 0.012 | 0.18 ± 0.027 |
Total cholesterol | 22,555 | 0.818 | 0.50–0.97 | 0.007 | 0.18 ± 0.039 |
Heart rate | 28,764 | 0.980 | 0.82–1.00 | 0.74 | 0.099 ± 0.015 |
WHR | 36,756 | 0.973 | 0.86–1.00 | 0.55 | 0.12 ± 0.015 |
Diastolic blood pressure | 43,787 | 1.000 | 0.90–1.00 | 1 | 0.077 ± 0.024 |
Systolic blood pressure | 43,788 | 1.000 | 0.88–1.00 | 1 | 0.071 ± 0.013 |
BMI | 49,521 | 0.974 | 0.92–1.00 | 0.33 | 0.22 ± 0.02 |
Height | 49,605 | 0.936 | 0.89–0.97 | 0.00043 | 0.4 ± 0.014 |
Meta-analysis | 0.947 | 0.93–0.97 | 8.7 × 10−7 |
For each trait, we report number of individuals, posterior mode and 95% credible interval(s) for estimated , nominal one-sided value for rejecting the null hypothesis of (unadjusted for multiple testing; Methods), and estimated heritability and standard error. Meta-analysis results performed across 38 traits are shown in the last row. Traits are ordered according to number of individuals. For each trait, we perform meta-analysis across studies if the trait is in multiple studies (Methods). Lymphocyte count has two credible intervals because of the non-concave profile likelihood curve, as a result of small sample size. BMD, bone mineral density; HLR, high light scattering reticulocytes; MCHC, mean corpuscular hemoglobin concentration.
We performed several secondary analyses. Similar to previous simulation studies, we determined prior assumption of had minimal impact to results: estimated of 24 traits in PAGE were highly consistent when assuming (default method) versus when assuming (Extended Data Fig. 2). Such consistency between the two methods again indicates similar genetic causal effects across local ancestries and that estimation is robust to choices of statistical prior on . Our results were robust to different assumption of effects distribution (Extended Data Fig. 3 and Supplementary Table 8), consistent with previous work33. Results were also robust to the SNP set used in the estimation (Extended Data Fig. 3 and Supplementary Table 8), and criterion of the included admixed individuals (Extended Data Fig. 4). Additionally, an alternative formulation of method assuming different variance component by ancestry did not outperform our default method assuming same variance component by ancestry (Extended Data Fig. 5, Supplementary Table 9 and Supplementary Note).
Next, we contrasted to transcontinental genetic correlations of (1) European versus African and (2) European versus East Asian (Fig. 3b and Methods). We determine a much higher similarity across local ancestries within admixed populations (, 95% credible interval 0.93–0.97) as compared with transcontinental correlations of African versus European within UKBB (, meta-analysis across 26 traits, 95% confidence interval 0.43–0.56) and East Asian (Biobank Japan) versus European (UKBB)8 (, meta-analysis across 31 traits, 95% confidence interval 0.83–0.87) (Supplementary Table 10). Overall, our results are consistent with being less susceptible to heterogeneity due to differences in phenotyping/environment in transcontinental comparisons.
We sought to replicate high using regression-based methods that leverage estimated ancestry-specific marginal effects at GWAS loci (Methods). Specifically, we used the following marginal regression equation (restricting equation (1) to each GWAS SNP ): (we distinguish marginal effects from causal effects ; Methods). Across 60 study–trait pairs, we detected 217 GWAS significant clumped trait–SNP pairs and we estimated the ancestry-specific marginal effects for each SNP (Fig. 3c and Supplementary Table 11). We determined the estimated marginal effects are largely consistent by local ancestry at these GWAS clumped SNPs via Deming regression slope34 of 0.82 (standard error 0.06) (applied to ; Deming regression properly accounts for uncertainty in both dependent and independent variables; Methods). Mean corpuscular hemoglobin (MCH)-associated SNPs at 16p13.3 drove most of the differences by ancestry: Deming regression slope was 0.93 (standard error 0.04) on the rest of 193 SNPs after excluding 24 MCH-associated SNPs; MCH-associated SNPs also have the strongest heterogeneity in marginal effects by ancestry (using heterogeneity score test (HET) for testing effects heterogeneity at each SNP35; Supplementary Table 11 and Methods). By performing statistical fine-mapping analysis, we found there are multiple conditionally independent association signals at MCH-associated and other loci with heterogeneity by ancestry (Extended Data Fig. 6 and Supplementary Note). In fact, the MCH-associated loci locate at a region harboring alpha-globin gene cluster (HBZ–HBM–HBA2–HBA1–HBQ1) known to contain multiple causal variants36. These results suggest that, similar to causal effects, marginal effects at GWAS loci are also largely consistent by local ancestry across multiple traits, with the exception of 16p13.3 loci for MCH in our study, where multiple large-effect causal variants drive some extent of heterogeneity by ancestry in marginal effects.
Pitfalls of using marginal effects to estimate heterogeneity
Next, we focused on thoroughly evaluating methods that use marginal effects at GWAS significant variants to estimate heterogeneity. Marginal effects are frequently used to compare effect sizes across populations or across studies4,14,15,28 and enjoy popularity for their simplicity and requirement of only GWAS summary statistics (estimated effect sizes and standard errors).
We first note that heterogeneities in marginal effects can be induced due to different LD patterns across ancestries even when the underlying causal effects are identical, especially when multiple causal variants are nearby in the same LD block (Fig. 4). We investigate the extent of heterogeneity by ancestry that can be induced in simulations with identical causal effects across ancestries, due to (1) local ancestry adjustment; (2) unknown causal variants coupled with ancestry-specific LD patterns; (3) highly polygenic genetic architectures with multiple causal SNPs within the same LD block; (4) standard errors in estimated marginal effects across ancestries. Our following simulations were based on real imputed genotypes from African-European individuals in PAGE data (17,299 individuals, average fraction of African ancestries 78%).
Regressing out local ancestry can deflate the observed similarity in causal effects across ancestries.
We first discuss the use of local ancestry in the heterogeneity estimation, which is a unique and important component to consider when studying admixed populations. We used simulations to investigate the role of local ancestry adjustment using three main approaches: (1) ignoring local ancestry altogether (‘w/o’); (2) including local ancestry as covariate in the model (‘lanc-included’); (3) regressing out the local ancestry from phenotype followed by heterogeneity estimation on residuals (‘lanc-regressed’) (Methods). First, in null simulations with identical causal effects (ratio of ), we observed that ignoring local ancestry or including local ancestry as covariate yielded well-calibrated HET tests; in contrast, regressing out the local ancestry effect induced inflated HET test statistics (Fig. 5 and Supplementary Table 12). Next, in power simulations with varying amount of heterogeneity (defined as ratio of ), including local ancestry in the covariate significantly reduced the power of HET test of up to 50% at high magnitude of heterogeneity (Fig. 5 and Supplementary Table 12) (see more details in Supplementary Note). Thus, with respect to local ancestry, we recommend either not using it or including it as a covariate in the model and not regressing out its effect before heterogeneity estimation as that will bias heterogeneity estimation.
Having investigated the role of local ancestry adjustment, we next turn to heterogeneity estimation for GWAS SNPs. We focused on investigating properties of HET test and Deming regression in null simulations with identical causal effects across ancestries . Since the true causal variants are usually uncertain, we investigated each method either at the true simulated causal variants or at the LD-clumped variants (Methods).
Uncertainty in which variants are causal can deflate the observed similarity in effects by ancestry.
We first performed simulations with single causal variant: we randomly selected one SNP as causal in each simulation. Evaluated at the causal SNPs (Methods), we found that HET test and Deming slope were well-calibrated (Fig. 6a-c, Extended Data Fig. 7 and Supplementary Table 13). However, evaluated at the clumped variants, as a more realistic setting (because causal variants need to be inferred), we found HET test became increasingly miscalibrated with increased , while Deming slope remained relatively robust (with an upward but not statistically significant trend with increasing ). Ordinary least squares (OLS) slope had bias even when evaluated at causal variants because of its ignorance of the standard errors in the estimated effects (Methods and Supplementary Note); such bias became smaller with increased .
High polygenicity can deflate the observed similarity in effects by ancestry.
Next, we performed simulations where multiple causal variants locate nearby within the same LD block (typical for polygenic complex traits37,38; Methods). In this scenario, marginal GWAS effects could tag multiple causal effects, thus potentially inflating the observed heterogeneity (Fig. 4c). In simulations, we varied the number of causal SNPs from 0.25 to 4.0 per Mb to span most polygenic architectures. In contrast to simulations with a single causal variant, all three methods (HET test, Deming slope and OLS slope) were biased in the presence of multiple nearby causal variants; the miscalibration/bias increased with number of causal variants per region, and LD clumping did not alleviate the miscalibration/bias (Fig. 6d-f). Such miscalibrations occurred irrespective of sample size (Extended Data Fig. 8), or simulated heritability (Supplementary Table 14).
In summary, we find that methods for heterogeneity-by-ancestry estimation based on marginal GWAS SNP effects are susceptible to inflated estimates of heterogeneity. HET test is susceptible to false positives when causal variants are unknown. Deming regression was robust in scenarios with low polygenicity, but was still susceptible to inflated estimates of heterogeneity for highly polygenic traits; the inflated estimates can be explained by differential tagging of causal effects across ancestries among causal SNPs. OLS slope had bias because it did not account for uncertainty in estimated effects. We also performed additional simulations with less than identical causal effects and broader range of per-SNP and we determined Deming regression was robust to quantify the heterogeneity level at the marginal effects in simulations of different , (Extended Data Fig. 9 and Supplementary Table 15).
Discussion
In this work, we developed a polygenic method that model genome-wide causal effects to complex traits of admixed individuals. We determined causal effects are largely similar across local ancestries in analysis of 53,001 African-European admixed individuals across 38 complex traits in PAGE, UKBB and AoU. In addition to causal effects, we also replicated such consistency-by-ancestry for marginal effects at GWAS loci. We highlighted realistic simulation scenarios where regression-based methods using marginal effects can report false heterogeneity when causal effects are identical across ancestries.
Our study has several implications for future genetic study of admixed populations, and more broadly of ancestrally diverse individuals. First, reduced accuracy of polygenic score has been observed in African-European admixed populations with increasing proportion of non-European ancestries21; our results suggest the causal effects difference has limited contribution to such reduced accuracy. Second, there has been recent work on incorporating local ancestry in statistical modeling of admixed populations, for example, in association testing19 and polygenic score21,22, based on the hypothesis that effects may differ across ancestries. Our results indicate the largely consistent causal effects across local ancestries (and also marginal effects at most GWAS loci). The robustness of our results to imperfect tagging also suggests that imperfect tagging induce limited effects heterogeneity across local ancestries, once SNPs are properly modeled in a polygenic model. The small heterogeneity-by-ancestry at causal effects or marginal effects suggest that association tests that do not model heterogeneity-by-ancestry should be preferred in most cases19,20 for improved statistical power for association. On the other hand, including local ancestry in association models could be useful in correcting for LD induced by admixture39 and lead to improved causal effect estimation. Full consideration of incorporating local ancestry in statistical models should also take into account the extent of confounding and heterogeneity in the data40. Third, our study further motivates studies of ancestrally diverse individuals to identify population-specific risk variants that cannot be investigated due to being rare in European individuals; for example, inclusion of individuals with diverse populations could further disentangle causal from tagging effects, thus increasing the power of heterogeneity-by-ancestry estimation. More importantly, larger and robust trans-ancestry studies may allow for the examination of differential causal effects on a locus-by-locus basis, in addition to the genome-wide approach as presented in this work.
Our results add to the existing literature to further delineate sources of causal effects differences. Previous works have shown moderate causal effects differences across transcontinental populations5,6,8,28, with part of differences being induced by heterogeneity in the definition of environment/phenotype across continental ancestries. Similarly, a recent work15 concluded differences between causal effects in European local ancestries within African American admixed individuals and that in European American individuals. Our results showcase that, if environments are well controlled (as is the case for genetic variants across local ancestries within admixed populations), causal effects are highly similar across genetic ancestries, agreeing with a recent study finding similar effects across ancestries at level of gene expression in controlled environments41. Moreover, our results suggest that local epistatic interaction, if any, does not lead to large causal effects differences across genetic ancestries. By contrasting the high genetic correlation within admixed populations and the low genetic correlation across continental populations, our results support the hypothesis that different environments modify the genetic effects to complex traits (gene-by-environment interaction) across populations.
We note several limitations and future directions of our work. First, we have analyzed SNPs with MAF ≥0.5% in both ancestries. We excluded population-specific SNPs (with MAF <0.5% in one of the ancestries) because these SNPs provide little information for estimating , since effects for these SNPs are estimated with large noises. We used simulations to show that omitting these rare variants could lead to downward bias in estimation because of population-specific tagging of shared causal variants (Supplementary Note). However, it remains possible that causal variants themselves are rare and population-specific, and upward bias in the estimation of may be present. While in this work we focused on estimating for common variants, future work with larger sample sizes is needed to further investigate the impact of population-specific causal SNPs to estimation. Second, we have considered two-way African-European admixed individuals. Several practical considerations remain before applying this method to other admixed populations such as three-way admixture: local ancestries are typically inferred with larger errors42, and this should be accounted for in statistical modeling (it may be possible to incorporate posterior probabilities in estimated local ancestries to obtain calibrated estimates); additional parameters need to be estimated (for example, three pairwise correlation parameters across ancestries for three-way admixture populations). We note that our methods can be readily applied to these populations when reliable local ancestry calls can be obtained. Third, our modeling can be extended to estimate correlations in causal effects stratified by functional annotation categories and we leave that as future work. Fourth, our polygenic method requires individual-level genotype and phenotype; if not available, we found Deming regression may be applied to evaluate heterogeneity with caution: in our simulation, Deming regression was the only method robust to most scenarios except for high polygenicity. In our analysis of marginal effects, we found LD clumping can produce cluster of SNPs that were nearby and probably dependent with each other, as a combined result of multiple causal variants within a region and long-range LD in admixed populations. Such dependence may induce bias for methods like Deming regression, highlighting the need for improved methods of identifying conditionally independent SNPs in admixed populations. Fifth, we have meta-analyzed three publicly available studies of PAGE, UKBB and AoU with large cohort of African-European admixed individuals. Such meta-analysis with greatly increased total sample size enabled us to derive the conclusion of the high similarity in causal effects by local ancestry across a broad range of traits. However, our estimates for each individual trait were still associated with large standard errors and can be further improved by analyzing more individuals. Additional limitations are discussed in Supplementary Note. Despite these limitations, our study has shown that causal effects to complex traits are highly similar across local ancestries, and this knowledge can be used to guide future genetic studies of ancestrally diverse populations.
Methods
Ethical approval
This research complies with all relevant ethical regulations. Ethics committee/institutional review board (IRB) of PAGE gave ethical approval for collection of PAGE data. Ethics committee/IRB of UKBB gave ethical approval for collection of UKBB data (https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/about-us/ethics). Approval to use UKBB individual level in this work was obtained under application 33297 at http://www.ukbiobank.ac.uk. Ethics committee/IRB of AoU gave ethical approval for collection of AoU data (https://allofus.nih.gov/about/who-we-are/institutional-review-board-irb-of-all-of-us-research-program). Approval to use AoU controlled tier data in this work was obtained through application at https://www.researchallofus.org.
Statistical model of phenotype for admixed individuals
For individual and SNP , we denote , as number of minor alleles at maternal and paternal haplotypes, respectively. We denote corresponding local ancestries as (we focus on two-way admixture here, for example, ‘1’ and ‘2’ denote African and European ancestries for African-European admixture). Then we use , to encode allele counts that are specific to each local ancestry:
where denotes the indicator function. Denoting causal allelic effects as , for two ancestries, we model the phenotype of each individual as
where , denote covariates (including all ‘1’ intercepts) and their effects. denotes environmental noise. By further aggregating , into matrices and for ancestry 1 and 2, and into , equation (1) becomes
(3) |
We pose the following distribution assumptions , and
(4) |
where denotes variance of effects for both populations, denotes covariance for similarity of effect sizes by ancestry, and denotes the variance for environments. denote SNP-specific parameters (fixed a priori) for effect sizes distribution (see ‘Specifying τs under different heritability models‘ below). We define correlation of causal genetic effects as . indicates for all variants , that is, causal effects are the same across ancestries; indicates differences in causal effects across ancestries.
Calculating and filtering by ancestry-specific allele frequencies.
For each SNP , we calculated MAF as . We also calculated ancestry-specific MAF as for ancestry 1 and 2. For a SNP with close-to-zero frequency for either of the ancestry, its effect will be estimated with very large noise. Therefore, we used SNPs with MAF >0.5% in both ancestries in analyses.
Specifying under different heritability models.
parameters model the coupling of SNP effects variance with MAF, local LD or other functional annotations. Commonly used heritability models include GCTA43, frequency-dependent29,30, LDAK44 and S-LDSC45 models. While heritability model is important to estimate heritability and functional enrichment of heritability33,46,47, genetic correlation estimation, the main focus of this study, has shown to be robust to different heritability models33. In this work, we mainly used the frequency-dependent model for both simulations and real data analyses (where is the MAF of the SNP and is estimated in a meta-analysis across 25 UKBB complex traits30). For real data analysis, we additionally used GCTA model for estimation and found results are robust to heritability models (Extended Data Fig. 3).
Alternative choice of genotype normalization by ancestry.
We discuss an alternative choice of normalization by ancestry, in which we have two parameters and separately for two ancestries for each SNP. For example, , parametrizing effects distribution
This implies that effects per genotype standard deviation is being modeled (ref.5 termed this as correlation of allelic impact). While genetic correlation estimation is robust to genotype standardization (Supplementary Table 8; refs. 5,33), we recommend modeling allelic effects via same across ancestries (as used in our default Methods).
Evaluation of genome-wide genetic effects consistency
We discuss parameter estimation and hypothesis testing in equations (3) and (4). Marginalizing over random effects and in equation (3), the distribution of is
where is a diagonal matrix with . By denoting , , and , the distribution of is simplified as
(5) |
The maximum likelihood estimates of can be found by directly maximizing the corresponding likelihood function . However, the constraint that the correlation parameter should be small than 1 cannot be easily incorporated here. Instead, we use the profile likelihood and perform grid search of to maximize profile likelihood (similar to ref.30): for each candidate , we compute , and solve for the single variance component model in equation (5) using GCTA27 (v1.94.0beta). In practice, we calculate profile likelihood for a predefined set of is a reasonable prior assumption here; we alternatively used an extended range of in simulation studies (Supplementary Table 4) and real data analyses (Extended Data Fig. 2)). We use natural cubic spline to interpolate pairs of to get a likelihood curve of . Then we obtain the estimated using the value that maximize the likelihood curve, and credible interval by combining the likelihood curve with a uniform prior of and calculating the highest posterior density interval as credible interval. To perform the meta-analysis across independent estimates, we obtain the joint likelihood by calculating the product of likelihood curves across estimates (or equivalently, the sum of log-likelihood curves), and similarly calculate the estimate and credible interval.
Evaluation of genetic effects consistency at individual variant with marginal effects
Parameter estimation and hypothesis testing.
We use a model between individual SNP and phenotype by restricting equation (1) to the SNP of interest , as
or in vector form,
(6) |
where , , , contain , , , for all individuals , respectively. We distinguish marginal effects , in equation (6) from causal effects , in Eq. (1): marginal effects tag effects from nearby causal SNPs with taggability as a function of ancestry-specific correlation between the focal SNP and nearby causal SNPs. Therefore, heterogeneity in marginal effects by local ancestry can be induced even if causal effects are the same (see extensive simulations in Results and more details in Supplementary Note). We estimate , using least squares (jointly for , ) and perform hypothesis testing of with a likelihood ratio test by comparing Eq. (6) to a restricted model where the allelic effects are the same :
(7) |
Marginal effects-based methods for estimating heterogeneity.
We describe details of marginal effects-based methods to estimate heterogeneity with input from a set of estimated effect sizes , and corresponding estimated standard errors , for a set of SNPs.
Pearson correlation: by calculating the Pearson correlation of , across SNPs. Pearson correlation does not model errors in estimated effects, therefore is expected be smaller than 1 and decreases with increasing error magnitude.
OLS regression slope: by regressing ( as dependent variable, as independent variable) or . It does not model errors in independent variable. Moreover, it assumes homogeneous errors in dependent variable across SNPs. Therefore, it is susceptible to these error terms and notably results can vary when one exchange the regression orders48 ( versus ; for example, and are associated with different standard errors when being estimated in an admixed population with different ancestry proportion).
Deming regression slope: obtained with Deming regression34 of , and estimated standard errors , . Deming regression models heterogeneous error terms in both independent and dependent variables, therefore is more robust than Pearson correlation and OLS regression. Specifically, given a set of data and estimated standard errors (we use a different set of notations for simplicity), Deming regression optimizes the following objective function to obtain estimated intercept and slope :
Standard errors of , can be obtained with bootstrapping.
Notably, Deming regression slope produce symmetric results with different regression orders (the obtained slope will be reciprocal to each other). However, Deming regression can still produce biased results when the standard errors , are misspecified48.
False positive rate of the HET test, as described above in ‘Parameter estimation and hypothesis testing’. It is expected to be well calibrated under the null, because its derivation as a likelihood ratio test. Similar to Deming regression, HET test properly models heterogeneous standard errors.
Genotype data processing
PAGE genotype.
We analyzed 17,299 genotyped individuals self-identified as African American in PAGE study1. These individuals were from three studies: Women’s Health Initiative (N = 6,820), Multi-ethnic Cohort (N = 5,325) and the Icahn School of Medicine at Mount Sinai BioMe biobank in New York City (BioMe) (N = 5,154). See more details in ref.1. The genotypes were imputed to the TOPMed reference panel and we retained well-imputed SNPs with imputation and MAF >0.5%. We further retained variants with ancestry-specific MAF > 0.5% in both ancestries. This resulted in ~6.9 million variants and 17,299 individuals in our analysis.
UKBB genotype.
We analyzed individuals with African-European admixed ancestries in UKBB. We first inferred the proportion of ancestries for each individual in UKBB using SCOPE49 (https://github.com/sriramlab/SCOPE; version 6 December 2021) supervised using 1,000 Genomes Phase 3 allele frequencies (AFR, EUR, EAS and SAS). We retained 4,327 African-European admixed individuals with more than 5% of both AFR and EUR ancestries, and with less than 5% of both EAS and SAS ancestries. We retained well-imputed SNPs with imputation and MAF >0.5%. We further retained variants with ancestry-specific MAF >0.5% in both ancestries. This resulted in ~6.6 million variants and 4,327 individuals in our analysis.
AoU genotype.
We analyzed individuals with African-European admixed ancestries in AoU. We first performed principal component analysis of all 165,208 individuals in AoU microarray data (release v5) joint with 1,000 Genomes Phase 3 reference panel. Then we identified 31,375 individuals with African-European admixed ancestries (with at least both 10% European ancestries and 10% African ancestries, and who was within 2× normalized distance from the line connecting individuals of European ancestries and African ancestries in 1,000 Genomes reference panel; Supplementary Note). For these individuals, we performed quality control using PLINK2 (ref.50) (v2.0a3) with --geno 0.05 --max-alleles 2 --maf 0.001, and statistical phasing using Eagle2 (ref.51) (v2.4.1) with default settings. We retained variants with ancestry-specific MAF >0.5% in both ancestries. This resulted in ~0.65 million variants and 31,375 individuals in our analysis. For AoU, we chose to use microarray data instead of whole genome sequencing data because microarray data of AoU contained more individuals and analyzing microarray data reduced the computational cost.
Local ancestry inference.
We performed local ancestry inference using RFMix52 (https://github.com/slowkoni/rfmix; v2) with default parameters (eight generations since admixture). We used 99 CEU individuals (Utah residents with Northern and Western European ancestry) and 108 YRI individuals (individuals from Yoruba in Ibadan, Nigeria) from unrelated individuals in 1,000 Genome Project Phase 3 (ref.53) as our reference populations, similar to previous works52,54. We used HapMap3 SNPs32 in inference, and then interpolated the inferred local ancestry results to other variants in both PAGE and UKBB data sets. The accuracy of RFMix for local ancestry inference has been validated for African-European admixed individuals19 (for example, ~98% accuracy for simulations with a realistic demographic model for African American individuals). We performed additional analyses using PAGE African American individuals to assess the robustness of local ancestry inference using an alternative set of reference data. We used all European and African individuals in 1,000 Genomes project (excluding African Caribbean in Barbados and African Ancestry in SW USA because they were admixed). We determined a high consistency of 98.9% for the inferred local ancestry using reference data of CEU/YRI or all European/African individuals. We used the inferred local ancestry for both simulation study and real data analysis described below.
Simulation study
We describe methods for simulations that corresponds to each section of Results.
Pitfalls of including local ancestry in estimating heterogeneity.
We first describe strategies of including local ancestry in estimating heterogeneity.
For ‘lanc included’, we follow common practices17,19,39,55 to use a local ancestry term (defined above) in equation (1):
where denotes the effect of local ancestry.
For ‘lanc regressed’, we use . We first estimate in the regression of , and then estimate , in regression of .
To assess the impact of including local ancestry term when applying HET test, we randomly selected 1,000 SNPs on chromosome 1 from PAGE genotype. We simulated traits with single causal SNP. For each SNP, we simulated quantitative trait with the given single causal SNP with varying . We scaled , such that the causal SNP explained the given amount of . For each SNP, simulations of , and environmental noises were repeated 30 times. We then applied different strategies of including local ancestry to these simulations and obtained -value of HET testing . We additionally included the top principal component as a covariate throughout. We evaluated the distribution of FPR or power of HET test by subsampling without replacement: we drew 100 random samples, each sample consisted of 500 SNPs, randomly drawn from the pool of 1,000 SNPs and 30 simulations; such sampling accounts for the randomness from both the environmental noises and SNP MAF. We calculated FPR or power for each sample of 500 SNPs, obtained empirical distributions of FPR or power (100 points each), and then calculated the mean and SE (using empirical standard deviation) from the empirical distribution.
Simulations with single causal variant.
We performed simulations with single causal variant to assess the properties of methods based on estimated marginal effects. We randomly selected 100 regions each spanning 20 Mb on chromosome 1 (approximately 120,000 SNPs per region on average, standard deviation 6,000). For each region, the causal variant located at the middle of the region; it had same causal effects across local ancestries and was expected to explain a fixed amount of heritability (0.2%, 0.6% and 1.0%); the sign of the causal effect and environmental noises were randomly drawn 100 times. We evaluated four metrics at both causal variants and clumped variants; clumped variants were obtained with regular LD clumping (index , , window size 10 Mb) using PLINK (v1.90b6.24): --clump --clump-p1 5e-8 --clump-p2 1e-4 --clump-r2 0.1 --clump-kb 10000. We used a 10 Mb clumping window to account for the larger LD window within admixed individuals; other parameters were adopted from ref.56. We found that, when the simulated was large, LD clumping can result in multiple SNPs because the secondary SNPs can reach when we applied a commonly-used threshold. Therefore, for each region, we either retained only the SNP with strongest association (matching the simulation setup of a single simulated causal variant), or retained all the SNPs from clumping results. Similar as above, we evaluated the distribution of four metrics by subsampling without replacement: we drew 100 random samples, each sample consisted of 500 regions (each region has one causal SNP), randomly drawn from the pool of 100 regions and 100 simulations; such sampling accounted for the randomness from both the environmental noises and SNP MAF. We then calculated the mean and SE from the 100 random samples.
Simulation with multiple causal variants.
We performed simulations with multiple causal variants. We simulated multiple causal variants randomly distributed on chromosome 1 (515,087 SNPs). We drew and 1,000 causal variants to simulate different levels of polygenicity, such that on average there were approximately 0.25, 0.5, 1.0, 2.0 and 4.0 causal variants per Mb. We fixed the heritability explained by all variants on chromosome 1 as and 20%. We performed subsampling without replacement to estimate the average and standard errors of four metrics (each sample consisted of 1,000 SNPs, randomly drawn from SNPs across 500 simulations). We found that when the simulated was small , because of the limited sample size in our data (n = 17,299) for PAGE data, very few SNPs reach in these simulations and consequently standard errors are very large and results cannot be reliably reported. Therefore, we chose to report results only from and 20% in Supplementary Table 14.
Genome-wide simulation for evaluating our polygenic method.
We performed simulations to evaluate our polygenic method in terms of parameter estimation of and hypothesis testing using real genome-wide genotypes. We simulated quantitative phenotypes using genotypes and inferred local ancestries from PAGE dataset. The phenotypes were simulated under a wide range of genetic architectures varying proportion of causal variants , heritability and true correlation , and a frequency-dependent effects distribution for causal variants: in each simulation, we randomly drew proportion of causal variants. Given the set of causal variants, we simulated quantitative phenotypes on the basis of equations (3) and (4). The environmental noises were then simulated according to the desired heritability .
Real data analysis
Phenotype processing.
For PAGE, we analyzed 24 heritable traits in PAGE based on ref.1. For UKBB, we analyzed 26 heritable traits based on heritability and number of individuals with non-missing phenotype values, following ref.57. For AoU, we analyzed ten heritable traits, including physical measurement and lipid phenotypes, which were straightforward to phenotype and have large sample sizes. Physical measurement phenotypes were extracted from Participant Provided Information in AoU dataset. Lipid phenotypes (including LDL, HDL, TC and TG) were extracted following https://github.com/all-of-us/ukb-cross-analysis-demo-project/tree/main/aou_workbench_siloed_analyses, including extracting most recent measurements per person, and correcting value with statin usage. These traits included both quantitative and binary traits and it was previously shown that genetic correlation methodology can be directly applied to binary traits58. For each trait, we quantile normalized phenotype values. We included age, sex, age*sex and top ten in-sample principal components (and ‘study center’ for PAGE) as covariates. We quantile normalized each covariate and used the average of each covariate to imputed missing values in covariates.
Genome-wide genetic correlation estimation.
We calculated , matrices in equation (5) using either imputed SNPs and HapMap3 SNPs (for PAGE and UKBB), or microarray SNPs (for AoU). We used either frequency-dependent or GCTA heritability models via specifying . , matrices were separately calculated for individuals within PAGE, UKBB and AoU studies. For each given , we used GCTA27 (v1.94.0beta) to fit a single variance component model with the calculated using gcta64 --reml --reml-no-constrain. We additionally included the causal signals at Duffy SNP (rs2814778) in 1q23.2 as covariates for analysis of white blood cell count and C-reactive protein because of the known strong admixture peak59,60. Specifically, we used the local ancestries of SNP closest to Duffy SNP in our data as proxies for Duffy SNP (Duffy SNP itself is not typed or imputed in our data). The local ancestries are valid proxies of Duffy SNP because Duffy SNP is known to be highly differentiated across ancestries (alternate allele frequency is 0.006 versus 0.964 in ref.53) and therefore local ancestries are highly correlated with the Duffy SNP. We excluded closely related individuals in the analysis (<3rd-degree relatives; using ref.61 with plink2 --king-cutoff 0.0884). We note that our meta-analysis credible interval across traits can be anti-conservative (that is, the actual coverage probability is less than the nominal coverage probability) because we did not account for the genetic correlation across traits.
Individual trait-SNP analysis.
We evaluated effects consistency at individual SNPs that were significantly associated with each trait. First, we performed GWAS and LD clumping with the same parameters described above. Even though LD clumping was performed using stringent parameters, we found cluster of clumped SNPs that were probably dependent with each other as a combined result of multiple causal variants within a region the long-range LD in admixed populations (Supplementary Table 11 and Discussion). For each clumped trait–SNP pair, we estimated ancestry-specific effects and standard errors.
Statistical fine-mapping analysis.
We performed fine-mapping analysis to each trait–SNP pair with significant heterogeneity by ancestry using SuSiE62 (v0.12) (for PAGE and UKBB, for which we used genotype data with high SNP density). For each trait–SNP, we included all imputed SNPs in a 3 Mb window. We ran SuSiE with individual-level genotype and phenotype (covariates were regressed out of genotype and phenotype), using default settings with maximum number of ten non-zero effects. We obtained posterior inclusion probability and credible sets.
Statistics and reproducibility
We analyzed three publicly available datasets of PAGE, UKBB and AoU, and sample sizes were determined in these studies. We did not use randomization or blinding. We focused on analyzing individuals with admixed African-European ancestries, and individuals with other genetic ancestries were not included in analyses of this work. We replicate our findings across these three independent datasets.
Extended Data
Supplementary Material
Acknowledgements
We thank A. Price, M. J. Zhang, R. Patel, J. Pritchard, A. Durvasula, J. Cai and E. Petter for helpful suggestions. This research was funded in part by the National Institutes of Health under awards U01-HG011715 (B.P.), R01-HG009120 (B.P.), R01-MH115676 (B.P.), R01-HL151152 (C.K.), P01-CA196569 (D.V.C.) and U01-CA261339 (D.V.C.). Y.W. and S.S. were supported in part by NIH R35-GM125055 and NSF CAREER-1943497. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. PAGE is supported by the National Institutes of Health under awards R01-HG010297. This research was conducted using the UKBB Resource under application 33297. We thank the participants of UKBB for making this work possible. The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the All of Us Research Program would not be possible without the partnership of its participants.
Footnotes
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-023-01338-6.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Competing interests
E.E.K. has received personal fees from Regeneron Pharmaceuticals, 23&Me and Illumina, and serves on the advisory boards for Encompass Biosciences, Foresite Labs and Galateo Bio. The remaining authors declare no competing interests.
Extended data is available for this paper at https://doi.org/10.1038/s41588-023-01338-6.
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41588-023-01338-6.
Data availability
PAGE individual-level genotype and phenotype data are available through dbGaP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000356.v2.p1. UKBB individual-level genotype and phenotype data are available through application at https://www.ukbiobank.ac.uk/. AoU individual-level genotype and phenotype are available through application at https://www.researchallofus.org/. The set of preprocessed HapMap3 variants used in this manuscript is retrieved from https://ndownloader.figshare.com/files/25503788.
Code availability
Software implementing genome-wide genetic correlation estimation method: https://github.com/kangchenghou/admix-kit (ref. https://doi.org/10.5281/ZENODO.7482679) Code for replicating analyses: https://github.com/kangchenghou/admix-genet-cor (ref. https://doi.org/10.5281/ZENODO.7482683).
References
- 1.Wojcik GL et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bycroft C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ramirez AH et al. The All of Us Research Program: data quality, utility, and diversity. Patterns 3, 100570 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zhou W. et al. Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease. Cell Genomics 2, 100192 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Brown BC, Ye CJ, Price AL & Zaitlen N Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet 99, 76–88 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Galinsky KJ et al. Estimating cross-population genetic correlations of causal effect sizes. Genet. Epidemiol 43, 180–188 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Shi H. et al. Localizing components of shared transethnic genetic architecture of complex traits from GWAS summary data. Am. J. Hum. Genet 106, 805–817 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shi H. et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun 12, 1098 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kanai M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv 10.1101/2021.09.03.21262975 (2021). [DOI] [Google Scholar]
- 10.Wang Y et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun 11, 3865 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Martin AR et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet 51, 584–591 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gurdasani D, Barroso I, Zeggini E & Sandhu MS Genomics of disease risk in globally diverse populations. Nat. Rev. Genet 20, 520–535 (2019). [DOI] [PubMed] [Google Scholar]
- 13.Sirugo G, Williams SM & Tishkoff SA The missing diversity in human genetic studies. Cell 177, 1080 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Marigorta UM & Navarro A High trans-ethnic replicability of GWAS results implies common causal variants. PLoS Genet. 9, e1003566 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Patel RA et al. Genetic interactions drive heterogeneity in causal variant effect sizes for gene expression and complex traits. Am. J. Hum. Genet 109, 1286–1297 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cai N et al. Minimal phenotyping yields genome-wide association signals of low specificity for major depression. Nat. Genet 52, 437–447 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Seldin MF, Pasaniuc B & Price AL New approaches to disease mapping in admixed populations. Nat. Rev. Genet 12, 523–528 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mills MC & Rahal C The GWAS Diversity Monitor tracks diversity by disease in real time. Nat. Genet 52, 242–243 (2020). [DOI] [PubMed] [Google Scholar]
- 19.Atkinson EG et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat. Genet 53, 195–204 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hou K, Bhattacharya A, Mester R, Burch KS & Pasaniuc B On powerful GWAS in admixed populations. Nat. Genet 53, 1631–1633 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Bitarello BD & Mathieson I Polygenic scores for height in admixed populations. G3 10, 4027–4036 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Marnetto D et al. Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat. Commun 11, 1628 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bentley AR et al. Gene-based sequencing identifies lipid-influencing variants with ethnicity-specific effects in African Americans. PLoS Genet. 10, e1004190 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rajabli F et al. Ancestral origin of ApoE ε4 Alzheimer disease risk in Puerto Rican and African American populations. PLoS Genet. 14, e1007791 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Blue EE, Horimoto ARVR, Mukherjee S, Wijsman EM & Thornton TA Local ancestry at APOE modifies Alzheimer’s disease risk in Caribbean Hispanics. Alzheimers Dement. 15, 1524–1532 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Naslavsky MS et al. Global and local ancestry modulate APOE association with Alzheimer’s neuropathology and cognitive outcomes in an admixed sample. Mol. Psychiatry 27, 4800–4808 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Yang J, Lee SH, Goddard ME & Visscher PM GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet 88, 76–82 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Sakaue S et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat. Genet 53, 1415–1424 (2021). [DOI] [PubMed] [Google Scholar]
- 29.Zeng J et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet 50, 746–753 (2018). [DOI] [PubMed] [Google Scholar]
- 30.Schoech AP et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun 10, 790 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang Y, Qi G, Park J-H & Chatterjee N Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet 50, 1318–1326 (2018). [DOI] [PubMed] [Google Scholar]
- 32.The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Speed D & Balding DJ SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet 51, 277–284 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Deming WE Statistical adjustment of data. Wiley. (1943). [Google Scholar]
- 35.Pasaniuc B. et al. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet. 7, e1001371 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hodonsky CJ et al. Ancestry-specific associations identified in genome-wide combined-phenotype study of red blood cell traits emphasize benefits of diversity in genomics. BMC Genomics 21, 228 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Loh P-R et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet 47, 1385–1392 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Johnson R et al. Estimation of regional polygenicity from GWAS provides insights into the genetic architecture of complex traits. PLoS Comput. Biol 17, e1009483 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhang J & Stram DO The role of local ancestry adjustment in association studies using admixed populations. Genet. Epidemiol 38, 502–515 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Liu J, Lewinger JP, Gilliland FD, Gauderman WJ & Conti DV Confounding and heterogeneity in genetic association studies with admixed populations. Am. J. Epidemiol 177, 351–360 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Saitou M, Dahl A, Wang Q & Liu X Allele frequency differences of causal variants have a major impact on low cross-ancestry portability of PRS. Preprint at medRxiv 10.1101/2022.10.21.22281371 (2022). [DOI] [Google Scholar]
- 42.Pasaniuc B et al. Analysis of Latino populations from GALA and MEC studies reveals genomic loci with biased local ancestry estimation. Bioinformatics 29, 1407–1415 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yang J et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet 42, 565–569 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Speed D, Hemani G, Johnson MR & Balding DJ Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet 91, 1011–1021 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Gazal S et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet 49, 1421–1427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Gazal S, Marquez-Luna C, Finucane HK & Price AL Reconciling S-LDSC and LDAK functional enrichment estimates. Nat. Genet 51, 1202–1204 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Hou K et al. Accurate estimation of SNP-heritability from biobank-scale data irrespective of genetic architecture. Nat. Genet 51, 1244–1251 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Linnet K Performance of Deming regression analysis in case of misspecified analytical error ratio in method comparison studies. Clin. Chem 44, 1024–1031 (1998). [PubMed] [Google Scholar]
- 49.Chiu AM, Molloy EK, Tan Z, Talwalkar A & Sankararaman S Inferring population structure in biobank-scale genomic data. Am. J. Hum. Genet 109, 727–737 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chang CC et al. Second-generation PUNK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Loh P-R et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet 48, 1443–1448 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Maples BK, Gravel S, Kenny EE & Bustamante CD RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet 93, 278–288 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.The 1000 Genomes Project Consortium, et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Schubert R, Andaleon A & Wheeler HE Comparing local ancestry inference models in populations of two- and three-way admixture. Peer J 8, e10090 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Gay NR et al. Impact of admixture and ancestry on eQTL analysis and GWAS colocalization in GTEx. Genome Biol. 21, 233 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Pardiñas AF et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet 50, 381–389 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Schoech AP et al. Negative short-range genomic autocorrelation of causal effects on human complex traits. Preprint at medRxiv 10.1101/2020.09.23.310748 (2020). [DOI] [Google Scholar]
- 58.Bulik-Sullivan B et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet 47, 1236–1241 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Reich D et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet. 5, e1000360 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Reiner AP et al. Genome-wide association and population genetic analysis of C-reactive protein in African American and Hispanic American women. Am. J. Hum. Genet 91, 502–512 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Manichaikul A et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Wang G, Sarkar A, Carbonetto P & Stephens M A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B 82, 1273–1300 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Cook JP & Morris AP Multi-ethnic genome-wide association study identifies novel locus for type 2 diabetes susceptibility. Eur. J. Hum. Genet 24, 1175–1180 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
PAGE individual-level genotype and phenotype data are available through dbGaP https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000356.v2.p1. UKBB individual-level genotype and phenotype data are available through application at https://www.ukbiobank.ac.uk/. AoU individual-level genotype and phenotype are available through application at https://www.researchallofus.org/. The set of preprocessed HapMap3 variants used in this manuscript is retrieved from https://ndownloader.figshare.com/files/25503788.
Software implementing genome-wide genetic correlation estimation method: https://github.com/kangchenghou/admix-kit (ref. https://doi.org/10.5281/ZENODO.7482679) Code for replicating analyses: https://github.com/kangchenghou/admix-genet-cor (ref. https://doi.org/10.5281/ZENODO.7482683).