Abstract
SNP heritability is defined as the proportion of phenotypic variance explained by genotyped SNPs and is believed to be a lower bound of heritability (), being equal to it if all causal variants are known. Despite the simple intuition behind , its interpretation and equivalence to is unclear, particularly in the presence of population structure and assortative mating. It is well known that population structure can lead to inflation in estimates because of confounding due to linkage disequilibrium (LD) or shared environment. Here we use analytical theory and simulations to demonstrate that estimates can be biased in admixed populations, even in the absence of confounding and even if all causal variants are known. This is because admixture generates LD, which contributes to the genetic variance, and therefore to heritability. Genome-wide restricted maximum likelihood (GREML) does not capture this contribution leading to under- or over-estimates of relative to , depending on the genetic architecture. In contrast, Haseman-Elston (HE) regression exaggerates the LD contribution leading to biases in the opposite direction. For the same reason, GREML and HE estimates of local ancestry heritability are also biased. We describe this bias in and as a function of admixture history and the genetic architecture of the trait and show that it can be recovered under some conditions. We clarify the interpretation of in admixed populations and discuss its implication for genome-wide association studies and polygenic prediction.
Introduction
The ability to estimate (narrow-sense) heritability () from unrelated individuals was a major advance in genetics. Traditionally, was estimated from family-based studies in which the phenotypic resemblance between relatives could be modeled as a function of their expected genetic relatedness [1]. However, this approach was limited to analysis of closely related individuals where pedigree information is available and the realized genetic relatedness is not too different from expectation [2]. With the advent of genome-wide association studies (GWAS), we hoped that many of the variants underlying this heritability would be uncovered. However, when genome-wide significant SNPs explained a much smaller fraction of the phenotypic variance, it became important to explain the missing heritability – were family-based estimates inflated or were GWAS just underpowered, limited by variant discovery?
Yang et al. (2010) [3] made the key insight that one could estimate the portion of tagged by genotyped SNPs, regardless of whether or not they were genome-wide significant, by exploiting the subtle variation in the realized genetic relatedness among apparently unrelated individuals [3–5]. This quantity came to be known colloquially as ‘SNP heritability’ () and it is believed to be equal to if all causal variants are included among genotyped SNPs [3]. Indeed, estimates of explain a much larger fraction of trait heritability than GWAS SNPs [3], approaching family-based estimates of when whole genome sequence data, which captures rare variants, are used [6]. This has made it clear that GWAS have yet to uncover more variants with increasing sample size. Now, has become an important aspect of the design of genetic studies and is often used to define the power of variant discovery in GWAS and the upper limit of polygenic prediction accuracy.
Despite the utility and simple intuition of , there is much confusion about its interpretation and equivalence to , particularly in the presence of population structure and assortative mating [7–12]. But much of the discussion of heritability in structured populations has focused on biases in – the estimator – due to confounding effects of shared environment and linkage disequilibrium (LD) with other variants [7, 9–11, 13]. There is comparatively little discussion, at least in human genetics, on the fact that LD due to population structure also contributes to genetic variance, and therefore, is a component of heritability [1] (but see [14–16] for a rigorous discussion). We think this is at least partly due to the fact that most studies are carried out in cohorts with primarily European ancestry, where the degree of population structure is minimal and large effects of LD can be ignored. However, that is not the case for diverse, multi-ethnic cohorts, which have historically been underrepresented in genetic studies, but thanks to a concerted effort in the field, are now becoming increasingly common [17–23]. The complex structure in these cohorts also brings unique methodological challenges and it is imperative that we understand whether existing methods, which have largely been evaluated in more homogeneous groups, generalize to more diverse cohorts.
Our goal in this paper is to study the behavior of in admixed populations. What is its interpretation in the ideal situation where causal variants are known? Is it an unbiased estimate of ? To answer these questions, we derived a general expression for the genetic variance in admixed populations, decomposing it in terms of the contribution of population structure, which influences both the genotypic variance at individual loci and the LD across loci. We used theory and simulations to show that estimated with genome-wide restricted maximum likelihood (GREML) [3, 5] and Haseman-Elston (HE) regression [24] – two widely used approaches – can be biased in admixed and other structured populations, even in the absence of confounding and when all causal variants are known. We explain this in terms of the discrepancy between the model assumed in estimation and the generative model from which the genetic architecture of the trait in the population may have been sampled. We describe the bias in as a function of admixture history and genetic architecture and discuss its implications for GWAS and polygenic prediction accuracy.
Model
Genetic architecture
We begin by describing a generative model for the phenotype. Let , where is the phenotypic value of an individual, is the genotypic value, and is random error. We assume additive effects such that where is the effect size of the ith biallelic locus and is the number of copies of the trait-increasing allele. Importantly, the effect sizes are fixed quantities and differences in genetic values among individuals are due to random variation in genotypes. Note, that this is different from the model assumed by GREML where genotypes are fixed and effect sizes are random [14].
We denote the mean, variance, and covariance with , , and , respectively, where the expectation is measured over random draws from the population rather than random realizations of the evolutionary process. We can express the additive genetic variance of a quantitative trait as follows:
Here the first term represents the contribution of individual loci (genic variance) and the second term is the contribution of linkage disequilibrium (LD contribution). We make the assumption that loci are unlinked and therefore, the LD contribution is entirely due to population structure. We describe the behavior of in a population that is a mixture of two previously isolated populations A and B that diverged from a common ancestor. To do this, we denote as the fraction of the genome of an individual with ancestry from population A. Thus, if the individual is from population A, 0 if they are from population B, and if they are admixed. Then, can be expressed in terms of ancestry as (Appendix):
(1.1) |
(1.2) |
(1.3) |
(1.4) |
where and are the allele frequencies in populations A and B, and and are the mean and variance of individual ancestry. The sum of the first three terms represents the genic variance and the last term represents the LD contribution.
Demographic history
From Eq. 1, it is clear that, conditional on the genetic architecture in the source populations is a function of the mean, , and variance, , of individual ancestry in the admixed population. We consider two demographic models that affect and in qualitatively different ways. In the first model, the source populations meet once generations ago (we refer to this as ) in proportions and , after which there is no subsequent admixture (Fig. 1A). In the second model, there is continued gene flow in every generation from one of the source populations such that the mean overall amount of ancestry from population A is the same as in the first model (Fig. 1A). For brevity, we refer to these as the hybrid-isolation (HI) and continuous gene flow (CGF) models, respectively, following Pfaff et al. (2001) [25]. is also affected by ancestry-based assortative mating, where individuals are more likely to partner with others of similar ancestry. We refer to this simply as assortative mating for brevity and model this following Zaitlen et al. (2017) using a parameter , which represents the correlation of the ancestry of individuals across mating pairs in the population [26].
Under these conditions, the behavior of and has been described previously [26, 27] (Fig. 1B and C). Briefly, in the HI model, remains constant at in the generations after admixture as there is no subsequent gene flow. is at its maximum at when each individuals carries chromosomes either from population A or B, but not both. This genome-wide correlation in ancestry breaks down in subsequent generations as a function of mating, independent assortment, and recombination, leading to a decay in , the rate depending on the strength of assortative mating (Fig. 1C). In the CGF model, both and increase with time as new chromosomes are introduced from the source populations. But while continues to increase monotonically, will plateau and decrease due to the countervailing effects of independent assortment and recombination which redistribute ancestry in the population, reaching equilibrium at zero if there is no more gene flow and the population is mating randomly. provides an intuitive and quantitative measure of the degree of population structure (along the axis of ancestry) in admixed populations.
Results
Genetic variance in admixed populations
To understand the expectation of genetic variance in admixed populations, it is first worth discussing its behavior in the source populations. In Eq. 1, the first term represents the within-population component and the last three terms altogether represent the component of genetic variance between populations A and B . Note that is positive only if there is a difference in the mean genotypic values (Fig. 2). This variance increases with genetic divergence since the expected values of both and are functions of . While is expected to increase monotonically with increasing divergence, is expected to be zero under neutrality because the direction of frequency change will be uncorrelated across loci. In this case, the LD contribution, i.e., (1.4), is expected to be zero and . However, this is true only in expectation over the evolutionary process and the realized LD contribution may be non-zero even for neutral traits.
For traits under selection, the LD contribution is expected to be greater or less than zero, depending on the type of selection. Under divergent selection, trait-increasing alleles will be systematically more frequent in one population over the other, inducing positive LD across loci [28, 29], increasing the LD contribution, i.e., term (1.4). Stabilizing selection, on the other hand, induces negative LD [30, 31]. In the extreme case, the mean genetic values of the two populations are exactly equal and . For this to be true, (1.4) has to be negative and equal to (1.2) + (1.3), which are both positive, and the total genetic variance is reduced to the within-population variance, i.e., term (1.1) (Fig. 2). This is relevant because, as we show in the following sections, the behavior of the genetic variance in admixed populations depends on the magnitude of between the source populations.
We illustrate this by tracking the genetic variance in admixed populations for two traits, both with the same mean at causal loci but with different LD contributions (term 1.4): one where the LD contribution is positive (Trait 1) and the other where it is negative (Trait 2). Thus, traits 1 and 2 can be thought of as examples of phenotypes under divergent and stabilizing selection, respectively, and we refer to them as such from hereon. To simulate the genetic variance of such traits, we drew the allele frequencies ( and ) in populations A and B for 1,000 causal loci with using the Balding-Nichols model [32]. We drew their effects from where is the mean allele frequency between the two populations, is the number of loci. To simulate positive and negative LD, we permuted the effect signs across variants 100 times and selected the combinations that gave the most positive and negative LD contribution to represent the genetic architecture of traits that might be under directional (Trait 1) and stabilizing (Trait 2) selection, respectively (Methods). We simulated the genotypes of 10,000 individuals under the HI and CGF models for generations post-admixture and calculated genetic values for both traits using , where (Method). The observed genetic variance at any time can then be calculated simply as the variance in genetic values, i.e. .
In the HI model, does not change (Fig. 1B) so terms (1.1) and (1.2) are constant through time. Terms (1.3) and (1.4) decay towards zero as the variance in ancestry goes to zero and ultimately converges to (1.1) + (1.2) (Fig. 3). This equilibrium value is equal to the (Appendix) and the rate of convergence depends on the strength of assortative mating, which slows the rate at which decays. approaches equilibrium from a higher value for traits under divergent selection and lower value for traits under stabilizing selection because of positive and negative LD contributions, respectively, at (Fig. 3). In the CGF model, increases initially for both traits with increasing gene flow (Fig. 3). This might seem counter-intuitive at first because gene flow increases admixture LD, which leads to more negative values of the LD contribution for traits under stabilizing selection (Fig. S1). But this is outweighed by positive contributions from the genic variance – terms (1.1) + (1.2) + (1.3) – all of which initially increase with gene flow (Fig. S1). After a certain point, the increase in slows down as any increase in due to gene flow is counterbalanced by recombination and independent assortment. Ultimately, will decrease if there is no more gene flow, reaching the same equilibrium value as in the HI model, i.e., . Because the loci are unlinked, we refer to the sum (1.3) + (1.4) as the contribution of population structure.
GREML estimation
In their original paper, Yang et al. (2010) defined as the variance explained by genotyped SNPs and not as heritability [3]. This is because is the genetic variance explained by causal variants, which are unknown. Genotyped SNPs may not overlap with or tag all causal variants and thus, is understood to be a lower bound of , both being equal if causal variants are known [3]. Our goal is to demonstrate that this may not be true in structured populations and quantify the bias in , even in the ideal situation when causal variants are known.
We used GREML, implemented in GCTA [3, 5], to estimate the genetic variance for our simulated traits. GCTA assumes the following model: where is an standardized genotype matrix such that the genotype of the kth individual at the ith locus is being the allele frequency. The SNP effects corresponding to the scaled genotypes are assumed to be random and independent such that and is random environmental error. Then, the phenotypic variance can be decomposed as:
where is the genetic relationship matrix (GRM), the variance components and are estimated using restricted maximum likelihood, and is calculated as . We are interested in asking whether is an unbiased estimate of . To answer this, we constructed the GRM with causal variants and estimated using GCTA [3, 4].
GCTA under- and over-estimates the genetic variance in admixed populations for traits under divergent (Trait 1) and stabilizing selection (Trait 2), respectively, when there is population structure, i.e., when (Fig. 4A). One reason for this bias is that the GREML model assumes that the effects are independent, and therefore the LD contribution is zero. This, as discussed in the previous section, is not true for traits under divergent or stabilizing selection between the source populations, and only true for neutral traits in expectation. Because of this, does not capture the LD contribution, i.e. term (1.4) (Fig. 4A). But can be biased even if the LD contribution is zero if the genotypes are scaled with – the standard practice – where is the frequency of the allele in the population. This scaling assumes that , which is true only if the population were mating randomly. In an admixed population , where , and correspond to frequency in the admixed population, and source populations, A and B, respectively (Appendix). Alternatively, if the genotypes are scaled, where is the at the ith locus. We show that this assumption biases downwards by a factor of (or if genotypes are scaled) – term (1.3) (Fig. 4B, Appendix). Thus, with the standard scaling, gives a biased estimate in the presence of population structure, even of the genic variance.
The overall bias in is determined by the relative magnitude and direction of terms (1.3) and (1.4), both of which are functions of , and therefore, of the degree of structure in the population. The contribution of term (1.3) will be modest, even in highly structured populations (Fig. S1) and therefore, the overall bias is largely driven by the LD contribution. If there is no more gene flow, will ultimately go to zero and will converge towards . Thus, is more accurately interpreted as the genetic variance expected if the LD contribution were zero and if the population were mating randomly. In other words, (Fig. 4B).
In principle, we can recover the missing components of by scaling the genotypes appropriately. For example, we can recover term (1.3) by scaling the genotype at each variant by its sample variance, i.e., (Fig. 4C) (Appendix). We can also recover term (1.4) by scaling the genotypes with the covariance between SNPs, i.e., the LD matrix, as previously proposed [33, 34] (Methods). In matrix form, the ‘LD-scaled’ genotypes can be written as where is an matrix such that all elements of the ith column contain the frequency of the ith SNP and is the (upper triangular) square root matrix of the LD matrix, i.e., [33]. GREML recovers the LD contribution under this scaling, resulting in unbiased estimates of for both traits (Fig. 4D, Appendix).
In practice, however, the LD contribution may not be fully recoverable for two reasons. One, the LD-scaled GRM requires computing the inverse of or which may not exist, especially if the number of markers is greater than the sample size – the case for most human genetic studies. Second, it is common to include individual ancestry or principal components of the GRM as fixed effects in the model to account for inflation in heritability estimates due to shared environment. This should also have the effect of removing the components of genetic variance along the ancestry axes, the residual variance being equal to (Appendix). Indeed, this is what we observe in Fig. 4H. Thus, if ancestry is included as a fixed effect, we expect to be underestimated in the presence of population structure, regardless of genetic architecture.
HE estimation
Haseman-Elston (HE) regression also assumes a random-effects model but uses a method-of-moments approach, as opposed to GREML, which maximizes the likelihood to estimate . Previous work has shown that as long as all causal variants are included in the GRM calculation, the HE estimator will not be biased, even if they are in LD with each other [35]. We show that in the presence of positive and negative LD between causal loci, as exemplified by traits under divergent and stabilizing selection, respectively, the HE estimates of are biased upwards and downwards, respectively (Fig. 5A–B). To understand this discrepancy and the source of bias in our simulations, recall that HE estimates from the regression of the (pairwise) phenotypic covariance between individuals on their genotypic covariance [24]. More specifically, if we denote as the product of the (centered) phenotypes of kth and lth individuals, and as the kth and lth entry of the GRM, then the HE estimator can be written as:
(2) |
Where the first and second terms represent the genic and LD components, respectively, of the estimate. Population structure induces correlations between the alleles at a given locus as well as across loci (i.e., LD). But the LD may not be directional, i.e., trait-increasing alleles may be as likely to be co-inherited with each other as they are to trait-decreasing alleles, and vice versa – implicit under the standard random-effects model. Thus, in the absence of directional LD, the second term is zero and the first term is unaffected as long as all causal variants are included in the GRM, because the increase in the numerator due to population structure is proportional to the denominator [35]. Directional LD does not affect the first term but exaggerates the contribution from the second term, i.e., the LD component (see Appendix section A3.2). Consequently, HE regression over- and under-estimates for traits with positive and negative LD, respectively. Note that this bias is in the opposite direction of the bias observed with GREML, which fails to capture the LD contribution. Scaling the genotype at a locus by its LD with other loci, as discussed in the previous section, corrects for the bias in HE regression regardless of genetic architecture, yielding estimates consistent with GREML (Fig. 5C). Thus, GREML and HE regression are guaranteed to yield the same estimates only if the underlying model specifying the distribution of effects is consistent with the true architecture of the trait.
The practice of including individual ancestry as a covariate in HE regression to account for shared environment [11] reduces the bias from exaggerated LD contributions (Fig. 5D–F). But, as with GREML, this also removes any genetic variance that may exist along the ancestry axis, yielding underestimates of , regardless of genetic architecture.
Local ancestry heritability
A related quantity of interest in admixed populations is local ancestry heritability , which is defined as the proportion of phenotypic variance that can be explained by local ancestry. Zaitlen et al. (2014) [36] showed that this quantity is related to, and can be used to estimate, in admixed populations. The advantage of this approach is that local ancestry segments shared between individuals are identical by descent and are therefore, more likely to tag causal variants compared to array markers, allowing one to potentially capture the contributions of rare variants [36]. Here, we show that in the presence of population structure, (i) the relationship between and is not straightforward and (ii) may be a biased estimate of local ancestry heritability under the random effects model for the same reasons that is biased.
We define local ancestry as the number of alleles at locus that trace their ancestry to population A. Thus, ancestry at the ith locus in individual is a binomial random variable with being the ancestry of the kth individual. Similar to genetic value, the ‘ancestry value’ of an individual can be defined as , where is the effect size of local ancestry (Appendix). Then, the genetic variance due to local ancestry can be expressed as (Appendix):
and heritability explained by local ancestry is simply the ratio of and the phenotypic variance. Note that and therefore its behavior is similar to in that the terms (1.3) and (1.4) decay towards zero as , and converges to (1.2) (Fig. S2). Additionally, the dependence of on both and precludes a straightforward derivation between local ancestry heritability and .
GREML estimation of is similar to that of , the key difference being that the former involves constructing the GRM using local ancestry instead of genotypes [36]. The following model is assumed: where is an standardized local ancestry matrix, are local ancestry effects, and . Note that captures both environmental noise as well as any genetic variance independent of local ancestry. The phenotypic variance is decomposed as where is the local ancestry GRM and is the parameter of interest, which is believed to be equal to – the genetic variance due to local ancestry.
We show that, in the presence of population structure, i.e., when , GREML is biased downwards relative to for traits under divergent selection and upwards for traits under stabilizing selection because it does not capture the contribution of LD (Fig. 6A). But there is another source of bias in , which tends to be inflated in the presence of population structure if individual ancestry is not included as a covariate, even with respect to the expectation of under equilibrium (seen more clearly in Fig. 6B–C). We suspect this inflation is because of strong correlations between local ancestry – local ancestry disequilibrium – across loci that inflates in a way that is not adequately corrected even when all causal variants are included in the model [4, 10]. Scaling local ancestry by its covariance removes this bias and recovers the contribution of LD (Fig. 6D) presumably because this accounts for the correlation in genotypes across loci. Including individual ancestry as a fixed effect also corrects for the inflation in (Fig. 6E–H). But as with , this practice will underestimate the genetic variance due to local ancestry in the presence of population structure because it removes the variance along the ancestry axis (Fig. 6E–H).
Based on the above, GREML and corresponding estimates of are more accurately interpreted as the heritability due to local ancestry and heritability, respectively, expected in the absence of population structure. We believe is still useful in that, because it should capture the effects of rare variants, it can be used to estimate the upper bound of .
In a previous paper, we suggested that local ancestry heritability could potentially be used to estimate the genetic variance between populations [37]. Our results suggest this is not possible for two reasons. First, the GREML estimator of local ancestry heritability, as we show in this section is biased and does not capture the LD contribution. But even if we were able to recover the LD component, our decomposition shows that local ancestry is equal to the genetic variance between populations only when and , which is only possible at in the HI model. After admixture, decays and the equivalence between and is lost, making it impossible to estimate the latter from admixed populations, especially for traits under divergent or stabilizing selection, even if the environment is randomly distributed with respect to ancestry. We note that this conclusion was recently reached independently by Schraiber and Edge (2023) [38].
How much does LD contribute to in practice?
In the previous sections, we showed theoretically that may be biased in admixed populations even if the causal variants are known and in the absence of confounding by shared environment. GREML fails to capture the LD contribution whereas HE regression overestimates it. The extent to which is biased because of this reason in practice is ultimately an empirical question, which is difficult to answer because the true genetic architecture – the LD contribution in particular – is unknown. In this section, we develop some intuition for this contribution among individuals with mixed African and European ancestry using a combination of simulations and empirical data analysis.
First, we simulated a neutral trait using genotype data from the African Americans (ASW) from the 1,000 Genomes Project (1KGP) [39]. To do this, we sampled causal loci from a set of common (MAF > 0.01), LD pruned variants and assigned them effects such that , i.e., the expected genic variance is (Methods). We computed the genic and LD contributions and repeated this process 1,000 times where each replicate can be thought of as an independent realization of the genetic architecture of a neutrally evolving trait. We show that the LD contribution may be zero in expectation but can be substantial for a given trait (up to 50% of the genic variance, Fig. S4), even in the absence of selection.
Second, we estimated the LD contribution of genome-wide significant SNPs for 26 quantitative traits from the GWAS catalog [40]. To do this, we decomposed the variance explained in ASW into the four components in Equation 1 using allele frequencies ( and ) from the YRI and CEU and the mean and variance of individual ancestry from ASW (Methods). We show that for skin pigmentation – a trait under strong divergent selection – the LD contribution, i.e. term (1.4), is positive and accounts for ≈ 40 — 50% of the total variance explained. This is because of large allele frequency differences between Africans and Europeans that are correlated across skin pigmentation loci, consistent with strong polygenic selection favoring alleles for darker pigmentation in regions with high UV exposure and vice versa [37, 41–44]. But for most other traits, LD contributes relatively little, explaining a modest, but non-negligible proportion of the genetic variance in height, LDL and HDL cholestrol, mean corpuscular hemoglobin (MCH), neutrophil count (NEU), and white blood cell count (WBC) (Fig. 7). Because we selected independent associations for this exercise (Methods), the LD contribution is driven entirely due to population structure in ASW. The contribution of population structure to the genic variance, i.e., term (1.3) is also small even for traits like skin pigmentation and neutrophil count with large effect alleles that are highly diverged in frequency between Africans and Europeans [42, 43, 45–47]. Overall, this suggests that population structure contributes relatively little, as least to the variance explained by GWAS SNPs.
Discussion
Despite the growing size of GWAS and discovery of thousands of variants for hundreds of traits [40], the heritability explained by GWAS SNPs remains a fraction of twin-based heritability estimates. Yang et al. (2010) introduced the concept of SNP heritability () that does not depend on the discovery of causal variants but assumes that they are numerous and are more or less uniformly distributed across the genome (the infinitesimal model), their contributions to the genetic variance ‘tagged’ by genotyped SNPs [3]. is now routinely estimated in most genomic studies and at least for some traits (e.g. height and BMI), these estimates now approach twin-based heritability [6]. But despite the widespread use of , its interpretation remains unclear, particularly in the presence of admixture and population structure. It is generally accepted that can be biased in structured populations because of confounding effects of unobserved environmental factors and LD between causal variants [4, 7, 9–11, 48]. But may be biased even in the absence of confounding because of misspecification of the underlying random-effects model, i.e., if the model does not represent the genetic architecture from which the trait is sampled [14–16, 49, 50].
Under the standard GREML model, SNP effects are assumed to be uncorrelated and the total genetic variance can be represented as the sum of the variance explained by individual loci, i.e. the genic variance [14–16]. In admixed populations, there is substantial LD, which can contribute to the genetic variance, and can persist for a number of generations, despite recombination, due to continued gene flow and/or ancestry-based assortative mating. GREML does not capture this LD contribution [12, 15], and therefore, may lead to biased estimates of . The LD contribution can be negative for traits under stabilizing selection, and positive for traits under divergent selection between the source populations, leading to over- or under-estimates, respectively. Thus, GREML estimates of , assuming genotypes are scaled properly (see below), is better interpreted as the proportion of phenotypic variance explained by the genic variance. Estimates of local ancestry heritability [36, 51] should be interpreted similarly.
We show that with GREML, can be biased even when the LD contribution is zero if the genotypes are scaled by – the standard approach, which implicitly assumes a randomly mating population. In the presence of population structure, the variance in genotypes can be higher and does not capture this additional variance, which we show can be recovered by scaling genotypes by the SNP variance . In principle, the LD contribution can also be recovered by scaling genotypes by the SNP covariance, i.e., the LD matrix, as previously suggested [33, 34]. But this approach is limited to situations where the sample size is much larger than the number of markers.
We also investigated the behavior of another widely used approach to estimate – Haseman-Elston regression. We show that estimated with HE regression is also biased, but for different reasons and in the opposite direction of the bias observed with GREML. HE regression exaggerates the LD contribution, leading to over- and under-estimates of for traits where the causal loci are in positive and negative LD, respectively. Approaches that correct for population structure [35] should remove this source of bias but would also remove any genetic variance in the trait along the ancestry axis, including the LD contribution. This results in underestimates of , regardless of trait architecture.
One limitation of this paper is that we have focused on random-effects estimators of because of their widespread use. Estimators of can be broadly grouped into random- and fixed effect estimators based on how they treat SNP effects [35]. Fixed effect estimators make fewer distributional assumptions but they are not as widely used because they require conditional estimates of all variants – a high-dimensional problem where the number of markers is often far larger than the sample size [52]. This is one reason why random effect estimators, such as GREML, are popular – because they reduce the number of parameters that need to be estimated by assuming that the effects are drawn from some distribution where the variance is the only parameter of interest. Fixed effects estimators, in principle, should be able to capture the LD contribution but this is not obvious in practice since the simulations used to evaluate the accuracy of such estimators still assume uncorrelated effects [35, 52, 53]. Further research is needed to clarify the interpretation of the different estimators of in structured populations under a range of genetic architectures.
Does the LD contribution to the genetic variance have practical implications? The answer to this depends on the context in which SNP heritability is used. can be useful in quantifying the power to detect variants in GWAS where the quantity of interest is the genic variance. But can lead to misleading conclusions if used to measure the extent to which genetic variation contributes to phenotypic variation, in predicting the response to selection, or in defining the upper limit of polygenic prediction accuracy [2] – applications where the LD contribution is important.
Ultimately, the discrepancy between and in practice is an empirical question, the answer to which depends on the degree of population structure (which we can measure) and the genetic architecture of the trait (which we do not know a priori). We show that for most traits, the contribution of population structure to the variance explained by GWAS SNPs is modest among African Americans. Thus, if we assume that the genetic architecture of GWAS SNPs represents that of all causal variants, then despite incorrect assumptions, the discrepancy between and should be fairly modest. But this assumption is unrealistic given that GWAS SNPs are common variants that in most cases cumulatively explain a fraction of trait heritability. What is the LD contribution of the rest of the genome, particularly rare variants? This is not obvious and will become clearer in the near future through large sequence-based studies [54]. While these are underway, theoretical studies are needed to understand how different selection regimes influence the directional LD between causal variants – clearly an important aspect of the genetic architecture of complex traits.
Methods
Simulating genetic architecture
We first drew the allele frequency of 1,000 biallelic causal loci in the ancestor of populations A and B from a uniform distribution, . Then, we simulated their frequency in populations A and B ( and ) under the Balding-Nichols model [32], such that where is the inbreeding coefficient. We implemented this using code adapted from Lin et al. (2021) [55]. To avoid drawing extremely rare alleles, we continued to draw and until we had 1,000 loci with .
We generated the effect size () of each locus by sampling from , where is the number of loci and is the mean allele frequency across populations A and B. Thus, rare variants have larger effects than common variants and the total genetic variance sums to 1. Given these effects, we simulated two different traits, one with a large difference in means between populations A and B (Trait 1) and the other with roughly no difference (Trait 2). This was achieved by permuting the signs of the effects 100 times to get a distribution of – the genetic variance between populations. This has the effect of varying the LD contribution without changing the at causal loci. We selected the maximum and minimum of to represent Traits 1 and 2.
Simulating admixture
We simulated the genotypes, local ancestry, and phenotype for 10,000 admixed individuals per generation under the hybrid isolation (HI) and continuous gene flow (CGF) models by adapting the code from Zaitlen et al. (2017) [26]. We denote the ancestry of a randomly selected individual with , the fraction of their genome from population A. At under the HI model, we set to 1 for individuals from population A and 0 if they were from population B such that with no further gene flow from either source population. In the CGF model, population B receives a constant amount from population A in every generation starting at . The mean overall proportion of ancestry in the population is kept the same as the HI model by setting where is the number of generations of gene flow from A. In every generation, we simulated ancestry-based assortative mating by selecting mates such that the correlation between their ancestries is in every generation. We do this by repeatedly permuting individuals with respect to each other until falls within ±0.01 of the desired value. It becomes difficult to meet this criterion when is small (Fig. 1C). To overcome this, we relaxed the threshold up to 0.04 for some conditions, i.e., when and . We generated expected variance in individual ancestry using the expression in Zaitlen et al. (2017) [26]. At time since admixture, under the HI model where measures the strength of assortative mating, i.e, the correlation between the ancestry between individuals in a mating pair. Under the CGF model, (Appendix).
We sampled the local ancestry at each ith locus as where and and represent the ancestry of the maternal and paternal chromosome, respectively. The global ancestry of the individual is then calculated as , where is the number of loci. We sample the genotypes and from a binomial distribution conditioning on local ancestry. For example, the genotype on the maternal chromosome is if and if where and represent the allele frequency in populations A and B, respectively. Then, the genotype can be obtained as the sum of the maternal and paternal genotypes: . We calculate the genetic value of each individual as and the genetic variance as .
Heritability estimation with GREML
We used the --reml and --reml-no-constrain flags in GCTA [5] to estimate and , the genetic variance due to genotypes and local ancestry, respectively. We could not run GCTA without noise in the genetic values so we simulated individual phenotypes with a heritability of by adding random noise . We computed three different GRMs, which correspond to different transformations of the genotypes: (i) standard, (i) Variance or scaled, and (ii) LD-scaled.
For the standard GRM, the genotypes at the ith SNP are standardized such that . For the variance scaled GRM, we computed where is the sample variance of the genotypes at the ith SNP. The LD-scaled GRM conceptually corresponds to standardizing the genotypes by the SNP covariance, rather than its variance. Let represent the unstandardized matrix of genotypes and represent an matrix where the ith column contains the allele frequency of that SNP. Let be the upper triangular ‘square root’ matrix of the SNP covariance matrix such that . Then, the standardized genotypes are computed as and the GRM becomes [33]. Similarly, the three GRMs for local ancestry were computed by scaling local ancestry with (i) where we denote as the mean local ancestry at the ith SNP, or with the (ii) variance, or (iii) covariance of local ancestry, respectively. We estimated and with and without individual ancestry as a fixed effect to correct for any confounding due to genetic stratification. This was done by using the --qcovar flag.
Heritability estimation with HE regression
Haseman-Elston regression with and without ancestry correction was implemented using custom scripts in R [56]. To estimate without ancestry correction, we first computed the cross-product of the centered phenotypes , resulting in an matrix . We stacked the upper-triangular matrix of into a vector and regressed it on the corresponding elements of the GRM , taking the slope as an estimate of :
To correct for individual ancestry, we followed the approach of Min et al. (2022) [35]. To do this, we first regressed out the effect of individual ancestry on phenotype. The regression coefficient can be expressed as and the residuals as . Then, we fit the following model:
where represents the cross-product of individual ancestry, represents its corresponding regression coefficient, and represents the parameter of interest, i.e., the genetic variance and , the residual variance.
To demonstrate that the bias in HE estimates arises because of a bias in the estimate of LD contribution, not the genic variance, we carried out a simple simulation where half of the individuals in the population derive their ancestry from population A and the rest from population B. This is equivalent to the meta-population under the HI model at where . We simulated genotypes for 1, 000 individuals for loci where the allele frequencies in populations A and B were set to and , respectively. We standardized the genotypes at each locus using the square-root of the sample variance and assigned effect sizes such that the total genetic variance explained by all loci is equal to 1, i.e., the effect of the scaled genotype at the ith locus is . This is equivalent to the effect size of the unscaled genotypes being where is the sample variance at the ith locus. We introduced randomness in the direction of the effect by assigning a negative or positive sign to each locus uniformly at random 100 times to generate 100 traits with the same genic variance but varying LD contributions. Then, for each trait we computed the two terms in Eq. 2, which should converge to the genic variance and LD contributions, which represent the genic and LD components to the HE regression estimate. Fig. S5 shows that in the presence of directional LD, the overall bias is in the HE regression estimate is due to an exaggerated estimate of the LD contribution.
Estimating variance explained by GWAS SNPs
To decompose the variance explained by GWAS SNPs in African Americans, we needed four quantities: (i) effect sizes of GWAS SNPs, (ii) their allele frequencies in Africans and Europeans, and (iii) the mean and variance of global ancestry in African Americans (Equation 1).
We retrieved the summary statistics of 26 traits from GWAS catalog [40]. Full list of traits and the source papers [44, 57–64] are listed in Table S1. To maximize the number of variants discovered, we chose summary statistics from studies that were conducted in both European and multi-ancestry samples and that reported the following information: effect allele, effect size, p-value, and genomic position. For birth weight, we downloaded the data from the Early Growth Genetics (EGG) consortium website [61] since the version reported on the GWAS catalog is incomplete. For skin pigmentation, we chose summary statistics from the UKB [65] released by the Neale Lab (http://www.nealelab.is/uk-biobank) and processed by Ju and Mathieson [44] to represent effect sizes estimated among individuals of European ancestry. We also selected summary statistics from Lona-Durazo et al. (2019) where effect sizes were meta-analyzed across four admixed cohorts [57]. Lona-Durazo et al. provide summary statistics separately with and without conditioning on rs1426654 and rs35397 – two large effect variants in SLC24A5 and SLC45A2. We used the ‘conditioned’ effect sizes and added in the effects of rs1426654 and rs35397 to estimate genetic variance.
We selected independent hits for each trait by pruning and thresholding with PLINK v1.90b6.21 [66] in two steps as in Ju et al. (2020) [44]. We used the genotype data of GBR from the 1000 genome project [39] as the LD reference panel. We kept only SNPs (indels were removed) that passed the genome-wide significant threshold (--clump-p1 5e-8) with a pairwise LD cutoff of 0.05 (--clump-r2 0.05) and a physical distance threshold of 250Kb (--clump-kb 250) for clumping. Second, we applied a second round of clumping (--clump-kb 100) to remove SNPs within 100kb.
When GWAS was carried out separately in different ancestry cohorts in the same study, we used inverse-variance weighting to meta-analyze effect sizes for variants that were genome-wide significant (p-value < 5 × 10−8) in at least one cohort. This allowed us to maximize the discovery of variants such as the Duffy null allele that are absent among individuals of European ancestry but polymorphic in other populations [47].
We used allele frequencies from the 1000 Genomes CEU and YRI to represent the allele frequencies of GWAS SNPs in Europeans and Africans, respectively, making sure that the alleles reported in the summary statistics matched the alleles reported in the 1000 Genomes. We estimated the global ancestry of ASW individuals (N = 74) with CEU and YRI individuals from 1000 genome (phase 3) using ADMIXTURE 1.3.0 [67] with k=2 and used it to calculate the mean (proportion of African ancestry = 0.767) and variance (0.018) of global ancestry in ASW. With the effect sizes, allele frequencies, and the mean and variance in ancestry, we calculated the four components of genetic variance using Equation 1 and expressed them as a fraction of the total genetic variance.
Initially, the multi-ancestry summary statistics for a few traits (NEU, WBC, MON, MCH, BAS) yielded values > 1 for the proportion of variance explained. This is likely because, despite LD pruning, some of the variants in the model are not independent and tag large effect variants under divergent selection such as the Duffy null allele, leading to an inflated contribution of LD. We checked this by calculating the pairwise contribution, i.e., , of all SNPs in the model and show long-range positive LD between variants on chromosome 1 for NEU, WBC, and MON, especially with the Duffy null allele (Fig. S6A–C). A similar pattern was observed on chromosome 16 for MCH, confirming our suspicion. This also suggests that for certain traits, pruning and thresholding approaches are not guaranteed to yield independent hits. To get around this problem, we retained only one association with the lowest p-value, each from chromosome 1 (rs2814778 for NEU, WBC, and MON) and chromosome 16 (rs13331259 for MCH) (Fig. S6D). For BAS, we observed that the variance explained was driven by a rare variant (rs188411703, MAF = 0.0024) of large effect . We believe this effect estimate to be inflated and therefore, we removed it from our calculation.
As a sanity check, we independently estimated the genetic variance as the variance in polygenic scores, calculated using --score flag in PLINK, [66] in ASW individuals. We compared the first estimate of the genetic variance to the second (Fig. S7) to confirm two things: (i) the allele frequencies, and mean and variance in ancestry are estimated correctly, and (ii) the variants are more or less independent in that they do not absorb the effects of other variants in the model. We show that the two estimates of the genetic variance are strongly correlated (, Fig. S7). The 95% confidence intervals were calculated by sampling individuals with replacement 10,000 times.
Supplementary Material
Acknowledgements
We thank Iain Mathieson and Doc Edge for helpful comments on the manuscript. This study was funded by National Institute of General Medical Sciences award R00GM137076 to A.A.Z. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Appendix
A1. Variance in ancestry
We denote variance and covariance with and and used the expressions in [26] to generate the expected value for the variance in ancestry, i.e., . This is straightforward for the HI model, where at time measures the strength of assortative mating, i.e, the correlation between the ancestry across mating pairs at time . For simplicity, we assumed this to be constant in every generation, i.e. following [26]. Since our notation slightly differs from [26], we re-derived the expression for for the CGF model where population B receives a constant amount of gene flow from population A in every generation. Note, that . Then,
A2 Genetic variance
Let , where is the phenotypic value of an individual, is the genotypic value, and is random error. We assume additive effects such that where is the effect size of the ith biallelic locus and is the number of copies of the trait-increasing allele. Then, the genetic variance is:
In the following sections, we decompose and further as functions of ancestry.
A2.1.
We first derive as a function of ancestry using the law of total variance:
where represents the expectation taken over .
A2.1.1.
We derive by further conditioning on the local ancestry at each locus.
where represents expectation taken over local ancestry. Since we are interested in the variance at a single locus, we will ignore the subscript and denote the frequency of the trait-increasing allele in populations A and B with and , respectively.
To derive , note that
And,
Putting this together,
A2.1.2.
Recall from the previous section that . Then,
We are now ready to express :
Note, that we can also express as:
where the second term is the contribution of population structure to the genetic variance at locus .
A2.2.
We can derive using the law of total covariance:
because we assume that the loci are unlinked and therefore, and are conditionally independent. Putting this all together, we get the genetic variance in admixed populations as presented in the main text:
The only difference being that in the main text we use instead of for simplicity. With two ‘unadmixed’ source populations with equal number of individuals, and and reduces to:
A3. The effect of genotype scale on
In the main text, we showed that both GREML and Haseman-Elston regression estimates of depend on how the genotypes are scaled. We provide an explanation of this behavior using the Haseman-Elston (HE) regression estimator, which is asymptotically equivalent to the GREML estimator if effects are uncorrelated [69] but which, unlike GREML, has a closed-form solution.
A3.1. No directional LD
A3.1.1. Scaling by
First, let’s assume a genetic architecture where all loci contribute equally to the genetic variance and there is no LD contribution. With the standard scaling, the genotype at a given locus is where is the frequency of the allele in the population. Under the random-effects model, this is equivalent to saying that the unscaled effects are: being the parameter of interest. In a panmictic population,
In an admixed population,
The HE estimator of is based on the regression of products of (centered) phenotypes for all pairs of individuals on the corresponding entries of the GRM where and is the centered and scaled genotype of individual for locus :
Where represents the expectation over all pairwise comparisons between individuals. It is simpler to express the HE estimator in terms of the scaled effects .
Where and represent expectations over random realizations of effect sizes. Thus, the last line follows from our assumption that the effect sizes are independent in expectation, i.e., . Note, that the estimate is still biased since it does not capture the contribution of population structure.
A3.1.2. Scaling by
Next, we consider the case where the genotypes are standardized instead by the sample variance, i.e., such that . We can derive corresponding to this scaling by noting that the genetic variance is invariant under linear transformations of the genotype [14]:
Then, the HE estimator becomes:
Which provides an unbiased estimate of the genic variance. It’s important to note that even though we assumed effect sizes under a random-effect model, the above result holds under a fixed-effect model as long as there is no directional LD. We discuss the implications of directional LD in the following section.
A3.2. Directional LD
Under the standard random-effect model, the effect sizes are assumed to be independent in expectation. We discussed in the main text how certain processes (e.g. selection and assortative mating) can induce directional LD across causal loci. But directional LD might arise even for neutral traits and under the random-effects model for any given realization of effects. This can lead to biases both HE and GREML estimates of , though the direction and reason for bias is different for the two methods. GREML does not have a closed-form solution so the exact estimand is difficult to derive. Instead, we develop some intuition based on HE regression.
A3.2.1. Scaling by
To do this, let represent the vector of a given realization of (fixed) effects corresponding to the standardized genotypes such that each locus contributes equally to , the genic variance, i.e., . Let there be positive LD across loci such that all cross-product terms are . Then, the genetic variance explained by all loci is:
(3) |
where is the LD between the ith and jth loci that ranges from 0 (no LD) to 1 (perfect LD). Thus, the LD contribution to ranges from 0 to . In comparison, the HE estimator is:
(4) |
This shows that the bias due to directional LD in the HE estimate of does not come from the genic, but the LD component. When there is no LD, e.g. if the population has reached equilibrium after generations of random mating, this component goes to zero and both the estimate and converge to the same value – the genic variance. The LD component is maximum when the ith and jth loci are in perfect LD. In this case, and are exchangeable and the second term of the estimator reduces to . Thus, HE regression should give an unbiased estimate of , even in the presence of directional LD, but only when LD is perfect. For any other value , the estimate is biased (Fig. A1). An interpretable, analytical derivation of the second term in Eq. 4 is complicated but we illustrate the bias with simulations below.
For unlinked markers, is a function of (see A2.1). Perfect LD arises when (i) both and and (ii) is maximum, which occurs at the time of admixture when source populations mix equally, i.e, . To generate a range of LD, we simulated an admixed population with equal number of individuals from populations A and B. Thus, . We simulated genotypes for each individual at 50 ‘causal’ loci where the difference between the frequencies in the source populations, with the condition that . We assigned each locus the same effect size (on the variance-standardized scale) of summing up to a genic variance of 1. The positive sign ensures positive LD across loci, i.e, all off-diagonal elements of are set to . For each simulation, we computed the expected and estimated LD component using the second terms in Eqs. 3 and 4, respectively, and averaged the results over 100 replications.
A3.2.2. Scaling by LD
In the main text, we showed that standardizing the genotypes at a locus by its covariance with other loci accounts for the bias for GREML and HE estimators. More specifically, the ‘LD-scaled’ genotypes can be written as where is an matrix such that all elements of the ith column contain the frequency of the ith SNP and is the (upper triangular) square root of the LD matrix, i.e, . Under this scheme, the standardized genotypes are uncorrelated and therefore, the second term in Eqs. 3 and 4 are zero. This reduces the estimator to the first term, representing the sum of squares of effect sizes, i.e. . The effect sizes corresponding to the LD scaled genotypes are and the sum of squares is:
Which captures both the genic and LD contributions and therefore, provides an unbiased estimate of .
A4. Genetic variance after correction for individual ancestry
In the main text, we stated that including individual ancestry as a fixed effect in GREML can lead to an underestimate of in the presence of population structure. Mixed effect models deal with fixed effects (ancestry in our case) by projecting them out of the phenotypes, and estimating the residual variance. This is conceptually equivalent to measuring the residual variance of the regression between phenotype and ancestry. As a result, any variance in the phenotype that is explained by ancestry is removed. To understand this quantitatively, it is helpful to decompose into components of variance explained by and variance orthogonal to ancestry:
We can express the residual variance as:
Note, that this represents the following components of from the main text: (1.1) + (1.2) − (1.3). Note, that (1.3), which is subtracted out, is always positive and depends on . Thus, the residual genetic variance will be underestimated, regardless of trait architecture, in the presence of population structure, i.e. when .
A5. Effect size of local ancestry
We define local ancestry as the number of alleles at locus that trace their ancestry to population A. Thus, the local ancestry at locus in individual is a Binomial random variable with . We define the ancestry value of an individual as the weighted sum of their local ancestry: where .
To show this, note that where and is a density function. Our goal is to express in terms of , which is equal to . Furthermore, . We can express in terms of as follows:
Similary, and
A6. Genetic variance due to local ancestry
(5) |
We use the law of total variance and covariance to derive and :
Footnotes
Code availability
We carried out all analyses in R version 4.2.3 [56], PLINK v1.90b6.21 and PLINK 2.0 [66, 68], and GCTA version 1.94.1 [5]. All code is freely available on https://github.com/jinguohuang/admix_heritability.git.
References
- 1.Lynch M. & Walsh B. Genetics and analysis of quantitative traits 1–980 (Sinauer Associates, Inc, 1998). [Google Scholar]
- 2.Visscher P. M., Hill W. G. & Wray N. R. Heritability in the genomics era–concepts and misconceptions. Nature reviews. Genetics 9, 255–66 (April 2008). [DOI] [PubMed] [Google Scholar]
- 3.Yang J., Benyamin B., et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (July 2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yang J., Manolio T. A., et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 2011 43:6 43, 519–525 (June 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yang J., Lee S. H., Goddard M. E. & Visscher P. M. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics 88, 76–82 (January 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wainschtein P., Jain D., et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nature Genetics 2022 54:3 54, 263–273 (March 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Browning S. R. & Browning B. L. Population structure can inflate SNP-based heritability estimates. American Journal of Human Genetics 89, 191–193 (January 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Goddard M. E., Lee S. H., Yang J., Wray N. R. & Visscher P. M. Response to browning and browning. American Journal of Human Genetics 89, 193–195 (January 2011). [Google Scholar]
- 9.Kumar S. K., Feldman M. W., Rehkopf D. H. & Tuljapurkar S. Limitations of GCTA as a solution to the missing heritability problem. Proceedings of the National Academy of Sciences of the United States of America 113, E61–E70 (January 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yang J., Leec S. H., Wraya N. R., Goddardd M. E. & Visscher P. M. GCTA-GREML accounts for linkage disequilibrium when estimating genetic variance from genome-wide SNPs. Proceedings of the National Academy of Sciences 113, E4579–E4580 (32 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lin Z., Seal S. & Basu S. Estimating SNP heritability in presence of population substructure in biobank-scale datasets. Genetics 220 (April 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Border R., O’Rourke S., et al. Assortative mating biases marker-based heritability estimators. Nature Communications 2022 13:1 13, 1–10 (January 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Visscher P. M., Yang J. & Goddard M. E. A Commentary on ‘Common SNPs Explain a Large Proportion of the Heritability for Human Height’ by Yang et al. (2010). Twin Research and Human Genetics 13, 517–524 (June 2010). [DOI] [PubMed] [Google Scholar]
- 14.De los Campos G., Sorensen D. & Gianola D. Genomic Heritability: What Is It? PLOS Genetics 11, e1005048 (May 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rawlik K., Canela-Xandri O., Oolliams J. W. & Tenesa A. SNP heritability: What are we estimating? bioRxiv, 2020.09.15.276121 (2020). [Google Scholar]
- 16.Lara L. A. C., Pocrnic I., de P. Oliveira T., Gaynor R. C. & Gorjanc G. Temporal and genomic analysis of additive genetic variance in breeding programmes. Heredity 2021 128:1 128, 21–32 (January 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wojcik G. L., Graff M., et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (7762 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ben-Eghan C., Sun R., et al. Don’t ignore genetic data from minority populations. Nature 2021 585:7824 585, 184–186 (7824 2020). [DOI] [PubMed] [Google Scholar]
- 19.Verma A., Damrauer S. M., et al. The Penn Medicine BioBank: Towards a Genomics-Enabled Learning Healthcare System to Accelerate Precision Medicine in a Diverse Population. Journal of Personalized Medicine 12, 1974 (December 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Fatumo S., Chikowore T., et al. A roadmap to increase diversity in genomic studies. Nature Medicine 2022 28:2 28, 243–250 (February 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.The All of Us Research Program Investigators. The “All of Us” Research Program. New England Journal of Medicine 381, 668–676 (July 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sohail M., Chong A. Y., et al. Nationwide genomic biobank in Mexico unravels demographic history and complex trait architecture from 6,057 individuals. bioRxiv, 2022.07.11.499652 (2022). [Google Scholar]
- 23.Johnson R., Ding Y., et al. The UCLA ATLAS Community Health Initiative: Promoting precision health research in a diverse biobank. Cell Genomics 3, 100243 (January 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Haseman J. K. & Elston R. C. The investigation of linkage between a quantitative trait and a marker locus. Behavior genetics 2, 3–19 (January 1972). [DOI] [PubMed] [Google Scholar]
- 25.Pfaff C. L., Parra E. J., et al. Population structure in admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. American journal of human genetics 68, 198–207 (January 2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zaitlen N., Huntsman S., et al. The Effects of Migration and Assortative Mating on Admixture Linkage Disequilibrium. Genetics 205, 375–383 (January 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Verdu P. & Rosenberg N. A. A General Mechanistic Model for Admixture Histories of Hybrid Populations. Genetics 189, 1413–1426 (April 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Corre V. L. & Kremer A. Genetic Variability at Neutral Markers, Quantitative Trait Loci and Trait in a Subdivided Population Under Selection. Genetics 164, 1205–1219 (March 2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Berg J. J. & Coop G. A Population Genetic Signal of Polygenic Adaptation. PLoS Genetics 10, e1004412 (August 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bulmer M. G. The Effect of Selection on Genetic Variability. The American Naturalist 105, 201–211 (943 1971). [Google Scholar]
- 31.Yair S. & Coop G. Population differentiation of polygenic score predictions under stabilizing selection. Philosophical Transactions of the Royal Society B 377 (1852 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Balding D. J. & Nichols R. A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1–2 1995). [DOI] [PubMed] [Google Scholar]
- 33.Mathew B., Léon J. & Sillanpää M. J. A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction. Heredity 2017 120:4 120, 356–368 (April 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ma R. & Dicker L. H. The Mahalanobis kernel for heritability estimation in genome-wide association studies: fixed-effects and random-effects methods (2019). [Google Scholar]
- 35.Min A., Thompson E. & Basu S. Comparing heritability estimators under alternative structures of linkage disequilibrium. G3 Genes|Genomes|Genetics 12 (August 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zaitlen N., Pasaniuc B., et al. Leveraging population admixture to characterize the heritability of complex traits. Nature Genetics 46, 1356–1362 (December 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zaidi A. A., Mattern B. C., et al. Investigating the case of human nose shape and climate adaptation. PLoS Genetics 13, 2017 (March 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Schraiber J. G. & Edge M. D. Heritability within groups is uninformative about differences among groups: cases from behavioral, evolutionary, and statistical genetics. bioRxiv, 2023.11.06.565864 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Auton A., Abecasis G. R., et al. A global reference for human genetic variation. Nature 2015 526:7571 526, 68–74 (7571 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Sollis E., Mosaku A., et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research 51, D977–D985 (D1 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Jablonski N. G. the Evolution of Human Skin and Skin Color. Annual Review of Anthropology 33, 585–623 (January 2004). [Google Scholar]
- 42.Lamason R. L., Mohideen M. A. P., et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science 310, 1782–1786 (5755 2005). [DOI] [PubMed] [Google Scholar]
- 43.Beleza S., Johnson N. A., et al. Genetic architecture of skin and eye color in an African-European admixed population. PLoS genetics 9 (ed Spritz R. A.) e1003372 (March 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ju D. & Mathieson I. The evolution of skin pigmentation-associated variation in West Eurasia. Proceedings of the National Academy of Sciences of the United States of America 118, e2009227118 (January 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Nalls M. A., Wilson J. G., et al. Admixture Mapping of White Cell Count: Genetic Locus Responsible for Lower White Blood Cell Count in the Health ABC and Jackson Heart Studies. The American Journal of Human Genetics 82, 81–87 (January 2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Reich D., Nalls M. A., et al. Reduced Neutrophil Count in People of African Descent Is Due To a Regulatory Variant in the Duffy Antigen Receptor for Chemokines Gene. PLoS Genetics 5 (ed Visscher P. M.) e1000360 (January 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.McManus K. F., Taravella A. M., et al. Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans. PLOS Genetics 13, e1006560 (March 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kumar S. K., Feldman M. W., Rehkop D. H. & Tuljapurkar S. Reply to Yang et al.: GCTA produces unreliable heritability estimates. Proceedings of the National Academy of Sciences 113, E4581–E4581 (32 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Speed D., Hemani G., Johnson M. R. & Balding D. J. Improved heritability estimation from genome-wide SNPs. American journal of human genetics 91, 1011–21 (June 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Speed D., Cai N., Johnson M. R., Nejentsev S. & Balding D. J. Reevaluation of SNP heritability in complex human traits. Nature Genetics 49, 986–992 (July 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Chan T. F., Rui X., et al. Estimating heritability explained by local ancestry and evaluating stratification bias in admixture mapping from summary statistics. bioRxiv, 2023.04.10.536252 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Schwartzman A., Schork A. J., Zablocki R. & Thompson W. K. A simple, consistent estimator of SNP heritability from genome-wide association studies. 10.1214/19-AOAS1291 13, 2509–2538 (April 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Hou K., Ding Y., et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nature Genetics 2023 55:4 55, 549–558 (April 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Backman J. D., Li A. H., et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 2021 599:7886 599, 628–634 (7886 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Lin M., Park D. S., Zaitlen N. A., Henn B. M. & Gignoux C. R. Admixed Populations Improve Power for Variant Discovery and Portability in Genome-Wide Association Studies. Frontiers in Genetics 12, 673167 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.R Core Team. R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing; (Vienna, Austria, 2023). [Google Scholar]
- 57.Lona-Durazo F., Hernandez-Pacheco N., et al. Meta-analysis of GWA studies provides new insights on the genetic architecture of skin pigmentation in recently admixed populations. BMC Genetics 20, 1–16 (January 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Yengo L., Vedantam S., et al. A saturated map of common genetic variants associated with human height. Nature 610, 704–712 (7933 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Hoffmann T. J., Choquet H., et al. A large multiethnic genome-wide association study of adult body mass index identifies novel loci. Genetics 210, 499–515 (February 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Pulit S. L., Stoneman C., et al. Meta-Analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Human Molecular Genetics 28, 166–174 (January 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Warrington N. M., Beaumont R. N., et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nature Genetics 2019 51:5 51, 804–814 (May 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Surendran P., Feofanova E. V., et al. Discovery of rare variants associated with blood pressure regulation through meta-analysis of 1.3 million individuals. Nature Genetics 52, 1314–1332 (December 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Graham S. E., Clarke S. L., et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (7890 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Chen M. H., Raffield L. M., et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell 182, 1198–1213.e14 (May 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Bycroft C., Freeman C., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (7726 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Chang C. C., Chow C. C., et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, s13742-015–0047-8 (January 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Alexander D. H., Novembre J. & Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome research 19, 1655–64 (September 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Purcell S., Neale B., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559–75 (March 2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Chen G. B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Frontiers in Genetics 5, 72296 (APR 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.