Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Aug 6:2023.08.04.551959. Originally published 2023 Aug 4. [Version 3] doi: 10.1101/2023.08.04.551959

Interpreting SNP heritability in admixed populations

Jinguo Huang 1,2, Nicole Kleman 3, Saonli Basu 4, Mark D Shriver 2, Arslan A Zaidi 3,5,
PMCID: PMC10418213  PMID: 37577588

Abstract

SNP heritability hsnp2 is defined as the proportion of phenotypic variance explained by genotyped SNPs and is believed to be a lower bound of heritability (h2), being equal to it if all causal variants are known. Despite the simple intuition behind hsnp2, its interpretation and equivalence to h2 is unclear, particularly in the presence of population structure and assortative mating. It is well known that population structure can lead to inflation in hˆsnp2 estimates because of confounding due to linkage disequilibrium (LD) or shared environment. Here we use analytical theory and simulations to demonstrate that hsnp2 estimates can be biased in admixed populations, even in the absence of confounding and even if all causal variants are known. This is because admixture generates LD, which contributes to the genetic variance, and therefore to heritability. Genome-wide restricted maximum likelihood (GREML) does not capture this contribution leading to under- or over-estimates of hsnp2 relative to h2, depending on the genetic architecture. In contrast, Haseman-Elston (HE) regression exaggerates the LD contribution leading to biases in the opposite direction. For the same reason, GREML and HE estimates of local ancestry heritability hγ2 are also biased. We describe this bias in hˆsnp2 and hˆγ2 as a function of admixture history and the genetic architecture of the trait and show that it can be recovered under some conditions. We clarify the interpretation of hˆsnp2 in admixed populations and discuss its implication for genome-wide association studies and polygenic prediction.

Introduction

The ability to estimate (narrow-sense) heritability (h2) from unrelated individuals was a major advance in genetics. Traditionally, h2 was estimated from family-based studies in which the phenotypic resemblance between relatives could be modeled as a function of their expected genetic relatedness [1]. However, this approach was limited to analysis of closely related individuals where pedigree information is available and the realized genetic relatedness is not too different from expectation [2]. With the advent of genome-wide association studies (GWAS), we hoped that many of the variants underlying this heritability would be uncovered. However, when genome-wide significant SNPs explained a much smaller fraction of the phenotypic variance, it became important to explain the missing heritability – were family-based estimates inflated or were GWAS just underpowered, limited by variant discovery?

Yang et al. (2010) [3] made the key insight that one could estimate the portion of h2 tagged by genotyped SNPs, regardless of whether or not they were genome-wide significant, by exploiting the subtle variation in the realized genetic relatedness among apparently unrelated individuals [35]. This quantity came to be known colloquially as ‘SNP heritability’ (hsnp2) and it is believed to be equal to h2 if all causal variants are included among genotyped SNPs [3]. Indeed, estimates of hsnp2 explain a much larger fraction of trait heritability than GWAS SNPs [3], approaching family-based estimates of h2 when whole genome sequence data, which captures rare variants, are used [6]. This has made it clear that GWAS have yet to uncover more variants with increasing sample size. Now, hsnp2 has become an important aspect of the design of genetic studies and is often used to define the power of variant discovery in GWAS and the upper limit of polygenic prediction accuracy.

Despite the utility and simple intuition of hsnp2, there is much confusion about its interpretation and equivalence to h2, particularly in the presence of population structure and assortative mating [712]. But much of the discussion of heritability in structured populations has focused on biases in hˆsnp2 – the estimator – due to confounding effects of shared environment and linkage disequilibrium (LD) with other variants [7, 911, 13]. There is comparatively little discussion, at least in human genetics, on the fact that LD due to population structure also contributes to genetic variance, and therefore, is a component of heritability [1] (but see [1416] for a rigorous discussion). We think this is at least partly due to the fact that most studies are carried out in cohorts with primarily European ancestry, where the degree of population structure is minimal and large effects of LD can be ignored. However, that is not the case for diverse, multi-ethnic cohorts, which have historically been underrepresented in genetic studies, but thanks to a concerted effort in the field, are now becoming increasingly common [1723]. The complex structure in these cohorts also brings unique methodological challenges and it is imperative that we understand whether existing methods, which have largely been evaluated in more homogeneous groups, generalize to more diverse cohorts.

Our goal in this paper is to study the behavior of hˆsnp2 in admixed populations. What is its interpretation in the ideal situation where causal variants are known? Is it an unbiased estimate of h2? To answer these questions, we derived a general expression for the genetic variance in admixed populations, decomposing it in terms of the contribution of population structure, which influences both the genotypic variance at individual loci and the LD across loci. We used theory and simulations to show that hˆsnp2 estimated with genome-wide restricted maximum likelihood (GREML) [3, 5] and Haseman-Elston (HE) regression [24] – two widely used approaches – can be biased in admixed and other structured populations, even in the absence of confounding and when all causal variants are known. We explain this in terms of the discrepancy between the model assumed in hˆsnp2 estimation and the generative model from which the genetic architecture of the trait in the population may have been sampled. We describe the bias in hˆsnp2 as a function of admixture history and genetic architecture and discuss its implications for GWAS and polygenic prediction accuracy.

Model

Genetic architecture

We begin by describing a generative model for the phenotype. Let y=g+e, where y is the phenotypic value of an individual, g is the genotypic value, and e is random error. We assume additive effects such that g=i=1mβixi where βi is the effect size of the ith biallelic locus and xi{0,1,2} is the number of copies of the trait-increasing allele. Importantly, the effect sizes are fixed quantities and differences in genetic values among individuals are due to random variation in genotypes. Note, that this is different from the model assumed by GREML where genotypes are fixed and effect sizes are random [14].

We denote the mean, variance, and covariance with E(.), V(.), and C(.), respectively, where the expectation is measured over random draws from the population rather than random realizations of the evolutionary process. We can express the additive genetic variance of a quantitative trait as follows:

Vg=Vi=1mβixi=i=1mβi2Vxi+jiβiβjCxi,xj

Here the first term represents the contribution of individual loci (genic variance) and the second term is the contribution of linkage disequilibrium (LD contribution). We make the assumption that loci are unlinked and therefore, the LD contribution is entirely due to population structure. We describe the behavior of Vg in a population that is a mixture of two previously isolated populations A and B that diverged from a common ancestor. To do this, we denote θ as the fraction of the genome of an individual with ancestry from population A. Thus, θ=1 if the individual is from population A, 0 if they are from population B, and θ(0,1) if they are admixed. Then, Vg can be expressed in terms of ancestry as (Appendix):

Vg=2Eθi=1mβi2fiA1fiA+21Eθi=1mβi2fiB1fiB (1.1)
+2Eθ1Eθi=1mβi2fiAfiB2 (1.2)
+2Vθi=1mβi2fiAfiB2 (1.3)
+4VθijβiβjfiAfiBfjAfjB (1.4)

where fiA and fiB are the allele frequencies in populations A and B, and E(θ) and V(θ) are the mean and variance of individual ancestry. The sum of the first three terms represents the genic variance and the last term represents the LD contribution.

Demographic history

From Eq. 1, it is clear that, conditional on the genetic architecture in the source populations (β,fA,fB),Vg is a function of the mean, E(θ), and variance, V(θ), of individual ancestry in the admixed population. We consider two demographic models that affect E(θ) and V(θ) in qualitatively different ways. In the first model, the source populations meet once t generations ago (we refer to this as t=0) in proportions p and 1p, after which there is no subsequent admixture (Fig. 1A). In the second model, there is continued gene flow in every generation from one of the source populations such that the mean overall amount of ancestry from population A is the same as in the first model (Fig. 1A). For brevity, we refer to these as the hybrid-isolation (HI) and continuous gene flow (CGF) models, respectively, following Pfaff et al. (2001) [25]. V(θ) is also affected by ancestry-based assortative mating, where individuals are more likely to partner with others of similar ancestry. We refer to this simply as assortative mating for brevity and model this following Zaitlen et al. (2017) using a parameter P(0,1), which represents the correlation of the ancestry of individuals across mating pairs in the population [26].

Figure 1:

Figure 1:

The behavior of mean and variance of individual ancestry as a function of admixture history. (A) Shows the demographic models under which simulations were carried out. Admixture might occur once (Hybrid Isolation, HI, left column) or continuously (Continuous Gene Flow, CGF, right column). (B) The mean individual ancestry, E(θ) remains constant over time in the HI model and increases in the CGF model with continued gene flow. (C) The variance in individual ancestry, V(θ) is maximum at t=0 in the HI model, decaying subsequently. V(θ) increases with gene flow in the CGF model and will subsequently decrease with time. P measures the strength of assortative mating, which slows the decay of V(θ). P=0.6 is missing for simulations run for 50 and 100 generations and θ{0.1,0.2} due to the difficulty in finding mate pairs (Methods).

Under these conditions, the behavior of E(θ) and V(θ) has been described previously [26, 27] (Fig. 1B and C). Briefly, in the HI model, E(θ) remains constant at p in the generations after admixture as there is no subsequent gene flow. V(θ) is at its maximum at t=0 when each individuals carries chromosomes either from population A or B, but not both. This genome-wide correlation in ancestry breaks down in subsequent generations as a function of mating, independent assortment, and recombination, leading to a decay in V(θ), the rate depending on the strength of assortative mating (Fig. 1C). In the CGF model, both E(θ) and V(θ) increase with time as new chromosomes are introduced from the source populations. But while E(θ) continues to increase monotonically, V(θ) will plateau and decrease due to the countervailing effects of independent assortment and recombination which redistribute ancestry in the population, reaching equilibrium at zero if there is no more gene flow and the population is mating randomly. V(θ) provides an intuitive and quantitative measure of the degree of population structure (along the axis of ancestry) in admixed populations.

Results

Genetic variance in admixed populations

To understand the expectation of genetic variance in admixed populations, it is first worth discussing its behavior in the source populations. In Eq. 1, the first term represents the within-population component Vgw and the last three terms altogether represent the component of genetic variance between populations A and B Vgb. Note that Vgb=gAgB22 is positive only if there is a difference in the mean genotypic values (Fig. 2). This variance increases with genetic divergence since the expected values of both (fiAfiB)2 and (fiAfiB)(fjAfjB) are functions of FST. While βi2(fiAfiB)2 is expected to increase monotonically with increasing divergence, βiβj(fiAfiB)(fjAfjB) is expected to be zero under neutrality because the direction of frequency change will be uncorrelated across loci. In this case, the LD contribution, i.e., (1.4), is expected to be zero and Vgb=(1.1)+(1.2)+(1.3). However, this is true only in expectation over the evolutionary process and the realized LD contribution may be non-zero even for neutral traits.

Figure 2:

Figure 2:

Decomposing genetic variance in a two-population system. The plot illustrates the expected distribution of genetic values in two populations under different selective pressures and the terms on the right list the total Vg and between-population genetic variance Vgb expected over the evolutionary process. For neutrally evolving traits (top row), we expect there to be an absolute difference in the mean genetic values (|gAgB|) that is proportional to FST. For traits under divergent selection (middle), |gAgB| is expected to be greater than that expected under genetic drift. For traits under stabilizing selection, |gAgB| will be less than that expected under genetic drift, and zero in the extreme case.

For traits under selection, the LD contribution is expected to be greater or less than zero, depending on the type of selection. Under divergent selection, trait-increasing alleles will be systematically more frequent in one population over the other, inducing positive LD across loci [28, 29], increasing the LD contribution, i.e., term (1.4). Stabilizing selection, on the other hand, induces negative LD [30, 31]. In the extreme case, the mean genetic values of the two populations are exactly equal and Vgb=(1.2)+(1.3)+(1.4)=0. For this to be true, (1.4) has to be negative and equal to (1.2) + (1.3), which are both positive, and the total genetic variance is reduced to the within-population variance, i.e., term (1.1) (Fig. 2). This is relevant because, as we show in the following sections, the behavior of the genetic variance in admixed populations depends on the magnitude of Vgb between the source populations.

We illustrate this by tracking the genetic variance in admixed populations for two traits, both with the same mean FST at causal loci but with different LD contributions (term 1.4): one where the LD contribution is positive (Trait 1) and the other where it is negative (Trait 2). Thus, traits 1 and 2 can be thought of as examples of phenotypes under divergent and stabilizing selection, respectively, and we refer to them as such from hereon. To simulate the genetic variance of such traits, we drew the allele frequencies (fA and fB) in populations A and B for 1,000 causal loci with FST~0.2 using the Balding-Nichols model [32]. We drew their effects (β) from 𝒩(0,12mf(1f)) where f is the mean allele frequency between the two populations, m is the number of loci. To simulate positive and negative LD, we permuted the effect signs across variants 100 times and selected the combinations that gave the most positive and negative LD contribution to represent the genetic architecture of traits that might be under directional (Trait 1) and stabilizing (Trait 2) selection, respectively (Methods). We simulated the genotypes of 10,000 individuals under the HI and CGF models for t{10,20,50,100} generations post-admixture and calculated genetic values for both traits using g=i=1mβixi, where m=1,000 (Method). The observed genetic variance at any time can then be calculated simply as the variance in genetic values, i.e. Vg=V(g).

In the HI model, E(θ) does not change (Fig. 1B) so terms (1.1) and (1.2) are constant through time. Terms (1.3) and (1.4) decay towards zero as the variance in ancestry goes to zero and Vg ultimately converges to (1.1) + (1.2) (Fig. 3). This equilibrium value is equal to the EVgθ (Appendix) and the rate of convergence depends on the strength of assortative mating, which slows the rate at which V(θ) decays. Vg approaches equilibrium from a higher value for traits under divergent selection and lower value for traits under stabilizing selection because of positive and negative LD contributions, respectively, at t=0 (Fig. 3). In the CGF model, Vg increases initially for both traits with increasing gene flow (Fig. 3). This might seem counter-intuitive at first because gene flow increases admixture LD, which leads to more negative values of the LD contribution for traits under stabilizing selection (Fig. S1). But this is outweighed by positive contributions from the genic variance – terms (1.1) + (1.2) + (1.3) – all of which initially increase with gene flow (Fig. S1). After a certain point, the increase in Vg slows down as any increase in V(θ) due to gene flow is counterbalanced by recombination and independent assortment. Ultimately, Vg will decrease if there is no more gene flow, reaching the same equilibrium value as in the HI model, i.e., EVgθ=(1.1)+(1.2). Because the loci are unlinked, we refer to the sum (1.3) + (1.4) as the contribution of population structure.

Figure 3:

Figure 3:

Genetic variance in admixed populations under the (A) HI and (B) CGF models. Dotted lines represent the expected genetic variance based on Eq. (1) and solid lines represent results of simulations averaged over ten replicates. Red and blue lines represent traits under divergent and stabilizing selection, respectively. P=0.6 is missing for simulations run for 50 and 100 generations and θ{0.1,0.2} due to the difficulty in finding mate pairs (Methods)

GREML estimation

In their original paper, Yang et al. (2010) defined hsnp2 as the variance explained by genotyped SNPs and not as heritability [3]. This is because h2 is the genetic variance explained by causal variants, which are unknown. Genotyped SNPs may not overlap with or tag all causal variants and thus, hsnp2 is understood to be a lower bound of h2, both being equal if causal variants are known [3]. Our goal is to demonstrate that this may not be true in structured populations and quantify the bias in hˆsnp2, even in the ideal situation when causal variants are known.

We used GREML, implemented in GCTA [3, 5], to estimate the genetic variance for our simulated traits. GCTA assumes the following model: y=Zu+ϵ where Z is an n×m standardized genotype matrix such that the genotype of the kth individual at the ith locus is zik=xik2fi2fi1fi, fi being the allele frequency. The SNP effects corresponding to the scaled genotypes are assumed to be random and independent such that u~𝒩(0,Iσu2m) and ϵ~𝒩0,Iσϵ2 is random environmental error. Then, the phenotypic variance can be decomposed as:

V(y)=V(Zu)+V(e)=ZZmσu2+σϵ2

where ZZm is the genetic relationship matrix (GRM), the variance components σu2 and σϵ2 are estimated using restricted maximum likelihood, and hˆsnp2 is calculated as σˆu2σˆu2+σˆϵ2. We are interested in asking whether σˆu2 is an unbiased estimate of Vg. To answer this, we constructed the GRM with causal variants and estimated σˆu2 using GCTA [3, 4].

GCTA under- and over-estimates the genetic variance in admixed populations for traits under divergent (Trait 1) and stabilizing selection (Trait 2), respectively, when there is population structure, i.e., when V(θ)>0 (Fig. 4A). One reason for this bias is that the GREML model assumes that the effects are independent, and therefore the LD contribution is zero. This, as discussed in the previous section, is not true for traits under divergent or stabilizing selection between the source populations, and only true for neutral traits in expectation. Because of this, σˆu2 does not capture the LD contribution, i.e. term (1.4) (Fig. 4A). But σˆu2 can be biased even if the LD contribution is zero if the genotypes are scaled with 2fi1fi – the standard practice – where fi is the frequency of the allele in the population. This scaling assumes that Vxi=2fi1fi, which is true only if the population were mating randomly. In an admixed population Vxi=2fi1fi+2V(θ)(fiAfiB)2, where fi, fiA, and fiB correspond to frequency in the admixed population, and source populations, A and B, respectively (Appendix). Alternatively, if the genotypes are scaled, Vzi=1+2V(θ)Fst(i) where Fst(i) is the Fst at the ith locus. We show that this assumption biases σˆu2 downwards by a factor of 2V(θ)(fiAfiB)2 (or 2V(θ)Fst(i) if genotypes are scaled) – term (1.3) (Fig. 4B, Appendix). Thus, with the standard scaling, σˆu2 gives a biased estimate in the presence of population structure, even of the genic variance.

Figure 4:

Figure 4:

The behavior of GREML estimates of the genetic variance σˆu2 in admixed populations under the HI (left column) and CGF (right column) models either without (A-D) or with (E-H) individual ancestry as a fixed effect. The solid lines represent estimates from simulated data averaged across ten replicates with red and blue colors representing estimates for traits under divergent and stabilizing selection, respectively. P indicates the strength of assortative mating. The shaded area represents the 95% confidence bands generated by bootstrapping (sampling with replacement 100 times) the point estimate reported by GCTA. The dotted lines either represent the expected variance in the population based on Eq. 1 (A & B) or the expected estimate for three different ways of scaling genotypes (B-D & F-H). (A-B & E-F) show the behavior of σˆu2 for the default scaling, (C, G) shows σˆu2 when the genotype at a locus is scaled by its sample variance (V(x) scaled), and (D, H) when it is scaled by the sample covariance (LD scaled).

The overall bias in σˆu2 is determined by the relative magnitude and direction of terms (1.3) and (1.4), both of which are functions of V(θ), and therefore, of the degree of structure in the population. The contribution of term (1.3) will be modest, even in highly structured populations (Fig. S1) and therefore, the overall bias is largely driven by the LD contribution. If there is no more gene flow, V(θ) will ultimately go to zero and Vg will converge towards σˆu2. Thus, σˆu2 is more accurately interpreted as the genetic variance expected if the LD contribution were zero and if the population were mating randomly. In other words, Eσˆu2=(1.1)+(1.2)Vg (Fig. 4B).

In principle, we can recover the missing components of Vg by scaling the genotypes appropriately. For example, we can recover term (1.3) by scaling the genotype at each variant i by its sample variance, i.e., zik=xik2fiVxi (Fig. 4C) (Appendix). We can also recover term (1.4) by scaling the genotypes with the covariance between SNPs, i.e., the LD matrix, as previously proposed [33, 34] (Methods). In matrix form, the ‘LD-scaled’ genotypes can be written as Z=(X2P)U1 where P is an n×m matrix such that all elements of the ith column contain the frequency of the ith SNP and U is the (upper triangular) square root matrix of the LD matrix, i.e., Σ=UU [33]. GREML recovers the LD contribution under this scaling, resulting in unbiased estimates of Vg for both traits (Fig. 4D, Appendix).

In practice, however, the LD contribution may not be fully recoverable for two reasons. One, the LD-scaled GRM requires computing the inverse of Σ or U which may not exist, especially if the number of markers is greater than the sample size – the case for most human genetic studies. Second, it is common to include individual ancestry or principal components of the GRM as fixed effects in the model to account for inflation in heritability estimates due to shared environment. This should also have the effect of removing the components of genetic variance along the ancestry axes, the residual variance being equal to E{V(gθ)}=(1.1)+(1.2)(1.3) (Appendix). Indeed, this is what we observe in Fig. 4H. Thus, if ancestry is included as a fixed effect, we expect Vg to be underestimated in the presence of population structure, regardless of genetic architecture.

HE estimation

Haseman-Elston (HE) regression also assumes a random-effects model but uses a method-of-moments approach, as opposed to GREML, which maximizes the likelihood to estimate Vg. Previous work has shown that as long as all causal variants are included in the GRM calculation, the HE estimator will not be biased, even if they are in LD with each other [35]. We show that in the presence of positive and negative LD between causal loci, as exemplified by traits under divergent and stabilizing selection, respectively, the HE estimates of Vg are biased upwards and downwards, respectively (Fig. 5AB). To understand this discrepancy and the source of bias in our simulations, recall that HE estimates Vg from the regression of the (pairwise) phenotypic covariance between individuals on their genotypic covariance [24]. More specifically, if we denote Ykl=ykyl as the product of the (centered) phenotypes of kth and lth individuals, and ψkl as the kth and lth entry of the GRM, then the HE estimator can be written as:

Figure 5:

Figure 5:

Genetic variance (Vˆg) estimated with HE regression in admixed populations under the HI (left column) and CGF (right column) models either without (A-C) or with (D-F) adjustment for individual ancestry. The solid lines represent estimates from simulated data averaged across ten replicates with red and blue colors representing estimates for traits under divergent and stabilizing selection, respectively. P indicates the strength of assortative mating. (A & D) show the behavior of Vˆg for the default scaling, (B, E) shows Vˆg when the genotype at a locus is scaled by its sample variance (V(x) scaled), and (C, F) when it is scaled by the sample covariance (LD scaled). The dotted lines in A-E represent the expected Vg in the population based on Eq. 1 and in F, represent the expected Vg after removing any genetic variance along the ancestry axis. The shaded areas represent the 95% bootstrapped confidence bands of the estimate.

Vˆg=CovYkl,ψklVarψkl=Eykylw=1MzwkzwlEi=1Mzikzilw=1Mzwkzwl=Egk+ekgl+elw=1MzwkzwlEi=1Mzikzilw=1Mzwkzwl=Egkglw=1MzwkzwlEi=1Mzikzilj=1Mzwkzwl=Ei=1Mj=1Muiujzikzjlw=1MzwkzwlEi=1Mzikzilw=1Mzwkzwl=Ei=1Mj=1Muiujw=1MzikzjlzwkzwlEi=1Mw=1Mzikzilzwkzwl=Ei=1Mui2w=1MzikzjlzwkzwlEi=1Mw=1Mzikzilzwkzwl+Ei=1Mjiuiujw=1MzikzjlzwkzwlEi=1Mw=1Mzikzilzwkzwl (2)

Where the first and second terms represent the genic and LD components, respectively, of the estimate. Population structure induces correlations between the alleles at a given locus as well as across loci (i.e., LD). But the LD may not be directional, i.e., trait-increasing alleles may be as likely to be co-inherited with each other as they are to trait-decreasing alleles, and vice versa – implicit under the standard random-effects model. Thus, in the absence of directional LD, the second term is zero and the first term is unaffected as long as all causal variants are included in the GRM, because the increase in the numerator due to population structure is proportional to the denominator [35]. Directional LD does not affect the first term but exaggerates the contribution from the second term, i.e., the LD component (see Appendix section A3.2). Consequently, HE regression over- and under-estimates Vg for traits with positive and negative LD, respectively. Note that this bias is in the opposite direction of the bias observed with GREML, which fails to capture the LD contribution. Scaling the genotype at a locus by its LD with other loci, as discussed in the previous section, corrects for the bias in HE regression regardless of genetic architecture, yielding estimates consistent with GREML (Fig. 5C). Thus, GREML and HE regression are guaranteed to yield the same estimates only if the underlying model specifying the distribution of effects is consistent with the true architecture of the trait.

The practice of including individual ancestry as a covariate in HE regression to account for shared environment [11] reduces the bias from exaggerated LD contributions (Fig. 5DF). But, as with GREML, this also removes any genetic variance that may exist along the ancestry axis, yielding underestimates of Vg, regardless of genetic architecture.

Local ancestry heritability

A related quantity of interest in admixed populations is local ancestry heritability hγ2, which is defined as the proportion of phenotypic variance that can be explained by local ancestry. Zaitlen et al. (2014) [36] showed that this quantity is related to, and can be used to estimate, h2 in admixed populations. The advantage of this approach is that local ancestry segments shared between individuals are identical by descent and are therefore, more likely to tag causal variants compared to array markers, allowing one to potentially capture the contributions of rare variants [36]. Here, we show that in the presence of population structure, (i) the relationship between hγ2 and h2 is not straightforward and (ii) hˆγ2 may be a biased estimate of local ancestry heritability under the random effects model for the same reasons that hˆsnp2 is biased.

We define local ancestry γi{0,1,2} as the number of alleles at locus i that trace their ancestry to population A. Thus, ancestry at the ith locus in individual k is a binomial random variable with Eγik=2θk, θk being the ancestry of the kth individual. Similar to genetic value, the ‘ancestry value’ of an individual can be defined as i=1mϕiγi, where ϕi=βi(fiAfiB) is the effect size of local ancestry (Appendix). Then, the genetic variance due to local ancestry can be expressed as (Appendix):

Vγ=Vi=1mϕiγi=i=1mϕi2Vγi+i=1mjiϕiϕjCγi,γj=2E(θ){1E(θ)}i=1mϕi2+2V(θ)i=1mϕi2+4V(θ)i=1mjiϕiϕj=2E(θ){1E(θ)}i=1mβi2(fiAfiB)2+2V(θ)i=1mβi2(fiAfiB)2+4V(θ)i=1mjiβiβj(fiAfiB)(fjAfjB)

and heritability explained by local ancestry is simply the ratio of Vγ and the phenotypic variance. Note that Vγ=(1.2)+(1.3)+(1.4) and therefore its behavior is similar to Vg in that the terms (1.3) and (1.4) decay towards zero as V(θ)0, and Vγ converges to (1.2) (Fig. S2). Additionally, the dependence of Vγ on both E(θ) and V(θ) precludes a straightforward derivation between local ancestry heritability and h2.

GREML estimation of hˆγ2 is similar to that of hˆsnp2, the key difference being that the former involves constructing the GRM using local ancestry instead of genotypes [36]. The following model is assumed: y=Wv+ξ where W is an n×m standardized local ancestry matrix, v~𝒩(0,Iσv2m) are local ancestry effects, and ξ~𝒩(0,Iσξ2). Note that σξ2 captures both environmental noise as well as any genetic variance independent of local ancestry. The phenotypic variance is decomposed as V(y)=V(Wv)+V(ξ)=WWmσv2+σξ2 where WWm is the local ancestry GRM and σv2 is the parameter of interest, which is believed to be equal to Vγ – the genetic variance due to local ancestry.

We show that, in the presence of population structure, i.e., when V(θ)>0, GREML σˆv2 is biased downwards relative to Vγ for traits under divergent selection and upwards for traits under stabilizing selection because it does not capture the contribution of LD (Fig. 6A). But there is another source of bias in σˆv2, which tends to be inflated in the presence of population structure if individual ancestry is not included as a covariate, even with respect to the expectation of Vγ under equilibrium (seen more clearly in Fig. 6BC). We suspect this inflation is because of strong correlations between local ancestry – local ancestry disequilibrium – across loci that inflates σˆv2 in a way that is not adequately corrected even when all causal variants are included in the model [4, 10]. Scaling local ancestry by its covariance removes this bias and recovers the contribution of LD (Fig. 6D) presumably because this accounts for the correlation in genotypes across loci. Including individual ancestry as a fixed effect also corrects for the inflation in σˆv2 (Fig. 6EH). But as with σˆu2, this practice will underestimate the genetic variance due to local ancestry in the presence of population structure because it removes the variance along the ancestry axis (Fig. 6EH).

Figure 6:

Figure 6:

The behavior of GREML estimates of the variance due to local ancestry σˆv2 in admixed populations under the HI (left column) and CGF (right column) models either without (A-D) or with (E-H) individual ancestry included as a fixed effect. The solid lines represent estimates from simulated data averaged across ten replicates with red and blue colors representing estimates for traits under divergent and stabilizing selection, respectively. P indicates the strength of assortative mating. The dotted lines either represent the expected variance in the population (A & B) or the expected estimate for three different ways of scaling local ancestry (B-D & F-H). (A-B & E-F) show the behavior of σˆv2 for the default scaling, (C, G) shows σˆv2 when local ancestry is scaled by the sample variance, and (D, H) when it is scaled by the sample covariance. Shaded regions represent the 95% confidence bands. Some runs in (D & H) failed to converge as seen by the missing segments of the solid lines because the expected variance in such cases was too small.

Based on the above, GREML hˆγ2 and corresponding estimates of h2 are more accurately interpreted as the heritability due to local ancestry and heritability, respectively, expected in the absence of population structure. We believe hˆγ2 is still useful in that, because it should capture the effects of rare variants, it can be used to estimate the upper bound of hˆsnp2.

In a previous paper, we suggested that local ancestry heritability could potentially be used to estimate the genetic variance between populations [37]. Our results suggest this is not possible for two reasons. First, the GREML estimator of local ancestry heritability, as we show in this section is biased and does not capture the LD contribution. But even if we were able to recover the LD component, our decomposition shows that local ancestry is equal to the genetic variance between populations Vgb only when E(θ)=0.5 and V(θ)=E(θ){1E(θ)}=0.25, which is only possible at t=0 in the HI model. After admixture, V(θ) decays and the equivalence between Vγ and Vgb is lost, making it impossible to estimate the latter from admixed populations, especially for traits under divergent or stabilizing selection, even if the environment is randomly distributed with respect to ancestry. We note that this conclusion was recently reached independently by Schraiber and Edge (2023) [38].

How much does LD contribute to Vg in practice?

In the previous sections, we showed theoretically that hˆsnp2 may be biased in admixed populations even if the causal variants are known and in the absence of confounding by shared environment. GREML fails to capture the LD contribution whereas HE regression overestimates it. The extent to which hˆsnp2 is biased because of this reason in practice is ultimately an empirical question, which is difficult to answer because the true genetic architecture – the LD contribution in particular – is unknown. In this section, we develop some intuition for this contribution among individuals with mixed African and European ancestry using a combination of simulations and empirical data analysis.

First, we simulated a neutral trait using genotype data from the African Americans (ASW) from the 1,000 Genomes Project (1KGP) [39]. To do this, we sampled m{10,100,1,000} causal loci from a set of common (MAF > 0.01), LD pruned variants and assigned them effects such that βi~𝒩(0,1mVxi), i.e., the expected genic variance is Ei=1mβi2Varxi=1 (Methods). We computed the genic and LD contributions and repeated this process 1,000 times where each replicate can be thought of as an independent realization of the genetic architecture of a neutrally evolving trait. We show that the LD contribution may be zero in expectation but can be substantial for a given trait (up to 50% of the genic variance, Fig. S4), even in the absence of selection.

Second, we estimated the LD contribution of genome-wide significant SNPs for 26 quantitative traits from the GWAS catalog [40]. To do this, we decomposed the variance explained in ASW into the four components in Equation 1 using allele frequencies (fA and fB) from the YRI and CEU and the mean (E(θ)0.77) and variance (V(θ)0.02) of individual ancestry from ASW (Methods). We show that for skin pigmentation – a trait under strong divergent selection – the LD contribution, i.e. term (1.4), is positive and accounts for ≈ 40 — 50% of the total variance explained. This is because of large allele frequency differences between Africans and Europeans that are correlated across skin pigmentation loci, consistent with strong polygenic selection favoring alleles for darker pigmentation in regions with high UV exposure and vice versa [37, 4144]. But for most other traits, LD contributes relatively little, explaining a modest, but non-negligible proportion of the genetic variance in height, LDL and HDL cholestrol, mean corpuscular hemoglobin (MCH), neutrophil count (NEU), and white blood cell count (WBC) (Fig. 7). Because we selected independent associations for this exercise (Methods), the LD contribution is driven entirely due to population structure in ASW. The contribution of population structure to the genic variance, i.e., term (1.3) is also small even for traits like skin pigmentation and neutrophil count with large effect alleles that are highly diverged in frequency between Africans and Europeans [42, 43, 4547]. Overall, this suggests that population structure contributes relatively little, as least to the variance explained by GWAS SNPs.

Figure 7:

Figure 7:

Decomposing the genetic variance explained by GWAS SNPs in the 1000 Genomes ASW (African Americans from Southwest). We calculated the four variance components listed in Eq. 1, their values shown on the y-axis as a fraction of the total variance explained (shown as percentage at the bottom). The LD contribution, which can be positive or negative, is shown in yellow. The number of variants used to calculate variance components for each trait is also shown at the bottom.

Discussion

Despite the growing size of GWAS and discovery of thousands of variants for hundreds of traits [40], the heritability explained by GWAS SNPs remains a fraction of twin-based heritability estimates. Yang et al. (2010) introduced the concept of SNP heritability (hsnp2) that does not depend on the discovery of causal variants but assumes that they are numerous and are more or less uniformly distributed across the genome (the infinitesimal model), their contributions to the genetic variance ‘tagged’ by genotyped SNPs [3]. hsnp2 is now routinely estimated in most genomic studies and at least for some traits (e.g. height and BMI), these estimates now approach twin-based heritability [6]. But despite the widespread use of hˆsnp2, its interpretation remains unclear, particularly in the presence of admixture and population structure. It is generally accepted that hˆsnp2 can be biased in structured populations because of confounding effects of unobserved environmental factors and LD between causal variants [4, 7, 911, 48]. But hˆsnp2 may be biased even in the absence of confounding because of misspecification of the underlying random-effects model, i.e., if the model does not represent the genetic architecture from which the trait is sampled [1416, 49, 50].

Under the standard GREML model, SNP effects are assumed to be uncorrelated and the total genetic variance can be represented as the sum of the variance explained by individual loci, i.e. the genic variance [1416]. In admixed populations, there is substantial LD, which can contribute to the genetic variance, and can persist for a number of generations, despite recombination, due to continued gene flow and/or ancestry-based assortative mating. GREML does not capture this LD contribution [12, 15], and therefore, may lead to biased estimates of hsnp2. The LD contribution can be negative for traits under stabilizing selection, and positive for traits under divergent selection between the source populations, leading to over- or under-estimates, respectively. Thus, GREML estimates of hsnp2, assuming genotypes are scaled properly (see below), is better interpreted as the proportion of phenotypic variance explained by the genic variance. Estimates of local ancestry heritability (hˆγ2) [36, 51] should be interpreted similarly.

We show that with GREML, hˆsnp2 can be biased even when the LD contribution is zero if the genotypes are scaled by 2f(1f) – the standard approach, which implicitly assumes a randomly mating population. In the presence of population structure, the variance in genotypes can be higher and hˆsnp2 does not capture this additional variance, which we show can be recovered by scaling genotypes by the SNP variance (Var(x)). In principle, the LD contribution can also be recovered by scaling genotypes by the SNP covariance, i.e., the LD matrix, as previously suggested [33, 34]. But this approach is limited to situations where the sample size is much larger than the number of markers.

We also investigated the behavior of another widely used approach to estimate hsnp2 – Haseman-Elston regression. We show that hˆsnp2 estimated with HE regression is also biased, but for different reasons and in the opposite direction of the bias observed with GREML. HE regression exaggerates the LD contribution, leading to over- and under-estimates of hsnp2 for traits where the causal loci are in positive and negative LD, respectively. Approaches that correct for population structure [35] should remove this source of bias but would also remove any genetic variance in the trait along the ancestry axis, including the LD contribution. This results in underestimates of hsnp2, regardless of trait architecture.

One limitation of this paper is that we have focused on random-effects estimators of hsnp2 because of their widespread use. Estimators of hsnp2 can be broadly grouped into random- and fixed effect estimators based on how they treat SNP effects [35]. Fixed effect estimators make fewer distributional assumptions but they are not as widely used because they require conditional estimates of all variants – a high-dimensional problem where the number of markers is often far larger than the sample size [52]. This is one reason why random effect estimators, such as GREML, are popular – because they reduce the number of parameters that need to be estimated by assuming that the effects are drawn from some distribution where the variance is the only parameter of interest. Fixed effects estimators, in principle, should be able to capture the LD contribution but this is not obvious in practice since the simulations used to evaluate the accuracy of such estimators still assume uncorrelated effects [35, 52, 53]. Further research is needed to clarify the interpretation of the different estimators of hsnp2 in structured populations under a range of genetic architectures.

Does the LD contribution to the genetic variance have practical implications? The answer to this depends on the context in which SNP heritability is used. hˆsnp2 can be useful in quantifying the power to detect variants in GWAS where the quantity of interest is the genic variance. But hˆsnp2 can lead to misleading conclusions if used to measure the extent to which genetic variation contributes to phenotypic variation, in predicting the response to selection, or in defining the upper limit of polygenic prediction accuracy [2] – applications where the LD contribution is important.

Ultimately, the discrepancy between hˆsnp2 and h2 in practice is an empirical question, the answer to which depends on the degree of population structure (which we can measure) and the genetic architecture of the trait (which we do not know a priori). We show that for most traits, the contribution of population structure to the variance explained by GWAS SNPs is modest among African Americans. Thus, if we assume that the genetic architecture of GWAS SNPs represents that of all causal variants, then despite incorrect assumptions, the discrepancy between hˆsnp2 and h2 should be fairly modest. But this assumption is unrealistic given that GWAS SNPs are common variants that in most cases cumulatively explain a fraction of trait heritability. What is the LD contribution of the rest of the genome, particularly rare variants? This is not obvious and will become clearer in the near future through large sequence-based studies [54]. While these are underway, theoretical studies are needed to understand how different selection regimes influence the directional LD between causal variants – clearly an important aspect of the genetic architecture of complex traits.

Methods

Simulating genetic architecture

We first drew the allele frequency f0 of 1,000 biallelic causal loci in the ancestor of populations A and B from a uniform distribution, U(0.001,0.999). Then, we simulated their frequency in populations A and B (fA and fB) under the Balding-Nichols model [32], such that fA,fB~Beta(f0(1F)F,1f0(1F)F) where F=0.2 is the inbreeding coefficient. We implemented this using code adapted from Lin et al. (2021) [55]. To avoid drawing extremely rare alleles, we continued to draw fA and fB until we had 1,000 loci with fA,fB(0.01,0.99).

We generated the effect size (β) of each locus by sampling from 𝒩(0,12mf(1f)), where m is the number of loci and f is the mean allele frequency across populations A and B. Thus, rare variants have larger effects than common variants and the total genetic variance sums to 1. Given these effects, we simulated two different traits, one with a large difference in means between populations A and B (Trait 1) and the other with roughly no difference (Trait 2). This was achieved by permuting the signs of the effects 100 times to get a distribution of Vgb – the genetic variance between populations. This has the effect of varying the LD contribution without changing the FST at causal loci. We selected the maximum and minimum of Vgb to represent Traits 1 and 2.

Simulating admixture

We simulated the genotypes, local ancestry, and phenotype for 10,000 admixed individuals per generation under the hybrid isolation (HI) and continuous gene flow (CGF) models by adapting the code from Zaitlen et al. (2017) [26]. We denote the ancestry of a randomly selected individual k with θ, the fraction of their genome from population A. At t=0 under the HI model, we set θ to 1 for individuals from population A and 0 if they were from population B such that E(θ)=p{0.1,0.2,0.5} with no further gene flow from either source population. In the CGF model, population B receives a constant amount q from population A in every generation starting at t=0. The mean overall proportion of ancestry in the population is kept the same as the HI model by setting q=1(1p)1t where t is the number of generations of gene flow from A. In every generation, we simulated ancestry-based assortative mating by selecting mates such that the correlation between their ancestries is P{0,0.3,0.6,0.9} in every generation. We do this by repeatedly permuting individuals with respect to each other until P falls within ±0.01 of the desired value. It becomes difficult to meet this criterion when V(θ) is small (Fig. 1C). To overcome this, we relaxed the threshold up to 0.04 for some conditions, i.e., when θ{0.1,0.2} and t50. We generated expected variance in individual ancestry using the expression in Zaitlen et al. (2017) [26]. At time t since admixture, Vθt=Vθt1(1+P)2 under the HI model where P measures the strength of assortative mating, i.e, the correlation between the ancestry between individuals in a mating pair. Under the CGF model, Vθt=q(1q)Eθt12+q(1q)12Eθt1+(1q)Vθt1(1+P)2 (Appendix).

We sampled the local ancestry at each ith locus as γi=γif+γim where γim~Bin1,θm, γif~Bin1,θf and θm and θf represent the ancestry of the maternal and paternal chromosome, respectively. The global ancestry of the individual is then calculated as θk=i=1mγim+γif2m, where m is the number of loci. We sample the genotypes xim and xif from a binomial distribution conditioning on local ancestry. For example, the genotype on the maternal chromosome is xim~Bin(1,fiA) if γim=1 and xim~Bin1,fiB if γim=0 where fiA and fiB represent the allele frequency in populations A and B, respectively. Then, the genotype can be obtained as the sum of the maternal and paternal genotypes: xi=xim+xip. We calculate the genetic value of each individual as g=i=1mβixi and the genetic variance as V(g).

Heritability estimation with GREML

We used the --reml and --reml-no-constrain flags in GCTA [5] to estimate σu2 and σv2, the genetic variance due to genotypes and local ancestry, respectively. We could not run GCTA without noise in the genetic values so we simulated individual phenotypes with a heritability of h2=0.8 by adding random noise e~𝒩(0,Vg1h2h2). We computed three different GRMs, which correspond to different transformations of the genotypes: (i) standard, (i) Variance or V(x) scaled, and (ii) LD-scaled.

For the standard GRM, the genotypes at the ith SNP are standardized such that zi=xi2fi2fi1fi. For the variance scaled GRM, we computed zi=xi2fiVxi where Vxi is the sample variance of the genotypes at the ith SNP. The LD-scaled GRM conceptually corresponds to standardizing the genotypes by the SNP covariance, rather than its variance. Let X represent the n×m unstandardized matrix of genotypes and P represent an n×m matrix where the ith column contains the allele frequency of that SNP. Let U be the upper triangular ‘square root’ matrix of the m×m SNP covariance matrix Σ such that Σ=UU. Then, the standardized genotypes are computed as Z=(X2P)U1 and the GRM becomes (X2P)Σ1(X2P) [33]. Similarly, the three GRMs for local ancestry were computed by scaling local ancestry with (i) 2γi1γi where we denote γi as the mean local ancestry at the ith SNP, or with the (ii) variance, or (iii) covariance of local ancestry, respectively. We estimated σu2 and σv2 with and without individual ancestry as a fixed effect to correct for any confounding due to genetic stratification. This was done by using the --qcovar flag.

Heritability estimation with HE regression

Haseman-Elston regression with and without ancestry correction was implemented using custom scripts in R [56]. To estimate Vg without ancestry correction, we first computed the cross-product of the centered phenotypes (y), resulting in an n×n matrix yy. We stacked the upper-triangular matrix of yy into a vector and regressed it on the corresponding elements of the GRM (ψ), taking the slope as an estimate of Vg:

Vˆg=k=1l<kykylψklkl<kψkl2

To correct for individual ancestry, we followed the approach of Min et al. (2022) [35]. To do this, we first regressed out the effect of individual ancestry (θ) on phenotype. The regression coefficient can be expressed as θθθ1θ and the residuals as y*=(Iθθθ1θ)y. Then, we fit the following model:

Ey*y*=Vgψ+VeI+δθθ

where θθ represents the cross-product of individual ancestry, δ represents its corresponding regression coefficient, and Vg represents the parameter of interest, i.e., the genetic variance and Ve, the residual variance.

To demonstrate that the bias in HE estimates arises because of a bias in the estimate of LD contribution, not the genic variance, we carried out a simple simulation where half of the individuals in the population derive their ancestry from population A and the rest from population B. This is equivalent to the meta-population under the HI model at t=0 where E(θ)=0.5. We simulated genotypes for 1, 000 individuals for m=100 loci where the allele frequencies in populations A and B were set to fA=0.1 and fB=0.8, respectively. We standardized the genotypes at each locus i using the square-root of the sample variance and assigned effect sizes such that the total genetic variance explained by all loci is equal to 1, i.e., the effect of the scaled genotype at the ith locus is ui=1m. This is equivalent to the effect size of the unscaled genotypes being βi=1mVxi where Vxi is the sample variance at the ith locus. We introduced randomness in the direction of the effect by assigning a negative or positive sign to each locus uniformly at random 100 times to generate 100 traits with the same genic variance but varying LD contributions. Then, for each trait we computed the two terms in Eq. 2, which should converge to the genic variance and LD contributions, which represent the genic and LD components to the HE regression estimate. Fig. S5 shows that in the presence of directional LD, the overall bias is in the HE regression estimate is due to an exaggerated estimate of the LD contribution.

Estimating variance explained by GWAS SNPs

To decompose the variance explained by GWAS SNPs in African Americans, we needed four quantities: (i) effect sizes of GWAS SNPs, (ii) their allele frequencies in Africans and Europeans, and (iii) the mean and variance of global ancestry in African Americans (Equation 1).

We retrieved the summary statistics of 26 traits from GWAS catalog [40]. Full list of traits and the source papers [44, 5764] are listed in Table S1. To maximize the number of variants discovered, we chose summary statistics from studies that were conducted in both European and multi-ancestry samples and that reported the following information: effect allele, effect size, p-value, and genomic position. For birth weight, we downloaded the data from the Early Growth Genetics (EGG) consortium website [61] since the version reported on the GWAS catalog is incomplete. For skin pigmentation, we chose summary statistics from the UKB [65] released by the Neale Lab (http://www.nealelab.is/uk-biobank) and processed by Ju and Mathieson [44] to represent effect sizes estimated among individuals of European ancestry. We also selected summary statistics from Lona-Durazo et al. (2019) where effect sizes were meta-analyzed across four admixed cohorts [57]. Lona-Durazo et al. provide summary statistics separately with and without conditioning on rs1426654 and rs35397 – two large effect variants in SLC24A5 and SLC45A2. We used the ‘conditioned’ effect sizes and added in the effects of rs1426654 and rs35397 to estimate genetic variance.

We selected independent hits for each trait by pruning and thresholding with PLINK v1.90b6.21 [66] in two steps as in Ju et al. (2020) [44]. We used the genotype data of GBR from the 1000 genome project [39] as the LD reference panel. We kept only SNPs (indels were removed) that passed the genome-wide significant threshold (--clump-p1 5e-8) with a pairwise LD cutoff of 0.05 (--clump-r2 0.05) and a physical distance threshold of 250Kb (--clump-kb 250) for clumping. Second, we applied a second round of clumping (--clump-kb 100) to remove SNPs within 100kb.

When GWAS was carried out separately in different ancestry cohorts in the same study, we used inverse-variance weighting to meta-analyze effect sizes for variants that were genome-wide significant (p-value < 5 × 10−8) in at least one cohort. This allowed us to maximize the discovery of variants such as the Duffy null allele that are absent among individuals of European ancestry but polymorphic in other populations [47].

We used allele frequencies from the 1000 Genomes CEU and YRI to represent the allele frequencies of GWAS SNPs in Europeans and Africans, respectively, making sure that the alleles reported in the summary statistics matched the alleles reported in the 1000 Genomes. We estimated the global ancestry of ASW individuals (N = 74) with CEU and YRI individuals from 1000 genome (phase 3) using ADMIXTURE 1.3.0 [67] with k=2 and used it to calculate the mean (proportion of African ancestry = 0.767) and variance (0.018) of global ancestry in ASW. With the effect sizes, allele frequencies, and the mean and variance in ancestry, we calculated the four components of genetic variance using Equation 1 and expressed them as a fraction of the total genetic variance.

Initially, the multi-ancestry summary statistics for a few traits (NEU, WBC, MON, MCH, BAS) yielded values > 1 for the proportion of variance explained. This is likely because, despite LD pruning, some of the variants in the model are not independent and tag large effect variants under divergent selection such as the Duffy null allele, leading to an inflated contribution of LD. We checked this by calculating the pairwise contribution, i.e., βiβj(fiAfiB)(fjAfjB), of all SNPs in the model and show long-range positive LD between variants on chromosome 1 for NEU, WBC, and MON, especially with the Duffy null allele (Fig. S6AC). A similar pattern was observed on chromosome 16 for MCH, confirming our suspicion. This also suggests that for certain traits, pruning and thresholding approaches are not guaranteed to yield independent hits. To get around this problem, we retained only one association with the lowest p-value, each from chromosome 1 (rs2814778 for NEU, WBC, and MON) and chromosome 16 (rs13331259 for MCH) (Fig. S6D). For BAS, we observed that the variance explained was driven by a rare variant (rs188411703, MAF = 0.0024) of large effect (β=2.27). We believe this effect estimate to be inflated and therefore, we removed it from our calculation.

As a sanity check, we independently estimated the genetic variance as the variance in polygenic scores, calculated using --score flag in PLINK, [66] in ASW individuals. We compared the first estimate of the genetic variance to the second (Fig. S7) to confirm two things: (i) the allele frequencies, and mean and variance in ancestry are estimated correctly, and (ii) the variants are more or less independent in that they do not absorb the effects of other variants in the model. We show that the two estimates of the genetic variance are strongly correlated (r0.85, Fig. S7). The 95% confidence intervals were calculated by sampling individuals with replacement 10,000 times.

Supplementary Material

Supplement 1

Acknowledgements

We thank Iain Mathieson and Doc Edge for helpful comments on the manuscript. This study was funded by National Institute of General Medical Sciences award R00GM137076 to A.A.Z. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix

A1. Variance in ancestry

We denote variance and covariance with V(.) and C(.) and used the expressions in [26] to generate the expected value for the variance in ancestry, i.e., V(θ). This is straightforward for the HI model, where at time tVθt=Vθt11+Pt12.Pt=Corθm,θf measures the strength of assortative mating, i.e, the correlation between the ancestry across mating pairs θm,θf at time t. For simplicity, we assumed this to be constant in every generation, i.e. Pt=Pt1=P following [26]. Since our notation slightly differs from [26], we re-derived the expression for Vθt for the CGF model where population B receives a constant amount q of gene flow from population A in every generation. Note, that Eθt=q+(1q)Eθt1. Then,

Vθt=Eθt2Eθt2=q+(1q)Eθt1m+θt1f2θt1m+θt1f2q+(1q)Eθt12=q+(1q)42Eθt12+2Eθt1mθt1fq2+2q(1q)Eθt1+(1q)2Eθt12=q+1q2Eθt12+1q2Eθt1mθt1fq22q(1q)Eθt1(1q)2Eθt12=q(1q)+1q2Vθt1+Eθt12+1q2Cθt1m,θt1f+Eθt122q(1q)Eθt1Eθt12=q(1q)+1q2Vθt1+1q2Eθt12+1q2Pt1Vθt1+1q2Eθt122q(1q)Eθt1Eθt12=q(1q)+1q2Vθt11+Pt1+(1q)Eθt122q(1q)Eθt1(1q)2Eθt12=q(1q)Eθt12+q(1q)12Eθt1+1q2Vθt11+Pt1

A2 Genetic variance

Let y=g+e, where y is the phenotypic value of an individual, g is the genotypic value, and e is random error. We assume additive effects such that g=i=1mβixi where βi is the effect size of the ith biallelic locus and xi{0,1,2} is the number of copies of the trait-increasing allele. Then, the genetic variance Vg is:

Vg=Vi=1mβixi=i=1mβi2Vxi+jiβiβjCxi,xj

In the following sections, we decompose Vxi and Cxi,xj further as functions of ancestry.

A2.1. Vxi

We first derive Vxi as a function of ancestry (θ) using the law of total variance:

Vxi=Eθ{Vxiθ}+VEθxiθ

where Eθ represents the expectation taken over θ.

A2.1.1. EθVxiθ

We derive Vxiθ by further conditioning on the local ancestry at each locus.

Vxiθ=EγVxiγ,θ+V{Eγxiγ,θ}

where Eγ represents expectation taken over local ancestry. Since we are interested in the variance at a single locus, we will ignore the subscript i and denote the frequency of the trait-increasing allele in populations A and B with fA and fB, respectively.

EγVxiγ,θ=Vxiγ=0,θP(γ=0θ)+Vxiγ=1,θP(γ=1θ)+Vxiγ=2,θP(γ=2θ)=2fB1fB(1θ)2+fA1fA+fB1fB2θ(1θ)+2fA1fAθ2=2fB2fA212θ+θ2+fAfA2+fBfB22θ2θ2+2fA2fA2θ2=2fB2θfB2fB2+2θfB2+2θfA2θfA2=2fB(1θ)2fB2(1θ)+2θfA1fA=2fB1fB(1θ)+2θfA1fA

To derive VEγ(xγ,θ), note that

Eγ(xγ,θ)=Eγ{E(xθ)}=E(xγ=0,θ)P(γ=0θ)+E(xγ=1,θ)P(γ=1θ)+E(xγ=2,θ)P(γ=2θ)=2θfA+2(1θ)fB

And,

V{Eγ(xγ,θ)}=[E(xγ=0,θ)E(xθ)]2P(γ=0θ)+[E(xγ=1,θ)E(xθ)]2P(γ=1θ)+[E(xγ=2,θ)E(xθ)]2P(γ=2θ)=θ22fA2θfA+2(1θ)fB2+2θ(1θ)fA+fB2θfA+2(1θ)fB2+(1θ)22fB2θfA+2(1θ)fB2=2θ(1θ)fAfB2

Putting this together,

EθVxiθ=Eθ2fB1fB(1θ)+2θfA1fA+2θ(1θ)fAfB2=2fB1fB{1Eθ(θ)}+2Eθ(θ)fA1fA+2Eθθθ2fAfB2=2fB1fB{1Eθ(θ)}+2Eθ(θ)fA1fA+2Eθ(θ)Eθθ2fAfB2=2fB1fB{1Eθ(θ)}+2Eθ(θ)fA1fA+2Eθ(θ)V(θ)Eθ(θ)2fAfB2=2fB1fB{1Eθ(θ)}+2Eθ(θ)fA1fA+2Eθ(θ)(1Eθ(θ))fAfB22V(θ)fAfB2
A2.1.2. VEθxiθ

Recall from the previous section that Eθxiθ=2θfA+2(1θ)fB. Then,

VEθxiθ=V2θfA+2(1θ)fB=4V(θ)fA2+4V(1θ)fB2+2C2θfA,21θfB=4V(θ)fA2+4V(1θ)fB28fAfBV(θ)=4V(θ)fAfB2

We are now ready to express Vxi:

Vxi=2fB1fB{1Eθ(θ)}+2Eθ(θ)fA1fA+2Eθ(θ)(1Eθ(θ))fAfB22V(θ)fAfB2+4V(θ)fAfB2=2Eθ(θ)fiA1fiA+2{1Eθ(θ)}fiB1fiB+2Eθ(θ){1Eθ(θ)}fiAfiB22V(θ)fiAfiB2

Note, that we can also express Vxi as:

Vxi=2fi1fi+2V(θ)fiAfiB2

where the second term is the contribution of population structure to the genetic variance at locus i.

A2.2. Cxi,xj

We can derive Cxi,xj using the law of total covariance:

Cxi,xj=EθCxi,xjθ+CEθxiθ,Eθxjθ=0+C2fiAθ+2fiB(1θ),2fjAθ+2fjB(1θ)=C2fiAθ,2fjAθ+C2fiAθ,2fjB(1θ)+C2fiB(1θ),2fjAθ+C2fiB(1θ),2fjB(1θ)=4V(θ)fiAfiBfjAfjB

EθCxi,xjθ=0 because we assume that the loci are unlinked and therefore, xi and xj are conditionally independent. Putting this all together, we get the genetic variance in admixed populations as presented in the main text:

Vg=i=1mβi2Vxi+jiβiβjCxi,xj=i=1mβi22Eθ(θ)fiA1fiA+i=1mβi22{1Eθ(θ)}fiB1fiB+i=1mβi22Eθ(θ){1Eθ(θ)}fiAfiB2++i=1mβi22V(θ)fiAfiB2+jiβiβj4V(θ)fiAfiBfjAfjB

The only difference being that in the main text we use E instead of Eθ for simplicity. With two ‘unadmixed’ source populations with equal number of individuals, E(θ)=0.5 and V(θ)=E(θ){1E(θ)}=0.25 and Vg reduces to:

Vg=Vi=1mβixi=i=1mβi2Vxi+jiβiβjCxi,xj=i=1mβi2fiA1fiA+fiB1fiB+i=1mβi2fiAfiB2+ijβiβjfiAfiBfjAfjB

A3. The effect of genotype scale on Vˆg

In the main text, we showed that both GREML and Haseman-Elston regression estimates of Vg depend on how the genotypes are scaled. We provide an explanation of this behavior using the Haseman-Elston (HE) regression estimator, which is asymptotically equivalent to the GREML estimator if effects are uncorrelated [69] but which, unlike GREML, has a closed-form solution.

A3.1. No directional LD

A3.1.1. Scaling by 2fi1fi

First, let’s assume a genetic architecture where all loci contribute equally to the genetic variance and there is no LD contribution. With the standard scaling, the genotype at a given locus i is zi=xi2fi2fi1fi where fi is the frequency of the allele in the population. Under the random-effects model, this is equivalent to saying that the unscaled effects are: βi𝒩0,σu22mfi1fi, σu2 being the parameter of interest. In a panmictic population,

Vg=Vi=1mβixi=i=1mβi2Vxi=i=1mσu22mfi1fi2fi1fi=σu2

In an admixed population,

Vg=i=1mβi22fi1fi+2V(θ)fiAfiB2=i=1mσu22mfi1fi2fi1fi+2V(θ)fiAfiB2=σu2mi=1m1+V(θ)fiAfiB2fi1fi=σu2+V(θ)σu2mi=1mfiAfiB2fi1ficontributionofpopulationstructuretothegenicvariance

The HE estimator of Vg is based on the regression of products of (centered) phenotypes ykyl for all pairs of individuals kl on the corresponding entries of the GRM (ψ) where ψkl=i=1mzikzilm and zik is the centered and scaled genotype of individual k for locus i:

Vˆg=Cykyl,ψklVψkl=EklykylψklEklykylEklψklEklψkl2Eψkl2=EklykylψklEklψkl2

Where Ekl represents the expectation over all k×l pairwise comparisons between individuals. It is simpler to express the HE estimator in terms of the scaled effects ui𝒩(0,σu2m).

Vˆg=EklykylψklEklψkl2=Ekli=1muiziki=1muizilψklEklψkl2=Ekli=1mui2zikzilψklEklψkl2+Ekli=1mjiuiujzikzjlψklEklψkl2=Eui2Ekli=1mzikzilψklEklψkl2+EuiujEkli=1mjizikzjlψklEklψkl2=Eui2Eklmψkl2Eklψkl2+0=σu2

Where Eui and Euiuj represent expectations over random realizations of effect sizes. Thus, the last line follows from our assumption that the effect sizes are independent in expectation, i.e., Euiuj=0. Note, that the estimate is still biased since it does not capture the contribution of population structure.

A3.1.2. Scaling by Vxi

Next, we consider the case where the genotypes are standardized instead by the sample variance, i.e., zkl=xik2fiVxi such that Vzi=1. We can derive Eui2 corresponding to this scaling by noting that the genetic variance is invariant under linear transformations of the genotype [14]:

i=1mβi2Vxi=i=1mui2VzimEui2=σu2+V(θ)σu2mi=1mfiAfiB2fi1fiEui2=σu2m+V(θ)σu2m2i=1mfiAfiB2fi1fi

Then, the HE estimator becomes:

Vˆg=mEui2=mσu2m+V(θ)σu2m2i=1mfiAfiB2fi1fi=σu2+V(θ)σu2mi=1mfiAfiB2fi1fi

Which provides an unbiased estimate of the genic variance. It’s important to note that even though we assumed effect sizes under a random-effect model, the above result holds under a fixed-effect model as long as there is no directional LD. We discuss the implications of directional LD in the following section.

A3.2. Directional LD

Under the standard random-effect model, the effect sizes are assumed to be independent in expectation. We discussed in the main text how certain processes (e.g. selection and assortative mating) can induce directional LD across causal loci. But directional LD might arise even for neutral traits and under the random-effects model for any given realization of effects. This can lead to biases both HE and GREML estimates of Vg, though the direction and reason for bias is different for the two methods. GREML does not have a closed-form solution so the exact estimand is difficult to derive. Instead, we develop some intuition based on HE regression.

A3.2.1. Scaling by Vxi

To do this, let u=u1,u2,,um represent the vector of a given realization of (fixed) effects corresponding to the standardized genotypes such that each locus contributes equally to σu2, the genic variance, i.e., ui2=σu2m. Let there be positive LD across loci such that all cross-product terms are uiuj=σu2m. Then, the genetic variance explained by all loci is:

Vg=i=1mui2Vzi+jiuiujzi,zj=i=1mui2+jiuiujzi,zj=σu2+σu2mjizi,zj (3)

where Czi,zj is the LD between the ith and jth loci that ranges from 0 (no LD) to 1 (perfect LD). Thus, the LD contribution to Vg ranges from 0 to (m1)σm2. In comparison, the HE estimator is:

Vˆg=Ekli=1mui2zikzilψklEklψkl2+Ekli=1mjiuiujzikzjlψklEklψkl2=Ekli=1mσu2mzikzilw=1mzwkzwl/mEkli=1mzikzil/mw=1mzwkzwl/m+Ekli=1mjiσu2mzikzjlw=1mzwkzwl/mEkli=1mzikzil/mw=1mzwkzwl/m=σu2+σu2Ekli=1mjizikzjlzwkzwlEkli=1mzikzilw=1mzwkzwl (4)

This shows that the bias due to directional LD in the HE estimate of Vg does not come from the genic, but the LD component. When there is no LD, e.g. if the population has reached equilibrium after generations of random mating, this component goes to zero and both the estimate and Vg converge to the same value – the genic variance. The LD component is maximum when the ith and jth loci are in perfect LD. In this case, i and j are exchangeable and the second term of the estimator reduces to (m1)σm2. Thus, HE regression should give an unbiased estimate of Vg, even in the presence of directional LD, but only when LD is perfect. For any other value 0<Czi,zj<1, the estimate is biased (Fig. A1). An interpretable, analytical derivation of the second term in Eq. 4 is complicated but we illustrate the bias with simulations below.

For unlinked markers, Czi,zj is a function of 4V(θ)fiAfiBfjAfjB (see A2.1). Perfect LD arises when (i) both fiAfiB=1 and fjAfjB=1 and (ii) V(θ) is maximum, which occurs at the time of admixture when source populations mix equally, i.e, E(θ)=0.5. To generate a range of LD, we simulated an admixed population (N=1,000) with equal number of individuals from populations A and B. Thus, 4V(θ)=4E(θ){1E(θ)}=1. We simulated genotypes for each individual at 50 ‘causal’ loci where the difference between the frequencies in the source populations, fiAfiB[0,1] with the condition that fiA+fiB2=0.5. We assigned each locus the same effect size (on the variance-standardized scale) of +1/m summing up to a genic variance of 1. The positive sign ensures positive LD across loci, i.e, all off-diagonal elements of uu are set to 1/m. For each simulation, we computed the expected and estimated LD component using the second terms in Eqs. 3 and 4, respectively, and averaged the results over 100 replications.

Figure A1:

Figure A1:

The behavior of the LD contribution (y-axis) to the genetic variance (red) and the Haseman-Elston regression estimate (blue) as a function of LD, i.e. Czi,zj (x-axis). Each point represents the contribution calculated from a random draw of genotypes, given Czi,zj(fiAfiB)(fjAfjB). The red line represents the expected LD contribution and the black dashed line represents the contribution expected in the case of perfect LD.

A3.2.2. Scaling by LD

In the main text, we showed that standardizing the genotypes at a locus by its covariance with other loci accounts for the bias for GREML and HE estimators. More specifically, the ‘LD-scaled’ genotypes can be written as Z=(X2P)U1 where P is an n×m matrix such that all elements of the ith column contain the frequency of the ith SNP and U is the (upper triangular) square root of the LD matrix, i.e, Σ=UU. Under this scheme, the standardized genotypes are uncorrelated and therefore, the second term in Eqs. 3 and 4 are zero. This reduces the estimator to the first term, representing the sum of squares of effect sizes, i.e. uu=i=1mui2. The effect sizes corresponding to the LD scaled genotypes are u=Uβ and the sum of squares is:

uu=(Uβ)(Uβ)=βUUβ=βΣβ=i=1mj=1mβiβjCxi,xj

Which captures both the genic and LD contributions and therefore, provides an unbiased estimate of Vg.

A4. Genetic variance after correction for individual ancestry

In the main text, we stated that including individual ancestry as a fixed effect in GREML can lead to an underestimate of Vg in the presence of population structure. Mixed effect models deal with fixed effects (ancestry in our case) by projecting them out of the phenotypes, and estimating the residual variance. This is conceptually equivalent to measuring the residual variance of the regression between phenotype and ancestry. As a result, any variance in the phenotype that is explained by ancestry is removed. To understand this quantitatively, it is helpful to decompose Vg into components of variance explained by and variance orthogonal to ancestry:

V(g)=V{Eθ(gθ)}variancealongancestryaxis+Eθ{V(gθ)}varianceorthogonaltoancestryaxis

We can express the residual variance as:

Eθ{V(gθ)}=EθVi=1Mβi2xiθ=Eθi=1Mβi2Vxiθ+EθijβiβjCxi,xjθ=i=1Mβi2EθVxiθ+0=2Eθ(θ)i=1Mβ2fiA1fiA+2{1Eθ(θ)}i=1Mβ2fiB1fiB+2Eθ(θ)i=1Mβ2{1Eθ(θ)}fiAfiB22V(θ)i=1Mβ2fiAfiB2

Note, that this represents the following components of Vg from the main text: (1.1) + (1.2) − (1.3). Note, that (1.3), which is subtracted out, is always positive and depends on V(θ). Thus, the residual genetic variance will be underestimated, regardless of trait architecture, in the presence of population structure, i.e. when V(θ)>0.

A5. Effect size of local ancestry

We define local ancestry γi{0,1,2} as the number of alleles at locus i that trace their ancestry to population A. Thus, the local ancestry at locus i in individual k is a Binomial random variable with Eγi,k=2θk. We define the ancestry value of an individual as the weighted sum of their local ancestry: i=1mϕiγi where ϕi=βi(fiBfiA).

To show this, note that ϕ=E(yγ=1)E(yγ=0) where E(yγ=1)=yh(yγ=1) and h is a density function. Our goal is to express ϕ in terms of β, which is equal to E(yx=1)E(yx=0). Furthermore, E(yx=1)=yh(yx=1). We can express h(yγ) in terms of h(yx) as follows:

h(yγ=1)=h(yx=0)P(x=0γ=1)+h(yx=1)P(x=1γ=1)+h(yx=2)P(x=2γ=1)=h(yx=0)21fA1fA+h(yx=1)fA1fB+fB1fA+h(yx=2)2fAfB
E(yγ=1)=yh(yγ=1)dy=1fA1fByh(yx=0)dy+fA1fB+fB1fAyh(yx=1)dy+fAfByh(yx=2)dy=1fA1fBE(yx=0)+fA1fB+fB1fAE(yx=1)+fAfBE(yx=2)=0+fA1fB+fB1fAβ+fAfB2β=βfA+βfB

Similary, E(yγ=0)=2βfB and ϕ=E(yγ=1)E(yγ=0)=βfBfA

A6. Genetic variance due to local ancestry

Vγ=Vi=1mϕiγi=i=1mϕi2Vγi+i=1mjiϕiϕjγi,γj (5)

We use the law of total variance and covariance to derive Vγi and Cγi,γj:

Vγi=EVγiθ+VEγiθ=E{2θ(1θ)}+V(2θ)=2E(θ)2Eθ2+4V(θ)=2E(θ)2V(θ)2E(θ)2+4V(θ)=2E(θ){1E(θ)}+2V(θ)
Cγi,γj=ECγi,γjθ+CEγi,γjθ=0+C(2θ,2θ)=4V(θ)
Vγ=2E(θ){1E(θ)}i=1mϕi2+2V(θ)i=1mϕi2+4V(θ)i=1mjiϕiϕj=2E(θ){1E(θ)}i=1mβi2fiBfiA2+2V(θ)i=1mβi2fiBfiA2+4V(θ)i=1mjiβiβjfiBfiAfjBfjA

Footnotes

Code availability

We carried out all analyses in R version 4.2.3 [56], PLINK v1.90b6.21 and PLINK 2.0 [66, 68], and GCTA version 1.94.1 [5]. All code is freely available on https://github.com/jinguohuang/admix_heritability.git.

References

  • 1.Lynch M. & Walsh B. Genetics and analysis of quantitative traits 1–980 (Sinauer Associates, Inc, 1998). [Google Scholar]
  • 2.Visscher P. M., Hill W. G. & Wray N. R. Heritability in the genomics era–concepts and misconceptions. Nature reviews. Genetics 9, 255–66 (April 2008). [DOI] [PubMed] [Google Scholar]
  • 3.Yang J., Benyamin B., et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (July 2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yang J., Manolio T. A., et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 2011 43:6 43, 519–525 (June 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yang J., Lee S. H., Goddard M. E. & Visscher P. M. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics 88, 76–82 (January 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wainschtein P., Jain D., et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nature Genetics 2022 54:3 54, 263–273 (March 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Browning S. R. & Browning B. L. Population structure can inflate SNP-based heritability estimates. American Journal of Human Genetics 89, 191–193 (January 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Goddard M. E., Lee S. H., Yang J., Wray N. R. & Visscher P. M. Response to browning and browning. American Journal of Human Genetics 89, 193–195 (January 2011). [Google Scholar]
  • 9.Kumar S. K., Feldman M. W., Rehkopf D. H. & Tuljapurkar S. Limitations of GCTA as a solution to the missing heritability problem. Proceedings of the National Academy of Sciences of the United States of America 113, E61–E70 (January 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yang J., Leec S. H., Wraya N. R., Goddardd M. E. & Visscher P. M. GCTA-GREML accounts for linkage disequilibrium when estimating genetic variance from genome-wide SNPs. Proceedings of the National Academy of Sciences 113, E4579–E4580 (32 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lin Z., Seal S. & Basu S. Estimating SNP heritability in presence of population substructure in biobank-scale datasets. Genetics 220 (April 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Border R., O’Rourke S., et al. Assortative mating biases marker-based heritability estimators. Nature Communications 2022 13:1 13, 1–10 (January 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Visscher P. M., Yang J. & Goddard M. E. A Commentary on ‘Common SNPs Explain a Large Proportion of the Heritability for Human Height’ by Yang et al. (2010). Twin Research and Human Genetics 13, 517–524 (June 2010). [DOI] [PubMed] [Google Scholar]
  • 14.De los Campos G., Sorensen D. & Gianola D. Genomic Heritability: What Is It? PLOS Genetics 11, e1005048 (May 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rawlik K., Canela-Xandri O., Oolliams J. W. & Tenesa A. SNP heritability: What are we estimating? bioRxiv, 2020.09.15.276121 (2020). [Google Scholar]
  • 16.Lara L. A. C., Pocrnic I., de P. Oliveira T., Gaynor R. C. & Gorjanc G. Temporal and genomic analysis of additive genetic variance in breeding programmes. Heredity 2021 128:1 128, 21–32 (January 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wojcik G. L., Graff M., et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (7762 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ben-Eghan C., Sun R., et al. Don’t ignore genetic data from minority populations. Nature 2021 585:7824 585, 184–186 (7824 2020). [DOI] [PubMed] [Google Scholar]
  • 19.Verma A., Damrauer S. M., et al. The Penn Medicine BioBank: Towards a Genomics-Enabled Learning Healthcare System to Accelerate Precision Medicine in a Diverse Population. Journal of Personalized Medicine 12, 1974 (December 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Fatumo S., Chikowore T., et al. A roadmap to increase diversity in genomic studies. Nature Medicine 2022 28:2 28, 243–250 (February 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.The All of Us Research Program Investigators. The “All of Us” Research Program. New England Journal of Medicine 381, 668–676 (July 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sohail M., Chong A. Y., et al. Nationwide genomic biobank in Mexico unravels demographic history and complex trait architecture from 6,057 individuals. bioRxiv, 2022.07.11.499652 (2022). [Google Scholar]
  • 23.Johnson R., Ding Y., et al. The UCLA ATLAS Community Health Initiative: Promoting precision health research in a diverse biobank. Cell Genomics 3, 100243 (January 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Haseman J. K. & Elston R. C. The investigation of linkage between a quantitative trait and a marker locus. Behavior genetics 2, 3–19 (January 1972). [DOI] [PubMed] [Google Scholar]
  • 25.Pfaff C. L., Parra E. J., et al. Population structure in admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. American journal of human genetics 68, 198–207 (January 2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zaitlen N., Huntsman S., et al. The Effects of Migration and Assortative Mating on Admixture Linkage Disequilibrium. Genetics 205, 375–383 (January 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Verdu P. & Rosenberg N. A. A General Mechanistic Model for Admixture Histories of Hybrid Populations. Genetics 189, 1413–1426 (April 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Corre V. L. & Kremer A. Genetic Variability at Neutral Markers, Quantitative Trait Loci and Trait in a Subdivided Population Under Selection. Genetics 164, 1205–1219 (March 2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Berg J. J. & Coop G. A Population Genetic Signal of Polygenic Adaptation. PLoS Genetics 10, e1004412 (August 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bulmer M. G. The Effect of Selection on Genetic Variability. The American Naturalist 105, 201–211 (943 1971). [Google Scholar]
  • 31.Yair S. & Coop G. Population differentiation of polygenic score predictions under stabilizing selection. Philosophical Transactions of the Royal Society B 377 (1852 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Balding D. J. & Nichols R. A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1–2 1995). [DOI] [PubMed] [Google Scholar]
  • 33.Mathew B., Léon J. & Sillanpää M. J. A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction. Heredity 2017 120:4 120, 356–368 (April 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ma R. & Dicker L. H. The Mahalanobis kernel for heritability estimation in genome-wide association studies: fixed-effects and random-effects methods (2019). [Google Scholar]
  • 35.Min A., Thompson E. & Basu S. Comparing heritability estimators under alternative structures of linkage disequilibrium. G3 Genes|Genomes|Genetics 12 (August 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Zaitlen N., Pasaniuc B., et al. Leveraging population admixture to characterize the heritability of complex traits. Nature Genetics 46, 1356–1362 (December 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zaidi A. A., Mattern B. C., et al. Investigating the case of human nose shape and climate adaptation. PLoS Genetics 13, 2017 (March 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Schraiber J. G. & Edge M. D. Heritability within groups is uninformative about differences among groups: cases from behavioral, evolutionary, and statistical genetics. bioRxiv, 2023.11.06.565864 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Auton A., Abecasis G. R., et al. A global reference for human genetic variation. Nature 2015 526:7571 526, 68–74 (7571 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Sollis E., Mosaku A., et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research 51, D977–D985 (D1 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Jablonski N. G. the Evolution of Human Skin and Skin Color. Annual Review of Anthropology 33, 585–623 (January 2004). [Google Scholar]
  • 42.Lamason R. L., Mohideen M. A. P., et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science 310, 1782–1786 (5755 2005). [DOI] [PubMed] [Google Scholar]
  • 43.Beleza S., Johnson N. A., et al. Genetic architecture of skin and eye color in an African-European admixed population. PLoS genetics 9 (ed Spritz R. A.) e1003372 (March 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Ju D. & Mathieson I. The evolution of skin pigmentation-associated variation in West Eurasia. Proceedings of the National Academy of Sciences of the United States of America 118, e2009227118 (January 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Nalls M. A., Wilson J. G., et al. Admixture Mapping of White Cell Count: Genetic Locus Responsible for Lower White Blood Cell Count in the Health ABC and Jackson Heart Studies. The American Journal of Human Genetics 82, 81–87 (January 2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Reich D., Nalls M. A., et al. Reduced Neutrophil Count in People of African Descent Is Due To a Regulatory Variant in the Duffy Antigen Receptor for Chemokines Gene. PLoS Genetics 5 (ed Visscher P. M.) e1000360 (January 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.McManus K. F., Taravella A. M., et al. Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans. PLOS Genetics 13, e1006560 (March 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kumar S. K., Feldman M. W., Rehkop D. H. & Tuljapurkar S. Reply to Yang et al.: GCTA produces unreliable heritability estimates. Proceedings of the National Academy of Sciences 113, E4581–E4581 (32 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Speed D., Hemani G., Johnson M. R. & Balding D. J. Improved heritability estimation from genome-wide SNPs. American journal of human genetics 91, 1011–21 (June 2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Speed D., Cai N., Johnson M. R., Nejentsev S. & Balding D. J. Reevaluation of SNP heritability in complex human traits. Nature Genetics 49, 986–992 (July 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chan T. F., Rui X., et al. Estimating heritability explained by local ancestry and evaluating stratification bias in admixture mapping from summary statistics. bioRxiv, 2023.04.10.536252 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Schwartzman A., Schork A. J., Zablocki R. & Thompson W. K. A simple, consistent estimator of SNP heritability from genome-wide association studies. 10.1214/19-AOAS1291 13, 2509–2538 (April 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Hou K., Ding Y., et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nature Genetics 2023 55:4 55, 549–558 (April 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Backman J. D., Li A. H., et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 2021 599:7886 599, 628–634 (7886 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lin M., Park D. S., Zaitlen N. A., Henn B. M. & Gignoux C. R. Admixed Populations Improve Power for Variant Discovery and Portability in Genome-Wide Association Studies. Frontiers in Genetics 12, 673167 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.R Core Team. R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing; (Vienna, Austria, 2023). [Google Scholar]
  • 57.Lona-Durazo F., Hernandez-Pacheco N., et al. Meta-analysis of GWA studies provides new insights on the genetic architecture of skin pigmentation in recently admixed populations. BMC Genetics 20, 1–16 (January 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Yengo L., Vedantam S., et al. A saturated map of common genetic variants associated with human height. Nature 610, 704–712 (7933 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Hoffmann T. J., Choquet H., et al. A large multiethnic genome-wide association study of adult body mass index identifies novel loci. Genetics 210, 499–515 (February 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Pulit S. L., Stoneman C., et al. Meta-Analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Human Molecular Genetics 28, 166–174 (January 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Warrington N. M., Beaumont R. N., et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nature Genetics 2019 51:5 51, 804–814 (May 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Surendran P., Feofanova E. V., et al. Discovery of rare variants associated with blood pressure regulation through meta-analysis of 1.3 million individuals. Nature Genetics 52, 1314–1332 (December 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Graham S. E., Clarke S. L., et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (7890 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Chen M. H., Raffield L. M., et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell 182, 1198–1213.e14 (May 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Bycroft C., Freeman C., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (7726 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Chang C. C., Chow C. C., et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, s13742-015–0047-8 (January 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Alexander D. H., Novembre J. & Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome research 19, 1655–64 (September 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Purcell S., Neale B., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559–75 (March 2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Chen G. B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Frontiers in Genetics 5, 72296 (APR 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES