Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 Aug 4:2023.08.04.551959. [Version 1] doi: 10.1101/2023.08.04.551959

Interpreting SNP heritability in admixed populations

Jinguo Huang 1,2, Saonli Basu 3, Mark D Shriver 2, Arslan A Zaidi 4,5,
PMCID: PMC10418213  PMID: 37577588

Abstract

SNP heritability hsnp2 is defined as the proportion of phenotypic variance explained by genotyped SNPs and is believed to be a lower bound of heritability h2, being equal to it if all causal variants are known. Despite the simple intuition behind hsnp2, its interpretation and equivalence to h2 is unclear, particularly in the presence of population structure and assortative mating. It is well known that population structure can lead to inflation in hˆsnp2 estimates. Here we use analytical theory and simulations to demonstrate that estimates of hsnp2 are not guaranteed to be equal to h2 in admixed populations, even in the absence of confounding and even if the causal variants are known. We interpret this discrepancy arising not because the estimate is biased, but because the estimand itself as defined under the random effects model may not be equal to h2. The model assumes that SNP effects are uncorrelated which may not be true, even for unlinked loci in admixed and structured populations, leading to over- or under-estimates of hsnp2 relative to h2. For the same reason, local ancestry heritability hγ2 may also not be equal to the variance explained by local ancestry in admixed populations. We describe the quantitative behavior of hsnp2 and hγ2 as a function of admixture history and the genetic architecture of the trait and discuss its implications for genome-wide association and polygenic prediction.

Introduction

The ability to estimate heritability h2 from unrelated individuals was a major advance in genetics. Traditionally, h2 was estimated from family-based studies in which the phenotypic resemblance between relatives could be modeled as a function of their expected genetic relatedness [1]. But this approach was limited to analysis of closely related individuals where pedigree information is available and the realized genetic relatedness is not too different from expectation [2]. With the advent of genome-wide association studies (GWAS), we hoped that many of the variants underlying this heritability would be uncovered. But when genome-wide significant SNPs explained a much smaller fraction of the phenotypic variance, it became important to explain the missing heritability – were family-based estimates inflated or were GWAS just underpowered, limited by variant discovery?

Yang et al. (2010) [3] made the key insight that one could estimate the portion of h2 tagged by genotyped SNPs, regardless of whether or not they were genome-wide significant, by exploiting the subtle variation in the realized genetic relatedness among apparently unrelated individuals [35]. This quantity came to be known colloquially as ‘SNP heritability’ (hsnp2) and it is believed to be equal to h2 if all causal variants are included among genotyped SNPs [3]. Indeed, estimates of hsnp2 explain a much larger fraction of trait heritability [3], approaching family-based estimates of h2 when whole genome sequence data are used [6]. This has made it clear that GWAS have yet to uncover more variants with increasing sample size. Now, hsnp2 has become an important aspect of the design of genetic studies and is often used to define the power of variant discovery in GWAS and the upper limit of polygenic prediction accuracy.

Despite the utility and simple intuition of hsnp2, there is much confusion about its interpretation and equivalence to h2, particularly in the presence of population structure and assortative mating [712]. But much of the discussion of heritability in structured populations has focused on biases in hˆsnp2 – the estimator – due to confounding effects of shared environment and linkage disequilibrium (LD) with other variants [7, 911, 13]. There is comparatively little discussion, at least in human genetics, on the fact that LD due to population structure also contributes to genetic variance, and therefore, is a component of heritability [1] (but see also [14, 15]). We think this is at least partly due to the fact that most studies are carried out in cohorts with primarily European ancestry, where the degree of population structure is minimal and large effects of LD can be ignored. But that is not the case for diverse, multi-ethnic cohorts, which have historically been underrepresented in genetic studies, but thanks to a concerted effort in the field, are now becoming increasingly common [1622]. The complex structure in these cohorts also brings unique methodological challenges and it is imperative that we understand whether existing methods, which have largely been evaluated in more homogeneous groups, generalize to more diverse cohorts.

Our goal in this paper is to study the behavior of hsnp2 in admixed populations in relation to h2. Should we expect the two to be equal in the ideal situation where causal variants are known? If not, how should we interpret hsnp2? To answer these questions, we derived a general expression for the genetic variance in admixed populations, decomposing it in terms of the contribution of population structure, which influences both the genotypic variance at individual loci and the LD across loci. We used extensive simulations, where the ground truth is known, to test if hsnp2 estimated with genome-wide restricted maximum likelihood (GREML) [3, 5] – arguably still the most widely used method – is equal to h2 and explore this equivalence as a function of admixture history and genetic architecture. We show that GREML-based hˆsnp2 is not guaranteed to be equal to h2 in admixed and other structured populations, even in the absence of confounding and when all causal variants are known. We explain this discrepancy in terms of the generative model underlying GREML, which assumes that (i) the effect of causal variants are random and uncorrelated, and (ii) the population is mating randomly. As a result, the GREML estimand hsnp2 by definition equals h2 only if these conditions are met. We discuss the implications of this for GWAS and polygenic prediction accuracy.

Model

Genetic architecture

We begin by describing a generative model for the phenotype. Let y=g+e, where y is the phenotypic value of an individual, g is the genotypic value, and e is random error. We assume additive effects such that g=i=1mβixi where βi is the effect size of the ith biallelic locus and xi{0,1,2} is the number of copies of the trait-increasing allele. Importantly, the effect sizes are fixed quantities and differences in genetic values among individuals are due to variation in genotypes. Note, that this is different from the model assumed by GREML where genotypes are fixed and effect sizes are random [14].

We denote the mean, variance, and covariance with E(.),V(.), and C(.), respectively, where the expectation is measured over random draws from the population rather than random realizations of the evolutionary process. We can express the additive genetic variance of a quantitative trait as follows:

Vg=Vi=1mβixi=i=1mβi2Vxi+jiβiβjCxi,xj

Here the first term represents the contribution of individual loci (genic variance) and the second term is the contribution of linkage disequilibrium (LD contribution). We make the assumption that loci are unlinked and therefore, the LD contribution is entirely due to population structure. We describe the behavior of Vg in a population that is a mixture of two previously isolated populations A and B that diverged from a common ancestor. To do this, we denote θ as the fraction of the genome of an individual with ancestry from population A. Thus, θ=1 if the individual is from population A, 0 if they are from population B, and θ(0,1) if they are admixed. Then, Vg can be expressed in terms of ancestry as (Appendix):

Vg=2E(θ)i=1mβi2fiA(1-fiA)+2{1-E(θ)}i=1mβi2fiB(1-fiB) (1.1)
+2E(θ){1-E(θ)}i=1mβi2(fiA-fiB)2 (1.2)
+2V(θ)i=1mβi2(fiA-fiB)2 (1.3)
+4V(θ)ijβiβj(fiAfiB)(fjAfjB) (1.4)

where fiA and fiB are the allele frequencies in populations A and B, and E(θ) and V(θ) are the mean and variance of individual ancestry. The first three terms represent the sum of the genic variance and the last term represents the LD contribution.

Demographic history

From Eq. 1, it is clear that, conditional on the genetic architecture in the source populations (β,fA,fB), Vg in the admixed populations is a function of the mean, E(θ), and variance, V(θ), of individual ancestry. We consider two demographic models that affect E(θ) and V(θ) in qualitatively different ways. In the first model, the source populations meet once t generations ago (we refer to this as t=0) in proportions p and 1-p, after which there is no subsequent admixture (Fig. 2A). In the second model, there is continued gene flow in every generation from one of the source populations such that the mean overall amount of ancestry from population A is the same as in the first model (Fig. 2A). For brevity, we refer to these as the hybrid-isolation (HI) and continuous gene flow (CGF) models, respectively, following Pfaff et al. (2001) [23]. V(θ) is also affected by assortative mating based on ancestry (hereafter referred to as assortative mating for bervity) and we model this following Zaitlen et al. (2017) using a parameter P(0,1), which represents the correlation in the ancestry of individuals in a mating pair [24].

Figure 2:

Figure 2:

Decomposing genetic variance in a two-population system. The plot illustrates the expected distribution of genetic values in two populations under different selective pressures and the terms on the right list the total Vg and between-population genetic variance Vgb expected over the evolutionary process. For neutrally evolving traits (top row), we expect there to be an absolute difference in the mean genetic values (|g1--g2-|) that is proportional to FST. For traits under divergent selection (middle), |g1--g2-| is expected to be greater than that expected under genetic drift. For traits under stabilizing selection, |g1--g2-| will be less than that expected under genetic drift, and zero in the extreme case.

Under these conditions, the behavior of E(θ) and V(θ) has been described previously [24, 25] (Fig. 1A and B). Briefly, in the HI model, E(θ) remains constant at p in the generations after admixture as there is no subsequent gene flow. V(θ) is at its maximum at t=0 when all individuals either carry chromosomes from population A or B, but not both. This genome-wide correlation in ancestry breaks down in subsequent generations as a function of mating, independent assortment, and recombination, leading to a decay in V(θ), the rate depending on the strength of assortative mating (Fig. 1). In the CGF model, both E(θ) and V(θ) increase with time as new chromosomes are introduced from the source populations. But while E(θ) continues to increase monotonically, V(θ) will plateau and decrease due to the countervailing effects of independent assortment and recombination which redistribute ancestry in the population, reaching equilibrium at zero if there is no more gene flow and the population is mating randomly. V(θ) provides an intuitive and quantitative measure of the degree of population structure (along the axis of ancestry) in admixed populations.

Figure 1:

Figure 1:

The behavior of mean and variance of individual ancestry as a function of admixture history. (A) Shows the demographic models under which simulations were carried out. Admixture might occur once (Hybrid Isolation, HI, left column) or continuously (Continuous Gene Flow, CGF, right column). (B) The mean individual ancestry, E(θ) remains constant over time in the HI model and increases in the CGF model with continued gene flow. (C) The variance in individual ancestry, V(θ) is maximum at t=0, decaying subsequently. V(θ) increases with gene flow in the CGF model and will subsequently decrease with time. P measures the strength of assortative mating, which slows the decay of V(θ).

Results

Genetic variance in admixed populations

To understand the expectation of genetic variance in admixed populations, it is first worth discussing its behavior in the source populations. In Eq. 1, the first term represents the within-population component Vgw and the last three terms altogether represent the component of genetic variance between populations A and B Vgb. Note that Vgb=(g1--g2-)22 is positive only if there is a difference in the mean genotypic values (Fig. 2B). This variance increases with the degree of genetic drift since the expected values of both (fiA-fiB)2 and (fiA-fiB)(fjA-fjB) are functions of FST. But while (fiA-fiB)2 is expected to increase monotonically with increasing genetic drift, (fiA-fiB)(fjA-fjB) is expected to be zero under neutrality because the direction of frequency change will be uncorrelated across loci. In this case, the LD contribution, i.e., (1.4), is expected to be zero and Vgb=(1.1)+(1.2)+(1.3). However, this is true only in expectation over the evolutionary process and the realized LD contribution may be non-zero even for neutral traits.

For traits under selection, the LD contribution is expected to be greater or less than zero, depending on the type of selection. Under divergent selection, trait-increasing alleles will be systematically more frequent in one population over the other, inducing positive LD across loci, increasing the LD contribution, i.e., term (1.4). Stabilizing selection, on the other hand, induces negative LD, reducing (1.4) [26, 27]. In the extreme case, the mean genetic values of the two populations are exactly equal and Vgb=(1.2)+(1.3)+(1.4)=0. For this to be true, (1.4) has to be negative and equal to (1.2)+(1.3), which are both positive, and the total genetic variance is reduced to the within-population variance, i.e., term (1.1) (Fig. 2). This is relevant because, as we show in the following sections, the behavior of the genetic variance in admixed populations depends on the magnitude of Vgb between the source populations.

We illustrate this by tracking the genetic variance in admixed populations for two traits, both with the same mean FST at causal loci but with different LD contributions (term 1.4): one where the LD contribution is positive (Trait 1) and the other where it is negative (Trait 2). Thus, traits 1 and 2 can be thought of as examples of phenotypes under divergent and stabilizing selection, respectively, and we refer to them as such from hereon. To simulate the genetic variance of such traits, we drew the allele frequencies (fA and fB) in populations A and B for 1,000 causal loci with FST0.2 using the Balding-Nichols model [28]. We drew their effects (β) from 𝒩(0,12f(1-f)) where f is the mean allele frequency between the two populations. To simulate positive and negative LD, we permuted the effect signs across variants 100 times and selected the combinations that gave the most positive and negative LD contribution to represent the genetic architecture of traits that might be under directional (Trait 1) and stabilizing (Trait 2) selection, respectively (Methods). We simulated the genotypes of 10,000 individuals under the HI and CGF models for t{10,20,50,100} generations post-admixture and calculated genetic values for both traits using g=i=1mβixi, where m=1,000 (Method). The observed genetic variance at any time can then be calculated simply as the variance in genetic values, i.e. Vg=V(g).

In the HI model, E(θ) does not change (Fig. 1) so terms (1.1) and (1.2) are constant through time. Terms (1.3) and (1.4) decay towards zero as the variance in ancestry goes to zero and Vg ultimately converges to (1.1)+(1.2) (Fig. 3). This equilibrium value is equal to the EVgθ (Appendix) and the rate of convergence depends on the strength of assortative mating, which slows the rate at which V(θ) decays. Vg approaches equilibrium from a higher value for traits under divergent selection and lower value for traits under stabilizing selection because of positive and negative LD contributions, respectively, at t=0 (Fig. 3). In the CGF model, Vg increases initially for both traits with increasing gene flow (Fig. 3). This might seem counter-intuitive at first because gene flow increases admixture LD, which leads to more negative values of the LD contribution for traits under stabilizing selection (Fig. S1). But this is outweighed by positive contributions from the genic variance – terms (1.1)+(1.2)+(1.3) – all of which initially increase with gene flow (Fig. S1). After a certain point, the increase in Vg slows down as any increase in V(θ) due to gene flow is counterbalanced by recombination and independent assortment. Ultimately, Vg will decrease if there is no more gene flow, reaching the same equilibrium value as in the HI model, i.e., EVgθ=(1.1)+(1.2). Because the loci are unlinked, we refer to the sum (1.3)+(1.4) as the contribution of population structure.

Figure 3:

Figure 3:

Genetic variance in admixed populations under the (A) HI and (B) CGF models. Solid lines represent the expected genetic variance based on Eq. (1) and dashed lines represent results of simulations averaged over ten replicates. Red and blue lines represent traits under divergent and stabilizing selection, respectively.

GREML estimation of hsnp2

In their original paper, Yang et al. (2010) defined hsnp2 as the variance explained by genotyped SNPs and not as heritability [3]. This is because h2 is the genetic variance explained by causal variants, which are unknown. Genotyped SNPs may not overlap with or tag all causal variants and thus, hsnp2 is understood to be a lower bound of h2, both being equal if causal variants are known [3]. Our goal is to demonstrate that this may not be true in structured populations and quantify the discrepancy between hsnp2 and h2, even in the ideal situation when causal variants are known.

We used GREML, implemented in GCTA [3, 5], to estimate the genetic variance for our simulated traits. GCTA assumes the following model: y=Zu+ϵ where Z is an n×m standardized genotype matrix such that the genotype of the kth individual at the ith locus is zik=xik-2fi2fi1-fi, fi being the allele frequency. The SNP effects are assumed to be random and independent such that u𝒩(0,Iσu2m) and ϵ𝒩0,Iσϵ2 is random environmental error. Then, the phenotypic variance can be decomposed as:

V(y)=V(Zu)+V(e)=ZZmσu2+σϵ2

where ZZm is the genetic relationship matrix (GRM), the variance components σˆu2 and σˆϵ2 are estimated using restricted maximum likelihood, and hˆsnp2 is calculated as σˆu2σˆu2+σˆϵ2. Really, the key estimate is σˆu2 since σˆϵ2=1-σˆu2, and we are interested in asking whether σˆu2 is equal to Vg. To answer this, we constructed the GRM with causal variants and estimated σˆu2 using GCTA, including individual ancestry in the model as a fixed effect to correct for any confounding due to genetic stratification [3, 4].

We show that GCTA under- and over-estimates the genetic variance in admixed populations for traits under divergent (Trait 1) and stabilizing selection (Trait 2), respectively, when there is population structure, i.e., when V(θ)>0 (Fig. 4A). One reason for this bias is that the GREML model assumes that the effects are independent, which, as discussed in the previous section, is not true for traits under divergent or stabilizing selection between the source populations, and only true for neutral traits in expectation. Because of this, σˆu2 does not capture the LD contribution, i.e. term (1.4) (Fig. S3A). In our simulations, the LD across loci is entirely due to population structure since they are unlinked. Thus, the LD contribution, and therefore, the bias in σˆu2 is larger in the presence of population structure.

Figure 4:

Figure 4:

The behavior of GREML estimates of the genetic variance σˆu2 using global ancestry as a covariate in admixed populations under the HI (left column) and CGF (right column) models. The solid lines represent values observed in simulations averaged across ten replicates and the dotted lines represent the expected values based on Eq. 1. Red and blue lines represent values for traits under divergent and stabilizing selection, respectively. P indicates the strength of assortative mating. (A) and (B) shows the behavior of σˆu2 using the standard 2f(1-f) genotype scaling. In (C), we show σˆu2 when genotypes are scaled by the square root of their sample variance, i.e., V(x).

But σˆu2 can be biased, even if the effects are uncorrelated and the LD contribution is zero. It is standard in GREML to scale genotypes with 2fi1-fi where fi is the frequency of the allele in the population. This scaling assumes that Vxi=2fi1-fi, which is true only if the population were mating randomly. In an admixed population Vxi=2fi1-fi+2V(θ)(fiA-fiB)2, where fi,fiA, and fiB correspond to frequency in the admixed population, and source populations, A and B, respectively. We show that this assumption biases σˆu2 downwards by a factor of 2V(θ)(fiA-fiB)2- term (1.3) (Fig. S3B, Appendix). We confirm this by showing that we can fix this bias if we scale genotypes by their sample variance, i.e., Vxi (Fig. S3C) (Appendix). Thus, with the standard scaling, σˆu2 is not even equal to the genic variance in the presence of population structure.

The overall bias in σˆu2 is determined by the relative magnitude and direction of terms (1.3) and (1.4), both of which are functions of V(θ), and therefore, of the degree of structure in the population. If there is no more gene flow, V(θ) will ultimately go to zero and Vg will converge towards σˆu2. But note that even though σˆu2 may be biased relative to Vg, it is not biased relative to the estimand under the model assumed by GREML σu2, which is more accurately interpreted as the genetic variance expected if there were no correlation in effect sizes and if the population were mating randomly. In other words, Eσˆu2=σu2=(1.1)+(1.2)Vg (Fig. S3B)

Local ancestry heritability

A related quantity of interest in admixed populations is local ancestry heritability hγ2, which is defined as the proportion of phenotypic variance that can be explained by local ancestry. Zaitlen et al. (2014) [29] showed that this quantity is related to, and can be used to estimate, hsnp2 in admixed populations. The advantage of this ‘indirect’ approach is that local ancestry segments shared between individuals are identical by descent and are therefore, more likely to tag causal variants compared to array markers, allowing one to potentially capture the contributions of rare and structural variants [29]. Here, we show that under the random effects model, hγ2 may not be equal to the local ancestry ancestry heritability in the population because of the same reasons that hsnp2 is not equal to the heritability.

We define local ancestry γi{0,1,2} as the number of alleles at locus i that trace their ancestry to population A. Thus, ancestry at the ith locus in individual k is a binomial random variable with Eγik=2θk. Similar to the genetic value of an individual, we define ‘ancestry value’ as i=1mϕiγi, where ϕi=βi(fiA-fiB) is the effect size of local ancestry (Appendix). Then, the genetic variance due to local ancestry can be expressed as:

Vγ=V(i=1mϕiγi)=i=1mϕi2Vγi+i=1mjiϕiϕjCγi,γj=2E(θ){1-E(θ)}i=1mϕi2+2V(θ)i=1mϕi2+4V(θ)i=1mjiϕiϕj=2E(θ){1-E(θ)}i=1mβi2(fiA-fiB)2+2V(θ)i=1mβi2(fiA-fiB)2+4V(θ)i=1mjiβiβj(fiA-fiB)(fjA-fjB)

and heritability explained by local ancestry is simply the ratio of Vγ and the phenotypic variance. Note that Vγ=(1.2)+(1.3)+(1.4) – the genetic variance between the source populations – and therefore its behavior is similar to Vg in that the terms (1.3) and (1.4) decay towards zero as V(θ)0, and Vγ converges to (1.2) (Fig. S2).

GREML estimation of hγ2 is similar to the estimation of hsnp2, the key difference being that the former involves constructing the GRM using local ancestry instead of genotypes [29]. The following model is assumed: y=Wv+ξ where W is an n×m standardized local ancestry matrix, v𝒩(0,Iσv2m) are local ancestry effects, and ξ𝒩(0,Iσξ2). The phenotypic variance is decomposed as V(y)=V(Wv)+V(ξ)=WWmσv2+σξ2 where WWm is the local ancestry GRM and σv2 is the parameter of interest, which is believed to be equal to Vγ – the genetic variance due to local ancestry. We show here that this equivalence is not guaranteed in the presence of population structure and/or correlated effects.

We calculated the GRM from local ancestry at causal variants with our simulated data, and estimated σv2 with individual ancestry as a fixed effect to correct for genetic stratification [29]. We show that, in the presence of population structure, i.e., when V(θ)>0,σˆv2 is biased downwards relative to Vγ for traits under divergent selection and upwards for traits under stabilizing selection (Fig. 5A). In addition, if local ancestry is scaled by its expectation under random mating rather than the square root of the sample variance, σˆv2 will be underestimated even if the effects are uncorrelated (Fig. 5B). The overall bias is equal to the terms weighted by V(θ)-(1.3)+(1.4). As a result, hˆγ2 is not guaranteed to be equal to the heritability explained by local ancestry, even in the absence of confounding and even if local ancestry at causal variants is known without error. This is not because of an inherent bias in the estimation procedure since Eσˆv2=σv2=(1.2), but because the estimand itself defined under the random-effects model does not capture the variance due to LD and population structure.

Figure 5:

Figure 5:

The behavior of GREML estimates of the variance due to local ancestry σˆv2 using global ancestry as a covariate in admixed populations under the HI (left column) and CGF (right column) models. The solid lines represent values observed in simulations averaged across ten replicates and the dotted lines represent the expected values based on Eq. 1. Red and blue lines represent values for Traits 1 and 2, respectively. P indicates the strength of assortative mating. (A) and (B) shows the behavior of σˆv2 when the default scaling of local ancestry is used. In (C), we show σˆv2 when local ancestry is scaled with square root of the sampling variance, i.e., Vγi.

We note that if individual ancestry is not included as a covariate, σˆv2 tends to be biased even relative to σv2 due to confounding effects of genetic stratification (Fig. S4). We did not see the same level of confounding when individual ancestry was not included in the model estimating σu2 (Fig. S3). We do not fully understand the reason for this but we think it might be because genetic stratification leads to more inflation in the effect size of local ancestry compared to the effect size of genotype.

How much does population structure contribute in practice?

In the previous sections, we showed theoretically that hsnp2 is not equal to h2 in admixed populations even if the causal variants are known. Ultimately, whether or not this is true in practice is an empirical question, which is difficult to answer because the causal variants, their FST, and the correlation between their effect sizes are unknown. Here, we sought to answer a related question: to what extent does population structure contribute to the variance explained by GWAS SNPs in African Americans? To answer this, we used independent genome-wide significant SNPs for 26 quantitative traits from the GWAS catalog [30]. We calculated the total genetic variance explained and decomposed it into the four components in Equation 1 using allele frequencies (fA and fB) from the 1000 Genomes YRI and CEU [31], and the mean (E(θ)0.77) and variance (V(θ)0.02) of individual ancestry from the 1000 Genomes ASW (Methods).

We show that for skin pigmentation – a trait under strong divergent selection – the LD contribution, i.e. term (1.4), is positive and accounts for ≈ 40 – 50% of the total variance explained. This is because of large allele frequency differences between Africans and Europeans that are correlated across skin pigmentation loci due to strong selection favoring alleles for darker pigmentation in regions with high UV exposure and vice versa [3235]. But for most other traits, LD contributes relatively little, explaining a modest, but non-negligible proportion of the genetic variance in height, LDL and HDL cholestrol, mean corpuscular hemoglobin (MCH), neutrophil count (NEU), and white blood cell count (WBC) (Fig. 6). Because we selected independent associations for this exercise (Methods), the LD contribution is driven entirely due to population structure among African Americans. The contribution of population structure to the genic variance, i.e., term (1.3) is also small even for traits like skin pigmentation and neutrophil count with large effect alleles that are highly diverged in frequency between Africans and Europeans [33, 34, 3638]. Overall, this suggests that population structure contributes relatively little, as least to the variance explained by GWAS SNPs.

Figure 6:

Figure 6:

Decomposing the genetic variance explained by GWAS SNPs in African Americans. We calculated the four variance components listed in Equation 1, their values shown on the y-axis as a fraction of the total variance explained (shown as percentage at the bottom). The number of variants used to calculate variance components for each trait is also shown at the bottom.

Discussion

Despite the growing size of GWAS and discovery of thousands of variants for hundreds of traits [30], the heritability explained by GWAS SNPs remains a fraction of twin-based heritability estimates. Yang et al. (2010) introduced the concept of SNP heritability hsnp2 that does not depend on the discovery of causal variants but assumes that they are numerous and are more or less uniformly distributed across the genome (the infinitesimal model), their contributions to the genetic variance ‘tagged’ by genotyped SNPs [3]. hsnp2 is now routinely estimated in most genomic studies and at least for some traits (e.g. height and BMI), these estimates now approach twin-based heritability [6]. But despite the widespread use of hsnp2, its interpretation remains unclear, particularly its equivalence to heritability in the presence of population structure. It is generally accepted that hˆsnp2- the estimator – can be biased in structured populations [4, 7, 911, 39]. Here, we show how hˆsnp2 may not be equal to h2 in admixed populations even in the absence of confounding and even if causal variants are known.

GREML assumes that SNP effects are random and independent – an assumption that may not be true, especially in the presence of admixture and population structure, which create LD across unlinked loci. This LD contributes to the genetic variance and can persist, despite recombination, for a number of generations due to continued gene flow and/or assortative mating. Because the LD contribution can be positive or negative, hˆsnp2 can under- or over-estimate h2. But hˆsnp2 can be biased even when effects are uncorrelated if the genotypes are scaled by 2f(1-f) – the standard approach, which implicitly assumes a randomly mating population. In the presence of population structure, the variance in genotypes can be higher and hˆsnp2 does not capture this additional variance. For these reasons, there is no guarantee that hˆsnp2 that will be equal to h2, even if the causal variants are known. But technically this is not because the estimate is biased, but because the estimand itself as defined under the random effects model is not equal to the heritability [14, 15]. hsnp2, assuming the genotypes are scaled properly, is better interpreted as the proportion of phenotypic variance explained by the genic variance. We show that hγ2 as defined under random effects models [29, 40] should be interpreted similarly.

Does the LD contribution to the genetic variance have practical implications? The answer to this depends on how one intends to use SNP heritability. hsnp2 can be useful in qauntifying the power to detect variants in GWAS where the quantity of interest is the genic variance. But if one is interested in using hsnp2 to measure the extent to which genetic variation contributes to phenotypic variation, in predicting the response to selection, or in defining the upper limit of polygenic prediction accuracy [2] – applications where the LD contribution is important – then hsnp2 is technically not the relevant quantity.

Ultimately, the discrepancy between hsnp2 and h2 in practice is an empirical question, the answer to which depends on the degree of population structure (which we can measure) and the genetic architecture of the trait (which we do not know a priori). We show that for most traits, the contribution of population structure to the variance explained by GWAS SNPs is modest among African Americans. Thus, if we assume that the genetic architecture of GWAS SNPs represents that of all causal variants, then despite incorrect assumptions, the discrepancy between hsnp2 and h2 should be fairly modest. But this assumption is obviously unrealistic given that GWAS SNPs are common variants that in most cases cumulatively explain a small fraction of trait heritability. We know that rare variants contribute disproportionately to the genic variance because of negative selection [41]. What is their LD contribution? This will become clearer in the near future with the discovery of rare variants through large sequence-based studies [42]. While these are underway, theoretical studies are needed to understand how different selection regimes influence the LD patterns between causal variants - an important aspect of the genetic architecture of complex traits.

One limitation of this research is that we studied only the GREML estimator of hsnp2 because of its widespread use. There are many estimators, which can be broadly grouped into random- and fixed effect estimators based on how they treat SNP effects [43]. Fixed effect estimators make fewer distributional assumptions but they are not as widely used because they require conditional estimates of all variants – a high-dimensional problem where the number of markers is often far larger than the sample size [44]. This is one reason why random effect estimators, such as GREML, are popular – because they reduce the dimensionality by assuming that the effects are drawn from some distribution where the variance is the only parameter of interest. Fixed effects estimators should be able to capture the LD contribution, in principle, but this is not obvious in practice since the simulations used to evaluate the accuracy of such estimators still assume uncorrelated effects [4345]. Further research is needed to clarify the interpretation of the different estimators of hsnp2 in structured populations under a range of genetic architectures.

Methods

Simulating genetic architecture

We first drew the allele frequency f0 of 1,000 biallelic causal loci in the ancestor of populations A and B from a uniform distribution, U (0.001, 0.999). Then, we simulated their frequency in populations A and B (fA and fB) under the Balding-Nichols model [28], such that fA, fBBeta(f0(1-F)F,1-f0(1-F)F) where F=0.2 is the inbreeding coefficient. We implemented this using code adapted from [46]. To avoid drawing extremely rare alleles, we continued to draw fA and fB until we had 1,000 loci with fA, fB(0.01,0.99)

We generated the effect size (β) of each locus by sampling from 𝒩(0,12mf(1-f)), where m is the number of loci and f is the mean allele frequency across populations A and B. Thus, rare variants have larger effects than common variants and the total genetic variance sums to 1. Given these effects, we simulated two different traits, one with a large difference in means between populations A and B (Trait 1) and the other with roughly no difference (Trait 2). This was achieved by permuting the signs of the effects 100 times to get a distribution of Vgb – the genetic variance between populations. This has the effect of varying the LD contribution without changing the FST at causal loci. We selected the maximum and minimum of Vgb to represent Traits 1 and 2.

Simulating admixture

We simulated the genotypes, local ancestry, and phenotype for 10,000 admixed individuals per generation under the hybrid isolation (HI) and continuous gene flow (CGF) models by adapting the code from Zaitlen et al. (2017) [24]. We denote the ancestry of a randomly selected individual k with θ, the fraction of their genome from population A. At t=0 under the HI model, we set θ to 1 for individuals from population A and 0 if they were from population B such that E(θ)=p{0.1,0.2,0.5} with no further gene flow from either source population. In the CGF model, population B receives a constant amount q from population A in every generation starting at t=0. The mean overall proportion of ancestry in the population is kept the same as the HI model by setting q=1-(1-p)1t where t is the number of generations of gene flow from A. In every generation, we simulated ancestry-based assortative mating by selecting mates such that the correlation between their ancestries is P{0,0.3,0.6,0.9} in every generation. We do this by repeatedly permuting individuals with respect to each other until P falls within ±0.01 of the desired value. It becomes difficult to meet this criterion when V(θ) is small (Fig.2C). To overcome this, we relaxed the threshold up to 0.04 for some conditions, i.e., when θ{0.1,0.2} and t50. We generated expected variance in individual ancestry using the expression in [24]. At time t since admixture, Vθt=Vθt-1(1+P)2 under the HI model where P measures the strength of assortative mating, i.e, the correlation between the ancestry between individuals in a mating pair. Under the CGF model, Vθt=q(1-q)Eθt-12+q(1-q)1-2Eθt-1+(1-q)Vθt-1(1+P)2 (Appendix).

We sampled the local ancestry at each ith locus as γi=γif+γim where γimBin1,θm,γifBin1,θf and θm and θf represent the ancestry of the maternal and paternal chromosome, respectively. The global ancestry of the individual is then calculated as θk=i=1mγim+γif2m, where m is the number of loci. We sample the genotype xi from a binomial distribution conditioning on local ancestry. Thus, ximBin(1,fiA) if γim=1 and ximBin1,fiB if γim=0 and similarly for xif. Then, the genotype can be obtained as the sum of the maternal and paternal genotypes: xi=xim+xip. We calculate the genetic value of each individual as g=i=1mβixi and the genetic variance as V(g).

Heritability estimation with GCTA

We used GCTA to estimate σu2 and σv2 using the --reml and --reml-no-constrain flags. We could not do this without any error in the genetic values so we simulated individual phenotypes with a heritability of 0.8 by adding random noise e𝒩0,Vg/4 to the genetic value. We used GCTA [5] to construct the standard GRM with the --make-grm flag. We also constructed an ‘adjusted’ GRM, where instead of standardizing the genotype of the ith SNP with 2fi1-fi, we used Vxi. Similarly, the ‘adjusted’ local ancestry GRM was constructed by scaling local ancestry with Vγi. For both σu2 and σv2, we included individual ancestry as a covariate to correct for any confounding due to genetic stratification.

Estimating variance explained by GWAS SNPs

To decompose the variance explained by GWAS SNPs in African Americans, we needed four quantities: (i) effect sizes of GWAS SNPs, (ii) their allele frequencies in Africans and Europeans, and (iii) the mean and variance of global ancestry in African Americans (Equation 1).

We retrieved the summary statistics of 26 traits from GWAS catalog [30]. Full list of traits and the source papers [4755] are listed in Table S1. To maximize the number of variants discovered, we chose summary statistics from studies that were conducted in both European and multi-ancestry samples and that reported the following information: effect allele, effect size, p-value, and genomic position. For birth weight, we downloaded the data from the Early Growth Genetics (EGG) consortium website [52] since the version reported on the GWAS catalog is incomplete. For skin pigmentation, we chose summary statistics from the UKB [56] released by the Neale Lab (http://www.nealelab.is/uk-biobank) and processed by Ju and Mathieson [47] to represent effect sizes estimated among individuals of European ancestry. We also selected summary statistics from Lona-Durazo et al. (2019) where effect sizes were meta-analyzed across four admixed cohorts [48]. Lona-Durazo et al. provide summary statistics separately with and without conditioning on rs1426654 and rs35397 – two large effect variants in SCL24A5 and SLC45A2. We used the ‘conditioned’ effect sizes and added in the effects of rs1426654 and rs35397 to estimate genetic variance.

We selected independent hits for each trait by pruning and thresholding with PLINK v1.90b6.21 [57] in two steps as in Ju et al. (2020) [47]. We used the genotype data of GBR from the 1000 genome project [31] as the LD reference panel. We kept only SNPs (indels were removed) that passed the genome-wide significant threshold (--clump-p1 5e-8) with a pairwise LD cutoff of 0.05 (--clump-r2 0.05) and a physical distance threshold of 250Kb (--clump-kb 250) for clumping. Second, we applied a second round of clumping (--clump-kb 100) to remove SNPs within 100kb.

When GWAS was carried out separately in different ancestry cohorts in the same study, we used inverse-variance weighting to meta-analyze effect sizes for variants that were genome-wide significant (p-value <5×10-8 in at least one cohort. This allowed us to maximize the discovery of variants such as the Duffy null allele that are absent among individuals of European ancestry but polymorphic in other populations [38].

We used allele frequencies from the 1000 Genomes CEU and YRI to represent the allele frequencies of GWAS SNPs in Europeans and Africans, respectively, making sure that the alleles reported in the summary statistics matched the alleles reported in the 1000 Genomes. We estimated the global ancestry of ASW individuals (N=74) with CEU and YRI individuals from 1000 genome (phase 3) using ADMIXTURE 1.3.0 [58] with k=2 and used it to calculate the mean (proportion of African ancestry = 0.767) and variance (0.018) of global ancestry in ASW. With the effect sizes, allele frequencies, and the mean and variance in ancestry, we calculated the four components of genetic variance using Equation 1 and expressed them as a fraction of the total genetic variance.

Initially, the multi-ancestry summary statistics for a few traits (NEU, WBC, MON, MCH, BAS) yielded values >1 for the proportion of variance explained. This is likely because, despite LD pruning, some of the variants in the model are not independent and tag large effect variants under divergent selection such as the Duffy null allele, leading to an inflated contribution of LD. We checked this by calculating the pairwise contribution, i.e., βiβj(fiA-fiB)(fjA-fjB), of all SNPs in the model and show long-range positive LD between variants on chromosome 1 for NEU, WBC, and MON, especially with the Duffy null allele (Fig. S5). A similar pattern was observed on chromosome 16 for MCH, confirming our suspicion. This also suggests that for certain traits, pruning and thresholding approaches are not guaranteed to yield independent hits. To get around this problem, we retained only one association with the lowest p-value, each from chromosome 1 (rs2814778 for NEU, WBC, and MON) and chromosome 16 (rs13331259 for MCH) (Fig. S5). For BAS, we observed that the variance explained was driven by a rare variant (rs188411703, MAF = 0.0024) of large effect (β=-2.27) We believe this effect estimate to be inflated and therefore, we removed it from our calculation.

As a sanity check, we independently estimated the genetic variance as the variance in polygenic scores, calculated using --score-sum flag in PLINK, [57] in ASW individuals. We compared the first estimate of the genetic variance to the second (Fig. S6) to confirm two things: (i) the allele frequencies, and mean and variance in ancestry are estimated correctly, and (ii) the variants are more or less independent in that they do not absorb the effects of other variants in the model. We show that the two estimates of the genetic variance are strongly correlated (r0.85, Fig. S6).

Supplementary Material

Supplement 1

Acknowledgements

We thank Iain Mathieson for helpful comments on the manuscript. This study was funded by National Institute of General Medical Sciences award R00GM137076 to A.A.Z. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix

Variance in ancestry

We denote variance and covariance with V(.) and C(.) and used the expressions in [24] to generate the expected value for the variance in ancestry, i.e., V(θ). This is straightforward for the HI model, where at time tVθt=Vθt-11+Pt-12. P=Corθm,θf measures the strength of assortative mating, i.e, the correlation between the ancestry in a mating pair θm,θf. Since our notation slightly differs from [24], we re-derived the expression for Vθt for the CGF model where population B receives a constant amount q of gene flow from population A in every generation. Note, that Eθt=q+(1-q)Eθt-1. Then,

Vθt=Eθt2-Eθt2=q+(1-q)Eθt-1m+θt-1f2θt-1m+θt-1f2-q+(1-q)Eθt-12=q+(1-q)4{2E(θt-12)+2E(θt-1mθt-1f))-{q2+2q(1-q)E(θt-1)+(1-q)2E(θt-1)2}=q+1-q2E(θt-12)+1-q2E(θt-1mθt-1f)-q2-2q(1-q)E(θt-1)-(1-q)2E(θt-1)2=q(1-q)+1-q2{V(θt-1)+E(θt-1)2}+1-q2{C(θt-1m,θt-1f)+E(θt-1)2}-2q(1-q)E(θt-1)-E(θt-1)2=q(1-q)+1-q2V(θt-1)+1-q2Eθt-12+1-q2Pt-1Vθt-1+1-q2Eθt-12-2q(1-q)Eθt-1-Eθt-12=q(1-q)+1-q2Vθt-11+Pt-1+(1-q)Eθt-12-2q(1-q)Eθt-1-(1-q)2Eθt-12=q(1-q)Eθt-12+q(1-q)1-2Eθt-1+1-q2Vθt-11+Pt-1

Genetic variance

Let y=g+e, where y is the phenotypic value of an individual, g is the genotypic value, and e is random error. We assume additive effects such that g=i=1mβixi where βi is the effect size of the ith biallelic locus and xi{0,1,2} is the number of copies of the trait-increasing allele. Then, the genetic variance Vg is:

Vg=Vi=1mβixi=i=1mβi2Vxi+jiβiβjCxi,xj

We first derive Vxi as a function of ancestry (θ) using the law of total variance:

Vxi=EVxiθ+VExiθ

where θ represents global ancestry. Let fiA and fiB represent the allele frequency at the ith locus. Then,

E{V(xiθ)}=E{2θfiA(1-fiA)+2(1-θ)fiB(1-fiB)+2θ(1-θ)(fiA-fiB)2}=2E(θ)fiA(1-fiA)+2{1-E(θ)}fiB(1-fiB)+2{E(θ)-E(θ2)}(fiA-fiB)2=2E(θ)fiA(1-fiA)+2{1-E(θ)}fiB(1-fiB)+2{E(θ)-V(θ)-E(θ)2}(fiA-fiB)2=2E(θ)fiA(1-fiA)+2{1-E(θ)}fiB(1-fiB)+2E(θ){1-E(θ)}(fiA-fiB)2-2V(θ)(fiA-fiB)2
VExiθ=V2θfiA+2(1θ)fiB=4fiA2V(θ)=4fiB2V(1θ)+22θfiA,2(1θ)fiB=4fiA2V(θ)+4fiB2V(1θ)8fiAfiBV(θ)
Vxi=2E(θ)fiA(1-fiA)+2{1-E(θ)}fiB(1-fiB)+2E(θ){1-E(θ)}(fiA-fiB)2+2V(θ)(fiA-fiB)2

Similarly, we derive Cxi,xj using the law of total covariance:

Cxi,xj=ECxi,xjθ+CExiθ,Exjθ=0+C{2fiAθ+2fiB(1-θ),2fjAθ+2fjB(1-θ)}=C(2fiAθ,2fjAθ)+C(2fiAθ,2fjB(1-θ)+C(2fiB(1-θ),2fjAθ)+C2fiB(1-θ),2fjB(1-θ)=4V(θ)(fiA-fiB)(fjA-fjB)

ECxi,xjθ=0 because we assume that the loci are unlinked and therefore, xi and xj are conditionally independent. Putting this all together, we get the genetic variance in admixed populations as presented in the main text:

Vg=i=1mβi2Vxi+jiβiβjCxi,xj=i=1mβi22E(θ)fiA(1-fiA)+i=1mβi22{1-E(θ)}fiB1-fiB+i=1mβi22E(θ){1-E(θ)}(fiA-fiB)2++i=1mβi22V(θ)(fiA-fiB)2]+jiβiβj4V(θ)(fiA-fiB)(fjA-fjB)

With two ‘unadmixed’ source populations with equal number of individuals, E(θ)=0.5 and V(θ)=E(θ){1-E(θ)}=0.25 and Vg reduces to:

Vg=V(i=1mβixi)=i=1mβi2Vxi+jiβiβjCxi,xj=i=1mβi2[fiA(1-fiA)+fiB1-fiB]+i=1mβi2(fiA-fiB)2+ijβiβj(fiA-fiB)(fjA-fjB)

The effect of genotype scaling on sigmau2

In the main text we showed that when we scale genotypes by 2fi1-fi in calculating the GRM, we get biased GREML estimates even if the effects are uncorrelated. We can fix this issue by scaling with the square root of sample variance, Vxi (Fig. S3). We provide an explanation of this behavior using the Haseman-Elston (HE) regression estimator, which is asymptotically equivalent to the GREML estimator if effects are uncorrelated [61] but which, unlike GREML, has a closed-form solution. We assume random effects corresponding to the unscaled genotypes are βi𝒩(0,σu22mfi1-fi).

Vg=Vi=1mβixi=i=1mβi2Vxi

In expectation,

Vxi=2E(θ)fiA(1-fiA)+2{1-E(θ)}fiB1-fiB+2E(θ){1-E(θ)}(fiA-fiB)2+2V(θ)(fiA-fiB)2=2fi1-fi+2V(θ)(fiA-fiB)2

And

Vg=i=1mβi2{2fi1-fi+2V(θ)(fiA-fiB)2}=i=1mσu22mfi(1-fi){2fi1-fi+2V(θ)(fiA-fiB)2}=σu2mi=1m{1+V(θ)(fiA-fiB)2fi(1-fi)}=σu2+V(θ)σu2mi=1m(fiA-fiB)2fi(1-fi)contributionofpopulationstructuretothegenicvariance

Thus, the genetic variance can be decomposed into two terms, one that depends on the degree of population structure and one that does not.

The HE estimator of Vg is based on the regression of products of (centered) phenotypes ykyl for all pairs of individuals kl on the corresponding entries of the GRM (ψ) where ψkl=i=1mzikzilm and zik is the centered and scaled genotype of individual k for locus i:

Vˆg=Cykyl,ψklVψkl=Eykylψkl-EykylEψklEψkl2-Eψkl2=EykylψklEψkl2

With the standard scaling, zik=xik-2fi2fi1-fi and the corresponding effects are αi𝒩0,σu2. In this case, the HE estimator is:

Vˆg=EykylψklEψkl2=Ei=1mαiziki=1mαizilψklEψkl2=E(i=1mαi2zikzilψkl)E(ψkl2)=Eαi2E(i=1mzikzilψkl)Eψkl2=σu2mEmψkl2Eψkl2=σu2Eψkl2Eψkl2=σu2

Which is not equal to Vg because it does not capture the contribution of population structure. Next, we consider the case where the genotypes are standardized instead by the sample variance, i.e., zkl=xik-2fiVxi such that Vzi=1. We can derive Eαi2 corresponding to this scaling by noting that the genetic variance remains the same:

i=1mβi2Vxi=i=1mαi2VzimEα2=σu2+V(θ)σu2mi=1m(fiA-fiB)2fi(1-fi)Eα2=σu2m+V(θ)σu2m2i=1m(fiA-fiB)2fi(1-fi)

Note that because the genotypes are scaled properly, Eψkl=0 and Vψkl=1. Then, the HE estimator becomes:

Vˆg=EykylψklEψkl2=Eαi2Ei=1mzikzilEψkl2=σu2m+V(θ)σu2m2i=1m(fiA-fiB)2fi(1-fi)mEψkl2Eψkl2=σu2+V(θ)σu2mi=1m(fiA-fiB)2fi(1-fi)

Which provides an unbiased estimate of the genic variance.

Effect size of local ancestry

We define local ancestry γi{0,1,2} as the number of alleles at locus i that trace their ancestry to population A. Thus, the local ancestry at locus i in individual k is a Binomial random variable with Eγi,k=2θk. We define the ancestry value of an individual as the weighted sum of their local ancestry: i=1mϕiγi where ϕi=βi(fiB-fiA).

To show this, note that ϕ=E(yγ=1)-E(yγ=0) where E(yγ=1)=-yh(yγ=1) and h is a density function. Our goal is to express ϕ in terms of β, which is equal to E(yx=1)-E(yx=0). Furthermore, E(yx=1)=-yh(yx=1). We can express h(yγ) in terms of h(yx) as follows:

h(yγ=1)=h(yx=0)P(x=0γ=1)+h(yx=1)P(x=1γ=1)+h(yx=2)P(x=2γ=1)=h(yx=0)2(1-fA)(1-fA)+h(yx=1){fA1-fB+fB(1-fA)}+h(yx=2)2fAfB
E(yγ=1)=-yh(yγ=1)dy=(1-fA)(1-fB)-yh(yx=0)dy+{fA(1-fB)+fB(1-fA)}-yh(yx=1)dy+fAfB-yh(yx=2)dy=(1-fA)1-fBE(yx=0)+{fA1-fB+fB(1-fA)}E(yx=1)+fAfBE(yx=2)=0+{fA1-fB+fB(1-fA)}β+fAfB2β=βfA+βfB

Similary, E(yγ=0)=2βfB and ϕ=E(yγ=1)-E(yγ=0)=β(fB-fA)

Genetic variance due to local ancestry

Vγ=Vi=1mϕiγi=i=1mϕi2Vγi+i=1mjiϕiϕjCγi,γj (1)

We use the law of total variance and covariance to derive Vγi and Cγi,γj:

V(γi)=E{V(γiθ)}+V{E(γiθ)}=E{2θ(1θ)}+V(2θ)=2E(θ)2E(θ2)+4V(θ)=2E(θ)2V(θ)2E(θ)2+4V(θ)=2E(θ){1E(θ)}+2V(θ)
(γi,γj)=E{(γi,γjθ)}+{E(γi,γjθ)}=0+(2θ,2θ)=4V(θ)
Vγ=2E(θ){1E(θ)}i=1mϕi2+2V(θ)i=1mϕi2+4V(θ)i=1mj1ϕiϕj=2E(θ){1E(θ)}i=1mβi2(f1Bf1A)2+2V(θ)i=1mβi2(f1Bf1A)2+4V(θ)i=1mjiβiβj(f1Bf1A)(fjBfjA)

Footnotes

Code availability

We carried out all analyses in R version 4.2.3 [59], PLINK v1.90b6.21 and PLINK 2.0 [57, 60], and GCTA version 1.94.1 [5]. All code is freely available on https://github.com/jinguohuang/admix_heritability.git.

References

  • 1.Lynch M. & Walsh B. Genetics and analysis of quantitative traits 1–980 (Sinauer Associates, Inc, 1998). [Google Scholar]
  • 2.Visscher P. M., Hill W. G., et al. Heritability in the genomics era–concepts and misconceptions. Nature reviews. Genetics 9, 255–66 (4 2008). [DOI] [PubMed] [Google Scholar]
  • 3.Yang J., Benyamin B., et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42, 565–569 (7 2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yang J., Manolio T. A., et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 2011 43:6 43, 519–525 (6 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yang J., Lee S. H., et al. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics 88, 76–82 (1 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wainschtein P., Jain D., et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nature Genetics 2022 54:3 54, 263–273 (3 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Browning S. R. & Browning B. L. Population structure can inflate SNP-based heritability estimates. American Journal of Human Genetics 89, 191–193 (1 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Goddard M. E., Lee S. H., et al. Response to browning and browning. American Journal of Human Genetics 89, 193–195 (1 2011). [Google Scholar]
  • 9.Kumar S. K., Feldman M. W., et al. Limitations of GCTA as a solution to the missing heritability problem. Proceedings of the National Academy of Sciences of the United States of America 113, E61–E70 (1 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yang J., Leec S. H., et al. GCTA-GREML accounts for linkage disequilibrium when estimating genetic variance from genome-wide SNPs. Proceedings of the National Academy of Sciences 113, E4579–E4580 (32 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lin Z., Seal S., et al. Estimating SNP heritability in presence of population substructure in biobank-scale datasets. Genetics 220 (4 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Border R., O’Rourke S., et al. Assortative mating biases marker-based heritability estimators. Nature Communications 2022 13:1 13, 1–10 (1 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Visscher P. M., Yang J., et al. A Commentary on ‘Common SNPs Explain a Large Proportion of the Heritability for Human Height’ by Yang et al. (2010). Twin Research and Human Genetics 13, 517–524 (6 2010). [DOI] [PubMed] [Google Scholar]
  • 14.De los Campos G., Sorensen D., et al. Genomic Heritability: What Is It? PLOS Genetics 11, e1005048 (5 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rawlik K., Canela-Xandri O., et al. SNP heritability: What are we estimating? bioRxiv, 2020.09.15.276121 (2020). [Google Scholar]
  • 16.Wojcik G. L., Graff M., et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (7762 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ben-Eghan C., Sun R., et al. Don’t ignore genetic data from minority populations. Nature 2021 585:7824 585, 184–186 (7824 2020). [DOI] [PubMed] [Google Scholar]
  • 18.Verma A., Damrauer S. M., et al. The Penn Medicine BioBank: Towards a Genomics-Enabled Learning Healthcare System to Accelerate Precision Medicine in a Diverse Population. Journal of Personalized Medicine 12, 1974 (12 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fatumo S., Chikowore T., et al. A roadmap to increase diversity in genomic studies. Nature Medicine 2022 28:2 28, 243–250 (2 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Of Us Research Program Investigators, T. A. The “All of Us” Research Program. New England Journal of Medicine 381, 668–676 (7 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Sohail M., Chong A. Y., et al. Nationwide genomic biobank in Mexico unravels demographic history and complex trait architecture from 6,057 individuals. bioRxiv, 2022.07.11.499652 (2022). [Google Scholar]
  • 22.Johnson R., Ding Y., et al. The UCLA ATLAS Community Health Initiative: Promoting precision health research in a diverse biobank. Cell Genomics 3, 100243 (1 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Pfaff C. L., Parra E. J., et al. Population structure in admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. American journal of human genetics 68, 198–207 (1 2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zaitlen N., Huntsman S., et al. The Effects of Migration and Assortative Mating on Admixture Linkage Disequilibrium. Genetics 205, 375–383 (1 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Verdu P. & Rosenberg N. A. A General Mechanistic Model for Admixture Histories of Hybrid Populations. Genetics 189, 1413–1426 (4 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bulmer M. G. The Effect of Selection on Genetic Variability. The American Naturalist 105, 201–211 (943 1971). [Google Scholar]
  • 27.Yair S. & Coop G. Population differentiation of polygenic score predictions under stabilizing selection. Philosophical Transactions of the Royal Society B 377 (1852 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Balding D. J. & Nichols R. A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1–2 1995). [DOI] [PubMed] [Google Scholar]
  • 29.Zaitlen N., Pasaniuc B., et al. Leveraging population admixture to characterize the heritability of complex traits. Nature Genetics 46, 1356–1362 (12 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Sollis E., Mosaku A., et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research 51, D977–D985 (D1 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Auton A., Abecasis G. R., et al. A global reference for human genetic variation. Nature 2015 526:7571 526, 68–74 (7571 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Jablonski N. G. the Evolution of Human Skin and Skin Color. Annual Review of Anthropology 33, 585–623 (1 2004). [Google Scholar]
  • 33.Lamason R. L., Mohideen M. A. P., et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science 310, 1782–1786 (5755 2005). [DOI] [PubMed] [Google Scholar]
  • 34.Beleza S., Johnson N. A., et al. Genetic architecture of skin and eye color in an African-European admixed population. PLoS genetics 9 (ed Spritz R. A.) e1003372 (3 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zaidi A. A., Mattern B. C., et al. Investigating the case of human nose shape and climate adaptation. PLoS Genetics 13, 2017 (3 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Nalls M. A., Wilson J. G., et al. Admixture Mapping of White Cell Count: Genetic Locus Responsible for Lower White Blood Cell Count in the Health ABC and Jackson Heart Studies. The American Journal of Human Genetics 82, 81–87 (1 2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Reich D., Nalls M. A., et al. Reduced Neutrophil Count in People of African Descent Is Due To a Regulatory Variant in the Duffy Antigen Receptor for Chemokines Gene. PLoS Genetics 5 (ed Visscher P. M.) e1000360 (1 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.McManus K. F., Taravella A. M., et al. Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans. PLOS Genetics 13, e1006560 (3 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kumar S. K., Feldman M. W., et al. Reply to Yang et al.: GCTA produces unreliable heritability estimates. Proceedings of the National Academy of Sciences 113, E4581–E4581 (32 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Chan T. F., Rui X., et al. Estimating heritability explained by local ancestry and evaluating stratification bias in admixture mapping from summary statistics. bioRxiv, 2023.04.10.536252 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Schoech A. P., Jordan D. M., et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nature Communications 10, 790 (1 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Backman J. D., Li A. H., et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 2021 599:7886 599, 628–634 (7886 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Min A., Thompson E., et al. Comparing heritability estimators under alternative structures of linkage disequilibrium. G3 Genes|Genomes|Genetics 12 (8 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Schwartzman A., Schork A. J., et al. A simple, consistent estimator of SNP heritability from genome-wide association studies. 10.1214/19-AOAS1291 13, 2509–2538 (4 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Hou K., Ding Y., et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nature Genetics 2023 55:4 55, 549–558 (4 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lin M., Park D. S., et al. Admixed Populations Improve Power for Variant Discovery and Portability in Genome-Wide Association Studies. Frontiers in Genetics 12, 673167 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Ju D. & Mathieson I. The evolution of skin pigmentation-associated variation in West Eurasia. Proceedings of the National Academy of Sciences of the United States of America 118, e2009227118 (1 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Lona-Durazo F., Hernandez-Pacheco N., et al. Meta-analysis of GWA studies provides new insights on the genetic architecture of skin pigmentation in recently admixed populations. BMC Genetics 20, 1–16 (1 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Yengo L., Vedantam S., et al. A saturated map of common genetic variants associated with human height. Nature 610, 704–712 (7933 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Hoffmann T. J., Choquet H., et al. A large multiethnic genome-wide association study of adult body mass index identifies novel loci. Genetics 210, 499–515 (2 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Pulit S. L., Stoneman C., et al. Meta-Analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Human Molecular Genetics 28, 166–174 (1 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Warrington N. M., Beaumont R. N., et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nature Genetics 2019 51:5 51, 804–814 (5 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Surendran P., Feofanova E. V., et al. Discovery of rare variants associated with blood pressure regulation through meta-analysis of 1.3 million individuals. Nature Genetics 52, 1314–1332 (12 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Graham S. E., Clarke S. L., et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (7890 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Chen M. H., Raffield L. M., et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell 182, 1198–1213.e14 (5 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Bycroft C., Freeman C., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (7726 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Chang C. C., Chow C. C., et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, s13742–015–0047–8 (1 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Alexander D. H., Novembre J., et al. Fast model-based estimation of ancestry in unrelated individuals. Genome research 19, 1655–64 (9 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.R Core Team. R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing (Vienna, Austria, 2023). [Google Scholar]
  • 60.Purcell S., Neale B., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559–75 (3 2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Chen G. B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Frontiers in Genetics 5, 72296 (APR 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES