Abstract
Polygenic scores link the genotypes of ancient individuals to their phenotypes, which are often unobservable, offering a tantalizing opportunity to reconstruct complex trait evolution. In practice, however, interpretation of ancient polygenic scores is subject to numerous assumptions. For one, the genome-wide association (GWA) studies from which polygenic scores are derived, can only estimate effect sizes for loci segregating in contemporary populations. Therefore, a GWA study may not correctly identify all loci relevant to trait variation in the ancient population. In addition, the frequencies of trait-associated loci may have changed in the intervening years. Here, we devise a theoretical framework to quantify the effect of this allelic turnover on the statistical properties of polygenic scores as functions of population genetic dynamics, trait architecture, power to detect significant loci, and the age of the ancient sample. We model the allele frequencies of loci underlying trait variation using the Wright-Fisher diffusion, and employ the spectral representation of its transition density to find analytical expressions for several error metrics, including the expected sample correlation between the polygenic scores of ancient individuals and their true phenotypes, referred to as polygenic score accuracy. Our theory also applies to a two-population scenario and demonstrates that allelic turnover alone may explain a substantial percentage of the reduced accuracy observed in cross-population predictions, akin to those performed in human genetics. Finally, we use simulations to explore the effects of recent directional selection, a bias-inducing process, on the statistics of interest. We find that even in the presence of bias, weak selection induces minimal deviations from our neutral expectations for the decay of polygenic score accuracy. By quantifying the limitations of polygenic scores in an explicit evolutionary context, our work lays the foundation for the development of more sophisticated statistical procedures to analyze both temporally and geographically resolved polygenic scores.
Author summary
The genomes of ancient organisms document, albeit imperfectly, the migrations, admixture events, and displacements that may have occurred in a given species’ history. Researchers also use these ancient genomes to learn whether genetic changes underlie the evolution of polygenic traits, like height and disease susceptibility, which are affected by many genetic variants of small effects. Such analyses rely on models, often garnered from large-scale genetic association studies, that predict the phenotypes of ancient individuals from their genotypes. Yet, theoretical and empirical research suggests that prediction accuracy depends on the relationship between the sample in which the model was built and the sample in which it is applied. In this vein, we quantify the effects of one fundamental limitation on prediction accuracy: the fact that allele frequencies may differ across time and geographic space. As a consequence, a given prediction model may not capture all of the genetic variation relevant to phenotypic variation in the focal, ancient sample.
Introduction
Decay in linkage disequilibrium (LD) between tagging and causal sites, population stratification, variation in allele frequencies within and across populations, and environmental heterogeneity, among other factors, are all thought to negatively impact the prediction accuracy of polygenic scores (see e.g., [1–7], and more recently in humans, e.g., [8–13]). Many of these issues likely influence both within- and out-of-sample predictions, where out-of-sample may refer to an individual sampled from a distinct time or location relative to that of the GWA study. While empirical [12, 14] and simulation [1, 13, 15] or combined [16] studies have explored particular population genetic scenarios or experimental contexts, we still do not know the extent to which each of these factors compromises prediction accuracy in general.
In this work, we address an issue pertinent to out-of-sample prediction: that causal loci may have different allele frequencies in the GWA study and focal populations. Variants common in the GWA study may be rare in the focal population, and vice versa. We refer to this phenomenon as allelic turnover. Allelic turnover implies that effect estimates ported across space and time, or both, may not reflect all of the genetic variation relevant to phenotypic variation in an ancient or geographically distinct population. Allelic turnover further suggests that the statistical properties of ancient polygenic scores depend on when an ancient individual was sampled—a feature not currently accounted for in ancient DNA analyses. Similarly, statistical properties of geographically disparate polygenic scores depend on the divergence time between the GWA study and focal populations. An understanding of allelic turnover in these contexts may ultimately improve statistical analyses of temporally (e.g., [17–20]) and geographically resolved polygenic scores (e.g., [9, 10]), analyses which are increasingly commonplace.
We aim to quantify the effect of allelic turnover on the polygenic scores of such out-of-sample individuals when they are computed using effect estimates from a contemporary population. We expect that increases in ancient sampling time or divergence time will be associated with declines in polygenic score accuracy due exclusively to allelic turnover. The question is, by how much does accuracy decline? And, can allelic turnover alone explain the reduced accuracy of out-of-sample predictions observed in numerous human (e.g., [15, 16]), animal (e.g., [1, 2, 4]) and plant (e.g., [21, 22]) experiments and simulation studies. The answer is likely to depend on the particular population genetic, trait, and GWA study features of the system under study [3]. We attempt to capture some important aspects of this diversity in our modeling framework.
Here, we consider a standard implementation of the polygenic score which attributes non-zero effects to a particular set of loci, . An individual’s polygenic score is a weighted sum of its genotype, where the weights are the estimated allelic effects. The loci in and their estimated effects are usually identified in large-scale GWA studies, often performed in regional biobanks with sample sizes in the tens to hundreds of thousands of individuals (e.g., the UK Biobank [23], BioBank Japan [24]). Frequently, the set includes loci which are approximately independent and surpass some allele frequency and p-value thresholds. Though there are numerous ways to define a polygenic score (e.g., [25, 26] and see Section 4 in S1 Text), the “prune and threshold” method is commonly used and proves analytically tractable in our framework.
Previous quantitative genetic approaches, such as [27] and [16], largely ignore the underlying population genetic dynamics. For example, Wang et al. [16] estimate the reduction in polygenic score accuracy in a focal population relative to the GWA study population as a function of the fixed population-specific trait heritabilities, allele frequencies, and LD patterns, and the estimated per-locus effects. In contrast, we embed the ancient polygenic score in an explicit population genetic framework, allowing us to take into account changes in allele frequency as well as the statistical constraint imposed by a finite GWA study sample size. And, distinct from previous approaches to the evolutionary modeling of polygenic scores [28], we track the frequencies of all loci that potentially contribute to a trait—not just the loci included in the polygenic score (i.e., loci in ).
Henceforth, we frame our study in terms of ancient polygenic scores. However, we formally demonstrate that our theoretical results apply to out-of-space polygenic scores, where the population divergence time multiplied by two is analogous to the ancient sampling time (see Fig 1 and Section 1 in S1 Text). The latter scenario can represent an ancient individual sampled from a population not directly ancestral to that of the GWA study as the two populations must have diverged at some point in the past. This scenario, to a first approximation, describes the population displacement events thought to be ubiquitous in the history of humans (e.g., [29]). However, human history is additionally characterized by numerous admixture events and population size changes (e.g., [29]) which are not yet captured within our modeling framework.
We use several statistics to characterize ancient polygenic score error in distinct population genetic and GWA study scenarios. Each statistic is indexed by the ancient sampling time τ: the bias, bias(τ), mean-squared error, mse(τ), estimated additive genetic variance, , and polygenic score accuracy, ρ2(τ), which approximates the expectation of the squared sample correlation coefficient between the polygenic scores and phenotypes of an ancient sample. In addition, we can readily express these statistics as functions of the genetic divergence between the ancient and GWA study populations, as measured by the fixation index, FST (Section 11 in S1 Text). We first derive general forms for these statistics that are agnostic to almost all of our modeling assumptions and which provide conceptual insights into the effects of allelic turnover. Next, we derive explicit, parameter-dependent expressions for each statistic when the trait is neutrally evolving in a population of constant size subject to recurrent mutation—which for small mutation rates approximates the infinite sites model. We take advantage of the spectral representation of the transition density function of the Wright-Fisher diffusion (tdf) to execute these computations [30–33]. We then find interpretable linear approximations for the initial rate of increase (or decrease) of the metrics under study. These approximations apply for the small ancient sampling times typical of ancient humans remains (e.g., see [18]).
Consistent with our expectations, mse(τ) increases and the estimated additive genetic variance decreases with increasing sampling age τ. Despite the fact that mse(τ) and are measuring distinct quantities—and indeed have different functional forms—our linear approximations reveal that, under our assumptions, both statistics initially change at approximately the same rate. This rate is proportional to the product of the mutation rate and the power to detect trait-associated loci in the GWA study, which in turn, is influenced by both study size, the magnitude of the true per-locus effect, and the underlying distribution of the allele frequencies of causal loci.
Moreover, we show that polygenic score accuracy ρ2(τ) is proportional to , which, as stated, is sensitive to the GWA study and evolutionary parameters. Unlike , ρ2(τ) depends on the trait heritability h2, with larger values of h2 increasing its rate of decay. In contrast, for small mutation rates, relative accuracy, defined as the ratio of ρ2(τ) to accuracy measured in a present-day sample ρ2(0), is insensitive to h2, the true per-locus effect size, and the GWA study parameters, as long as the GWA study size n exceeds some minimum threshold. We show that this result likely holds for an arbitrary distribution of effects. Importantly, accuracy and relative accuracy decay considerably over the short time spans characteristic of ancient human samples and geographically distinct human populations.
With equal probability of detecting positive versus negative effect alleles, and under neutrality, the bias of the polygenic score is zero for all ancient sampling times. In practice, both of these conditions are likely violated. For example, detection imbalances have been observed in case-control GWA studies [34], and many polygenic traits are likely under some form of selection [35, 36]. While unequal thresholds do not precisely capture the phenomena described in [34], they do yield a non-zero bias(τ) within our framework. The magnitude of this bias is small, implying that other perturbations would be necessary to explain an observed, appreciable bias. To relax the neutrality assumption, we simulate recent directional selection. We find that when the selection coefficient is large enough (4Ns ≥ 1), selection indeed yields biased polygenic scores. Though this selection-induced bias is several orders of magnitude larger than that induced by asymmetry in the detection thresholds, it is still small relative to the variance explained by segregating genetic variants. Additionally, weak selection only induces small deviations from neutral theoretical expectations for the other statistics, suggesting that our neutral theory may still accurately capture accuracy declines in the presence of weak directional selection. Altogether, our theoretical results suggest that allelic turnover may make large contributions to out-of-sample reductions in accuracy, even under neutrality.
Model and metrics
Our modeling framework readily encompasses two demographic scenarios. In the first, the focal individual is sampled from the same population in which the GWA study was performed, but at a previous point in time τ (Fig 1A). We specify τ in coalescent time units: An ancient sampling time of τ corresponds to 2N ⋅ τ generations in the past, with 2N as the diploid population size. When τ = 0, the focal individual is an independent sample from the GWA study population. In the second scenario (Fig 1B), the focal individual is sampled at τ′ from a population that diverged from the GWA study population at τsplit (in coalescent time units) in the past. However, we show in Section 1 in S1 Text that scenario (A) is equivalent to scenario (B) if the ancient sampling time τ is equal to 2τsplit − τ′. Therefore, we proceed according to the first scenario, while emphasizing that our conclusions readily translate to the second.
We summarize the full model in Fig 1C and detail its constituent parts in the proceeding subsections. Briefly, the genotype of the ancient individual is sampled conditional on the population allele frequencies at τ. The ancient individual’s phenotype is then sampled conditional on its genotype. Population allele frequencies for all loci that potentially affect the trait evolve until present day, at which point the GWA study is conducted. In particular, the effect sizes included in the polygenic score model are estimated from the genotypes and phenotypes of n contemporary individuals. Finally, the ancient polygenic score is computed from the ancient individual’s genotype and the polygenic score model derived from the results of a contemporary GWA study.
Sampling the genotype of a time-indexed individual
We assume that each site is at most bi-allelic, with possible alleles A1 and A2. We denote the genotype of an individual sampled at some time t (in coalescent units) as Xiℓ(t), where i indexes the individual, and ℓ the locus. For the ancient individual(s), t = τ; for the participants in the GWA study, t = 0. For mathematical convenience, we use a symmetric genotype encoding, that is Xiℓ(t) ∈ {−1, 0, 1}, corresponding to genotypes A1 A1, A1 A2, and A2 A2, respectively. Conditional on the population allele frequency of allele A2 at t, Zℓ(t), the distribution of Xiℓ(t) is given by the Hardy-Weinberg sampling probabilities: , , and .
Modeling the true phenotype
The genetic basis of a polygenic trait, Y, is determined by a set , consisting of L distinct genetic loci (), each with a true per-locus additive effect (for ℓ = 1, 2, …, L). We further assume that the L loci contribute linearly to the trait, such that the true phenotype of the i-th individual sampled at t is specified by the commonly used additive genetic model [37],
(1) |
where C is a constant; βℓ is the true additive effect of locus ℓ; and is a normally distributed random variable that incorporates variance in the phenotype due to the environment. The summation in Eq 1 is often referred to as an individual’s genetic value [25]. A locus ℓ contributes ±βℓ to the genetic value (and phenotype) of an individual who is homozygous at ℓ, and zero to that of a heterozygous individual. C is thus the phenotype of an hypothetical all heterozygous individual. Without loss of generality, we set C = 0. In addition, we assume, without loss of generality, that all βℓ ≥ 0 such that locus ℓ contributes −βℓ to the genetic values of A1A1 individuals and +βℓ to the genetic values of A2A2 individuals.
A fixed locus, Zℓ(t) ∈ {0, 1}, will affect the mean phenotype of the population at t by ±βℓ but will not contribute to phenotypic variation. We illustrate this fact by conditioning on the allele frequencies of all loci in at t, Z(t) ∈ [0, 1]L. Assuming linkage equilibrium between loci as well as independence between the environmental and genetic effects, we have,
(2) |
The summation in Eq 2 is the additive genetic variance at t, VA(t). For a segregating site, the summand is proportional to Zℓ(t)(1 − Zℓ(t)), with 0 < Zℓ(t)(1 − Zℓ(t)) < 1. For a fixed site, the summand is zero and the site does not contribute to the additive genetic variance VA(t). An important feature of our model is that some of the L loci may not exhibit genetic variation in the population at a given time. More concretely, the set of loci with non-zero estimated effects on the polygenic score, , may only be a small subset of . Thus, we assume that is a superset of .
Constructing a model for the polygenic score
As our aim is to isolate the effects of allelic turnover on the statistical properties of polygenic scores, we make the additional assumption that the genotyped sites are the causal sites. (We have already assumed that all loci are in linkage equilibrium.) Akin to [38], we employ a simple threshold model for the effect estimates. For a GWA study consisting of n individuals (and 2n chromosomes),
(3) |
where Dℓ is the allele count of the trait-increasing allele A2 at the ℓ-th site in the GWA study sample; and dℓ1 and dℓ2 are the site-specific detection thresholds. In this simplified model, the true effect is estimated perfectly for all sites with allele counts within the intervals (dℓ1, 2n − dℓ2) for . In Section 4 in S1 Text, we relate Eq 3 to two alternative estimation procedures: maximum likelihood estimation (MLE) and the best linear unbiased predictor (BLUP).
We allow the two thresholds to differ in order to encompass scenarios in which power is an asymmetric function of the sample allele frequencies, e.g., there is more power to detect low frequency (Dℓ < n) versus high frequency (Dℓ > n) trait-increasing alleles. Such situations may arise with polygenic disease inheritance and imbalanced case and control sample sizes [34]. In most cases, however, we will consider symmetric detection thresholds, with dℓ1 = dℓ2 = dℓ. The threshold dℓ depends on on the phenotypic variance, genome-wide significance threshold, true per-locus effect βℓ, and GWA study size n. In Section 2 in S1 Text, we give an explicit form for this dependency for a continuous focal trait and equal detection thresholds. Varying dℓ while keeping the GWA sample size fixed is equivalent to varying the true per-locus effect βℓ. Varying the GWA study size n while keeping βℓ and the other parameters fixed is akin to varying the GWA study’s power to detect loci of a particular effect size. In Analytical Results, we do both.
The threshold model arises in the large GWA study size n limit for the model of provided in Equation 5 in S1 Text. Namely, as long as Dℓ is not too small, the variance of goes to zero as n grows. Thus, the threshold model in Eq 3 will necessarily underestimate the true variance of (Section 4 in S1 Text). Still, this model captures the dependency of on the GWA study sample size n and the true per-locus effect βℓ, while facilitating our analytical treatment.
In order to compare the polygenic score with an individual’s true phenotype, we need to account for all sites in the mutational target , not just those in , the set of sites with non-zero effect estimates in the polygenic score. As for any site in but not , we express the polygenic score as a function of all loci in . The ancient polygenic score of individual i sampled τ generations in the past is then given by,
(4) |
where is the average phenotype of the GWA sample after subtracting the estimated genetic effects at all loci,
(5) |
with and as the mean phenotype and genotype at locus ℓ in the GWA study sample, respectively. Here, and in the remainder of our study, we omit time-indexing for random variables associated with the GWA study at t = 0. By design, the estimated intercept absorbs the effects of all loci which were not detected as significant in the GWA study, i.e., those sites for which . Its presence in the polygenic score of Eq 4 is necessitated by the fact that, to facilitate our analytical treatment, we did not center nor scale the genotypes and phenotypes in the GWA study. Importantly, all of our results are independent of this choice (Section 5 in S1 Text). Henceforth, unless otherwise noted, we refer to Eq 4 as the polygenic score and to the summation in Eq 4 as the genetic prediction.
Modeling population genetic dynamics
Population genetic processes govern the correlations between allele frequencies at distinct points in time. We model this correlation using the Wright-Fisher diffusion with recurrent mutation. As we assumed all loci were in linkage equilibrium, their allele frequencies evolve forward in time independently, subject to genetic drift and mutation. At each site, alleles mutate from A1 → A2 with rate μ, and from A2 → A1 with rate ν. While our results readily generalize to arbitrary μ and ν, we restrict ourselves to equal mutation rates, μ = ν.
We further assume that the population is at equilibrium. In this setting, the marginal allele frequencies are beta-distributed, with shape and scale parameters specified by the population-scaled mutation rate; we denote the latter quantity by a, with a = 4Nμ = 4Nν.
The relative magnitudes of mutation and genetic drift determine which force dominates an allele frequency trajectory. For example, as a approaches 0, the effects of mutation on the frequencies of segregating mutations become negligible and genetic drift dominates. In this low mutation regime (a ≪ 1, or equivalently ), the recurrent mutation model approximates the infinite sites model, while still retaining the features that make it attractive for our analytical treatment. In particular, the stationary allele frequency distribution is a well-defined probability distribution under the recurrent mutation model, but not under the infinite sites model. We concern ourselves almost exclusively with the low mutation regime.
Quantifying out-of-sample prediction errors
To quantify how well the polygenic score approximates the true phenotype of an individual sampled uniformly at random from the population at time τ before the present, we use several statistics:
Bias
We define the bias as the expectation of the difference between the polygenic score and true phenotype,
(6) |
where, here and elsewhere, we omit the subscript when there is only one sample. The expectation in Eq 6 is with respect to the entire random process, encompassing the underlying population genetic dynamics, estimation of the per-locus effects in the GWA study, and computation of the ancient polygenic score (illustrated in Fig 1C).
Mean-squared error (mse)
We define the mse as the expectation of the squared prediction error,
(7) |
As in Eq 6, the expectation in Eq 7 is with respect to all sources of randomness in the model. The variance of the prediction error equals the difference of the mse and the square of the bias, and thus it is fully characterized by these two metrics.
Expected estimated additive genetic variance ()
The estimated additive genetic variance is an estimate of the amount of phenotypic variance in the ancient population explained by additive genetic effects alone. We use to represent the expectation of this quantity,
(8) |
where is an estimate of the ancient population allele frequency computed from a sample of na individuals sampled at τ. The expected true additive genetic variance, , can be found by taking the expectation of the summation in Eq 2.
Polygenic score accuracy (ρ2)
Practitioners often compute the sample correlation coefficient r2 to measure the accuracy of a predictor in a sample. Here, our sample is na ancient individuals sampled at time τ, thus,
(9) |
where Cov[⋅, ⋅] and Var[⋅] are the sample covariance and variance operators, respectively, and are the na-dimensional vectors of polygenic scores and phenotypes of the ancient individuals, respectively. Ideally, we would compute the expectation of this quantity—but, this is challenging due to the common difficulty of computing an expectation of a ratio of random variables. Thus, we approximate the expectation of r2(τ) as the ratio of expectations,
(10) |
where, as above, the covariance and variances are taken with respect to the sample of na ancient individuals, while the expectation is over all sources of randomness in Fig 1C (see Section 7.4 in S1 Text for more details). We present simulations in the section Polygenic score accuracy of Analytical Results showing that ρ2(τ) is a good approximation for the expectation of r2(τ) in the parameter regimes of interest.
Analytical Results
By how much does the prediction accuracy of a polygenic score decrease as the time between sampling the ancient individual and conducting the GWA study increases? To answer this question, we consider a trait potentially influenced by L genetic loci, each with true effect βℓ ≥ 0, ℓ = 1, …, L. The forward evolution of sites underlying this trait is modulated by a per site, per generation mutation rate, μ, and a population scaled rate of a = 4Nμ. The diploid population of size 2N chromosomes is assumed to be at equilibrium. The parameters dictating the GWA study are the sample size n and the detection thresholds specified by d1, d2 ∈ {1, …, n}L. The metrics are indexed by the ancient sampling time τ in coalescent time-units. An ancient sampling time of τ corresponds to 2N ⋅ τ generations in the past. We omit the time index for variables associated with the GWA study, which occurs at present day (t = 0). (We show in Section 11 in S1 Text, that the metrics can also be expressed as a function of divergence or FST between the ancient and contemporary populations).
Each subsection is structured as follows: We first derive a general expression for the statistic that does not depend on how we model the population genetic dynamics nor the GWA study. Second, we derive an analytical expression for the statistic under the population genetic assumptions and the GWA study threshold model described in Model and metrics.
Bias
We can rewrite the sampling time-dependent bias defined in Eq 6 as,
(11) |
where biasℓ(τ) is the contribution of locus ℓ to bias(τ). From Eq 11, we see that biasℓ(τ) ≈ 0 when either or both of and are true. Thus, biasℓ(τ) is minimal when (i) effect estimates are accurate, and (ii) the allele frequencies have not changed substantially in the interval [τ, 0].
Under the assumption of equal mutation rates and detection thresholds (dℓ1 = dℓ2), biasℓ(τ) = 0 for τ ≥ 0 for a reason distinct from those stated above. Trait-increasing alleles at high frequencies (Dℓ > n) and low frequencies (Dℓ < n) are detected as significant () with equal probability. An equivalent assumption is that power is not affected by whether the most prevalent allele is trait-increasing or decreasing. Subsequent evolution of the allele frequencies preserves this symmetry and bias(τ) remains equal to zero for all τ. It follows that in the absence of additional perturbing forces, an estimate of the mean polygenic score from a sample of na ancient individuals will also be unbiased, and therefore will on average accurately reflect the lack of change in the mean phenotype.
However, if we introduce asymmetry in the detection thresholds (dℓ1 ≠ dℓ2), bias(τ) is non-zero for all τ (Section 7.1 in S1 Text). Using the spectral representation of the transition density of the Wright-Fisher diffusion (tdf), we derive the per-locus contribution to the bias, biasℓ(τ) (Section 7.1 in S1 Text). For a small population-scaled mutation rate a and a large GWA study size n, we approximate this expression (given in Equation 45 in S1 Text) as,
(12) |
where,
(13) |
is the probability that the allele count of site ℓ is less than dℓi, i.e., Dℓ < dℓi for i = 1, 2; and, B(⋅, ⋅) is the beta function. Thus, the magnitude of biasℓ(τ) is approximately proportional to the difference in the probability of detecting high (Dℓ > n) versus low (Dℓ < n) frequency alleles, and increases exponentially with τ. With a large GWA study size n and a small mutation rate a, this difference is small relative to the square root of the additive genetic variance—the ratio of these two quantities is smaller than (Fig S1a in S1 Text). This is due to the fact that when the mutation rate is small, most alleles are close to fixation or fixed. The stationary population allele frequency density κ(z) ∝ za−1(1 − z)a−1 behaves like z−1(1 − z)−1 for small a. Varying dℓi then has relatively little impact on , constraining the difference between the one-sided detection probabilities (Fig S1b in S1 Text).
Mean-squared error
The sampling time-dependent mean-squared error mse(τ) can be expressed as,
(14) |
where is the variance in the phenotype due to the environment (Section 7.2 in S1 Text). Note the similarity of the left term in Eq 14 to the form of bias(τ) given in Eq 11—similar heuristics apply. Under the threshold model specified in Eq 3, sites at moderate frequencies in the GWA study sample, Dℓ ∈ [dℓ, 2n − dℓ], will not contribute to mse(τ) since . Only sites with frequencies outside this interval (including sites invariant in the GWA study sample) will contribute, and their contributions will be proportional to the squared difference between Xℓ(τ) and . In practice, moderate frequency loci will also contribute to mse(τ) due to errors in the estimation of the effect estimates and any difference between the ancient genotypes and the average genotypes in the GWA study sample at these sites (Section 4 in S1 Text).
We use the spectral representation of the tdf (Section 6 in S1 Text) to derive an analytical expression for mseℓ(τ), the per-locus contribution to the mse (Section 7.2 in S1 Text). From this expression, Equation 50 in S1 Text, we derive a linear approximation for the initial per-locus increase in this statistic, Δmseℓ(τ). With a symmetric detection threshold (dℓ1 = dℓ2 = dℓ) we have,
(15) |
where mseℓ(0) is the contribution of site ℓ to mse(τ) for τ = 0 (Equation 76 in S1 Text); and , defined in Eq 13, is the probability that the allele count of site ℓ is outside the detection interval such that . Both mseℓ(0) and depend on the mutation rate a, the GWA study size n, and the detection threshold dℓ.
Δmseℓ(τ) reflects the time-dependent contributions of sites not detected in the GWA study. To see this, we condition on the effect estimate , . Thus, Eq 15 implies that for small τ, and consequently, the combined effects of drift and mutation on mseℓ(τ) are captured in the product of the mutation rate and sampling time aτ.
In addition, Eq 15 suggests that the rate at which mseℓ(τ) increases will be shared across parameter regimes when is similar (Fig S4a in S1 Text). To illustrate this, we use our analytic formula (given in Equation 50 in S1 Text) to compute mseℓ(τ) for several low mutation rates, a ∈ {10−4, 10−3, 10−2}, and three GWA study sizes, n ∈ {104, 105, 106} (Fig 2A). These mutation rates and sample sizes span the range of parameter values appropriate for human data. We depict our results in two ways: (i) we plot the change in mseℓ(τ), and (ii) we plot mseℓ(τ) normalized by the expected additive genetic variance contributed by a single site. At stationarity the expected additive genetic variance is constant and equal to,
for a scaled-mutation rate a. The plot of the former, Fig 2A, exhibits the functional relationship revealed by Eq 15, while the latter, Fig 2B, approximates the noise-to-signal ratio. In Section 9 in S1 Text, we demonstrate that Eq 15 is a good approximation to mse(τ) for τ ≤ 0.2, particularly when the GWA study size n is large (in particular, see Fig S5 in S1 Text).
To find the GWA study size specific detection thresholds used in Fig 2A and 2B, we solve Equation 11 in S1 Text for a given effect size β, phenotypic variance Vp, and significance threshold α, while varying the GWA study sample size. For β2 = 0.01, Vp = 1, and α = 10−8, the detection thresholds are d = 4142, 3340, 3290 in order of increasing sample size, which corresponds to sample allele frequencies of approximately 0.2, 0.02, amd 0.002, respectively. Thus, for a given effect size, larger sample sizes will lead to the detection of alleles at more extreme allele frequencies, while smaller samples will restrict detection to alleles at more intermediate frequencies. Due to non-identifiability, the parameter choices are fairly arbitrary.
We find that for small mutation rates, the cumulative change in the mse, Δmseℓ(τ), is mostly insensitive to differences in the GWA study sample size (Fig 2A and 2B). The approximation in Eq 15 helps to explain this result. The rate of increase is approximately proportional to . For small mutation rates (a ≪ 1) and an arbitrary detection threshold dℓ, the probability of not detecting a locus as significantly associated with the trait is roughly for all sufficiently large n (Fig S1b in S1 Text). In this regime, increasing the GWA study sample size only yields small increases in the probability of detecting a locus as significant. Thus, for small mutation rates, the product of this quantity with the mutation rate is , and indeed, we observe a cumulative increase in mseℓ(τ) that is for τ = 1 (Fig 2A). We note that increasing the GWA study sample size does enable detection of loci with smaller effects.
The result in Fig 2A, however, hides the fact that a small absolute increase in mse(τ) may correspond to a substantial increase in the noise-to-signal ratio. Indeed, for a = 10−3 (blue lines throughout), mseℓ(τ) ultimately exceeds the expected additive genetic variance for all GWA study sample sizes (Fig 2B). By τ = 0.2, a sampling time characteristic of ancient humans, mseℓ(τ) due to allelic turnover is approximately 20% of the additive genetic variance . For sufficiently large τ, mseℓ(τ) is at least the same order of magnitude as the expected additive genetic variance. In addition, while mseℓ(τ) increases at approximately the same rate irrespective of study size, its initial value mseℓ(0) is sample size dependent (Fig 2B and see Fig S4b and S4e in S1 Text for a larger parameter space). Yet, for a given value of dℓ, reductions in mseℓ(0) mediated by sample size diminish once n is large enough (Fig S4b and S4e in S1 Text).
Further, Fig 2A obscures the fact that different mutation rates may yield similar noise-to-signal ratios. As discussed, for small a, mseℓ(τ) increases with τ at a rate that is . For small a, the additive genetic variance is likewise , yielding a relative increase that is mostly insensitive to the mutation rate. Normalized mseℓ(0) is also similar across small mutation rates (Fig S4b and S4e in S1 Text), rendering relative mseℓ(τ) mostly insensitive to a. We thus omitted the other two mutation rates from Fig 2B.
Lastly, we fix the GWA study sample size at n = 105 and vary the detection threshold d (Fig 2C). Varying d while keeping n fixed is analogous to varying the true per-locus effect size β, or keeping β fixed while varying the significance threshold α. The minimum threshold is d = 10, whereas d = n = 105 maximizes mseℓ(τ) since would equal zero for all ℓ. Consistent with our analysis above, for small a, (i) mseℓ(0) depends critically on d, while (ii) mseℓ(τ)’s approximately linear growth rate is largely insensitive to d. Furthermore, by our previous arguments, relative mseℓ(τ) is similar across small mutation rates, and they are also omitted in Fig 2C. For independent and identically distributed (iid) loci and , the per-locus mseℓ(τ) values presented in Fig 2B and 2C are equal to the corresponding trait-wide statistics mse(τ).
Additive genetic variance
The per-locus contribution to the expected estimated additive genetic variance is,
(16) |
where is the estimated allele frequency at τ, computed in a sample of na ancient individuals. When or Zℓ(τ) ∈ {0, 1}, site ℓ will not contribute to . Thus, a site ℓ has a non-zero contribution to the estimated additive genetic variance only when it is segregating at both the present day and τ. This condition is necessary for both and to be true.
As with the two previous statistics, we use the spectral representation of the tdf to derive an analytical expression for under our population genetic assumptions (Section 7.3 in S1 Text). The resulting expression, Equation 54 in S1 Text, indicates that the expected additive genetic variance decays exponentially. We then, to first order in the ancient sampling time τ, approximate the initial decrease in the per-locus estimated additive genetic variance ,
(17) |
where is evaluated at τ = 0 (Equation 77 in S1 Text); and 2P(dℓ), defined in Eq 6, is the probability that . The factor due to finite sampling, 2na/(2na − 1), is ≈1 when the ancient sample size na is large. Thus, apart from sign, is equal to Δmseℓ(τ) of Eq 15. Therefore, for small τ, decreases at approximately the same rate as mse(τ) increases. This result further suggests that for a ≪ 1 and a large GWA study size n, for small τ (Fig 2C and 2F). Although, this relationship trivially breaks down for large τ as mseℓ(τ) is not bounded by one.
To compare across mutation rates, we mirror our treatment of mseℓ(τ) in the previous section. We plot (i) its increase (Fig 2D); (ii) normalized by the expectation of the true additive genetic variance at stationarity (Fig 2E); and (iii) normalized , varying the detection threshold for a fixed GWA study sample size (Fig 2F). Akin to mseℓ(τ), normalized is very similar across small mutation rates. And, while the GWA study size n and the detection threshold d influence the initial estimated additive genetic variance , its rate of change is mostly insensitive to the two GWA study parameters.
As largely recapitulates our results for mse(τ) with opposing sign, we focus on their differences. Indeed, they have different functional forms and behave differently for modest or large τ (see Equations 50 and 54 in S1 Text, respectively). Conceptually, this discrepancy is not unexpected: In the previous section, we showed that a site only contributes to mse(τ) if its allele count falls outside the detection interval and . Thus, mse(τ) increases with τ due to alleles shifting from intermediate frequencies in the ancient population to frequencies outside of the detection region in the contemporary population. For the expected estimated additive genetic variance , the converse is true: The slope represents the decline in due to alleles changing from frequencies near or at fixation in the ancient population to frequencies within the detection interval in the contemporary population. While our results reveal similar functional behavior for these two quantities (with opposing signs) that applies for small τ, we caution that statements about do not immediately translate to statements about mse(τ), particularly for τ ⪆ 0.2.
Polygenic score accuracy
While our framework, in principle, encompasses a trait with varying effect sizes, we will first assume that all sites are iid with true effect size β. Our approximation to the expectation of the sample correlation coefficient simplifies to,
(18) |
where the compound parameter is the environmental variance normalized by the product of the number of loci in the mutational target L and the squared per-locus effect size β (Section 7.4 in S1 Text). By comparing Eq 18 with Eq 16, we can see that ρ2(τ) is closely related to the estimated additive genetic variance. Thus, like , ρ2(τ) will decrease with τ due to loci having changed from frequencies close to zero or one in the ancient population to intermediate frequencies in the contemporary population. However, unlike , ρ2(τ) does not depend on the ancient sample size. Therefore, to relate the two statistics, we multiply by the inverse of the ancient sample size dependent factor implicit in ,
(19) |
For , barring the sample size factor, Eq 19 is equal to normalized by the expected additive genetic variance. By extension, this quantity approximates the expected sample correlation coefficient r2(τ). By invoking our additional population genetic and GWA study assumptions, we arrive at an approximation for the decrease in polygenic score accuracy,
(20) |
Now, to relate our theory to empirical and simulation studies, we compute ρ2(τ) for a given narrow-sense heritability h2 and mutation rate a pair. We define h2 for a trait with a mutational target of L loci of equal effects β,
where the equality follows from our population genetic assumptions. Together with a, h2 fully specifies the compound parameter with,
We plot our analytical expressions for both accuracy (Fig 3A) and relative accuracy (Fig 3B), defined as the ratio of ρ2(τ) to ρ2(0) for τ ∈ [1, 0] spanning 2N generations. For humans, this time span corresponds to approximately 500,000 years in the past, encompassing the “Out-of-Africa” migration event estimated to have occurred 50,000–100,000 years ago [39]. As with the preceding statistics, when τ = 0, ρ2(τ) approximates the accuracy of the polygenic score within the GWA study population. Relative accuracy then directly measures reductions in accuracy relative to the GWA study population. We set h2 = 0.5 and a = 10−3, and fix the GWA study sample size at n = 105. We then compute ρ2(τ), varying the detection threshold over several orders of magnitude (Fig 3A). (See Fig S6 in S1 Text for accuracy as a function of the fixation index, or FST.) Our results for ρ2(τ) necessarily recapitulate those of : While increasing the detection threshold d reduces accuracy substantially, it does not have a large impact on relative accuracy for n = 105 (Fig 3A). Indeed, for small mutation rates, relative accuracy is insensitive to the mutation rate and threshold, and is well approximated by e−τ (Equation 68 in S1 Text). Thus, its derivative is also exponential. Absolute accuracy ρ2(τ) likewise decays exponentially, but its derivative is scaled by a quantity that reflects features of the GWA study and the phenotypic variance. For a small mutation rate a ≪ 1, its derivative is approximately , which, in turn, is approximately 2P(d)h2e−τ (Equation 67 in S1 Text). The latter expression suggests that the probability of not detecting a significant association P(d) and trait heritability h2 are the key determinants of prediction accuracy. Importantly, ρ2(τ) declines considerably over the interval τ ∈ [1, 0] irrespective of the detection threshold d.
In addition, we glean from Eq 18 that while heritability affects the magnitude of ρ2(τ) through the compound parameter , it does not influence the relative accuracy, consistent with previous results [16]. Our simulations suggest that this is also true of the sample correlation coefficient, as simulated estimates of r2(τ) agree extremely well with our theory for ρ2(τ) (Fig 3B). We note that this result is contingent on the fact that the environmental variance only enters our simple threshold model in the specification of the threshold d (Equation 11 in S1 Text), and does not contribute directly to the variance of the polygenic score (Section 7.4 in S1 Text). Therefore, we expect this result to hold only for large GWA study sample sizes for which the threshold model is a good approximation to the distribution of . While the finding that relative accuracy is insensitive to the GWA study parameters relies on the assumption that all loci are iid and share a causal effect β, we provide preliminary theoretical evidence that our results will hold when β varies across loci (see Equation 69 in S1 Text and ensuing comments).
Simulation results for recent directional selection
We use simulations to explore if and how the statistics under study deviate from their neutral expectations in the presence of recent directional selection. Each copy of the A2 allele at the ℓ-th site confers a fitness advantage of +sℓ, and so the fitness ratio of the three possible genotypes A1A1:A1A2:A2A2 is 1: (1 + sℓ): (1 + 2sℓ). In our simulations, the population evolves neutrally until the onset of selection at N generations (or τs = 0.5 in coalescent time units) before present. Thereafter, the population evolves according to discrete Wright-Fisher dynamics with selection.
In the presence of selection, the allele frequency distribution is no longer symmetric; rather, it is skewed toward the beneficial allele. The severity of the skew depends on the selection coefficient and mutation rate, as well as the amount of time that selection has been acting. As we restrict sℓ to positive values, designating the A2 or + allele as beneficial, the allele frequency distribution will be skewed toward one. If we instead designated the A1 allele as the beneficial allele, the allele frequency distribution would be skewed toward zero. The former models “positive” selection whereas the latter models “negative” selection. Because bias(τ) is proportional to β, its sign will be sensitive to this choice, but its magnitude will be unaltered. The other statistics will not be affected as long as the detection thresholds are symmetric. Therefore, our results are general up to the sign of bias(τ).
We conduct simulations over a range of selection coefficients, σ = 4Ns ∈ {0, 0.1, 1, 10}, for a mutation rate of a = 10−3. Under directional selection, σ is proportional to the locus effect size β; mutations with larger effect sizes will be more likely to establish and achieve appreciable frequencies [40]. In addition, we plot results for two different detection thresholds, d ∈ {103, 104}, in a GWA study sample of size n = 104. More details on the simulation procedures are provided in Section 3 in S1 Text.
When σ ≥ 1, the polygenic score is biased towards positive values for τ > 0 for both detection thresholds (Fig 4A). In other words, with directional selection acting to increase the trait value, tends to overestimate Y(τ). The magnitude of biasℓ(τ) depends critically on the strength of selection relative to mutation: We observe a larger bias for σ = 10 relative to σ = 1, and likewise the bias is larger for σ = 1 relative to σ = 0.1. In fact, the smaller selection coefficient σ = 0.1 is not distinguishable from neutral expectations. For 0 ≤ τ < τs, biasℓ(τ) increases at an accelerating rate; for τ ≥ τs, bias(τ) appears constant in this parameter regime.
A higher detection threshold decreases the detection probability. Thus, we expect that the magnitude of biasℓ(τ) will increase with the detection threshold. Indeed, biasℓ(τ) is larger and increases more quickly for the larger detection threshold d = 104 compared to d = 103 (Fig 4A). Further, our simulations suggest that the detection threshold coupled with the time of the onset of selection govern the magnitude of the bias for τ > τs. For some large τ, biasℓ(τ) will reach an equilibrium value that depends approximately on the asymmetry of the detection thresholds at the present day, which in turn, depends on both the timing and strength of selection (Section 10 in S1 Text).
The underlying allele frequency dynamics provide some insight into these patterns. Before the onset of selection, the allele frequency distribution is stationary and symmetric around 0.5. After the onset of selection, trait-increasing alleles tend to increase in frequency, skewing the distribution toward one. Thus, alleles not detected in the GWA study will tend be at higher versus lower frequencies at t = 0, yielding for σ > 0. For large τ, the allele frequencies of sites not detected in the GWA study, i.e., with , may have been substantially different in the ancient population. Each one of these sites will make a contribution to bias(τ) that is proportional to (Eq 11). Looking backward in time, the shift in the allele frequency distribution ensures that the conditional expectation of Xℓ(τ) is smaller than that of , yielding a positive biasℓ(τ) for τ > 0. Notably, the magnitude of biasℓ(τ) induced by selection is several orders of magnitude larger than that induced by asymmetry in the detection threshold alone (Fig S1a in S1 Text).
The effects of selection on mseℓ(τ) are qualitatively consistent with those on biasℓ(τ) (Fig 4B). Although, here, the only selection coefficient which induces significant deviations from neutral expectations is σ = 10. And, mse(τ) is larger for d = 104 compared to d = 103. As with bias(τ), for 0 ≤ τ < τs, mseℓ(τ) increases at an accelerating rate; before τs (τ ≥ τs), mseℓ(τ) appears to increase linearly. Values of σ < 10 do not induce noticeable deviations from neutrality for the correlation coefficient ρ2(τ) either (Fig 4C). However, strong selection (σ = 10) does lead to substantially larger reductions in accuracy relative to our neutral expectations. In addition, for σ = 10, relative accuracy is sensitive to the detection threshold, with accuracy decreasing faster for the larger detection threshold (Fig 4D).
Discussion
In this work, we devised a theoretical framework to quantify the effect of allelic turnover on the error and accuracy of out-of-sample polygenic scores. Unlike previous theoretical approaches [16, 27], we averaged over the evolutionary process governing trait evolution, the GWA study from which a polygenic score model is constructed, and the ancient individual’s genotype and phenotype. In doing so, we found explicit expressions for several commonly used metrics that depend on the focal individual’s sampling time, as well as the parameters governing the population genetic dynamics and power to detect trait-associated loci in the GWA study. Mathematical properties of the recurrent mutation model at stationarity enabled us to compute analytical expressions for the metrics of interest under neutrality, and approximations thereof.
Our analytical expressions suggest that allelic turnover alone may be responsible for large reductions in accuracy: For small mutation rates, ρ2(τ) (and r2(τ)) decreases substantially within short time-spans, by about 20 percent in 0.2N generations (corresponding to approximately 120,000 years in humans). In addition, increasing the detection threshold yielded lower polygenic score accuracy, as a locus was less likely to have a non-zero effect. These results are broadly consistent with a concurrent study by Yair and Coop [41], in which the authors used simulations to assess cross-population prediction accuracy, defined as the ratio of the variance of and individual’s polygenic score to that of their genetic value, under neutrality and in the presence of stabilizing selection. When Yair and Coop restricted the polygenic score to the top one percent of SNPs, roughly analogous to altering the detection threshold, they similarly found that the accuracy declined in the focal population.
Yet, while the detection threshold influenced the magnitude of the polygenic score accuracy, relative accuracy was insensitive to this parameter. In other words, under neutrality, relative accuracy is insensitive to the magnitude of the per-locus effect and only depends on the underlying allele frequency distribution. In addition, relative accuracy was independent of the size of the mutational target when the constituent loci were iid. Our theory suggests that these results will hold for arbitrary distributions of the true effect β. Consideration of several effect size distributions in a parameter regime consistent with the UK Biobank further supports this conjecture (Section 8 in S1 Text). Although more work is required to fully substantiate this claim.
Selection, however, induces a dependency between an allele’s effect and its frequency, and may thereby render relative accuracy sensitive to the detection threshold. Our simulations provide preliminary evidence in support of this claim. For a small mutation rate of a = 4Nμ = 10−3 and a large per-locus selection coefficient σ = 4Ns = 10, relative accuracy was lower for the larger detection threshold of d = 104 compared to d = 103. Yet, the difference between detection thresholds was small relative to that induced by selection, and was negligible for smaller selection coefficients. Indeed, smaller selection coefficients (σ ≤ 1) did not yield appreciable deviations from our neutral expectations for the mse, accuracy, nor relative accuracy. Therefore, excluding strong selection (σ > 1), our neutral expectations for these statistics appear to be good approximations to their true values. Our theoretical results under neutrality thus may prove an accurate description of temporally-resolved polygenic scores when polygenic adaptation is achieved by concurrent small frequency changes at numerous small effect loci—a plausible scenario [28, 35]. In addition, the simple patterns revealed by our simulations suggest that it may be possible to derive (approximate) analytic expressions for the given metrics in the presence of strong selection, when loci exhibit selective sweep-like behavior.
It is unclear whether our neutral expectations will hold in the context of more sophisticated polygenic trait modeling. In our simulation study, as in our theoretical work, we focus on dynamics at a single locus. Thus, our results are most relevant to scenarios in which single locus dynamics can be decoupled from the evolution of the mean phenotype and the genetic background [40]. Namely, the effect of an individual locus must be small relative to the mean phenotype [38, 40]. Future work will assess polygenic score accuracy under more sophisticated models of polygenic adaptation (e.g., [38, 42]).
Of the two bias-inducing processes explored, detection threshold asymmetry and directional selection, the latter induced much larger deviations from our neutral expectation for the bias, i.e., under neutrality bias(τ) = 0 for all ancient sampling times τ. In the presence of detection asymmetry, bias(τ) is approximately proportional to the difference between the one-sided detection probabilities, which in turn is constrained by the shape of the allele frequency distribution. Under neutrality, and for small mutation rates, most alleles are at very low frequencies or fixed, such that changing the detection threshold minimally influences the one-sided detection probabilities. Selection, however, perturbs the underlying allele frequency density. At equilibrium, this density is proportional to eσzz−1(1 − z)−1 for small a, where σ = 4Ns. Depending on σ, the one-sided detection probabilities may differ markedly, yielding larger values of bias(τ). We thus suspect that detection asymmetry has the potential to further exacerbate any bias induced by selection. These results are interesting in light of those of Chan et al. 2014 [34], who demonstrated that polygenic disease inheritance under the liability threshold model induced differences in the power to detect protective versus susceptible alleles. In Chan et al., this effect was further increased by imbalances in the case and control sample sizes in the GWA study. Additional work is needed to incorporate these features of case-control studies into our modeling framework.
The effects of selection on the bias have implications for assessments of mean differences between ancient polygenic scores from distinct time points. In particular, our results suggest that sufficiently strong positive directional selection will lead to overestimation of the difference between the polygenic scores of ancient individuals sampled before and after the onset of selection. Likewise, in the presence of negative selection, the polygenic score will underestimate this difference. At the same time, as discussed above, estimation error increases (as measured by mse(τ)) and accuracy (as measured by ρ2(τ)) decreases as the ancient sampling time increases.
Our results clarify relationships between various commonly used metrics of prediction error and accuracy. For example, we demonstrated an approximate functional relationship between the mean-squared error mse(τ) and the expected additive genetic variance that applies for small ancient sampling times and mutation rates. This shared initial rate emerged despite fundamental differences between these statistics: mse(τ) measures error due to variants near or at fixation in the contemporary sample, which were segregating at intermediate frequencies in the ancient sample. In contrast, measures error due to variants segregating in the contemporary sample, which were near or at fixation in the ancient sample. This conceptual result does not rely on any of our population genetic or GWA modeling assumptions, and perhaps could be exploited to learn about the genetic architecture of quantitative traits from multi-population data. In addition, we showed formally that polygenic score accuracy ρ2(τ), an approximation to the expectation of the sample correlation coefficient r2(τ), is proportional to the ratio of to the total phenotypic variance. We believe that these relations, and their evolutionary and GWA study dependent forms, may facilitate the development of novel, more principled statistical procedures for the analysis of out-of-sample polygenic scores.
At the same time, the simplifying assumptions underlying our results indicate that significant challenges remain. For one, our model does not incorporate the complex demographic processes, such as admixture and population size changes, inherent in human history. This implies that an ancient sampling time of t years in the past likely does not correspond to a sampling time of τ = t/2N in our model, where 2N is the contemporary population size. Indeed, allelic turnover cannot explain all of the reductions in accuracy observed in out-of-sample predictions in humans. For example, our neutral theory predicts an approximately fifty percent reduction in accuracy when FST between the focal and GWA study populations is comparable to African-European divergence (FST ≈ 0.1). This more severely overestimates the prediction accuracy of height in a sample of individuals with African ancestry compared to the Wang et al. predictions, which take into account both LD and allele frequency changes (Section 12 in S1 Text). Thus, to achieve the same accuracy reductions observed in both simulated, e.g., [15, 16] and empirical, e.g., [14, 16, 43], studies of cross-population polygenic scores for contemporary humans, allelic turnover under neutrality would require population divergence times that far exceed their estimated values (Fig S7 in S1 Text).
Differences in LD between contemporary human populations may largely explain this discrepancy as most trait-associated loci are likely to be tagging rather than causal sites [12, 16]. As with geographically distinct populations, if LD between the genotyped and causal sites differed in the ancient population, then polygenic score accuracy would suffer [1]. We did not model this effect and assumed that the genotyped site was the causal site. This assumption may be justified when ancient sampling or population divergence times are recent, as high marker density in the GWA study may mitigate accuracy losses due to LD decay, but more theoretical work is required to substantiate this claim. While our framework can readily incorporate LD, it is difficult to obtain analytical results when the genotyped marker is not the causal site. In lieu of theoretical results, large-scale simulations in simple population genetic scenarios may provide insight into the relative contributions of LD—which depends on the allele frequencies of the tagging and causal sites—and allelic turnover to declines in polygenic score accuracy.
Furthermore, our assumption of linkage equilibrium between loci roughly equates to assuming that each LD block contains only a single causal site. Thus, our results will be most applicable to traits with relatively sparse genetic architectures for which the distance between any two causal loci is large compared to the scale of LD. In contrast, when the trait architecture is dense, a large number of variants have non-zero effect on the trait. Causal sites in close proximity are necessarily linked, and our assumption of linkage equilibrium would be violated. In addition, under a dense trait architecture, the “prune and threshold” polygenic score described herein may achieve lower accuracy than a best linear unbiased predictor (BLUP) that allows all segregating loci to have non-zero effects. In Section 4 in S1 Text, we speculate on the accuracy of BLUP in the context of our modeling framework when the trait has a dense architecture.
In addition, we assumed that per-locus causal effects were shared by the ancient and contemporary samples. Differences in causal effects across contemporary populations, perhaps due to changes in the environment, epistasis, or gene-by-environment interactions, likely contribute to accuracy reductions [8, 12]. Indeed, Cox et al. [18] found that trends in the polygenic scores of temporally disparate ancient samples did not always recapitulate those of the true phenotype. We conjecture that fluctuations in the per-locus effects would increase mse(τ) and decrease accuracy, but not profoundly alter our conclusions. Perhaps, if the fluctuations were asymmetric, e.g., effect sizes tended to increase in time, then bias(τ) may be non-zero under neutrality. Population stratification in the GWA study population may also lead to biased ancient polygenic scores, as has been observed in cross-population predictions in humans [9, 10]. Lastly, technical challenges inherent to the extraction and sequencing of ancient DNA often result in noisy estimates of the ancient genotypes. This additional source of randomness is likely to reduce accuracy and increase mse(τ), but otherwise should not substantially alter our conclusions.
Supporting information
Acknowledgments
We thank members of the Berg, Novembre, and Steinrücken labs, and the Cummings fourth floor for helpful discussions throughout the development of this project. In addition, we thank Jennifer Blanc, Adam Fine, Evan Koch, Zachary Miller, and John Novembre for comments on earlier (or very early) versions of this manuscript. We also give a special thanks to Carlos A. Serván and Micol Tresoldi for numerous insightful discussions over the course of this project.
Data Availability
All code to generate the results is available at https://github.com/marync/ancient_polygenic.
Funding Statement
MOC received funding from a National Institute of Health training grant T32 GM07197. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007;177(4):2389–2397. doi: 10.1534/genetics.107.081190 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. De Roos APW, Hayes BJ, Spelman RJ, Goddard ME. Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics. 2008;179(3):1503–1512. doi: 10.1534/genetics.107.084301 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Hamblin MT, Buckler ES, Jannink JL. Population genetics of genomics-based crop improvement methods. Trends in Genetics. 2011;27(3):98–106. doi: 10.1016/j.tig.2010.12.003 [DOI] [PubMed] [Google Scholar]
- 4. Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science. 2012;95(7):4114–4129. doi: 10.3168/jds.2011-5019 [DOI] [PubMed] [Google Scholar]
- 5. Carlson CS, Matise TC, North KE, Haiman CA, Fesinmeyer MD, Buyske S, et al. Generalization and Dilution of Association Results from European GWAS in Populations of Non-European Ancestry: The PAGE Study. PLoS Biology. 2013;11(9):e1001661. doi: 10.1371/journal.pbio.1001661 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics. 2013;14(7):507–515. doi: 10.1038/nrg3457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Guo Z, Tucker DM, Basten CJ, Gandhi H, Ersoz E, Guo B, et al. The impact of population structure on genomic prediction in stratified populations. TAG Theoretical and applied genetics. 2014;127(3):749–762. doi: 10.1007/s00122-013-2255-x [DOI] [PubMed] [Google Scholar]
- 8. Galinsky KJ, Reshef YA, Finucane HK, Loh PR, Zaitlen N, Patterson NJ, et al. Estimating cross-population genetic correlations of causal effect sizes. Genetic Epidemiology. 2019;43(2):180–188. doi: 10.1002/gepi.22173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Berg JJ, Harpak A, Sinnott-Armstrong N, Joergensen AM, Mostafavi H, Field Y, et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife. 2019;8. doi: 10.7554/eLife.39725 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Sohail M, Maier RM, Ganna A, Bloemendal A, Martin AR, Turchin MC, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8:1–17. doi: 10.7554/eLife.39702 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Mostafavi H, Harpak A, Agarwal I, Conley D, Pritchard JK, Przeworski M. Variable prediction accuracy of polygenic scores within an ancestry group. eLife. 2020;9. doi: 10.7554/eLife.48376 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Bitarello BD, Mathieson I. Polygenic scores for height in admixed populations. G3: Genes, Genomes, Genetics. 2020;10(11):4027–4036. doi: 10.1534/g3.120.401658 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Durvasula A, Lohmueller KE. Negative selection on complex traits limits phenotype prediction accuracy between populations. American Journal of Human Genetics. 2021;108(4):620–631. doi: 10.1016/j.ajhg.2021.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Martin AR, et al. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. American Journal of Human Genetics. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Ragsdale AP, Nelson D, Gravel S, Kelleher J. Lessons Learned from Bugs in Models of Human History. American Journal of Human Genetics. 2020;107(4):583–588. doi: 10.1016/j.ajhg.2020.08.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wang Y, Guo J, Ni G, Yang J, Visscher PM, Yengo L. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications. 2020;11(1). doi: 10.1038/s41467-020-17719-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Swarts K, Gutaker RM, Benz B, Blake M, Bukowski R, Holland J, et al. Genomic estimation of complex traits reveals ancient maize adaptation to temperate North America. Science. 2017;357(6350):512–515. doi: 10.1126/science.aam9425 [DOI] [PubMed] [Google Scholar]
- 18. Cox SL, Ruff CB, Maier RM, Mathieson I. Genetic contributions to variation in human stature in prehistoric Europe. Proceedings of the National Academy of Sciences of the United States of America. 2019;116(43):21484–21492. doi: 10.1073/pnas.1910606116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Colbran LL, Gamazon ER, Zhou D, Evans P, Cox NJ, Capra JA. Inferred divergent gene regulation in archaic hominins reveals potential phenotypic differences. Nature Ecology and Evolution. 2019;3(11):1598–1606. doi: 10.1038/s41559-019-0996-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Cox SL, Moots H, Stock JT, Shbat A, Bitarello BD, Haak W, et al. Predicting skeletal stature using ancient DNA. bioRxiv. 2021; p. 2021.03.31.437877. 10.1101/2021.03.31.437877 [DOI] [Google Scholar]
- 21. Windhausen VS, Atlin GN, Hickey JM, Crossa J, Jannink JL, Sorrells ME, et al. Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3: Genes, Genomes, Genetics. 2012;2(11):1427–1436. doi: 10.1534/g3.112.003699 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Lorenz AJ, Smith KP, Jannink JL. Potential and optimization of genomic selection for Fusarium head blight resistance in six-row barley. Crop Science. 2012;52(4):1609–1621. doi: 10.2135/cropsci2011.09.0503 [DOI] [Google Scholar]
- 23. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kanai M, Akiyama M, Takahashi A, Matoba N, Momozawa Y, Ikeda M, et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nature Genetics. 2018;50(3):390–400. doi: 10.1038/s41588-018-0047-6 [DOI] [PubMed] [Google Scholar]
- 25. Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819–1829. doi: 10.1093/genetics/157.4.1819 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193(2):327–345. doi: 10.1534/genetics.112.143313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE. 2008;3(10). doi: 10.1371/journal.pone.0003395 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Berg JJ, Coop G. A Population Genetic Signal of Polygenic Adaptation. PLoS Genetics. 2014;10(8):1004412. doi: 10.1371/journal.pgen.1004412 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Liu Y, Mao X, Krause J, Fu Q. Insights into human history from the first decade of ancient human genomics. Science. 2021;373(6562):1479–1484. doi: 10.1126/science.abi8202 [DOI] [PubMed] [Google Scholar]
- 30. Ewens WJ. Mathematical Population Genetics I: Theoretical Introduction. New York: Springer-Verlag; 2004. [Google Scholar]
- 31. Durrett R. Probability Models for DNA Sequence Evolution. 2nd ed. New York: Springer-Verlag; 2008. [Google Scholar]
- 32.Griffiths RC, Spano D. Diffusion processes and coalescent trees. arXiv. 2010. http://arxiv.org/abs/1003.4650.
- 33. Song YS, Steinrücken M. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics. 2012;190(3):1117–1129. doi: 10.1534/genetics.111.136929 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Chan Y, Lim ET, Sandholm N, Wang SR, McKnight AJ, Ripke S, et al. An excess of risk-increasing low-frequency variants can be a signal of polygenic inheritance in complex diseases. American Journal of Human Genetics. 2014;94(3):437–452. doi: 10.1016/j.ajhg.2014.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Pritchard JK, Pickrell JK, Coop G. The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation. Current Biology. 2010;20(4):208–215. doi: 10.1016/j.cub.2009.11.055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Boyle EA, Li YI, Pritchard JK. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell. 2017;169(7):1177–1186. doi: 10.1016/j.cell.2017.05.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Lynch M, Walsh B. Genetics and Analysis of Quantitative Traits. 1st ed. Sinauer Associates; 1998. [Google Scholar]
- 38. Simons YB, Bullaughey K, Hudson RR, Sella G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biology. 2018;16. doi: 10.1371/journal.pbio.2002985 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Jouganous J, Long W, Ragsdale AP, Gravel S. Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation. Genetics. 2017;206(3):1549–1567. doi: 10.1534/genetics.117.200493 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Chevin LM, Hospital F. Selective sweep at a quantitative trait locus in the presence of background genetic variation. Genetics. 2008;180(3):1645–1660. doi: 10.1534/genetics.108.093351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Yair S, Coop G. Population differentiation of polygenic score predictions under stabilizing selection. bioRxiv. 2021; p. 2021.09.10.459833. 10.1101/2021.09.10.459833 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Hayward LK, Sella G. Polygenic adaptation after a sudden change in environment. bioRχiv. 2019. https://www.biorxiv.org/content/10.1101/792952v2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications. 2019;10(1). doi: 10.1038/s41467-019-11112-0 [DOI] [PMC free article] [PubMed] [Google Scholar]