Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2022 May 6;18(5):e1010170. doi: 10.1371/journal.pgen.1010170

Polygenic score accuracy in ancient samples: Quantifying the effects of allelic turnover

Maryn O Carlson 1,*, Daniel P Rice 2, Jeremy J Berg 1,2, Matthias Steinrücken 1,2,3,*
Editor: Kirk E Lohmueller4
PMCID: PMC9116686  PMID: 35522704

Abstract

Polygenic scores link the genotypes of ancient individuals to their phenotypes, which are often unobservable, offering a tantalizing opportunity to reconstruct complex trait evolution. In practice, however, interpretation of ancient polygenic scores is subject to numerous assumptions. For one, the genome-wide association (GWA) studies from which polygenic scores are derived, can only estimate effect sizes for loci segregating in contemporary populations. Therefore, a GWA study may not correctly identify all loci relevant to trait variation in the ancient population. In addition, the frequencies of trait-associated loci may have changed in the intervening years. Here, we devise a theoretical framework to quantify the effect of this allelic turnover on the statistical properties of polygenic scores as functions of population genetic dynamics, trait architecture, power to detect significant loci, and the age of the ancient sample. We model the allele frequencies of loci underlying trait variation using the Wright-Fisher diffusion, and employ the spectral representation of its transition density to find analytical expressions for several error metrics, including the expected sample correlation between the polygenic scores of ancient individuals and their true phenotypes, referred to as polygenic score accuracy. Our theory also applies to a two-population scenario and demonstrates that allelic turnover alone may explain a substantial percentage of the reduced accuracy observed in cross-population predictions, akin to those performed in human genetics. Finally, we use simulations to explore the effects of recent directional selection, a bias-inducing process, on the statistics of interest. We find that even in the presence of bias, weak selection induces minimal deviations from our neutral expectations for the decay of polygenic score accuracy. By quantifying the limitations of polygenic scores in an explicit evolutionary context, our work lays the foundation for the development of more sophisticated statistical procedures to analyze both temporally and geographically resolved polygenic scores.

Author summary

The genomes of ancient organisms document, albeit imperfectly, the migrations, admixture events, and displacements that may have occurred in a given species’ history. Researchers also use these ancient genomes to learn whether genetic changes underlie the evolution of polygenic traits, like height and disease susceptibility, which are affected by many genetic variants of small effects. Such analyses rely on models, often garnered from large-scale genetic association studies, that predict the phenotypes of ancient individuals from their genotypes. Yet, theoretical and empirical research suggests that prediction accuracy depends on the relationship between the sample in which the model was built and the sample in which it is applied. In this vein, we quantify the effects of one fundamental limitation on prediction accuracy: the fact that allele frequencies may differ across time and geographic space. As a consequence, a given prediction model may not capture all of the genetic variation relevant to phenotypic variation in the focal, ancient sample.

Introduction

Decay in linkage disequilibrium (LD) between tagging and causal sites, population stratification, variation in allele frequencies within and across populations, and environmental heterogeneity, among other factors, are all thought to negatively impact the prediction accuracy of polygenic scores (see e.g., [17], and more recently in humans, e.g., [813]). Many of these issues likely influence both within- and out-of-sample predictions, where out-of-sample may refer to an individual sampled from a distinct time or location relative to that of the GWA study. While empirical [12, 14] and simulation [1, 13, 15] or combined [16] studies have explored particular population genetic scenarios or experimental contexts, we still do not know the extent to which each of these factors compromises prediction accuracy in general.

In this work, we address an issue pertinent to out-of-sample prediction: that causal loci may have different allele frequencies in the GWA study and focal populations. Variants common in the GWA study may be rare in the focal population, and vice versa. We refer to this phenomenon as allelic turnover. Allelic turnover implies that effect estimates ported across space and time, or both, may not reflect all of the genetic variation relevant to phenotypic variation in an ancient or geographically distinct population. Allelic turnover further suggests that the statistical properties of ancient polygenic scores depend on when an ancient individual was sampled—a feature not currently accounted for in ancient DNA analyses. Similarly, statistical properties of geographically disparate polygenic scores depend on the divergence time between the GWA study and focal populations. An understanding of allelic turnover in these contexts may ultimately improve statistical analyses of temporally (e.g., [1720]) and geographically resolved polygenic scores (e.g., [9, 10]), analyses which are increasingly commonplace.

We aim to quantify the effect of allelic turnover on the polygenic scores of such out-of-sample individuals when they are computed using effect estimates from a contemporary population. We expect that increases in ancient sampling time or divergence time will be associated with declines in polygenic score accuracy due exclusively to allelic turnover. The question is, by how much does accuracy decline? And, can allelic turnover alone explain the reduced accuracy of out-of-sample predictions observed in numerous human (e.g., [15, 16]), animal (e.g., [1, 2, 4]) and plant (e.g., [21, 22]) experiments and simulation studies. The answer is likely to depend on the particular population genetic, trait, and GWA study features of the system under study [3]. We attempt to capture some important aspects of this diversity in our modeling framework.

Here, we consider a standard implementation of the polygenic score Y^ which attributes non-zero effects to a particular set of loci, S. An individual’s polygenic score is a weighted sum of its genotype, where the weights are the estimated allelic effects. The loci in S and their estimated effects are usually identified in large-scale GWA studies, often performed in regional biobanks with sample sizes in the tens to hundreds of thousands of individuals (e.g., the UK Biobank [23], BioBank Japan [24]). Frequently, the set S includes loci which are approximately independent and surpass some allele frequency and p-value thresholds. Though there are numerous ways to define a polygenic score (e.g., [25, 26] and see Section 4 in S1 Text), the “prune and threshold” method is commonly used and proves analytically tractable in our framework.

Previous quantitative genetic approaches, such as [27] and [16], largely ignore the underlying population genetic dynamics. For example, Wang et al. [16] estimate the reduction in polygenic score accuracy in a focal population relative to the GWA study population as a function of the fixed population-specific trait heritabilities, allele frequencies, and LD patterns, and the estimated per-locus effects. In contrast, we embed the ancient polygenic score in an explicit population genetic framework, allowing us to take into account changes in allele frequency as well as the statistical constraint imposed by a finite GWA study sample size. And, distinct from previous approaches to the evolutionary modeling of polygenic scores [28], we track the frequencies of all loci that potentially contribute to a trait—not just the loci included in the polygenic score (i.e., loci in S).

Henceforth, we frame our study in terms of ancient polygenic scores. However, we formally demonstrate that our theoretical results apply to out-of-space polygenic scores, where the population divergence time multiplied by two is analogous to the ancient sampling time (see Fig 1 and Section 1 in S1 Text). The latter scenario can represent an ancient individual sampled from a population not directly ancestral to that of the GWA study as the two populations must have diverged at some point in the past. This scenario, to a first approximation, describes the population displacement events thought to be ubiquitous in the history of humans (e.g., [29]). However, human history is additionally characterized by numerous admixture events and population size changes (e.g., [29]) which are not yet captured within our modeling framework.

Fig 1. A population genetic model for an ancient polygenic score.

Fig 1

Figures (A) and (B) portray the two demographic scenarios encompassed by our modeling framework. In (A), the ancient individual is sampled at an earlier time τ from the same population in which the GWA study is conducted. In (B), the ancient individual is sampled at an arbitrary time τ′ from a population that split from the population in which the GWA study was conducted at some time τsplit in the past. The dotted line schematically relates τ′ to the ancient sampling time τ of (A), i.e., τ = 2τsplitτ′. In (C), a graphical model relates the random variables explicit and implicit in the polygenic score Y^(τ) and phenotype Y(τ) of an ancient individual sampled τ generations in the past, as in (A). Darkly shaded and thickly bordered nodes are observed quantities. Unshaded and thinly bordered nodes are unobserved. Lightly shaded nodes bordered by dashed lines denote estimated quantities. Edges denote direct dependencies between connected nodes. For example, conditional on the ancient genotype X(τ), the polygenic score Y^(τ) is independent of the population allele frequencies Z(τ). Quantities in blue are associated with the present day only, and include the population allele frequencies Z(0); the genotypes of the n individuals in the GWA study, {Xi}i=1n and their phenotypes, {Yi}i=1n; and, the effects and intercept term estimated in the GWA study, β^ and C^, respectively.

We use several statistics to characterize ancient polygenic score error in distinct population genetic and GWA study scenarios. Each statistic is indexed by the ancient sampling time τ: the bias, bias(τ), mean-squared error, mse(τ), estimated additive genetic variance, V^A(τ), and polygenic score accuracy, ρ2(τ), which approximates the expectation of the squared sample correlation coefficient between the polygenic scores and phenotypes of an ancient sample. In addition, we can readily express these statistics as functions of the genetic divergence between the ancient and GWA study populations, as measured by the fixation index, FST (Section 11 in S1 Text). We first derive general forms for these statistics that are agnostic to almost all of our modeling assumptions and which provide conceptual insights into the effects of allelic turnover. Next, we derive explicit, parameter-dependent expressions for each statistic when the trait is neutrally evolving in a population of constant size subject to recurrent mutation—which for small mutation rates approximates the infinite sites model. We take advantage of the spectral representation of the transition density function of the Wright-Fisher diffusion (tdf) to execute these computations [3033]. We then find interpretable linear approximations for the initial rate of increase (or decrease) of the metrics under study. These approximations apply for the small ancient sampling times typical of ancient humans remains (e.g., see [18]).

Consistent with our expectations, mse(τ) increases and the estimated additive genetic variance V^A(τ) decreases with increasing sampling age τ. Despite the fact that mse(τ) and V^A(τ) are measuring distinct quantities—and indeed have different functional forms—our linear approximations reveal that, under our assumptions, both statistics initially change at approximately the same rate. This rate is proportional to the product of the mutation rate and the power to detect trait-associated loci in the GWA study, which in turn, is influenced by both study size, the magnitude of the true per-locus effect, and the underlying distribution of the allele frequencies of causal loci.

Moreover, we show that polygenic score accuracy ρ2(τ) is proportional to V^A(τ), which, as stated, is sensitive to the GWA study and evolutionary parameters. Unlike V^A(τ), ρ2(τ) depends on the trait heritability h2, with larger values of h2 increasing its rate of decay. In contrast, for small mutation rates, relative accuracy, defined as the ratio of ρ2(τ) to accuracy measured in a present-day sample ρ2(0), is insensitive to h2, the true per-locus effect size, and the GWA study parameters, as long as the GWA study size n exceeds some minimum threshold. We show that this result likely holds for an arbitrary distribution of effects. Importantly, accuracy and relative accuracy decay considerably over the short time spans characteristic of ancient human samples and geographically distinct human populations.

With equal probability of detecting positive versus negative effect alleles, and under neutrality, the bias of the polygenic score is zero for all ancient sampling times. In practice, both of these conditions are likely violated. For example, detection imbalances have been observed in case-control GWA studies [34], and many polygenic traits are likely under some form of selection [35, 36]. While unequal thresholds do not precisely capture the phenomena described in [34], they do yield a non-zero bias(τ) within our framework. The magnitude of this bias is small, implying that other perturbations would be necessary to explain an observed, appreciable bias. To relax the neutrality assumption, we simulate recent directional selection. We find that when the selection coefficient is large enough (4Ns ≥ 1), selection indeed yields biased polygenic scores. Though this selection-induced bias is several orders of magnitude larger than that induced by asymmetry in the detection thresholds, it is still small relative to the variance explained by segregating genetic variants. Additionally, weak selection only induces small deviations from neutral theoretical expectations for the other statistics, suggesting that our neutral theory may still accurately capture accuracy declines in the presence of weak directional selection. Altogether, our theoretical results suggest that allelic turnover may make large contributions to out-of-sample reductions in accuracy, even under neutrality.

Model and metrics

Our modeling framework readily encompasses two demographic scenarios. In the first, the focal individual is sampled from the same population in which the GWA study was performed, but at a previous point in time τ (Fig 1A). We specify τ in coalescent time units: An ancient sampling time of τ corresponds to 2Nτ generations in the past, with 2N as the diploid population size. When τ = 0, the focal individual is an independent sample from the GWA study population. In the second scenario (Fig 1B), the focal individual is sampled at τ′ from a population that diverged from the GWA study population at τsplit (in coalescent time units) in the past. However, we show in Section 1 in S1 Text that scenario (A) is equivalent to scenario (B) if the ancient sampling time τ is equal to 2τsplitτ′. Therefore, we proceed according to the first scenario, while emphasizing that our conclusions readily translate to the second.

We summarize the full model in Fig 1C and detail its constituent parts in the proceeding subsections. Briefly, the genotype of the ancient individual is sampled conditional on the population allele frequencies at τ. The ancient individual’s phenotype is then sampled conditional on its genotype. Population allele frequencies for all loci that potentially affect the trait evolve until present day, at which point the GWA study is conducted. In particular, the effect sizes included in the polygenic score model are estimated from the genotypes and phenotypes of n contemporary individuals. Finally, the ancient polygenic score is computed from the ancient individual’s genotype and the polygenic score model derived from the results of a contemporary GWA study.

Sampling the genotype of a time-indexed individual

We assume that each site is at most bi-allelic, with possible alleles A1 and A2. We denote the genotype of an individual sampled at some time t (in coalescent units) as Xiℓ(t), where i indexes the individual, and the locus. For the ancient individual(s), t = τ; for the participants in the GWA study, t = 0. For mathematical convenience, we use a symmetric genotype encoding, that is Xiℓ(t) ∈ {−1, 0, 1}, corresponding to genotypes A1 A1, A1 A2, and A2 A2, respectively. Conditional on the population allele frequency of allele A2 at t, Z(t), the distribution of Xiℓ(t) is given by the Hardy-Weinberg sampling probabilities: P{Xi(t)=-1|Z(t)=z}=(1-z)2, P{Xi(t)=0|Z(t)=z}=2z(1-z), and P{Xi(t)=1|Z(t)=z}=z2.

Modeling the true phenotype

The genetic basis of a polygenic trait, Y, is determined by a set L, consisting of L distinct genetic loci (|L|=L), each with a true per-locus additive effect βR (for = 1, 2, …, L). We further assume that the L loci contribute linearly to the trait, such that the true phenotype of the i-th individual sampled at t is specified by the commonly used additive genetic model [37],

Yi(t)=C+=1LXi(t)β+ϵi(t), (1)

where C is a constant; β is the true additive effect of locus ; and ϵi(t)N(0,σe2) is a normally distributed random variable that incorporates variance in the phenotype due to the environment. The summation in Eq 1 is often referred to as an individual’s genetic value [25]. A locus contributes ±β to the genetic value (and phenotype) of an individual who is homozygous at , and zero to that of a heterozygous individual. C is thus the phenotype of an hypothetical all heterozygous individual. Without loss of generality, we set C = 0. In addition, we assume, without loss of generality, that all β ≥ 0 such that locus contributes −β to the genetic values of A1A1 individuals and +β to the genetic values of A2A2 individuals.

A fixed locus, Z(t) ∈ {0, 1}, will affect the mean phenotype of the population at t by ±β but will not contribute to phenotypic variation. We illustrate this fact by conditioning on the allele frequencies of all loci in L at t, Z(t) ∈ [0, 1]L. Assuming linkage equilibrium between loci as well as independence between the environmental and genetic effects, we have,

V[Yi(t)|Z(t)]=2=1Lβ2Z(t)(1-Z(t))+σe2. (2)

The summation in Eq 2 is the additive genetic variance at t, VA(t). For a segregating site, the summand is proportional to Z(t)(1 − Z(t)), with 0 < Z(t)(1 − Z(t)) < 1. For a fixed site, the summand is zero and the site does not contribute to the additive genetic variance VA(t). An important feature of our model is that some of the L loci may not exhibit genetic variation in the population at a given time. More concretely, the set of loci with non-zero estimated effects on the polygenic score, S, may only be a small subset of L. Thus, we assume that L is a superset of S.

Constructing a model for the polygenic score

As our aim is to isolate the effects of allelic turnover on the statistical properties of polygenic scores, we make the additional assumption that the genotyped sites are the causal sites. (We have already assumed that all loci are in linkage equilibrium.) Akin to [38], we employ a simple threshold model for the effect estimates. For a GWA study consisting of n individuals (and 2n chromosomes),

β^{βifD(d1,2n-d2),0else, (3)

where D is the allele count of the trait-increasing allele A2 at the -th site in the GWA study sample; and d1 and d2 are the site-specific detection thresholds. In this simplified model, the true effect is estimated perfectly for all sites with allele counts within the intervals (d1, 2nd2) for L. In Section 4 in S1 Text, we relate Eq 3 to two alternative estimation procedures: maximum likelihood estimation (MLE) and the best linear unbiased predictor (BLUP).

We allow the two thresholds to differ in order to encompass scenarios in which power is an asymmetric function of the sample allele frequencies, e.g., there is more power to detect low frequency (D < n) versus high frequency (D > n) trait-increasing alleles. Such situations may arise with polygenic disease inheritance and imbalanced case and control sample sizes [34]. In most cases, however, we will consider symmetric detection thresholds, with d1 = d2 = d. The threshold d depends on on the phenotypic variance, genome-wide significance threshold, true per-locus effect β, and GWA study size n. In Section 2 in S1 Text, we give an explicit form for this dependency for a continuous focal trait and equal detection thresholds. Varying d while keeping the GWA sample size fixed is equivalent to varying the true per-locus effect β. Varying the GWA study size n while keeping β and the other parameters fixed is akin to varying the GWA study’s power to detect loci of a particular effect size. In Analytical Results, we do both.

The threshold model arises in the large GWA study size n limit for the model of β^ provided in Equation 5 in S1 Text. Namely, as long as D is not too small, the variance of β^ goes to zero as n grows. Thus, the threshold model in Eq 3 will necessarily underestimate the true variance of β^ (Section 4 in S1 Text). Still, this model captures the dependency of β^ on the GWA study sample size n and the true per-locus effect β, while facilitating our analytical treatment.

In order to compare the polygenic score with an individual’s true phenotype, we need to account for all sites in the mutational target L, not just those in S, the set of sites with non-zero effect estimates in the polygenic score. As β^=0 for any site in L but not S, we express the polygenic score as a function of all loci in L. The ancient polygenic score of individual i sampled τ generations in the past is then given by,

Y^i(τ)C^+=1LXi(τ)β^, (4)

where C^ is the average phenotype of the GWA sample after subtracting the estimated genetic effects at all loci,

C^Y¯-=1Lβ^X¯, (5)

with Y¯=1nj=1nYj and X¯=1nj=1nXj as the mean phenotype and genotype at locus in the GWA study sample, respectively. Here, and in the remainder of our study, we omit time-indexing for random variables associated with the GWA study at t = 0. By design, the estimated intercept C^ absorbs the effects of all loci which were not detected as significant in the GWA study, i.e., those sites for which β^=0. Its presence in the polygenic score of Eq 4 is necessitated by the fact that, to facilitate our analytical treatment, we did not center nor scale the genotypes and phenotypes in the GWA study. Importantly, all of our results are independent of this choice (Section 5 in S1 Text). Henceforth, unless otherwise noted, we refer to Eq 4 as the polygenic score and to the summation in Eq 4 as the genetic prediction.

Modeling population genetic dynamics

Population genetic processes govern the correlations between allele frequencies at distinct points in time. We model this correlation using the Wright-Fisher diffusion with recurrent mutation. As we assumed all loci were in linkage equilibrium, their allele frequencies evolve forward in time independently, subject to genetic drift and mutation. At each site, alleles mutate from A1A2 with rate μ, and from A2A1 with rate ν. While our results readily generalize to arbitrary μ and ν, we restrict ourselves to equal mutation rates, μ = ν.

We further assume that the population is at equilibrium. In this setting, the marginal allele frequencies are beta-distributed, with shape and scale parameters specified by the population-scaled mutation rate; we denote the latter quantity by a, with a = 4 = 4.

The relative magnitudes of mutation and genetic drift determine which force dominates an allele frequency trajectory. For example, as a approaches 0, the effects of mutation on the frequencies of segregating mutations become negligible and genetic drift dominates. In this low mutation regime (a ≪ 1, or equivalently μ12N), the recurrent mutation model approximates the infinite sites model, while still retaining the features that make it attractive for our analytical treatment. In particular, the stationary allele frequency distribution is a well-defined probability distribution under the recurrent mutation model, but not under the infinite sites model. We concern ourselves almost exclusively with the low mutation regime.

Quantifying out-of-sample prediction errors

To quantify how well the polygenic score approximates the true phenotype of an individual sampled uniformly at random from the population at time τ before the present, we use several statistics:

Bias

We define the bias as the expectation of the difference between the polygenic score and true phenotype,

bias(τ)E[Y^(τ)-Y(τ)], (6)

where, here and elsewhere, we omit the subscript when there is only one sample. The expectation in Eq 6 is with respect to the entire random process, encompassing the underlying population genetic dynamics, estimation of the per-locus effects in the GWA study, and computation of the ancient polygenic score (illustrated in Fig 1C).

Mean-squared error (mse)

We define the mse as the expectation of the squared prediction error,

mse(τ)E[(Y^(τ)-Y(τ))2]. (7)

As in Eq 6, the expectation in Eq 7 is with respect to all sources of randomness in the model. The variance of the prediction error equals the difference of the mse and the square of the bias, and thus it is fully characterized by these two metrics.

Expected estimated additive genetic variance (V^A)

The estimated additive genetic variance is an estimate of the amount of phenotypic variance in the ancient population explained by additive genetic effects alone. We use V^A(τ) to represent the expectation of this quantity,

V^A(τ)=1LV^A(τ)=2=1LE[β^2Z^(τ)(1-Z^(τ))], (8)

where Z^(τ) is an estimate of the ancient population allele frequency computed from a sample of na individuals sampled at τ. The expected true additive genetic variance, E[VA], can be found by taking the expectation of the summation in Eq 2.

Polygenic score accuracy (ρ2)

Practitioners often compute the sample correlation coefficient r2 to measure the accuracy of a predictor in a sample. Here, our sample is na ancient individuals sampled at time τ, thus,

r2(τ)Cov[Y^(τ),Y(τ)]2Var[Y^(τ)]Var[Y(τ)], (9)

where Cov[⋅, ⋅] and Var[⋅] are the sample covariance and variance operators, respectively, and Y^(τ),Y(τ)Rna are the na-dimensional vectors of polygenic scores and phenotypes of the ancient individuals, respectively. Ideally, we would compute the expectation of this quantity—but, this is challenging due to the common difficulty of computing an expectation of a ratio of random variables. Thus, we approximate the expectation of r2(τ) as the ratio of expectations,

E[r2(τ)]E[Cov[Y^(τ),Y(τ)]]2E[Var[Y^(τ)]]E[Var[Y(τ)]]ρ2(τ), (10)

where, as above, the covariance and variances are taken with respect to the sample of na ancient individuals, while the expectation is over all sources of randomness in Fig 1C (see Section 7.4 in S1 Text for more details). We present simulations in the section Polygenic score accuracy of Analytical Results showing that ρ2(τ) is a good approximation for the expectation of r2(τ) in the parameter regimes of interest.

Analytical Results

By how much does the prediction accuracy of a polygenic score decrease as the time between sampling the ancient individual and conducting the GWA study increases? To answer this question, we consider a trait potentially influenced by L genetic loci, each with true effect β ≥ 0, = 1, …, L. The forward evolution of sites underlying this trait is modulated by a per site, per generation mutation rate, μ, and a population scaled rate of a = 4. The diploid population of size 2N chromosomes is assumed to be at equilibrium. The parameters dictating the GWA study are the sample size n and the detection thresholds specified by d1, d2 ∈ {1, …, n}L. The metrics are indexed by the ancient sampling time τ in coalescent time-units. An ancient sampling time of τ corresponds to 2Nτ generations in the past. We omit the time index for variables associated with the GWA study, which occurs at present day (t = 0). (We show in Section 11 in S1 Text, that the metrics can also be expressed as a function of divergence or FST between the ancient and contemporary populations).

Each subsection is structured as follows: We first derive a general expression for the statistic that does not depend on how we model the population genetic dynamics nor the GWA study. Second, we derive an analytical expression for the statistic under the population genetic assumptions and the GWA study threshold model described in Model and metrics.

Bias

We can rewrite the sampling time-dependent bias defined in Eq 6 as,

bias(τ)==1Lbias(τ)==1LE[(X¯-X(τ))(β-β^)], (11)

where bias(τ) is the contribution of locus to bias(τ). From Eq 11, we see that bias(τ) ≈ 0 when either or both of β^β and X¯X(τ) are true. Thus, bias(τ) is minimal when (i) effect estimates are accurate, and (ii) the allele frequencies have not changed substantially in the interval [τ, 0].

Under the assumption of equal mutation rates and detection thresholds (d1 = d2), bias(τ) = 0 for τ ≥ 0 for a reason distinct from those stated above. Trait-increasing alleles at high frequencies (D > n) and low frequencies (D < n) are detected as significant (β^0) with equal probability. An equivalent assumption is that power is not affected by whether the most prevalent allele is trait-increasing or decreasing. Subsequent evolution of the allele frequencies preserves this symmetry and bias(τ) remains equal to zero for all τ. It follows that in the absence of additional perturbing forces, an estimate of the mean polygenic score from a sample of na ancient individuals will also be unbiased, and therefore will on average accurately reflect the lack of change in the mean phenotype.

However, if we introduce asymmetry in the detection thresholds (d1d2), bias(τ) is non-zero for all τ (Section 7.1 in S1 Text). Using the spectral representation of the transition density of the Wright-Fisher diffusion (tdf), we derive the per-locus contribution to the bias, bias(τ) (Section 7.1 in S1 Text). For a small population-scaled mutation rate a and a large GWA study size n, we approximate this expression (given in Equation 45 in S1 Text) as,

bias(τ)(e-aτ-1)(P(d1)-P(d2)), (12)

where,

P(di)=i=0di-1(2ni)B(a+i,a+2n-i)B(a,a) (13)

is the probability that the allele count of site is less than dℓi, i.e., D < dℓi for i = 1, 2; and, B(⋅, ⋅) is the beta function. Thus, the magnitude of bias(τ) is approximately proportional to the difference in the probability of detecting high (D > n) versus low (D < n) frequency alleles, and increases exponentially with τ. With a large GWA study size n and a small mutation rate a, this difference is small relative to the square root of the additive genetic variance—the ratio of these two quantities is smaller than O(a) (Fig S1a in S1 Text). This is due to the fact that when the mutation rate is small, most alleles are close to fixation or fixed. The stationary population allele frequency density κ(z) ∝ za−1(1 − z)a−1 behaves like z−1(1 − z)−1 for small a. Varying dℓi then has relatively little impact on P(dli), constraining the difference between the one-sided detection probabilities (Fig S1b in S1 Text).

Mean-squared error

The sampling time-dependent mean-squared error mse(τ) can be expressed as,

mse(τ)==1Lmse(τ)+(n-1n)σe2==1LE[(X(τ)-X¯)2(β^-β)2]+(n-1n)σe2, (14)

where σe2 is the variance in the phenotype due to the environment (Section 7.2 in S1 Text). Note the similarity of the left term in Eq 14 to the form of bias(τ) given in Eq 11—similar heuristics apply. Under the threshold model specified in Eq 3, sites at moderate frequencies in the GWA study sample, D ∈ [d, 2nd], will not contribute to mse(τ) since β^=β. Only sites with frequencies outside this interval (including sites invariant in the GWA study sample) will contribute, and their contributions will be proportional to the squared difference between X(τ) and X¯. In practice, moderate frequency loci will also contribute to mse(τ) due to errors in the estimation of the effect estimates and any difference between the ancient genotypes and the average genotypes in the GWA study sample at these sites (Section 4 in S1 Text).

We use the spectral representation of the tdf (Section 6 in S1 Text) to derive an analytical expression for mse(τ), the per-locus contribution to the mse (Section 7.2 in S1 Text). From this expression, Equation 50 in S1 Text, we derive a linear approximation for the initial per-locus increase in this statistic, Δmse(τ). With a symmetric detection threshold (d1 = d2 = d) we have,

Δmse(τ)mse(τ)-mse(0)2β2aP(d)τ, (15)

where mse(0) is the contribution of site to mse(τ) for τ = 0 (Equation 76 in S1 Text); and 2P(dl), defined in Eq 13, is the probability that the allele count of site is outside the detection interval such that β^=0. Both mse(0) and P(dl) depend on the mutation rate a, the GWA study size n, and the detection threshold d.

Δmse(τ) reflects the time-dependent contributions of sites not detected in the GWA study. To see this, we condition on the effect estimate β^, mse(τ)=β2E[(X(τ)-X¯)2|β^=0]·2P(d)+0·(1-2P(d)). Thus, Eq 15 implies that dE[(X(τ)-X¯)2|β^=0]dτa for small τ, and consequently, the combined effects of drift and mutation on mse(τ) are captured in the product of the mutation rate and sampling time .

In addition, Eq 15 suggests that the rate at which mse(τ) increases will be shared across parameter regimes when aP(dl) is similar (Fig S4a in S1 Text). To illustrate this, we use our analytic formula (given in Equation 50 in S1 Text) to compute mse(τ) for several low mutation rates, a ∈ {10−4, 10−3, 10−2}, and three GWA study sizes, n ∈ {104, 105, 106} (Fig 2A). These mutation rates and sample sizes span the range of parameter values appropriate for human data. We depict our results in two ways: (i) we plot the change in mse(τ), and (ii) we plot mse(τ) normalized by the expected additive genetic variance contributed by a single site. At stationarity the expected additive genetic variance is constant and equal to,

E[VA]=E[2β2Z(1-Z)]=β2(a/(2a+1))

for a scaled-mutation rate a. The plot of the former, Fig 2A, exhibits the functional relationship revealed by Eq 15, while the latter, Fig 2B, approximates the noise-to-signal ratio. In Section 9 in S1 Text, we demonstrate that Eq 15 is a good approximation to mse(τ) for τ ≤ 0.2, particularly when the GWA study size n is large (in particular, see Fig S5 in S1 Text).

Fig 2. Per locus contributions to the mean-squared error and estimated additive genetic variance across sample sizes, mutation rates, and detection thresholds.

Fig 2

In (A), we plot the per-locus increase in mse, Δmse(τ), normalized by β2, for three mutation rates a = 10−4, 10−3, 10−2 by color, and for the three sample sizes, n = 104, 105, 106 by shape, respectively. For a squared effect size of β2 = 0.01, each sample size, in part, specifies a value of d, with d = 4142, 3340, 3290, or sample allele frequencies of approximately 0.2, 0.02, and 0.002, in order of increasing sample size. In (B-C), we restrict ourselves to a = 10−3 as the lines for different mutation rates would otherwise largely coincide. In (B), we plot mse(τ) normalized by the expected additive genetic variance at stationarity, E[VA]=β2a/(2a+1). In (C), we fix n = 104 and vary the detection threshold over several orders of magnitude, d ∈ {10, …, 105}, plotting mse(τ) normalized by E[VA]. In (D-F), we repeat (A-C), but for the statistic V^A(τ), with the following exception: Because V^A(τ) decreases with τ, we plot the absolute value of its difference from V^A(0) in (A). For all plots the ancient sampling time τ ∈ [1, 0], which corresponds to a time span of 2N generations.

To find the GWA study size specific detection thresholds used in Fig 2A and 2B, we solve Equation 11 in S1 Text for a given effect size β, phenotypic variance Vp, and significance threshold α, while varying the GWA study sample size. For β2 = 0.01, Vp = 1, and α = 10−8, the detection thresholds are d = 4142, 3340, 3290 in order of increasing sample size, which corresponds to sample allele frequencies of approximately 0.2, 0.02, amd 0.002, respectively. Thus, for a given effect size, larger sample sizes will lead to the detection of alleles at more extreme allele frequencies, while smaller samples will restrict detection to alleles at more intermediate frequencies. Due to non-identifiability, the parameter choices are fairly arbitrary.

We find that for small mutation rates, the cumulative change in the mse, Δmse(τ), is mostly insensitive to differences in the GWA study sample size (Fig 2A and 2B). The approximation in Eq 15 helps to explain this result. The rate of increase is approximately proportional to 2aP(dl)τ. For small mutation rates (a ≪ 1) and an arbitrary detection threshold d, the probability of not detecting a locus as significantly associated with the trait is roughly 2aP(dl)1 for all sufficiently large n (Fig S1b in S1 Text). In this regime, increasing the GWA study sample size only yields small increases in the probability of detecting a locus as significant. Thus, for small mutation rates, the product of this quantity with the mutation rate is 2aP(dl)a, and indeed, we observe a cumulative increase in mse(τ) that is O(a) for τ = 1 (Fig 2A). We note that increasing the GWA study sample size does enable detection of loci with smaller effects.

The result in Fig 2A, however, hides the fact that a small absolute increase in mse(τ) may correspond to a substantial increase in the noise-to-signal ratio. Indeed, for a = 10−3 (blue lines throughout), mse(τ) ultimately exceeds the expected additive genetic variance E[VA] for all GWA study sample sizes (Fig 2B). By τ = 0.2, a sampling time characteristic of ancient humans, mse(τ) due to allelic turnover is approximately 20% of the additive genetic variance E[VA]. For sufficiently large τ, mse(τ) is at least the same order of magnitude as the expected additive genetic variance. In addition, while mse(τ) increases at approximately the same rate irrespective of study size, its initial value mse(0) is sample size dependent (Fig 2B and see Fig S4b and S4e in S1 Text for a larger parameter space). Yet, for a given value of d, reductions in mse(0) mediated by sample size diminish once n is large enough (Fig S4b and S4e in S1 Text).

Further, Fig 2A obscures the fact that different mutation rates may yield similar noise-to-signal ratios. As discussed, for small a, mse(τ) increases with τ at a rate that is O(a). For small a, the additive genetic variance is likewise O(a), yielding a relative increase that is mostly insensitive to the mutation rate. Normalized mse(0) is also similar across small mutation rates (Fig S4b and S4e in S1 Text), rendering relative mse(τ) mostly insensitive to a. We thus omitted the other two mutation rates from Fig 2B.

Lastly, we fix the GWA study sample size at n = 105 and vary the detection threshold d (Fig 2C). Varying d while keeping n fixed is analogous to varying the true per-locus effect size β, or keeping β fixed while varying the significance threshold α. The minimum threshold is d = 10, whereas d = n = 105 maximizes mse(τ) since β^ would equal zero for all . Consistent with our analysis above, for small a, (i) mse(0) depends critically on d, while (ii) mse(τ)’s approximately linear growth rate is largely insensitive to d. Furthermore, by our previous arguments, relative mse(τ) is similar across small mutation rates, and they are also omitted in Fig 2C. For independent and identically distributed (iid) loci and σe2=0, the per-locus mse(τ) values presented in Fig 2B and 2C are equal to the corresponding trait-wide statistics mse(τ).

Additive genetic variance

The per-locus contribution to the expected estimated additive genetic variance V^A(τ) is,

V^A(τ)=2E[β^2Z^(τ)(1-Z^(τ))]=2(2na-12na)E[β^2Z(τ)(1-Z(τ))], (16)

where Z^(τ)=12nai=1na(Xi(τ)+1) is the estimated allele frequency at τ, computed in a sample of na ancient individuals. When β^=0 or Z(τ) ∈ {0, 1}, site will not contribute to V^A(τ). Thus, a site has a non-zero contribution to the estimated additive genetic variance only when it is segregating at both the present day and τ. This condition is necessary for both Z^(τ)(1-Z^(τ))>0 and β^0 to be true.

As with the two previous statistics, we use the spectral representation of the tdf to derive an analytical expression for V^A(τ) under our population genetic assumptions (Section 7.3 in S1 Text). The resulting expression, Equation 54 in S1 Text, indicates that the expected additive genetic variance decays exponentially. We then, to first order in the ancient sampling time τ, approximate the initial decrease in the per-locus estimated additive genetic variance ΔV^A(τ),

ΔV^A(τ)V^A(τ)-V^A(0)=-2(2na-12na)β2aP(d)τ, (17)

where V^A(0) is V^A(τ) evaluated at τ = 0 (Equation 77 in S1 Text); and 2P(d), defined in Eq 6, is the probability that β^=0. The factor due to finite sampling, 2na/(2na − 1), is ≈1 when the ancient sample size na is large. Thus, apart from sign, ΔV^A(τ) is equal to Δmse(τ) of Eq 15. Therefore, for small τ, V^A(τ) decreases at approximately the same rate as mse(τ) increases. This result further suggests that for a ≪ 1 and a large GWA study size n, V^A(τ)/E[VA]1-mse(τ)/E[VA] for small τ (Fig 2C and 2F). Although, this relationship trivially breaks down for large τ as mse(τ) is not bounded by one.

To compare V^A(τ) across mutation rates, we mirror our treatment of mse(τ) in the previous section. We plot (i) its increase ΔV^A(τ) (Fig 2D); (ii) V^A(τ) normalized by the expectation of the true additive genetic variance at stationarity (Fig 2E); and (iii) normalized V^A(τ), varying the detection threshold for a fixed GWA study sample size (Fig 2F). Akin to mse(τ), normalized V^A(τ) is very similar across small mutation rates. And, while the GWA study size n and the detection threshold d influence the initial estimated additive genetic variance V^A(0), its rate of change is mostly insensitive to the two GWA study parameters.

As V^A(τ) largely recapitulates our results for mse(τ) with opposing sign, we focus on their differences. Indeed, they have different functional forms and behave differently for modest or large τ (see Equations 50 and 54 in S1 Text, respectively). Conceptually, this discrepancy is not unexpected: In the previous section, we showed that a site only contributes to mse(τ) if its allele count falls outside the detection interval and β^=0. Thus, mse(τ) increases with τ due to alleles shifting from intermediate frequencies in the ancient population to frequencies outside of the detection region in the contemporary population. For the expected estimated additive genetic variance V^A(τ), the converse is true: The slope represents the decline in V^A(τ) due to alleles changing from frequencies near or at fixation in the ancient population to frequencies within the detection interval in the contemporary population. While our results reveal similar functional behavior for these two quantities (with opposing signs) that applies for small τ, we caution that statements about V^A(τ) do not immediately translate to statements about mse(τ), particularly for τ ⪆ 0.2.

Polygenic score accuracy

While our framework, in principle, encompasses a trait with varying effect sizes, we will first assume that all sites are iid with true effect size β. Our approximation to the expectation of the sample correlation coefficient simplifies to,

ρ2(τ)=LβE[β^(X(τ)-X¯(τ))2]Lβ2E[(X(τ)-X¯(τ))2]+σe2=E[β^Z(τ)(1-Z(τ))]/βE[Z(τ)(1-Z(τ))]+σe2, (18)

where the compound parameter σe2=σe2/Lβ2 is the environmental variance normalized by the product of the number of loci in the mutational target L and the squared per-locus effect size β (Section 7.4 in S1 Text). By comparing Eq 18 with Eq 16, we can see that ρ2(τ) is closely related to the estimated additive genetic variance. Thus, like V^A(τ), ρ2(τ) will decrease with τ due to loci having changed from frequencies close to zero or one in the ancient population to intermediate frequencies in the contemporary population. However, unlike V^A(τ), ρ2(τ) does not depend on the ancient sample size. Therefore, to relate the two statistics, we multiply by the inverse of the ancient sample size dependent factor implicit in V^A(τ),

ρ2(τ)=(2na2na-1)V^A(τ)/β2E[VA(τ)]/β2+σe2. (19)

For σe2=0, barring the sample size factor, Eq 19 is equal to V^A(τ) normalized by the expected additive genetic variance. By extension, this quantity approximates the expected sample correlation coefficient r2(τ). By invoking our additional population genetic and GWA study assumptions, we arrive at an approximation for the decrease in polygenic score accuracy,

Δρ2(τ)ρ2(τ)-ρ2(0)-2aP(d)τa2a+1+σe2. (20)

Now, to relate our theory to empirical and simulation studies, we compute ρ2(τ) for a given narrow-sense heritability h2 and mutation rate a pair. We define h2 for a trait with a mutational target of L loci of equal effects β,

h2E[VA]E[VA]+σe2=a/(2a+1)a/(2a+1)+σe2,

where the equality follows from our population genetic assumptions. Together with a, h2 fully specifies the compound parameter σe2 with,

σe2=(a2a+1)(1-h2h2).

We plot our analytical expressions for both accuracy (Fig 3A) and relative accuracy (Fig 3B), defined as the ratio of ρ2(τ) to ρ2(0) for τ ∈ [1, 0] spanning 2N generations. For humans, this time span corresponds to approximately 500,000 years in the past, encompassing the “Out-of-Africa” migration event estimated to have occurred 50,000–100,000 years ago [39]. As with the preceding statistics, when τ = 0, ρ2(τ) approximates the accuracy of the polygenic score within the GWA study population. Relative accuracy then directly measures reductions in accuracy relative to the GWA study population. We set h2 = 0.5 and a = 10−3, and fix the GWA study sample size at n = 105. We then compute ρ2(τ), varying the detection threshold over several orders of magnitude (Fig 3A). (See Fig S6 in S1 Text for accuracy as a function of the fixation index, or FST.) Our results for ρ2(τ) necessarily recapitulate those of V^A(τ): While increasing the detection threshold d reduces accuracy substantially, it does not have a large impact on relative accuracy for n = 105 (Fig 3A). Indeed, for small mutation rates, relative accuracy is insensitive to the mutation rate and threshold, and is well approximated by eτ (Equation 68 in S1 Text). Thus, its derivative is also exponential. Absolute accuracy ρ2(τ) likewise decays exponentially, but its derivative is scaled by a quantity that reflects features of the GWA study and the phenotypic variance. For a small mutation rate a ≪ 1, its derivative is approximately 2P(d)(a/(a+σe2))e-τ, which, in turn, is approximately 2P(d)h2eτ (Equation 67 in S1 Text). The latter expression suggests that the probability of not detecting a significant association P(d) and trait heritability h2 are the key determinants of prediction accuracy. Importantly, ρ2(τ) declines considerably over the interval τ ∈ [1, 0] irrespective of the detection threshold d.

Fig 3. Polygenic score accuracy.

Fig 3

We plot our theoretical results for both absolute (A, main) and relative accuracy ρ2(τ) (A, inset) for ancient sampling times τ ∈ [1, 0] (or a time span of 2N generations) with a mutation rate of a = 10−3. The GWA study size is shared in all plots, with n = 105. In (A), we vary the detection threshold over the range of possible values, d ∈ {10, …105}. In (B), we compare our theoretical expectations with simulated estimates of the approximate sample correlation coefficient ρ2(τ) (circles) and the statistic itself r2(τ) (crosses) for a threshold of d = 104 (a minimum sample allele frequency of 0.05), and two values of heritability, h2 = 0.5, 1 (in blue and gold, respectively). The ancient sample size is na = 100. In the inset of (B), we normalize the estimates by their initial (estimated) values. Theoretical expressions for ρ2(τ) are also plotted in (B). Each simulated point is the average of K = 5000 simulations of L = 5000 iid loci.

In addition, we glean from Eq 18 that while heritability affects the magnitude of ρ2(τ) through the compound parameter σe2, it does not influence the relative accuracy, consistent with previous results [16]. Our simulations suggest that this is also true of the sample correlation coefficient, as simulated estimates of r2(τ) agree extremely well with our theory for ρ2(τ) (Fig 3B). We note that this result is contingent on the fact that the environmental variance σe2 only enters our simple threshold model in the specification of the threshold d (Equation 11 in S1 Text), and does not contribute directly to the variance of the polygenic score (Section 7.4 in S1 Text). Therefore, we expect this result to hold only for large GWA study sample sizes for which the threshold model is a good approximation to the distribution of β^. While the finding that relative accuracy is insensitive to the GWA study parameters relies on the assumption that all loci are iid and share a causal effect β, we provide preliminary theoretical evidence that our results will hold when β varies across loci (see Equation 69 in S1 Text and ensuing comments).

Simulation results for recent directional selection

We use simulations to explore if and how the statistics under study deviate from their neutral expectations in the presence of recent directional selection. Each copy of the A2 allele at the -th site confers a fitness advantage of +s, and so the fitness ratio of the three possible genotypes A1A1:A1A2:A2A2 is 1: (1 + s): (1 + 2s). In our simulations, the population evolves neutrally until the onset of selection at N generations (or τs = 0.5 in coalescent time units) before present. Thereafter, the population evolves according to discrete Wright-Fisher dynamics with selection.

In the presence of selection, the allele frequency distribution is no longer symmetric; rather, it is skewed toward the beneficial allele. The severity of the skew depends on the selection coefficient and mutation rate, as well as the amount of time that selection has been acting. As we restrict s to positive values, designating the A2 or + allele as beneficial, the allele frequency distribution will be skewed toward one. If we instead designated the A1 allele as the beneficial allele, the allele frequency distribution would be skewed toward zero. The former models “positive” selection whereas the latter models “negative” selection. Because bias(τ) is proportional to β, its sign will be sensitive to this choice, but its magnitude will be unaltered. The other statistics will not be affected as long as the detection thresholds are symmetric. Therefore, our results are general up to the sign of bias(τ).

We conduct simulations over a range of selection coefficients, σ = 4Ns ∈ {0, 0.1, 1, 10}, for a mutation rate of a = 10−3. Under directional selection, σ is proportional to the locus effect size β; mutations with larger effect sizes will be more likely to establish and achieve appreciable frequencies [40]. In addition, we plot results for two different detection thresholds, d ∈ {103, 104}, in a GWA study sample of size n = 104. More details on the simulation procedures are provided in Section 3 in S1 Text.

When σ ≥ 1, the polygenic score is biased towards positive values for τ > 0 for both detection thresholds (Fig 4A). In other words, with directional selection acting to increase the trait value, Y^(τ) tends to overestimate Y(τ). The magnitude of bias(τ) depends critically on the strength of selection relative to mutation: We observe a larger bias for σ = 10 relative to σ = 1, and likewise the bias is larger for σ = 1 relative to σ = 0.1. In fact, the smaller selection coefficient σ = 0.1 is not distinguishable from neutral expectations. For 0 ≤ τ < τs, bias(τ) increases at an accelerating rate; for ττs, bias(τ) appears constant in this parameter regime.

Fig 4. Ancient polygenic scores in the presence of genic selection.

Fig 4

We conduct K = 5000 simulations, each with a mutational target of L = 5000 loci, in a population of size 2N = 2 ⋅ 103, with a population-scaled mutation rate, a = 10−3. We consider four selection coefficients, σ = 4Ns ∈ {0, 0.1, 1, 10} (indicated by color). The GWA study sample size is 2n = 2 ⋅ 105, with d equal to either 103 or 104. In (A-D), we plot the various simulated statistics along with their neutral expectations (solid or dashed black lines). The vertical gray lines indicate the onset of selection at τs = 0.5 which corresponds to N = 1000 generations. The ancient sample times are τ ∈ [1, 0], corresponding to a time span of 2N = 2000 generations. We computed, but did not plot, 95% confidence intervals for bias(τ), mse(τ), and r2(τ), as they largely overlapped with the symbols. We note that the oscillations observed in (A) and (B) are not statistically significant.

A higher detection threshold decreases the detection probability. Thus, we expect that the magnitude of bias(τ) will increase with the detection threshold. Indeed, bias(τ) is larger and increases more quickly for the larger detection threshold d = 104 compared to d = 103 (Fig 4A). Further, our simulations suggest that the detection threshold coupled with the time of the onset of selection govern the magnitude of the bias for τ > τs. For some large τ, bias(τ) will reach an equilibrium value that depends approximately on the asymmetry of the detection thresholds at the present day, which in turn, depends on both the timing and strength of selection (Section 10 in S1 Text).

The underlying allele frequency dynamics provide some insight into these patterns. Before the onset of selection, the allele frequency distribution is stationary and symmetric around 0.5. After the onset of selection, trait-increasing alleles tend to increase in frequency, skewing the distribution toward one. Thus, alleles not detected in the GWA study will tend be at higher versus lower frequencies at t = 0, yielding E[X¯|β^=0]>0 for σ > 0. For large τ, the allele frequencies of sites not detected in the GWA study, i.e., with β^=0, may have been substantially different in the ancient population. Each one of these sites will make a contribution to bias(τ) that is proportional to βE[(X¯-X(τ))|β^=0] (Eq 11). Looking backward in time, the shift in the allele frequency distribution ensures that the conditional expectation of X(τ) is smaller than that of X¯, yielding a positive bias(τ) for τ > 0. Notably, the magnitude of bias(τ) induced by selection is several orders of magnitude larger than that induced by asymmetry in the detection threshold alone (Fig S1a in S1 Text).

The effects of selection on mse(τ) are qualitatively consistent with those on bias(τ) (Fig 4B). Although, here, the only selection coefficient which induces significant deviations from neutral expectations is σ = 10. And, mse(τ) is larger for d = 104 compared to d = 103. As with bias(τ), for 0 ≤ τ < τs, mse(τ) increases at an accelerating rate; before τs (ττs), mse(τ) appears to increase linearly. Values of σ < 10 do not induce noticeable deviations from neutrality for the correlation coefficient ρ2(τ) either (Fig 4C). However, strong selection (σ = 10) does lead to substantially larger reductions in accuracy relative to our neutral expectations. In addition, for σ = 10, relative accuracy is sensitive to the detection threshold, with accuracy decreasing faster for the larger detection threshold (Fig 4D).

Discussion

In this work, we devised a theoretical framework to quantify the effect of allelic turnover on the error and accuracy of out-of-sample polygenic scores. Unlike previous theoretical approaches [16, 27], we averaged over the evolutionary process governing trait evolution, the GWA study from which a polygenic score model is constructed, and the ancient individual’s genotype and phenotype. In doing so, we found explicit expressions for several commonly used metrics that depend on the focal individual’s sampling time, as well as the parameters governing the population genetic dynamics and power to detect trait-associated loci in the GWA study. Mathematical properties of the recurrent mutation model at stationarity enabled us to compute analytical expressions for the metrics of interest under neutrality, and approximations thereof.

Our analytical expressions suggest that allelic turnover alone may be responsible for large reductions in accuracy: For small mutation rates, ρ2(τ) (and r2(τ)) decreases substantially within short time-spans, by about 20 percent in 0.2N generations (corresponding to approximately 120,000 years in humans). In addition, increasing the detection threshold yielded lower polygenic score accuracy, as a locus was less likely to have a non-zero effect. These results are broadly consistent with a concurrent study by Yair and Coop [41], in which the authors used simulations to assess cross-population prediction accuracy, defined as the ratio of the variance of and individual’s polygenic score to that of their genetic value, under neutrality and in the presence of stabilizing selection. When Yair and Coop restricted the polygenic score to the top one percent of SNPs, roughly analogous to altering the detection threshold, they similarly found that the accuracy declined in the focal population.

Yet, while the detection threshold influenced the magnitude of the polygenic score accuracy, relative accuracy was insensitive to this parameter. In other words, under neutrality, relative accuracy is insensitive to the magnitude of the per-locus effect and only depends on the underlying allele frequency distribution. In addition, relative accuracy was independent of the size of the mutational target when the constituent loci were iid. Our theory suggests that these results will hold for arbitrary distributions of the true effect β. Consideration of several effect size distributions in a parameter regime consistent with the UK Biobank further supports this conjecture (Section 8 in S1 Text). Although more work is required to fully substantiate this claim.

Selection, however, induces a dependency between an allele’s effect and its frequency, and may thereby render relative accuracy sensitive to the detection threshold. Our simulations provide preliminary evidence in support of this claim. For a small mutation rate of a = 4 = 10−3 and a large per-locus selection coefficient σ = 4Ns = 10, relative accuracy was lower for the larger detection threshold of d = 104 compared to d = 103. Yet, the difference between detection thresholds was small relative to that induced by selection, and was negligible for smaller selection coefficients. Indeed, smaller selection coefficients (σ ≤ 1) did not yield appreciable deviations from our neutral expectations for the mse, accuracy, nor relative accuracy. Therefore, excluding strong selection (σ > 1), our neutral expectations for these statistics appear to be good approximations to their true values. Our theoretical results under neutrality thus may prove an accurate description of temporally-resolved polygenic scores when polygenic adaptation is achieved by concurrent small frequency changes at numerous small effect loci—a plausible scenario [28, 35]. In addition, the simple patterns revealed by our simulations suggest that it may be possible to derive (approximate) analytic expressions for the given metrics in the presence of strong selection, when loci exhibit selective sweep-like behavior.

It is unclear whether our neutral expectations will hold in the context of more sophisticated polygenic trait modeling. In our simulation study, as in our theoretical work, we focus on dynamics at a single locus. Thus, our results are most relevant to scenarios in which single locus dynamics can be decoupled from the evolution of the mean phenotype and the genetic background [40]. Namely, the effect of an individual locus must be small relative to the mean phenotype [38, 40]. Future work will assess polygenic score accuracy under more sophisticated models of polygenic adaptation (e.g., [38, 42]).

Of the two bias-inducing processes explored, detection threshold asymmetry and directional selection, the latter induced much larger deviations from our neutral expectation for the bias, i.e., under neutrality bias(τ) = 0 for all ancient sampling times τ. In the presence of detection asymmetry, bias(τ) is approximately proportional to the difference between the one-sided detection probabilities, which in turn is constrained by the shape of the allele frequency distribution. Under neutrality, and for small mutation rates, most alleles are at very low frequencies or fixed, such that changing the detection threshold minimally influences the one-sided detection probabilities. Selection, however, perturbs the underlying allele frequency density. At equilibrium, this density is proportional to eσzz−1(1 − z)−1 for small a, where σ = 4Ns. Depending on σ, the one-sided detection probabilities may differ markedly, yielding larger values of bias(τ). We thus suspect that detection asymmetry has the potential to further exacerbate any bias induced by selection. These results are interesting in light of those of Chan et al. 2014 [34], who demonstrated that polygenic disease inheritance under the liability threshold model induced differences in the power to detect protective versus susceptible alleles. In Chan et al., this effect was further increased by imbalances in the case and control sample sizes in the GWA study. Additional work is needed to incorporate these features of case-control studies into our modeling framework.

The effects of selection on the bias have implications for assessments of mean differences between ancient polygenic scores from distinct time points. In particular, our results suggest that sufficiently strong positive directional selection will lead to overestimation of the difference between the polygenic scores of ancient individuals sampled before and after the onset of selection. Likewise, in the presence of negative selection, the polygenic score will underestimate this difference. At the same time, as discussed above, estimation error increases (as measured by mse(τ)) and accuracy (as measured by ρ2(τ)) decreases as the ancient sampling time increases.

Our results clarify relationships between various commonly used metrics of prediction error and accuracy. For example, we demonstrated an approximate functional relationship between the mean-squared error mse(τ) and the expected additive genetic variance V^A(τ) that applies for small ancient sampling times and mutation rates. This shared initial rate emerged despite fundamental differences between these statistics: mse(τ) measures error due to variants near or at fixation in the contemporary sample, which were segregating at intermediate frequencies in the ancient sample. In contrast, V^A(τ) measures error due to variants segregating in the contemporary sample, which were near or at fixation in the ancient sample. This conceptual result does not rely on any of our population genetic or GWA modeling assumptions, and perhaps could be exploited to learn about the genetic architecture of quantitative traits from multi-population data. In addition, we showed formally that polygenic score accuracy ρ2(τ), an approximation to the expectation of the sample correlation coefficient r2(τ), is proportional to the ratio of V^A(τ) to the total phenotypic variance. We believe that these relations, and their evolutionary and GWA study dependent forms, may facilitate the development of novel, more principled statistical procedures for the analysis of out-of-sample polygenic scores.

At the same time, the simplifying assumptions underlying our results indicate that significant challenges remain. For one, our model does not incorporate the complex demographic processes, such as admixture and population size changes, inherent in human history. This implies that an ancient sampling time of t years in the past likely does not correspond to a sampling time of τ = t/2N in our model, where 2N is the contemporary population size. Indeed, allelic turnover cannot explain all of the reductions in accuracy observed in out-of-sample predictions in humans. For example, our neutral theory predicts an approximately fifty percent reduction in accuracy when FST between the focal and GWA study populations is comparable to African-European divergence (FST ≈ 0.1). This more severely overestimates the prediction accuracy of height in a sample of individuals with African ancestry compared to the Wang et al. predictions, which take into account both LD and allele frequency changes (Section 12 in S1 Text). Thus, to achieve the same accuracy reductions observed in both simulated, e.g., [15, 16] and empirical, e.g., [14, 16, 43], studies of cross-population polygenic scores for contemporary humans, allelic turnover under neutrality would require population divergence times that far exceed their estimated values (Fig S7 in S1 Text).

Differences in LD between contemporary human populations may largely explain this discrepancy as most trait-associated loci are likely to be tagging rather than causal sites [12, 16]. As with geographically distinct populations, if LD between the genotyped and causal sites differed in the ancient population, then polygenic score accuracy would suffer [1]. We did not model this effect and assumed that the genotyped site was the causal site. This assumption may be justified when ancient sampling or population divergence times are recent, as high marker density in the GWA study may mitigate accuracy losses due to LD decay, but more theoretical work is required to substantiate this claim. While our framework can readily incorporate LD, it is difficult to obtain analytical results when the genotyped marker is not the causal site. In lieu of theoretical results, large-scale simulations in simple population genetic scenarios may provide insight into the relative contributions of LD—which depends on the allele frequencies of the tagging and causal sites—and allelic turnover to declines in polygenic score accuracy.

Furthermore, our assumption of linkage equilibrium between loci roughly equates to assuming that each LD block contains only a single causal site. Thus, our results will be most applicable to traits with relatively sparse genetic architectures for which the distance between any two causal loci is large compared to the scale of LD. In contrast, when the trait architecture is dense, a large number of variants have non-zero effect on the trait. Causal sites in close proximity are necessarily linked, and our assumption of linkage equilibrium would be violated. In addition, under a dense trait architecture, the “prune and threshold” polygenic score described herein may achieve lower accuracy than a best linear unbiased predictor (BLUP) that allows all segregating loci to have non-zero effects. In Section 4 in S1 Text, we speculate on the accuracy of BLUP in the context of our modeling framework when the trait has a dense architecture.

In addition, we assumed that per-locus causal effects were shared by the ancient and contemporary samples. Differences in causal effects across contemporary populations, perhaps due to changes in the environment, epistasis, or gene-by-environment interactions, likely contribute to accuracy reductions [8, 12]. Indeed, Cox et al. [18] found that trends in the polygenic scores of temporally disparate ancient samples did not always recapitulate those of the true phenotype. We conjecture that fluctuations in the per-locus effects would increase mse(τ) and decrease accuracy, but not profoundly alter our conclusions. Perhaps, if the fluctuations were asymmetric, e.g., effect sizes tended to increase in time, then bias(τ) may be non-zero under neutrality. Population stratification in the GWA study population may also lead to biased ancient polygenic scores, as has been observed in cross-population predictions in humans [9, 10]. Lastly, technical challenges inherent to the extraction and sequencing of ancient DNA often result in noisy estimates of the ancient genotypes. This additional source of randomness is likely to reduce accuracy and increase mse(τ), but otherwise should not substantially alter our conclusions.

Supporting information

S1 Text. Extended model, methods, and results.

This supplementary text contains detailed derivations and additional analyses.

(PDF)

Acknowledgments

We thank members of the Berg, Novembre, and Steinrücken labs, and the Cummings fourth floor for helpful discussions throughout the development of this project. In addition, we thank Jennifer Blanc, Adam Fine, Evan Koch, Zachary Miller, and John Novembre for comments on earlier (or very early) versions of this manuscript. We also give a special thanks to Carlos A. Serván and Micol Tresoldi for numerous insightful discussions over the course of this project.

Data Availability

All code to generate the results is available at https://github.com/marync/ancient_polygenic.

Funding Statement

MOC received funding from a National Institute of Health training grant T32 GM07197. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Habier D, Fernando RL, Dekkers JCM. The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007;177(4):2389–2397. doi: 10.1534/genetics.107.081190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. De Roos APW, Hayes BJ, Spelman RJ, Goddard ME. Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics. 2008;179(3):1503–1512. doi: 10.1534/genetics.107.084301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hamblin MT, Buckler ES, Jannink JL. Population genetics of genomics-based crop improvement methods. Trends in Genetics. 2011;27(3):98–106. doi: 10.1016/j.tig.2010.12.003 [DOI] [PubMed] [Google Scholar]
  • 4. Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science. 2012;95(7):4114–4129. doi: 10.3168/jds.2011-5019 [DOI] [PubMed] [Google Scholar]
  • 5. Carlson CS, Matise TC, North KE, Haiman CA, Fesinmeyer MD, Buyske S, et al. Generalization and Dilution of Association Results from European GWAS in Populations of Non-European Ancestry: The PAGE Study. PLoS Biology. 2013;11(9):e1001661. doi: 10.1371/journal.pbio.1001661 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics. 2013;14(7):507–515. doi: 10.1038/nrg3457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Guo Z, Tucker DM, Basten CJ, Gandhi H, Ersoz E, Guo B, et al. The impact of population structure on genomic prediction in stratified populations. TAG Theoretical and applied genetics. 2014;127(3):749–762. doi: 10.1007/s00122-013-2255-x [DOI] [PubMed] [Google Scholar]
  • 8. Galinsky KJ, Reshef YA, Finucane HK, Loh PR, Zaitlen N, Patterson NJ, et al. Estimating cross-population genetic correlations of causal effect sizes. Genetic Epidemiology. 2019;43(2):180–188. doi: 10.1002/gepi.22173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Berg JJ, Harpak A, Sinnott-Armstrong N, Joergensen AM, Mostafavi H, Field Y, et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife. 2019;8. doi: 10.7554/eLife.39725 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Sohail M, Maier RM, Ganna A, Bloemendal A, Martin AR, Turchin MC, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8:1–17. doi: 10.7554/eLife.39702 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Mostafavi H, Harpak A, Agarwal I, Conley D, Pritchard JK, Przeworski M. Variable prediction accuracy of polygenic scores within an ancestry group. eLife. 2020;9. doi: 10.7554/eLife.48376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Bitarello BD, Mathieson I. Polygenic scores for height in admixed populations. G3: Genes, Genomes, Genetics. 2020;10(11):4027–4036. doi: 10.1534/g3.120.401658 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Durvasula A, Lohmueller KE. Negative selection on complex traits limits phenotype prediction accuracy between populations. American Journal of Human Genetics. 2021;108(4):620–631. doi: 10.1016/j.ajhg.2021.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Martin AR, et al. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. American Journal of Human Genetics. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Ragsdale AP, Nelson D, Gravel S, Kelleher J. Lessons Learned from Bugs in Models of Human History. American Journal of Human Genetics. 2020;107(4):583–588. doi: 10.1016/j.ajhg.2020.08.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wang Y, Guo J, Ni G, Yang J, Visscher PM, Yengo L. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications. 2020;11(1). doi: 10.1038/s41467-020-17719-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Swarts K, Gutaker RM, Benz B, Blake M, Bukowski R, Holland J, et al. Genomic estimation of complex traits reveals ancient maize adaptation to temperate North America. Science. 2017;357(6350):512–515. doi: 10.1126/science.aam9425 [DOI] [PubMed] [Google Scholar]
  • 18. Cox SL, Ruff CB, Maier RM, Mathieson I. Genetic contributions to variation in human stature in prehistoric Europe. Proceedings of the National Academy of Sciences of the United States of America. 2019;116(43):21484–21492. doi: 10.1073/pnas.1910606116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Colbran LL, Gamazon ER, Zhou D, Evans P, Cox NJ, Capra JA. Inferred divergent gene regulation in archaic hominins reveals potential phenotypic differences. Nature Ecology and Evolution. 2019;3(11):1598–1606. doi: 10.1038/s41559-019-0996-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Cox SL, Moots H, Stock JT, Shbat A, Bitarello BD, Haak W, et al. Predicting skeletal stature using ancient DNA. bioRxiv. 2021; p. 2021.03.31.437877. 10.1101/2021.03.31.437877 [DOI] [Google Scholar]
  • 21. Windhausen VS, Atlin GN, Hickey JM, Crossa J, Jannink JL, Sorrells ME, et al. Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3: Genes, Genomes, Genetics. 2012;2(11):1427–1436. doi: 10.1534/g3.112.003699 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Lorenz AJ, Smith KP, Jannink JL. Potential and optimization of genomic selection for Fusarium head blight resistance in six-row barley. Crop Science. 2012;52(4):1609–1621. doi: 10.2135/cropsci2011.09.0503 [DOI] [Google Scholar]
  • 23. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Kanai M, Akiyama M, Takahashi A, Matoba N, Momozawa Y, Ikeda M, et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nature Genetics. 2018;50(3):390–400. doi: 10.1038/s41588-018-0047-6 [DOI] [PubMed] [Google Scholar]
  • 25. Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819–1829. doi: 10.1093/genetics/157.4.1819 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193(2):327–345. doi: 10.1534/genetics.112.143313 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE. 2008;3(10). doi: 10.1371/journal.pone.0003395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Berg JJ, Coop G. A Population Genetic Signal of Polygenic Adaptation. PLoS Genetics. 2014;10(8):1004412. doi: 10.1371/journal.pgen.1004412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Liu Y, Mao X, Krause J, Fu Q. Insights into human history from the first decade of ancient human genomics. Science. 2021;373(6562):1479–1484. doi: 10.1126/science.abi8202 [DOI] [PubMed] [Google Scholar]
  • 30. Ewens WJ. Mathematical Population Genetics I: Theoretical Introduction. New York: Springer-Verlag; 2004. [Google Scholar]
  • 31. Durrett R. Probability Models for DNA Sequence Evolution. 2nd ed. New York: Springer-Verlag; 2008. [Google Scholar]
  • 32.Griffiths RC, Spano D. Diffusion processes and coalescent trees. arXiv. 2010. http://arxiv.org/abs/1003.4650.
  • 33. Song YS, Steinrücken M. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics. 2012;190(3):1117–1129. doi: 10.1534/genetics.111.136929 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Chan Y, Lim ET, Sandholm N, Wang SR, McKnight AJ, Ripke S, et al. An excess of risk-increasing low-frequency variants can be a signal of polygenic inheritance in complex diseases. American Journal of Human Genetics. 2014;94(3):437–452. doi: 10.1016/j.ajhg.2014.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Pritchard JK, Pickrell JK, Coop G. The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation. Current Biology. 2010;20(4):208–215. doi: 10.1016/j.cub.2009.11.055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Boyle EA, Li YI, Pritchard JK. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell. 2017;169(7):1177–1186. doi: 10.1016/j.cell.2017.05.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Lynch M, Walsh B. Genetics and Analysis of Quantitative Traits. 1st ed. Sinauer Associates; 1998. [Google Scholar]
  • 38. Simons YB, Bullaughey K, Hudson RR, Sella G. A population genetic interpretation of GWAS findings for human quantitative traits. PLoS Biology. 2018;16. doi: 10.1371/journal.pbio.2002985 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Jouganous J, Long W, Ragsdale AP, Gravel S. Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation. Genetics. 2017;206(3):1549–1567. doi: 10.1534/genetics.117.200493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Chevin LM, Hospital F. Selective sweep at a quantitative trait locus in the presence of background genetic variation. Genetics. 2008;180(3):1645–1660. doi: 10.1534/genetics.108.093351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Yair S, Coop G. Population differentiation of polygenic score predictions under stabilizing selection. bioRxiv. 2021; p. 2021.09.10.459833. 10.1101/2021.09.10.459833 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Hayward LK, Sella G. Polygenic adaptation after a sudden change in environment. bioRχiv. 2019. https://www.biorxiv.org/content/10.1101/792952v2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications. 2019;10(1). doi: 10.1038/s41467-019-11112-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Bret Payseur, Kirk E Lohmueller

24 Nov 2021

Dear Dr Steinrücken,

Thank you very much for submitting your Research Article entitled 'Polygenic score accuracy in ancient samples: quantifying the effects of allelic turnover' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Kirk E Lohmueller

Guest Editor

PLOS Genetics

Bret Payseur

Section Editor: Evolution

PLOS Genetics

Thank you for submitting this work to PLOS Genetics. It has been evaluated by 3 reviewers. All generally liked the manuscript. The reviewers provide a number of comments to improve the manuscript. None stood out as being more critical than others. I think addressing these comments, even those that were presented as suggestions/not required by the reviewers, in your revision will improve the readability and applicability of your work.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This manuscript by Carlson and colleagues considers the effect of "allelic turnover" on polygenic scores fit to ancient samples. "Allelic turnover," as the authors define it, is relevant to the problem of polygenic score portability in part because causal variants that explain trait variation in an ancient sample may experience frequency changes that make them undetectable in modern GWAS. The authors use a diffusion approach to study several summaries of polygenic score performance. I have not checked all the equations, but the work appears to be well done. I have a few comments, which I list not as requirements for publication (which I leave up to the editor), but rather as friendly reactions and suggestions.

1) The model here assumes that the ancient sample is from a population directly ancestral to a contemporary population, when in fact many ancient samples may not fit this description straightforwardly. (See e.g. Schraiber 2018, PMID: 29167200). I think it is totally fair for the authors to model things the way they do, as it captures a lot of the intuition and allows them just to think about one forward diffusion. However, the paper might benefit from a clearer discussion of this potential caveat.

2) Along similar lines to (1), I was struck by how much of the framing is in terms of ancient samples when a lot of the intuitions developed are relevant to pairs of contemporary populations. The model is more straightforwardly related to ancient populations (though see (1)), but I think it'd be interesting for the authors to give more space to thinking about the relevance for contemporary populations. There is a paragraph on this in the discussion, but only a hint of a quantitative comparison is given, and as far as I can tell mainly for empirical results. It would be useful to include a comparison with the modeling approach of Wang, Guo et al (ref 16).

3) arXiv:1909.00892 has a nice discussion of the relevance of allelic turnover to polygenic score comparisons as proxies for phenotype comparisons. This is captured in the various summaries studied, but I would have liked more emphasis on distinguishing within-population accuracy vs. getting changes in the mean over time correct in the discussion.

4) A new preprint by Yair & Coop is relevant for this study and takes a complementary modeling approach https://doi.org/10.1101/2021.09.10.459833 . (I know what this is starting to look like, and I swear I'm not Graham Coop.)

5) I found the explanations of the results in the main text to be, in some places, a little wordy and hard to follow. I think it is worthwhile to communicate the parts of the intuition that have to do with allele frequencies changing into or out of the GWAS-detectable range, but I got lost in some of the verbal explanations. I also thought the summary in the intro was in too much depth for not having seen the model yet. The repeated structure of the results subsections, though helpful for comprehension, also get to feel redundant. I would urge the authors to cut down the main text, lean more on their figures, and even possibly make use of tables to summarize some of the main results (e.g. in the intro).

6) I appreciated that the authors brought in literature on genetic prediction outside humans.

7) The final paragraph is perhaps missing a mention of environmental differences and GxE.

Minor comments

a) line 139, I assume that N is a diploid population size because time units are scaled in 2N generations, but please specify.

b) line 167, is it meant that this quantity is often called a "genetic value" specifically in this paper? I have not seen it much in other places. If it's meant that it's a general term, please provide a citation.

c) p. 5, I might have missed it, but I don't see it explicitly stated here that eq. 2 assumes linkage equilibrium as well. (It shows up in the beginning of 2.3 but should probably be here first.)

Reviewer #2: In this manuscript, Carlson et al. provide a masterful assessment of polygenic score accuracy decreases due to allelic turnover, in the context of ancient populations that are ancestral to the GWAS panel on which a phenotype was measured. They quantify this decrease as a function of various parameters of interest, including the temporal distance between the ancient and present-day population, the power to detect significantly-associated loci, and trait heritability. They also produce simulations to determine how much weak selection on trait-associated variants biases the quantities of interest.

The manuscript is well written and it is clear that the authors have done a great deal of work to reach their conclusions, and evaluate various quantities of interest for geneticists to assess how reliable a score for an ancient individual would be. I only have very minimal suggestions for changes or additions to the manuscript, that I don't deem crucial for the manuscript to be accepted for publication:

To provide a clearer picture for researchers choosing to do (or not do) polygenic prediction on ancient samples, it would be nice to work with values from a concrete phenotype and GWAS study: given the size and heritability estimates from the latest UK Biobank GWAS or Biobank Japan on a particular phenotype (e.g. height), how much would one mis-estimate this phenotype in a population that is X years away from UKBB or BBJ, in the past, solely due to allelic turnover?

Rarely are polygenic scores in the ancient DNA literature computed for populations that are directly ancestral to the present-day GWAS population, as there has been substantial population turnover and admixture in human populations during the Holocene and Pleistocene. It might be worth emphasizing this more strong in the Discussion so that the reader is aware that a population sampled X years in the past (in the same location as a present-day population) does not necessarily correspond to your scenario of an ancestral population that is separated by X years from the present-day GWAS populations.

In this vein, the manuscript would benefit from moving some of the results from Supplementary Text S1.1 to the main text (at a minimum the figure and the final conclusion), as those would more generally applicable to the larger set of researchers not necessarily working with directly ancestral populations, or working with diverged populations from the GWAS one, outside the ancient genomic literature.

Following up on this, could the authors also express the turnover-caused decreased in accuracy in terms of Fst values, rather than the ancient sampling time in coalescent units? This would make it a lot easier for an empirical researcher to assess how much of a decrease in accuracy due to allelic turnover they would expect for their ancient (or out-of-space) polygenic predictions, given that Fst values between present-day and ancient populations are readily obtainable (assuming stationarity and that allelic turnover is the only contributor to the decrease in accuracy).

I would be curious to see how accuracy decreases with increasing the number of ancient individuals used in the computation of polygenic score, e.g. how much more / less problematic is it to compute the mean polygenic score for a population of ancient hunter-gatherers vs. computing a polygenic score individually for each ancient hunter-gatherer in the population. How is the variance for these different estimates affected by the sampling time?

Beyond recent directional selection, could the authors provide some intuition as to how persistent negative selection would affect their results? i.e. How would these be affected when the magnitude of the effect size estimates determines the probability that the individual variants will be removed from the population?

Reviewer #3: See attached

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Fernando Racimo

Reviewer #3: Yes: Luke J. O'Connor

Attachment

Submitted filename: Maryn_PG_PRS_review.docx

Decision Letter 1

Bret Payseur, Kirk E Lohmueller

3 Mar 2022

Dear Dr Steinrücken,

Thank you very much for submitting your Research Article entitled 'Polygenic score accuracy in ancient samples: quantifying the effects of allelic turnover' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Kirk E Lohmueller

Guest Editor

PLOS Genetics

Bret Payseur

Section Editor: Evolution

PLOS Genetics

Thank you for submitting your revised manuscript. Two of the three reviewers are satisfied with the changes. Reviewer 3 has some additional comments on the BLUP model in the Supplement. Please provide some qualitative description of what would change under a BLUP-type estimator with no threshold.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed my first-round comments sufficiently, and I have no further comments. This was a nice paper even before the revision.

Reviewer #2: The authors have addressed all of my concerns. I recommend this paper for acceptance.

Reviewer #3: In their revised manuscript, Carlson et al. have addressed most of my comments from the initial submission, but I was not completely satisfied by their treatment of the BLUP PRS estimator, which I still think would produce qualitatively different results than the threshold-style PRS estimator they primarily analyze.

The approximation the authors use for the PRS is that there is some detection threshold based on the true effect size (and AF) of a variant, and we get a good estimate of the effect size for variants above the threshold, but a zero estimate for variants below it. This approximation is aligned with a threshold-style estimator, where you throw out variants using a threshold for their estimated (rather than true) effect size/significance.

It contrasts with the BLUP estimator, which does not throw out any variants. The new supplementary note seems to be arguing that the BLUP is approximately the same as the estimator under consideration because its weights are well approximated by the OLS effect size estimates. This is true (perhaps even obvious), but the OLS effect size estimates are only used in the threshold approximation *for variants passing the threshold*, and this threshold seems to be critical reason for the qualitative findings throughout the paper.

As I noted before, the difference between these estimators corresponds to the difference between a sparse vs. infinitesimal genetic architecture: under a sparse architecture, a method with some sort of feature selection is called for; under an infinitesimal architecture, when every SNP is causal, the BLUP estimator is optimal.

I do not at all think it would detract from the findings of the paper if they are only applicable to sparse architectures and threshold/variable selection style estimators. We know that real genetic architectures are not infinitesimal (especially for non-psychiatric traits), and in general, models with some thresholding or variable selection perform better than BLUP.

I don’t request new analyses to address this, but I am concerned that the new supplementary note misses the mark, and I recommend at least some qualitative description of what would change under a BLUP-type estimator with no threshold.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Fernando Racimo

Reviewer #3: Yes: Luke J O'Connor

Decision Letter 2

Bret Payseur, Kirk E Lohmueller

26 Mar 2022

Dear Dr Steinrücken,

We are pleased to inform you that your manuscript entitled "Polygenic score accuracy in ancient samples: quantifying the effects of allelic turnover" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Kirk E Lohmueller

Guest Editor

PLOS Genetics

Bret Payseur

Section Editor: Evolution

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #3: The authors have addressed my comments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: Yes: Luke J O'Connor

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-01313R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Bret Payseur, Kirk E Lohmueller

26 Apr 2022

PGENETICS-D-21-01313R2

Polygenic score accuracy in ancient samples: quantifying the effects of allelic turnover

Dear Dr Steinrücken,

We are pleased to inform you that your manuscript entitled "Polygenic score accuracy in ancient samples: quantifying the effects of allelic turnover" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Andrea Szabo

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Extended model, methods, and results.

    This supplementary text contains detailed derivations and additional analyses.

    (PDF)

    Attachment

    Submitted filename: Maryn_PG_PRS_review.docx

    Attachment

    Submitted filename: response_to_reviewers.pdf

    Attachment

    Submitted filename: responses_to_reviewers.pdf

    Data Availability Statement

    All code to generate the results is available at https://github.com/marync/ancient_polygenic.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES