Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2021 Jan 18:2020.04.28.066365. [Version 4] doi: 10.1101/2020.04.28.066365

The impact of purifying and background selection on the inference of population history: problems and prospects

Parul Johri 1,*, Kellen Riall 1, Hannes Becher 2, Laurent Excoffier 3,4, Brian Charlesworth 2, Jeffrey D Jensen 1,*
PMCID: PMC7836109  PMID: 33501439

Abstract

Current procedures for inferring population history generally assume complete neutrality - that is, they neglect both direct selection and the effects of selection on linked sites. We here examine how the presence of direct purifying selection and background selection may bias demographic inference by evaluating two commonly-used methods (MSMC and fastsimcoal2), specifically studying how the underlying shape of the distribution of fitness effects (DFE) and the fraction of directly selected sites interact with demographic parameter estimation. The results show that, even after masking functional genomic regions, background selection may cause the mis-inference of population growth under models of both constant population size and decline. This effect is amplified as the strength of purifying selection and the density of directly selected sites increases, as indicated by the distortion of the site frequency spectrum and levels of nucleotide diversity at linked neutral sites. We also show how simulated changes in background selection effects caused by population size changes can be predicted analytically. We propose a potential method for correcting for the mis-inference of population growth caused by selection. By treating the DFE as a nuisance parameter and averaging across all potential realizations, we demonstrate that even directly selected sites can be used to infer demographic histories with reasonable accuracy.

Keywords: demographic inference, background selection, distribution of fitness effects, MSMC, fastsimcoal2, approximate Bayesian computation (ABC)

INTRODUCTION

The characterization of past population size change is a central goal of population genomic analysis, with applications ranging from anthropological to agricultural to clinical (see review by Beichman et al. 2018). Furthermore, use of an appropriate demographic model provides a necessary null model for assessing the impact of selection across the genome (e.g., Teshima et al. 2006; Thornton and Jensen 2007; Jensen et al. 2019). Multiple strategies have been proposed for performing demographic inference, utilizing expectations related to levels of variation, the site frequency spectrum, linkage disequilibrium, and within- and between-population divergence (e.g., Gutenkunst et al. 2009; Li and Durbin 2011; Lukic and Hey 2012; Excoffier et al. 2013; Harris and Nielsen 2013; Bhaskar et al. 2015; Boitard et al. 2016; Sheehan and Song 2016; Ragsdale and Gutenkunst 2017; Kelleher et al. 2019; Speidel et al. 2019; Steinrücken et al. 2019).

Although many methods perform well when evaluated under the standard assumption of neutrality, it is difficult in practice to assure that the nucleotide sites used in empirical analyses experience neither direct selection nor the effects of selection at linked sites. For example, inference is often performed using intergenic, 4-fold degenerate, or intronic sites. While there is evidence for weak direct selection on all of these categories in multiple organisms (e.g., Haddrill et al. 2005; Chamary and Hurst 2005; Andolfatto 2005; Lynch 2007; Zeng and Charlesworth 2010; Choi and Aquadro 2016; Jackson et al. 2017), it is also clear that such sites near or in coding regions will also experience background selection (BGS; Charlesworth et al. 1993; Charlesworth 2013), and may periodically be affected by selective sweeps as well (Messer and Petrov 2013; Schrider et al. 2016). These effects are known to affect the local underlying effective population size, and alter both the levels and patterns of variation and linkage disequilibrium (Charlesworth et al. 1993; Kaiser and Charlesworth 2009; O’Fallon et al. 2010; Charlesworth 2013; Nicolaisen and Desai 2013; Ewing and Jensen 2016; Johri et al. 2020).

However, commonly-used approaches for performing demographic inference that assume complete neutrality, including fastsimcoal2 (Excoffier et al. 2013) and MSMC/PSMC (Li and Durbin 2011; Schiffels and Durbin 2014), have yet to be thoroughly evaluated in the light of this assumption, which is likely to be violated in practice. There are, however, some exceptions, as well as subsequent suggestions on how best to choose the least-affected genomic data for analysis (Pouyet et al. 2018). Rather than investigating existing software, Ewing and Jensen (2016) implemented an approximate Bayesian (ABC) approach quantifying the impact of BGS effects, demonstrating that weak purifying selection can generate a skew towards rare alleles that would be mis-interpreted as population growth. Under certain scenarios, this resulted in a many-fold mis-inference of population size change. However, the effects of the density of directly selected sites and the shape of the distribution of fitness effects (DFE), which are probably of great importance, have yet to be fully considered. Spanning the range of these potential parameter values is important for understanding the implications for empirical applications. For example, the proportion of the genome experiencing direct purifying selection can vary greatly between species, with estimates ranging from ~3–8% in humans, ~12% in rice, 37–53% in D. melanogaster, and 47–68% in S. cerevisiae (Siepel et al. 2005; Liang et al. 2018). Furthermore, many organisms have highly compact genomes, with ~88% of the E. coli genome (Blattner et al. 1997), and effectively all of many virus genomes, being functional (e.g., >95% of the SARS-CoV-2 genome, Wu et al. 2020).

While such estimates allow us to approximate the effects of BGS in some model organisms, in which recombination and mutation rates are well known, it is difficult to predict these effects in the vast majority of study systems. Moreover, while the genome-wide mean of B, a widely-used measure of BGS effects that measures the level of variability relative to neutral expectation, can range from ~0.45 in D. melanogaster to ~0.94 in humans (Charlesworth 2013; but see Pouyet et al. 2018), existing demographic inference approaches are usually applied across organisms without considering this important source of differences in levels of bias. Here, we examine the effects of the DFE shape and functional density on two common demographic inference approaches - the multiple sequentially Markovian coalescent (MSMC) and fastsimcoal2. Finally, we propose an extension within the approximate Bayesian computation (ABC) framework to address this issue, treating the DFE as a nuisance parameter and demonstrating greatly improved demographic inference even when using directly selected sites alone.

RESULTS AND DISCUSSION

Effects of SNP numbers, density and genome size on inference under neutral equilibrium

The accuracy and performance of demographic inference was evaluated using two popular methods, MSMC (Schiffels and Durbin 2014) and fastsimcoal2 (Excoffier et al. 2013). In order to assess performance, it was first necessary to determine how much genomic information is required to make accurate inference when the assumptions of neutrality are met. Chromosomal segments of varying sizes (1 Mb, 10 Mb, 50 Mb, 200 Mb, and 1 Gb) were simulated under neutrality and demographic equilibrium (i.e., a constant population size of 5000 diploid individuals) with 100 independent replicates each. For each replicate this amounted to the mean [SD] number of segregating sites for each diploid individual being 1,944 [283], 9,996 [418], 40,046 [957] and 200,245 [1887]; for 50 diploid individuals, these values were 10,354 [225], 51,863 [567], 207,118 [1139] and 1,035,393 [2476] for 10 Mb, 50 Mb, 200 Mb and 1 Gb, respectively. Use of MSMC resulted in incorrect inferences for all segments smaller than 1 Gb (Supp Figures 1, 2). Specifically, very strong recent growth was inferred instead of demographic equilibrium, although ancestral population sizes were correctly estimated. In addition, when two or four diploid genomes were used for inference, MSMC again inferred a recent many-fold growth for all segment sizes even when the true model was equilibrium, but performed well when using 1 diploid genome with large segments (Supp Figures 1, 2). These results suggest caution when performing inference with MSMC on smaller regions or genomes, specifically when the number of SNPs is less than ~200,000 per single diploid individual. Extra caution should be used when interpreting population size changes inferred by MSMC when using more than 1 diploid individual.

When using fastsimcoal2 to perform demographic inference, parameters were accurately estimated for all chromosomal segment sizes when the correct model (i.e., equilibrium) was specified (Supp Table 1). However, when model selection was performed using a choice of four models (equilibrium, instantaneous size change, exponential size change, and instantaneous bottleneck), the correct model was chosen more often (~30% of replicates) when the simulated chromosome sizes were small (1 and 10 Mb), while an alternative model of either instantaneous size change or instant bottleneck was increasingly preferred for larger regions (Supp Tables 2, 3), although the estimates of ancestral sizes were correct. This finding suggests that the non-independence of SNPs may result in model mis-identification. Indeed, since the model choice procedure assumes that SNPs are independent, the true number of independent SNPs is overestimated, which results in an overestimation in the confidence of the model choice with increasing amount of data. However, it is interesting to note that the parameter values underlying the non-constant size preferred model were often pointing towards a constant-population size (see below). When model selection was performed using sparser SNP densities (i.e., 1 SNP per 5 kb, 50 kb or 100 kb), the correct model was recovered for longer chromosomes up to 200 Mb (Supp Tables 2, 3; Supp Figures 3, 4), although model selection was slightly less accurate for smaller chromosomes due to the decrease in the total amount of data. As suspected, the biases introduced by the non-independence of SNPs were found to be concordant with the level of linkage disequilibrium amongst SNPs used for the analysis (for 10 SNP windows, in which SNPs were separated by 50 kb (100 kb), mean r2 = 0.027 (0.020), compared to the all-SNP mean r2 of 0.118, and to the completely unlinked SNPs mean r2 of 0.010; Supp Table 4). Additionally, AIC performed on partially linked SNPs may impose an insufficient penalty on larger number of parameters, resulting in an undesirable preference for parameter-rich models. We found that implementing a more severe penalty improved inference considerably, even for 1 Gb chromosome sizes (Supp Table 5, 6). This model selection performance, the potential corrections related to increased penalties, as well as the total number of SNPs and SNP thinning, should be investigated on a case-by-case basis in empirical applications, owing to the contribution of multiple underlying parameters (e.g., chromosome length, recombination rates, and SNP densities).

In the light of this performance assessment, all further analyses were restricted to characterizing demographic inference on data that far exceeded 1 Gb and roughly matched the structure and size of the human genome - for every diploid individual, 22 chromosomes (autosomes) of size 150 Mb each were simulated, which amounted to roughly 3 Gb of total sequence. Ten independent replicates of each parameter combination were performed throughout, and inference utilized one and fifty diploid individuals for MSMC and fastsimcoal2, respectively.

Effect of the strength of purifying selection on demographic inference

In order to test the accuracy of demographic inference in the presence of BGS, all 22 chromosomes were simulated with exons of size 350 bp each, with varying sizes of introns and intergenic regions (see Methods) in order to vary the fraction (5%, 10% and 20%) of the genome under selection. Because the strength of selection acting on deleterious mutations affects the distance over which the effects of BGS extend, demographic inference was evaluated for various DFEs (Table 1). The DFE was modelled as a discrete distribution with four fixed classes: 0 ≤ 2Nancs < 1, 1 ≤ 2Nancs < 10, 10 ≤ 2Nancs < 100 and 100 < 2Nancs < 2Nanc, where Nanc is the ancestral effective population size and s is the reduction in the fitness of the homozygous mutant relative to wildtype. The fitness effects of mutations were uniformly distributed within each bin, and assumed to be semi-dominant, following a multiplicative fitness model for multiple loci; the DFE shape was altered by varying the proportion of mutations belonging to each class, given by f0, f1, f2, and f3, respectively (see Methods). Three DFEs highly skewed towards a particular class were initially used to assess the impact of the strength of selection on demographic inference (with the remaining mutations equally distributed amongst the other three classes): DFE1: a DFE in which 70% of mutations have weakly deleterious fitness effects (i.e., f1 = 0.7); DFE2: a DFE in which 70% of mutations have moderately deleterious fitness effects (i.e., f2 = 0.7); and DFE3: a DFE in which 70% of mutations have strongly deleterious fitness effects (i.e., f3 = 0.7). A DFE with equal proportions of all deleterious classes (i.e., DFE4: f0 = f1 = f2 = f3 = 0.25) was also simulated to evaluate the combined effect of different selective strengths. In addition, two bimodal DFEs consisting of only the neutral and the strongly deleterious class of mutations were simulated to characterize the role of strongly deleterious mutations (DFE5: a DFE in which 50% of mutations have strongly deleterious effects (i.e., f3 = 0.5) with the remaining being neutral; and DFE6: a DFE in which 30% of mutations were strongly deleterious (i.e., f3 = 0.3) with the remaining being neutral).

Table 1:

Proportion (fi) of mutations in each class of the discrete distribution of fitness effects (DFE) simulated in this study.

f0 f1 f2 f3
DFE1 0.1 0.7 0.1 0.1
DFE2 0.1 0.1 0.7 0.1
DFE3 0.1 0.1 0.1 0.7
DFE4 0.25 0.25 0.25 0.25
DFE5 0.5 0.0 0.0 0.5
DFE6 0.7 0.0 0.0 0.3

In order to understand the effects of BGS, exonic sites were masked, and only linked neutral intergenic and intronic sites were used for demographic inference by both MSMC and fastsimcoal2 (although comparisons are presented under certain models to analyses based on non-masked datasets). The three demographic models examined were (1) demographic equilibrium, (2) a 30-fold exponential growth, mimicking the recent growth experienced by European human populations, and (3) ~6-fold instantaneous decline, mimicking the out-of-Africa bottleneck in human populations (Figure 1a). Although these models were parameterized using previous estimates of human demographic history (Supp Table 7; Gutenkunst et al. 2009), these basic demographic scenarios are applicable to many organisms, although the magnitudes of population size changes in this case may represent an extreme. Under neutrality, inference of parameters of all three simulated demographic models was highly accurate with both MSMC and fastsimcoal2 (Figure 1a; Supp Table 8). However, when inferring parameters using fastsimcoal2, the time of change in case of the population decline model was consistently over-estimated when SNPs separated by 5 kb were used, while the time was accurately inferred when using all SNPs (Supp Table 8). We therefore present our results using all SNPs throughout (with comparisons to 1 SNP per 5 kb and 1 SNP per 100 kb thinning, under certain models), and recommend caution when implementing thinning procedures.

Figure 1:

Figure 1:

Inference of demography by MSMC (red lines; 10 replicates) and fastsimcoal2 (blue lines; 10 replicates) with and without BGS, under demographic equilibrium (left column), 30-fold exponential growth (middle column), and ~6-fold instantaneous decline (right column). The true demographic models are depicted as black lines, with the x-axis origin representing the present day. (a) All genomic sites are strictly neutral. Exonic sites experience purifying selection specified by (b) DFE1, (c) DFE2, and (d) DFE3 (see Table 1). Exons represent 20% of the genome, and exonic sites were masked/excluded when performing demographic inference, quantifying the effects of BGS alone. The dashed lines represent indefinite extensions of the ancestral population sizes. Detailed methods including command lines can be found at: https://github.com/paruljohri/demographic_inference_with_selection/blob/main/CommandLines/Figure1.txt

Under demographic equilibrium, when 20% of the genome experiences direct selection (with masking of the directly selected sites), we found the true population size to be underestimated as expected, and recent population growth mis-inferred (Figure 1, Supp Figure 5), even when only 1 SNP per 100kb was used and a higher AIC penalty was employed (Supp Figure 6). Conversely, when the true demographic model was characterized by a recent 30-fold growth, demographic inference was accurate and performed equally well for both MSMC and fastsimcoal2, with the exception of a slight underestimation of the ancestral population size for all DFE types. When the true model was population decline, weakly deleterious mutations alone did not affect inference drastically with either method, and it was possible to recover the true model (i.e., decline vs growth) by fastsimcoal2 in all replicates (Supp Figure 7). However, moderately and strongly deleterious mutations resulted in an underestimation of population size and the inference of an instantaneous bottleneck and strong recent growth respectively, to the extent that population decline was misinterpreted as a bottleneck/growth in all replicates (Supp Figures 5,7). Strong recent growth was inferred (in the presence of moderately and strongly deleterious mutations) even when SNPs separated by 100 kb were used, and an increased penalty was employed against parameter-rich models (Supp Figure 6). We further tested the effect of BGS on demographic inference when changes in population size were less severe, namely, when population growth and decline were only 2-fold, with qualitatively similar results (Supp Figure 8).

Finally, given the strong evidence that most organisms have a bi-modal DFE with a significant proportion of strongly deleterious or lethal mutations (Sanjuán 2010; Jacquier et al. 2013; Kousathanas and Keightley 2013; Bank et al. 2014; Charlesworth 2015; Galtier and Rousselle 2020), we investigated the effect of this strongly deleterious class further. Thus, for comparison with the above, we simulated a rather extreme case in which 30% or 50% of exonic mutations were strongly deleterious with fitness effects uniformly sampled between 100 ≤ 2Nancs ≤ 2Nanc, with the remaining mutations being neutral (i.e., DFE5 and DFE6; see Table 1). As with the above results, both equilibrium and decline models were falsely inferred as growth, with an order of magnitude underestimation of the true population size (Figure 2).

Figure 2:

Figure 2:

Inference of demography by MSMC (red lines; 10 replicates) and fastsimcoal2 (blue lines; 10 replicates) in the presence of BGS generated by strongly deleterious mutations. Directly selected sites comprised 20% of the genome and were masked when performing demographic inference. Exons experience purifying selection specified by (a) DFE6 and (b) DFE5 (see Table 1). The true demographic models are given as black lines, with the x-axis origin representing the present day. The dashed lines represent indefinite extensions of the ancestral population sizes. Detailed methods including command lines can be found at: https://github.com/paruljohri/demographic_inference_with_selection/blob/main/CommandLines/Figure2.txt

In sum, neglecting BGS frequently results in the inference of population growth, almost regardless of the true underlying demographic model.

Effects of density and inclusion/exclusion of directly selected sites on inference

Although we have shown that the presence of purifying selection biases demographic inference, the extent of mis-inference necessarily depends on the fraction of the genome experiencing direct selection. We therefore compared models in which 5%, 10% or 20% of the genome was functional. For this comparison, equal proportions of mutations in each DFE bin were assumed corresponding to DFE4 (Table 1). As before, when the true model was growth, inference was unbiased, with a slight underestimation of ancestral population size when 20% of the genome experienced selection (Figure 3). Population decline was inferred reasonably well if less than 10% of the genome experienced direct selection, but could be mis-inferred as growth with greater functional density, as shown in Figure 3. Similarly, the extent to which population size was under-estimated at demographic equilibrium increased with the fraction of the genome under selection. Finally, it is noteworthy that many changes in population size that were falsely inferred were greater than 2-fold in size, suggesting the need for great caution when inferring such changes from real data.

Figure 3:

Figure 3:

Inference of demography by MSMC (red lines; 10 replicates) and fastsimcoal2 (blue lines; 10 replicates) in the presence of BGS with varying proportions of the genome under selection, for demographic equilibrium (left column), exponential growth (middle column), and instantaneous decline (right column). Exonic sites were simulated with purifying selection with all fi values equal to 0.25 (DFE4; see Table 1), and were masked when performing inference. Directly selected sites comprise (a) 20% of the simulated genome, (b) 10% of the simulated genome, and (c) 5% of the simulated genome. The true demographic models are given by the black lines, with the x-axis origin representing the present day. The dashed lines represent indefinite extensions of the ancestral population sizes. Detailed methods including command lines can be found at: https://github.com/paruljohri/demographic_inference_with_selection/blob/main/CommandLines/Figure3.txt

Importantly, the results presented do not significantly differ between inference performed while including directly selected sites (i.e., no masking of functional regions; Supp Figure 9) versus inference performed using linked neutral sites (i.e., masking functional regions; Figures 13). These results suggest that the exclusion of exonic sites, which is often assumed to provide a sufficiently neutral dataset to enable accurate demographic inference, is not necessarily a satisfactory solution unless gene density is low. For example, demographic inference would naturally be expected to be less biased by BGS for human-like genomes with a relatively low functional density, and more biased in genomes with higher functional density like D. melanogaster.

Effect of BGS on model selection and inferred time of size change using fastsimcoal2

In order to quantify the effects of BGS on model selection, four competing models were used for inference: equilibrium, instantaneous size change (growth/decline), exponential size change (growth/decline), and an instantaneous bottleneck. Although demographic equilibrium was almost always inferred as an instantaneous size change (70–100% of replicates), the fitted parameters of the size change model were nearly indistinguishable from the correct model (Figure 1a). In other words, the inferred size change was so inconsequential so as to be nearly a constant-size model, suggesting that parameter estimation is usually more reliable than model selection. When there was a substantial proportion of highly deleterious mutations (DFE3 and DFE5), exponential growth was generally inferred. However, when there was a true size change, fastsimcoal2 performed well in distinguishing between exponential vs. instantaneous change models even in the presence of BGS (Supp Figures 5, 6), provided that the magnitude of size change was large. When size changes were on the order of 2-fold, exponential growth was consistently inferred to be instantaneous.

With respect to model choice between growth and decline in the presence of BGS (irrespective of instantaneous vs. exponential change), as the density of selected sites and strength of purifying selection increased, both equilibrium and decline models were more likely to be inferred as growth and occasionally as instantaneous bottlenecks (Supp Figure 7), while true growth models were generally chosen correctly. It should be added that with such large chromosome sizes (3 Gb of total sequence data), model selection was not observed to vary between replicates using fastsimcoal2 for any given parameter combination. Thus, in the presence of BGS, high-confidence calls of an incorrect underlying demographic model appear likely.

With regard to the time of inferred size change, when the true model was exponential growth, the model was always correctly identified and inference of the time of change was slightly under-estimated in the presence of BGS (Supp Figures 10, 11), consistent with the fact that BGS will further skew the site frequency spectrum towards rare alleles. When the true model was decline, and the model was correctly identified as such, the time of change was modestly over-estimated (Supp Figures 12, 13) – up to ~2-fold for 6× growth and 2.5-fold for 2× growth (when 20% of the genome was exonic).

Effect of heterogeneity in recombination rates, mutation rates, and repeat masking

Variation in recombination and mutation rates, as well as the masking of repeat regions, may also affect demographic inference procedures. We evaluated this issue by simulating heterogeneity in both mutation and recombination rates (based on estimated human genome maps, as described in the Methods section), and masking 10% of each simulated segment drawing from the empirical distribution of repeat lengths in the human genome (Supp Figure 14). In general, inferences under neutrality (Supp Figures 1517) as well as under BGS (Supp Figures 1820) were not affected to a great extent, suggesting such heterogeneity to have a comparatively minor role for the parameter space considered in this study. Thus, serious mis-inference is more likely to be caused by selection. These observations also suggest that simulations performed with mean rates of recombination and mutation, as in this study, are sufficient to evaluate biases caused by BGS.

Effects of BGS on diversity and the SFS under various demographic models: theoretical expectations versus simulation results

To better understand how BGS can lead to different biases in the inference of population history, we investigated the extent of BGS effects under all three demographic models, with respect to both the expected diversity in the presence of BGS relative to neutrality (B), as well as the shape of the SFS at linked neutral sites. First, we found that B differed among demographic scenarios, with much lower values in the case of equilibrium and decline, concordant with stronger demographic mis-inference (Figure 4). After a population decline, B was lower than that before the size change; while after population expansion, B increased relative to that in the ancestral population, sometimes approaching 1 (Figure 4). This may seem paradoxical, given that the magnitude of the scaled selection coefficient (2Nes) decreases with decreasing Ne (i.e., the efficacy of purifying selection decreases, and could thus be expected to result in larger values of B under population decline). Conversely, with increasing Ne, B should be expected to reduce.

Figure 4:

Figure 4:

Nucleotide site diversity with BGS (B) relative to its purely neutral expectation (π0) for varying DFEs (specified in Table 1) and demographic scenarios. The results are shown for (a) demographic equilibrium, (b) population growth, and (c) population decline. All cases refer to size changes forward in time, the ancestral B (i.e., B pre-change in population size) is shown in white bars, B post-change in population size is shown in solid gray bars, and the analytical expectations for the post-size change B is shown as red bars. Exonic sites comprised ~10% of the genome, roughly mimicking the density of the human genome. Detailed methods including command lines can be found at: https://github.com/paruljohri/demographic_inference_with_selection/blob/main/CommandLines/Figure4.txt

However, these expectations apply only once a population has maintained a given Ne for sufficient time such that mutation-drift-equilibrium has been approached. During the initial stages of population size change, and shortly afterwards, the dynamics of B tend to show a trend opposing these long-term expectations (see also Figure 5 of Torres et al. 2020). This is because differences in Ne caused by different initial levels of BGS cause differences in the rates of response to changes in population size – a small value of Ne (corresponding to low B) results in a faster response compared with a high value (Fay and Wu 1999; Hey and Harris 1999; Pool and Nielsen 2007; Pool and Nielsen 2009; Campos et al. 2014; Torres et al. 2020). In other words, diversity in a growing population will increase more rapidly in regions experiencing stronger BGS than in completely neutral regions, while diversity in a declining population will decrease at a faster rate in regions with BGS relative to those with neutrality, resulting in temporarily higher and lower B, respectively. The relative diversity values observed with different initial equilibrium B values after a short period of population size change may thus be very different from both the initial and final equilibrium values. The overall effect is that there is an apparent increase in B immediately following a population decline, and a decrease immediately following an expansion. Analytical models describing these effects are presented in the Appendix. These models used the simulated values of B at equilibrium before the population size changes to predict the apparent B values at the ends of the periods of size change (see the Methods and Appendix). It can be seen from Figure 4 that there is good agreement between these predictions and the simulation results.

Because several demographic estimation methods are based on fitting a demographic model to the SFS, it is also of interest to determine whether BGS can skew the SFS to different extents under different demographic models. Although it is well known that BGS causes a skew of the SFS towards rare variants under equilibrium models (Charlesworth et al. 1995; Nicolaisen and Desai 2013), the effect of BGS on the SFS with population size change has not been much explored (but see Johri et al. 2020; Torres et al. 2020). As shown in Figure 5, with a population size decline, the SFS of derived alleles is more skewed towards rare variants when BGS is operating, especially when B is initially small, since the effects of BGS work in opposition to the effects of the population size reduction. This difference in the left skew of the SFS with and without BGS is much less noticeable in the case of population expansion, since here the effects of BGS and the expansion act in a similar direction.

Figure 5:

Figure 5:

The site frequency spectrum (SFS) of derived allele frequencies at neutral sites from 10 diploid genomes under (a) demographic equilibrium, (b) population growth, and (c) population decline, under the same DFEs as shown in Figure 4. The x-axis indicates the number of sample alleles (out of 20) carrying the derived variant. Exonic sites comprised ~10% of the genome, roughly mimicking the density of the human genome. The red solid circles give the values predicted analytically with a purely neutral model, but correcting for BGS by using the B values of the ancestral population (i.e., pre-change in population size) obtained from simulations, in order to quantify the effective population size. Detailed methods including command lines can be found at: https://github.com/paruljohri/demographic_inference_with_selection/blob/main/CommandLines/Figure5.txt

As with the estimates of the apparent B values discussed above, analytical predictions of the expected SFS after an instantaneous / exponential change in population size can be made, using the values of B and the SFS at equilibrium in the ancestral population before the population size change using the formulae of Polanski and Kimmel (2003) and Polanski et al. (2003) for the purely neutral case, as described in the Methods section. Importantly, the use of the B parameter does not in itself cause a skew in the SFS, it merely affects overall diversity values. Figure 5 shows that the overall shape of the SFS is predicted reasonably well by the analytical results, although deviations are to be expected for the rare allele classes, which are the most sensitive to demographic change and selection. Overall, the results imply that BGS is more likely to bias demographic inference post-decline compared with post-expansion, consistent with the performance of the methods described above. Although it is notable that the SFS can be reasonably well predicted by correcting for the re-scaling effects of BGS if the effects of BGS in the ancestral population are accurately known, the exact allele frequency patterns observed will depend on the timing of population size changes relative to the time of sampling, as well as the value of B prior to the size change. The patterns described here thus represent only a small subset of the possibilities.

A potential solution: averaging across all possible DFEs

As shown above, demographic inference can be strongly affected by BGS effects that have not been taken into account, as well as by direct purifying selection. A potential solution is thus to correct for these effects when performing inferences of population history. A widely-used approach to estimating direct selection effects, DFE-alpha, takes a stepwise approach to inferring demography, by using a presumed neutral class (synonymous sites); conditional on that demography, it then estimates the parameters of the DFE (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009; Schneider et al. 2011; Kousathanas and Keightley 2013). However, this approach does not include the possibility of effects of selection at linked sites, which can result an over-estimate of population growth, and while the DFE may not be mis-inferred strongly (Kim et al. 2017), there is substantial mis-inference of the DFE if synonymous sites experience direct selection (Johri et al. 2020).

Building on this idea, Johri et al. (2020) recently proposed an approach that includes both direct and background effects of purifying selection, and simultaneously infers the deleterious DFE and demography. By utilizing the decay of BGS effects around functional regions, they demonstrated high accuracy under the simple demographic models examined. Moreover, the method makes no assumptions about the neutrality of synonymous sites, and can thus be used to estimate selection acting on these sites, as well as in non-coding functional elements. However, this computationally-intensive approach is specifically concerned with jointly inferring the DFE and demographic parameters. As such, if an unbiased characterization of the population history is the sole aim, this procedure may be needlessly involved. We thus here examine the possibility of instead treating the DFE as an unknown nuisance parameter, averaging across all possible DFE shapes, in order to assess whether demographic inference may be improved simply by correcting for these selection effects without inferring their underlying parameter values. This approach utilizes functional (i.e., directly selected) regions, a potential advantage in populations for which only coding data may be available (e.g., exome-capture data; see Jones and Good 2016), or more generally in organisms with largely functional genomes.

In order to illustrate this approach, a functional genomic element was simulated under demographic equilibrium, 2-fold exponential population growth and 2-fold exponential population decline with four different DFE shapes (as described previously, and shown in Figure 6). A number of summary statistics were calculated (see Methods) for the entire functional region. Inference was first performed assuming strict neutrality, and inferring a one-epoch size change (thus estimating the ancestral (Nanc) and current population sizes (Ncur)). As was found with the other inference approaches examined, population sizes were underestimated and a false inference of population growth was observed in almost all cases when selective effects are ignored (Figure 6).

Figure 6:

Figure 6:

Comparison of estimates of ancestral (Nanc) and current (Ncur) population sizes when assuming neutrality vs when varying the DFE shape as a nuisance parameter, using an ABC framework. Inference is shown for demographic equilibrium (left column), 2-fold exponential growth (middle column), and 2-fold population decline (right column), for five separate DFE shapes that define the extent of direct purifying selection acting on the genomic segment for which demographic inference is performed: (a) neutrality, (b) DFE1, (c) DFE2, (d) DFE3, and (e) DFE4 (see Table 1). In each case, the horizontal lines give the true values (black for Nanc; and gray for Ncur) and the box-plots give the estimated values. Black and gray boxes represent estimates when assuming neutrality, while red boxes represent estimates when the DFE is treated as a nuisance parameter. Detailed methods including command lines can be found at: https://github.com/paruljohri/demographic_inference_with_selection/blob/main/CommandLines/Figure6.txt)

Next, the assumption of neutrality was relaxed, and mutations were simulated with fitness effects characterized by a discrete DFE, with the fitness classes used above (f0, f1, f2, f3). Values for fi were drawn from a uniform prior between 0 and 1, such that ∑fi = 1. Note that no assumptions were made about which sites in the genomic region were functionally important, or regarding the presence/absence of a neutral class. These directly selected sites were then used to infer demographic parameters. We found that, by varying the shape of the DFE, averaging across all realizations, and only estimating parameters related to population history, highly accurate inference of modern and ancestral population sizes is possible (Figure 6). These results demonstrate that, even if the true DFE of a population is unknown (as will always be the case in reality), it is possible to infer demographic history with reasonable accuracy by approximately correcting for these selective effects.

This proposed method is most applicable to organisms in which recombination rates are reasonably well known. If the assumed recombination rate is 2-fold lower than the true rate, the ABC approach infers growth by over-estimating the current population size; correspondingly, if the assumed recombination rate is higher than the true rate, the current population size is under estimated (Supp Figure 21). Interestingly, in both cases the ancestral population sizes are correctly inferred, consistent with previous results (Johri et al. 2020).

CONCLUSIONS

While commonly used approaches for inferring demography assume neutrality and independence among segregating sites, these assumptions are likely to be violated in practice. In addition, there is considerable evidence for wide-spread effects of selection at linked sites in many commonly studied organisms (Hernandez et al. 2011; Cutter and Payseur 2013; Williamson et al. 2014; Elyashiv et al. 2016; Campos et al. 2017; Booker and Keightley 2018; Pouyet et al. 2018; Ragsdale et al. 2018; Torres et al. 2018; Castellano et al. 2020). Accordingly, we have explored how violations of the assumption of neutrality may affect demographic inference, particularly with regard to the underlying strength of purifying selection and the genomic density of directly selected sites. Generally speaking, the neglect of these effects (i.e., background selection) results in an inference of population growth, with the severity of the growth model roughly scaling with selection strength and density, as well as the inference of historical bottlenecks with some frequency. Thus, when the true underlying model is in fact growth, demographic mis-inference is not particularly severe; when the true underlying model is constant size or decline, the mis-inference can be extreme, with a many-fold underestimation of population size.

However, given that BGS will lead to the false inference of recent growth nearly regardless of the true history, it would be difficult in practice to determine the accuracy of this model without independent information on any given empirical application. Moreover, as the two very different methods investigated here result in highly similar mis-inference, we propose that this performance is unlikely to be a feature of these specific approaches, but rather a quantification of the fact that the underlying genealogies are distorted in the presence of BGS. Thus, these problems are likely to be common to all demographic inference based on polymorphism data.

It is important to note that BGS effects extend over genomic distances in a way that is positively related to the strength of purifying selection. For instance, strongly and moderately deleterious mutations affect patterns of diversity at large genomic distances, whereas mildly deleterious mutations primarily skew allele frequencies at adjacent sites. Thus, if intergenic regions further away from exons are used to perform demographic inference, it is predominantly moderately deleterious mutations that are likely to bias inferences; if these are relatively rare, they may not cause significant problems. In contrast, if synonymous sites are used to infer demographic history, mildly deleterious mutations arising in the coding sequences to which they belong may have significant effects. As we have focused here on relatively sparsely-coding genomes (with human-like gene densities) and used intergenic sites for inference, moderately deleterious mutations resulted in more severe mis-inference. The effect of the decay with distance of BGS due to mildly deleterious mutations depends on multiple parameters. For instance, with an exon of length 500 bp and Drosophila-like parameters (e.g., Ne = 106; recombination rate = mutation rate = 10−8 / site / generation), B increases from 0.53 (at 10 bases from the end of the exon) to 0.94 at a distance of 1000 bases. On the other hand, with human-like parameters (Ne = 104; recombination rate = mutation rate = 10−8 / site / generation) the corresponding change in B is only from 0.981 to 0.982 (Supp Table 9).

Thus, mildly deleterious mutations have drastically different effects, depending on the underlying population parameters. While these results certainly suggest that demographic inference ought to be less biased by BGS in neutral regions very distant from functional elements (for species with sufficiently high recombination / functionally sparse genomes), it is noteworthy that purifying selection on moderately and strongly deleterious mutations can have long-range effects, and that the complex interaction of population history with purifying / background selection necessitates a consideration of this topic in any given empirical application.

Comparing the two inference methods investigated here, it appears that fastsimcoal2 is less prone to inferring false fluctuations in population size. However, both methods falsely infer growth in the presence of BGS, with increasing severity as the density of coding regions increases. The times of population growth inferred by both methods appear to be affected in unpredictable ways when the inferred model is incorrect. When the general model is correctly identified, BGS leads to inference of more recent growth, and more ancient decline, than the reality. In addition, although variation in mutation and recombination rates across the genome alone did not strongly affect demographic inference, our evaluations in the current study are restricted to a specific parameter space resembling those of human populations. The effects of this variation on organisms with more extreme rate fluctuations remain in need of investigation.

It is noteworthy that, even when all sites are strictly neutral or only 5% of the genome experiences direct selection, demographic equilibrium is mis-estimated by MSMC as a series of size changes. The pattern of these erroneous size changes lend a characteristic shape to the MSMC curve (i.e., ancient decline and recent growth) which appears to resemble the demographic history previously inferred for the Yoruban population (Schiffels and Durbin 2014), including the time at which changes in population size occurred (Supp Figure 22). Previous work has demonstrated that the resulting demographic model does not in fact fit the observed SFS in the Yoruban population (Beichman et al. 2017; Lapierre et al. 2017). A similar shape has also been inferred in the vervet subspecies (Warren et al. 2015; Figure 4), in passenger pigeons (Hung et al. 2014; Figure 2), in elephants (Palkopoulou et al. 2018; Figure 4), in Arabidopsis (Fulgione et al. 2018; Figure 3), and in grapevines (Zhou et al. 2017; Figure 2A).

Although the inferred population size fluctuations under simulated neutrality are only ~1.2-fold, in most empirical applications the fluctuations are of a somewhat larger magnitude (~2-fold in pigeons, Arabidopsis, and grapevines). Nonetheless, this performance of MSMC under neutral demographic equilibrium is concerning, and adds to the other previously published cautions concerning the interpretation of MSMC results. For example, Mazet et al. (2016) and Chikhi et al. (2018) demonstrated that, under constant population size with hidden structure, MSMC may suggest false size changes (see also Orozco-terWengel 2016). In addition, MSMC has been reported to falsely infer growth prior to instantaneous bottlenecks (Bunnefeld et al. 2015). In addition, we observed that, if insufficient genomic data are used, or more than one diploid genome is used to perform inference, MSMC falsely infers recent growth of varying magnitudes, the latter having been previously observed by Beichman et al. (2017) and Adrion et al. (2020).

In sum, we find that the effects of purifying and background selection result in similar demographic mis-inference across approaches, and that masking functional sites does not yield accurate parameter estimates. In order to side-step many of these difficulties, our proposed approach of inferring demography by averaging selection effects across all possible DFE shapes within an ABC framework appears to be promising. Utilizing only functional regions, we found a great improvement in accuracy, without making any assumptions regarding the true underlying shape of the DFE or the neutrality of particular classes of sites. As such, this approach represents a more computationally efficient avenue if only demographic parameters are of interest, and ought to be particularly useful in the great majority of organisms in which independent neutral sites either do not exist, or are difficult to identify and verify.

METHODS

Simulations of chromosomal segments under neutral equilibrium:

When assessing the amount of genomic information required for accurate demographic inference, chromosomal segments of varying sizes (1 Mb, 10 Mb, 50 Mb, 200 Mb and 1 Gb) were simulated under neutral equilibrium. In all cases, the effective population size (Ne) simulated was 5000, and mutation and recombination rates were both 1 × 10−8 per site per generation. Simulations were performed with both SLiM 3.1 (Haller and Messer 2019) for a 10Ne generation burn-in, and with msprime 0.7.3 (Kelleher et al. 2016). In all cases 100 replicates were simulated, with the exception of 1 Gb chromosomes simulated by SLiM, in which only 10 replicates were obtained.

Simulations of human-like chromosomes (with and without selection):

Simulations were performed using SLiM 3.1 (Haller and Messer 2019) for a burn-in of 10Nanc generations, with 10 replicates per evolutionary scenario. For every replicate, 22 chromosomes of 150Mb each were simulated, totaling ~3 Gb of information per individual genome (similar to the amount of information in a human genome). Within each chromosome, 3 different types of regions were simulated, representing non-coding intergenic, intronic, and exonic regions. Based on the NCBI RefSeq human genome annotation, downloaded from the UCSC genome browser for hg19 (http://genome.ucsc.edu/; Kent et al. 2002), mean values of exon sizes and intron numbers per gene were calculated. To represent mean values for the human genome (Lander et al. 2001), each gene comprised 8 exons and 7 introns, and exon lengths were fixed at 350 bp. By varying the lengths of the intergenic and intronic regions, three different genomic configurations with varying densities of functional elements were simulated and compared - with 5%, 10% and 20% of the genome being under direct selection - hereafter referred to as genome5, genome10, and genome20, respectively. Genome5 was comprised of introns of 3000 bp and intergenic sequence of 31000 bp, genome10 of introns of 1500 bp and intergenic sequence of 15750 bp, while genome20 was comprised of introns of 600 bp and intergenic sequence of 6300 bp. The total chromosome sizes of these genomes were approximately 150 Mb (150,018,599 bp, 150,029,949 bp, and 150,003,699 bp) with 2737, 5164, and 11278 genes per chromosome in genome5, genome10, and genome20, respectively. In order to be conservative with respect to the performance of existing demographic estimators, intronic and intergenic regions were assumed to be neutral.

Recombination and mutation rates were assumed to be equal to 1 × 10−8 /site / generation. Neither crossover interference nor gene conversion were modeled (see the discussion in Campos and Charlesworth 2019). Exonic regions in the genomes experienced direct purifying selection given by a discrete DFE comprised of 4 fixed classes (Johri et al. 2020), whose frequencies are denoted by fi: f0, with 0 ≤ 2Nes < 1 (i.e., effectively neutral mutations), f1, with 1 ≤ 2Nes < 10 (i.e., weakly deleterious mutations), f2, with 10 ≤ 2Nes < 100 (i.e., moderately deleterious mutations), and f3, with 100 ≤ 2Nes < 2Ne (i.e strongly deleterious mutations), where Ne is the effective population size and s is the reduction in fitness of the mutant homozygote relative to wild-type. Within each bin, the distribution of s was assumed to be uniform. All mutations were assumed to be semi-dominant. In all cases, the Ne corresponding to the DFE refers to the ancestral effective population size.

Six different types of DFE were simulated, described by the parameters provided in Table 1. Three different demographic models were tested for each of these DFEs (Supp Table 7): 1) demographic equilibrium, 2) recent exponential 30-fold growth, resembling that estimated for the human CEU population (Gutenkunst et al. 2009), and 3) ~6-fold instantaneous decline, resembling the out-of-Africa bottleneck in humans (Gutenkunst et al. 2009). For simulations of demographic equilibrium and decline, population sizes and time of change were scaled down by a factor of 10 (with corresponding scaling of the recombination rate, mutation rate, and selection coefficients), while simulations of growth were not scaled.

Running MSMC:

In order to quantify the effect of purifying selection on demographic inference, we used entire chromosomes generated by SLiM to generate input files for MSMC. For comparison, and in order to quantify the effect of BGS alone on demographic inference, we masked the exonic regions to generate input files. For all parameters, MSMC was performed on a single diploid genome, as the results for this case were the most accurate (Supp Figure 1, 2). Input files were made using the script ms2multihetsep.py provided in the msmc-tools-Repository downloaded from https://github.com/stschiff/msmc-tools. MSMC1 and 2 were run as follows: msmc_1.1.0_linux64bit -t 5 -r 1.0 -o output_genomeID input_chr1.tab input_chr2.tab … input_chr22.tab. Population sizes obtained from MSMC were plotted up to the maximum number of generations obtained from MSMC, and the final value of the ancestral population size was extended indefinitely as a dashed line.

Running fastsimcoal2:

Inference was performed by masking all exonic SNPs and using all intronic and intergenic SNPs in order to obtain the most accurate estimates. In order to minimize the effects of linkage disequilibrium (LD), SNPs separated by 5 kb or 100 kb were also used for inference in some cases to assess the impact of violating the assumption of independence. When choosing SNPs separated by a particular distance, the first SNP from each chromosome was chosen and if the distance to the next consecutive SNP was greater than or equal to 5 kb/100 kb, that SNP was included, otherwise the next downstream SNP was evaluated. Site frequency spectra (SFS) were obtained for all sets of SNPs for all 10 replicates of every combination of demographic history and DFE. SNPs from all 22 chromosomes were pooled together to calculate the SFS. In the case of SNPs separated by 5 kb/100 kb, the “0” class of the SFS was scaled down by the same extent as the decrease in the total number of SNPs. Fastsimcoal2 was used to fit each SFS to 4 distinct models: (a) equilibrium, which estimates only a single population size parameter (N); (b) instantaneous size change (decline/growth), which fits 3 parameters - ancestral population size (Nanc), current population size (Ncur), and time of change (T); (c) exponential size change (decline/growth), which also estimates 3 parameters - Nanc, Ncur and T; and (d) an instantaneous bottleneck model with 3 parameters – Nanc, intensity, and time of bottleneck. The parameter search ranges for both ancestral and current population sizes in all cases were specified to be uniformly distributed between 100–500000 individuals, while the parameter range for time of change was specified to be uniform between 100–10000 generations in all models. The intensity of the bottleneck was sampled from a log-uniform distribution between 10−5 and 2. The following command line was used to run fastsimcoal2: fsc26 -t demographic_model.tpl -n 150000 -d -e demographic_model.est -M -L 50 -q,

Model selection was performed as recommended by Excoffier et al. (2013). For each demographic model, the maximum of maximum likelihoods from all replicates was used to calculate the Akaike Information Criterion (AIC) = 2 × number of parameters − 2 × ln(likelihood) = 2 × number of parameters − 2 × ln(10) × L10, where L10 is the logarithm (with respect to base 10) of the best likelihood provided by fastsimcoal2. For model choice comparison, we also implemented a stricter penalty of 25× (see Supp Table 5, 6), in which case AIC = 25 × number of parameters − 2 × ln(likelihood). The relative likelihoods (Akaike’s weight of evidence) in favor of the ith model were then calculated as:

w(i)=e0.5Δij=14e0.5Δj

where Δi = AICiAICmin. The model with the highest relative likelihood was selected as the best model, and the parameters estimated using that model were used to plot the final inferred demography.

Simulations of variable recombination and mutation rates, and repeat masking:

In order to simulate variation in recombination and mutation rates, all 22 chromosomes were simulated by mimicking chromosome 6 (~171Mb) of the human genome. Recombination rates (HapMap) obtained from Yoruban populations (McVean et al. 2004; Myers et al. 2005) were obtained from the UCSC genome browser, while the mutation rate map (https://molgenis26.target.rug.nl/downloads/gonl_public/mutation_rate_map/release2/) was assumed to correspond to estimates obtained from de novo mutations (Francioli et al. 2015), as in Castellano et al. (2020). Absolute values of mutation rates were normalized in order to maintain the mean mutation rate across the genome at ~ 1.0 × 10−8 per site per generation. Recombination and mutation rate stimates were taken from positions of approximately 10 Mb to 160 Mb, with the recombination map starting at 10010063 bp and the mutation map starting at 10010001 bp. Regions with missing data for either of the two estimates were simulated with rates corresponding to the previous window, except for the case of centromeres in which no recombination was assumed. In order to understand the effect of excluding centromeric regions in empirical studies, the 4Mb region corresponding to the centromere was masked, corresponding to 48.5 to 52.5 Mb of the simulated 150Mb chromosomes. In order to evaluate the effect of masking repeat regions, random segments comprising 10% of each chromosome were masked. The lengths of these segments were drawn from the lengths of repeat regions found in the human genome (Supp Figure 14), as obtained from the repeat regions in the hg19 assembly of the human genome from the UCSC genome browser.

Performing inference by approximate Bayesian computation (ABC):

ABC was performed using the R package “abc” (Csilléry et al. 2010), and non-linear regression aided by a neural net (used with default parameters as provided by the package) was used to correct for the relationship between parameters and statistics (Johri et al. 2020). To infer posterior estimates, a tolerance of 0.1 was applied (i.e., 10% of the total number of simulations were accepted by ABC in order to estimate the posterior probability of each parameter). The weighted medians of the posterior estimates for each parameter were used as point estimates. ABC inference was performed under two conditions: (1) complete neutrality, or (2) the presence of direct purifying selection. In both cases only 2 parameters were inferred - ancestral (Nanc) and current (Ncur) population sizes. However, in scenario 2, the shape of the DFE was also varied. Specifically, the and parameters f0, f1, f2, and f3 were treated as nuisance parameters and were sampled such that 0 ≤ fi ≤ 1, and Σi fi = 1, for i = 0 to 3. In addition, in order to limit the computational complexity involved in the ABC framework, values of fi were restricted to multiples of 0.05 (i.e., fi ϵ {0.0, 0.05, 0.10, …, 0.95, 1.0} ∀ i), which allowed us to sample 1,771 different DFE realizations. Simulations were performed with functional genomic regions, and the demographic model was characterized by 1-epoch changes in which the population either grows or declines exponentially from ancestral to current size, beginning at a fixed time in the past.

For the purpose of illustration, and for a contrast with the human-like parameter set above, parameters for ABC testing were selected to resemble those of D. melanogaster African populations. Priors on ancestral and current population sizes were drawn from a uniform distribution between 105-107 diploid individuals, while the time of change was fixed at 106 (~Ne) generations. In order to simulate functional regions, 94 single-exon genes, as described in Johri et al. (2020) and provided in https://github.com/paruljohri/BGS_Demography_DFE/blob/master/DPGP3_data.zip, were simulated with recombination rates specific to those exons (https://petrov.stanford.edu/cgi-bin/recombination-rates_updateR5.pl) (Fiston-Lavier et al. 2010; Comeron et al. 2012). Mutation rates were assumed to be fixed at 3 × 10−9 per site per generation (Keightley et al. 2009; Keightley et al. 2014).

All parameters were scaled by the factor 320 in order to decrease computational time, using the principle first described by Hill and Robertson (1966), and subsequently employed by others (Comeron and Kreitman 2002; Hoggart et al. 2007; Kaiser and Charlesworth 2009; Kim and Wiehe 2009; Uricchio and Hernandez 2014; Campos and Charlesworth 2019). The scaled population sizes thus ranged between ~300–30000 and were reported as scaled values in the main text. One thousand replicate simulations were performed for every parameter combination (Nanc, Ncur, f0, f1, f2, f3); for performing ABC inference, 50 diploid genomes were randomly sampled without replacement, and summary statistics were calculated using pylibseq 0.2.3 (Thornton 2003). The following summary statistics were calculated across the entire exonic region for every exon: nucleotide site diversity (π), Watterson’s θ, Tajima’s D, Fay and Wu’s H (both absolute and normalized), number of singletons, haplotype diversity, LD-based statistics (r2, D, D′), and divergence (i.e., number of fixed mutations per site per generation after the burn-in period). Means and variances (between exons) of all of the above (a total of 22) were used as final summary statistics to perform ABC. As opposed to the above examples, in this inference scheme only exonic data (i.e., directly selected sites) were utilized. Test datasets were generated in exactly the same fashion as described above.

Analytical expectations for the relative site frequencies:

To compute the expected relative frequencies of site frequency classes, the approach of Polanski and Kimmel (2003) was followed. They describe a method for computing the “probability that a SNP has b mutant bases”, which is equivalent to the expected site frequency spectrum (SFS) of derived variants. This method (their equations 3–10) allows for the specification of arbitrary population size histories and sample sizes. For reasons of computational precision, a sample size of 10 diploid genomes was chosen. The demographic scenarios were implemented as piecewise functions of the effective population size (counting haploid genomes), and the effect of BGS was included by scaling these functions by values of B before population size change as obtained from the forwards-in-time simulations described above. A Mathematica notebook detailing these results is available online (see data availability statement). In addition, analytical expressions can be obtained for pairwise diversity values when there are step changes or exponential growth in population size, as described in the Appendix and in an example program that calculates diversity values after exponential growth.

Data availability:

The following data are publicly available: (1) The general workflow for simulating and performing demographic inference; (2) Scripts used for performing simulations, masking genomes, calculating the SFS, performing model selection and plotting the final results; (3) Input files used to run fastsimcoal2, including for the calculated SFS; (4) Output files of MSMC and fastsimcoal2 for all simulated scenarios; (5) A Mathematica (version 12.1) notebook detailing the calculations of analytical expectations for the relative SFS; (6) An example program (Fortran script) demonstrating how to obtain analytical expressions for values of B after exponential growth. All supplemental files, scripts, command lines, and descriptions may be found at: https://github.com/paruljohri/demographic_inference_with_selection

Supplementary Material

1

ACKNOWLEDGEMENTS

We would like to thank Susanne Pfeifer for helpful discussions related to this project, and for feedback on the manuscript. This research was conducted using resources provided by Research Computing at Arizona State University (http://www.researchcomputing.asu.edu) and the Open Science Grid, which is supported by the National Science Foundation and the U.S. Department of Energy’s Office of Science. This work was funded by National Institutes of Health grants R01GM135899 and 1R35GM139383-01 to JDJ.

APPENDIX

There are two scenarios of population size change for which simple explicit expressions for the expected pairwise coalescent time or diversity can be obtained, without using the methodology of Polanski and Kimmel (2003) and Polanski et al. (2003) – a step change in N or an exponential growth in N. First consider the coalescent process for a step change, where the current and initial effective population sizes are denoted by Ne1 and Ne0, respectively. Let B be the background selection parameter at the start of the process of change, corresponding to effective size Ne0. For convenience, time is scaled in units of 2Ne1 generations, and the time of the change in population size on this scale is denoted by T0, counting back from the present time, T = 0. T0 is assumed to be sufficiently small that B remains approximately constant during the period since the change in size. Denote the ratio Ne0/Ne1 by R. The derivation for the case of a step change in population size is similar to that given by Pool and Nielsen (2009) for the purpose of comparing X chromosomes and autosomes.

Between times T and T0, coalescence occurs at a rate B−1 on the chosen timescale, so that the contribution to the net coalescent time for a pair of alleles sampled at T = 0 is:

B10T0Texp(B1T)dT=BBexp(B1T0)T0exp(B1T0)

There is a probability of exp(−B−1 T0) that there is no coalescence when T lies between 0 and T0, after which coalescence occurs at a rate 1/BR, giving a net contribution to the coalescence time of:

(BR+T0)exp (B1T0)

The net coalescence time for the stepwise change with BGS is given by the sum of these two expressions:

B[1+(R1) exp(B1T0)] (1a)

If this expression is compared to the corresponding equation with B = 1, the apparent value of B at the time of sampling of the pair of alleles is given by:

BS=B[1+(R1) exp(B1T0)][1+(R1) exp(T0)] (1b)

Next, consider a process of exponential change in population size, starting at an initial effective size of Ne0 at t0 generations in the past and ending at size Ne1, such that the instantaneous growth rate r per generation is r = ln(Ne1/Ne0)/t0. The effective population size at time t in the past is Ne(t) = Ne1exp(−rt); with BGS, the rate of coalescence at time t is 1/BNe(t). As before, the BGS parameter is assumed to remain constant over the period of population size change. It follows that the probability of no coalescence by generation t in the past is:

Pnc(t)=exp [0t(2BNe1)1 exp(rt)dt]=exp[c1(1ert)] (2)

where c = 2BNe1r.

The pre-growth period with t> t0 contributes an expected coalescent time of (2BNe0 + t0)Pnc(t0), on the scale of generations.

Following Slatkin and Hudson (1991), to obtain the contribution from the period with t > t0, it is convenient to measure time as τ = rt. The probability of coalescence between τ and τ + d τ is then given by:

Pc(τ)=c1eτexp[c1(1eτ)]dτ (3)

The contribution from this period to the expected coalescent time is given by the integral of τ Pc(τ) between 0 and τ0. Following Slatkin and Hudson (1991), by transforming to u = exp(τ), this contribution can be expressed as the following integral:

τ¯1=c1ec11u0ln (u)euc1du (4)

This integral can easily be evaluated numerically. The corresponding mean coalescent time on the scale of generations is obtained by division by r, and the result can be added to (2BNe0 + t0)Pnc(t0), yielding the net expected coalescent time. By dividing the resulting expression by the corresponding expression with B = 1, the apparent BGS effect at the time of sampling can be obtained, in the same way as for the step change model.

REFERENCES

  1. Adrion JR, Cole CB, Dukler N, Galloway JG, Gladstein AL, Gower G, Kyriazis CC, Ragsdale AP, Tsambos G, Baumdicker F, et al. 2020. A community-maintained standard library of population genetic models. Coop G, Wittkopp PJ, Novembre J, Sethuraman A, Mathieson S, editors. eLife 9:e54967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andolfatto P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437:1149–1152. [DOI] [PubMed] [Google Scholar]
  3. Bank C, Ewing GB, Ferrer-Admettla A, Foll M, Jensen JD. 2014. Thinking too positive? Revisiting current methods of population genetic selection inference. Trends Genet. 30:540–546. [DOI] [PubMed] [Google Scholar]
  4. Beichman AC, Huerta-Sanchez E, Lohmueller KE. 2018. Using genomic data to infer historic population dynamics of nonmodel organisms. Annu. Rev. Ecol. Evol. Syst. 49:433–456. [Google Scholar]
  5. Beichman AC, Phung TN, Lohmueller KE. 2017. Comparison of single genome and allele frequency data reveals discordant demographic histories. G3 7:3605–3620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bhaskar A, Wang YXR, Song YS. 2015. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25:268–279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277:1453–1462. [DOI] [PubMed] [Google Scholar]
  8. Boitard S, Rodríguez W, Jay F, Mona S, Austerlitz F. 2016. Inferring population size history from large samples of genome-wide molecular data - an approximate Bayesian computation approach. PLoS Genet. 12:e1005877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Booker TR, Keightley PD. 2018. Understanding the factors that shape patterns of nucleotide diversity in the house mouse genome. Mol. Biol. Evol. 35:2971–2988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bunnefeld L, Frantz LAF, Lohse K. 2015. Inferring bottlenecks from genome-wide samples of short sequence blocks. Genetics 201:1157–1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Campos JL, Charlesworth B. 2019. The effects on neutral variability of recurrent selective sweeps and background selection. Genetics 212:287–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Campos JL, Halligan DL, Haddrill PR, Charlesworth B. 2014. The relation between recombination rate and patterns of molecular evolution and variation in Drosophila melanogaster. Mol. Biol. Evol. 31:1010–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Campos JL, Zhao L, Charlesworth B. 2017. Estimating the parameters of background selection and selective sweeps in Drosophila in the presence of gene conversion. Proc. Natl. Acad. Sci. U.S.A. 114:E4762–E4771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Castellano D, Eyre-Walker A, Munch K. 2020. Impact of mutation rate and selection at linked sites on DNA variation across the genomes of humans and other Homininae. Genome Biol. Evol. 12:3550–3561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chamary J, Hurst LD. 2005. Evidence for selection on synonymous mutations affecting stability of mRNA secondary structure in mammals. Genome Biol. 6:R75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Charlesworth B. 2013. Background selection 20 years on. The Wilhelmine E. Key 2012 invitational lecture. J. Hered. 104:161–171. [DOI] [PubMed] [Google Scholar]
  17. Charlesworth B. 2015. Causes of natural variation in fitness: evidence from studies of Drosophila populations. Proc. Natl. Acad. Sci. U.S.A. 112:1662–1669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Charlesworth B, Morgan MT, Charlesworth D. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Charlesworth D, Charlesworth B, Morgan MT. 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141:1619–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Chikhi L, Rodríguez W, Grusea S, Santos P, Boitard S, Mazet O. 2018. The IICR (inverse instantaneous coalescence rate) as a summary of genomic diversity: insights into demographic inference and model choice. Heredity 120:13–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Choi JY, Aquadro CF. 2016. Recent and long term selection across synonymous sites in Drosophila ananassae. J. Mol. Evol. 83:50–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Comeron JM, Kreitman M. 2002. Population, evolutionary and genomic consequences of interference selection. Genetics 161:389–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Comeron JM, Ratnappan R, Bailin S. 2012. The many landscapes of recombination in Drosophila melanogaster. PLoS Genet. 8:e1002905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Csilléry K, Blum MGB, Gaggiotti OE, François O. 2010. Approximate Bayesian Computation (ABC) in practice. Trends Ecol. Evol. 25:410–418. [DOI] [PubMed] [Google Scholar]
  25. Cutter AD, Payseur BA. 2013. Genomic signatures of selection at linked sites: unifying the disparity among species. Nat. Rev. Genet. 14:262–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Elyashiv E, Sattath S, Hu TT, Strutsovsky A, McVicker G, Andolfatto P, Coop G, Sella G. 2016. A genomic map of the effects of linked selection in Drosophila. PLoS Genet. 12:e1006130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ewing GB, Jensen JD. 2016. The consequences of not accounting for background selection in demographic inference. Mol. Ecol. 25:135–141. [DOI] [PubMed] [Google Scholar]
  28. Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. 2013. Robust demographic inference from genomic and SNP data. PLoS Genet. 9:e1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Eyre-Walker A, Keightley PD. 2009. Estimating the rate of adaptive molecular evolution in the presence of slightly deleterious mutations and population size change. Mol. Biol. Evol. 26:2097–2108. [DOI] [PubMed] [Google Scholar]
  30. Fay JC, Wu CI. 1999. A human population bottleneck can account for the discordance between patterns of mitochondrial versus nuclear DNA variation. Mol. Biol. Evol. 16:1003–1005. [DOI] [PubMed] [Google Scholar]
  31. Fiston-Lavier A-S, Singh ND, Lipatov M, Petrov DA. 2010. Drosophila melanogaster recombination rate calculator. Gene 463:18–20. [DOI] [PubMed] [Google Scholar]
  32. Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, et al. 2015. Genome-wide patterns and properties of de novo mutations in humans. Nat Genet 47:822–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Fulgione A, Koornneef M, Roux F, Hermisson J, Hancock AM. 2018. Madeiran Arabidopsis thaliana reveals ancient long-range colonization and clarifies demography in Eurasia. Mol. Biol. Evol. 35:564–574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Galtier N, Rousselle M. 2020. How much does Ne vary among species? Genetics 216:559–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5:e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Haddrill PR, Thornton KR, Charlesworth B, Andolfatto P. 2005. Multilocus patterns of nucleotide variability and the demographic and selection history of Drosophila melanogaster populations. Genome Res. 15:790–799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Haller BC, Messer PW. 2019. SLiM 3: Forward genetic simulations beyond the Wright–Fisher model. Mol. Biol. Evol. 36:632–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Harris K, Nielsen R. 2013. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9:e1003521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, Project 1000 Genomes, Sella G, Przeworski M. 2011. Classic selective sweeps were rare in recent human evolution. Science 331:920–924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Hey J, Harris E. 1999. Population bottlenecks and patterns of human polymorphism. Mol Biol Evol 16:1423–1426. [DOI] [PubMed] [Google Scholar]
  41. Hill WG, Robertson A. 1966. The effect of linkage on limits to artificial selection. Genet. Res. 8:269–294. [PubMed] [Google Scholar]
  42. Hoggart CJ, Chadeau-Hyam M, Clark TG, Lampariello R, Whittaker JC, Iorio MD, Balding DJ. 2007. Sequence-level population simulations over large genomic regions. Genetics 177:1725–1731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Hung C-M, Shaner P-JL, Zink RM, Liu W-C, Chu T-C, Huang W-S, Li S-H. 2014. Drastic population fluctuations explain the rapid extinction of the passenger pigeon. Proc. Nat. Acad. Sci. U.S.A. 111:10636–10641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Jackson BC, Campos JL, Haddrill PR, Charlesworth B, Zeng K. 2017. Variation in the intensity of selection on codon bias over time causes contrasting patterns of base composition evolution in Drosophila. Genome Biol. Evol. 9:102–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Jacquier H, Birgy A, Nagard HL, Mechulam Y, Schmitt E, Glodt J, Bercot B, Petit E, Poulain J, Barnaud G, et al. 2013. Capturing the mutational landscape of the beta-lactamase TEM-1. Proc. Natl. Acad. Sci. U.S.A. 110:13067–13072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Jensen JD, Payseur BA, Stephan W, Aquadro CF, Lynch M, Charlesworth D, Charlesworth B. 2019. The importance of the Neutral Theory in 1968 and 50 years on: A response to Kern and Hahn 2018. Evolution 73:111–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Johri P, Charlesworth B, Jensen JD. 2020. Toward an evolutionarily appropriate null model: jointly inferring demography and purifying selection. Genetics 215:173–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Jones MR, Good JM. 2016. Targeted capture in evolutionary and ecological genomics. Mol. Ecol. 25:185–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Kaiser VB, Charlesworth B. 2009. The effects of deleterious mutations on evolution in non-recombining genomes. Trends Genet. 25:9–12. [DOI] [PubMed] [Google Scholar]
  50. Keightley PD, Eyre-Walker A. 2007. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177:2251–2261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Keightley PD, Ness RW, Halligan DL, Haddrill PR. 2014. Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family. Genetics 196:313–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Keightley PD, Trivedi U, Thomson M, Oliver F, Kumar S, Blaxter ML. 2009. Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation accumulation lines. Genome Res. 19:1195–1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Kelleher J, Etheridge AM, McVean G. 2016. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 12:e1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. 2019. Inferring whole-genome histories in large population datasets. Nat. Genet. 51:1330–1338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler and D. 2002. The Human Genome Browser at UCSC. Genome Res. 12:996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Kim BY, Huber CD, Lohmueller KE. 2017. Inference of the distribution of selection coefficients for new nonsynonymous mutations using large samples. Genetics 206:345–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Kim Y, Wiehe T. 2009. Simulation of DNA sequence evolution under models of recent directional selection. Brief. Bioinformatics 10:84–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Kousathanas A, Keightley PD. 2013. A comparison of models to infer the distribution of fitness effects of new mutations. Genetics 193:1197–1208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921. [DOI] [PubMed] [Google Scholar]
  60. Lapierre M, Lambert A, Achaz G. 2017. Accuracy of demographic inferences from the site frequency spectrum: the case of the Yoruba population. Genetics 206:439–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Li H, Durbin R. 2011. Inference of human population history from individual whole-genome sequences. Nature 475:493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Liang P, Saqib HSA, Zhang X, Zhang L, Tang H. 2018. Single-base resolution map of evolutionary constraints and annotation of conserved elements across major grass genomes. Genome Biol. Evol. 10:473–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Lukic S, Hey J. 2012. Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics 192:619–639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Lynch M. 2007. The Origins of Genome Architecture. Sunderland, Massachusetts: Sinauer Associates [Google Scholar]
  65. Mazet O, Rodríguez W, Grusea S, Boitard S, Chikhi L. 2016. On the importance of being structured: instantaneous coalescence rates and human evolution--lessons for ancestral population size inference? Heredity 116:362–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. 2004. The fine-scale structure of recombination rate variation in the human genome. Science 304:581–584. [DOI] [PubMed] [Google Scholar]
  67. Messer PW, Petrov DA. 2013. Frequent adaptation and the McDonald–Kreitman test. Proc. Natl. Acad. Sci. U.S.A. 110:8615–8620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science 310:321–324. [DOI] [PubMed] [Google Scholar]
  69. Nicolaisen LE, Desai MM. 2013. Distortions in genealogies due to purifying selection and recombination. Genetics 195:221–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. O’Fallon BD, Seger J, Adler FR. 2010. A continuous-state coalescent and the impact of weak selection on the structure of gene genealogies. Mol. Biol. Evol. 27:1162–1172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Orozco-terWengel P. 2016. The devil is in the details: the effect of population structure on demographic inference. Heredity 116:349–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Palkopoulou E, Lipson M, Mallick S, Nielsen S, Rohland N, Baleka S, Karpinski E, Ivancevic AM, To T-H, Kortschak RD, et al. 2018. A comprehensive genomic history of extinct and living elephants. Proc. Nat. Acad. Sci. U.S.A. 115:E2566–E2574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Polanski A, Bobrowski A, Kimmel M. 2003. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol 63:33–40. [DOI] [PubMed] [Google Scholar]
  74. Polanski A, Kimmel M. 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165:427–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Pool JE, Nielsen R. 2007. Population size changes reshape genomic patterns of diversity. Evolution 61:3001–3006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Pool JE, Nielsen R. 2009. Correction for Pool and Nielsen (2007). Evolution 63:1671. [Google Scholar]
  77. Pouyet F, Aeschbacher S, Thiéry A, Excoffier L. 2018. Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences. eLife 7:e36317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Ragsdale AP, Gutenkunst RN. 2017. Inferring demographic history using two-locus statistics. Genetics 206:1037–1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Ragsdale AP, Moreau C, Gravel S. 2018. Genomic inference using diffusion models and the allele frequency spectrum. Current Opinion in Genetics & Development 53:140–147. [DOI] [PubMed] [Google Scholar]
  80. Sanjuán R. 2010. Mutational fitness effects in RNA and single-stranded DNA viruses: common patterns revealed by site-directed mutagenesis studies. Phil. Trans. R. Soc. B. 365:1975–1982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Schiffels S, Durbin R. 2014. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46:919–925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Schneider A, Charlesworth B, Eyre-Walker A, Keightley PD. 2011. A method for inferring the rate of occurrence and fitness effects of advantageous mutations. Genetics 189:1427–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Schrider DR, Shanku AG, Kern AD. 2016. Effects of linked selective sweeps on demographic inference and model selection. Genetics 204:1207–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Sheehan S, Song YS. 2016. Deep learning for population genetic inference. PLoS Comput. Biol. 12:e1004845. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15:1034–1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Slatkin M, Hudson RR. 1991. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555–562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Speidel L, Forest M, Shi S, Myers SR. 2019. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 51:1321–1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Steinrücken M, Kamm J, Spence JP, Song YS. 2019. Inference of complex population histories using whole-genome sequences from multiple populations. Proc. Natl. Acad. Sci. U.S.A. 116:17115–17120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Teshima KM, Coop G, Przeworski M. 2006. How reliable are empirical genomic scans for selective sweeps? Genome Res 16:702–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Thornton K. 2003. Libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19:2325–2327. [DOI] [PubMed] [Google Scholar]
  91. Thornton KR, Jensen JD. 2007. Controlling the false-positive rate in multilocus genome scans for selection. Genetics 175:737–750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Torres R, Stetter MG, Hernandez RD, Ross-Ibarra J. 2020. The temporal dynamics of background selection in nonequilibrium populations. Genetics 214:1019–1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Torres R, Szpiech ZA, Hernandez RD. 2018. Human demographic history has amplified the effects of background selection across the genome. PLoS Genet. 14:e1007387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Uricchio LH, Hernandez RD. 2014. Robust forward simulations of recurrent hitchhiking. Genetics 197:221–236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Warren WC, Jasinska AJ, García-Pérez R, Svardal H, Tomlinson C, Rocchi M, Archidiacono N, Capozzi O, Minx P, Montague MJ, et al. 2015. The genome of the vervet (Chlorocebus aethiops sabaeus). Genome Res. 25:1921–1933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  96. Williamson RJ, Josephs EB, Platts AE, Hazzouri KM, Haudry A, Blanchette M, Wright SI. 2014. Evidence for widespread positive and negative selection in coding and conserved noncoding regions of Capsella grandiflora. PLoS Genet. 10:e1004622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y, et al. 2020. A new coronavirus associated with human respiratory disease in China. Nature 579:265–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Zeng K, Charlesworth B. 2010. Studying patterns of recent evolution at synonymous sites and intronic sites in Drosophila melanogaster. J. Mol. Evol. 70:116–128. [DOI] [PubMed] [Google Scholar]
  99. Zhou Y, Massonnet M, Sanjak JS, Cantu D, Gaut BS. 2017. Evolutionary genomics of grape (Vitis vinifera ssp. vinifera) domestication. Proc. Nat. Acad. Sci. U.S.A. 114:11715–11720. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Data Availability Statement

The following data are publicly available: (1) The general workflow for simulating and performing demographic inference; (2) Scripts used for performing simulations, masking genomes, calculating the SFS, performing model selection and plotting the final results; (3) Input files used to run fastsimcoal2, including for the calculated SFS; (4) Output files of MSMC and fastsimcoal2 for all simulated scenarios; (5) A Mathematica (version 12.1) notebook detailing the calculations of analytical expectations for the relative SFS; (6) An example program (Fortran script) demonstrating how to obtain analytical expressions for values of B after exponential growth. All supplemental files, scripts, command lines, and descriptions may be found at: https://github.com/paruljohri/demographic_inference_with_selection


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES