Abstract
The recent increase in time-series population genomic data from experimental, natural, and ancient populations has been accompanied by a promising growth in methodologies for inferring demographic and selective parameters from such data. However, these methods have largely presumed that the populations of interest are well-described by the Kingman coalescent. In reality, many groups of organisms, including viruses, marine organisms, and some plants, protists, and fungi, typified by high variance in progeny number, may be best characterized by multiple-merger coalescent models. Estimation of population genetic parameters under Wright-Fisher assumptions for these organisms may thus be prone to serious mis-inference. We propose a novel method for the joint inference of demography and selection under the -coalescent model, termed Multiple-Merger Coalescent Approximate Bayesian Computation, or MMC-ABC. We first demonstrate mis-inference under the Kingman, and then exhibit the superior performance of MMC-ABC under conditions of skewed offspring distributions. In order to highlight the utility of this approach, we reanalyzed previously published drug-selection lines of influenza A virus. We jointly inferred the extent of progeny-skew inherent to viral replication and identified putative drug-resistance mutations.
Keywords: time-sampled inference, selection, population genetics, coalescent theory, sweepstakes reproduction
ELUCIDATION of the underlying processes of evolution through the measurement of temporal changes in allele frequencies has remained a major focus of population genetics since the founding of the field (Fisher 1930; Wright 1931). Advancements in sequencing technologies over the last decade have dramatically increased the availability of genome-wide time-sampled polymorphism data for a wide variety of organisms, and several methods have been developed to analyze such data (Malaspinas et al. 2012; Mathieson and McVean 2013; Foll et al. 2014a; Lacerda and Seoighe 2014; Steinrücken et al. 2014; Ferrer-Admetlla et al. 2016; Schraiber et al. 2016; Shim et al. 2016; Rousseau et al. 2017). Of primary interest is the estimation of site-specific selection coefficients, and new methods account for nonequilibrium demography and environmental fluctuations by, for example, accounting for effective population size, population structure, and changing selection intensities.
Time-series polymorphism data are generally available from three sources: experimentally evolved populations, clinical patient samples, and ancient specimens. Viruses are well-represented among such data, owing both to their obvious clinical relevance, as well as their short generation times, small genomes, and relatively high mutation rates. However, aspects of viral biology render the application of standard population genetic inference methods problematic. Namely, existing methodologies for analyzing time-sampled polymorphism data are generally developed around the Kingman coalescent framework and the Wright-Fisher (WF) model (Wright 1931; Kingman 1982) and are of questionable applicability to organisms typified by large variances in offspring distributions, or so-called “sweepstakes reproduction,” including not only viruses but many classes of prokaryotes, fungi, plants, and animals (reviewed in Tellier and Lemaire 2014; Irwin et al. 2016).
In particular, the WF model assumes constant population size, random mating, nonoverlapping generations, and Poisson offspring distributions with equal mean and variance. The Kingman coalescent is derived in the limit of the WF model and shares its assumptions. Reassuringly, population genetic statistics and methods developed under the Kingman have been shown to be robust to many violations of WF assumptions (Möhle 1998, 1999), and have been extended to incorporate selection, migration, and population structure (Neuhauser and Krone 1997; Nordborg 1997; Wilkinson-Herbots 1998). However, large variance in offspring number (Eldon and Wakeley 2006; Matuszewski et al. 2018), strong selection (Neher and Hallatschek 2013; Schweinsberg 2017), large sample sizes (Wakeley and Takahashi 2003; Bhaskar et al. 2014), and recurrent selective sweeps (Durrett and Schweinsberg 2004, 2005) may violate the critical assumption underlying the Kingman coalescent that only two lineages may coalesce at a time. Such a violation may produce genealogies that are characterized by multiple-lineage mergers. Thus, the analysis of genomic data from organisms characterized by highly skewed offspring distributions—such as viruses—may be prone to serious mis-inference if examined with traditional WF and Kingman based approaches, even under neutrality. In particular, the neutral multiple merger events induced by the reproductive biology of these organisms may be mistaken for multiple-merger events induced by positive selection (Hallatschek 2018).
Though not widely utilized for inference, an alternative class of multiple-merger coalescent (MMC) models have been developed that are more general than the Kingman (e.g., Bolthausen and Sznitman 1998; Pitman 1999; Sagitov 1999; Schweinsberg 2000; Möhle and Sagitov 2001), many being derived from Moran models generalized to allow multiple offspring per individual. Many of the recently derived MMC models form specific sub-classes of the -coalescent, of which the Kingman is also a specific case, in which only two lineages are allowed to merge in a generation (Donnelly and Kurtz 1999; Pitman 1999; Sagitov 1999). It has been demonstrated that expectations under MMC models differ from those of the Kingman coalescent in several significant ways: effective population size does not scale linearly with census size (N) as it does under the Kingman (Huillet and Möhle 2011); the site frequency spectrum (SFS) is skewed toward an excess of low- and high-frequency variants relative to the standard WF expectations, even under equilibrium neutrality (Eldon and Wakeley 2006; Blath et al. 2016); and the fixation probability of new beneficial mutations approaches one as population size increases (Der et al. 2011).
Eldon and Wakeley (2006, 2008, 2009) introduced a specific case of the broader class of MMC models, the -coalescent, under which the parameter Ψ describes the proportion of offspring in the population originating from a single parent in the previous generation. The -coalescent has been used in several instances to infer the strength and frequency of sweepstake events in marine organisms typified by Type-III survivorship curves (Eldon and Wakeley 2006; Birkner et al. 2013; Blath et al. 2016; Matuszewski et al. 2018), and the expected SFS has been determined under both standard and nonequilibrium demography (Matuszewski et al. 2018).
Thus, we here introduce a novel statistical inference approach, termed Multiple-Merger Coalescent Approximate Bayesian Computation (MMC-ABC), for inferring population genetic parameters from time-sampled polymorphism data in populations subject to sweepstakes reproduction. MMC-ABC first characterizes the neutral demography of the population by generating genome-wide estimates of N and Ψ. It then estimates site-specific selection coefficients under the inferred sweepstakes model. We demonstrate that failing to account for skewed offspring distributions results in strong mis-inference of both demography and selection, and that MMC-ABC is capable of accurate joint estimation of offspring skew and selection coefficients even when the population size is not precisely known.
Materials and Methods
Forward simulation of populations under the -coalescent
Eldon and Wakeley (2006) described a model, the -coalescent, where each reproductive event in a population of size N is either, with probability , a standard WF event yielding a single offspring, or, with probability ϵ, a multiple-merger event yielding offspring. The probability such that the coalescent history of a sample is dominated by multiple-merger events when , and produces a coalescent history typical of the Kingman. The rate at which k out of n lineages merge under the -coalescent is therefore (Tellier and Lemaire 2014):
Under this model, Ψ has a straightforward biological interpretation. Namely, it is equal to the proportion of individuals in generation who are the offspring of a single individual in (Eldon and Wakeley 2006). We simulated populations evolving under a -coalescent model with SLiM version 3 (Haller and Messer 2019). To circumvent the WF framework of SLiM, we utilized a system of subpopulations with migration to achieve the same effect as sweepstakes reproduction events. Each generation consists of three steps:
One individual is chosen from the population (A) and placed in a separate subpopulation (B) of size . The unidirectional migration rate from B to A is set to Ψ.
One WF generation occurs, with migration from subpopulation B resulting in the chosen individual contributing of the individuals of the next generation of A. A series of mate choice callbacks within SLiM force the migration rate to be exact, rather than stochastic (see source code in the Supplemental Materials). Thus, each generation is a mix of individual WF reproductive events and a single sweepstakes event of magnitude .
Subpopulation B is removed, and the next generation begins.
-based ABC method
The data X consist of allele frequency trajectories measured at L loci: . The -based ABC methodology (modified from the method of Foll et al. 2014a) infers genome-wide values of N and Ψ and L locus-specific selection coefficients . At a particular locus i, we can approximate the joint posterior distribution as:
where denotes summary statistics chosen to be informative about N and Ψ that are a function of all loci, and denotes locus-specific summary statistics chosen to be informative about . A two-step ABC algorithm as proposed by Bazin et al. (2010) is used to approximate this posterior:
Step 1. Obtain an approximation of the density
Simulate L trajectories for J populations using the starting frequencies from the first time point in each trajectory with N and Ψ for each trajectory sampled randomly from their priors, and J equal to the total number of simulation replicates.
Compute for each simulated population.
Retain the simulations with the smallest Euclidian distance between and to obtain a sample from an approximation to .
Step 2. For loci to :
Simulate K trajectories from a -coalescent model, with sampled randomly from its prior, and N and Ψ from the joint density obtained in Step 1.
Compute for each simulated trajectory.
Retain the simulations with the smallest Euclidian distance between and to obtain a sample from an approximation to .
In Step 1 of MMC-ABC, a population of size N and skew Ψ (chosen from their prior distributions) is evolved with variants matching those described by the empirical data. The starting frequencies are identical to those observed during the first sampled time point. The frequency of each allele under consideration is output at each generation of the trajectories in X.
As in the WF-ABC methodology of Foll et al. (2014b), we define as a single statistic, , an unbiased estimator of under the WF model, given by Jorde and Ryman (2007):
where x and y are the minor allele frequencies at the two time points separated by t generations, , and is the harmonic mean of the sample sizes and at the two time points expressed in the number of chromosomes (twice the number of individuals for diploids). We averaged values over sites and times to obtain a genome-wide estimator of for haploids and for diploids (Jorde and Ryman 2007). Note that we use the common notation where corresponds to the effective number of individuals, and the corresponding number of chromosomes for diploids is .
In Step 2 of MMC-ABC, simulations are performed for each site with an initial allele frequency and sample size matching those observed, and with N and Ψ drawn from a joint posterior derived during Step 1 and the selection coefficient s chosen from its prior. At each site we utilize two summary statistics derived from : with and calculated, respectively, between pairs of time points where the allele considered is decreasing and increasing in frequency, such that, at a given site, . For the diploid model, we define the relative fitness as , , and where h denotes the dominance ratio (1 = dominant, 0.5 = codominance, 0 = recessive), and as and for the haploid model (Ewens 2004).
Simulated data sets for testing performance of MMC-ABC
The data used for testing the performance of MMC-ABC were generated in one of two ways:
A diploid population of size N was first evolved under standard, neutral WF conditions for a burn-in period of 50,000 generations, and then evolved for a period of time under sweepstakes conditions. The frequencies of every segregating allele were output at the onset of sweepstakes conditions and at predetermined intervals for a set number of generations, including mutations present at the start of output as well as mutations that arose or fixed during the output period. Trajectories meeting minimum criteria (at least three informative time points, at least two consecutive time points with frequency >0.01, and at least one time point with frequency higher than 0.025) were retained. Data were generated in this manner for testing the performance of Step 1 of MMC-ABC (joint estimation of N and Ψ). Unfiltered single-time point population data were used to generate the observed SFS data in Figure 1.
Individual trajectories of mutations of a given starting frequency with selection coefficient s were modeled in a diploid population of size N with free recombination so that all sites were unlinked, with allele frequency trajectories and sweepstakes dynamics beginning in generation one. Trajectories generated in this manner were pooled into larger data sets for use in testing the performance of Step 2 of MMC-ABC (estimation of site-specific selection coefficients), with all allele trajectories beginning at a minor allele frequency of —a frequency low enough that most neutral mutations should not fix, but high enough to ensure the availability of multiple informative time points for most trajectories.
We frequently used a fixed value of throughout our study, as this is close to the value estimated for experimentally evolved lines of influenza analyzed below. Additionally, at this level of skew, multiple mergers should dominate the coalescent history of a population without entirely eliminating all segregating variation. When , variation is generally eliminated from the population more quickly than it can be generated, and we therefore restricted most of our analyses performed over a range of Ψ to values of .
Analysis of drug-resistance in influenza A virus
We applied MMC-ABC to time-series polymorphism data from experimentally evolved populations of influenza A virus, originally described by Foll et al. (2014a). The data consist of population genomic sequencing from two control lineages and two lineages exposed to exponentially increasing concentrations of the influenza drug oseltamivir, reared on Madin-Darby canine kidney (MDCK) cells and sampled every 13 generations. The data were previously analyzed with WF-ABC and putative drug-resistance mutations were identified. We reanalyzed the data with MMC-ABC for comparison.
Data availability
The source code and manual for MMC-ABC, along with the SLiM and python scripts used to generate our simulated data, are publicly available at https://github.com/sackmana/MMC-ABC/. The raw data from the experimentally evolved influenza virus populations can be found at the ALiVE repository at http://bib.umassmed.edu/influenza/. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7579943.
Results and Discussion
Effects of skewed offspring distributions on variation within populations
To underscore the importance of properly accounting for skewed offspring distributions when inferring selection from population genetic data, we briefly illustrate the effects of sweepstakes reproduction on two population genetic summary statistics. Under a model of sweepstakes reproduction where the variable Ψ describes the proportion of individuals in a generation that are the offspring of a single individual in the previous generation, we summarize in Figure 1 the SFS and Tajima’s D, averaged over 100 replicate populations of size under a broad range of Ψ.
The primary points of note are that, under equilibrium neutrality, nonzero values of Ψ skew the SFS toward an excess of singletons and high-frequency variants, and that Tajima’s D is negatively correlated with Ψ. The reader may note that Tajima’s D is slightly negative for , as should be expected given that Tajima’s D is a biased summary of the SFS dependent upon the recombination rate (Thornton 2005).
Hence, it is clear that failure to account for offspring skew may result in mis-inference, as null model expectations differ strongly from those of the WF model. In the following sections, we will demonstrate that accounting for sweepstakes reproduction simply as a decrease in (as in WF-ABC) results in highly biased estimates of selection. However, explicitly incorporating the underlying processes of MMC events can correctly adjust for their effects and yield accurate and precise estimates of s from time-series data.
Estimation of Ψ with MMC-ABC
In Step 1 of MMC-ABC, the trajectories of all sites included in the data are used to estimate using the unbiased estimator of Jorde and Ryman (2007). In the case where the census size or harmonic mean of the population size across all time points is known, as is often the case in experimental lineages, populations of census size N with sweepstakes parameter Ψ drawn from its prior and mutational frequencies matching those at the first time point of the data are simulated for the same number of generations as the original data. The best of simulations are retained to generate a posterior for Ψ.
MMC-ABC is able to accurately infer Ψ over a broad parameter space. Figure 2 shows the mean of the posterior distribution of Ψ averaged over 1000 replicate populations each at in the case where the correct value of N is specified. These illustrative parameter values were chosen to match general features of common viral experimental evolution studies (e.g., Foll et al. 2014a; Bank et al. 2016; Ormond et al. 2017).
Although in cases of experimental evolution precise measurements of N may be available to inform the prior used in Step 1 of MMC-ABC, knowledge of the size of the population in question may not be available. Therefore, we determined the power of MMC-ABC to accurately estimate Ψ in the absence of knowledge about the true value of N. In this case, both N and Ψ are drawn from priors, and MMC-ABC generates a joint posterior for the two parameters. We found that MMC-ABC is a good estimator of Ψ even when a large, uniform prior is used (Figure 3). MMC-ABC likewise performs well in the case where a single, incorrect value of N is specified, particularly for high values of Ψ, at which converges at the true value due to the nonlinear relationship between Ψ and (Figure 3).
We assessed the performance of MMC-ABC over a range of data types, including cases with 5, 11, or 21 time points over a span of 100 generations, as well as for sample sizes of 25, 100, and 250 for populations of at (Supplemental Material, Figures S1 and S2). As expected, the estimation of Ψ improves with larger sample sizes and more densely sampled time points. However, MMC-ABC remains a good estimator of Ψ, even with as few as five time points or a sample size of 25.
To assess the ability of MMC-ABC to perform accurate inference from time-series data including ancient samples, we estimated Ψ for 500 replicate simulated populations with data from 10 time points spaced 10 generations apart, and a single time point 500 generations in the past. The true value of Ψ was 0.1 for all populations, with roughly one third of sites being nonzero at the ancient time point. The average value of Ψ estimated across all replicate populations was 0.10 (Figure S3).
Estimation of site-specific selection coefficients
In the second step of MMC-ABC, the posterior distributions of N and Ψ obtained in Step 1 are used to simulate 10,000 trajectories at each site with the alleles introduced in the population at the initial frequency provided in the data. The best of simulations are retained to generate a posterior for s. Foll et al. (2014b) previously demonstrated WF-ABC to be a good estimator of genome-wide and site-specific selection coefficients in populations well-described by the Kingman coalescent. The performance of WF-ABC matched or exceeded that of similar methods. Therefore, we restrict our comparison of the performance of MMC-ABC to that of WF-ABC. For a detailed comparison of the performance of WF-ABC with that of other methods, including those of Bollback et al. (2008), Malaspinas et al. (2012), and Mathieson and McVean (2013), see the results of Foll et al. (2014b).
To compare the ability of MMC-ABC and WF-ABC to infer site-specific selection coefficients, we estimated s for 1000 trajectories simulated under the -coalescent with and for . All allele trajectories began from a minor allele frequency of . Because the summary statistics used by MMC-ABC and WF-ABC assume that the majority of sites are neutral, we provided true values of N and Ψ to MMC-ABC and of N to WF-ABC in this initial comparison. As shown in Figure 4, MMC-ABC is very accurate at estimating s under recurrent and strong sweepstakes reproduction, while WF-ABC consistently overestimates selection coefficients for positively selected sites and underestimates s for neutral sites. The same is true for small values of . The results of the same analysis performed over a broad range of demonstrate that the performance of WF-ABC rapidly deteriorates when Ψ is as large as 0.04 (Figure S4).
Estimating s for single trajectories of mutations covering a wider range of true selection coefficients from −0.1 to 0.4, it is evident that MMC-ABC is not only a good estimator under sweepstakes reproduction of selection for sites under positive selection and neutrality, but is also accurate for sites under negative selection. WF-ABC, however, in addition to having a strong bias toward overestimation of s for sites under positive selection, is negatively biased for neutral and negatively selected sites (Figure 5). Inference under the Kingman for organisms that violate the assumption of small variance in progeny distributions is thus prone to serious over- or under-estimation of selection coefficients, while correctly accounting for reproductive skew produces accurate estimates of selective strength. This mis-inference under the WF model results from the acceleration of transit times under sweepstakes reproduction, which are interpreted by WF-ABC as an amplification of positive or negative selection.
As with the estimation of Ψ, we assessed the performance of Part 2 of MMC-ABC over a range of data types, including cases with 5, 11, or 21 time points over a span of 100 generations, as well as for sample sizes of 25, 100, and 250 for populations of at (Figures S5 and S6). Again, as expected, the estimation of s improves with larger sample sizes and more densely sampled time points. However, MMC-ABC is a reasonably good estimator of s even with as few as five time points or a sample size of 25.
Joint estimation of N, Ψ, and s
We simulated trajectories for 9500 neutral loci and 500 selected loci for which , under conditions in which and . MMC-ABC estimated first N and Ψ over priors of and , respectively, and then estimated s for each site with values of N and Ψ drawn from the joint posterior (Figure 6). The estimated value of , the mean estimated value of s for neutral sites was and for positively selected sites was 0.0924, highlighting the ability of MMC-ABC to jointly estimate the magnitude of skewed offspring distributions and site-specific selection coefficients with accuracy, even when a relatively large proportion ( in this case) of sites are under strong positive selection.
Results of similar analyses comparing the performance of MMC-ABC and WF-ABC for sets of simulated trajectories of 1900 neutral and 100 selected loci with and demonstrate good performance of MMC-ABC over a broad range of Ψ, and poor performance of WF-ABC when (Figures S7 and S8).
These results are notable, given that both recurrent positive selection and skewed progeny distributions can result in coalescent trees dominated by multiple-mergers (Durrett and Schweinsberg 2004, 2005). Different features of the data—resulting from the localized effects of selection and the genome-wide effects of sweepstakes reproduction—allow us to disentangle the MMC behavior of neutral offspring skew from that of non-neutral offspring skew generated by positive selection.
Application to data from influenza A
We applied MMC-ABC to time-series data from the experimental evolution of influenza A. These data were collected under standard culture conditions and during a period of exposure to exponentially increasing concentrations of the drug oseltamivir (Foll et al. 2014a; Renzette et al. 2014).
The data consist of time-sampled minor allele frequencies for two control lineages and two drug-selected lineages. Using WF-ABC, Foll et al. (2014a) previously estimated the effective population sizes of the control and selected populations to be 176 and 226, respectively, with values of derived from the harmonic means of the population sizes during passaging being 737 and 696, respectively. They hypothesized that the discrepancies in measurements of were likely due to the large variance in viral burst sizes, yielding skewed offspring distributions. These experimentally evolved populations are therefore well-suited to the application of MMC-ABC.
We first obtained estimates of Ψ for each population, using the harmonic population size means as a prior for N. We then obtained posterior distributions of s for all mutations segregating in at least two time points and with a minimum frequency of for at least one time point. We define Bayesian “P-values” for s as and consider a trajectory to be “significant at level p” if its equal-tailed posterior interval excludes zero (Beaumont and Balding 2004).
The mean posterior estimate of Ψ for the two control lines was 0.067, and the mean value of Ψ across both drug-treatment lines was 0.084. MMC-ABC recovered two of the same six control line mutations and 7 of the 15 mutations from the drug selection lines identified by Foll et al. (2014a) as being beneficial at the level (Table 1, summarizing all 8 mutations significant under MMC-ABC and 8 of the 20 significant under WF-ABC, sites of known drug-resistance mutations shown in bold font). The mutations of significant beneficial effect under WF-ABC had an average effect of for control line mutations and for drug-selection mutations. The same sets of mutations (including those that did not achieve significance under MMC-ABC) had average effects of and , as estimated by MMC-ABC.
Table 1. Influenza A virus mutations identified as significantly beneficial by MMC-ABC.
Segment | Position | Substitution type | Initial freq. (%) | Final freq. (%) | WF-ABC s estimates (99 HPDIs) | MMC-ABC s estimates (99 HPDIs) | |
---|---|---|---|---|---|---|---|
Control 1 | HA | 1395 | Nonsynonymous | 0.12 (0.05, 0.19) | 0.14 (0.06, 0.21) | ||
Control 2 | HA | 1211 | Nonsynonymous | 0.20 (0.08, 0.35) | 0.23 (0.15, 0.32) | ||
Drug 1 | PA | 2194 | Synonymous | 0.09 (0.02, 0.17) | 0.11 (0.05, 0.18) | ||
HA | 48 | Synonymous | 0.14 (0.06, 0.27) | 0.16 (0.05, 0.24) | |||
HA | 1395 | Nonsynonymous | 0.22 (0.08, 0.34) | 0.27 (0.13, 0.42) | |||
NA | 582 | Synonymous | 0.29 (0.15, 0.45) | 0.43 (0.28, 0.56) | |||
NA | 823a | Nonsynonymous | 0.15 (0.06, 0.24) | 0.18 (0.08, 0.28) | |||
Drug 2 | NA | 823a | Nonsynonymous | 0.27 (0.12, 0.48) | 0.26 (0.13, 0.42) |
Sites of known drug-resistance mutations
The beneficial mutations identified in the control lines are likely adaptations to the MDCK cells used in serial passaging. One mutation at nucleotide position 1395 of the hemagglutinin segment, which rose to high frequency in the first control and drug lines, has been widely observed across influenza strains and is a common adaptation to tissue culture (Daniels et al. 1985; Reed et al. 2009; Foll et al. 2014a). Another mutation, which reached high frequency in the second control line, has likewise been associated with adaptation to culture conditions (Lin et al. 1997; Ilyushina et al. 2007). Notably, the mutation at position 823 of the neuraminidase segment (identified as H275Y under the N2 numbering system) achieved high frequency in both drug lineages, and is a well-documented resistance mutation for oseltamivir (Sha and Luo 1997; Arzt et al. 2001; Collins et al. 2008).
Six of the eight synonymous mutations found to be significantly beneficial by WF-ABC were not significantly beneficial under MMC-ABC. By estimating an appropriate neutral null model under the -coalescent, we reduced the list of candidate resistance mutations, thus likely minimizing the rate of false positives and excluding many hitchhiking mutations (as the synonymous sites are likely to be). This is supported by an analysis of the proportion of neutral and positively selected mutations that are significantly beneficial or deleterious under WF-ABC and MMC-ABC, which demonstrated a high false-positive rate of neutral mutations classified as strongly beneficial by WF-ABC under moderately strong offspring skew (Table S2). Several experimentally validated mutations known to improve either infectivity in tissue culture or resistance to oseltamivir were retained under MMC-ABC, as were a handful of other potential candidate resistance mutations.
Conclusions
The revolution in sequencing technology has increased the availability of time-series polymorphism data by orders of magnitude, but the utility of such data relies upon the derivation and development of appropriate inference methodologies. The neutral biology of large swaths of the tree of life renders the most common class of method based on the Kingman coalescent of questionable use. We have demonstrated here that performing inference under the assumptions of the Wright-Fisher model and the Kingman coalescent leads to an incorrect understanding of both population size and selection coefficients in such organisms. Matuszewski et al. (2018) have also shown this to be true for the demographic history of the population. Fortunately, the theoretical details are in place to develop similar inference of demography and selection under biologically appropriate alternative coalescent models (Wakeley 2013).
We have shown that MMC-ABC is able to jointly estimate N, Ψ, and site-specific selection coefficients accurately, even under high levels of reproductive skew and with an unknown population size. Notably, we were able to distinguish selection-induced offspring skew from skew originating from the neutral reproductive biology of populations, largely due to the genome-wide scale of MMC events relative to the localized effects of selection. We were also able to differentiate drift-induced effects imposed by small population sizes from those induced by sweepstakes reproduction events.
Very little is known regarding the extent of progeny skew across groups of viruses, bacteria, and plants, or the extent of skew artificially induced by domestication and cultivation. However, this work demonstrates that, at least with time-sampled allele frequency data, such inference is now possible. Moreover, our method will allow for the construction of much more accurate neutral null models in these organisms, which will greatly reduce false-positive rates in scans for selection, provide a more accurate picture of demographic history, and reveal previously hidden details regarding variance in offspring number.
Acknowledgments
We thank Stefan Laurent for helpful discussion. This work was funded by grants from the European Research Council, the Swiss National Science Foundation, and the U.S. Department of Defense to J.D.J.
Footnotes
Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7579943.
Communicating editor: M. Beaumont
Literature Cited
- Arzt S., Baudin F., Barge A., Timmins P., Burmeister W. P., et al. , 2001. Combined results from solution studies on intact influenza virus M1 protein and from a new crystal form of its N-terminal domain show that M1 is an elongated monomer. Virology 279: 439–446. 10.1006/viro.2000.0727 [DOI] [PubMed] [Google Scholar]
- Bank C., Renzette N., Liu P., Matuszewski S., Shim H., et al. , 2016. An experimental evaluation of drug-induced mutational meltdown as an antiviral treatment strategy. Evolution 70: 2470–2484. 10.1111/evo.13041 [DOI] [PubMed] [Google Scholar]
- Bazin E., Dawson K. J., Beaumont M. A., 2010. Likelihood-free inference of a population structure and local adaptation in a Bayesian hierarchical model. Genetics 185: 587–602. 10.1534/genetics.109.112391 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beaumont M., Balding D., 2004. Identifying adaptive genetic divergence among population from genome scans. Mol. Ecol. 13: 969–980. 10.1111/j.1365-294X.2004.02125.x [DOI] [PubMed] [Google Scholar]
- Bhaskar A., Clark A. G., Song Y. S., 2014. Distortion of genealogical properties when the sample is very large. Proc. Natl. Acad. Sci. USA 111: 2385–2390. 10.1073/pnas.1322709111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birkner M., Blath J., Eldon B., 2013. Statistical properties of the site-frequency spectrum associated with Λ-coalescents. Genetics 195: 1037–1053. 10.1534/genetics.113.156612 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blath J., Cronjäger M. C., Eldon B., Hammer M., 2016. The site-frequency spectrum associated with Ξ-coalescents. Theor. Popul. Biol. 110: 36–50. 10.1016/j.tpb.2016.04.002 [DOI] [PubMed] [Google Scholar]
- Bollback J. P., York T. L., Nielsen R., 2008. Estimation of 2Nes from temporal allele frequency data. Genetics 179: 497–502. 10.1534/genetics.107.085019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bolthausen E., Sznitman A., 1998. On Ruelle’s probability cascades and an abstract cavity method. Commun. Math. Phys. 197: 247–276. 10.1007/s002200050450 [DOI] [Google Scholar]
- Collins P. J., Haire L. F., Lin Y. P., Liu J., Russell R. J., et al. , 2008. Crystal structures of oseltamivir-resistant influenza virus neuraminidase mutants. Nature 453: 1258–1261. 10.1038/nature06956 [DOI] [PubMed] [Google Scholar]
- Daniels R. S., Downie J. C., Hay A. J., Knossow M., Skehel J. J., et al. , 1985. Fusion mutants of the influenza virus hemagglutinin glycoprotein. Cell 40: 431–439. 10.1016/0092-8674(85)90157-6 [DOI] [PubMed] [Google Scholar]
- Der R., Epstein C. L., Plotkin J. B., 2011. Generalized population models and the nature of genetic drift. Theor. Popul. Biol. 80: 80–99. 10.1016/j.tpb.2011.06.004 [DOI] [PubMed] [Google Scholar]
- Donnelly P., Kurtz T. G., 1999. Particle representations for measure-valued population models. Ann. Probab. 27: 166–205. 10.1214/aop/1022677258 [DOI] [Google Scholar]
- Durrett R., Schweinsberg J., 2004. Approximating selective sweeps. Theor. Popul. Biol. 66: 129–138. 10.1016/j.tpb.2004.04.002 [DOI] [PubMed] [Google Scholar]
- Durrett R., Schweinsberg J., 2005. A coalescent model for the effect of advantageous mutations on the genealogy of a population. Stochastic Process. Appl. 115: 1628–1657. 10.1016/j.spa.2005.04.009 [DOI] [Google Scholar]
- Eldon B., Wakeley J., 2006. Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics 172: 2621–2633. 10.1534/genetics.105.052175 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eldon B., Wakeley J., 2008. Linkage disequilibrium under skewed offspring distribution among individuals in a population. Genetics 178: 1517–1532. 10.1534/genetics.107.075200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eldon B., Wakeley J., 2009. Coalescence times and FST under a skewed offspring distribution among individuals in a population. Genetics 181: 615–629. 10.1534/genetics.108.094342 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewens W. J., 2004. Mathematical Population Genetics: Theoretical Introduction. Springer, New York: 10.1007/978-0-387-21822-9 [DOI] [Google Scholar]
- Ferrer-Admetlla A., Leuenberger C., Jensen J. D., Wegmann D., 2016. An approximate Markov model for the Wright-Fisher diffusion and its application to time series data. Genetics 203: 831–846. 10.1534/genetics.115.184598 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher R. A., 1930. The Genetical Theory of Natural Selection. Oxford University Press, Oxford: 10.5962/bhl.title.27468 [DOI] [Google Scholar]
- Foll M., Poh Y.-P., Renzette N., Ferrer-Admetlla A., Bank C., et al. , 2014a Influenza virus drug resistance: a time-sampled population genetics perspective. PLoS Genet. 10: e1004185 10.1371/journal.pgen.1004185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foll M., Shim H., Jensen J. D., 2014b WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data. Mol. Ecol. Resour. 15: 87–98. 10.1111/1755-0998.12280 [DOI] [PubMed] [Google Scholar]
- Hallatschek O., 2018. Selection-like biases emerge in population models with recurrent jackpot events. Genetics 210: 1053–1073. 10.1534/genetics.118.301516 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller B. C., Messer P. W., 2019. Slim 3: forward genetic simulations beyond the Wright-Fisher model. Mol. Biol. Evol. 10.1093/molbev/msy228 [DOI] [PMC free article] [PubMed]
- Huillet T., Möhle M., 2011. Population genetics models with skewed fertilities: a forward and backward analysis. Stoch. Models 27: 521–554. 10.1080/15326349.2011.593411 [DOI] [Google Scholar]
- Ilyushina N. A., Govorkova E. A., Russell C. J., Hoffmann E., Webster R. G., 2007. Contribution of H7 haemagglutinin to amantadine resistance and infectivity of influenza virus. J. Gen. Virol. 88: 1266–1274. 10.1099/vir.0.82256-0 [DOI] [PubMed] [Google Scholar]
- Irwin K., Laurent S., Matuszewski S., Vuilleumier S., Ormond L., et al. , 2016. On the importance of skewed offspring distributions and background selection in virus population genetics. Heredity 117: 393–399. 10.1038/hdy.2016.58 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jorde P. E., Ryman N., 2007. Unbiased estimator for genetic drift and effective population size. Genetics 177: 927–935. 10.1534/genetics.107.075481 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman J. F. C., 1982. The coalescent. Stochastic Process. Appl. 13: 235–248. 10.1016/0304-4149(82)90011-4 [DOI] [Google Scholar]
- Lacerda M., Seoighe C., 2014. Population genetics inference of longitudinally-sampled mutants under strong selection. Genetics 198: 1237–1250. 10.1534/genetics.114.167957 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Y. P., Wharton S. A., Martin J., Skehel J. J., Wiley D. C., et al. , 1997. Adaptation of egg-grown and transfectant influenza viruses for growth in mammalian cells: selection of hemagglutinin mutants with elevated pH of membrane fusion. Virology 233: 402–410. 10.1006/viro.1997.8626 [DOI] [PubMed] [Google Scholar]
- Malaspinas A.-S., Malaspinas O., Evans S. N., Slatkin M., 2012. Estimating allele age and selection coefficient from time-serial data. Genetics 192: 599–607. 10.1534/genetics.112.140939 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mathieson I., McVean G., 2013. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics 193: 973–984. 10.1534/genetics.112.147611 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matuszewski S., Hildebrandt M. E., Achaz G., Jensen J. D., 2018. Coalescent processes with skewed offspring distributions and nonequilibrium demography. Genetics 208: 323–338. 10.1534/genetics.117.300499 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Möhle M., 1998. Robustness results for the coalescent. J. Appl. Probab. 35: 438–447. 10.1239/jap/1032192859 [DOI] [Google Scholar]
- Möhle M., 1999. Weak convergence to the coalescent in neutral population models. J. Appl. Probab. 36: 446–460. 10.1239/jap/1032374464 [DOI] [Google Scholar]
- Möhle M., Sagitov S., 2001. A classification of coalescent processes for haploid exchangeable population models. Ann. Probab. 29: 1547–1562. [Google Scholar]
- Neher R. A., Hallatschek O., 2013. Genealogies of rapidly adapting populations. Proc. Natl. Acad. Sci. USA 110: 437–442. 10.1073/pnas.1213113110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neuhauser C., Krone S. M., 1997. The genealogy of samples in models with selection. Genetics 145: 519–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nordborg M., 1997. Structured coalescent processes on different time scales. Genetics 146: 1501–1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ormond L., Liu P., Matuszewski S., Renzette N., Bank C., et al. , 2017. The combined effect of oseltamivir and favipiravir on influenza A virus evolution. Genome Biol. Evol. 9: 1913–1924. 10.1093/gbe/evx138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pitman J., 1999. Coalescents with multiple collisions. Ann. Probab. 27: 1870–1902. 10.1214/aop/1022874819 [DOI] [Google Scholar]
- Reed M. L., Yen H., DuBois R. M., Bridges O. A., Salomon R., et al. , 2009. Amino acid residues in the fusion peptide pocket regulate the pH of activation of the H5N1 influenza virus hemagglutinin protein. J. Virol. 83: 3568–3580. 10.1128/JVI.02238-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Renzette N., Caffrey D. R., Zeldovich K. B., Liu P., Gallagher G. R., et al. , 2014. Evolution of the influenza A virus genome during development of oseltamivir resistance in vitro. J. Virol. 88: 272–281. 10.1128/JVI.01067-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rousseau E., Moury B., Malleret L., Senoussi R., Palloix A., et al. , 2017. Estimating virus effective population size and selection without neutral markers. PLoS Pathog. 13: e1006702 10.1371/journal.ppat.1006702 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sagitov S., 1999. The general coalescent with asynchronous mergers of ancestral lines. J. Appl. Probab. 36: 1116–1125. 10.1239/jap/1032374759 [DOI] [Google Scholar]
- Schraiber J. G., Evans S. N., Slatkin M., 2016. Bayesian inference of natural selection from allele frequency time series. Genetics 203: 493–511. 10.1534/genetics.116.187278 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schweinsberg J., 2000. Coalescents with simultaneous multiple collisions. Electron. J. Probab. 5: 50 10.1214/EJP.v5-68 [DOI] [Google Scholar]
- Schweinsberg J., 2017. Rigorous results for a population model with selection II: genealogy of the population. Electron. J. Probab. 22: 54. [Google Scholar]
- Sha B., Luo M., 1997. Structure of a bifunctional membrane-RNA binding protein, influenza virus matrix protein M1. Nat. Struct. Mol. Biol. 4: 239–244. 10.1038/nsb0397-239 [DOI] [PubMed] [Google Scholar]
- Shim H., Laurent S., Matuszewski S., Foll M., Jensen J. D., 2016. Detecting and quantifying changing selection intensities from time-sampled polymorphism data. G3 (Bethesda) 6: 893–904. 10.1534/g3.115.023200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinrücken M., Bhaskar A., Song Y. S., 2014. A novel spectral method for inferring general diploid selection from time series genetic data. Ann. Appl. Stat. 8: 2203–2222. 10.1214/14-AOAS764 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tellier A., Lemaire C., 2014. Coalescence 2.0: a multiple branching of recent theoretical developments and their applications. Mol. Ecol. 23: 2637–2652. 10.1111/mec.12755 [DOI] [PubMed] [Google Scholar]
- Thornton K., 2005. Recombination and the properties of Tajima’s D in the context of approximate-likelihood calculation. Genetics 171: 2143–2148. 10.1534/genetics.105.043786 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakeley J., 2013. Coalescent theory has many new branches. Theor. Popul. Biol. 87: 1–4. 10.1016/j.tpb.2013.06.001 [DOI] [PubMed] [Google Scholar]
- Wakeley J., Takahashi T., 2003. Gene genealogies when the sample size exceeds the effective size of the population. Mol. Biol. Evol. 20: 208–213. 10.1093/molbev/msg024 [DOI] [PubMed] [Google Scholar]
- Wilkinson-Herbots H. M., 1998. Genealogy and subpopulation differentiation under various models of population structure. J. Math. Biol. 37: 535–585. 10.1007/s002850050140 [DOI] [Google Scholar]
- Wright S. G., 1931. Evolution in Mendelian populations. Genetics 15: 97–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The source code and manual for MMC-ABC, along with the SLiM and python scripts used to generate our simulated data, are publicly available at https://github.com/sackmana/MMC-ABC/. The raw data from the experimentally evolved influenza virus populations can be found at the ALiVE repository at http://bib.umassmed.edu/influenza/. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7579943.