Skip to main content
Nature Communications logoLink to Nature Communications
letter
. 2024 Apr 16;15:3240. doi: 10.1038/s41467-024-46261-4

Population genetic considerations regarding the interpretation of within-patient SARS-CoV-2 polymorphism data

Vivak Soni 1, John W Terbot II 1,2, Jeffrey D Jensen 1,
PMCID: PMC11021480  PMID: 38627371

arising from H. Gu et al. Nature Communications 10.1038/s41467-023-37468-y (2023)

With the recent onset of the SARS-CoV-2 pandemic, there has been great interest in interpreting the within-patient evolutionary dynamics of this virus. Indeed, the accurate identification of genomic regions experiencing positive selection, and the quantification of these selective effects, is of crucial importance for both evolutionary as well as clinical interpretation. With this goal, the recently published Gu et al.1 work collected 2820 respiratory samples to investigate observed levels of within-patient synonymous relative to non-synonymous variation, and relied upon this comparison to assign genomic regions as evolving under purifying selection, neutrality, or positive selection. Specifically, they interpreted πNπS > 0 as being indicative of positive selection, ~0 as being indicative of neutrality, and <0 as being indicative of purifying selection (e.g., see Fig. 2 of Gu et al.). Using this criterion when performing sliding window analyses, the authors claimed that multiple genomic regions are experiencing positive selection. Crucially, the authors relied upon their selection inference derived from these π-based comparisons to support conclusions regarding infection dynamics in vaccinated vs. unvaccinated patients, a focal point of their publication.

Fig. 2. πN, πS, and πN-πS values simulated under a model of a partial selective sweep (top panels) and a complete selective sweep (bottom panels), both on a weakly deleterious background as well as a strongly deleterious background (as given by the deleterious DFEs in Fig. 1).

Fig. 2

The top and bottom 2 × 2 plots present per-site πN (red) and πS (blue) values for 10 kb non-overlapping windows, as well as genome-wide (30 kb) values (left), and πN - πS values (right) across the same scales. Selective sweeps were modeled as a beneficial mutation with selection coefficient (s) = 10 introduced after 168N generations (7 days post-infection), in the middle of the simulated genome; sampling occurred when the beneficial mutation reached 50% frequency (partial sweep), and again at fixation (complete sweep). On average the beneficial mutation reached 50% frequency 14.8N generations and fixed 21.9N generations after introduction on the weakly deleterious background, and 15N generations and 22N generations, respectively on the strongly deleterious background. All other parameter details are in Fig. 1. Source data are provided as a Source Data file. All code for replicating these results is available on GitHub (https://github.com/vivaksoni/Gu_etal_2023_response).

There is a long history in the field of population genetics of comparing non-synonymous and synonymous divergence in this regard (i.e., dN/dS), as well as in jointly interpreting non-synonymous to synonymous divergence relative to polymorphism (e.g., as implemented in the McDonald-Kreitman test2, as well as numerous other related implementations; see refs. 3,4). In this framework, assuming that synonymous sites are evolving neutrally, the neutral divergence at these sites under genetic drift alone will be equal to the neutral mutation rate5, and thus non-synonymous divergence may be interpreted as being depressed by purifying selection or accelerated by positive selection relative to this synonymous/neutral standard.

However, this divergence-based interpretation does not correctly extend to a comparison of πN and πS as utilized by Gu et al. As one example, the effects of selection at linked sites (see review of ref. 6) renders this polymorphism-level interpretation problematic. Namely, even if mutations at synonymous sites are themselves neutral (and see ref. 7), their observed frequency in the population may be shaped by the episodic genetic hitchhiking effects associated with positive selection (i.e., selective sweeps8), and will be shaped by the constantly occurring genetic hitchhiking effects associated with purifying selection (i.e., background selection9). Importantly, these genetic hitchhiking effects will not impact divergence-based comparisons such as dN/dS (10; though there are nonetheless important considerations, see refs. 11,12), but they will strongly impact polymorphism-based comparisons such as the πNπS of Gu et al.

For these reasons, one must account for the myriad of evolutionary forces shaping observed levels of within-patient nucleotide variation when performing population genomic inference of this sort13,14. In SARS-CoV-2 specifically, this evolutionary baseline model will necessarily include the underlying mutation and recombination rates, the history of population size change associated with infection, as well as the constant purging of deleterious mutations and the resulting effects on linked sites15,16. Only by accounting for these certain-to-be-operating evolutionary processes may one determine if episodic or hypothesized processes (such as positive or balancing section) need to be invoked to explain observed levels and patterns of variation1720.

Thus, in order to investigate the claims of Gu et al., we simulated this SARS-CoV-2 baseline model in both the presence and absence of positive selection, in order to better interpret the behavior of πNπS. As shown in Figs. 1 and 2, these simulations reveal multiple reasons to question their interpretations. Firstly, because of the small number of variable sites observed in the SARS-CoV-2 genome in any given patient sample, particularly after their filtering for SNPs segregating at greater than 2.5% frequency in a folded site frequency spectrum (i.e., resulting in a median of ~5 SNPs/sampled genome in the patient data), there is an extremely large variance associated with πN and πS, which is only exacerbated by further reducing the scale of inference to specific genomic windows. For example, as shown in Fig. 1, in the complete absence of positive selection, it is naturally the case that purifying selection will on average reduce the frequencies of non-synonymous relative to synonymous variants (though the latter will be experiencing background selection effects); however, it is also the case that the variance is such that there is an appreciable probability of observing πN values that are larger than πS (i.e., their criteria for identifying positive selection), particularly on a sliding-window scale.

Fig. 1. Per-site πN and πS values simulated under a model of primarily weakly deleterious mutations (top row), and a model of primarily strongly deleterious mutations (bottom row), occurring in the SARS-CoV-2 genome.

Fig. 1

The leftmost column provides the deleterious distribution of fitness effects (DFE) from which non-synonymous mutations were sampled under these two respective models; the middle column presents πN (red) and πS (blue) values for 10 kb non-overlapping windows of the genome, as well as the genome-wide values (30 kb); the rightmost column presents πNπS values across the same genomic windows, and genome-wide. Point estimates represent mean values across 200 simulation replicates, with the standard deviation plotted as 68% confidence intervals. Simulations were performed using SLiM4.126. Every third site of the genome was simulated as being strictly neutral (i.e., synonymous for the purpose of analysis), while all other sites were drawn from the respective DFE (i.e., non-synonymous for the purpose of analysis). Following the baseline model recommendations of refs. 15,16, the following parameterizations were utilized: infection bottleneck size = 1; recombination rate = 5.5e-5 events/site/cycle; mutation rate/site/replication = 2.135e-6; carrying capacity = 1e5. Simulations were run for 168 N generations (corresponding to an infection of 7 days), with 100 genomes sampled at the end-point. As per ref. 1, SNPs with an allele frequency less than 2.5% were masked when estimating π. Source data are provided as a Source Data file. All code for replicating these results is available on GitHub (https://github.com/vivaksoni/Gu_etal_2023_response).

Secondly, even in the presence of positive selection (Fig. 2), the implemented expectation of πNπS > 0 by Gu et al. would not successfully identify this evolutionary process. As shown for both a partial selective sweep (i.e., a beneficial mutation having reached 50% frequency in the patient population) and a complete selective sweep (i.e., a beneficial mutation having reached fixation in the patient population immediately prior to sampling), respectively, the expectation of πNπS remains negative. This observation partly owes to the fact that linked synonymous variants will be increased in frequency via genetic hitchhiking more readily than other linked non-synonymous variants which are likely deleterious; as such, synonymous variation in the hitchhiked region of the genome may be augmented more than non-synonymous variation. In addition, these models are similarly characterized by a large variance.

We additionally extended this model to consider recurrent beneficial mutations. Specifically, we evaluated scenarios in which 1% of new mutations are beneficial and in which 10% of new mutations are beneficial, occurring on the strongly or weakly deleterious DFE backgrounds given in Figs. 1 and 2, or occurring on the DFE background recently estimated for SARS-CoV-2 experimentally21. As shown in Supplementary Fig. 1, genomic windows were observed in all scenarios in which πNπS is both greater than and less than 0, and even genome-wide there is no significant differentiation in these distributions. It is worth emphasizing that while an extreme scenario in which 10% of all newly arising mutations are strongly beneficial and simultaneously segregating in the population may indeed elevate πN relative to πS, even this unrealistic parameter space does not reliably produce this pattern. Furthermore, given that elevated πN may also be readily generated by models lacking positive selection entirely as shown, this π-based approach of Gu et al. remains inappropriate owing to issues of identifiability.

In summary, πNπS is not a reliable indicator of selective effects and dynamics. As shown in the specific case of SARS-CoV-2, the large variance associated with relatively few genomic SNPs renders the interpretation highly tenuous, leading to a situation in which values greater than 0 and less than 0 are both associated with appreciable probabilities in the presence of purifying selection alone. Furthermore, even with the addition of positive selection, the observation of πN > πS is unreliable owing partly to the effects of genetic hitchhiking. For these reasons, statistical inference procedures which directly account for multiple competing evolutionary processes (see refs. 22,23), and which utilize more sophisticated expectations associated with patterns of variation in the site frequency spectrum and linkage disequilibrium associated with positive selection (as reviewed by ref. 24, and see ref. 25), would be required to evaluate the claims of Gu et al.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplementary Information (143.8KB, pdf)
Reporting Summary (73.7KB, pdf)

Source data

Source Data (70KB, zip)

Acknowledgements

This work was supported by the National Institutes of Health grant R35GM139383 to J.D.J.

Author contributions

VS, JWT and JDJ conceived the project; VS performed simulations with input from JWT and JDJ; VS, JWT and JDJ wrote the manuscript; JDJ provided funding for the project.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Data availability

Datasets generated and/or analyzed during the current study are available in the paper. Source data are provided with this paper.

Code availability

All scripts and data underlying the simulations, analyses, and Figures may be found at: https://github.com/vivaksoni/Gu_etal_2023_response.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-024-46261-4.

References

  • 1.Gu H, et al. Within-host genetic diversity of SARS-CoV-2 lineages in unvaccinated and vaccinated individuals. Nat. Commun. 2023;14:1793. doi: 10.1038/s41467-023-37468-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991;351:652–654. doi: 10.1038/351652a0. [DOI] [PubMed] [Google Scholar]
  • 3.Charlesworth, B. & Charlesworth, D. Elements of Evolutionary Genetics. (W. H. Freeman and Company, New York, 2010).
  • 4.Walsh, B. & Lynch, M. Evolution and Selection of Quantitative Traits. (Oxford University Press, Oxford, 2018).
  • 5.Kimura, M. The Neutral Theory of Molecular Evolution. (Cambridge University Press, Cambridge, 1983).
  • 6.Charlesworth B, Jensen JD. Effects of selection at linked sites on patterns of genetic variability. Annu. Rev. Ecol. Evol. Syst. 2021;52:177–197. doi: 10.1146/annurev-ecolsys-010621-044528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang H, Pipes L, Nielsen R. Synonymous mutations and the molecular evolution of SARS-CoV-2 origins. Virus Evol. 2021;7:1–11. doi: 10.1093/ve/veaa098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Maynard Smith J, Haigh J. The hitch-hiking effect of a favourable gene. Genet. Res. 1974;23:23–35. doi: 10.1017/S0016672300014634. [DOI] [PubMed] [Google Scholar]
  • 9.Charlesworth B, Morgan MT, Charlesworth D. The effect of deleterious mutations on neutral molecular variation. Genetics. 1993;134:1289–1303. doi: 10.1093/genetics/134.4.1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Birky CW, Walsh JB. Effects of linkage on rates of molecular evolution. Proc. Natl Acad. Sci. 1988;85:6414–6418. doi: 10.1073/pnas.85.17.6414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Eyre-Walker A. Changing effective population size and the McDonald-Kreitman test. Genetics. 2002;162:2017–2024. doi: 10.1093/genetics/162.4.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008;4:e1000304. doi: 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Johri P, Eyre-Walker A, Gutenkunst RN, Lohmueller KE, Jensen JD. On the prospect of achieving accurate joint estimation of selection with population history. Genome Biol. Evol. 2022;14:evac088. doi: 10.1093/gbe/evac088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Johri P, et al. Recommendations for improving statistical inference in population genomics. PLOS Biol. 2022;20:e3001669. doi: 10.1371/journal.pbio.3001669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Terbot JW, et al. Developing an appropriate evolutionary baseline model for the study of SARS-CoV-2 patient samples. PLOS Pathog. 2023;19:e1011265. doi: 10.1371/journal.ppat.1011265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Terbot JW, et al. A simulation framework for modeling the within-patient evolutionary dynamics of SARS-CoV-2. Genome Biol. Evol. 2023;15:evad204. doi: 10.1093/gbe/evad204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Irwin KK, et al. On the importance of skewed offspring distributions and background selection in virus population genetics. Heredity. 2016;117:393–399. doi: 10.1038/hdy.2016.58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jensen JD, Kowalik TF. A consideration of within-host human cytomegalovirus genetic variation. Proc. Natl Acad. Sci. 2020;117:816–817. doi: 10.1073/pnas.1915295117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Jensen, J. D. Studying population genetic processes in viruses: from drug-resistance evolution to patient infection dynamics. In: Bamford, D. H. and Zuckerman, M. (eds.) Encyclopedia of Virology, 4th edition 5, 227–232 (2021).
  • 20.Johri P, Stephan W, Jensen JD. Soft selective sweeps: addressing new definitions, evaluating competing models, and interpreting empirical outliers. PLOS Genet. 2022;18:e1010022. doi: 10.1371/journal.pgen.1010022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Flynn JA, et al. Comprehensive fitness landscape of SARS-CoV-2 Mpro reveals insights into viral resistance mechanisms. Elife. 2022;11:e77433. doi: 10.7554/eLife.77433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Johri P, Charlesworth B, Jensen JD. Toward an evolutionarily appropriate null model: jointly inferring demography and purifying selection. Genetics. 2020;215:173–192. doi: 10.1534/genetics.119.303002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Howell AA, et al. Developing an appropriate evolutionary baseline model for the study of human cytomegalovirus. Genome Biol. Evol. 2023;15:evad059. doi: 10.1093/gbe/evad059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Stephan W. Selective sweeps. Genetics. 2019;211:5–13. doi: 10.1534/genetics.118.301319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Soni V, Johri P, Jensen JD. Evaluating power to detect recurrent selective sweeps under increasingly realistic evolutionary null models. Evolution. 2023;77:2113–2127. doi: 10.1093/evolut/qpad120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Haller BC, Messer PW. SLiM 4: Multispecies eco-evolutionary modeling. Am. Nat. 2023;201:E127–E139. doi: 10.1086/723601. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (143.8KB, pdf)
Reporting Summary (73.7KB, pdf)
Source Data (70KB, zip)

Data Availability Statement

Datasets generated and/or analyzed during the current study are available in the paper. Source data are provided with this paper.

All scripts and data underlying the simulations, analyses, and Figures may be found at: https://github.com/vivaksoni/Gu_etal_2023_response.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES