Your father may have given you your nose, but your sneeze owes its genes to its neighbors. From family trees to the tree of life, evolution is commonly depicted as an exclusively “vertical” process in which novel features are stochastically acquired and may be inherited only by direct descendants. This intuition is shaped by centuries of observing multicellular behavior. Charismatic megafauna like ourselves display several dozen de novo mutations in each newborn germline (1) for every recombination event (between the parent genomes). While viral recombination rates are highly variable, some examples, notably HIV, demonstrate parity between rates of recombination and point mutation (2). In PNAS, Steinberg et al. (3) suggest that recombination among multiple common human pathogenic viruses, including SARS-like viruses, exceeds even this threshold, exhibiting several recombination events per synonymous substitution.
This result is achieved through the application and further development of a computationally efficient method, first introduced by the same group to evaluate rates of bacterial recombination (4), providing the opportunity to work with massive genomic datasets, fundamental to pandemic preparedness. More broadly, these strikingly high predicted recombination rates motivate a robust reconsideration for the role of horizontally transmitted genomic information in microbial populations (5, 6).
Recombination has played an important part in the human host adaptation of past pandemic viruses (7). Among nonsegmented RNA viruses, including SARS-like viruses, recombination is usually the result of template switching where a polymerase detaches from the genome being copied before it is complete and attaches to another template (Fig. 1A). Reattachment to the original position in the new template, resulting in homologous recombination, is most commonly observed, presumably due to constraints on sequence similarity. This provides a mechanism to mitigate the accumulation of weakly deleterious “passenger” mutations across a genome which has acquired a substantial adaptation within a specific gene. Recombination also influences the impact of positive epistasis, a major driver of molecular evolution (8), where the increased fitness resulting from an entire ensemble of mutations is greater than the summed effects of each constitutive mutation, which may be deleterious. When recombination is frequent relative to mutation, ensembles of mutations connected through positive epistasis are more likely to emerge. Accordingly, estimating viral recombination rates is an important part of assessing pathogen evolutionary risk.
Fig. 1.
(A) Cartoon of copy-choice recombination. The black oval denotes RNA-dependent RNA polymerase (RdRp). (B) Illustration of coalescence. The last common ancestors of arbitrary pairs (circles and squares) are identified. (C) Schematic of the method. Begin with a multisequence alignment containing N genomes of length L with a possible nucleotides at each site (here 4). Compute binary difference vectors for all pairs of genomes. Average over pairs and then sites to obtain the pairwise diversity, dS. Repeat similar procedure for paired site statistics to obtain P(l). Fit analytical form of P(l) to data via least squares to infer values for the two free parameters, θS and φS, the expected number of synonymous substitutions and recombination events, respectively, per site since coalescence within the sample (subscript, S, to differentiate from the broader “pool” as described in the main text).
Most available tools for estimating recombination rates rely on the explicit construction of phylogenetic trees. When the ancestry inferred for a target gene is substantially different from that for an entire genome, or the core reference gene, this phylogenetic incongruence indicates that the target gene has likely been subject to horizontal transmission. These classical methods are capable of yielding valuable biological insight into specific recombination events potentially linked to epidemiological consequences (7) but come with three caveats. First, the statistical power is generally low (9), leading to a systematic underestimation of recombination (10). Second, the construction of and statistical inference over large phylogenies are scientifically challenging and computationally intensive (11). Third, these models assess the possibility of recombination only among observed sequences and not between known isolates and the larger, unknown viral reservoir or broader “gene pool.”
The method applied by Steinberg et al. (3) addresses these complications with a phylogeny-free approach. Briefly, the conditional probability, P(l), of observing a synonymous substitution in site i + l given a synonymous substitution in site i is directly computed from a multisequence alignment of all known isolates. Only fourfold degenerate, third-codon positions are included. Greater degrees of recombination result in a higher conditional probability for small l due to the possibility of multiple horizontally transmitted substitutions present in a single sequence fragment. A simplified model of copy-choice recombination among a pool of genomes including a subset of observed isolates is then constructed, within the broader framework of coalescence theory (12), and an analytical form for P(l) is derived.
Steinberg et al. demonstrate that a simplified model of copy-choice recombination not only enables efficient rate estimation but also reveals information about the larger, unobserved viral reservoir invisible to classical, phylogenetic approaches.
This function is dependent on just three measurable quantities and only two free parameters. The measured quantities are the genome length, the number of alleles at each locus (four in this case corresponding to all possible nucleotides), and the pairwise diversity of all known isolates. The pairwise diversity is computed by mapping each pair of isolates to a binary vector (same nucleotide or different), averaging over all vectors at each site, and then averaging over all sites to yield a constant. The free parameters are the expected number of synonymous substitutions and recombination events, respectively, per site since coalescence among the sampled isolates (between the last common ancestor of any pair of genomes within the sample and either extant individual, Fig. 1B). These are fit through least-squares minimization (13) between the observed and predicted P(l). See Fig. 1C for a simplified schematic of the method.
This model fit yields, among other useful quantities, the relative recombination rate of the viral reservoir [recombination per synonymous substitution]. Recombination was observed for several RNA viruses analyzed but not for West Nile or yellow fever virus, qualitatively consistent with prior estimates (14). Quantitatively, recombination was predicted to be more frequent than synonymous substitution in every case where it was observed, including among SARS-like coronaviruses, a dramatic finding.
As described above, homologous recombination complicates phylogenomic reconstruction. Methods which assume no homologous recombination may result in the inference of global phylogenies which do not correspond to the true phylogeny of any segment (15). In the analysis of SARS-like coronaviruses presented, incorporating homologous recombination with the estimated viral reservoir results in an inferred ancestry which is not reproduced by standard methods (16). These discrepancies have practical implications for public health. Identifying the most recent common ancestor (MRCA) of a group of epidemic isolates is key to determining how long a pathogen had been circulating and how far it may have spread, by the time the first few sequences are obtained (17). Similarly, finding the MRCA between a novel virus, like SARS-CoV-2, and related animal viruses is a critical step toward the identification, and potential containment, of an intermediate host reservoir (18).
Steinberg et al. (3) demonstrate both a high rate of homologous recombination among SARS-like coronaviruses globally and a complex landscape of structured gene pools between which recombination is minimal. At some level, the boundaries of these gene pools must be determined by the host population structure, with the strongest barrier to recombination being cross-species transmission. Within a single species, perhaps such a structure of distinct gene pools is indicative of separable epidemiological niches, broadly characteristic of human RNA viruses (19).
It is important to acknowledge that recombination between identical sequences is undetectable, and classical methods for the inference of recombination requiring explicit phylogenetic reconstruction typically produce lower bound estimates. While the approach used in this work is likely more accurate and certainly more computationally practical for many datasets, it may result in the overestimation of recombination rates. Contemporary estimates for recombination rates among SARS-like coronaviruses using phylogenetic approaches are several orders of magnitude lower (20) than reported in this study, and it will take time to build consensus regarding best practices.
Recombination will continue to play a principal role in the evolution of pathogenic viruses. Steinberg et al. (3) demonstrate that a simplified model of copy-choice recombination not only enables efficient rate estimation but also reveals information about the larger, unobserved viral reservoir invisible to classical, phylogenetic approaches. The results presented suggest that recombination is dramatically underestimated for many well-known pathogens among which horizontal transmission of information is more frequent than vertical inheritance.
Acknowledgments
N.D.R. is supported by the Intramural Research Program of the NIH (National Library of Medicine).
Author contributions
N.D.R. wrote the paper.
Competing interests
The author declares no competing interest.
Footnotes
See companion article, “Correlated substitutions reveal SARS-like coronaviruses recombine frequently with a diverse set of structured gene pools,” 10.1073/pnas.2206945119.
References
- 1.Smits R., et al. , De novo mutations in children born after medical assisted reproduction. Hum. Reprod. 37, 1360–1369 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Song H., et al. , Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection. Nat. Commun. 9, 1–15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Steinberg A. P., Silander O. K., Kussell E., Correlated substitutions reveal SARS-like coronaviruses recombine frequently with a diverse set of structured gene pools. bioRxiv [Preprint] (2022). 10.1101/2022.08.26.505425. [DOI] [PMC free article] [PubMed]
- 4.Lin M., Kussell E., Inferring bacterial recombination rates from large-scale sequencing datasets. Nat. Methods 16, 199–204 (2019). [DOI] [PubMed] [Google Scholar]
- 5.Lin M., Kussell E., Correlated mutations and homologous recombination within bacterial populations. Genetics 205, 891–917 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Steinberg A. P., Lin M., Kussell E., Core genes can have higher recombination rates than accessory genes within global microbial populations. Elife 11, e78533 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Rochman N. D., Wolf Y. I., Koonin E. V., Molecular adaptations during viral epidemics. EMBO Rep. 23, e55393 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rochman N. D., Wolf Y. I., Koonin E. V., Deep phylogeny of cancer drivers and compensatory mutations. commun. Biol. 3, 1–11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Posada D., Crandall K. A., Evaluation of methods for detecting recombination from DNA sequences: Computer simulations. Proc. Natl. Acad. Sci. U.S.A. 98, 13757–13762 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Posada D., Crandall K. A., Holmes E. C., Recombination in evolutionary genomics. Annu. Rev. Genetics 36, 75–97 (2002). [DOI] [PubMed] [Google Scholar]
- 11.Rochman N. D., et al. , Ongoing global and regional adaptive evolution of SARS-CoV-2. Proc. Natl. Acad. Sci. U.S.A. 118, e2104241118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tajima F., Evolutionary relationship of DNA sequences in finite populations. Genetics 105, 437–460 (1983). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Newville M., et al. , LMFIT: Non-Linear Least-Square Minimization and Curve-Fitting for Python (Astrophysics Source Code Library:ascl:1606.1014, 2016).
- 14.Twiddy S. S., Holmes E. C., The extent of homologous recombination in members of the genus Flavivirus. J. Gen. Virol. 84, 429–440 (2003). [DOI] [PubMed] [Google Scholar]
- 15.Posada D., Crandall K. A., The effect of recombination on the accuracy of phylogeny estimation. J. Mol. Evol. 54, 396–402 (2002). [DOI] [PubMed] [Google Scholar]
- 16.Minh B. Q., et al. , IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pekar J., Worobey M., Moshiri N., Scheffler K., Wertheim J. O., Timing the SARS-CoV-2 index case in Hubei province. Science 372, 412–417 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wacharapluesadee S., et al. , Evidence for SARS-CoV-2 related coronaviruses circulating in bats and pangolins in Southeast Asia. Nat. Commun. 12, 1–9 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Mutz P., et al. , Human pathogenic RNA viruses establish noncompeting lineages by occupying independent niches. Proc. Natl. Acad. Sci. U.S.A. 119, e2121335119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Müller N. F., Kistler K. E., Bedford T., A Bayesian approach to infer recombination patterns in coronaviruses. Nat. Commun. 13, 1–9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

