A reanalysis of SARS-CoV-2 deep sequencing data from donor-recipient pairs indicates that transmission bottlenecks are very narrow (1–3 virions).
In their recent research article (1), Popa, Genger et al. combined epidemiological and viral genetic data to characterize the transmission dynamics of SARS-CoV-2 in Austria between February and April 2020. The genetic data they analyzed comprised >500 deep-sequenced virus samples. Beyond using consensus-level SARS-CoV-2 sequences to infer transmission clusters within Austria and to examine the role that Austria played in seeding regional epidemics elsewhere in Europe, the authors used their sequenced samples to characterize mutational dynamics within hosts and along short transmission chains. Although we believe that the findings from their consensus-level genetic analysis are robust, we here revisit their analyses of mutational dynamics at the below-the-consensus level. Specifically, we consider their estimates of the viral transmission bottleneck size, defined as the number of virions that successfully seed infection in a recipient individual following infection from a donor individual. Equivalently, it is the number of viral particles from a person who transmits the infection that contribute genetically to the viral population in the recipient who contracts it. From our reanalysis, we conclude that transmission bottleneck sizes for SARS-CoV-2 are not on the order of 1000 virions as concluded by the authors, but instead much smaller.
Our decision to revisit Popa, Genger et al.’s conclusions on transmission bottleneck sizes stems from certain patterns present in some of their figures. First, inferred bottleneck size estimates using a 3% variant calling threshold were bimodal, with 14 of the 39 transmission pairs having an inferred bottleneck size (Nb) of <10 and the remaining 25 pairs having Nb estimates of 115–5000 (their fig. S4G). Further, when a 1% variant calling threshold was used, only a single transmission pair retained an Nb estimate of <10 (their Figure 5B). In an attempt to understand these patterns, we first reanalyzed their deep sequencing data and recalled variants using their pipeline. In the analyses presented below, we use these recalled variant frequencies, which appear to be highly similar to those presented in Popa, Genger et al. based on the plots published as part of their article that show the frequencies of called variants in donor individuals against those in recipient individuals in their identified transmission pairs (10.5281/zenodo.4247401).
As expected, re-estimation of transmission bottleneck sizes at variant calling thresholds of 1% and 3% yielded similar results to those shown in (1) (fig. S1A,B; data file S1). During this analysis, we noticed that bottleneck size estimates dropped, sometimes precipitously, when going from a 1% cutoff to a 3% cutoff for every one of the 13 transmission pairs that had donors with a maximum intrahost single nucleotide variant (iSNV) frequency of >6% (Figure 1A; p = 0.004, paired t-test). Because increasing the variant calling threshold would remove low-frequency iSNVs from the analysis, these consistent decreases in Nb estimates could come about if low-frequency donor iSNVs pointed towards bottleneck sizes being large whereas high-frequency donor iSNVs instead pointed towards bottleneck sizes being small. Examination of low-frequency iSNVs across donor-recipient pairs indeed indicate a high degree of congruence between their frequencies (Figure 1B inset and fig. S2 in data file S2), which would suggest wide transmission bottlenecks. In contrast, high-frequency donor iSNVs rarely appeared to be transmitted to their corresponding recipient (fig. S2), suggesting narrow transmission bottlenecks.
To come to terms with these conflicting patterns, we considered genetic variation that appeared de novo in recipient hosts. This genetic variation appears in the donor-against-recipient variant frequency plots as iSNVs absent from a donor but present in a corresponding recipient. When a de novo variant is observed as fixed in a recipient sample, we should not observe any shared iSNVs between a donor and a recipient that are present in the recipient at subclonal (that is, not fixed) frequencies unless within-host recombination occurred extremely rapidly or the fixed de novo variant arose multiple times in different genetic backgrounds. However, in the transmission pairs analyzed in Popa, Genger et al., shared subclonal iSNVs are observed in several transmission pairs where there is also a fixed de novo variant present in the recipient. The transmission pair CoV_162 → CoV_161 provides an example (Figure 1B). This means that the low-frequency iSNVs shared between CoV_162 and CoV_161 are either spurious or that they arose independently in the recipient (that is, they are homoplasies). Although iSNV homoplasies have been documented in a number of recent SARS-CoV-2 studies (2, 3), we believe that these low-frequency iSNVs in the Popa, Genger et al. transmission pairs are likely spurious, potentially arising from systematic issues related to the sequencing protocol. This is because these low-frequency iSNVs occur at extremely similar frequencies between the donor sample and the recipient sample (Figure 1B inset; fig. S2), which is unlikely if the iSNVs were homoplasies. In either case, however, the low-frequency shared iSNVs in transmission pair CoV_162 → CoV_161 and in other transmission pairs with fixed de novo variants in the recipient could only constitute transmitted genetic variation under scenarios that are highly implausible from a biological perspective, and as such should to be excluded from a transmission bottleneck analysis involving these transmission pairs.
A comprehensive analysis of all transmission pairs identified in Popa, Genger et al. indicates that patterns of low-frequency shared genetic variation are quantitatively highly similar across transmission pairs. To illustrate this, we categorized transmission pairs into three groups: transmission pairs with de novo fixed variants in the recipient (here, defined as >94% in frequency), transmission pairs with de novo high-frequency (6–94%) variants in the recipient, and transmission pairs with only low-frequency de novo variants (≤6%). Figure 1C shows that the shared low-frequency iSNVs across these three groups are quantitatively extremely similar: most shared iSNVs in each of these groups have frequencies falling between 1–2% in the donor, although some have frequencies of up to 6%. Whereas we should expect no transmitted subclonal genetic variation for the transmission pairs falling in the first group, we expect any shared iSNVs between transmission pairs belonging to the second group to have markedly different frequencies between the donor and the recipient because of genetic linkage with the high-frequency de novo variant in the recipient. The third group in principle could have very similar iSNV frequencies if bottleneck sizes were sufficiently large. In contrast to these expectations, Figure 1C shows that all shared iSNVs (regardless of which group is being considered) are highly congruent in their frequencies between donors and recipients, indicating again that these iSNVs are very likely spurious. Indeed, when we calculate the proportion of the low-frequency donor iSNVs that are observed in a corresponding recipient (at ≥1%) versus observed in an epidemiologically unlinked recipient, we find that the distribution of these proportions are highly similar (Figure 1D). This finding again suggests that these shared low-frequency iSNVs do not constitute true shared genetic variation within transmission pairs; if these shared iSNVs were transmitted, we would expect the proportion of shared low-frequency variants to be higher for the corresponding recipient compared to an epidemiologically unlinked one.
Given these findings that shed doubt on low-frequency iSNVs constituting transmitted genetic variation, we decided to quantify the extent to which particular iSNVs were present across the samples used in the transmission pair analyses. We found that 5 iSNVs were present in ≥40 of the 43 samples analyzed, at frequencies that fell into a very narrow range (1%−2.2%) (Figure 1E). Many other iSNVs were also present across numerous samples (Figure 1E; fig. S3 in data file S3 and fig. S4), with the frequencies of any particular iSNV being highly similar across the samples that it appears in. This similarity in iSNV frequencies again argues against these low-frequency iSNVs being homoplasies and strongly argues for these iSNVs being spurious. To assess the evidence for this, we plotted the genome location of all variants observed in between 1%−99% of reads in at least 10 samples against the read depth at those positions (fig. S5). Although these variants do not tend to appear in areas of particularly low sequencing coverage, they do cluster within a small number of sequenced amplicons, which are distributed across the genome (fig. S6).
Last, a comparison between observed patterns of iSNV frequencies between donors and recipients versus those expected under large transmission bottleneck sizes as inferred in Popa, Genger et al. further argues against the transmission of the low-frequency shared iSNVs. Specifically, observed iSNV frequencies from transmission pairs with inferred bottleneck sizes of Nb ≥ 1000 show that iSNVs are present in both donor and recipient at highly similar frequencies or are observed exclusively in the donor or recipient (Figure 1F). On this figure, we overlaid simulated iSNV frequencies under the assumption of a bottleneck size of Nb = 1000. Juxtaposition of the observed versus theoretically predicted iSNV frequencies highlights an inconsistency: at Nb values of ~1000, we should expect almost all (at least 96.1%) of the iSNVs present in the donor at ≥2% to be transmitted and also observed above the variant calling threshold of 1% in the recipient. However, only 77.5% of donor iSNVs within the 2–6% frequency range were observed in the corresponding recipients at ≥1% frequency. This inconsistency indicates that the low-frequency iSNVs themselves show patterns that cannot be parsimoniously explained by large transmission bottleneck sizes. Moreover, bottleneck sizes of around Nb = 3000 are needed to quantitatively reproduce patterns of shared iSNV frequencies (fig. S7); at this bottleneck size, nearly 100% of iSNVs present in the donor at ≥2% should be transmitted to the recipient, but this is not the case.
Given our finding that the shared low-frequency iSNVs called in Popa, Genger, et al. are likely spurious, we re-estimated transmission bottleneck sizes using the beta-binomial method (4) at a conservative variant calling threshold of 6% (Figure 1A; figure S1C; data file S1). Increasing the variant calling threshold does not bias bottleneck size estimates, but it does increase statistical uncertainty in the estimated values. At this 6% cutoff, only 13 transmission pairs had one or more donor iSNVs remaining, such that bottleneck sizes could only be estimated for these pairs. The maximum likelihood estimate for Nb was 1 for 12 out of these 13 transmission pairs (with the largest upper bound of the 95% CI being Nb = 181 virions); for the remaining transmission pair (CoV_198 → CoV_230), the estimate was Nb = 143 virions (95% CI = 4 to 951). This transmission pair was the only one where a donor iSNV (at a frequency of ~22%) was transmitted to a recipient but remained subclonal (at a frequency of ~17%). Because the confidence intervals around these estimates were large, we also estimated an overall transmission bottleneck size using the data from these 13 transmission pairs. We arrived at an estimate of a mean bottleneck size of 1.21, with 3 or fewer viral particles successfully seeding infection in >99% of successful transmissions (Figure 1G). Of note, this estimate depends on patterns of genetic variation observed between donors and recipients of transmission pairs. We here relied on the transmission pairs specified in Popa, Genger et al.; misspecification of these pairs could result in erroneously small bottleneck estimates.
Our finding of a very tight transmission bottleneck from a reanalysis of the viral deep-sequencing data from Popa, Genger et al. is consistent with conclusions from other recent studies that have quantified SARS-CoV-2 transmission bottleneck sizes in humans (3, 5) and other mammals (6). These results indicate that SARS-CoV-2 has a narrow transmission bottleneck, similar in size to that of influenza A viruses (7). Small bottleneck sizes also mean that infections generally start off with very little – if any – viral genetic diversity, such that acute infections will likely be characterized by low levels of viral diversity except in instances of superinfection, consistent with other recent studies (2, 8). Our reanalysis thus parsimoniously adds to a growing understanding of SARS-CoV-2 evolution between and within infected individuals.
Supplementary Material
Acknowledgments
We thank Andreas Bergthaler and his group for providing clarification on the SARS-CoV-2 deep-sequencing data submitted as part of their research article. We also thank Carl Bergstrom for a clear definition of transmission bottleneck size and three anonymous reviewers for their insightful recommendations to improve this work. The research reported in this technical comment was supported by National Institute of Allergy and Infectious Diseases Centers of Excellence for Influenza Research and Surveillance (CEIRS) grant HHSN272201400004C and by the US National Institutes of Health National Institute of General Medical Sciences grant 1R01 GM124280-03S1.
Footnotes
Competing interests
KK consults for Moderna on SARS-CoV-2 epidemiology and evolution.
Data and materials availability
All raw sequencing data used in this study are available from the National Center for Biotechnology Information (NCBI) Sequencing Read Archive (SRA) BioProject #PRJEB39849. Associated metadata and analysis code to recreate the analysis is available at https://doi.org/10.5281/zenodo.5224640.
References and notes
- 1.Popa A, Genger J-W, Nicholson MD, Penz T, Schmid D, Aberle SW, Agerer B, Lercher A, Endler L, Colaço H, Smyth M, Schuster M, Grau ML, Martínez-Jiménez F, Pich O, Borena W, Pawelka E, Keszei Z, Senekowitsch M, Laine J, Aberle JH, Redlberger-Fritz M, Karolyi M, Zoufaly A, Maritschnik S, Borkovec M, Hufnagl P, Nairz M, Weiss G, Wolfinger MT, von Laer D, Superti-Furga G, Lopez-Bigas N, Puchhammer-Stöckl E, Allerberger F, Michor F, Bock C, Bergthaler A, Genomic epidemiology of superspreading events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2, Sci. Transl. Med 12, eabe2555 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Valesano AL, Rumfelt KE, Dimcheff DE, Blair CN, Fitzsimmons WJ, Petrie JG, Martin ET, Lauring AS, Pekosz A, Ed. Temporal dynamics of SARS-CoV-2 mutation accumulation within and across infected hosts, PLoS Pathog 17, e1009499 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lythgoe KA, Hall M, Ferretti L, de Cesare M, MacIntyre-Cockett G, Trebes A, Andersson M, Otecko N, Wise EL, Moore N, Lynch J, Kidd S, Cortes N, Mori M, Williams R, Vernet G, Justice A, Green A, Nicholls SM, Ansari MA, Abeler-Dörner L, Moore CE, Peto TEA, Eyre DW, Shaw R, Simmonds P, Buck D, Todd JA, on behalf of the Oxford Virus Sequencing Analysis Group (OVSG)‡, Connor TR, Ashraf S, da Silva Filipe A, Shepherd J, Thomson EC, The COVID-19 Genomics UK (COG-UK) Consortium§, Bonsall D, Fraser C, Golubchik T, SARS-CoV-2 within-host diversity and transmission, Science 372, eabg0821 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sobel Leonard A, Weissman D, Greenbaum B, Ghedin E, Koelle K, Transmission Bottleneck Size Estimation from Pathogen Deep-Sequencing Data, with an Application to Human Influenza A Virus, Journal of Virology, JVI.00171–17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Braun K, Moreno G, Wagner C, Accola MA, Rehrauer WM, Baker D, Koelle K, O’Connor DH, Bedford T, Friedrich TC, Moncla LH, Limited within-host diversity and tight transmission bottlenecks limit SARS-CoV-2 evolution in acutely infected individuals (Evolutionary Biology, 2021; 10.1371/journal.ppat.1009849). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Braun KM, Moreno GK, Halfmann PJ, Hodcroft EB, Baker DA, Boehm EC, Weiler AM, Haj AK, Hatta M, Chiba S, Maemura T, Kawaoka Y, Koelle K, O’Connor DH, Friedrich TC, Pekosz A, Ed. Transmission of SARS-CoV-2 in domestic cats imposes a narrow bottleneck, PLoS Pathog 17, e1009373 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.McCrone JT, Woods RJ, Martin ET, Malosh RE, Monto AS, Lauring AS, Stochastic processes constrain the within and between host evolution of influenza virus, eLife 7 (2018), doi: 10.7554/eLife.35962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tonkin-Hill G, Martincorena I, Amato R, Lawson ARJ, Gerstung M, Johnston I, Jackson DK, Park NR, Lensing SV, Quail MA, Gonçalves S, Ariani C, Chapman MS, Hamilton WL, Meredith LW, Hall G, Jahun AS, Chaudhry Y, Hosmillo M, Pinckert ML, Georgana I, Yakovleva A, Caller LG, Caddy SL, Feltwell T, Khokhar FA, Houldcroft CJ, Curran MD, Parmar S, The COVID-19 Genomics UK (COG-UK) Consortium, Alderton A, Nelson R, Harrison E, Sillitoe J, Bentley SD, Barrett JC, Torok ME, Goodfellow IG, Langford C, Kwiatkowski D, Wellcome Sanger Institute COVID-19 Surveillance Team, Patterns of within-host genetic diversity in SARS-CoV-2 (Genomics, 2020; 10.7554/eLife.66857). [DOI] [Google Scholar]
- 9.Wu F, Zhao S, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y, Yuan M-L, Zhang Y-L, Dai F-H, Liu Y, Wang Q-M, Zheng J-J, Xu L, Holmes EC, Zhang Y-Z, A new coronavirus associated with human respiratory disease in China, Nature 579, 265–269 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sobel Leonard A, Weissman DB, Greenbaum B, Ghedin E, Koelle K, Lyles DS, Ed. Transmission Bottleneck Size Estimation from Pathogen Deep-Sequencing Data, with an Application to Human Influenza A Virus, Journal of Virology 91 (2017), doi: 10.1128/JVI.00171-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All raw sequencing data used in this study are available from the National Center for Biotechnology Information (NCBI) Sequencing Read Archive (SRA) BioProject #PRJEB39849. Associated metadata and analysis code to recreate the analysis is available at https://doi.org/10.5281/zenodo.5224640.