Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 1.
Published in final edited form as: Nat Microbiol. 2024 Apr 22;9(5):1382–1392. doi: 10.1038/s41564-024-01655-4

Targeted Accurate RNA Consensus sequencing (tARC-seq) reveals mechanisms of replication error affecting SARS-CoV-2 divergence

Catherine C Bradley 1,2,3, Chen Wang 1,*, Alasdair JE Gordon 1,*, Alice X Wen 1,2,3,*, Pamela N Luna 1, Matthew B Cooke 1, Brendan F Kohrn 4, Scott R Kennedy 4, Vasanthi Avadhanula 7, Pedro A Piedra 7, Olivier Lichtarge 1,8, Chad A Shaw 1, Shannon E Ronca 5,6,7, Christophe Herman 1,7,8
PMCID: PMC11384275  NIHMSID: NIHMS2004148  PMID: 38649410

Abstract

RNA viruses, like SARS-CoV-2, depend on their RNA-dependent RNA polymerases (RdRp) for replication, which is error-prone. Monitoring replication errors is crucial for understanding the virus’s evolution. Current methods lack the precision to detect rare de novo RNA mutations, particularly in low-input samples such as those from patients. Here, we introduce a new targeted Accurate RNA Consensus sequencing method (tARC-seq) to accurately determine the mutation frequency and types in SARS-CoV-2, both in cell culture and clinical samples. Our findings show an average of 2.68×105 new errors per cycle with a C>T bias that cannot be solely attributed to APOBEC editing. We identified hotspots and cold spots throughout the genome, correlating with high or low GC content, and pinpointed transcription regulatory sites as regions more susceptible to errors. tARC-seq captured template switching events including insertions, deletions, and complex mutations. These insights shed light on the genetic diversity generation and evolutionary dynamics of SARS-CoV-2.

Introduction.

During the COVID-19 pandemic, we have witnessed the repeated emergence of SARS-CoV-2 lineages and viral variants of concern (VOC). SARS-CoV-2 relies on viral RNA-dependent RNA polymerase (RdRp) for replication1 (Extended Data Fig. 1a). RNA polymerase (RNAP) misincorporates nucleotides at much higher frequencies than their DNA counterparts2, generating RNA variants which can fuel viral evolution (Extended Data Fig. 1a, Supplementary Table 1). RNA viral mutation rates have been inferred3, estimated4, and directly determined5, and they range from 10−6 to 10−4 per base per cell infection. The estimated mutation rates of two related SARS-CoV-2 viruses, murine hepatitis virus (MHV) and SARS-CoV, are approximately 106 and 10−7, respectively6, due to the presence of proofreading activity by an exonuclease (ExoN). Inactivation of ExoN activity within MHV and SARS-CoV virus increases the estimated mutation rates7. Therefore, SARS-CoV-2, which has ExoN, is believed to acquire mutations more slowly than other RNA viruses. Furthermore, estimates of SARS-CoV-2 evolution mutation rate vary from ~7 × 10−4 to 1.1 × 10−3 substitutions per site per year8,9. This translates to a predicted average of two mutations per genome per month, which is lower than what has been observed in some of the lineages during the pandemic10. In addition, the preponderance of C>T mutational events in pandemic sequencing data suggests a role for APOBEC RNA editing rather than polymerase errors11, but this remains speculative because of the inaccurate sequencing methods used in these studies. Moreover, no empirical studies have directly measured the frequency and spectrum of de novo RdRp errors during SARS-CoV-2 replication.

Results.

tARC-seq for low input viral sample

Circle sequencing (CirSeq) was developed to eliminate technical artifacts from library preparation and next-generation sequencing, a major challenge for de novo RNA variant detection, and was initially applied to poliovirus5. Molecular barcoding was later developped, enhancing the detection of insertions and deletions through ARC-seq12. While these advances in consensus sequencing reduced technical noise, they require considerable substrate (≥1 μg of RNA) and are unfeasible for low input samples13, such as clinical samples. Another major constraint is that variant discovery is directly correlated with sequencing depth, which can be difficult to achieve for rare transcripts or organisms with large genomes. To overcome these limitations, we developed targeted Accurate RNA Consensus sequencing (tARC-seq) (Fig. 1). tARC-seq combines ARC-seq features with hybrid capture technology for target enrichment to enable deep variant interrogation of low input SARS-CoV-2 samples. Here, we applied tARC-seq to establish the frequency and mutation spectrum of ancestral SARS-CoV-2 (WT), Alpha and Omicron variants, as well as clinical Omicron samples.

Fig. 1 |. SARS-CoV-2 library preparation for tARC-seq.

Fig. 1 |

(1) SARS-CoV-2 RNA is added to a carrier and the sample is fragmented. (2) Fragments are ligated to barcoded adapters, circularized and primed for rolling-circle reverse-transcription. (3) The resulting cDNA multimers are restriction digested into monomer copies. (4) Sequencing adapters and additional barcodes [] are added through subsequent PCR steps. (5) SARS-CoV-2 reads are enriched through hybrid capture, followed by post-capture PCR. (6) Final library is sequenced, reads are organized into families by barcode, and collapsed into consensus sequences. This process of error correction removes technical artifacts [] and identifies true RNA variants [] that occur at the same position across duplicates. The non-targeted sister protocol to tARC-seq is outlined in grey (steps 2–4, 6).

We first validated tARC-seq in Escherichia coli, where hybrid capture produced a >30-fold enrichment on average in unique consensus reads across a panel of twelve genes with different gene expression strength (Extended Data Fig. 2a). RNA variant frequency analysis by gene is poorly powered using the original ARC-seq method because coverage is highly variable between targets leading to inaccurate estimates of the true variant frequency. Combining ARC-seq with hybrid capture (tARC-seq) enriches for reads across the twelve-gene panel in E. coli. This enrichment allows high confidence measurements of variant frequencies across the enriched genes (Extended Data Fig. 2b,c, Supplementary Table 2). These results validate hybrid capture for E. coli mRNA error detection by increasing the statistical significance for variant calling.

RNA variants in Ancestral SARS-CoV-2

We next used tARC-seq to sequence WT SARS-CoV-2 RNA virus obtained after four infectious cycles, yielding 9 × 105 pfu and corresponding to picograms of viral RNA (Supplementary Table 1). Since viral RNA yield is low, E. coli mRNA was added for library preparation to serve as a carrier to facilitate enzyme chemistry. With hybrid capture some E. coli carrier RNA sequences are found in the final library; these are analyzed separately and serve as internal technical controls. These off-target reads recapitulate the previously reported variant frequency and spectrum obtained by CirSeq for total E. coli mRNA14. Furthermore, E. coli error frequencies and mutational spectra were comparable between tARC-seq carrier control (6.4 × 10−5 overall frequency) and WT E. coli tARC-seq (7.5 × 10−5 overall frequency) samples (Extended Data Fig. 2b,c). The molecular spectrum of transcript errors in E. coli indicates a general C>U substitution bias, which has been proposed to be due to spontaneous deamination or non-Watson-Crick base mispairing between a G and U during rNTP incorporation15. The spectrum is also strongly biased towards G>A substitutions, which could arise primarily from error-prone transcription machinery16.

Using tARC-seq, we achieved on average >16,000X depth in unique consensus reads across the 29,903-nucleotide genome of WT SARS-CoV-2 (Supplementary Table 3). Positions were filtered for ≥50X depth, which excluded only 0.1% of the genome. Clonal and subclonal variants, present at >5% allele frequency, were discarded to enrich for de novo events (Extended Data Fig. 3a; Supplementary Table 4). We determined that three cDNA copies (i.e., minimum family size) was sufficient to filter out most technical artifacts during consensus calling without compromising read depth (Extended Data Fig. 3b).

The overall RNA variant frequency in WT virus, calculated as the number of variants called divided by the number of consensus nucleotides sequenced, is 1.07 × 10−4 per base (Fig. 2a). Most variants are base substitutions (8.38 × 10−5), followed by deletions (2.14 × 10−5) and insertions (2.52 × 10−6). All base substitution types were observed with C>T and G>A transitions dominating the mutational landscape, accounting for 44% and 9% of all events, respectively (Fig. 2b). The frequency and mutational spectrum of WT virus differ from the E. coli off-target reads (Fig. 2a,b, Extended Fig. 2b,c), which were processed together with the virus sample, suggesting that these mutational events are true viral mutations and are not artifacts introduced during library preparation.

Fig. 2 |. RNA variant frequencies and mutational spectra in SARS-CoV-2.

Fig. 2 |

a, RNA variant frequencies were measured in ancestral SARS-CoV-2 (WT), the B.1.1.7 lineage (Alpha) and the B.1.1.529 lineage (Omicron) using tARC-seq (n = 2 biologically independent samples). b, Spectrum of mutation of RNA variants, C>T and G>A transitions are the major events (n = 2 biologically independent samples). c, Most variants create nonsynonymous amino acid substitutions. d, The mutation frequencies are compared for each base substitution type observed in the coding regions (REF, original base; ALT, altered base substitution) (n = 2 biologically independent samples). e, The variant frequency (VF) is computed by position (VF = count / depth) and graphed along the genome. Two-sided Fisher’s exact test with Benjamini–Hochberg correction was performed position-wise across the genome to determine hotspots for RNA variants. Hotspot positions with depth ≥5,000X and significantly increased VFs relative to the genome-wide average are indicated with black dots (P < 0.05). One representative sample of WT virus is shown. The broken black line represents the genome-wide average VF. f, RNA VF vary by ORF. The horizontal dashed line marks the genome-wide average across all replicates (n = 2 biologically independent samples). g, All significant hotspots (e) and cold spots, from Extended Data Fig. 4c, were then analyzed by GC content. Compared to cold spots, hotspots are GC-enriched (n = 2 biologically independent samples). h, Venn diagram showing recurrent hotspots across lineages. Significant sites were combined between replicates before searching for overlaps between lineages.

Nearly 70% of all base substitutions are nonsynonymous (Fig. 2c; Supplementary Table 5), which is expected by random mutations alone altering the universal genetic code. Classic mutagenesis studies have shown that most novel mutations are deleterious or, more rarely, neutral in RNA viruses17,18; thus, if selection exerted a strong influence on tARC-seq data, this nonsynonomous fraction should be reduced. To further consider the role of selection in tARC-seq data, the frequencies of synonymous, nonsynonymous, and nonsense variants detected by tARC-seq were mapped across nsp12 – an essential gene encoding the viral RdRp (Extended Data Fig. 4a). Truncating mutations in nsp12 would render new virions inviable and would be under purifying selection. The random distribution and similar frequencies of the three types of mutations observed in this essential gene indicate that most RNA variants detected by tARC-seq are de novo replication errors. Another metric for selection is Evolutionary Action (EA)19, which estimates the impact of nonsynonymous variants on protein function using amino acid conservation (Supplementary Table 5). We calculated the EA scores and variant frequencies of all nonsynonymous and nonsense SNPs observed in nsp12 and S. We observed no difference in variant frequency between SNPs with low EA scores (predicted neutral effects) and high EA scores (predicted deleterious effects) across the spectrum of base substitutions, suggesting a limited role for selection (Extended Data Fig. 4b). Finally, we calculated the average mutation frequencies observed across all ORFs in WT virus by base alteration and mutation type and found that no significant difference was observed between synonymous, nonsynonymous, and nonsense mutations across the spectrum of base substitutions (Fig. 2d). Since mutation frequency is independent of phenotypic outcome, we conclude that the effects of selection are negligible in the final pool of virus harvested for tARC-seq. Assuming that the initial pool of viral particles, which infected Vero cells at a multiplicity of infection of 0.1, must be under purifying selection (e.g. selection against RdRp nonsense variants), and knowing that these viral particles underwent four infectious cycles, the derived variant rate per cycle is a minimum of 2.68×10−5.

To assess whether RNA variants are randomly distributed across the SARS-CoV-2 genome, frequencies were calculated by position. We found that variant frequencies are highly variable across positions. After filtering for high-confidence positions (depth ≥5,000X), our analysis identified 643 loci in WT virus duplicates with significantly elevated base substitution frequencies (Fig. 2e), of which 80 were recurrent across both WT replicates (Supplementary Table 6). These hot spots were distributed across the entire SARS-CoV-2 genome. Cold spots were also detected along portions of the genome with high coverage (Extended Data Fig. 4c). Moreover, frequencies differed among ORFs (Fig. 2f, Supplementary Table 7). To find a molecular basis for this variability, nucleotide identity was analyzed across all hot and cold spots, and we observed a strong GC bias at positions with significantly elevated variant frequencies (Fig. 2g). Several ORFs with high GC content, such as S and ORF3a, also have higher-than-average SNP frequencies (Supplementary Table 7). However, differences between ORFs cannot be explained solely by GC content—for instance, N, which has a high GC content (47%), does not exhibit a particularly elevated SNP frequency, and ORF7b, which has a relatively low GC content (31%), still exhibits a relatively high SNP frequency (Supplementary Table 7).

RNA variants in Alpha and Omicron

We next investigated if the frequencies of variants and mutation spectra vary among the evolving lineages of SARS-CoV-2, each characterized by distinct VOC, ass some viral variants exhibited enhanced infectivity and replication20. A tradeoff between replication speed and accuracy has been shown to influence mutation rates in RNA viruses21. Applying tARC-seq to the B.1.1.7 isolate (Alpha), we measured an overall RNA variant frequency of 1.38 × 10−4 (Fig. 2a). The B.1.1.529 isolate (Omicron) had a comparable variant frequency of 1.40 × 10−4 (Fig. 2a). The mutational spectra were similar between WT, Alpha and Omicron (Fig. 2b). Analysis of mutation frequencies by base alteration and synonymous/nonsynonymous/nonsense mutation type likewise revealed the lack of apparent selection in the sequenced viral pool (Fig. 2d). Variant frequencies by ORF were mostly concordant between WT, Alpha, and Omicron, with a few exceptions (Fig. 2f, Supplementary Table 7). Position-wise calculations in Alpha and Omicron again revealed several hot and cold spots for RNA substitutions. As with WT virus, a bias towards high GC content was observed in hotspots versus cold spots (Fig. 2g). 1007 hotspots were identified in Alpha and Omicron combined, of which 255 were identified across the WT RNA pool, and 51 were found in all three strains (Fig. 2h).

To determine the effect of lineage on variant frequency, we constructed a negative binomial model with genomic position and lineage as explanatory variables (see Methods; Supplementary Table 8). Considering the 1000 most significant positions across all samples, the mutation frequency of Alpha (P = 0.003), but not Omicron (P = 0.071), was increased over WT virus (Extended Data Fig. 5ac; Supplementary Table 9). Regression modeling also corroborated our findings that variant frequencies are increased at cytosine residues and vary between ORFs (Extended Data Fig. 5d,e). Furthermore, some genomic positions were more prone to mutations than others in a manner that could not be explained by base identity or ORF alone (Extended Data Fig. 5f). This may be reflective of RNA secondary structures or environmental factors during viral replication that are not immediately apparent in our data.

RNA variants in clinical COVID-19 samples

To investigate de novo mutation in the context of natural infection, tARC-seq was applied to two clinical samples. Both patients were confirmed PCR positive in mid-2022 during the Omicron wave. Each nasal swab yielded sufficient RNA for tARC-seq. Both infections were confirmed to be Omicron and mapped to different Pango lineages by Nextclade analysis22. Overall variant frequencies and mutational distributions by base substitution type were similar between the clinical samples (1.45 × 10−4 [22C]; 1.31 × 10−4 [22B]) and when compared to laboratory-grown Omicron (1.40 × 10−4 [22K]) (Extended Data Fig. 6a,b). Indel frequencies in the clinical samples were skewed by a handful of common deletions with allele fractions between 1–5% (beneath our cut-off for omission). These are likely not independent events but instead occurred early and were propagated during infection. Thus, tARC-seq can capture a snapshot of de novo mutations in clinical samples from individual patients and can be useful for monitoring viral microevolution in the human population.

Origin of High C>T and G>A mutation frequencies

The C>T and G>A de novo viral variants could arise from RdRp errors during genome replication (Extended Data Fig. 1a) or from host RNA editing in the (+) and (−) strand by APOBEC enzymes (Fig. 3a). The high C>T variant count may also arise from spontaneous deamination of cytosine in cells. However, this is unlikely since the C>T transition frequency in human mRNA23, which includes spontaneous deamination in addition to other mechanisms, is 10−6, at least an order of magnitude lower than what we observed for SARS-CoV-2 originating from Vero or Calu-2 cells (Fig. 2a and Extended Data Fig. 6).

Fig. 3 |. APOBEC editing does not account for the majority of C>T mutations observed by tARC-seq.

Fig. 3 |

Comparison of tARC-seq C>T, G>A transition hotspots to the APOBEC3A (A3A) editing signature. a, Schematic of A3A editing of + and - strand sequences based on the known preference of A3A for C sites with a 5’U26,25. The final base edits that are detectable by tARC-seq are shown on the bottom row. Green: edited bases before A3A deamination. Red: edited bases after A3A deamination. Blue: unedited bases. b, Sequence context for SNP hotspots in the WT1-Vero dataset for which C>T or G>A transitions comprise over 50% of the mutations observed versus genome-wide C and G sites. The sequence context known to be recognized by A3A at C sites of SARS-CoV2 and its reverse complement are shown for comparison26. c, The top ten highest frequency C>T mutation hotspots versus the top ten known A3A editing sites of WT SARS-CoV2 virus25 graphed by the frequency observed in the WT1-Vero dataset. d, Breakdown of the frequency of all C>T mutations by 5’ base and all G>A mutations by 3’ base in WT-Vero, Omicron-Vero, and Omicron-Clinical datasets. Bars represent lineage average of two biological samples which are indicated as dots. Dashed red lines indicate the average overall C>T or G>A mutation frequency in each lineage (n = 2 biologically independent samples).

Bioinformatic analysis of pandemic data proposed that C>T transitions amongst lineages originated from APOBEC editing11, leading to a debate over the RdRp error/editing origin of these transitions24. Recent studies suggest that APOBEC3A (A3A) is the primary source promoting C>T transitions25. A3A strongly favors C sites preceded by T in the –1 position, and it also strongly disfavors C sites preceded by G in the –1 position25,26 (Fig. 3a). When editing is performed on the (−) strand, this preference should be reflected by G>A transitions on the (+) strand at sites where A occupies the +1 position downstream of the mutation (Fig. 3a). tARC-seq SNP hotspots that are predominated by the occurrence of C>T transitions (n=343) do not show the A3A editing signature (Fig. 3b). Instead, these C>T hotspots resemble the genome-wide base composition preceding C sites, with only a 6% higher fraction of T in the –1 position compared to the genome-wide control. Furthermore, about 17% of C>T hotspots possess an A3A-disfavored G at the −1 position, which indicates they did not arise due to A3A editing (Fig. 3b). Likewise, tARC-seq SNP hotspots that are predominated by the occurrence of G>A transitions (n=39) resemble the genome wide base composition of G sites more closely than the A3A editing signature (Fig. 3b).

Next, we compared the top ten highest-frequency tARC-seq C>T hotspots to the top ten known A3A editing sites in WT virus25 and found no overlap between the two lists (Fig. 3c, Supplementary Table 10). The tARC-seq C>T frequencies at the A3A editing sites were on average one to two orders of magnitude lower than the C>T frequencies of the top ten highest-frequency tARC-seq C>T hotspots (Fig. 3c,d). Only four of the top ten A3A editing sites were identified as hotspots by tARC-seq, and their C>T frequencies were ranked between 55 to 327 out of a total of 343 tARC-seq hotspots at C sites (Fig. 3c, Supplementary Table 10).

Finally, we calculated the frequency of all NC>NT and GN>AN transitions, including non-hotspots, detected by tARC-seq in WT and Omicron virus grown in Vero cells (WT-Vero and Omicron-Vero) and the clinical samples of Omicron (Omicron-Clinical) (Fig. 3d). We found that in the Vero samples, the frequencies of TC>TT and GA>AA, which represent potential A3A editing sites, were not substantially higher compared to the other transitions (Fig. 3d). However, the frequency of TC>TT in the Omicron-Clinical samples was higher than CC>CT and GC>GT transitions, suggesting a potential preference for C>T transitions at possible A3A editing sites (Fig. 3d). No preference towards GA>AA transitions was observed (Fig. 3d). Surprisingly, Omicron-Clinical samples also showed a bias towards AC>AT transitions, which could be due to APOBEC1 (A1) editing although A1 has been dismissed as a major source of APOBEC editing in SARS-CoV-225 (Fig. 3d). We mapped the genome-wide occurrence of TC>TT mutations in WT-Vero cells, indicative of possible A3A editing, alongside the distribution of GC>GT mutations, which are less likely to result from A3A editing (Extended Data Fig. 4d). Our analysis indicates a uniform distribution of mutations across the genome, and while tARC-seq identifies potential A3A edits in both cell-cultured and clinical samples, most C>T and G>A changes in cell-cultured samples are unlikely caused by A3A editing.

Viral RdRp prone to template switching

During negative strand synthesis in coronaviruses, RdRp jumps from transcription regulatory sequences (TRS) located upstream of most gene bodies (TRS-B) to a leader sequence (TRS-L) in the 5’ untranslated region (UTR) to generate subgenomic mRNAs (sgRNA)27. This programmed template switching is driven by sequence complementarity between TRS sequences28. tARC-seq detected fusion transcripts in WT SARS-CoV-2 with junctions mapping to canonical TRS sites (Extended Data Fig. 7a). Moreover, programmed template switching impacts RdRp fidelity as TRS-flanking regions exhibit significantly higher variant frequencies in WT, Alpha and Omicron virus (Extended Data Fig. 7bd).

Non-programmed template switching has been implicated in insertion events and the emergence of novel coronavirus strains29. Analyzing tARC-seq data, we observe many recurrent junctions outside canonical TRS sites in fusion transcripts (Fig. 4b), as reported27. Compellingly, several large insertions and deletions were observed in WT (Extended Data Fig. 7e), many of which can be templated from within the SARS-CoV-2 genome.

Fig. 4 |. RdRp template switching at sites of sequence complementarity models rare events in SARS-CoV-2.

Fig. 4 |

a, Indel hotspots, corresponding to transcription skip sites, are mapped across the genome for one representative replicate each of WT, Alpha, and Omicron. These loci have significantly elevated frequencies of insertions and deletions as calculated by Fisher exact test (P < 0.05 with Benjamini-Hochberg correction). b, Template switching at TRS (denoted as vertical lines) drives sgRNA synthesis. However, many chimeric junctions observed by tARC-seq lay outside canonical regions indicating non-programmed template switching. Chimeric reads are detected by mapping with a spliced aligner (STAR). Each arc represents a chimeric alignment where the left and right x-intercepts correspond to the junction coordinates and line shading reflects frequency. One representative replicate of WT SARS-CoV-2 is shown. c,d, three events from tARC-seq data are modeled involving g.23308 (Panel a, red arrow): two deletions (c) and one insertion (d). The 41 nt insertions is: TGGTTAAAAACAAATGTGTCAATTTCAACTTCAATGGTTTA. e, The overlap of slippage-induced deletions (SID) between tARC-seq and outbreak data with complementarity size ≥2bp. The expected overlap if SID events are independent is shown in red.

We model two deletions at neighboring repeat sequences (Fig. 4c) and a large 41-nucleotide insertion (Fig. 4d), which can be modeled by complementarity-mediated template switching involving three sequential jumps between discrete genomic loci. While simple RdRp slippage downstream can explain many deletions, templated insertions are less common and are introduced when RdRp slips upstream or otherwise realigns to the correct sequence after a switching event, requiring two or more jumps.

As further evidence for template switching, we found that many indels are clustered around certain sequences, which we have termed transcription “skip sites” (Fig. 4a, Extended Data Fig. 7e). Jumpy RdRp activity at skip sites fuels a diverse repertoire of indels detectable by tARC-seq at a single locus. For example, the indel frequency at position 23308, pictured in Fig. 4a and modeled in Fig. 4c,d, is elevated ten-fold over the genome-wide average in both WT (g.23303: 2.97 × 10−4) and Alpha (g.23308: 3.11 × 10−4). Skip sites often sit adjacent to regions of microhomology (Supplementary Table 11) and homopolymeric nucleotide runs (Supplementary Table 12), which drive up local indel frequencies (Supplementary Table 13).

For templated mutational events, sequence complementarity between donor and acceptor sites facilitates the jump; however, the minimum number of complementary bases is unknown, as is tolerance for mismatches. While these constraints hinder a comprehensive analysis of all non-programmed template switching, some events are more readily studied. Simple slippage-induced deletions (SIDs) as in Fig. 4c can be identified through an unbiased search based on a clear event definition (see Methods). To better understand the rules governing template switching, all SID events with ≥2 nucleotides of complementarity between the donor site and acceptor site downstream were analyzed in WT, Alpha and Omicron. We found significant recurrence of specific SIDs across lineages, suggesting that these events are nonrandom (Fig. 4e). Recurrence was associated with low GC content and increased complementarity (Extended Data Fig. 8ac, Supplementary Table 14).

Template switching contributes to pandemic genomic change

DNA polymerase has been shown to template switch by misalignment and realignment, via sequence complementarity, leading to contiguous multiple nucleotide alterations that result in nonsynonymous amino acid substitutions and protein innovation30. Signatures of aberrant RdRp template switching are present in sequences from real-world pandemic data as well (Fig. 5a). For example, one event, a GGG to AAC substitution, defines the 20B clade (Fig. 5b) from which the Alpha, Gamma, Lambda and Omicron lineages evolved. This multiple mutation creates the tandem amino acid substitution R203K/G204R in the Nucleocapsid gene that increases the infectivity, fitness, and virulence of SAR-CoV-231. All of the subsequent VOC derived from 20B also contain lineage-specific multiple nucleotide alterations that lead to nonsynonymous amino acid substitutions and can be modeled as singular RdRp misalignment and realignment events templated from within the SARS-CoV-2 genome (Fig. 5ce; Extended Data Fig. 9ad). Moreover, many of the recurrent SID sites identified by tARC-seq were also found in outbreak data (Fig. 4g). Thus, aberrant RdRp template switching contributes to SARS-CoV-2 evolution.

Fig. 5 |. RdRp template switching contributes to genomic change during the COVID-19 pandemic.

Fig. 5 |

Pandemic data shows that many complex mutations in SARS-CoV-2 appeared suddenly (red arrows). They likely did not accumulate gradually but were driven by a single event: RdRp template switching. In the events modeled here, 3’ complementarity facilitates the misalignment and realignment of RdRp, creating complex mutations that have fueled viral evolution. a, Phylogenetic tree based on sequence alterations that define the 20B and Omicron clades; not drawn to any scale. Discrete, coordinated nucleotide alterations are coded by color, and each template switching event is mapped out below (b-e). b, The GGG>AAC mutation in the N gene occurred once early in the pandemic and defines the 20B clade. It creates a tandem amino acid substitution that increases the infectivity, fitness, and virulence of SARS-CoV-231. c, RdRp replication across a small hairpin in S has spawned the same 6-nt deletion on more than three separate occasions in different VOC, while other singular events are specific to Omicron (d-e). Phylogenetic trees were constructed in Nextstrain v2.35.043 from genomes sequenced between Dec. 2019 and March 2022.

Discussion.

We have sequenced SARS-CoV-2 laboratory and clinical isolates and established a baseline mutation frequency of 1.07 × 10−4 and mutation rate of 2.68 × 10−5 per infectious cycle for the virus that is variable between genomic loci and some lineages. These results could reflect differences between loci and lineages such as alterations in local RNA secondary structure or GC content. While higher than other predictions32, this frequency is comparable to similar observations in poliovirus, lacking proofreading activity5. However, SARS-CoV-2 has exonuclease proofreading capability provided by the nonstructural protein ExoN (nsp14), which reduces misincorporation during replication33. In contrast to related coronaviruses, ExoN is essential for SARS-CoV-2 replication, possibly because it removes roadblocks during RNA synthesis34,35. The essential nature of ExoN in SARS-CoV-2 might be due to the high error rate of RdRp, which, without ExoN, could lead to an excess number of deleterious mutations. Comparing RdRp error rates across coronaviruses may further illuminate ExoN’s critical function.

Using tARC-seq, we captured de novo insertions and deletions in SARS-CoV-2 generated during cell culture infection, recapitulating observations made with worldwide pandemic sequencing data. Many large indels and complex mutations can be modeled as non-programmed RdRp template switching events mediated by 3’ nucleotide complementarity, akin to accepted mechanisms for DNA instability36. We report that specific slippage-induced deletions recur across lineages, suggesting that these events are nonrandom, associated with low GC content and increased complementarity.

We further demonstrated that the switching events driving complex variants in our data could also explain some mutations observed during the evolution of the different SARS-CoV-2 lineages worldwide (GISAID https://www.gisaid.org/). This includes the origin of the 20B Clade which subsequently gave rise to many VOC, including Alpha, Gamma, Lambda and Omicron. Each VOC has subsequently acquired additional nonsynonymous base substitutions that can also be modeled via template switching. Therefore, our study links de novo RNA mutation data from cell cultures with global genomic data and emphasizes the significance of template switching in the evolution of SARS-CoV-2, which is critical for propagating beneficial mutations that enhance the virus’s ability to adapt37. Altogether, our work highlights the promiscuous nature of SARS-CoV-2 RdRp driven by nucleotide misincorporation and erroneous template switching, both linked to the same exonuclease. Interestingly, the proofreading activity of ExoN has also been implicated in promoting template switching38, which we show here is error-prone. Consequently, ExoN might be a pivotal protein influencing viral evolvability.

The most frequent base changes observed with tARC-seq are C>T and G>A transitions, agreeing with early studies reporting that C>T largely contributes to the mutation spectrum of SARS-CoV-2 during the pandemic. C>T and G>A transitions are the most common errors made by DNA/RNA polymerase during replication/transcription both in vivo5,39,23 and in vitro40,41,42, but could also be introduced by APOBEC editing enzymes. Our analysis indicates that most C>T transitions observed in cell culture for SARS-CoV-2 are likely due to replication errors by RdRp rather than A3A, the primary APOBEC RNA editing enzyme25.. We showed that the distribution of bases preceding the C>T events followed the genome distribution anticipated for replication errors, but not for A3A editing. A guanine preceding C>T editing, which accounts for 17% of our C>T events, has not been reported for A3A editing, further suggesting the influence of replication errors. However, Omicron-Clinical samples showed a slight preference for C>T transitions at potential A3A editing sites compared to our cell culture condition, suggesting a role for APOBEC editing during the pandemic. G>A transitions are the second most frequent mutational events observed and can be generated either during replication or C>T editing by APOBEC on the minus strand. Again, our data suggest that APOBEC editing is not the main driver of G>A transitions in cell culture. In fact, G>A transition is the most common error made by RNAP defective or lacking in proofreading activity23,16. Together, these data suggest that RdRp replication errors may be a major force of SARS-CoV-2 evolution.

Beyond the COVID-19 pandemic, our data has revealed principles concerning basic RdRp biology. RNA errors are linked to nucleotide content, transcriptional patterns and sequence complementarity. RdRp is capable of extensive non-programmed template switching to form structural variants, insertions, and deletions, ultimately fueling virus evolution.

Methods

Our research complies with all relevant ethical regulations; the clinical samples used was approved by the Institutional Review Board for Human Subject Research for Baylor College of Medicine and Affiliated Hospitals (H-47423- BCM IRB). The IRB protocol was approved with waiver of consent.

RNA extraction

All RNA samples were processed under RNase-free conditions using dedicated equipment and reagents. To maintain RNA integrity, samples were limited to ≤1 freeze-thaw cycle, kept on ice whenever possible, and not subjected to high temperatures (≥65°C) in the presence of metal cations44. RNA integrity was confirmed via Agilent Tapestation prior to sequencing library preparation.

SARS-CoV-2

Ancestral SARS-CoV-2 was received from the World Reference Center for Emerging Viruses and Arboviruses at The University of Texas Medical Branch (Galveston, TX, USA) under the direction of Drs. Scott Weaver and Kenneth Plante. Alpha variant (Isolate hu/USA/CA_CDC_5574/2020, Lineage B.1.1.7, Source: Centers for Disease Control and Prevention, BEI Catalogue number NR-54011), and Omicron variant (Isolate hCoV-19/USA/MD-HP20874/2021, Lineage B.1.1.529, Source: Johns Hopkins University; BEI Catalogue Number NR-56461) were received from BEI Resources. Viral stocks were prepared by infecting Vero E6 CCL-81 cells as previously described45. Briefly, for the in vitro infection protocol, ~1 × 106 Vero E6 cells were infected with SARS-CoV-2 at a 0.1 multiplicity of infection for 1 h. Cells were washed and reconstituted with cell culture media. The supernatant from infected cultures were collected between 48–72 h depending on the lineage to obtain a final titer of ~9 ×105 pfu/ml using Vero E6 cells to measure viral titer post infection. Virus was inactivated in TRIzol reagent before freezing at −80°C for storage. RNA was exacted from thawed TRIzol preps46 at the time of library preparation.

Clinical samples

RT-PCR testing was performed as a service to Baylor College of Medicine and affiliated institutions, while the collection of metadata was performed under an institutional review board–approved protocol with waiver of consent. Briefly, RNA was extracted from nasal swabs using the Qiagen Viral RNA Mini Kit (Qiagen Sciences) as previously described47. Both clinical samples sequenced for the present study were collected in 2022 and correspond to vaccinated females aged 25–30 years old without comorbidities and with symptomatic, PCR-confirmed SARS-CoV-2 infection.

Escherichia coli

Luria broth was inoculated from isolated colonies and grown for 16 h at 37°C. The next day, overnight cultures were washed, diluted 100-fold in fresh LB, and grown at 37°C to mid-log phase (OD600 ~0.4). 1 ml culture aliquots were then harvested in duplicate and placed on ice preceding RNA extraction using the RNAsnap protocol 48. Following harvest, sample cleanup was performed with the RNA Clean & Concentrator Kit (Zymo Research) and DNase treatment (TURBO DNase) was applied off-column at 37°C for 1 h. The ribosomal RNA fraction was depleted via the RiboMinus Transcriptome Isolation Kit for bacteria (Invitrogen), and the resulting mRNA was concentrated by ethanol precipitation for downstream library preparation.

Library preparation and sequencing

Total accurate RNA consensus sequencing (ARC-seq)

1 μg of RNA was enzymatically fragmented with RNaseIII for 7 min to an average size of 450 nt44. Following phenol:chloroform:isoamyl alcohol cleanup, the remaining steps of library construction were guided by ARC-seq design12. Briefly, 200 fmol of RNA was 5’ adenylated with Mth RNA ligase and ligated with T4 RNA Ligase 2 truncated KQ to ARC-seq RCRT-01 barcodes containing unique molecular identifiers (UMIs). Barcoded RNA fragments were 5’ phosphorylated with T4 polynucleotide kinase and circularized with T4 RNA ligase 1. Circularized fragments were primed with ARC-seq RCRT-02 and rolling-circle reverse-transcribed with ProtoScript® II Reverse Transcriptase, yielding multiple conjoined, single-stranded cDNA copies of the original fragment. The conjoined cDNA copies were hybridized to AscI X oligos at barcode-adjacent AscI cut sites and digested into monomers. Individual monomers were tailed with ARC-seq RCRT-05 to add sequencing adapters and additional indexes for each monomer. The library was PCR amplified 13–15 cycles and analyzed by Tapestation prior to paired-end sequencing on the Illumina NextSeq 550 system (NextSeq Control Software 4.0.2.7). Importantly, reaction conditions throughout were optimized to reduce RNA damage from heat and metal ion catalysis.

Targeted accurate RNA Consensus sequencing (tARC-seq)

For concentrated samples (≥ 100 ng/μl), fragmentation can proceed as described above. However, for low-abundance samples it is recommended to add carrier RNA up to 1 μg. For the SARS-CoV-2 experiments, a previously sequenced E. coli sample was mixed 4:1 with viral RNA and served as both the carrier and internal control. Once fragmented, the targeted libraries are prepared by the total RNA protocol up through library amplification. SARS-CoV-2 reads were then enriched using the COVID xGen Hybrid Capture Kit (IDT) following the standard kit procedure. Briefly, 500 ng of each replicate were pooled up to a total of 6 μg DNA, mixed with hybridization blocking DNA, and hybridized to the probe pool. Bead capture followed by 7–9 cycles of post-capture on-bead PCR was performed to generate the final library.

Data analysis

Data collection and analysis were not performed blind to the conditions of the experiments.

Data distribution was assumed to be normal but this was not formally tested.

Error-correction and variant calling

Illumina BCL files were converted to Fastq and demultiplexed from the 6 nt sample barcode in the i7 index read (base masking: Y*,I6N*,Y*,Y*). Next, UMIs were extracted and appended to the read headers before converting to unaligned BAM format. Specifically, the 11 nt PCR UMI was read from the i5 index, and the 16 nt cDNA UMI was clipped from the leading 16 bases of Read 2. Reads with identical UMI tags, originating from the same RNA fragment, were grouped into families and collapsed into consensus reads using a custom python script. Consensus FASTQ sequences were aligned to the appropriate reference genome using BWA-MEM (accessions: NC_000913.3, NC_045512.2). Reads were then quality filtered, clipped, and realigned with GATK v3.8. Finally, variants were called using mpileup and tabulated via a custom R script. The frequency was calculated by dividing the read count of variants sharing a 5’ or 3’ sequence context by the adjusted sequencing depth of all bases sharing that 5’ or 3’ sequence context. FASTQ files were generated from raw BCL files using bcl2fastq (v2.19.0). Molecular barcodes were extracted from FASTQ reads via UMI-tools (v1.0.1). Reads were converted to unaligned BAM format with Picard (v2.18.29). Consensus calling was performed using a custom python script adapted from published sources (https://github.com/Kennedy-Lab-UW). Consensus reads were aligned with BWA-MEM (v0.7.17). BAM files were quality filtered with samtools (v1.7). GATK (v3.8) was used for read soft-clipping and indel realignment. Samtools mpileup (v1.7) was used for variant calling. Variant frequency calculations, Fisher testing, and template switching analysis were performed using custom R scripts (v4.1.0). R was also used for data visualization. Phylogenetic trees were constructed in Nextstrain (v2.35.0).

Statistical information

Frequencies were calculated as variant count / total consensus nucleotides sequenced. Proportion confidence intervals (two-tailed) were calculated for each frequency and error bars represent Wilson scores of 95% confidence. To detect positions in the SARS-CoV-2 genome that have a base substitution frequency different from the sample average, we constructed a contingency table and performed Fisher’s exact test comparing each position to the genome-wide average. All positions that passed our initial filter (clonality ≤0.05, raw depth ≥50, fraction of N ≤0.05) were included in this analysis. To control the false discovery rate, P-values were adjusted using the Benjamini–Hochberg procedure. Positions with adjusted P-values <0.05, depth ≥ 5000 and substitution frequencies that are elevated or reduced compared to the genome-wide average were called as hot and cold spots, respectively. E. coli hybrid capture was performed on two biological replicates. Viral analysis was performed on two biological replicates per genotype. No statistical methods were used to pre-determine sample sizes but our sample sizes are similar to those reported in previous publications 14,16.

Template switching analysis

A custom R script was used to identify the maximum nucleotides of complementarity between each deletion event and its upstream and downstream sequences. Complementarity was searched from 3’ to 5’ between each deletion and the sequence upstream, and from 5’ to 3’ between the deletion and the sequence downstream. Complementarity identified at either end of the deletion was combined to determine the maximum length of complementarity for each event, and the GC content was calculated. We identified all slippage-induced deletions (SIDs) with ≥2 nucleotides of complementarity across all virus samples, then summarized the unique events for each lineage. SIDs were also analyzed in real-world outbreak data. Deletion events from the outbreak were obtained from the CNCB database (https://ngdc.cncb.ac.cn/ncov/variation/annotation) on November 1, 2022.49

To test whether SIDs are enriched relative to all-type deletions, we simulated all possible deletion events in the SARS-CoV-2 genome (g.1–29870, excluding the poly-A tail) with size 2–67 nt – the maximum size detected by tARC-seq. Within the simulated dataset, we searched for SIDs with ≥2 nucleotides of complementarity. The frequency of SIDs in simulated data (predicted) was compared to the actual frequency (observed) to determine enrichment. Analysis of overlapping SIDs (≥2 nt) across lineages was performed using the R SuperExactTest package.50

Negative binomial model

A negative binomial regression analysis of RNA variants in SARS-CoV-2 was performed fitting site (i.e. position) and strain (i.e. lineage) effects as explanatory variables. Sites sequenced to a mean depth >5000X and with ≥1 variant observed across samples were included in the analysis. To address the high dimensionality driven by the large number of sites, we utilized a resampling approach to consider the 12,093 sites passing filter. For 10,000 iterations we sampled 300 sites from the 12,093 and constructed a negative binomial regression model using the count of observed variants as the dependent variable and site and strain as the independent variables, adjusting for sequencing depth using the log depth as an offset variable. We constructed a dummy reference site with average VF equal to the mean VF across all sites and used the WT strain as the reference strain in all iterations. All sites were included in at least 200 regression models. This initial analysis found that Alpha trended towards increased variants compared to WT (βAlpha =0.13, P = 0.12), whereas Omicron did not appear different (βOmicron =−0.04, P = 0.68).

To improve our power to detect strain effects across samples, we next focused exclusively on the most significant sites as determined by the mean log-P-value for each site across all inclusive models. Among these influential sites, Alpha had significantly more variant observations compared to WT (βAlpha = 0.34, P = 0.003). Omicron had fewer variants compared to WT (βOmicron = −0.25, P = 0.07), although the difference was not significant.

Extended Data

Extended Data Fig. 1 |. Replication cycle of SARS-CoV-2 virus.

Extended Data Fig. 1 |

a, As a positive-strand RNA virus, SARS-CoV-2 encodes an RNAP (blue) that is responsible for both replication and gene expression. After entering the cell, the virus releases its (+) strand RNA into the host cell’s cytoplasm. Using its own polymerase (RdRp), the viral RNA replicates into a (−) strand then back to a (+) strand, producing more viral RNA genomes for new virus particles. RNAP errors (red) generate genetic diversity in SARS-CoV-2 at any step of replication and fuel the evolution of novel strains. b, Definition of the terms used in this study. c, Plaque forming units over time 9 of WT, Alpha, and Omicron SARS-CoV-2 grown in Vero cells (n=3), consistent with previous data52.

Extended Data Fig. 2 |. Hybrid capture of specific E. coli mRNAs for tARC-seq validation.

Extended Data Fig. 2 |

a, Hybrid capture in tARC-seq produces a >30-fold enrichment in post-consensus nucleotides across a panel of twelve E. coli genes (n = 2). PCR duplicates account for most of the pre-consensus nucleotides sequenced, and fold-enrichment drops during consensus calling as duplicates of the same parent RNA fragment are collapsed into a single read. The drop in enrichment between pre- and post-consensus reads is more pronounced for low-expression genes like marR. Fold enrichment was calculated from the cumulative, normalized sequencing depth across each gene in tARC-seq samples versus matched bulk ARC-seq controls. b,c, Each biological replicate represents one WT E. coli sample sequenced two ways to generate paired data. Purified mRNA was either fragmented individually and prepared for ARC-seq (control), or it was used as a carrier for SARS-CoV-2 fragmentation and tARC-seq (carrier). Libraries were sequenced separately and aligned to the E. coli reference for variant calling. Mutation frequencies were comparable between carrier (6.4 × 10−5) and control (7.5 × 10−5) samples (b) and reproduced the known variant frequency for WT E. coli (8.2 × 10−5)14. The mutation distribution across all base substitution types was also comparable between carrier and control (c). d,e, Comparison of the variant frequencies observed between ARC-seq and tARC-seq over the twelve genes in biological replicate 1 (d) and 2 (e). (n = 2; mean ± SD).

Extended Data Fig. 3 |. Empirical validation of tARC-seq data analysis parameters.

Extended Data Fig. 3 |

In contrast to de novo variants, clonal and subclonal variants are not independent events and should be filtered out during analysis. a, To determine an appropriate cutoff, all variants were graphed by the cumulative base substitution frequency as a function of each variant’s clonality. A cutoff of 0.05 – or ≤5% allele fraction – counted most variants on the curve while excluding clonal outliers. b, The overall variant frequency (left y-axis, grey bars) in WT SARS-CoV-2 is graphed by consensus read depth (right y-axis, purple line) over a series of minimum cDNAcs family sizes (minmem2). Minmem2 is an expression of the minimum number of PCR copies required to form a cDNA consensus sequence during consensus calling. A family size of 1 is equivalent to traditional RNA-seq without error correction, while a family size of 3 reduces the frequency of technical artifacts to <10−9[13].

Extended Data Fig. 4 |. De novo mutation frequencies in SARS-CoV-2 vary by feature independent of variant effect.

Extended Data Fig. 4 |

a, Synonymous (gray), nonsynonymous (blue) and nonsense mutation frequencies (red) from a single biological replicate of WT SARS-CoV-2 are mapped across nsp12, which encodes the viral RdRp. b, Substitution frequency is analyzed by Evolutionary Action (EA) for the S gene and nsp1219 (see also: http://cov.lichtargelab.org/). Higher EA scores correspond to residues with greater impact on evolutionary divergences and variants at these positions are predicted to be more deleterious. The lack of relationship between variant frequency and EA score (m ≈ 0) further suggests a limited role for selection in tARC-seq data. c, The variant frequency (VF) is computed by position (VF = count / depth) and graphed along the genome. Positions were filtered for depth ≥5,000X. Fisher’s exact test with Benjamini–Hochberg correction was performed position-wise across the genome to determine cold spots for RNA variants. Cold stop positions with significantly decreased VFs relative to the genome-wide average are indicated with black dots (P < 0.05). One representative sample of WT virus is shown. The broken black line represents the genome-wide average VF. d, RNA variant frequencies by ORF in WT, Alpha and Omicron SARS-CoV-2. e, C>T mutations observed across the WT1-Vero genome were graphed by 5’ base: T (red dot) or G (blue dot). Gray lines indicate every C>T mutation observed.

Extended Data Fig. 5 |. Effect of lineage on variant frequency using a Negative binomial regression model.

Extended Data Fig. 5 |

To handle the high dimensionality of the genome wide data a resampling regression approach was employed to estimate strain and site effects. Negative binomial regression analyses modeling the effects of explanatory variables on variant count were iteratively performed on random subsets of sites. Estimated effect sizes across all regression models in the resampling regression approach are calculated by taking the mean of the predicted effect sizes across all models that include the site. a-c, Strain effects are predicted for the 1000 most significant sites. The distributions of estimated strain effects show that Alpha (P = 0.003) has significantly higher variant frequency compared to WT (b), and Omicron (P = 0.07) has fewer variants than WT (c). Red dashed lines indicate 95% confidence intervals; for Alpha this confidence interval excludes zero indicating significance; for Omicron zero is within the interval but close to the boundary. d,e, Site effects estimated using all genomic positions in the resampling regression analysis demonstrate systematic patterns when grouped by features of the variant. In panel (d) differences are shown in the estimated site effects by mutation type broken down by reference base and alternative allele combinations. In panel (e) differences are shown in site effects across genomic features. f, Estimated site effects with P < 0.05 after adjusting for multiple testing are plotted along the genome.

Extended Data Fig. 6 |. Frequency and spectrum of RNA variants in two clinical samples.

Extended Data Fig. 6 |

Omicron data is graphed by sample source and clade. For both clinical samples, patients were female, age 25–30 years, fully vaccinated and without comorbidities. In both cases, symptoms were reported as “mild/moderate”. (BA.5.6 → 22B clade) (BA.2.12.1 → 22C) (B.1.1.529 → 8 21K) (n = 1, mean ± Wilson score 95% confidence interval). a,b, Genome-wide mutation 9 frequencies (a) and analyzed by base substitution type (b) in the two clinical samples.

Extended Data Fig. 7 |. TRS activity fuels RNA variants.

Extended Data Fig. 7 |

Recombination at transcription regulatory sequences (TRS) drives sgRNA synthesis and is central to SARS-CoV-2 gene expression27. a, Canonical junctions associated with sgRNA synthesis are shown (black lines). Chimeric reads are detected by mapping with a spliced aligner (STAR). Each arc represents a chimeric alignment where the left and right x-intercepts correspond to the junction coordinates and line shading reflects frequency. Data is from one biological replicate of WT SARS-CoV-2. b-d, RNA errors are increased at TRS sites and flanking regions compared to genome-wide rates in WT (b), Alpha (c), and Omicron (d) SARS-CoV-2 (n = 2, mean ± SD). Each TRS region (n = 10) is small (<115 nt) and composed of one canonical TRS site plus 100 flanking nucleotides. WT TRS site error bars appear large because of high-frequency outliers. e, Indels are mapped by size (y-axis) and count (dot size) across the SARS-CoV-2 genome for one representative replicate of WT virus. Promiscuous RdRp activity fuels a diverse repertoire of indels detectable by tARC-seq at a single locus, visible as a vertical smear.

Extended Data Fig. 8 |. Template switching is a driving mechanism for deletions.

Extended Data Fig. 8 |

a, To test if template switching is a contributing factor for deletions, we determined the expected number of SIDs by chance. For each deletion event with size ≥ 2bps in the WT1 sample, we assigned it with a random genomic position in SARS-CoV-2 while preserving the deletion size. The occurrence of SIDs and the maximum nucleotides of complementarity was then determined. This process was repeated 1000 times to obtain a null distribution for each complementarity size. The observed SIDs occurrences were indicated as red lines. b, Z-scores, normal distribution P values and empirical P values were determined using the SIDs occurrence in the WT1 sample comparing to the null distribution. A significant P value suggests more than expected SIDs are observed in the viral sample, and slippage is a significant contributing mechanism for deletion events. c, SIDs observed in more than two lineages have significantly lower GC content than ones only shown in no more than one lineage (Two-sided t-test). d, SIDs observed in more than two lineages have significantly longer complementarity size than ones only shown in no more than one lineage (Two-sided t-test).

Extended Data Fig. 9 |. RdRp template switching contributes to genomic change during the COVID-19 pandemic.

Extended Data Fig. 9 |

VOC-specific multiple nucleotide alterations can be modeled as singular RdRp template switching events based on 3’ micro-complementarity that facilitates RdRp misalignments/realignments. a, Phylogenetic tree based on sequence alterations observed in VOC arising from the 20B clade; not drawn to any scale. The different colors indicate VOC-specific nucleotide alterations. b-d, Top panels, proposed template switching events that explain multiple nucleotide alterations in the Lambda (b), Alpha (c), and Gamma (d) lineages. Bottom panels, phylogenetic trees that establish the singular origin (red arrows) of the coordinated multiple nucleotide alterations in each lineage. Phylogenetic trees were constructed in Nextstrain v2.35.0 from genomes sequenced between Dec. 2019 and March 2022.

Supplementary Material

Tables3-6_8_9_13
1

Acknowledgments:

We would like to thank Susan Rosenberg, Mary Estes, Carol Gross, Peter Hotez, Jennifer Halliday, and Herman Dierick for critical reading. The study was supported by NIH grants R01GM088653 (C.H.), 3R01AG061105-03S1 (O.L.), 1R21CA259780 (S.R.K.), and 1R21HG011229 (S.R.K.); and by NSF grant DBI-2032904 (O.L.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The authors thank the Reviewers for their helpful comments.

Footnotes

Code availability Python and R code are available on GitHub (https://github.com/chermanlab/SarsCov2-ArcSeq)

Competing interests The authors declare no competing interests.

Data availability

Sequencing data are available through the Sequence Read Archive under BioProject PRJNA824595.

References

  • 1.Snijder EJ, Decroly E. & Ziebuhr J. Chapter Three - The Nonstructural Proteins Directing Coronavirus RNA Synthesis and Processing. in Advances in Virus Research (ed. Ziebuhr J) vol. 96 59–126 (Academic Press, 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bradley CC, Gordon AJE, Halliday JA & Herman C. Transcription fidelity: New paradigms in epigenetic inheritance, genome instability and disease. DNA Repair (Amst) 81, 102652 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Drake JW Rates of spontaneous mutation among RNA viruses. Proc Natl Acad Sci U S A 90, 4171–4175 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sanjuán R, Nebot MR, Chirico N, Mansky LM & Belshaw R. Viral Mutation Rates. Journal of Virology 84, 9733–9748 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Acevedo A, Brodsky L. & Andino R. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature 505, 686–690 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Smith EC, Sexton NR & Denison MR Thinking Outside the Triangle: Replication Fidelity of the Largest RNA Viruses. Annu Rev Virol 1, 111–132 (2014). [DOI] [PubMed] [Google Scholar]
  • 7.Eckerle LD et al. Infidelity of SARS-CoV Nsp14-exonuclease mutant virus replication is revealed by complete genome sequencing. PLoS Pathog 6, e1000896 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Koyama T, Platt D. & Parida L. Variant analysis of SARS-CoV-2 genomes. Bull World Health Organ 98, 495–504 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wang S. et al. Molecular evolutionary characteristics of SARS-CoV-2 emerging in the United States. J Med Virol 94, 310–317 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tay JH, Porter AF, Wirth W. & Duchene S. The Emergence of SARS-CoV-2 Variants of Concern Is Driven by Acceleration of the Substitution Rate. Mol Biol Evol 39, msac013 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Di Giorgio S, Martignano F, Torcia MG, Mattiuz G. & Conticello SG Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2. Sci Adv 6, eabb5813 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Reid-Bayliss KS & Loeb LA Accurate RNA consensus sequencing for high-fidelity detection of transcriptional mutagenesis-induced epimutations. Proc Natl Acad Sci U S A 114, 9415–9420 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Acevedo A. & Andino R. Library preparation for highly accurate population sequencing of RNA viruses. Nat Protoc 9, 1760–1769 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Traverse CC & Ochman H. Conserved rates and patterns of transcription errors across bacterial growth states and lifestyles. Proc Natl Acad Sci U S A 113, 3311–3316 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li W. & Lynch M. Universally high transcript error rates in bacteria. Elife 9, e54898 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Traverse CC & Ochman H. A Genome-Wide Assay Specifies Only GreA as a Transcription Fidelity Factor in Escherichia coli. G3 (Bethesda) 8, 2257–2264 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Eyre-Walker A. & Keightley PD The distribution of fitness effects of new mutations. Nat Rev Genet 8, 610–618 (2007). [DOI] [PubMed] [Google Scholar]
  • 18.Sanjuán R, Moya A. & Elena SF The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. Proc Natl Acad Sci U S A 101, 8396–8401 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang C. et al. Identification of evolutionarily stable functional and immunogenic sites across the SARS-CoV-2 proteome and greater coronavirus family. Bioinformatics btab406 (2021) doi: 10.1093/bioinformatics/btab406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hou YJ et al. SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo. Science 370, 1464–1468 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Fitzsimmons WJ et al. A speed–fidelity trade-off determines the mutation rate and virulence of an RNA virus. PLoS Biol 16, e2006459 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Aksamentov I, Roemer C, Hodcroft EB & Neher RA Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software 6, 3773 (2021). [Google Scholar]
  • 23.Chung C. et al. Evolutionary conservation of the fidelity of transcription. Nat Commun 14, 1547 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wei L. Retrospect of the Two-Year Debate: What Fuels the Evolution of SARS-CoV-2: RNA Editing or Replication Error? Curr Microbiol 80, 151 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Nakata Y. et al. Cellular APOBEC3A deaminase drives mutations in the SARS-CoV-2 genome. Nucleic Acids Res 51, 783–795 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kim K. et al. The roles of APOBEC-mediated RNA editing in SARS-CoV-2 mutations, replication and fitness. Sci Rep 12, 14972 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kim D. et al. The Architecture of SARS-CoV-2 Transcriptome. Cell 181, 914–921.e10 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Alonso S, Izeta A, Sola I. & Enjuanes L. Transcription Regulatory Sequences and mRNA Expression Levels in the Coronavirus Transmissible Gastroenteritis Virus. J Virol 76, 1293–1308 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Garushyants SK, Rogozin IB & Koonin EV Template switching and duplications in SARS-CoV-2 genomes give rise to insertion variants that merit monitoring. Commun Biol 4, 1343 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Abraham M. & Hazkani-Covo E. Protein innovation through template switching in the Saccharomyces cerevisiae lineage. Sci Rep 11, 22558 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wu H. et al. Nucleocapsid mutations R203K/G204R increase the infectivity, fitness, and virulence of SARS-CoV-2. Cell Host Microbe 29, 1788–1801.e6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bar-On YM, Flamholz A, Phillips R. & Milo R. SARS-CoV-2 (COVID-19) by the numbers. Elife 9, e57309 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Moeller NH et al. Structure and dynamics of SARS-CoV-2 proofreading exoribonuclease ExoN. Proc Natl Acad Sci U S A 119, e2106379119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Baddock HT et al. Characterization of the SARS-CoV-2 ExoN (nsp14ExoN–nsp10) complex: implications for its role in viral genome stability and inhibitor identification. Nucleic Acids Research 50, 1484–1500 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ogando NS et al. The Enzymatic Activity of the nsp14 Exoribonuclease Is Critical for Replication of MERS-CoV and SARS-CoV-2. J Virol 94, e01246–20 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Hastings PJ, Ira G. & Lupski JR A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet 5, e1000327 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Xiao Y. et al. RNA Recombination Enhances Adaptability and Is Required for Virus Spread and Virulence. Cell Host Microbe 19, 493–503 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gribble J. et al. The coronavirus proofreading exoribonuclease mediates extensive viral recombination. PLoS Pathog 17, e1009226 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Lee H, Popodi E, Tang H. & Foster PL Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing. Proc Natl Acad Sci U S A 109, E2774–2783 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kunkel TA The mutational specificity of DNA polymerases-alpha and -gamma during in vitro DNA synthesis. J Biol Chem 260, 12866–12874 (1985). [PubMed] [Google Scholar]
  • 41.Imashimizu M, Oshima T, Lubkowska L. & Kashlev M. Direct assessment of transcription fidelity by high-resolution RNA sequencing. Nucleic Acids Res 41, 9090–9104 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hestand MS, Van Houdt J, Cristofoli F. & Vermeesch JR Polymerase specific error rates and profiles identified by single molecule sequencing. Mutat Res 784–785, 39–45 (2016). [DOI] [PubMed] [Google Scholar]
  • 43.Hadfield J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Gout J-F et al. The landscape of transcription errors in eukaryotic cells. Science Advances 3, e1701484 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Harcourt J. et al. Severe Acute Respiratory Syndrome Coronavirus 2 from Patient with Coronavirus Disease, United States. Emerg Infect Dis 26, 1266–1273 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Rio DC, Ares M, Hannon GJ & Nilsen TW Purification of RNA Using TRIzol (TRI Reagent). Cold Spring Harb Protoc 2010, pdb.prot5439 (2010). [DOI] [PubMed] [Google Scholar]
  • 47.Avadhanula V. et al. Viral load of SARS-CoV-2 in adults during the first and second wave of COVID-19 pandemic in Houston, TX: the potential of the super-spreader. J Infect Dis jiab097 (2021) doi: 10.1093/infdis/jiab097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Stead MB et al. RNAsnap: a rapid, quantitative and inexpensive, method for isolating total RNA from bacteria. Nucleic Acids Res 40, e156 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Zhao W-M et al. The 2019 novel coronavirus resource. Yi Chuan 42, 212–221 (2020). [DOI] [PubMed] [Google Scholar]
  • 50.Wang M, Zhao Y. & Zhang B. Efficient Test and Visualization of Multi-Set Intersections. Sci Rep 5, 16923 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Mautner L. et al. Replication kinetics and infectivity of SARS-CoV-2 variants of concern in common cell culture models. Virol J 19, 76 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Tables3-6_8_9_13
1

Data Availability Statement

Sequencing data are available through the Sequence Read Archive under BioProject PRJNA824595.

RESOURCES