Abstract
Sequencing the SARS-CoV-2 genome from clinical samples can be challenging, especially in specimens with low viral titer. Here we report Accurate SARS-CoV-2 genome Reconstruction (ACoRE), an amplicon-based viral genome sequencing workflow for the complete and accurate reconstruction of SARS-CoV-2 sequences from clinical samples, including suboptimal ones that would usually be excluded even if unique and irreplaceable. The protocol was optimized to improve flexibility and the combination of technical replicates was established as the central strategy to achieve accurate analysis of low-titer/suboptimal samples. We demonstrated the utility of the approach by achieving complete genome reconstruction and the identification of false-positive variants in >170 clinical samples, thus avoiding the generation of inaccurate and/or incomplete sequences. Most importantly, ACoRE was crucial to identify the correct viral strain responsible of a relapse case, that would be otherwise mis-classified as a re-infection due to missing or incorrect variant identification by a standard workflow.
Keywords: SARS-CoV-2 genome sequencing, Genetic variants, Re-infection, Suboptimal samples, Low-viral titer
1. Introduction
The coronavirus disease 2019 (COVID-19) pandemic has thus far resulted in the infection of more than 84 million people, causing at least 1.8 million deaths (Johns Hopkins University, 1/1/2021) [1]. The agent responsible for COVID-19 is a β-coronavirus known as severe acute respiratory syndrome-associated coronavirus 2 (SARS-CoV-2) with a compact single-stranded RNA genome of 29,903 nucleotides. The first SARS-CoV-2 genome sequence was published soon after the initial outbreak [2], and more than 260,000 complete genome sequences have subsequently been deposited in the GISAID database [3]. The phylogenetic analysis of genomic sequences provides a valuable tool to track viral diversity during the course of a pandemic and to identify the emergence of prevalent strains characterized by lineage-specific single nucleotide variants (SNVs), such as the D614G variant in the SARS-CoV-2 spike protein gene (23,403,A → G) [[4], [5], [6]]. As the virus propagates in human-to-human transmission, changes in the reference genome sequence must be recorded to monitor correlations between viral genotype and disease communicability, manifestation and severity [4,[7], [8], [9]]. The combination of genomic analysis and epidemiological data can also reliably determine the extent of SARS-CoV-2 transmission in different nations [[10], [11], [12]] and thus facilitates early decision-making to control local transmission [13]. Finally, mutations that may be relevant to the fitness or antigenic profile of the virus can be identified to ensure the efficacy of vaccines and immunotherapeutic interventions in the clinic [4,14].
Consensus variations reflect the analysis of virus sequences that differ between patients, but the analysis of intra-individual single nucleotide variations (iSNVs) is also important because it helps us to understand more about virus–host interactions, as previously demonstrated for Ebola, Zika, influenza and HIV [[15], [16], [17], [18], [19]]. The analysis of iSNVs during the COVID-19 pandemic may also provide data about the potential of SARS-CoV-2 for immunological escape and resistance to therapy, as well as on the sensitivity of molecular diagnostic assays [[20], [21], [22]]. However, the identification of iSNVs in clinical samples can be challenging because current protocols often feature enrichment and amplification steps that introduce technical errors indistinguishable from true biological variants [23].
The reconstruction of complete and accurate genomic sequences to detect both SNVs and iSNVs is therefore necessary to produce reliable data, at all these aims. In addition, the accumulation of meaningful data during pandemics requires the analysis of many samples, and the corresponding methods must therefore be cost-effective, straightforward and suitable for high-multiplexing [24]. The protocols must also be sensitive enough to detect low viral titers but applicable over a wide dynamic range of virus concentrations to allow the analysis of clinical samples with different viral loads, ideally including samples from early and late infection stages, that usually show a lower viral detection, or from re-infection/relapse cases [25,26].
Among the many approaches available for SARS-CoV-2 whole-genome analysis, the amplicon-based sequencing method developed by the ARTIC Network [27] is currently the most widely used [13,24,[28], [29], [30], [31], [32]]. Based on the PrimalSeq protocol originally developed for Zika virus [23,33], the ARTIC Network designed a set of 98 tiled amplicons in two PCR pools for the targeted whole-genome amplification of SARS-CoV-2 [27]. This approach is simple and highly sensitive, but it suffers from technical biases leading to uneven genome coverage, thus reducing the completeness and accuracy of genome sequencing, especially for the identification iSNVs in samples with low viral titers [[34], [35], [36]]. Sequencing technical replicates of multiple cDNAs generated from the same sample has been proposed as a mitigation strategy to identify iSNVs more reliably [23]. However, whereas amplicon-based sequencing has been widely used for the analysis of low-frequency variants [[20], [21], [22],37,38] only a few studies thus far have evaluated the confidence of such calls and have implemented the sequencing of cDNA replicates to ensure accuracy [23]. False positives have also been reported among high-frequency variants supported by good sequencing depth, indicating that the risks of inaccurate sequencing are not limited to suboptimal samples [39].
To avoid the generation of incomplete genomic sequences typically associated with poor genome coverage [[40], [41], [42]], the sequencing of samples with fewer than 1000 virus copies per RT-qPCR reaction (Ct > 30) is currently discouraged [23,43]. However, the strict implementation of such recommendations would lead to the exclusion of many clinical samples, which are often unavoidably collected or stored under suboptimal conditions. Since specimens with these features may be unique and irreplaceable -central to the investigation conducted-, numerous studies therefore report sequencing data from samples with (very) low viral titers (Ct > 30) despite this advice [26,44,45]. To address these challenges, we set out to develop an optimized workflow, ACoRE (Accurate SARS-CoV-2 genome Reconstruction), for the reliable reconstruction of complete and accurate SARS-CoV-2 genomes from clinical samples with a broad range of Ct values, aiming to improve the flexibility, accuracy and throughout of amplicon-based sequencing.
2. Materials and methods
2.1. Clinical samples
178 Nasopharyngeal swabs (eSwab, Copan, Italy) were obtained from 172 COVID-19 patients diagnosed at the Department of Infectious, Tropical Diseases and Microbiology of the IRCCS Sacro Cuore Don Calabria Hospital, qualified for SARS-CoV-2 molecular diagnosis by the regional reference laboratory (Department of Microbiology, University Hospital of Padua). After collection, swabs were stored at 4 °C for a maximum of 48 h, analyzed by the molecular diagnostic method described in the following paragraph) and subsequently stored at −80 °C. The study was approved by the competent Ethical Committee for Clinical Research of Verona and Rovigo Provinces (Prot N° 39,528/2020).
2.2. RNA extraction and RT-qPCR analysis
The routine RT-qPCR protocol was based on the WHO guidelines [46]. Briefly, RNA was extracted from 200 μL of swabs using the automated Microlab Nimbus workstation (Hamilton, Reno, NV, USA) coupled to a Kingfisher Presto system (Thermo Fisher Scientific, Waltham, MA, USA) or using the MagnaMax Viral/Pathogen extraction kit (Thermo Fisher Scientific) according to the manufacturer's instructions. RT-qPCR was carried out using the CDC 2019-nCoV rRT-PCR Diagnostic Panel assay and protocol [47], targeting the nucleocapsid protein gene regions N1 and N2 (with the human RNAse P gene as the internal control) on a CFX96 Touch system (Bio-Rad Laboratories, Hercules, California, USA). The amplification cycle threshold (Ct) was determined using CFX Maestro (Bio-Rad), setting a baseline threshold at 200 relative fluorescence units (RFU). A standard curve from 5 to 500 genome copies per reaction was performed with serial dilution of the CDC control plasmid containing the CDC qPCR Assays target regions (2019-nCoV_N_Positive Control, Integrated DNA Technologies, Coralville, Iowa, USA) containing the complete nucleocapsid gene of SARS-CoV-2 (Table S1).
2.3. Reverse transcription and amplification of the SARS-CoV-2 genome
Samples with Ct values of 15–18 were diluted 10-fold as suggested by the ARTIC Network [27]. RNA (5 or 10 μL) was incubated with 1 μL of 60 μM Random Primer Mix (New England Biolabs, Ipswich, MA, USA) and 1 μL of 10 mM dNTPs (New England Biolabs) at 65 °C for 5 min followed by 1 min on ice. Subsequently, 4 μL of 5× SSIV buffer, 1 μL of 100 mM DTT, 1 μL of 40 U/μL RNaseOUT, 1 μL of 200 U/μL SSIV enzyme (Thermo Fisher Scientific) and 6 μL nuclease-free water (total reaction volume = 20 μL) were added and the reaction conducted at 23 °C for 10 min, 52 °C for 10 min and 80 °C for 10 min. Two or three cDNAs were generated from every sample, each of which was amplified 2–3 times using the ARTIC protocol. 2.5 or 5 μL cDNA were mixed with 3.7 μL of 10 μM primer pools A and B from the ARTIC nCoV-2019 V3 panel (IDT, Coralville, IA, USA), 12.5 μL Q5 high-fidelity DNA polymerase 2× (New England Biolabs) for each of the primer pools, and nuclease-free water to a final volume of 25 μL. The reaction was conducted at 98 °C for 30 s, followed by 25 cycles (sample Ct ≤ 21) or 35 cycles (sample Ct > 21) of 98 °C for 15 s and 65 °C for 5 min. PCR products were then combined and cleaned up using 1× AMPure XP beads (Beckman Coulter, Brea, CA, USA) and eluted in 15 μL of water. Resulting amplicons were analyzed on the 4150 TapeStation System (Agilent Technologies, Santa Clara, CA, USA) and quantified using the Qubit dsDNA HS Assay kit (Thermo Fisher Scientific).
2.4. Full-length amplicon sequencing
Libraries were prepared from 50 ng of virus amplicons using the KAPA Hyper prep kit and unique dual-indexed adapters (5 μL of a 15 μM stock) according to the supplier's protocol (Roche, Basel, Switzerland). Post-ligation products were cleaned up using 0.8× AMPure XP beads followed by library amplification (six cycles) with the KAPA Library Amplification Primer Mix (Roche). After a clean-up with 1× AMPure XP beads, libraries were analyzed on the 4150 TapeStation System (average size 526–573 bp) and quantified using the Qubit dsDNA BR Assay kit (Thermo Fisher Scientific). Barcoded libraries were pooled at equimolar concentrations and sequenced on the MiSeq platform (Illumina, San Diego, CA, USA) with Miseq Reagent kit v2 in 250PE mode.
2.5. Fragmented amplicon sequencing
Libraries were prepared from 10 μL of purified viral amplicons using the Illumina DNA Prep kit (Illumina, former NexteraFlex kit) according to the manufacturer's recommendations, and combinatorial dual indexes were added in six cycles of PCR. 10-μL aliquots of each amplified library were cleaned up with 1× beads (Illumina) and eluted in 20 μL of resuspension buffer (Illumina). Resulting libraries were analyzed on the 4150 TapeStation System (average size 335–369 bp), pooled and quantified using the Qubit dsDNA BR Assay kit. Libraries were sequenced on a Novaseq 6000 device (Illumina) using a SP flow cell in 100PE mode, or on a NextSeq500 (Illumina) in 150PE mode.
2.6. Data filtering and reference genome alignment
Full-length amplicon sequencing data were randomly downsampled using seqtk sample v1.3 (https://github.com/lh3/seqtk). To compare sequencing data from the full-length and fragmented amplicons, KAPA library reads were downsampled at the same mean mapped coverage as the corresponding Illumina library replicates using sambamba v0.6.7 [48]. To simulate sequencing using 100PE reads, data from the fragmented amplicon libraries were trimmed using a custom script. All sequencing datasets were trimmed for quality and adapters were removed using Trimmomatic v0.39 [49] with the following parameters: ILLUMINACLIP:adapters_file:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:20. Filtered reads were aligned to the SARS-CoV-2 reference genome (GenBank ID: MN908947.3) using BWA MEM v0.7.17 [50] with default parameters and the relative alignment file was converted to BAM file using SAMtools v1.9 [51]. For the fragmented libraries, duplicate reads were identified and discarded using Picard v2.21.1 (http://broadinstitute.github.io/picard). Subsequently, iVar v1.2.2 trim [23] was used to remove ARTIC v3 primer sequences from the BAM files. For the fragmented libraries, the -e parameter was used to include reads without primers. Finally, overlapping portions of reads were clipped using fgbio ClipBam v1.1.0 (https://github.com/fulcrumgenomics/fgbio) with the following parameters: –clip-overlapping-reads -c Hard. Coverage and genotypability statistics were calculated from the BAM files using bedtools genomecov v2.19.1 [52] and GATK CallableLoci v3.8 [53], respectively. Raw genomic sequencing data were deposited in NCBI GenBank (BioProject no PRJNA690890).
2.7. Consensus variant calling and generation of the consensus sequence
A pileup was calculated for each position in the BAM file of each replicate using the SAMtools v1.9 mpileup option with parameters -aa -A -d 0 -Q 0. The resulting files were used as input for iVar consensus v1.2.2 [23] to generate consensus sequences, considering those positions covered by at least three reads (parameters: -t 0 -m 3). The most abundant nucleotide for each position was reported in the consensus sequence, whereas positions covered by fewer than three reads or reporting an equal proportion of nucleotides were represented by the ambiguous character N.
To call variants present in the consensus sequences (consensus variants), sequences were aligned to the SARS-CoV-2 reference genome using Minimap v2.17 [54] and the alignment file was converted to the BAM format using SAMtools v1.9. Consensus variants were then called using bcftools call v1.10.2 [51] with the following parameters: –ploidy 1 -A -m —P 0.05 -M -Oz.
Final consensus sequences from the cohort of 170 samples and the relapse case were called after merging sequencing data for each individual replicate. False-positive variants in the consensus sequence were identified manually by comparing the presence of discordant iSNVs at the same genomic position between replicates of the same sample and considering only positions genotyped in both replicates. False-positive variants were removed from consensus sequences and replaced with the reference allele.
2.8. iSNV variant calling
Alignment BAM files were used to call iSNVs present in each replicate with a minimum minor allele frequency (MAF) threshold of 3%. Joint variant calling of the 30 entire amplicon libraries, and between replicates of the same sample for fragmented amplicon libraries, was achieved by generating a pileup using SAMtools mpileup v1.9 [51] with the following parameters: -A -d 600,000 —B -Q 0. The output file was used to detect iSNVs with VarScan mpileup2cns v2.3.9 [55] and the following parameters: –min-var-freq 0.03 –min-avg-qual 20.
For each sample, inter-replicate discordant variants were identified by iSNV variant calling after merging sequencing data from all replicates, considering only genotyped positions. A discordant variant was defined as a variant called in one replicate, whereas the same position in the other replicate reported the reference allele.
2.9. Calculation of the concordance rate
The concordance rate (Rc) between replicates samples was calculated as follows:
Nc represents (i) the number of shared variants (consensus variants or iSNVs) excluding positions that could not be genotyped in at least one replicate, or (ii) the number of shared genotypable bases, excluding positions marked N in at least one replicate, or (iii) the number of shared amplicons with coverage higher than three reads in all replicates. N1 and N2 represent the total number of iSNVs, consensus variants, genotypable bases or covered amplicons detected in each of the two samples in the analysis. Rc was calculated by comparing couples of replicates generated from the same cDNA (intra-cDNA concordance) and triplets of replicates generated from different cDNAs (inter-cDNA concordance) as shown in Table S2.
2.10. Statistical analysis
The non-parametric Wilcoxon signed rank test and the Mann Whitney U test were used to compare matched pairs and non-matched data, respectively. The non-parametric Friedman test was used to compare multiple paired groups. Significance of pairing was confirmed by calculating Spearman's rho. We used GraphPad Prism 6.0 (GraphPad Software, San Diego, CA, USA) for all statistical analysis, with a significance threshold of p < 0.05.
3. Results
3.1. Accuracy of SARS-CoV-2 genome reconstruction
The original Primalseq protocol stipulates two independent reverse transcriptions per sample and the subsequent amplification of the separate cDNAs in order to reduce technical errors. In this study, we initially tested replicate amplifications from the same cDNA to investigate whether this alternative approach could affect the reproducibility in the generation of SARS-CoV-2 consensus sequences and in the identification of intra-host variants. At this aim, we selected five COVID-19-positive swabs representing viral loads ranging from ~500 to ~2 million, based on Ct values (determined by RT-qPCR) ranging from 15.07 to 28.5 (Table S1). For each sample, we generated three cDNAs and carried out two separate amplifications, resulting in six replicates per starting RNA (Fig. 1A). An individual KAPA library was prepared from each replicate, and sequencing in 250PE mode produced an average of 1 million fragments. The dataset was normalized to ~800,000 fragments per library, corresponding to ~7800× coverage per sample after alignment to the SARS-CoV-2 reference genome (Table S3).
The sequencing coverage was variable across the different amplicons of the ARTIC panel, particularly in samples with a higher Ct value (Fig. 2 and Fig. S1). Interestingly, most amplicons showed either high (>500×) or very low (≤10×) to zero coverage, and amplicons absent in one replicate could be present in another, even when produced from the same cDNA. The concordance (Rc) in sequencing coverage was high for replicates of four samples (Rc ~ 0.99–1) but lower in sample S5 (Rc ~ 0.95) with the lowest viral load (Fig. 1B and Table S4), but there was no significant difference between replicates from the same or different cDNAs (p = 0.25, Wilcoxon test). Variations in coverage can affect genotyping accuracy, so we evaluated reproducibility in terms of genotypability by calculating the fraction of genomic positions where it is possible to call a genotype after aligning reads to the reference genome. The genotypability Rc was optimal or slightly lower than 1 in all samples (Rc = 0.99–1), but lower in sample S5, which also showed the lowest sequencing coverage Rc (Fig. 1C and Table S5). Reproducibility was similar between inter-cDNA replicates and intra-cDNA replicates (p > 0.99, Wilcoxon test). To assess how fluctuations in genotypability and coverage affect the final viral genome sequences, we generated a consensus sequence for each replicate. The reproducibility among consensus variants was optimal in the first four samples, but consistently dropped to ~0.3 for sample S5 (Fig. 1D and Table S6). Nevertheless, reproducibility was again similar between inter-cDNA replicates and intra-cDNA replicates (p > 0.99, Wilcoxon test).
The number of iSNVs (frequency > 3%) varied significantly between technical replicates, with a small fraction of iSNVs shared by different replicates compared to the total number of iSNVs identified (Table S7). The Rc was suboptimal (<0.95) for all samples and steadily decreased as the Ct value increased (Fig. 1E and Table S8), but there was no significant difference between replicates generated from the same or different cDNAs (p = 0.44, Wilcoxon test). In summary, consensus sequences and intra-host variants can be strongly affected by uneven amplicon representation and PCR errors (Fig. 2) confirming the need to sequence at least two replicates to achieve an accurate characterization of the SARS-CoV-2 genome. However, the two amplifications can be generated from the same starting cDNA, thus reducing sample consumption and costs.
3.2. Improvement of genome reconstruction by merging technical replicates
While addressing the reproducibility issues observed for both SNVs and iSNVs in samples with low viral loads, we also tested whether merging two or more technical replicates could improve coverage and genotypability. The rationale was the observation that amplicons with the lowest coverage varied across different replicates, and amplicons missing in one replicate could have a coverage >100× or > 1000× in others (Fig. S1). All possible combinations of two replicates for each sample were merged and downsampled to 800,000 fragments (400,000 for each replicate) to obtain the same sequencing input data as the initial analysis based on a single replicate (Table S9). When considering the merged datasets rather than single-replicate data, the average coverage consistently increased in the sample with the highest Ct value (p < 0.0001, Mann Whitney U test), confirming that merging two amplification replicates (intra-cDNA or inter-cDNA) could mitigate the technical variability in amplicon coverage (Fig. 3A-C) as well as significantly (p < 0.0001, Mann Whitney U test) enhance the genotypability (Fig. 3B). Merging up to six replicates achieved a slight further improvement in both coverage and genotypability (Fig. 3A-B), indicating that both properties can be maximized by analyzing replicates of samples with low viral loads. Indeed, merging all sequence data available for sample S5 (with the lowest reproducibility) increased coverage sufficiently to achieve >96.98% non-ambiguous bases in the consensus sequence (Fig. 3C-D), which is the GISAID threshold for classifying a SARS-CoV-2 genome as complete [3]. Similar improvement was achieved in a panel of 170 clinical samples analyzed in duplicate or quadruplicate (Fig. 3E-G shows three representative samples).
3.3. Improvement of the technical workflow for viral genome sequencing
One drawback of the ARTIC protocol on the Illumina platform is the need for 250PE sequencing to cover the full length of the amplicons (400 bp). This type of sequencing is currently available only for MiSeq and NovaSeq6000 SP flow cells, increasing the cost per sample and reducing the sample throughput. We therefore generated shorter libraries using a tagmentase-based approach (Illumina DNA Prep, former NexteraFlex) and tested the use of alterative flow cells (NextSeq500/550 and NovaSeq6000 SP) and sequencing mode (150PE) on the 30 samples originally tested using the KAPA library (Fig. 1A). Despite skipping the laborious input DNA and library quantification steps before sequencing, the variability in the number of fragments analyzed per sample was lower (CV = 22.5%) than the full-amplicon approach (CV = 38.3%) described above (Fig. 4A). The sequencing data were mapped to the reference genome (Table S10) and compared to the 250PE dataset (KAPA Hyper Prep library) normalized with the same average-mapped coverage as the 150PE dataset (Illumina DNA Prep library) (Table S11). Sequencing coverage was evenly distributed along the amplicons even when the Illumina protocol was used, because the partial overlap of ARTIC amplicons compensated for the expected loss of sequence representation at the amplicon ends due to tagmentation (Fig. 4B). The sequencing of fragmented amplicons had no adverse impact on genome coverage and genotypability, which were significantly higher compared to the full-length amplicon sequencing (p < 0.001 and p = 0.024, respectively, Friedman test; Fig. 4C-D). Despite the lower coverage, similar results were observed with 100PE sequencing simulated after trimming the 150PE dataset (Fig. 4C-D). The fragmented-amplicon approach was therefore advantageous for multiple aspects of SARS-CoV-2 sequencing, by increasing coverage, genotypability and throughput (allowing higher multiplexing) while reducing sequencing costs and eliminating unnecessary protocol steps such as DNA quantification after PCR and library quantification before pooling.
Although the Illumina DNA Prep protocol saves on costs, this is offset by the requirement for multiple sequencing replicates from the same sample to improve genome coverage. We therefore compared the effect of sequencing a library generated from two replicates (each amplified from 5 μL of cDNA) and a standard library prepared from a single amplification generated from double amount of cDNA (10 μL). Because samples with a low viral load benefit the most from multiple replicates, we analyzed 20 samples with a Ct range of 25–35 (Fig. S2A). Two samples showed a lower coverage in libraries produced from a single cDNA, but overall there was little difference in coverage (p = 0.1) or genotypability (p = 0.09) when comparing the two conditions (Wilcoxon test; Fig. S2B—C). This result confirmed that the reconstruction of SARS-CoV-2 genomes can also be maximized by increasing the amount of template cDNA through the use of more complex samples. Although such adjustments can improve coverage and genotypability, technical replicates are still required for the identification of true-positive variants.
3.4. Application of the optimized workflow to large sets of samples
Next we applied the optimized workflow to a set of 170 clinical samples representing a wide range of viral loads, with Ct values in the range 15–40 (Fig. S3). Each sample was amplified in duplicate or quadruplicate starting from 10 μL cDNA, and 100PE sequencing was carried on a NovaSeq6000 SP flow cell using Illumina libraries, generating an average of ~2.8 million fragments per replicate (Table S12). After pooling data from the replicates, ~75% of the samples showed both coverage and genotypability >96.98% (Fig. 5A-B) which is a clear improvement over the sequencing of a single cDNA (Fig. 5C-D). Most (90.9%) of the samples that were not fully reconstructed were characterized by a low viral load (Ct > 30), but almost half (45%) of the samples in this Ct value range were nevertheless reconstructed optimally (Fig. 5E-F). In particular, five of the seven viral genomes from swabs with a Ct value ≥38 were completely reconstructed (>96.98%), indicating that the outcome is not solely determined by the viral titer in the starting material. In order to generate accurate consensus sequences, we applied the same approach used to identify true-positive iSNVs (only variants in both replicates were included in the final consensus). This approach revealed that 22 samples (12.94%), with Ct 25.9–40, would have included at least one false-positive variant in the consensus sequences based on single-cDNA analysis, but these were efficiently removed by considering the concordance between replicates (Table S13).
3.5. Impact of genome reconstruction accuracy on the evaluation of a potential re-infection case
The identification of SARS-CoV-2 genetic variants at different time points can reveal whether recurrent infections are relapses caused by the same strain or independent infections with a different strain. We therefore evaluated our optimized workflow in a case-study of relapse/re-infection involving a 48-year-old female patient who was hospitalized with mild COVID-19 symptoms following a positive nasopharyngeal swab on 4/3/2020, discharged with no symptoms on 11/3/2020 followed by two consecutive negative swab tests, but readmitted with mild COVID-19 symptoms 12 days later. During the second hospital stay, the nasopharyngeal swab test results fluctuated, and the patient was finally discharged on 21/4/2020 with no symptoms, and two consecutive negative molecular tests. Three swab samples (one from the first and two from the second hospitalization period) were sequenced to identify the viral strain responsible for infection (Table 1 ). All samples were sequenced in duplicate or quadruplicate (Table S14), and consensus variants were called in order to identify the viral strains. Depending on the replicate, some consensus variants identified in the first hospitalization period were missing or could not be genotyped in the second hospitalization period, leading to the hypothesis that different strains could be responsible for each infection (Table 1). In contrast, when merging sequencing replicates, the same variants were identified in all three samples (Table 1) and a very high-frequency (99.95%) false-positive variant could be identified at position 12,890 (Table S13). Based on this analysis, we concluded that the same viral strain was responsible of both the first and second infection, and that the latter should therefore not be classified as a re-infection.
Table 1.
1° Hospitalization |
2° Hospitalization |
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
05/03/2020 |
22/03/2020 |
03/04/2020 |
||||||||||||
Ct 27 |
Ct 34 |
Ct 35.7 |
||||||||||||
Genome |
Reference allele | 9075 |
9075 |
9075 |
9075 |
9075 |
9076 |
9076 |
9076 |
9078 |
9078 |
9078 |
9078 |
9078 |
Position | 1.1 | 1.2 | 2.1 | 2.2 | merged | 1.1 | 1.2 | merged | 1.1 | 1.2 | 2.1 | 2.2 | merged | |
241 | C | T | T | T | T | T | T | – | T | – | – | – | T | T |
3037 | C | T | T | T | T | T | – | – | – | – | T | – | – | T |
13,620 | C | T | T | T | T | T | T | – | T | – | – | T | T | T |
14,408 | C | T | T | T | T | T | T | T | T | – | – | – | T | T |
23,403 | A | G | G | G | G | G | G | G | G | – | – | G | – | G |
28,881 | G | A | A | A | A | A | – | A | A | – | A | – | – | A |
28,882 | G | A | A | A | A | A | – | A | A | – | A | – | – | A |
28,883 | G | C | C | C | C | C | – | C | C | – | C | – | – | C |
4. Discussion
4.1. Protocol optimization for simplicity, flexibility, throughput and cost-efficiency
Amplicon-based sequencing (originally called PrimalSeq) is the most sensitive and widely-used protocol for SARS-CoV-2 whole-genome analysis from clinical isolates, but its disadvantages include uneven amplicon coverage and poor accuracy when the viral load is low [23]. We addressed these limits by improving the accuracy and completeness of sequencing, as well as the cost-efficiency and throughput, thus achieving the highly reliable analysis of SARS-CoV-2 genomes. This benchmarking analysis established a robust workflow, ACoRE, that allowed the complete and accurate characterization of SARS-CoV-2 genomes in 170 clinical samples, including a subset (42%) with very low viral titers (Ct ≥ 30). We were also able to properly categorize an infection-relapse case study.
The protocol optimized by the ARTIC Network for SARS-CoV-2 genome sequencing utilizes a tiling primer scheme generating 400-bp viral amplicons for adaptor ligation and 250PE sequencing [33]. This limits the sequencing options on Illumina platforms because this read type is compatible only with the MiSeq v2 chemistry and NovaSeq6000 SP flow cells. To increase flexibility, we used the Illumina DNA Prep kit to prepare amplicon libraries with shorter inserts (170–200 bp) suitable for 150PE sequencing without loss of performance. This also confers the ability to pool up to 384 samples in a single run using unique dual indexes, reducing costs from €80 per sample to €3.5 on the NovaSeq6000 with S1 flow cell or €12 on the NextSeq500 with HighOutput flow cell. Even shorter sequencing reads (100PE) resulted in shorter overlap of paired ends, reducing the number of sequencing fragments required per sample and translating to even lower costs of €3 per sample. Because the Illumina tagmentase-based method does not require the quantification of starting amplicons or final sequencing libraries, this further reduces costs and processing time. Further savings could potentially be achieved by using half the volume of tagmentase reagent, but testing is required to ensure that accuracy and coverage is maintained. The generation of amplification replicates from a single starting cDNA (instead of multiple cDNAs, as recommended by the original protocol [23]) would also save time and costs, while preserving the sample for additional tests. The fragmented amplicon approach and other adjustments therefore improved protocol simplicity, flexibility, multiplexing and economy, allowing the cost-effective and timely processing of larger cohorts of samples by ACoRE.
4.2. Sequencing multiple replicates to increase accuracy and completeness
Clinical specimens with low viral loads reduce the accuracy of variant calling and the completeness of genome reconstruction, both of which are inversely correlated with the quality and quantity of starting material [23,30,43]. Current guidelines for viral genotyping recommend a lower limit of 1000 virus copies per reaction [23,43] but this would rule out a large proportion of clinical samples, including ~53% of the samples in our cohort. A Ct value of ~25 was identified as the median for virus detection in symptomatic patients, with a consistent proportion of samples (15–25%) falling above Ct 30 [25,56]. Low viral loads are often found in patients with prolonged COVID-19 infection [[57], [58], [59]], and five of six reported cases of potential re-infection involved samples with Ct values >30 [60], but whole-genome sequencing is nevertheless recommended to differentiate between relapse and new infections caused by a different SARS-CoV-2 variants [60,61]. The ability to sequence SARS-CoV-2 genomes in low-titer samples is therefore necessary to track infections and correlate different strains with disease communicability, manifestation and severity.
Increasing the depth of sequencing has been proposed as a strategy to achieve complete genome reconstruction in low-titer samples, but this does not overcome limitations caused by missing amplicons [43]. Similarly, improvement in ARTIC primer design and compatibility (currently version 3) can also ameliorate genome coverage, but again cannot make up for missing amplicons [24,30]. We found that only a few specific amplicons were reproducibly suboptimal (64, 70 and 91) whereas most showed coverage variations limited to particular samples or replicates. We therefore merged the sequencing data from two or more replicates as a simple solution to enhance coverage and genotypability, achieving a more homogeneous representation of the viral genome and rescuing the suboptimal samples. The random amplification observed in low-titer samples most likely reflects the low sample complexity rather than poor assay sensitivity or performance. Accordingly, the sampled RNA and corresponding cDNA fragments before amplification are unlikely to represent the complete genome based on our observation that the coverage achieved by sequencing two amplification replicates (each from 5 μL of cDNA) was similar to that achieved with a single amplification starting from double the amount of cDNA (10 μL). Therefore, to optimize genome reconstruction, a single large cDNA batch should be amplified in several parallel reactions, using as much sample volume as possible to increase complexity. The multiple PCR products can then be pooled before library preparation and sequenced as a single sample to avoid increasing costs.
It must be noted that low viral loads are not linearly correlated to poor sequencing results, as also some samples with Ct > 30 showed complete genome reconstruction even when considering only one replicate. Therefore, beside sample complexity and concentration, other factors could play a role, as for example the integrity of initial RNA samples or the presence of contaminants, whose effect may be more evident on low concentrated samples. Since these factors would similarly impact the completeness of genome reconstruction as low titers, the ACoRE workflow could provide an experimental solution also for highly degraded samples or specimens containing inhibitors.
As well as improving coverage and genotypability, at least two amplification reactions must be analyzed to achieve accurate variant calling (SNVs and iSNVs). It is well established that the analysis of viral iSNVs down to 3% frequency requires the generation of multiple replicates to distinguish true-positive iSNVs from low-frequency PCR or sequencing errors [23]. In contrast, the generation of consensus sequences for the analysis of SNVs in epidemiological studies requires the identification of the most-frequent nucleotide at each position and is typically based on single replicates [12,45]. However, we discovered that consensus sequences also contain frequent SNV errors (>12% in our cohort) and the comparison of technical replicates is required to ensure accuracy. This was not confined to low-titer samples (Ct > 30) but also included some samples with moderate viral loads (Ct = 25–30) potentially leading to the submission of inaccurate consensus sequences to public repositories such as GISAID. These false-positive variants probably arose due to PCR errors because they were not found in other amplification replicates (either from the same or different cDNA). However, studies reporting SARS-CoV-2 consensus sequences thus far have not included the analysis of technical replicates, even in the case of low-titer samples (Ct > 30) [26,62]. The accuracy of SARS-CoV-2 consensus sequences deposited in GSAID has been called into question for documented sequences with putative errors or a significant number of variants in one particular submission (singletons) [35] and the use of stringent filters and bioinformatic tools has been proposed as a solution [62,63]. Instead, with ACoRE we propose the use of replicates as a simple experimental solution to avoid the generation of incorrect consensus sequences prior to database submission. Since similar errors and amplification biases have been reported to limit the analysis also of other viral genomes, such as HIV, Influenza or Zika virus [23,64,65], we could predict that the benefits of the ACoRE approach are not limited to SARS-CoV-2, but may be extended to the NGS-based analysis of viral infections in general.
4.3. The assessment of re-infections
Reconstruction of highly accurate sequences from sub-optimal samples was crucial to identify the correct viral strain responsible of a second hospitalization case, that was hypothesized to be a re-infection. A standard workflow would have missed or included incorrect variants in support of such hypothesis, while ACoRE properly recognized that the different time-point samples contained the same viral strain.
Another interesting example, that would certainly benefit of ACoRE, comes from a publication that reported the first individual in North America to have symptomatic reinfection with SARS-CoV-2, for whom “…genomic analysis of SARS-CoV-2 showed genetically significant differences between each variant associated with each instance of infection…” suggesting that “…the patient was infected by SARS-CoV-2 on two separate occasions by a genetically distinct virus…” [45]. The viral load of the swab samples analyzed in that study was very low (Ct > 35) based on 14–22 PCR cycles-protocol without amplification replicates, therefore potential false-positive variants and/or regions with low genotypability may have influenced the results. We reanalyzed the data and noted that two of the four variants specifically associated with the first infection had insufficient sequencing coverage to achieve confident variant calling in the sample from the second infection (Table S15). In particular, our bioinformatic pipeline revealed that position 539 was covered by only five reads, thus a genotype could not be properly called; while variant 16741G → T (supported by 10 reads) was only just above the genotypability threshold of 8 (Table S15). These positions were genotyped using the bioinformatic pipeline utilized by the authors because the limit was set to five reads. Furthermore, variant 4113C → T showed frequency of 67.82% in the first infection, suggesting that two viral strains were already present: a predominant strain carrying the identified variant and a less-abundant strain lacking the variant that became prevalent in the second infection (Table S15). However, the absence of replicate analysis makes it impossible to confirm this hypothesis. Similarly, although the final variant (7921A → G) was abundant, the absence of replication makes it impossible to rule out the possibility of an amplification error, as frequently observed in our low-titer samples. These questions could be resolved by sequencing two technical replicates rather than analyzing data from one sequencing library using two different pipelines (as reported by the authors). The conclusions put forward by the authors therefore appear to be only weakly supported by the raw data, but would nevertheless have a major impact on future research by highlighting the possibility of re-infection and thus possibly questioning the efficacy of vaccines. The analysis of such critical samples would greatly benefit from the use of technical replicates, and robust evaluation is particularly important due to the ramifications of the conclusions for the global research and biomedical communities.
5. Conclusions
We have optimized ACoRE, a workflow for SARS-COV-2 sequencing to improve flexibility and throughout, thus reducing assay time and costs and facilitating the robust analysis of suboptimal samples that would normally be excluded from sequencing even if they are central and irreplaceable specimens. The sequencing of such low-titer samples without replication risks the generation of consensus sequences containing false-positive SNVs and iSNVs, but we found that the inclusion of technical replicates improves both the accuracy and completeness of viral genome analysis. This reduces the risk of generating inaccurate and incomplete genomic sequences, favoring the submission of robust sequences to public databases and enhancing the downstream analysis of SARS-CoV-2 genotyping data.
Availability of data and materials
The raw sequencing data supporting the conclusions of this article are available at the NCBI SRA repository under BioProject ID PRJNA690890.
Funding
The work performed at IRCCS Sacro Cuore Don Calabria Hospital was supported by the Italian Ministry of Health “Fondi Ricerca corrente—L1P5”.
Declaration of Competing Interest
The authors declare that they have no competing interests
Acknowledgements
We gratefully acknowledge the Centro Piattaforme Tecnologiche (CPT) for granting access to the genomic facility of University of Verona.
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.ygeno.2021.04.008.
Appendix A. Supplementary data
References
- 1.COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) [Internet]. Available from. https://coronavirus.jhu.edu/map.html
- 2.Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.GISAID Initiative [Internet] https://www.gisaid.org/ Available from:
- 4.Plante J.A., Liu Y., Liu J., Xia H., Johnson B.A., Lokugamage K.G., et al. Spike mutation D614G alters SARS-CoV-2 fitness. Nature [Internet]. Springer US. 2020 doi: 10.1038/s41586-020-2895-3. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Korber B., Fischer W.M., Gnanakaran S., Yoon H., Theiler J., Abfalterer W., et al. Tracking changes in SARS-CoV-2 spike: evidence that d614g increases infectivity of the COVID-19 virus. Cell. 2020;182 doi: 10.1016/j.cell.2020.06.043. 812–827.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.van Dorp L., Acman M., Richard D., Shaw L.P., Ford C.E., Ormond L., et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect. Genet. Evol. [Internet]. Elsevier. 2020;83:104351. doi: 10.1016/j.meegid.2020.104351. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Li Q., Wu J., Nie J., Zhang L., Hao H., Liu S., et al. The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell [Internet]. Elsevier. 2020;182 doi: 10.1016/j.cell.2020.07.012. 1284-1294.e9. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yao H., Lu X., Chen Q., Xu K., Chen Y., Cheng M., et al. Patient-derived SARS-CoV-2 mutations impact viral replication dynamics and infectivity in vitro and with clinical implications in vivo. Cell Discov. [internet]. Springer US. 2020;6:1–16. doi: 10.1038/s41421-020-00226-1. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rahman M.S., Islam M.R., Alam A.S.M.R.U., Islam I., Hoque M.N., Akter S., et al. Evolutionary dynamics of SARS-CoV-2 nucleocapsid protein and its consequences. J. Med. Virol. 2021;93(4):2177–2195. doi: 10.1002/jmv.26626. [DOI] [PubMed] [Google Scholar]
- 10.Geoghegan J.L., Ren X., Storey M., Hadfield J., Jelley L., Jefferies S., et al. Genomic epidemiology reveals transmission patterns and dynamics of SARS-CoV-2 in Aotearoa New Zealand. medRxiv [internet] 2020 doi: 10.1101/2020.08.05.20168930v3. 2020.08.05.20168930. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rockett R.J., Arnott A., Lam C., Sadsad R., Timms V., Gray K.A., et al. Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-based modeling. Nat. Med. 2020;26(9):1398–1404. doi: 10.1038/s41591-020-1000-7. [DOI] [PubMed] [Google Scholar]
- 12.Gudbjartsson D.F., Helgason A., Jonsson H., Magnusson O.T., Melsted P., Norddahl G.L., et al. Spread of SARS-CoV-2 in the Icelandic population. N. Engl. J. Med. 2020;382:2302–2315. doi: 10.1056/NEJMoa2006100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Oude Munnink B.B., Nieuwenhuijse D.F., Stein M., O’Toole Á., Haverkate M., Mollers M., et al. Rapid SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in the Netherlands. Nat. Med. 2020;26:1405–1410. doi: 10.1038/s41591-020-0997-y. [DOI] [PubMed] [Google Scholar]
- 14.Houldcroft C.J., Beale M.A., Breuer J. Clinical and biological insights from viral genome sequencing. Nat. Rev. Microbiol. Nature Publishing Group. 2017;15:183–192. doi: 10.1038/nrmicro.2016.182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gire S.K., Goba A., Andersen K.G., Sealfon R.S.G., Park D.J., Kanneh L., et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science (80-) 2014;345:1369–1372. doi: 10.1126/science.1259657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.McCrone J.T., Woods R.J., Martin E.T., Malosh R.E., Monto A.S., Lauring A.S. Stochastic processes constrain the within and between host evolution of influenza virus. Elife. 2018;7:1–19. doi: 10.7554/eLife.35962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gardy J., Loman N.J., Rambaut A. Real-time digital pathogen surveillance - the time is now. Genome Biol. [Internet]. Genome Biol. 2015;16:15–17. doi: 10.1186/s13059-015-0726-x. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Park D.J., Dudas G., Wohl S., Goba A., Whitmer S.L.M., Andersen K.G., et al. Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone. Cell. 2015;161:1516–1526. doi: 10.1016/j.cell.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Dube Mandishora R.S., Gjøtterud K.S., Lagström S., Stray-Pedersen B., Duri K., Chin’ombe N., et al. Intra-host sequence variability in human papillomavirus. Papillomavirus Res. [Internet]. Elsevier B.V. 2018;5:180–191. doi: 10.1016/j.pvr.2018.04.006. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Karamitros T., Papadopoulou G., Bousali M., Mexias A., Tsiodras S., Mentis A. SARS-CoV-2 exhibits intra-host genomic plasticity and low-frequency polymorphic quasispecies. J. Clin. Virol. [internet]. Elsevier B.V. 2020;131:104585. doi: 10.1016/j.jcv.2020.104585. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shen Z., Xiao Y., Kang L., Ma W., Shi L., Zhang L., et al. 2019. Genomic Diversity of SARS-CoV-2 in COVID-19 patients; pp. 1–27. [Google Scholar]
- 22.Sashittal P., Luo Y., Peng J., El-Kebir M. 2020. Characterization of SARS-CoV-2 Viral Diversity within and across Hosts. [Google Scholar]
- 23.Grubaugh N.D., Gangavarapu K., Quick J., Matteson N.L., De Jesus J.G., Main B.J., et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol. Genome Biology. 2019;20:1–19. doi: 10.1186/s13059-018-1618-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tyson J.R., James P., Stoddart D., Sparks N., Wickenhagen A., Hall G., et al. Improvements to the ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore. bioRxiv Prepr Serv Biol [Internet] 2020 doi: 10.1101/2020.09.04.283077. http://www.ncbi.nlm.nih.gov/pubmed/32908977%0Ahttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC7480024 Available from: [DOI] [Google Scholar]
- 25.Walsh K.A., Jordan K., Clyne B., Rohde D., Drummond L., Byrne P., et al. SARS-CoV-2 detection, viral load and infectivity over the course of an infection. J. Infect. Elsevier Ltd. 2020;81:357–371. doi: 10.1016/j.jinf.2020.06.067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lescure F.X., Bouadma L., Nguyen D., Parisey M., Wicky P.H., Behillil S., et al. Clinical and virological data of the first cases of COVID-19 in Europe: a case series. Lancet Infect. Dis. 2020;20:697–706. doi: 10.1016/S1473-3099(20)30200-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.ARTIC Network. https://artic.network/ncov-2019 Available from:
- 28.Li C., Debruyne D., Spencer J., Kapoor V., Liu L., Zhou B., et al. 2020. Highly Sensitive and Full-Genome Interrogation of SARS-CoV-2 Using Multiplexed PCR Enrichment Followed by Next-Generation Sequencing. [Google Scholar]
- 29.Resende P.C., Motta F.C., Roy S., Appolinario L., Fabri A., Xavier J., et al. 2020. SARS-CoV-2 genomes recovered by long amplicon tiling multiplex approach using nanopore sequencing and applicable to other sequencing platforms; pp. 1–11. [Google Scholar]
- 30.Itokawa K., Sekizuka T., Hashino M., Tanaka R., Kuroda M. Disentangling primer interactions improves SARS-CoV-2 genome sequencing by multiplex tiling PCR. PLoS One [Internet] 2020;15:1–11. doi: 10.1371/journal.pone.0239403. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.McNamara R.P., Caro-Vegas C., Landis J.T., Moorad R., Pluta L.J., Eason A.B., et al. High-density amplicon sequencing identifies community spread and ongoing evolution of SARS-CoV-2 in the southern United States. Cell Rep. [Internet]. ElsevierCompany. 2020;33:108352. doi: 10.1016/j.celrep.2020.108352. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Klempt P., Brož P., Kašný M., Novotný A., Kvapilová K., Kvapil P. Performance of targeted library preparation solutions for SARS-CoV-2 whole genome analysis. Diagnostics. 2020;10:1–12. doi: 10.3390/diagnostics10100769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Quick J., Grubaugh N.D., Pullan S.T., Claro I.M., Smith A.D., Gangavarapu K., et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protoc. 2017;12:1261–1266. doi: 10.1038/nprot.2017.066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Turakhia Y., De Maio N., Thornlow B., Gozashti L., Lanfear R., Walker C.R., et al. Stability of SARS-CoV-2 phylogenies [Internet] PLoS Genet. 2020 doi: 10.1371/journal.pgen.1009175. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Rayko M., Komissarov A. 2020. Quality Control of Low-Frequency Variants in SARS-CoV-2 Genomes. [Google Scholar]
- 36.Mercatelli D., Giorgi F.M. Geographic and genomic distribution of SARS-CoV-2 mutations. Front. Microbiol. 2020;11:1–13. doi: 10.3389/fmicb.2020.01800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Moreno G., Braun K., Halfmann P., Prall T., Riemersma K., Haj A., et al. 2020. Limited SARS-CoV-2 Diversity within Hosts and Following Passage in Cell Culture. [Google Scholar]
- 38.Andrés C., Garcia-Cehic D., Gregori J., Piñana M., Rodriguez-Frias F., Guerrero-Murillo M., et al. Naturally occurring SARS-CoV-2 gene deletions close to the spike S1/S2 cleavage site in the viral quasispecies of COVID19 patients. Emerg. Microb. Infect. 2020;9:1900–1911. doi: 10.1080/22221751.2020.1806735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Liu T., Chen Z., Chen W., Chen X., Hosseini M., Yang Z., et al. A benchmarking study of SARS-CoV-2 whole-genome sequencing protocols using COVID-19 patient samples. bioRxiv [Internet] 2020 doi: 10.1101/2020.11.10.375022. 2020.11.10.375022. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Doddapaneni H., Cregeen S.J., Sucgang R., Meng Q., Qin X., Avadhanula V., et al. Oligonucleotide capture sequencing of the SARS-CoV-2 genome and subgenomic fragments from COVID-19 individuals. bioRxiv [internet] 2020 doi: 10.1101/2020.07.27.223495. 2020.07.27.223495. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lu J., du Plessis L., Liu Z., Hill V., Kang M., Lin H., et al. Genomic epidemiology of SARS-CoV-2 in Guangdong Province. China. Cell. 2020;181 doi: 10.1016/j.cell.2020.04.023. 997–1003.e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Pillay S., Giandhari J., Tegally H., Wilkinson E., Chimukangara B., Lessells R., et al. Whole genome sequencing of sars-cov-2: adapting illumina protocols for quick and accurate outbreak investigation during a pandemic. Genes (Basel) 2020;11:1–13. doi: 10.3390/genes11080949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kubik S., Marques A.C., Xing X., Silvery J., Bertelli C., De Maio F., et al. Guidelines for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples. bioRxiv [Internet] 2020 doi: 10.1101/2020.12.01.405738. http://biorxiv.org/content/early/2020/12/01/2020.12.01.405738.abstract 2020.12.01.405738. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.de Torres D.A., do CB Ribeiro L., de FL Riello A.P., DDG Horovitz, LFR Pinto, Croda J. Reinfection of COVID-19 after 3 months with a distinct and more aggressive clinical presentation: case report. J. Med. Virol. 2021;93(4):1857–1859. doi: 10.1002/jmv.26637. [DOI] [PubMed] [Google Scholar]
- 45.Tillett R.L., Sevinsky J.R., Hartley P.D., Kerwin H., Crawford N., Gorzalski A., et al. Genomic evidence for reinfection with SARS-CoV-2: a case study. Lancet Infect. Dis. [Internet]. Elsevier Ltd. 2020;21:52–58. doi: 10.1016/S1473-3099(20)30764-7. Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Molecular assays to diagnose COVID-19: Summary table of available protocols [Internet]. Available from. https://www.who.int/publications/m/item/molecular-assays-to-diagnose-covid-19-summary-table-of-available-protocols
- 47.CDC 2019-Novel Coronavirus (2019-nCoV) Real-Time RT-PCR Diagnostic Panel [Internet]. Available from. https://www.fda.gov/media/134922/download [DOI] [PMC free article] [PubMed]
- 48.Tarasov A., Vilella A.J., Cuppen E., Nijman I.J., Prins P. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015;31:2032–2034. doi: 10.1093/bioinformatics/btv098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. http://arxiv.org/abs/1303.3997 00:1–3. Available from:
- 51.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome res [internet]. 2010/07/19. Cold Spring Harbor Laboratory Press. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. https://pubmed.ncbi.nlm.nih.gov/20644199 Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Koboldt D.C., Zhang Q., Larson D.E., Shen D., McLellan M.D., Lin L., et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–576. doi: 10.1101/gr.129684.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Buchan B.W., Hoff J.S., Gmehlin C.G., Perez A., Faron M.L., Munoz-Price L.S., et al. Distribution of SARS-CoV-2 PCR cycle threshold values provide practical insight into overall and target-specific sensitivity among symptomatic patients. Am. J. Clin. Pathol. 2020;154:479–485. doi: 10.1093/ajcp/aqaa133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Zhang R.Z., Deng W., He J., Song Y.Y., Qian C.F., Yu Q., et al. Case report: recurrence of positive SARS-CoV-2 results in patients recovered from COVID-19. Front. Med. 2020;7:1–5. doi: 10.3389/fmed.2020.585485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Li Q., Zheng X.S., Shen X.R., Si H.R., Wang X., Wang Q., et al. Prolonged shedding of severe acute respiratory syndrome coronavirus 2 in patients with COVID-19. Emerg. Microb. Infect. 2020;9:2571–2577. doi: 10.1080/22221751.2020.1852058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Zapor M. Persistent detection and infectious potential of SARS-CoV-2 virus in clinical specimens from COVID-19 patients. Viruses. 2020;12:1–17. doi: 10.3390/v12121384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Brief T.A. 2020. European Centre for Disease Prevention and Control. Reinfection with SARS-CoV: considerations for public health response: ECDC; p. 2020.https://www.ecdc.europa.eu/en/publications-data/threat-assessment-brief-reinfection-sars-cov-2 Available from: [Google Scholar]
- 61.Lu J., Tillett R., Long Q., Kong H., Kong H., Kong H., et al. COVID-19 reinfection: are we ready for winter? EBioMedicine. 2020;62 doi: 10.1016/j.ebiom.2020.103173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Voloch C.M., da Silva Jr R.F., de Almeida L.G., Brustolini O.J., Cardoso C.C., Gerber A.L., et al. Intra-host evolution during SARS-CoV-2 persistent infection. medRxiv [internet] 2020 doi: 10.1101/2020.11.13.20231217. 2020.11.13.20231217. Available from: [DOI] [Google Scholar]
- 63.Forster P., Forster L., Renfrew C., Forster M. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc. Natl. Acad. Sci. U. S. A. 2020;117:9241–9243. doi: 10.1073/pnas.2004999117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Illingworth C.J.R., Roy S., Beale M.A., Tutill H., Williams R., Breuer J. On the effective depth of viral sequence data. Virus Evol. 2017;3:1–9. doi: 10.1093/ve/vex030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Zanini F., Brodin J., Albert J., Neher R.A. Error rates, PCR recombination, and sampling depth in HIV-1 whole genome deep sequencing. Virus Res. [Internet]. Elsevier B.V. 2017;239:106–114. doi: 10.1016/j.virusres.2016.12.009. Available from: [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The raw sequencing data supporting the conclusions of this article are available at the NCBI SRA repository under BioProject ID PRJNA690890.