To the Editor,
I read with great interest a recent study by Poterico and Mestanza 1 who described mutations in 30 SARS‐CoV‐2 genomes from South American countries (Argentina, Brazil, Chile, Colombia, Ecuador, and Peru). Next‐generation sequencing (NGS) technologies have accelerated genomic and metagenomic studies providing affordable tools to obtain pathogen genomes and improving diagnosis and surveillance efforts. 2 However, many downstream analyses after assembling the genomes are impacted by low‐quality sequences and sequence contamination, which could lead to wrong conclusions.
The authors mentioned poor quality of some viral genomes as a limitation in their study, along with other issues such as the lack of epidemiological metadata, possible primer design variations, and limited number of South American samples. To overcome this problem the authors used high coverage complete genomes (>29 000 bp) from GISAID (https://www.gisaid.org/). In addition, they mentioned that they did not used genomes from Colombia and Ecuador in their phylogenetic analyses due to poor quality of their sequences. Although GISAID provides information on the genome length and coverage, it does not provide raw sequence reads, which are important to validate the observed mutations. As an alternative, it is possible to use another genomic database, for instance, the Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra), which is the largest publicly available repository of NGS data. Nevertheless, most of the public genome sequences are stored in GISAID (25 369) in comparison to SRA (4904) or GenBank (3812), data recovered on 15 May 2020. This could be problematic if independent research groups try to find similar mutations using public data.
I looked for the South American SARS‐CoV‐2 genome sequences in the SRA database and found only three records from two countries, Brazil (SRR11365239 and SRR11365239) and Peru (SRR11508492). Raw reads from Brazil were obtained using Ion Torrent sequencing technology and did not /ccorrespond to any of the 92 Brazilian records on GISAID. On the contrary, the Peruvian sequence on GISAID (EPI_ISL_415787) is identical to the GenBank record (MT263074.1). The raw reads of this assembly were obtained using Illumina technology and they are stored in the SRA database (SRR11508492).
I downloaded the raw reads form the Peruvian genome (2 359 909 paired end reads) and trimmed the reads in Trimmomatic 0.39 3 using the following parameters: ILLUMINACLIP:NexteraPE‐PE.fa:2:30:10:2:keepBothReads LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:30. After that, I conducted a de novo assembly using SPAdes 3.13. 4 In addition, I mapped the trimmed reads to the reference genome (NC_045512.2) using Bowtie2. 5 Then, I aligned the reference genome against the Peruvian genome and the new obtained reassembly using MAFFT. 6 The alignment and the mapping were visualized in Geneious R7. Both, the alignment and mapping results can be obtained from FigShare. 7
The de novo reassembly and mapped reads provided independent evidence to validate mutations reported by Poterico and Mestanza based on the Peruvian SARS‐CoV‐2 genome. 8 First, mutation N2894D in nsp4 (table 1 in ref. 1) corresponding to a change from A to G in the nucleotide position 8945 occurs only in few reads (4 out of 33 mapped reads) and is not considered in the consensus sequence in the de novo reassembly (Figure 1A). Thus, we should be very careful in considering this mutation as a real variant. Second, the authors reported a non‐synonymous mutation E1207E in the S gene (table 1 in ref. 1). This corresponds to a change from T to C in the nucleotide position 24022. Again, this mutation occurred only in 4 of 29 mapped reads and it is not present in the consensus sequence (Figure 1B).
Figure 1.

A portion of the trimmed reads from the Peruvian SARS‐CoV‐2 sequence (SRR11508492) compared against the reference genome (NC_045512.2). A, Position 8945, change from A to G. B, Position 24022, change from T to G. Only few reads showed mutations described in Ref. 1
This evidence supports the necessity of using original sequence reads to verify if the previously described mutations in SARS‐CoV‐2 genomes are accurate, assembly artifacts or sequencing errors. Erroneous conclusions such as the presence of high mutation rates, unreal evolutionary relationships among the lineages, and flawed target sites for vaccines and antiviral drugs, could be drawn from problematic data and would impede the urgent development of more initiatives to respond against SARS‐CoV‐2.
Additionally, the authors performed phylogenetic analyses but did not mention if they performed analyses of statistical branch support, namely, bootstrap replications. These results could also provide a better assessment of the significance 9 of the described groups or clades in their work.
Funding information Fondo Nacional de Desarrollo Científico y Tecnológico y de Innovación Tecnológica (Fondecyt ‐ Perú) ‐ “Proyecto de Mejoramiento y Ampliación de los Servicios del Sistema Nacional de Ciencia, Tecnología e Innovación Tecnológica”: Contract number 34–2019
REFERENCES
- 1. Poterico JA, Mestanza O. Genetic variants and source of introduction of SARS‐CoV‐2 in South America. J Med Virol. 2020. 10.1002/jmv.26001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Gardy JL, Loman NJ. Towards a genomics‐informed, real‐time, global pathogen surveillance system. Nat Rev Genet. 2018;19(1):9–20. 10.1038/nrg.2017.88 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to Single‐Cell Sequencing. J Comput Biol. 2012;19(5):455–477. 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Langmead B, Salzberg SL. Fast gapped‐read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–359. 10.1038/nmeth.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–780. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Romero PE. 2020. Supporting information for “Comment on Genetic variants and source of introduction of SARS‐CoV‐2 in South America”. figshare collection. 10.6084/m9.figshare.c.4981178 [DOI] [PMC free article] [PubMed]
- 8. Padilla‐Rojas C, Lope‐Pari P, Vega‐Chozo K, et al. Near‐complete genome sequence of a 2019 Novel coronavirus (SARS‐CoV‐2) strain causing a COVID‐19 case in Peru. Microbiol Resour Announc. 2020;9(19):e00303020‐1–e00303020‐3. 10.1128/MRA.00303-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Efron B, Halloran E, Holmes S. Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad of Sci USA. 1996;93(23):13429–13434. 10.1073/pnas.93.23.13429 [DOI] [PMC free article] [PubMed] [Google Scholar]
