Abstract
There is a rising global concern for the recently emerged novel coronavirus (2019‐nCoV). Full genomic sequences have been released by the worldwide scientific community in the last few weeks to understand the evolutionary origin and molecular characteristics of this virus. Taking advantage of all the genomic information currently available, we constructed a phylogenetic tree including also representatives of other coronaviridae, such as Bat coronavirus (BCoV) and severe acute respiratory syndrome. We confirm high sequence similarity (>99%) between all sequenced 2019‐nCoVs genomes available, with the closest BCoV sequence sharing 96.2% sequence identity, confirming the notion of a zoonotic origin of 2019‐nCoV. Despite the low heterogeneity of the 2019‐nCoV genomes, we could identify at least two hypervariable genomic hotspots, one of which is responsible for a Serine/Leucine variation in the viral ORF8‐encoded protein. Finally, we perform a full proteomic comparison with other coronaviridae, identifying key aminoacidic differences to be considered for antiviral strategies deriving from previous anti‐coronavirus approaches.
Keywords: biostatistics & bioinformatics, CLUSTAL analysis, coronavirus, data visualization, virus classification
Highlights
56 genomic sequences from distinct 2019‐nCoV patients were analyzed, showing very high (99%) sequence similarity.
There exist few variable genomic regions within the 2019‐nCoV population. One of these affects the ORF8 locus.
The closest publicly available genomic sequences to 2019‐nCoV appear to be coronaviruses infecting bats, while SARS and MERS viruses are more distantly related.
1. INTRODUCTION
Coronaviridae (CoVs) are the largest known single‐stranded RNA viruses. 1 They have been categorized in three groups, based on phylogenetic analyses and antigenic criteria, 2 specifically: (a) alpha‐CoVs, responsible for gastrointestinal disorders in human, dogs, pigs, and cats; (b) beta‐CoVs, including the Bat coronavirus (BCoV), the human severe acute respiratory syndrome (SARS) virus and the Middle Eastern respiratory syndrome (MERS) virus; (c) gamma‐CoVs, which infect avian species.
Very recently, a novel beta‐CoVs coronavirus (2019‐nCoV) originating from the province of Wuhan, China, has been causally linked to severe respiratory infections in humans. At the time of writing, 14 441 cases of 2019‐nCoV‐related pneumonia cases have been reported in China, plus 118 cases from 23 other countries. There are currently 315 deaths linked to this pathogen (source: World Health Organization report, 02 February 2020). Phylogenetic relationships between Bat and Human coronaviridae have been discovered for SARS 3 and more recently also for 2019‐nCoV, 4 suggesting events of inter‐species transmissions. 5
No vaccine for 2019‐nCoV has been publicly released, but a World effort has arisen toward the characterization of the molecular determinants and evolutionary features of this novel virus. An initial comparison of 10 genomic sequences from 2019‐nCoV specimens has reported a low heterogeneity of this viruses with intersample sequence identity above 99.9%. 6 There are currently 54 2019‐nCoV complete genome sequences from the Global Initiative for Sharing all Influenza Data (Gisaid 7 ) and from GenBank, 8 plus two partial sequences obtained by the Spallanzani hospital in Rome, Italy (also from Gisaid).
In this short report, we set out to characterize the heterogeneity of all 2019‐nCoV genomes and proteomes available at the moment of the study, comparing them to other representative coronaviridae, specifically SARS, MERS, and BCoV. We will generate phylogenetic trees of the 2019‐nCoV cases and apply entropy‐based analyses of position‐wise variance and categorical principal component analysis (CATPCA) as an alternative method to estimate the sequence distance between all analyzed viruses.
2. METHODS
All genomic sequences were collected on 02 February 2020 from GenBank 8 or Gisaid. 7
MSA was performed using MUSCLE v3.8.31. 9
MSA visualization was generated via Jalview v 2.11.0. 10
Phylogenetic model generation and tree visualization were done using MEGAX v 10.1.7, 11 using the Maximum Likelihood method and Tamura‐Nei model. 12 The tree structure was validated by running the analysis on 100 bootstrapped input datasets. 13
CATPCA was performed on R version 3.6.1 using the package FactoMineR. 14 Specifically, an MSA FASTA file from MUSCLE is loaded in R and converted into a categorical matrix, with genomes as rows and nucleotide coordinates as columns. Factors are defined as A, C, G, T, N, or—(gap), as described in results. Then, the FactoMineR multiple correspondence analysis algorithms is run with default parameters and custom R functions are used to plot the component values for each genome.
Pairwise protein identity and coverage were calculated using BLAST protein v2.6.0 15 with BLOSUM62 matrix and default parameters. Nucleotide sequence identity and coverage were calculated using BLAST nucleotide v2.6.0. 15
Prediction of structural protein disorder was performed using GLOBPLOT2, an implementation of the Russell/Linding algorithm. 16
3. RESULTS
3.1. Phylogenetic analysis
We collected 53 full genomic 2019‐nCoV sequences from the Gisaid database (Table S1), plus the GenBank‐deposited sequence from the Wuhan seafood market pneumonia virus isolate Wuhan‐Hu‐1 (NC_045512.2) and two partial sequences from Italian isolates (EPI_ISL_406959 and EPI_ISL_406960). To compare 2019‐nCoVs with closely related viral species, we obtained six sequences from distinct human SARS genomes from GenBank (the reference NC_004718.3, plus the genomes AY274119.3, GU553363.1, DQ182595.1, AY297028.1, and AY515512.1). We also obtained six BCoV genomic sequences (DQ022305.2, DQ648857.1, JX993987.1, KJ473816.1, MG772934.1, EPI_ISL_402131). Finally, as more distantly related beta‐CoVs we analyzed also MERS genomes from GenBank entries JX869059.2 and KT368829.1.
Similarly to a previous report with 10 virus specimens, 6 we detected very high conservation between the 56 analyzed 2019‐nCoV genomes, with sequence identity above 99%. We found a bat CoV genome (Gisaid EPI_ISL_402131) with 96.2% sequence identity (and query coverage above 99%) to the 2019‐nCoV reference sequence (NC_045512.2), while the previously reported closest bat CoV (bat‐SL‐CoVZC45) has a sequence similarity of 88%. 6 The reference human SARS genome (NC_004718.3) appears more distant from the reference 2019‐nCoV, with sequence identity of 80.26% and query coverage of 98%.
We aligned all the 70 coronavirus sequences using MUSCLE 9 and inferred the evolutionary relationships between these sequences with a Tamura‐Nei Maximum Likelihood estimation 12 with 100 bootstraps for model robustness estimation.
The results are shown in Figure 1 as a phylogenetic tree representation. All the human 2019‐nCoV appear very similar to each other, despite the different locations of sampling. Bat coronaviridae appears to be the closet homologs. Two specific specimens gathered in 2013 and 2015 in China from the bat species Rhinolophus affinis and Rhinolophus sinicus appear to be located between the BCoV and the human 2019‐nCoV groups, supporting the notion of a zoonotic transfer from bats to humans. 4 Human SARS sequences group with BCoV sequences more distantly related to 2019‐nCoV genomes. Finally, MERS genomes are the most genetically distinct amongst the other sequences.
A purely topological representation of a bootstrapped Maximum Likelihood tree (Figure S1) shows that 2019‐nCoV sequences are highly similar to each other, with poor support to the existence of distinct subgroups.
The global multiple sequence alignment (MSA) is available as Supporting Information File S1.
3.2. Genomic divergence from other beta‐coronaviridae
Given the high homogeneity between 2019‐nCoV genomes, we developed a novel method to classify genomic sequences, based on CATPCA. 14 Briefly, this analysis finds the eigenvectors describing the highest variance within a categorical dataset, like ours. Our dataset derived from the MUSCLE MSA of 70 genomes and generated 32 206 positions: the categories in each coordinate could be A (Adenine), C (Cytosine), G (Guanine; T [Thymine], although being an single strand RNA virus, it would be more appropriate to use U [Uracil]), N (Nucleotide, uncertain location: very rare in this dataset, and accounting for only nine positions, or 0.0004% of all the data).
Our analysis shows similar results for phylogenetic tree representations. In Figure 2A, we show the catPCA of the first components for all analyzed genomes. The MERS/non‐MERS grouping accounts for the largest variance, while SARS and SARS‐like BCoVs cluster together. While 2019‐nCoV constitute a tightly similar cluster, the two bat virus sequences MG772934.1 and EPI_ISL_402131 appear to be linking the human 2019‐nCoV to the Bat coronaviridae.
A catPCA analysis on the sole 2019‐nCoV sequences highlights some internal variability (Figure 2B), with two likely outliers identified in the genome EPI_ISL_406862 (collected in Germany) and EPI_ISL_406592 (collected in Shenzhen, China).
3.3. Genomic variance estimation within 2019‐nCoV genomes
Although the variability within the 2019‐nCoV genomes is very low, we set out to discover possible hotspots of hypervariability within the viral population. We analyzed the approximately 30 000 nt of multiple genome alignments performed on the 54 full 2019‐nCoV genomes. Our analysis shows that these viruses have largely the same genomic arrangement as the SARS species. 17 A large gene encoding for a polyprotein (ORF1ab) at the 5′ end of the genome is followed by four major structural protein‐coding genes: S = Spike protein, E = Envelope protein, M = Membrane protein, and N = Nucleocapsid protein. There are also at least six other accessory open reading frames (ORFs) (Figure 3A).
For each position of the multialigned 54 2019‐nCoV, we calculated Shannon Entropy as a measure of the position variability. 18 Apart from the 5′ and 3′ ends, likely nonprotein coding and less homogeneous, we identified two hotspots of hypervariability at positions 8789 and 28151 (Figure 3B,C).
Position 8789 witnesses the presence of either T (U) or C, and it falls within the polyprotein gene. It causes a synonymous variation in the nucleotide triplet encoding for Serine 2839 (amino acid coordinates based on the reference genome NC_045512.2), so it is likely not to introduce phenotypical differences between the different strains.
On the other hand, position 28151 falls within ORF8 and is characterized by the presence of either a C or a U. This causes a Ser/Leu change in amino acid (aa) 84, which can affect the conformation of the peptide, given that Serine is a polar amino acid, and Leucine is nonpolar. Aa84 appears to be nonconserved also across other coronaviridae (Figure 4A, black arrow).
We analyzed the alternative isoforms of 2019‐nCoV ORF8‐aa84 alternative isoforms, dubbed ORF8‐L (Leucine) and ORF8‐S (Serine). Unfortunately, no crystal structures of close homologs to the ORF8 protein are available for a reliable homology modeling to measure the structural impact of this amino acid substitution. The closest 3D model to 2019‐nCoV ORF8 available in Protein Data Bank 19 is a short 22 amino acid stretch in the protein entry 6P65, with a nonsignificant E‐value of 0.848. We, therefore, employed de novo methods to infer the structural features of ORF8. One important effect we could detect is a significant effect of Serine in ORF8‐S in inducing structural disorder in the protein C‐terminal portion, which is not predicted to be present in the ORF8‐L (Figure 4B), using the Russell/Linding algorithm. 16 Moreover, it did not escape our attention that the ORF8‐S could theoretically generate a novel phosphorylation target for the mammalian host Serine/Threonine kinases of the host organism. So, we searched for ORF8 homologous substrates in the Mammalia NCBI nr protein database, but could not find matches within the E‐value threshold of 1.
3.4. Protein conservation within 2019‐nCoV and between other beta‐coronaviridae
We performed a cross‐species analysis for all proteins encoded by the 2019‐nCoV and its relatives. We, therefore, analyzed the protein sequences encoded by all ORFs in these genomes
Wuhan NC_045512.2 (GenBank reference genome for 2019‐nCoV)
BCoV bat‐SL‐CoVZXC21 MG772934.1 (Bat virus similar to 2019‐nCoV)
Bat SARS coronavirus HKU3‐1 DQ022305.2 (Bat virus more distantly related to 2019‐nCoV)
SARS NC_004718.3 (GenBank reference genome for SARS)
Our analysis shows a close homology for all proteins with Bat sequence MG772934.1 (>80%), and more distant with the other Bat sequence and SARS reference. Query (2019‐nCoV) coverage was always above 99.0%. Generally, we could observe high conservation for structural proteins E, M, and A across the beta‐coronavirus family, while accessory proteins (especially ORF8) seem to have much stricter evolutionary constraints (Table 1).
Table 1.
Gene | Bat MG772934.1 | Bat DQ022305.2 | SARS NC_004718.3 |
---|---|---|---|
ORF1ab (polyprotein) | 95.15% | 85.78% | 86.12% |
S (Spike) | 80.32% | 76.04% | 75.96% |
Orf3a | 92.00% | 72.99% | 72.36% |
E (Envelope) | 100% | 94.74% | 94.74% |
M (Membrane) | 98.65% | 90.99% | 90.54% |
ORF6 | 93.44% | 67.21% | 68.85% |
ORF7a | 88.43% | 88.52% | 85.25% |
ORF7b | 93.02% | 79.07% | 81.40% |
ORF8 | 94.21% | 57.02% | 30.16% |
N (Nucleocapsid) | 94.27% | 89.55% | 90.52% |
ORF10 | 73.20% | 74.23% | 72.45% |
This article is being made freely available through PubMed Central as part of the COVID-19 public health emergency response. It can be used for unrestricted research re-use and analysis in any form or by any means with acknowledgement of the original source, for the duration of the public health emergency.
On average, nCoV shares 91.1% of protein sequences with Bat virus MG772934.1, 79.7% with Bat virus DQ022305.2%, and 77.1% with the SARS proteome.
For further visual confirmation, via MSA, of the conservation between SARS and 2019‐nCoV, especially in structural proteins, please refer also to Figure 4C and Figures S2 and S3. As previously observed, the N protein in 2019‐nCoV differs from the SARS ortholog in the structurally relevant amino acids 380 and 410. 4
4. DISCUSSION
Our results highlight a high level of conservation within 2019‐nCoV genomes sequenced so far, and a clear origin from other beta‐CoVs, specifically BCoVs, SARS and MERS. Our analysis confirms previous results highlighting the BCoV as a likely evolutionary link between the SARS viruses and the current epidemic 2019‐nCoV. 4 We could confirm this result both with standard phylogenetic analysis and a newly developed visualization method for genomic distances, based on CATPCA.
The similarity between 2019‐nCoV and the closest Bat relative is very high: all proteins in the coronavirus proteome (with the exception of ORF10) have identities of above 85%, with full conservation of the genome length (~30 kb). We could report also the specific amino acids that changed between SARS and nCoV, with potential implications in epitope definition and possible repurposing of anti‐SARS drugs and vaccines.
Our analysis found low variability (>99% sequence identity) within the 56 2019‐nCoV genomes available at the time of writing, with only two core positions of high variability, one a silent variant in the ORF1ab locus, and the other as an amino acid polymorphism in ORF8. The mutation in ORF8 resulting in one of its two variants, ORF8‐L and ORF8‐S, is predicted to be affecting the structural disorder of the protein. Specifically, the amino acidic region aa83‐aa89 is more likely to be disordered in the ORF8‐S isoform.
In conclusion, our analysis confirms low variability within the new epidemic virus 2019‐nCoV sequenced specimens, while highlighting at least two nucleotide positions of higher variability within protein‐coding regions, and specific amino acid divergences compared to BCoVs and SARS. 4 These findings shed a cautiously optimistic light on the possibility of finding effective treatment for this novel coronavirus, starting from already existing anti‐beta‐coronaviridae compounds, 20 which will be dealing with a relatively homogenous viral population.
Supporting information
ACKNOWLEDGMENTS
We would like to thank the Italian Ministry of University and Research for funding. Also, we would like to acknowledge the fruitful discussions with our colleagues Daniele Mercatelli, Simone Di Giacomo and Giorgio Milazzo. Also, a big acknowledgment to Eleonora Fornasari for help with graphics.
Ceraolo C, Giorgi FM. Genomic variance of the 2019‐nCoV coronavirus. J Med Virol. 2020;92:522–528. 10.1002/jmv.25700
REFERENCES
- 1. Cui J, Li F, Shi Z‐L. Origin and evolution of pathogenic coronaviruses. Nat Rev Microbiol. 2019;17(3):181‐192. 10.1038/s41579-018-0118-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Schoeman D, Fielding BC. Coronavirus envelope protein: current knowledge. Virol J. 2019;16(1):69 10.1186/s12985-019-1182-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Hu B, Ge X, Wang L‐F, Shi Z. Bat origin of human coronaviruses. Virol J. 2015;12(1):221 10.1186/s12985-015-0422-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Benvenuto D, Giovannetti M, Ciccozzi A, Spoto S, Angeletti S, Ciccozzi M. The 2019‐new coronavirus epidemic: evidence for virus evolution. J Med Virol. 2020. 10.1002/jmv.25688 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Lu H, Stratton CW, Tang Y. Outbreak of pneumonia of unknown etiology in Wuhan China: the mystery and the miracle. J Med Virol. 2020. 10.1002/jmv.25678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Lu R, Zhao X, LI J, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet. 2020:S0140673620302518 10.1016/S0140-6736(20)30251-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data—from vision to reality. Euro Surveill. Mar. 2017;22(13):1‐3. 10.2807/1560-7917.ES.2017.22.13.30494 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch‐Mizrachi I. GenBank. Nucleic Acids Res. 2019;47(D1):D94‐D99. 10.1093/nar/gky989 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792‐1797. 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25(9):1189‐1191. 10.1093/bioinformatics/btp033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35(6):1547‐1549. 10.1093/molbev/msy096 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10(3):512‐526. 10.1093/oxfordjournals.molbev.a040023 [DOI] [PubMed] [Google Scholar]
- 13. Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39(4):783‐791. 10.1111/j.1558-5646.1985.tb00420.x [DOI] [PubMed] [Google Scholar]
- 14. Le S, Joss J, Husson F. FactoMineR: an R package for multivariate analysis. J Stat Softw. 2008;25(1):1‐18. [Google Scholar]
- 15. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403‐410. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
- 16. Linding R. GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res. 2003;31(13):3701‐3708. 10.1093/nar/gkg519 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Narayanan K, Huang C, Makino S. SARS coronavirus accessory proteins. Virus Res. 2008;133(1):113‐121. 10.1016/j.virusres.2007.10.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Manaresi E, Conti I, Bua G, Bonvicini F, Gallinella G. A Parvovirus B19 synthetic genome: sequence features and functional competence. Virology. 2017;508:54‐62. 10.1016/j.virol.2017.05.006 [DOI] [PubMed] [Google Scholar]
- 19. Burley SK, Berman HM, Kleywegt GJ, Markley JL, Nakamura H, Velankar S. Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive In: Wlodawer A, Dauter Z, Jaskolski M, eds. Protein Crystallography. 1607 New York, NY: Springer New York; 2017:627‐641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Anand K. Coronavirus main proteinase (3CLpro) structure: basis for design of anti‐SARS drugs. Science. 2003;300(5626):1763‐1767. 10.1126/science.1085658 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.