Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2020 Nov 21;129:104131. doi: 10.1016/j.compbiomed.2020.104131

DBCOVP: A database of coronavirus virulent glycoproteins

Susrita Sahoo a, Soumya Ranjan Mahapatra a, Bikram Kumar Parida c, Satyajit Rath c, Budheswar Dehury a, Vishakha Raina a, Nirmal Kumar Mohakud d, Namrata Misra a,b, Mrutyunjay Suar a,b,
PMCID: PMC7679231  PMID: 33276297

Abstract

Since the emergence of SARS‐CoV-1 (2002), novel coronaviruses have emerged periodically like the MERS‐ CoV (2012) and now, the SARS‐CoV-2 outbreak which has posed a global threat to public health. Although, this is the third zoonotic coronavirus breakout within the last two decades, there are only a few platforms that provide information about coronavirus genomes. None of them is specific for the virulence glycoproteins and complete sequence-structural features of these virulence factors across the betacoronavirus family including SARS-CoV-2 strains are lacking. Against this backdrop, we present DBCOVP (http://covp.immt.res.in/), the first manually-curated, web-based resource to provide extensive information on the complete repertoire of structural virulent glycoproteins from coronavirus genomes belonging to betacoronavirus genera. The database provides various sequence-structural properties in which users can browse and analyze information in different ways. Furthermore, many conserved T-cell and B-cell epitopes predicted for each protein are present that may perform a significant role in eliciting the humoral and cellular immune response. The tertiary structure of the epitopes together with the docked epitope-HLA binding-complex is made available to facilitate further analysis. DBCOVP presents an easy-to-use interface with in-built tools for similarity search, cross-genome comparison, phylogenetic, and multiple sequence alignment. DBCOVP will certainly be an important resource for experimental biologists engaged in coronavirus research studies and will aid in vaccine development.

Keywords: Coronavirus, Glycoproteins, Database, Bioinformatics, Immunoinformatics, COVID-19

Highlights

  • DBCOVP is the first manually curated resource to provide information on entire repertoire of structural glycoproteins of betacoronavirus.

  • The database provides complete functional annotation of the proteins highlighting fourteen sequence-structural properties.

  • Immunoinformatics data on potent T-cell & B-cell epitopes for each protein along with the population coverage analysis are exclusively included.

  • In-built tools to perform similarity search, cross genome comparison, phylogenetic analysis and multiple sequence alignment are included.

  • DBCOVP will become valuable resource for experimental biologist engaged in coronavirus research studies and will aid in vaccine development.

1. Introduction

Coronaviruses belonging to the Coronaviridae family is the causative agent of neurologic, enteric, hepatic, and upper respiratory tract diseases in a wide range of hosts including human, cattle, camels, swine, bats, cats, dogs, rabbits, snake, and several other wild animals and avian host species [1]. The genome comprises a single positive-stranded RNA genome, with size ranging from 26 to 32 Kilo bases in length, with G + C contents varying from 32 to 43% [1,2]. Among the various coronaviruses that are infecting humans, the majority are associated with mild clinical symptoms unlike the Severe Acute Respiratory Syndrome (SARS) coronavirus (SARS-CoV-1) and Middle East Respiratory Syndrome (MERS) coronavirus (MERS-CoV) [3], which cause high morbidity and mortality in human populations. SARS-CoV-1 incidence was initially reported in November 2002 in Guangdong, Southern China, and resulted in around 8000 cases of human infections with 744 deaths, around 9.5% mortality rate [4,5]. Later on, a similar epidemic outbreak (MERS-CoV) was first detected in Saudi Arabia in September 2012 which resulted in a higher incidence of mortality rate [[6], [7], [8]]. Recently, in late December 2019, patients with viral pneumonia symptoms due to an unidentified etiology were reported first in Wuhan City, China [9]. A novel coronavirus was later identified as the causative pathogen, provisionally named as 2019-nCoV, and later renamed as SARS-CoV-2, has been declared as the Public Health Emergency of International Concern by the World Health Organization (WHO) on 30 January 2020 [9,10]. As of 1st August 2020, the virus has spread worldwide affecting 213 countries with more than 6 million cases of infected patients. According to comparative genomic analysis, SARS-CoV-2 shares 79.5% nucleotide identity with SARS-CoV-1; and 96% identity with bat-CoV-RaTG13. Therefore, SARS-CoV-2 is considered as SARS related coronavirus, and bats as the most probable source of infection [11]. The SARS-CoV-2, SARS-CoV-1, and MERS-CoV show several similarities regarding the clinical presentations with pneumonia-like symptoms, evidence of zoonotic transmission as the route of disease origin, and human to human transmission [12]. Furthermore, all three coronaviruses belong to the genus betacoronavirus which is further classified into five sub-genus, namely Sarbecovirus, Embecovirus, Hibecovirus, Merbecovirus, and Nobecovirus. The SARS-CoV-2, SARS-CoV-1belongs to the Sarbecovirus subgenus [9]. Despite the great threats to public health around the world and global concern to combat the spread of the ongoing outbreak, to date, there are no clinically approved vaccines available for either SARS-CoV-2 or SARS, MERS, and therefore further research is imperative for identifying appropriate therapeutic targets for the development of safe, stable vaccines for combating human coronavirus infections [12,13].

Advances in molecular biology and the use of bioinformatics resources, particularly the immunoinformatics approach have resulted in a deluge of genomic data that can provide prior information on the efficacy of potential vaccine targets worthy of subsequent validation through wet-lab experiments, thus saving a lot of time and effort in the vaccine discovery process [13,14]. The prediction and characterization of immunogenic epitopes that can induce antibody production from B-cells and cellular response and cytokine secretion from T-cells is a critical step in silico identification and assessment of potential vaccine targets. The epitope-driven vaccine concept has already been successfully employed against many infectious diseases in recent years [[15], [16], [17]]. As the first step in this direction, it is essential to find proteins that play a definite role in the pathogenesis of any virus. The primary goal of any viral infection is to pinpoint a receptor on the host cell surface for effective binding which would pave the entry of the virus into the host cell. In most cases, glycoproteins are involved in host binding and subsequent virus-host membrane fusion to establish the pathogenesis of the virus [18]. The four important glycoproteins that majorly contribute to the structure of all coronaviruses are the spike protein (S), small envelope protein (E), membrane protein (M), and nucleocapsid (N) protein [13]. The S protein mediates receptor binding and membrane fusion and is vital for identifying host tropism and transmission capacity [[19], [20], [21]]. Mutations in the gene encoding spike protein have resulted in altered pathogenesis and virulence in other coronaviruses [22]. It is believed that three molecules of spike proteins form the characteristic ‘spikes’ or the crown-like appearance specific of this virus family [13].The majority of the candidate vaccine that is being developed against coronaviruses, targets the spike protein as they are the major inducer of neutralizing antibodies [23,24]. It is seen that the association of the spike with the membrane protein is crucial in the formation of the viral envelope and the accumulation of both the glycoproteins at the site of virus assembly [22]. The gene encoding the nucleocapsid protein in the SARS-CoV-1 virus is believed to possess a novel nuclear function, which could play a role in pathogenesis. Additionally, the basic nature of this protein implies that it may assist in RNA binding [22,23]. Lastly, the envelope protein has been shown to play an important role in the assembly of the virion and its replication [25,26]. These structural proteins have a diverse functional role in the viral pathogenesis; therefore, a dedicated database on all the four discussed major structural glycoproteins will provide a timely and valuable source of detailed sequence-structural properties about these virulence factors to the scientific community that will aid in the development of vaccines against coronavirus.

Despite the constant emerging and re-emerging of the deadly coronavirus since the last two decades, to date, there are only a few dedicated web resources exclusively available to study coronaviruses genes and proteins. For instance, the Comprehensive Database for Comparative Analysis of Coronavirus Genes and Genomes (CoVDB) that performs fast, and precise batch sequence retrieval, the basis for comparative gene or genome analysis [27]. CoVDB has not been updated since 2007 and provides limited annotation features including cleavage sites, genome information, tandem repeat sequences, transcription regulatory sequences, and RNA structures. Virus Pathogen Database and Analysis Resource (VipR) covers a huge plethora of human pathogenic viruses but includes knowledge on sequence records, a few genome and protein annotations, tertiary protein structures, immune epitope, surveillance, and clinical metadata derived from comparative genomics analysis [28]. Although very useful, VipR doesn't hold any information specific to the virulence glycoprotein and further lacks details on secondary structure properties, subcellular location, molecular function, biological process, domain, cluster, Super family, Physicochemical properties, Epitope conservancy, Allergenicity, Antigenicity, Toxicity, 3D epitope structure, Population coverage analysis. Similarly, ViralZone (https://viralzone.expasy.org/), a web-resource for viral genus and families, hosted by the Swiss Institute of Bioinformatics provides general molecular and epidemiological information of viruses [29]. Since March 2020, ViralZone holds Covid-19 genome expression details, protein sequence records, host-virus interaction, and general information on coronaviruses belonging to betacoronavirus genera. A GenBank submissions tool, Viral Annotation DefineR (VADR, https://github.com/nawrockie/vadr), was specifically designed to validate and annotate viral sequences. VADR has been used to check sequence submissions norovirus (May 2018), dengue virus (January 2019), and SARS-CoV-2 (March 2020) sequence submissions [30]. Likewise, the Viral Bioinformatics Resource Center (VBRC, https://4virology.net/) funded by the National Institute of Allergy and Infectious Diseases, holds information on curated viral genomes (belonging to the family Coronaviridae, Asfarviridae, Poxviridae) and a plethora of bioinformatics tools to perform genome analysis [31]. Presently, VBRC redirects to various exclusive SARS-CoV-2 resources viz., genome, scientific literature, Worldometers, case trackers, and COVID-19 specific news. Earlier developed, Rfam, an online resource providing access to families of structural RNAs, where each family is characterized by a covariance model and multiple sequence alignment [32]. Its current special release RFAM 14.2 includes details on Untranslated regions of all the five families of coronavirus. Recently made available, CORona Drug InTEractions database (CORDITE, https://cordite.mathematik.unimarburg.de/#/) collects and aggregates details on in vitro, computational, or case analyses on promising drugs for COVID-19 from PubMed (https://www.ncbi.nlm.nih.gov/pubmed/), chemRxiv (https://www.chemrxiv.org/), bioRxiv (https://www.biorxiv.org/), and medRxiv (https://www.medrxiv.org/) to further perform meta-analyses and new clinical trials [33]. To find putative drug targets and further explore the molecular mechanisms of pathogenicity, sadegh et al., developed a CoronaVirus Explorer (CoVeX, https://exbio.wzw.tum.de/covex/) that includes information on drug candidates and experimentally validated virus–human interaction data for both SARS-CoV-2 and SARS-CoV-1 with human interactome [34]. Apart from the above mentioned web resources, few other platforms are recently made available exclusively focused on coronavirus research like Coronavirus Database V3 (http://covdb.popgenetics.net/v3/ [35]), that contains only genomic data; COVID-Profiler (http://genomics.lshtm.ac.uk/), analyses Sars-Cov-2 sequencing and a few immunological data; COVIEdb (http://biopharm.zju.edu.cn/coviedb/help/ [36]), holds only some potential B/T cell epitopes for SARS CoV-2, RaTG13-CoV, SARS-CoV and MERS-CoV; CoVIDep (https://covidep.ust.hk/ [37]), consists of genetic data for SARS-CoV-2 and immunological data for the 2003 SARS virus, to identify B-cell and T-cell epitopes; CoV3D (https://cov3d.ibbr.umd.edu/cov3d), contains structures of SARS-CoV-2, SARS-CoV, and MERS-CoV proteins, without any other structural details; CoronaVIR (https://webs.iiitd.edu.in/raghava/coronavir/index.html [38]) contains a few genomic, proteomic, diagnostic and therapeutic knowledge about novel SARS-CoV-2 coronaviruses. Moreover, none of them is specific for the virulence proteins encompassing the spike protein, small envelope protein, membrane protein, and nucleocapsid protein and an in-depth investigation of complete sequence-structural features of these virulence factors across the betacoronavirus family including the newly identified SARS-CoV-2 strains is lacking. Although sequence efforts have resulted in a marked increase in emerging SARS-CoV-2 sequenced data; however, functional annotation of the encoded proteins in primary databases such as GenBank and UniProt knowledgebase remains limited. To address this issue, we developed a specifically designed web-accessible resource DBCOVP (http://covp.immt.res.in/) to integrate in-depth functional annotation of coronavirus virulence glycoproteins (Fig. 1 ). DBCOVP is the first manually curated data repository that provides comprehensive details on the entire repertoire of structural glycoproteins from coronavirus genomes of betacoronavirus genera including the SARS-CoV-1, MERS‐CoV, and SARS-CoV-2 strains. The database provides complete functional annotation of the proteins highlighting fourteen sequence-structural properties. A comparative overview between DBCOVP and other platforms are presented in Table 1

Fig. 1.

Fig. 1

Schematic representation of complete protocol employed for theidentification of promiscuous epitope-based vaccine candidates present in DBCOVP.

Table 1.

Comparison of DBCOVP with the existing coronavirus web repositories.

Covdb VipR CoronaVIR Covdb (Coronavirus Database V3) COVID-Profiler Coviedb Covidep Cov3d ViralZone VADR VBRC Rfam CORDITE CoVex DBCOVP
URLrowhead http://covdb.microbiology.hku.hk https://www.viprbrc.org/brc/home.spg?Decorator=vipr https://webs.iiitd.edu.in/raghava/coronavir/
http://covdb.popgenetics.net/v3/index http://genomics.lshtm.ac.uk/ http://biopharm.zju.edu.cn/coviedb/ https://covidep.ust.hk/ https://cov3d.ibbr.umd.edu/cov3d https://viralzone.expasy.org/ https://github.com/nawrockie/vadr https://4virology.net/ https://rfam.org/covid-19 https://cordite.mathematik.unimarburg.de/#/ https://exbio.wzw.tum.de/covex/ http://covp.immt.res.in/
Specificityrowhead Includes annotated coronavirus genes and genomes belonging to six coronavirus species ViPR contains information for human pathogenic viruses Contains genomic, proteomic, diagnostic and therapeutic knowledge about novel SARS-CoV-2 coronaviruses Contains coronavirus genomic data belonging to 32 organisms Allows to analyze
Sars-Cov-2 sequencing and immunological data
potential B/T cell epitopes for SARS CoV-2, RaTG13-CoV, SARS-CoV, and MERS-CoV Consists genetic data for SARS-CoV-2 and immunological data for the 2003 SARS virus, to identify B-cell and T-cell epitopes Includes Structures of SARS-CoV-2, SARS-CoV, and MERS-CoV proteins Contains Covid-19 genome expression details, protein sequence records, host-virus interaction, and general information on coronaviruses belonging to betacoronavirus genera. Designed to validate and annotate viral sequences Validates and annotates viral sequences in GenBank submissions providing access to families of structural RNAs collects and aggregates details on in vitro, computational, or case analyses on promising drugs for COVID-19 from PubMed, bioRxiv, medRxiv. Contains information on drug candidates and experimentally validated virus–human interaction data for both SARS-CoV-2 and SARS-CoV-1 with human interactome The only database of structural glycoproteins from coronavirus genomes belonging to 137 strains from betacoronavirus genera.
Strain Informationrowhead
Description, Isolation Source, Collection Date, Host, Countryrowhead × Available × Available × × × × x x x X x x Available
Transmission, Epidemiology, Clinical symptomsrowhead × Available × × × × × × x x x X x x Available
Associated Glycoproteinrowhead × × Available × × × × × x x x X x x Available
Toolsrowhead
Search and Advanced Searchrowhead × Available × × × × × × x x x x x x Available
BLASTrowhead Available Available × × × × × × x x x x x x Available
Phylogenyrowhead × Available × Available × × × × x x x x x x Available
Comparerowhead × × × × × × × × x x x x x x Available
MSArowhead × Available × Available × × × × x x x x x x Available
Covid-19 Trackerrowhead × × × × × × × × x x x x x x Available
Proteins Detailsrowhead
Taxonomic lineagerowhead × Available × × × × × × x x x x x x Available
Subcellular locationrowhead × × × Available × × × × x x x x x x Available
Molecular Functionrowhead × × × × × × × × x x x x x x Available
Biological Processrowhead × × × × × × × × x x x x x x Available
Domainrowhead × Available × × × × × × x x x x x x Available
Clusterrowhead × × × × × × × × x x x x x x Available
Super familyrowhead × × × × × × × × x x x x x x Available
Protein Fasta Sequencerowhead × Available × × × × × x x x x x x Available
Secondary Structure detailsrowhead × × × × × × × × x x x x x x Available
Disordered Regionrowhead × × × × × × × × x x x x x x Available
Disulfide Bondrowhead × × × × × × × × x x x x x x Available
Transmembrane Helixrowhead × Available × Available × × × × x x x x x x Available
Ubiquitination Siterowhead × × × × × × × x x x x x x Available
Proteinase Clevage Sitesrowhead Available × × × × × × × x x x x x x Available
Internal repeatsrowhead Available × × × × × × × x x x x x x Available
3D protein structurerowhead × Available 12 protein structures are present × Available × × AVAILABLE x x x x x x Available
Physicochemical propertiesrowhead × × × × × × × × x x x x x x Available
Epitope Details (MHC-I; MHC-II & B-Cell Epitope)rowhead × Available Available × × × Available × x x x x x x Available
Associated allelesrowhead × Available × × × × × x x x x x x Available
Epitope Conservancyrowhead × × × × × × × × x x x x x x Available
Allerginicityrowhead × × × × × × × × x x x x x x Available
Antigenicityrowhead × × × × × × × × x x x x x x Available
Toxicityrowhead × × × × × × × × x x x x x x Available
Hydropathicityrowhead × × × × × × × × x x x x x x Available
Hydrophilicityrowhead × × × × × × × × x x x x x x Available
Chargerowhead × × × × × × × × x x x x x x Available
Molecular Weightrowhead × × × × × × × × x x x x x x Available
3D epitope structurerowhead × × × × × × × × x x x x x x Available
Population Coveragerowhead × × × × × × × × x x x x x x Available
Links to external databaserowhead Available Available Available × × × × × Available Available Available Available Available x Available

Furthermore, since computational identification of antigenic epitopes require a complex analysis with a combination of several different tools and is a time-consuming and complex process. Therefore, to enable researchers to have a better understanding of the immunological properties and identify suitable vaccine candidates in the coronaviruses, we have mapped the potential conserved T-cell and B-cell epitopes on all the antigenic protein sequences along with information on the conservancy of the epitopes, potential immunogenicity, allergenicity, toxicity, and allergenicity analysis. Since HLA allele distribution differs among diverse geographic regions and ethnic groups around the world, population coverage analysis is an important factor in vaccine development. Thus, the cumulative percentage of population coverage across the world was estimated for the predicted epitopes and these results are freely available in the database. Besides, we determined the 3D structure of the epitopes and its binding interaction with the HLA molecules using in silico docking techniques. To our knowledge, DBCOVP is the first database with a special focus on SARS and MERS betacoronavirus virulence proteins containing detailed physicochemical, and structural information on the spike, envelope, membrane, and nucleocapsid protein sequences derived from 137 strains belonging to diverse host organisms. Most importantly, it is the only database to provide computed high-confidence complete immunological data of the coronavirus antigenic proteins in one platform. All the annotation data were manually curated from public databases and published literature but also computationally predicted using various bioinformatics tools and databases for complete functional annotation of each protein. Additionally, to facilitate further comparative data analysis, DBCOVP supports multiple search and browsing options, with integrated tools for multiple sequence alignment, phylogenetic tree construction, local BLAST alignment search, and in house developed compare tool for comparative genomic analysis. To promote its usability, ‘Exclusive Entries for COVID-19’ has been included, which consists of proteomic, genomic, and immunoinformatics details of virulent glycoproteins specific to SARS-CoV-2. Moreover, DBCOVP maintains a ‘Data Submission Form’ that enables users to submit a protein sequence in FASTA format to proceed with the sequence-structure analysis. With the rapidly increasing global demand for the development of a vaccine against SARS-CoV-2, this database will certainly act as a one-stop resource for virologist and vaccinologists for understanding the pathogenesis of this epidemic disease and also for accelerating rational vaccine design by subsequent in vitro and in vivo experimental validation of the identified promiscuous vaccine targets.

2. Database contents and web interface

Currently, DBCOVP contains 185 proteins sequences including spike proteins (47), envelope proteins (43), membrane (46), and nucleocapsid proteins (49) in 137 strains originating from eight species (human, bat, murine, bovine, rat, rabbit, equine, hedgehog) across all the five subgenera of the betacoronavirus viz., Sarbecovirus, Embecovirus, Hibecovirus, Merbecovirus, and Nobecovirus. Sequences were collected from the National Centre for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov/) and UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/).

All backend data are organized into a set of relational tables in a SQL server database. Stored procedures were implemented to improve the scalability and efficiency of the database. The graphical interface was developed using HTML5.0, ASP.Net(C#), CSS 3.0, JavaScript, and AJAX to obtain a rich user experience. DBCOVP provides user-friendly browsing, searching and data download functionalities which are made highly interactive to facilitate data extraction on each coronavirus virulence proteins. The “Search” option on the homepage enables users to search for information easily by a variety of keywords, including host species, proteins, and subgenus. Also, an “Advanced” search is provided on the search page for more specific requirements where users could obtain desired information by entering multiple combinations of keywords (Fig. 2 a) with AJAX-driven auto-suggestions for users. The database can be browsed by visiting the ‘Browse’ tab either from the navigation menu or home page, where multiple options such as Browse by Host species, Proteins, and Epitopes are available to retrieve results (Fig. 2b). Selecting any of the strains in the list will bring up the corresponding strains details page containing information on taxonomic lineage, genome size, etc as shown in Fig. 2c. The virus-strains and the encoded protein sequences present in the database are also classified based on sequence similarity and phylogeny into the five subgenera of Betacoronavirus including Sarbecovirus, Embecovirus, Hibecovirus, Merbecovirus, and Nobecovirus which can be viewed by clicking the Browse by “Subgenus” option (Fig. 2d). If users want to view the detailed information of any particular protein occurring in the search results, they can click the corresponding UniProt Id or the hyperlink named ‘READ MORE’, which links to the detailed annotation page of that protein (Fig. 2e). Also, users can search for both T-cell and B-cell Epitopes present in the database by first selecting the epitope type, and then either one or multiple proteins from the four categories namely spike, envelope, membrane, and nucleocapsid, and finally selecting the host strain. The resulting page will contain detailed immunogenic information of the desired protein sequence as shown in Fig. 3 . Furthermore, as a publicly released scientific database, the full dataset of DBCOVP is available for batch download in several forms including the FASTA Sequences and tabular (Excel) files.

Fig. 2.

Fig. 2

Screenshot of DBCOVP Web-interface: a) ‘Advanced Search’ provides users to input multiple search queries simultaneously to retrieve specific proteins of interest. b) Browse by ‘Subgenus’, ‘Epitope’ and ‘Proteins’. c) Details page for a coronavirus strain d) Result page from Browse by ‘Subgenus’. e) The detailed annotation page of a protein.

Fig. 3.

Fig. 3

Detailed immunogenic information obtained for proteins using Browse by Epitope.

3. Database content and annotation features

Each protein entry in the database has five important annotation components as discussed below. The detailed annotation has been manually predicted using various tools and databases as described in Supplementary Table 1.

a. Summary: This section presents general information about the protein sequence as retrieved from UniProt and NCBI GenBank like accession ID, strain name, host species, taxonomic lineage, subcellular localization, genomic location, Gene ontology, Pfam domain, family description, cross-referenced links to external databases like NCBI, KEGG, UniProt and protein and nucleotide sequence in fast format (Fig. 4 a).

Fig. 4.

Fig. 4

Schematic representation of database content and annotation features: a) Summary tab of protein annotation page. b) Structural Details tab of protein annotation page. c) Physicochemical properties tab of protein annotation page. d) Epitopes tab of protein annotation page.

b. Structural Details: This includes the secondary and tertiary structure details of each protein stating the no of helices, beta-sheet, and turns, predicted disorder region, disulfide bond position, transmembrane helices, presence of signal peptides and cleavage sites, ubiquitination Site details, the position of repeat sequences, and crystal structure of protein sequences if available in the protein data bank. Users can view the 3D structure with the help of the Jmol program integrated and can also download the structure in PDB format (Fig. 4b).

c. Physicochemical properties: physicochemical properties of proteins comprising of pI, no of positive/negatively charged amino acids, instability index, aliphatic index, GRAVY, hydropathy plot, and solubility (Fig. 4c).

d. Epitopes: Each spike, membrane, envelope, and nucleocapsid protein sequences were analyzed to identify the highest immunogenic, and antigenic T-cell epitopes along with B-cell epitopes. We have also predicted the binding Class I and Class II HLA alleles, conservancy score, allergenicity, antigenicity, toxicity, hydropathicity, hydrophilicity, charge, molecular weight of the predicted peptides. In addition, the population coverage analysis of the promiscuous epitopes is also available in the database. Furthermore, the 3D structure of the epitopes along with the docked complex of the epitope and binding HLA have been developed and users can also download the structures for further analysis. The detailed immunogenic results obtained for epitope analysis is described in the next section (Fig. 4d).

4. Immunoinformatics data

The immunoinformatics data are organized in the database for easy analysis and retrieval. Users can retrieve these resources either from the annotation detail page of each protein entry or by clicking the epitope option from the browse section of the homepage of the database. Each of the 185 sequences encompassing coronavirus spike, membrane, envelope, and nucleocapsid proteins were analyzed with several Immunoinformatics algorithms and tools displayed in Supplementary Table 2. The complete protocol employed for the identification of promiscuous epitope-based vaccine candidates is shown in Fig. 1.

For each protein sequence, most promiscuous T-cell epitopes and B-cell epitopes were selected which were recognized by a considerable number of HLA alleles and contained the highest immunogenicity, antigenicity value, and were nontoxic to human and hence, considered as the most potential epitopes to induce a strong immune response. Furthermore, the epitopes were selected based on the consensus matching results of all the employed tools. HLA allele distribution differs among diverse geographic regions and ethnic groups around the world. Therefore, population coverage analysis of the epitopes is a very important factor that must be taken into consideration during the development of an effective vaccine. Therefore, for all the predicted epitopes, the cumulative percentage of population coverage across the world was measured and the results are displayed in a graphical format as shown in Fig. 2d. The results indicate that all the predicted epitopes and their binding HLA alleles covered more than 80% of the world's population, which is a very important factor for a vaccine candidate since the emerging SARS-CoV-2 strain has affected the human population across the world. Besides, the three-dimensional structure of each of the predicted epitopes was determined and the binding interaction with the most conserved HLA allele was studied using the docking technique. The PDB structures are available for download. The ribbon representation of the structures was prepared and visualized by the PyMOL molecular graphics system.

5. Integrated tools

To facilitate further in-depth analysis of virulence proteins from coronavirus, four analysis tools have been integrated. Sequence similarity search of both nucleotide and amino acid sequences can be performed using the basic local alignment search tool (BLAST) algorithm through an integrated Blast module within the database. The BLAST interface allows alignment of a user-provided sequence against a customized BLAST library containing all sequences present in the DBCOVP database. This helps to identify the sequence similarity of any unknown sequence to known annotated proteins. The user may specify BLAST parameters and upload or paste the query sequences. The output is given in the standard format with the blast score and ordered by ascending e-value. Each hit is hyperlinked to that entry's browser page. As the analysis of variability of virulence proteins is important for understanding the emergence of novel strains and to decipher sequence level variations leading to changes in pathogenicity, therefore to facilitate cross-genome comparative analysis a COMPARE Tool has been integrated by which users can analyze the variations in targeted sequences across multiple strains belonging to same or different host species. Additionally, multiple sequence alignment and phylogenetic tree can be constructed using embedded MUSCLE tool and PhyML tool, respectively in the database.

6. Discussion and future directions

The COVID-19 pandemic has resulted in an exponential increase in the number of novel SARS-CoV-2 coronaviruses genomes being sequenced. Therefore, computational methods and databases are needed to organize, explore and analyze large volumes of the biological data to aid in understanding the mechanisms of disease pathogenesis and, most importantly, to speed up the vaccines development process by providing adequate information on the efficacy and immunogenicity of potential molecular targets critical for subsequent clinical validation. Increasing studies have shown that the four major structural glycoproteins namely spike protein, envelope protein, membrane protein and nucleocapsid protein have important functions and play vital roles in viral infection and particularly spike protein has been shown to elicit T-cell responses suggesting as potential vaccine candidates against SARS infection [39].

In this study, we developed the DBCOVP, the first manually curated database to provide comprehensive information on the entire repertoire of structural glycoproteins from coronavirus genomes of betacoronavirus genera including the newly sequenced SARS-CoV-2 strains which are majorly responsible for the atypical severe acute respiratory syndrome. As compared to few existing databases on coronaviruses research, DBCOVP is a specialized database focussed on coronavirus spike, envelope, membrane, and nucleocapsid proteins and excels in the following aspects: (i) Substantially extended data volume consisting of a total of 185 structural proteins from 137 strains including sequences from the recently deposited SARS-CoV-2 strains in NCBI. (ii) Complete functional annotation of the proteins highlighting 14 sequence-structural properties which are partially addressed in some of the existing coronavirus sequence data resources. Basic information about each protein includes manually curated information from known databases while more specific and source-dependent annotation features have been computationally predicted using various bioinformatics tools and methods. (iii) The major purpose of the database is to enable users to perform knowledge discovery from coronavirus antigen data with particular emphasis on applications in immunology and vaccinology. Each spike, membrane, envelope, and nucleocapsid protein sequences have been mapped to highlight the most promiscuous epitopic regions (T-cell and B-cell) along with conservancy score, allergenicity, antigenicity, toxicity, hydropathicity, hydrophilicity, charge, molecular weight, and population coverage analysis of the predicted peptides. In addition, the 3D structure of the epitopes along with the docked epitope-HLA binding complex is available for further analysis. This is the first database containing the aforementioned immunogenic data specific for coronavirus virulent glycoproteins on one single platform. (iv) Multiple searches and browse options to facilitate data extraction. (v) Links to resources pertinent to coronavirus research. (vi) DBCOVP provides a user-friendly interface, incorporating an application for BLAST similarity search and integrating many useful tools for cross genome comparison, phylogenetic, and multiple sequence alignment to facilitate further studies on structural glycoproteins and their functional role in virulence.

Research on viable therapeutics and vaccine targets against human coronavirus infection is probably only beginning to unfold. In the future, we will continue to update the database and include sequences from other coronavirus strains as well as with more valuable resources constantly integrated into the database. Furthermore, we will also try to combine all the complex steps and tools employed in this study for epitope analysis into one automated tool which would be particularly useful for researchers with little knowledge in bioinformatics to rapidly analyze the immunogenic properties of uncharacterized sequences in one platform without moving data between different analysis tools. DBCOVP will certainly be an important resource when prioritizing vaccine candidates against coronavirus infection.

Declaration of competing interest

Authors declare there is no conflict of interest.

Acknowledgments

The authors acknowledge the School of Biotechnology, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University, Bhubaneswar 751024, India for providing the necessary infrastructure to carry out this work. The authors also acknowledge Informatics Lab, CSIR-Institute of Minerals and Materials Technology (CSIR-IMMT), Bhubaneswar-751013, India for providing the computational facilities and hosting the database.

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.compbiomed.2020.104131.

Author contributions

Conception and design: MS, NM; Computational work: SS, SM; Data Analysis and Curation: NM, BD, VR; Original Draft Preparation: NM; Writing- Reviewing and Editing; MS, VR, SS. The manuscript has been read and approved by all authors.

Funding statement

The author received no funding from an external source.

Appendix A. Supplementary data

The following are the Supplementary data to this article:

Multimedia component 1
mmc1.docx (13.3KB, docx)
Multimedia component 2
mmc2.docx (12.5KB, docx)

References

  • 1.Malik Y.S., Sircara S., Bhata S., Sharunb K., Dhamac K., Dadard M. Emerging novel coronavirus (SARS-CoV-2 )—current scenario, evolutionary perspective based on genome analysis and recent developments. Vet. Q. 2020;40:68–76. doi: 10.1080/01652176.2020.1727993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Su S., Wong G., Shi W., Liu J., Lai A.C.K., Zhou J. Epidemiology, genetic recombination, and pathogenesis of coronaviruses. Trends Microbiol. 2016;24:490–502. doi: 10.1016/j.tim.2016.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lu R., Zhao X., Li J., Niu P., Yang B., Wu H. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet. 2020;395:565–574. doi: 10.1016/S0140-6736(20)30251-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Peiris J.S., Guan Y., Yuen K.Y. Severe acute respiratory syndrome. Nat. Med. 2004;10(suppl 12):S88–S97. doi: 10.1038/nm1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chan-Yeung M., Xu R.H. SARS: epidemiology. Respirology. 2003;8(suppl):S9–S14. doi: 10.1046/j.1440-1843.2003.00518.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zaki A.M., van Boheemen S., Bestebroer T.M., Osterhaus A.D., Fouchier R.A. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. N. Engl. J. Med. 2012;367:1814–1820. doi: 10.1056/NEJMoa1211721. [DOI] [PubMed] [Google Scholar]
  • 7.Lee J., Chowell G., Jung E. A dynamic compartmental model for the Middle East respiratory syndrome outbreak in the Republic of Korea: a retrospective analysis on control interventions and superspreading events. J. Theor. Biol. 2016;408:118–126. doi: 10.1016/j.jtbi.2016.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lee J.Y., Kim Y.J., Chung E.H. The clinical and virological features of the first imported case causing MERS-CoV outbreak in South Korea, 2015. BMC Infect. Dis. 2017;17:498. doi: 10.1186/s12879-017-2576-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G. A new coronavirus associated with human respiratory disease in China. Nature. 2020;580:E7. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li J.Y., You Z., Wang Q., Zhou Z.J., Qiu Y., Luo R. The epidemic of 2019-novel-coronavirus (SARS-CoV-2 ) pneumonia and insights for emerging infectious diseases in the future. Microb. Infect. 2020;22:80–85. doi: 10.1016/j.micinf.2020.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhou P., Yang X.L., Wang X.G., Hu B., Zhang L., Zhang W. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Liu J., Zheng X., Tong Q., Li W., Wang B., Sutter K. Overlapping and discrete aspects of the pathology and pathogenesis of the emerging human pathogenic coronaviruses SARS-CoV, MERS-CoV, and SARS-CoV-2. J. Med. Virol. 2020;92:491–494. doi: 10.1002/jmv.25709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li G., Fan Y., Lai Y., Han T., Li Z., Zhou P. Coronavirus infections and immune responses. J. Med. Virol. 2020;92:424–432. doi: 10.1002/jmv.25685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Woo P.C.Y., Huang Y., Lau S.K.P., Yuen K.Y. Coronavirus genomics and bioinformatics analysis. Viruses. 2010;2:1804–1820. doi: 10.3390/v2081803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bourdette D.N., Edmonds E., Smith C. A highly immunogenic trivalent T cell receptor peptide vaccine for multiple sclerosis. Mult. Scler. 2005;11:552–561. doi: 10.1191/1352458505ms1225oa. [DOI] [PubMed] [Google Scholar]
  • 16.Lopez J.A., Weilenman C., Audran R. A synthetic malaria vaccine elicits a potent CD8(+) and CD4(+) T lymphocyte immune response in humans. Implications for vaccination strategies. Eur. J. Immunol. 2001;31:1989–1998. doi: 10.1002/1521-4141(200107)31:7<1989::aid-immu1989>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
  • 17.Knutson K.L., Schiffman K., Disis M.L. Immunization with a HER-2/neu helper peptide vaccine generates HER-2/neu CD8 T-cell immunity in cancer patients. J. Clin. Invest. 2001;107:477–484. doi: 10.1172/JCI11752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Banerjee N., Mukhopadhyay S. Viral glycoproteins: biological role and application in diagnosis. Virus Dis. 2016;27:1–11. doi: 10.1007/s13337-015-0293-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li F. Structure, function, and evolution of coronavirus spike proteins. Ann. Rev. Virol. 2016;3:237–261. doi: 10.1146/annurev-virology-110615-042301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lu G., Wang Q., Gao G.F. Bat-to-human: spike features determining ‘host jump’ of coronaviruses SARS-CoV, MERS-CoV, and beyond. Trends Microbiol. 2015;23:468–478. doi: 10.1016/j.tim.2015.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang Q., Wong G., Lu G., Yan J., Gao G.F. MERS-CoV spike protein: targets for vaccines and therapeutics. Antivir. Res. 2016;133:165–177. doi: 10.1016/j.antiviral.2016.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Marra M.A., Jones S.J., Astell C.R., Holt R.A., Brooks-Wilson A., Butterfield Y.S. The Genome sequence of the SARS-associated coronavirus. Science. 2003;300:1399–1404. doi: 10.1126/science.1085953. [DOI] [PubMed] [Google Scholar]
  • 23.Jiang S., He Y., Liu S. SARS vaccine development. Emerg. Infect. Dis. 2005;11:1016–1020. doi: 10.3201/eid1107.050219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Salvatori G., Luberto L., Maffei M., Aurisicchio L., Roscilli G., Palombo F., Marra E. SARS-CoV-2 SPIKE PROTEIN: an optimal immunological target for vaccines. J. Transl. Med. 2020 Dec;18:1–3. doi: 10.1186/s12967-020-02392-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Schoeman D., Fielding B.C. Coronavirus envelope protein: current knowledge. Virol. J. 2019;16:69. doi: 10.1186/s12985-019-1182-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ruch T.R., Machamer C.E. The coronavirus E protein: assembly and beyond. Viruses. 2012;4:363–382. doi: 10.3390/v4030363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Huang Y., Lau S.K., Woo P.C., Yuen K.Y. CoVDB: a comprehensive database for comparative analysis of coronavirus genes and genomes. Nucleic Acids Res. 2008;36:D504–D511. doi: 10.1093/nar/gkm754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pickett B.E., Greer D.S., Zhang Y., Stewart L., Zhou L., Sun G. Virus pathogen database and analysis resource (ViPR): a comprehensive bioinformatics database and analysis resource for the coronavirus research community. Viruses. 2012;4:3209–3226. doi: 10.3390/v4113209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hulo C., De Castro E., Masson P., Bougueleret L., Bairoch A., Xenarios I., Le Mercier P. ViralZone: a knowledge resource to understand virus diversity. Nucleic Acids Res. 2011;39(suppl_1):D576–D582. doi: 10.1093/nar/gkq901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Schäffer A.A., Hatcher E.L., Yankie L., Shonkwiler L., Brister J.R., Karsch-Mizrachi I., Nawrocki E.P. VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinf. 2020;21:1–23. doi: 10.1186/s12859-020-3537-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Amgarten D., Upton C. Bioinformatic approaches for comparative analysis of viruses. InComparative Genomics. 2018:401–417. doi: 10.1007/978-1-4939-7463-4_15. Humana Press, New York, NY. [DOI] [PubMed] [Google Scholar]
  • 32.Kalvari I., Argasinska J., Quinones-Olvera N., Nawrocki E.P., Rivas E., Eddy S.R., Bateman A., Finn R.D., Petrov A.I. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46(D1):D335–D342. doi: 10.1093/nar/gkx1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Martin R., Löchel H.F., Welzel M., Hattab G., Hauschild A.C., Heider D. CORDITE: the curated CORona drug InTERactions database for SARS-CoV-2. Iscience. 2020;23(7):101297. doi: 10.1016/j.isci.2020.101297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Sadegh S., Matschinske J., Blumenthal D.B., Galindez G., Kacprowski T., List M., Nasirigerdeh R., Oubounyt M., Pichlmair A., Rose T.D., Salgado-Albarrán M. Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing. Nat. Commun. 2020;11 doi: 10.1038/s41467-020-17189-2. Article number: 3518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zhu Zhenglin, Meng Kaiwen, Geng Meng. A database resource for Genome-wide dynamics analysis of Coronaviruses on a historical and global scale. 2020. [DOI] [PMC free article] [PubMed]
  • 36.Wu J., Chen W., Zhou J., Zhao W., Chen S., Zhou Z.* COVIEdb : a database for potential immune epitopes of coronaviruses. bioRxiv. 2020;vol. 5 doi: 10.1101/2020.05.24.096164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ahmed S.F., Quadeer A.A., McKay M.R. COVIDep: a web-based platform for real-time reporting of vaccine target recommendations for SARS-CoV-2. Nat. Protoc. 2020 doi: 10.1038/s41596-020-0358-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Patiyal S., Kaur D., Kaur H., Sharma N., Dhall A., Sahai S. A web-based platform on COVID-19 to maintain Predicted Diagnostic, Drug and Vaccine candidates. OSF Preprints. 2020 doi: 10.31219/osf.io/xegzu. [DOI] [PubMed] [Google Scholar]
  • 39.Huang J., Cao Y., Du J., Bu X., Ma R., Wu C. Priming with SARS CoV S DNA and boosting with SARS CoV S epitopes specific for CD4+ and CD8+ T cells promote cellular immune responses. Vaccine. 2007;25:6981–6991. doi: 10.1016/j.vaccine.2007.06.047. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.docx (13.3KB, docx)
Multimedia component 2
mmc2.docx (12.5KB, docx)

Articles from Computers in Biology and Medicine are provided here courtesy of Elsevier

RESOURCES