Abstract
The continuing threat of infectious disease and future pandemics, coupled to the continuous increase of drug-resistant pathogens, makes the discovery of new and better vaccines imperative. For effective vaccine development, antigen discovery and validation is a prerequisite. The compilation of information concerning pathogens, virulence factors and antigenic epitopes has resulted in many useful databases. However, most such immunological databases focus almost exclusively on antigens where epitopes are known and ignore those for which epitope information was unavailable. We have compiled more than 500 antigens into the AntigenDB database, making use of the literature and other immunological resources. These antigens come from 44 important pathogenic species. In AntigenDB, a database entry contains information regarding the sequence, structure, origin, etc. of an antigen with additional information such as B and T-cell epitopes, MHC binding, function, gene-expression and post translational modifications, where available. AntigenDB also provides links to major internal and external databases. We shall update AntigenDB on a rolling basis, regularly adding antigens from other organisms and extra data analysis tools. AntigenDB is available freely at http://www.imtech.res.in/raghava/antigendb and its mirror site http://www.bic.uams.edu/raghava/antigendb.
INTRODUCTION
The term vaccine can be applied to all agents, either of a molecular or supramolecular nature, used to stimulate specific, protective immunity against pathogenic microbes and the disease they cause. It is clear that vaccines form the most powerful and cost-effective prophylactic therapy for infectious disease. Vaccines work to militate against the effects of subsequent infection as well as blocking the ability of a pathogen to kill its host. The availability of entire genomes corresponding to many important pathogens has instigated a new research initiative able to discover a wide array of antigens, which can act as potential vaccine candidates. Bioinformatics, in the form of comprehensive immunological databases and analysis tools, has hastened both the identification and the validation of candidate vaccines.
Previously, the principal means of antigen discovery was empirical. A live, virulent pathogen, for example, is now considered a poor candidate vaccine since despite its potent immunogenicity it is more liable to induce disease than to prevent or treat it. Thus, vaccines have, until recently, been primarily attenuated or chemically inactivated whole pathogen vaccines such as Sabin’s polio vaccine or BCG. More recently, safety concerns have fostered alternate strategies for vaccine development. The most successful has focused on the antigen, acellular or subunit vaccine, which includes recombinant vaccines against hepatitis B, human papillomavirus and Haemophilus influenzae B. Subunit vaccines are typically composed of immunogenic protein or carbohydrate, such as cell wall components, or a bio-conjugated combination of both. Many antigens are highly immunogenic, while others stimulate measurable yet often weak responses, requiring boosting to perpetuate long-term protection and the addition of adjuvants (1–3).
Antigen-based subunit vaccines long ago became a prime focus on vaccine discovery. Approaches to the identification of antigens include the identification of immunodominant epitopes, assaying for enhanced correlates of protection such as antibody levels and cytokine production and testing for enhanced survival in disease challenge models. Long, cumbersome and inconclusive procedures for antigen validation currently in use, compounded by the failure to identify protective B cell or T cell mediated epitopes, are often unsuccessful, failing to detect efficiently an antigen as an efficacious vaccine candidate. Thus the identification of antigens as putative whole protein subunit vaccines is also now a key goal of immunoinformatics and computational vaccinology.
The in silico identification of antigens offers the hope of eliciting significant responses from both humoral and cellular immune systems, far exceeding the efficacy of peptide vaccines, while avoiding the potential toxicity problems associated with whole microbe vaccines. A necessary step towards systematizing antigen discovery is the rigorous compilation and annotation of antigens.
Recently, massive factory-scale experimentation coupled to literature mining has allowed numerous functional immunology databases to emerge. Databases such as The Immune Epitope Database and Analysis Resource (IEDB), MHCBN, AntiJen and BCIPEP (4–7) are large, robust data repositories and provide copious information. Concerning antigens, however, these databases, although replete with information concerning individual B cell epitopes, T cell epitopes and Major Histocompatibility Complex (MHC) binding peptides, remain otherwise partial and incomplete. Their focus is on the epitope, and not the antigen. There are many antigens, for which specific epitope or MHC binding information is not currently available, yet many such antigens are known experimentally to induce either or both innate or adaptive immune responses. Such antigens—or similar pathogenic proteins—might prove useful in vaccine design. These antigens require urgent and rigorous cataloguing.
In order to address this pressing issue, the present work describes the database AntigenDB. It is a specialized, value-added database of antigens derived from pathogenic organisms. This resource is intended to be a repository for all experimentally determined antigens, irrespective of whether such an antigen is associated within the extant knowledge-base with known epitope data. The database is freely accessible through a web browser at http://www.imtech.res.in/raghava/antigendb/.
SYSTEM AND METHODS
Database construction and architecture
Experimentally validated antigens were collected from the literature (PubMed: http://www.ncbi.nlm.nih.gov/pubmed/; ScienceDirect: http://www.sciencedirect.com/). Additional information about these antigens was collected from various public databases including IEDB, MHCBN, AntiJen and BCIPEP. We developed PERL scripts to extract sequence, structural, functional and gene expression information from SwissProt, GenBank, PDB (Protein Data Bank) and GEO databases (8–11). AntigenDB is built on a SUN systems T-1000 under Solaris 10.0 environment. The front-end was developed using HTML and the backend was developed using PostgreSQL, a relational database management system. All common gateway interface (CGI) and database interfacing scripts were written in the PERL programming language. The architecture of AntigenDB database is shown in Figure 1a and b.
Organization of data
AntigenDB collects and compiles comprehensive information concerning antigens. Most of the antigens in AntigenDB come from the genus Mycobacterium, Plasmodium and the Influenza A virus (Figure 2). The up-to-date status of the database is available at url: http://www.imtech.res.in/raghava/antigendb/info.html. Data for each antigen can be categorized as primary data (antigen sequence and structural information), B cell epitope (epitope sequence and antibody information), T cell epitope (T helper and T killer cell epitopes, MHC I/II binding and TAP binding and cleavage sites information), function (cellular location, function, functional sites and similarity with host and pathogens), Gene expression (nucleotide sequence, codon frequency and expression profiles) and different types of Post-translational modification (PTM) associated with antigens. Each antigen is assigned a unique entry number and information is divided into six tables; each table providing unique information.
General information
The main or primary table contains general information about antigens. This table has the following main fields: (i) antigen name: the name of the antigen with its synonyms; (ii) antigen type: whether it is protein, carbohydrate or lipid; (iii) amino acid sequence of protein antigen and (iv) source organism of origin, with a link to the NCBI taxonomy database.
Structural information
Within the database, detailed crystal structure information is available for 290 molecules out of 504 antigens. This information is supplemented by the OCA web browser (http://oca.weizmann.ac.il), with surface accessibility provided by ASAView tools (12). We also link to the Swiss-model repository database (13), which provides hypothetical structures for unsolved protein using protein-modeling techniques. We also provide secondary structure information in the form of percent content as calculated by DSSP (14).
B-cell epitope
A principal challenge for immunology is to identify antigenic regions, responsible for simulating B-cells, also called B-cell epitopes (15). We have collected B-cell epitopes reported for antigens available in AntigenDB (Figure S1). This table has the following major fields: (i) the capability of antigen to induce B cell or humoral immune responses; (ii) specific B-cell epitopes within the antigen; (iii) the antibodies that recognize these epitopes and (iv) PTMs associated with such epitopes. Most of the data for this table has been obtained from the primary literature or from a secondary source (BCIPEP and IEDB). Many antigens have no known B-cell epitopes; these antigens are not covered by B-cell epitope databases.
T-cell epitope
Most extant vaccines are mediated by antibodies. However, responses to diseases without effective vaccines are largely mediated by cellular—not humoral—immunology. Thus, to develop subunit vaccines one needs to identify T-cell epitopes within an antigen. Figure S2 shows the distribution of T cell epitopes in antigenDB; most antigens have less than 10 T-cell epitopes. The T-cell epitope table contains the following main fields: (i) immunogenicity induced due to T helper or Cytotoxic T cells; (ii) Helper or cytotoxic T cell epitopes; (iii) PTMs associated with these epitopes; (iv) MHC I/II binders with experimental IC50 values if available; (v) TAP binders mapping and (vi) mapping of cleavage sites. Data for this table was obtained from the literature, and from the MHCBN, IEDB and AntiJen databases. Both B and T-cell epitopes are linked directly to other epitope databases, such as the IEDB. There are ∼358 antigens for which there is no epitope information; this means these antigens are not covered in existing epitopes databases. There are about 95 antigens of genus Mycobacterium, alone which are shown to induce immune response, but for which no epitope has yet been identified. This demonstrates the importance of AntigenDB, which covers experimentally-validated, protective antigens not covered in epitope-oriented databases.
Functional information
The function of an antigen determines potential candidacy as a vaccine. If the antigen is involved in house-keeping or is present at the cell surface and easily available to surveillance by the host immune system then the probability that an antigen will be suitable vaccine candidate increases. Therefore, there is an imperative need to have insight into the function and subcellular localization of each antigen. The function table provides the following information: (i) functional annotation of antigens using SwissProt and GO databases; (ii) Cellular localization (secreted, cytoplasmic and membrane bound) as described in DBSubLoc and PSORTdb (16,17) databases and (iii) functional sites and domains obtained from InterPro (18). The development of a better vaccine requires knowledge of the similarity between candidate antigen and others derived from the host or other pathogenic organisms. If an antigen is similar to one or more host proteins there exists the possibility of autoimmune responses; less similarity with other pathogens is also advantageous and would qualify the antigen as a better vaccine candidate. Therefore, we provide a BLAST (19) results for each antigen with pathogenic and human proteins.
Gene expression
Suitable antigens are often highly expressed and thus optimally available to host surveillance. Expression of any protein is largely dependent on the codon usage of that organism (20). Therefore, AntigenDB provides information about: (i) Gene sequence as obtained from GenBank; (ii) codon frequency of genes; (iii) codon bias calculated using Graphical codon usage analyzer (GCUA) (21) and (iv) expression profiles of antigen obtained from databases such as GEO (11).
Post-translational modification
PTMs affect the expression and function of antigens. Therefore different types of PTMs are covered in the database. The major PTMs covered include: (i) N/O/C/S-Glycosylation; (ii) Phosphorylation; (iii) Amidation site; (iv) N-Myristoylation; (v) Tyrosine Sulfation and (vi) Methylation and other PTMs. These are compiled from the literature as well as specialized databases such as dbPTM, PhosphoELM, and RESID (22–24).
IMPLEMENTATION
Currently, AntigenDB contains an extensive collection of proteins, glycoproteins and lipoproteins (in excess of 500), extracted from 44 important pathogenic species. This covers following major genuses; Mycobacterium, Influenza A virus, Helicobacter, Bacillus, Brucella, Clostridium, Hepatitis, Plasmodium, Streptococcus, Yersinia and Vibrio (Figure 2).
AntigenDB entries are cross-linked to a variety of key databases, such as SwissProt, GenBank, IMGT (25), SYFPEITHI (26), AntiJen, DBSubLoc, PSORT and the PDB. In AntigenDB, different types of useful web tools have been provided. There are many tools integrated into AntigenDB for the extraction and analysis of antigens. These include searching the AntigenDB database, the analysis of antigen data through mapping, and data submission.
Data search and analysis
AntigenDB is complemented by an array of tools, which facilitate further analysis of antigens (Figure 3). AntigenDB is user-friendly and can be searched using antigen name, SwissProt accession no, PDB ID or organisms name. Users of AntigenDB can quickly extract useful information from the database in two different ways: (i) via Keyword search and (ii) via Browsing.
Keyword search
The keyword search allows a user to search for antigens using the five digits AntigenDB entry number, antigen name, SwissProt accession number or by entering a source microorganism to list all the available antigens present in that pathogen. A query can be filtered further by selecting the antigen type among the protein, carbohydrate and lipid antigens or by selecting the specific organism from the drop down menu. A query returns a tabular output whose format depends upon the fields selected initially, with an option of exporting the search result as a data file.
Browsing
Browing is a very useful tool in database, which can be searched easily by selecting organism of interest and output fields. It returns all the available antigens present for that pathogen.
Antigenic BLAST
The Antigenic blast page provides users with an opportunity to search a query protein sequence against the AntigenDB database for the purposes of sequence comparison. The standalone WWW-BLAST program is carefully integrated into AntigenDB for this purpose, and has a customizable weight matrix and E score threshold.
Peptide mapping
The peptide-mapping tool allows users to ask whether a query protein contains any already known antigenic epitopes. The tool scans all the epitopes present in the AntigenDB database against the query protein and returns the starting and ending position of experimentally defined epitopes corresponding to the query protein sequence. It also links to the AntigenDB antigens, from where returned epitopes were derived.
Epitope search
The epitope search enables users to ask whether a query epitope sequence is present in known antigens within the database. This tool searches for exact or similar epitopes in the database. It returns AntigenDB antigens in which such epitopes are present.
Data submission
The online data submission tool allows users to submit a newly identified antigen not present in the AntigenDB database. Submitted antigens are included in AntigenDB after validation.
DISCUSSION AND CONCLUSIONS
AntigenDB should prove of value to a variety of researchers: immunoinformaticians developing new and enhanced methods for the prediction of antigens (27,28); vaccine development scientists searching for antigens in newly defined pathogens or novel candidate vaccine antigens in well known pathogenic microorganisms and microbiologists analyzing virulence mechanisms in pathogenic microorganisms, amongst others.
AntigenDB is a repository for experimentally determined antigens. It is the first of its kind; in contrast to existing epitope-orientated database, AntigenDB emphasizes experimentally-verified antigens, irrespective of whether epitope information is known or unknown. The database provides useful analysis tools able to search and map an unknown protein for similarity to known antigens or experimentally determined epitopes. We have and continue to undertake exhaustive literature searches in order to cover as many antigens as possible, yet it remains possible that certain antigens are missing from the database. Initially, we intended to cover all pathogenic species, but realized quickly that this was an impractical approach. Instead, we have selected species with an impact on global health and antigens, which are available in the literature and other sources. Several species were also selected whose close relatives were already present in the database. To enable faster database accesss, we have created a mirror site at http://www.bic.uams.edu/raghava/antigendb/. The current database mainly contains protein antigens. This is influenced by the easy availability of such antigens within the current literature.
We anticipate that this thorough and comprehensive database will be extended to effective completeness, and then maintained and its content expanded, with constantly enhanced search and analysis features added on a rolling basis. For example, where available we will add experimental data derived from the literature and archived microarray experiments relating to the expression of antigens in the next release of AntigenDB. Certain important viral pathogens—such as Hepatitis A, Hepatitis B, Hepatitis E, Herpes simplex virus and Human adenovirus—and other antigen types will be included in future releases of the database.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Council of Scientific and Industrial Research (CSIR). Funding for open access charge: CSIR.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
While this project has not been funded directly by it, we gratefully acknowledge the Biotechnology and Biological Sciences Research Council (BBSRC) grant India PA 1713. Dr Flower is a Jenner Investigator supported by the Jenner Foundation.
REFERENCES
- 1.Pashine A, Valiante NM, Ulmer JB. Targeting the innate immune response with improved vaccine adjuvants. Nat. Med. 2005;11:S63–S68. doi: 10.1038/nm1210. [DOI] [PubMed] [Google Scholar]
- 2.Lata S, Raghava GPS. PRRDB: a comprehensive database of Pattern-Recognition Receptors and their ligands. BMC Genomics. 2008;9:180. doi: 10.1186/1471-2164-9-180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bayry J, Tchilian EZ, Davies MN, Forbes EK, Draper SJ, Kaveri SV, Hill AV, Kazatchkine MD, Beverley PC, Flower DR, et al. In silico identified CCR4 antagonists target regulatory T cells and exert adjuvant activity in vaccination. Proc. Natl Acad. Sci USA. 2008;105:10221–10226. doi: 10.1073/pnas.0803453105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, et al. The immune epitope database and analysis resource: from vision to blueprint. PLoS Biol. 2005;3:e91. doi: 10.1371/journal.pbio.0030091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bhasin M, Singh H, Raghava GP. MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics. 2003;19:665–666. doi: 10.1093/bioinformatics/btg055. [DOI] [PubMed] [Google Scholar]
- 6.Toseland CP, Clayton DJ, McSparron H, Hemsley SL, Blythe MJ, Paine K, Doytchinova IA, Guan P, Hattotuwagama CK, Flower DR. AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res. 2005;1:4. doi: 10.1186/1745-7580-1-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Saha S, Bhasin M, Raghava GP. Bcipep: a database of B-cell epitopes. BMC Genomics. 2005;6:79. doi: 10.1186/1471-2164-6-79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'D;onovan C, Phan I, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. doi: 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2009;37:D26–D31. doi: 10.1093/nar/gkn723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Jr, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 1977;112:535–542. doi: 10.1016/s0022-2836(77)80200-3. [DOI] [PubMed] [Google Scholar]
- 11.Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res. 2007;35:D760–D765. doi: 10.1093/nar/gkl887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ahmad S, Gromiha M, Fawareh H, Sarai A. ASAView: database and tool for solvent accessibility representation in proteins. BMC Bioinformatics. 2004;5:51. doi: 10.1186/1471-2105-5-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kopp J, Schwede T. The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models. Nucleic Acids Res. 2004;32:D230–D234. doi: 10.1093/nar/gkh008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- 15.Saha S, Raghava GPS. Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins. 2006;65:40–48. doi: 10.1002/prot.21078. [DOI] [PubMed] [Google Scholar]
- 16.Guo T, Hua S, Ji X, Sun Z. DBSubLoc: database of protein subcellular localization. Nucleic Acids Res. 2004;32:D122–D124. doi: 10.1093/nar/gkh109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rey S, Acab M, Gardy JL, Laird MR, deFays K, Lambert C, Brinkman FS. PSORTdb: a protein subcellular localization database for bacteria. Nucleic Acids Res. 2005;33:D164–D168. doi: 10.1093/nar/gki027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001;29:37–40. doi: 10.1093/nar/29.1.37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Manoj S, Babiuk LA, van Drunen Littel-van den Hurk S. Approaches to enhance the efficacy of DNA vaccines. Crit. Rev. Clin. Lab. Sci. 2004;41:1–39. doi: 10.1080/10408360490269251. [DOI] [PubMed] [Google Scholar]
- 21.McInerney JO. GCUA: general codon usage analysis. Bioinformatics. 1998;14:372–373. doi: 10.1093/bioinformatics/14.4.372. [DOI] [PubMed] [Google Scholar]
- 22.Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH. dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006;34:D622–D627. doi: 10.1093/nar/gkj083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, Sicheritz-Ponten T, Blom N, Gibson TJ. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics. 2004;5:79. doi: 10.1186/1471-2105-5-79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Garavelli JS. The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics. 2004;4:1527–1533. doi: 10.1002/pmic.200300777. [DOI] [PubMed] [Google Scholar]
- 25.Lefranc MP, Giudicelli V, Ginestoux C, Jabado-Michaloud J, Folch G, Bellahcene F, Wu Y, Gemrot E, Brochet X, Lane J, et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 2009;37:D1006–D1012. doi: 10.1093/nar/gkn838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics. 1999;50:213–219. doi: 10.1007/s002510050595. [DOI] [PubMed] [Google Scholar]
- 27.Doytchinova IA, Flower DR. Identifying candidate subunit vaccines using an alignment-independent method based on principal amino acid properties. Vaccine. 2007;25:856–866. doi: 10.1016/j.vaccine.2006.09.032. [DOI] [PubMed] [Google Scholar]
- 28.Doytchinova IA, Flower DR. VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics. 2007;8:4. doi: 10.1186/1471-2105-8-4. [DOI] [PMC free article] [PubMed] [Google Scholar]