Abstract
CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
INTRODUCTION
The process of biocuration can create a set of high-confidence annotations for a protein, separately asserting molecular function, preferred nomenclature for protein name and for gene symbol and assignment to one or more biological processes. Each of these annotations may be exploited for different purposes, such as supporting machine annotation of newly sequenced genomes or decorating nodes in multiple sequence alignment-based phylogenetic trees for protein functional inference (1). The advent of next-generation sequencing technologies and cheaper sequencing costs during the past decade has paved the way for sequencing a vast variety of genomes; completed prokaryotic genomes now number in the low thousands. This new abundance places every characterized protein into (often implicit) protein families and sets the stage for comparative genomics studies. Protein family co-occurrence across multiple taxa (phylogenetic profiling), conserved gene neighborhoods and metabolic context derived from pathway reconstruction can provide extensive guidance for the tricky process of using one characterized protein to annotate another. Previous attempts to select blanket generalizations such as prediction of equivalent enzymatic function at 50% identity or greater are too strict for some protein families, and too permissive for others (2). It is likely that new generations of annotation tools will use comparative genomics, libraries of prebuilt protein clusters and improved statistical models to achieve more accurate machine annotation directly from characterized proteins than has been possible from reliance on legacy annotation sets of mixed but unknown provenance.
Protein functional annotations deposited in public databases often represent inference by greatest sequence similarity to a protein with an ostensibly informative name and themselves lack traceable origins. These protein sequences then become the fundamental source for further BLAST-based propagation of protein functional assignments. Multiple types of ‘transitive annotation error’ can occur during such propagation of putative function, including overly specific annotation (3), founder effects that obscure functional diversity in large families such as radical SAM (4), daisy-chain inference that passes through non-overlapping regions of a multidomain protein (5) and faults from successive rounds of reinterpretation of an original protein name. Protein functional inference through computation will benefit, in the future, from increasingly deep comparative genomics resources. Conserved gene neighborhoods, pathway reconstructions and hole filling, multiple sequence alignments and molecular phylogenetic trees, identification of orthologs and paralogs and other data-driven techniques will help propagate information with improving reach and accuracy. The sparse resource of proteins whose functions are known from direct laboratory characterizations will continue to grow in importance.
Anticipating that next-generation annotation tools will need to track which sequences carry primary annotations and to compute confidences during propagation, we have created a database architecture for representing experimentally derived protein characterizations, in which the original source of individual annotation fields is included. Gene Ontology (GO) (6) terms for both molecular function and biological process are presented with both provenance and GO evidence codes (ev-codes) to facilitate their use in machine annotation. We have established two methods for populating CharProtDB— manually as a synergistic benefit of biocuration of prokaryotic genomes and by import from various publicly accessible resources after filtering, processing, validation and consolidation of GO term assignments.
HISTORY
CharProtDB arose as a consequence of needing high quality annotations for the prokaryotic annotation projects at J. Craig Venter Institute (JCVI). Initially, it was just a listing of characterized accessions with a standardized name but grew to include annotation types listed in the content section (below). Initially, the primary emphasis was on experimentally characterized proteins useful for annotating prokaryotic pathogens, with a special focus on characterizations relevant for Escherichia coli, Burkholderia, Bacillus and Clostridium. The use of CharProtDB within JCVI automated annotation pipelines necessitated importing additional protein data sets with experimental evidence codes especially from model organism databases.
DATABASE
The central unit of CharProtDB is the protein record. Each protein record in CharProtDB (Figure 1) must have an assigned organism (by taxon ID), and at least one public accession, protein name and GO annotation complete with an experimental evidence code and an associated reference. The protein may also have one or more gene symbols assigned. Additional synonymous accessions are added to proteins, either automatically as the proteins are entered, or manually by curators. These synonymous accessions, in the context of CharProtDB, are limited to public accessions with both identical sequence and taxon ID.
Figure 1.
Detailed view of individual protein record.
Annotating using the GO system is of importance for several reasons; the GO system captures defined concepts (the GO terms) with unique ids, which can be attached to specific genes and the three controlled vocabularies of the GO allow for the capture of much more annotation information than is traditionally captured in protein common names, including, for example, not just the function of the protein, but its location as well. GO evidence codes implemented in CHAR directly correlate with the GO consortium definitions of experimental evidence codes (6).
Beyond GO annotations, the protein may be assigned one or more controlled vocabulary terms for enzyme functional classification, as Enzyme Commission (EC) numbers (http://www.chem.qmul.ac.uk/iubmb/) (7), or transporter functional classification, as Transport Classification (TC) numbers (8). Except for GO assignments, which must have a reference, any or all of these annotations may be linked to a reference. If a record was imported from an external database, the annotations coming from that database will be referenced back to the original source. Any additional references found, including those not directly linked to an annotation, will be attached to the applicable protein(s).
For leveraging CharProtDB in automated annotation, one protein name, one gene symbol and one or more GO terms, EC or TC annotations are marked as ‘primary’. This represents the preferred choice for assignment to a predicted gene being automatically annotated. Apart from the primary annotation, CharProtDB also stores alternate protein names or synonyms and alternate gene symbol.
CONTENT
Data sources
The core of CharProtDB is a collection of prokaryotic proteins manually curated at JCVI. To that, we have added entries from the following databases that show explicit reference to a physical characterization: UniProtKB (9), EcoCyc (10), TCDB (Transporters) (8), MGOS (Magnaporthe oryzae) (11), AspGD (Aspergillus) (12), CGD (Candida albicans) (13) and GeneDB (Schizosaccharomyces pombe) (14).
To each of these, we have added characterized data from the GO Associations database. These entries have been flagged as being either fully characterized, or characterized for only one of the base GO assignments: process, function or component. At a lower level of confidence, we have added records from the above databases that have been marked as curated, but do not have biochemical characterizations (Table 1). We have developed an extensive list of controlled vocabulary terms that indicate the level characterization for a protein record and the source database from which it has been imported. A complete list of such ‘status’ terms is described in Table 1. We have begun adding records for proteins that may not have been functionally characterized through biochemical experiments, but only structurally through crystallography or proteomics or whose functional assertions have been made through bioinformatics analysis (15).
Table 1.
CharProtDB protein assertions
| Status | Description | Proteins |
|---|---|---|
| curated | Proteins manually annotated by a JCVI annotator that contain both function and process annotation. | 1075 |
| curated_function | Proteins with only functional annotation, added by a JCVI annotator. | 297 |
| curated_process | Proteins with only biological process annotation, added by a JCVI annotator. | 339 |
| curated_component | Proteins with only cellular component annotation, added by a JCVI annotator. | 7 |
| curated_structure | Proteins with only structural annotation (e.g. proteomics or crystallographic data), added by a JCVI annotator. | 39 |
| curated_source | Proteins from a ‘source’ database marked as experimentally validated with added Gene Ontology annotation data. | 6183 |
| trusted_source | Proteins from a ‘source’ database marked as curated but without fully traceable experimental validation with added Gene Ontology annotation data. | 8396 |
CharProtDB tools can link characterization data from multiple input streams through synonymous accessions or direct sequence identity. CharProtDB can represent multiple characterizations of the same protein, with proper attribution and links to database sources. As of publication, CharProtDB contains 16 046 proteins from 1588 species; 9185 proteins are bacterial in origin, with about one-third from each of Enterobacteriales and Bacillales; 5238 are eukaryotic in origin, with over three-fourths of those being fungal proteins; 931 proteins are viral in origin, primarily bacteriophage. Only 622 are archaeal in origin and are almost entirely imported from Swiss-Prot. Because of CharProtDB's origins and use as an internal resource for annotation, the species breakdown strongly reflects projects done at JCVI.
Database access and interface
CharProtDB is a standalone database that supports searching and retrieval of data using different search terms. A web interface allows users to search by protein name, protein accession, GO term, GO evidence code, gene symbol, EC number, organism name (genus or species), PubMed identifier or a combination of search terms. The complete list of protein records in CharProtDB broken down by taxonomic groups can be viewed on the website.
Data validation
Data imported into CharProtDB is extensively cross-validated, verified and standardized. We have developed several automated data consistency checks to resolve problems related to data discrepancy.
BLAST
Users can search against CharProtDB using a Blast utility. A BLAST sequence similarity search has been provided from the CharProtDB web interface, which accepts user input and can search the user submitted query sequence against the entire CharProtDB data set. Likewise the BLAST search utility available from a single protein view page provides convenient search capability for a single sequence search in CharProtDB against the entire database. The BLAST results are provided in standard BLAST text output, with links in the summary to the alignments and back to the protein details in CharProtDB.
Use of CharProtDB in automated annotation
AutoAnnotate is JCVI's automated prokaryotic functional annotation program designed for performing high throughput annotation of complete and draft bacterial genome sequences (16). Designed to assign ‘heuristic annotation’ controlled by parameters within the pipeline, the program weighs evidence from a ranked list of evidence types, annotates proteins according to molecular function and biological process, attaching both controlled vocabulary terms, such as GO terms supported by their appropriate GO evidence codes, and more human-readable fields such as the gene/protein name, gene symbol and EC number. AutoAnnotate primarily uses homology-based methods for automatic annotation. Homology evidence to CharProtDB proteins is given highest precedence in the ranking order. AutoAnnotate is the primary functional annotation pipeline adapted by the genome centers on the Human Microbiome project (HMP) to generate automated annotation of reference genomes (17). We have distributed the CharProtDB data set as part of JCVI's annotation pipeline to all the participating centers.
Access
The CharProtDB website can be accessed at (http://www.jcvi.org/charprotdb/). The CharProtDB is currently available freely for download as Swiss-Prot format records with all annotations, or just the sequences in Fasta format. Users can choose to download any displayed record, or the entire data set.
DISCUSSION AND CONCLUSIONS
CharProtDB is similar in goals to several other biocuration efforts that aim to provide computational access to assertions about experimentally verified protein function. NeXtProt (this issue), a resource for human proteins, is an example of an organism-specific database. Improvements to UniProtKB improve access by query to proteins with experimental evidence. COMBREX has begun an effort to enlist community annotators to contribute ‘gold standard’ biocuration of experimentally characterized proteins (18). Unfortunately, much work remains to be done to link experimental characterizations of protein function as reported in the literature with computationally accessible protein sequences, and much of the content of CharProtDB is unique. CharProtDB entries bring together consolidated protein annotations including sequence, synonymous accessions, GO annotations for experimentally characterized proteins curated from scientific literature, a resource we found essential to enable best practices in microbial annotation. The CharProtDB proteins are available to the public as a source of computable objects, BLAST-ready and freely distributable protein set supported by querying interfaces. Although the set of ‘trusted’ category proteins obtained from external resources do not necessarily have direct experimental validation of function, they expand the collection of validated, certified entries in CharProtDB that can be used to annotate other proteins in a reliable way by automated annotation pipelines. The ‘trusted’ set can be filtered easily from the main curated data set using specific queries and a prefiltered set with only curated entries is provided for separate download.
FUNDING
National Human Genome Research Institute (NHGRI) (R01 HG004881); National Institute of Allergy and Infectious Disease (contract HHSN266200100038C). Funding for open access charge: NHGRI.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors would like to thank past and present colleagues at the JCVI Bioinformatics and Information Technology departments for scientific contributions and technical support including Peter Rosanelli and Su Qi.
REFERENCES
- 1.Engelhardt BE, Jordan MI, Muratore KE, Brenner SE. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput. Biol. 2005;1:e45. doi: 10.1371/journal.pcbi.0010045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Rost B. Enzyme function less conserved than anticipated. J. Mol. Biol. 2002;26:312–318. doi: 10.1016/S0022-2836(02)00016-5. [DOI] [PubMed] [Google Scholar]
- 3.Louie B, Higdon R, Kolker E. A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions. PLoS One. 2009;4:e7546. doi: 10.1371/journal.pone.0007546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Haft DH, Basu MK. Biological systems discovery in silico: radical S-adenosylmethionine protein families and their target peptides for posttranslational modification. J. Bacteriol. 2011;193:2745–2755. doi: 10.1128/JB.00040-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. [PubMed] [Google Scholar]
- 6.Gene Ontology Consortium. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res. 2010;38:D331–D335. doi: 10.1093/nar/gkp1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.McDonald AG, Boyce S, Tipton KF. ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res. 2009;37:D593–D597. doi: 10.1093/nar/gkn582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Saier MH, Jr, Noto K, Tamang DG, Elkan C. The Transporter Classification Database: recent advances. Nucl. Acids Res. 2009;37:D274–D278. doi: 10.1093/nar/gkn862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.The UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011;39:D214–D219. doi: 10.1093/nar/gkq1020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Keseler IM, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S, Muñiz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M, Altman T, Kaipa P, et al. EcoCyc: a comprehensive database of Escherichia coli biology. Nucleic Acids Res. 2011;39:D583–D590. doi: 10.1093/nar/gkq1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Soderlund CHK, Pampanwar V, Ebbole D, Farman M, Orbach MJ, Wang GL, Wing R, Xu JR, Brown D, Mitchell T, et al. MGOS: a resource for studying Magnaporthe grisea and Oryza sativa interactions. Mol. Plant Microb. Interact. 2006;19:1055–1061. doi: 10.1094/MPMI-19-1055. [DOI] [PubMed] [Google Scholar]
- 12.Arnaud MB, Costanzo MC, Crabtree J, Inglis DO, Lotia A, Orvis J, Shah P, Skrzypek MS, Binkley G, Miyasato SR, et al. The Aspergillus Genome Database, a curated comparative genomics resource for gene, protein and sequence information for the Aspergillus research community. Nucleic Acids Res. 2010;38:D420–D427. doi: 10.1093/nar/gkp751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Skrzypek MS, Costanzo MC, Inglis DO, Shah P, Binkley G, Miyasato SR, Sherlock G. New tools at the Candida Genome Database: biochemical pathways and full-text literature search. Nucleic Acids Res. 2010;38:D428–D432. doi: 10.1093/nar/gkp836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Aslett M. Gene Ontology annotation status of the fission yeast genome: preliminary coverage approaches 100% Yeast. 2006;23:913–919. doi: 10.1002/yea.1420. [DOI] [PubMed] [Google Scholar]
- 15.Selengut JD, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res. 2007;35:D260–D264. doi: 10.1093/nar/gkl1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Davidsen T, Ganapathy A, Montgomery R, Zafar N, Yang Q, Madupu R, Goetz P, Galinsky K, White O, Sutton G. The comprehensive microbial resource. Nucleic Acids Res. 2010;38:D340–D345. doi: 10.1093/nar/gkp912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Nelson KE, Weinstock GM, Highlander SK, Worley KC, Creasy HH, Wortman JR, Rusch DB, Mitreva M, Sodergren E, Chinwalla AT, et al. A catalog of reference genomes from the human microbiome. Science. 2010;328:994–999. doi: 10.1126/science.1183605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Roberts RJ, Chang YC, Hu Z, Rachlin JN, Anton BP, Pokrzywa RM, Choi HP, Faller LL, Guleria J, Housman G, et al. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes. Nucleic Acids Res. 2011;39:D11–D14. doi: 10.1093/nar/gkq1168. [DOI] [PMC free article] [PubMed] [Google Scholar]

