Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current.
INTRODUCTION
Since 2001, the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) has used the Gene Ontology (GO) to annotate gene products in the budding yeast Saccharomyces cerevisiae (1,2). GO consists of three sets of structured, controlled vocabularies, also known as ontologies: the Molecular Function ontology describes the activities of gene products; the Biological Process ontology places molecular functions in a biological context; and the Cellular Component ontology describes the subcellular localizations of gene products (3). The selection of a GO term from one of these ontologies to annotate a gene product must be supported by a reference, such as a peer-reviewed research article or an abstract, as well as by an evidence code that describes the type of evidence present in that reference (4).
At SGD, results from traditional experimental methods published in the scientific literature are the primary sources of evidence used to support the GO annotation of gene products. If no experimental data are available for a gene, it is annotated to the terms ‘biological_process’, ‘molecular_function’ or ‘cellular_component’ (the root terms of the three ontologies) with the evidence code ‘ND’ to indicate there are ‘No Biological Data Available’. While this does not describe the biology of the gene product, it indicates that no experimental results are available in the published literature at the time of annotation (Table 1). Using this curatorial process, every S. cerevisiae gene product has been assigned at least one GO term in each of the three ontologies since 2003.
Table 1.
Annotation method | Data source (No. of annotations) | Evidence code |
---|---|---|
Manually curated* | SGD (35 684) | IDA: Inferred from Direct Assay |
UniProt (93) | IGI: Inferred from Genetic Interaction | |
MGI (8) | IMP: Inferred from Mutant Phenotype | |
IPI: Inferred from Physical Interaction | ||
IEP: Inferred from Expression Pattern | ||
ISS: Inferred from Sequence/Structural Similarity | ||
IC: Inferred by Curator | ||
RCA: Reviewed Computational Analysis | ||
NAS: Non-traceable Author Statement | ||
TAS: Traceable Author Statement | ||
ND: No Biological Data Available | ||
High-throughput* | SGD (4203) | IDA, IMP, IGI, IPI, IEP |
Computational | UniProt (30959) | IEA: Inferred from Electronic Annotation |
*Annotations generated by the manually curated and high-throughput methods are available from the GO Consortium (http://www.geneontology.org/GO.current.annotations.shtml). The total numbers of annotations are current as of September 2007. Numbers of manually curated annotations from GOA UniProt are cumulative since the January 2007 GOA UniProt data release. Because GOA UniProt compiles GO annotations from many sources, GO annotations are assigned by GOA UniProt and the Mouse Genome Informatics group (MGI; http://www.informatics.jax.org/). Numbers of Computational annotations from UniProt are from the June 2007 GOA UniProt data release. Documentation about evidence codes is available at http://www.geneontology.org/GO.evidence.shtml.
In recent years, results from comparative sequence and genomic studies, as well as analyses of functional genomic and proteomic data, have provided valuable insights into the biological roles of gene products, especially when data from traditional experimental approaches are unavailable (5,6). In order to provide greater access to these results, SGD now incorporates these data as GO annotations. Because the process of assigning GO annotations from high-throughput experimental data and computational predictions differs from the process of assigning annotations from traditional experimental studies, GO annotations in SGD are now distinguished by their annotation method.
INCORPORATING HIGH-THROUGHPUT DATA AT SGD
Traditional experimental methods, focusing on in-depth characterization of small numbers of genes, have been and will continue to be the primary source of evidence for GO annotations. However, modern techniques allow experiments to be designed on a genome-wide scale, generating data for large numbers of genes. SGD now assigns GO annotations based on data from such high-throughput experiments. These data sources have been particularly valuable in providing a nearly comprehensive set of Cellular Component GO annotations: from the GO annotation summary on SGD's Genome Snapshot, 5474 of 6301 gene products have been assigned at least one Cellular Component GO term as of September 2007, and 2238 of these are supported by data from high-throughput methods (7–9).
INCORPORATING GO ANNOTATIONS FROM GOA UNIPROT
In addition to data from high-throughput experimental methods, GO annotations can also be generated by computational analyses. For example, the Gene Ontology Annotation Project generates computationally predicted GO annotations for UniProt proteins based on sequence similarity algorithms (GOA UniProt; http://www.ebi.ac.uk/GOA/) (10,11). In order to provide greater access to these predictions, GOA UniProt annotations are now incorporated into SGD. Because these computationally predicted GO annotations are added without being reviewed in the context of literature-based GO annotations, they retain the ‘Inferred from Electronic Annotation’ (‘IEA’) evidence code assigned by GOA UniProt (Table 1).
Note that GOA UniProt also compiles literature-based GO annotations from many data sources (10). These annotations are also available at SGD, along with their original evidence codes and data sources, but are reviewed for redundancy with current SGD GO annotations before being incorporated (Table 1).
DIFFERENTIATING ANNOTATION METHODS
In addition to GO annotations derived from the manual curation of traditional experimental approaches published in the literature, SGD now contains GO annotations derived from data from high-throughput experiments as well as computational predictions provided by GOA UniProt, creating a central repository for all S. cerevisiae GO annotations. Although all of these annotations are supported by references and evidence codes, the basis for any differences among the GO annotations for any given gene may not be immediately clear. The curation process used for assigning GO annotations from these data varies according to the experimental approach. Therefore, in order to indicate how the data were curated, and to facilitate identification and comparison of these annotations, each GO annotation is now categorized in one of three annotation methods: manually curated, high-throughput or computational (Table 1).
The manually curated method indicates that the evidence in a publication has been individually reviewed to generate an annotation. Types of evidence can include experimental results in published literature that focuses on single genes or small sets of genes, author statements in a publication and sequence similarities that have been analyzed by the authors [for examples, see (12,13) shown in Figure 1B)].
The high-throughput method indicates that, although the evidence for a subset of results from a high-throughput or genome-wide experimental approach may have been reviewed, results for each gene product in the dataset have not been individually reviewed. Generally, this annotation method includes data from experimental approaches in which all significant results were produced using the same condition or analysis [for examples, see (7,8)].
In contrast, annotations generated by the computational method are not supported by direct experimental evidence and are not individually reviewed. These annotations include predictions generated by sequence similarity algorithms or by the integrated computational analyses of different sets of high-throughput experimental data that have not been individually reviewed [(for examples, see (11,14–17)].
All literature-based GO annotations from SGD and GOA UniProt are classified either as manually curated or high-throughput. Computational predictions provided by GOA UniProt are classified as computational (Table 1).
MODIFICATIONS TO INTERFACES
SGD has changed several web interfaces in order to display data sources and annotation methods. The Locus Summary lists each manually curated and high-throughput GO annotation and indicates when computational GO annotations are available (Figure 1A). The phrases ‘All GO Evidence and References’ and ‘View Computational GO annotations’ are both hyperlinked to a detailed Gene Ontology Annotations page, which is subdivided into sections according to each annotation method. Because annotations no longer come solely from SGD, an ‘Assigned by’ column now indicates the data source (Figure 1B).
From the Locus Summary and GO Annotations pages, each GO term is hyperlinked to its GO Term page, which now lists all annotation methods used to generate that annotation for a particular gene. Annotations may be downloaded, according to annotation method, from the summary table at the top of the page (Figure 1C).
To ensure that data analyzed at SGD or by others in the scientific community are based on GO annotations supported by evidence in the published literature, only manually curated and high-throughput GO annotations are publicly available from the GO Consortium (http://www.geneontology.org/GO.current.annotations.shtml). They are also the default annotation sets used for SGD's GO Term Finder (http://www.yeastgenome.org/TermFinder) and GO Slim Mapper (http://www.yeastgenome.org/SlimMapper).
FUTURE DIRECTIONS
SGD will continue to update manually curated GO annotations as new experimental data are published and will add more sources of high-throughput and computational GO annotations. Discrepancies between annotations may become evident as GO annotations are made from different data sources and annotation methods. These differences can help refine GO and individual annotations by indicating areas in the ontology that require modification and gene products whose annotations need to be reviewed and updated to reflect the current literature. SGD will use this method of comparison to identify under-annotated gene products and areas in the GO structure that need to be reviewed.
SUMMARY
The incorporation of annotations from additional data sources makes SGD a central source for S. cerevisiae GO annotations. Differentiating these annotations by annotation method distinguishes what has been experimentally determined for each gene from what has only been computationally predicted. This knowledge will spur experimental research by contributing valuable information for genes that have not been experimentally characterized, and by suggesting additional roles for others (6).
SGD is committed to maintaining high-quality GO annotations and welcomes all comments or questions. Please contact us at: yeast-curator@genome.stanford.edu.
ACKNOWLEDGEMENTS
The SGD project is supported by a P41 grant from the NHGRI HG001315 (J.M.C.) and through the GO Consortium P41 grant from NHGRI HG002273 (co-PI J.M.C). Funding to pay the Open Access publication charges for this article was provided by the National Human Genome Research Institute.
Conflict of interest statement. None declared.
REFERENCES
- 1.Gene Ontology Consortium. The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 2006;34:326. doi: 10.1093/nar/gkj021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Dwight SS, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, et al. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) Nucleic Acids Res. 2002;30:69–72. doi: 10.1093/nar/30.1.69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 2001;11:1425–1433. doi: 10.1101/gr.180801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Dolinski K. Changing perspectives in yeast research nearly a decade after the genome sequence. Genome Res. 2005;15:1611–1619. doi: 10.1101/gr.3727505. [DOI] [PubMed] [Google Scholar]
- 6.Pena-Castillo L, Hughes TR. Why are there still over 1000 uncharacterized yeast genes? Genetics. 2007;176:7–14. doi: 10.1534/genetics.107.074468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. doi: 10.1038/nature02026. [DOI] [PubMed] [Google Scholar]
- 8.Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, et al. Subcellular localization of the yeast proteome. Gen. Dev. 2002;16:707–719. doi: 10.1101/gad.970902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hirschman JE, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hong EL, Livstone MS, et al. Genome Snapshot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the Saccharomyces cerevisiae genome. Nucleic Acids Res. 2006;34:D442–D445. doi: 10.1093/nar/gkj117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Res. 2004;32:D262–D266. doi: 10.1093/nar/gkh021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–W120. doi: 10.1093/nar/gki442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hanna JS, Kroll ES, Lundblad V, Spencer FA. Saccharomyces cerevisiae CTF18 and CTF4 are required for sister chromatid cohesion. Mol. Cell. Biol. 2001;21:3144–3158. doi: 10.1128/MCB.21.9.3144-3158.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mayer ML, Gygi SP, Aebersold R, Hieter P. Identification of RFC(Ctf18p, Ctf8p, Dcc1p): an alternative RFC complex required for sister chromatid cohesion in S. cerevisiae. Mol. Cell. 2001;7:959–970. doi: 10.1016/s1097-2765(01)00254-4. [DOI] [PubMed] [Google Scholar]
- 14.Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science. 2003;302:449–453. doi: 10.1126/science.1087361. [DOI] [PubMed] [Google Scholar]
- 15.Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science. 2004;306:1555–1558. doi: 10.1126/science.1099511. [DOI] [PubMed] [Google Scholar]
- 16.Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, et al. InterPro, progress and status in 2005. Nucleic Acids Res. 2005;33:D201–D205. doi: 10.1093/nar/gki106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) Proc. Natl Acad. Sci. USA. 2003;100:8348–8353. doi: 10.1073/pnas.0832373100. [DOI] [PMC free article] [PubMed] [Google Scholar]