Abstract
To increase the utility of Gene Ontology annotations for interpretation of genome-wide experimental data, we have developed GO-CAM, a structured framework for linking multiple GO annotations into an integrated model of a biological system. We expect that GO-CAM will enable new applications in pathway and network analysis as well as improving standard GO annotations for traditional GO-based applications.
Introduction
The Gene Ontology was created as a computational structure for conceptualizing and describing gene function (1). The broad aims were 1) to create an ontology of gene function, a comprehensive set of terms and relationships between them, and 2) to support functional annotation of genes. At the time the GO was developed, the first whole genomes were being sequenced, and statements about gene function were conceived of as “annotations” on the “book” of the genome. The goal was to apply a consistent set of concepts describing gene function to a broad range of eukaryotic model organisms (later extended to prokaryotes and viruses). This application would enable the identification of evolutionarily shared genetic programs, with the ultimate goal being to shed light on the functions of human genes based on knowledge about genes in model organisms.
The development of the GO has always been tightly coupled to its use in describing the functions of genes across a wide variety of organisms. New biological concepts, and the revision of existing ones, were and still are driven primarily by requests from expert biocurators, who read published scientific articles reporting discoveries of the functions of gene products and “annotate” the gene with terms selected from the GO. Thus, the GO ontology enumerates the universe of possible functions performed by genes, while GO annotations specify the functions that have been experimentally observed or otherwise inferred for a particular gene. In the initial publication, Ashburner et al. (1) emphasized the independence of each of the aspects of the GO. This was an important advance, because it clarified the diverse uses of the word “function” in the biological literature. In the GO, molecular functions (the activities of gene products at the molecular level, such as catalysis of a reaction) are distinct from cellular components (the location, relative to cellular structures, where the gene product is active), and distinct from biological processes (the larger biological programs carried out by a series of molecular functions).
At its core, a GO annotation is an association between a single gene and a single GO term (Figure 1a), and a record of the supporting scientific evidence for the association. This association is a statement about some aspect of the function of that gene. However, because it refers to a single GO term, each GO annotation is necessarily a partial functional description, and there is no representation of how different annotations for the same gene fit together into a more complete description. As a result, a GO annotation often represents a minimal, discrete piece of biological knowledge that can be determined from one, or at most a few, experiments that appear in a typical scientific paper. The simplicity of the GO annotation structure was a key driver for its success. Over the past 20 years, the GO knowledgebase has become indisputably the largest repository of computational representations of gene functions (2). The ontology currently contains roughly 45,000 terms, and the annotation database has over 750,000 experimental gene annotations, taken from 150,000 distinct scientific publications and contributed by biocurators from around the globe. During this period, we have made many advances in the Gene Ontology itself to facilitate computational analysis (3,4). In contrast, the representation of statements about gene function as separate “annotations” has remained essentially unchanged, until now.
In order to represent more complex statements about biological functions in a way that is scalable and structured, we introduce here a framework we call Gene Ontology Causal Activity Modeling (GO-CAM). GO-CAM extends the existing annotation paradigm by introducing the concept of a model, which is a collection of connected GO annotations (plus contextual information from other ontologies) linked together according to a defined schema. Figure 1 illustrates how multiple GO annotations for the function of NEDD4 in UV-induced transcriptional arrest (5) are linked together in GO-CAM into a more complete, integrated model. If standard GO annotations are analogous to phrases of text, GO-CAM allows us to use these phrases to build sentences, paragraphs and whole documents.
The core structure of GO-CAM
The GO-CAM formalism defines a schema that combines multiple simple GO annotations into an integrated, semantically precise and computable model of biological function. It formalizes the relationships between annotations by integrating different aspects of function, as shown in Figure 2. Each element of GO-CAM refers to terms from an ontology or other standard identifier (Table 1). As originally defined by Ashburner et al. (1) and further elaborated by Thomas (6), a molecular activity (GO molecular function annotation) of a gene product occurs in a location (GO cellular component annotation) and is part of a larger biological program (GO biological process annotation). In GO-CAM, relations to terms from other ontologies can provide additional specificity: for location, a cellular component can be part of a specified cell type, which in turn can be part of a specified anatomical structure; an activity can occur during a specified temporal period (biological phase).
Table 1. GO-CAM elements and ontologies used.
GO-CAM element (Figure 2) |
Ontology or identifier source(s) |
Example |
---|---|---|
Molecular activity | GO molecular function | ubiquitin-protein transferase activity (GO:0004842) |
Biological process | GO biological process | cellular response to UV (GO:0034644) |
Location | GO cellular component | nucleus (GO:0005634) |
Cell Type Ontology (CL) (8) | retinal cell (CL: 0009004) | |
anatomy ontologies, e.g. UBERON (9), C. elegans gross anatomy (10), EMAPA (11) | eye (UBERON: 0000970) | |
Active entity | Gene, protein, RNA or complex identifier from a standard source, e.g. HGNC for a human gene | NEDD4 (HGNC:7727) |
Target entity | Same as active entity, or chemical from ChEBI (12) | MAP2K1 (HGNC:6840) |
Biological phase | GO biological phase (GO:0044848) | mitotic G1 phase (GO:0000080) |
Developmental phase ontology, e.g. Mouse Developmental Stage | Theiler stage 02 (MmusDv:0000005) | |
Relations (arrows in Figure 2) | Relations Ontology | occurs in (BFO:0000066) |
In addition, a molecular activity can have a causal effect on another molecular activity. Previously, these were represented as annotations to GO terms from the regulation of molecular function branch of the ontology, but in GO-CAM we represent these instead as separate activities linked by a relation from the causal relation branch of the Relations Ontology (7). Note that causal relations can have a positive or negative direction of effect, and encompass many different terms such as directly regulates, or causally upstream of. By linking together chains of effects, GO-CAM models can specify causal pathways of arbitrary size and branching.
GO-CAM records the evidence for each element of a model
GO-CAM preserves and extends the way in which GO annotations are currently supported by scientific evidence. As described above, each GO-CAM model is composed of “triples” that specify a subject, a relation and an object (e.g. in Figure 1b, ubiquitin-protein transferase activity enabled by NEDD4), and each triple must be supported by evidence. As is currently done for all GO annotations, GO-CAM models use the Evidence and Conclusion Ontology (ECO) for specifying the type of evidence (13). An advance in GO-CAM over simple annotations is that a triple can be supported by more than one piece of evidence. Furthermore, like standard GO annotations, GO-CAM triples may not be completely consistent. We recognize that current knowledge of biological systems is incomplete, and in some cases contradictory models may have been proposed. In these cases, multiple alternative models (or different triples in the same GO-CAM model) will co-exist in the GO knowledgebase, and can be revised later in response to additional experimental evidence.
Modeling biological pathways in GO-CAM
As an example of the power of GO-CAM models to represent more complex processes such as signaling pathways, we consider a model of the canonical Wnt signaling pathway (Figure 3). The pathway was constructed by combining standard annotations (one gene to one GO term, e.g. receptor ligand activity enabled by WNT3). Causal relations between activities were then added manually using Noctua, the collaborative web curation platform we have developed to support GO-CAM modeling (http://noctua.geneontology.org).
Figure 3 shows the “curator view” of a portion of the GO-CAM model for the initial steps in the canonical Wnt signaling pathway using FZD1 and WNT3 as the receptor-ligand pair. The model comprises multiple molecular activities linked by causal relationships (directly positively regulates, directly negatively regulates, positively regulates, negatively regulates); direct relations indicate regulation via direct physical interactions. Each molecular activity is carried out by either a single gene product (e.g. WNT3) or a complex of gene products (e.g. the beta-catenin destruction complex). A distinct sub-process (regulation of proteasomal protein catabolic process, GO:0010498) represents the use of the relatively general “constitutive” proteasomal degradation process to negatively regulate beta-catenin activity.
As Figures 1b and 3 show graphically, a GO-CAM model has similarities to the “cartoons” published in many molecular biology papers showing how gene product activities causally relate to each other; the primary differences are that, in GO-CAM, 1) the model explicitly represents dynamic molecular activities instead of using gene names to stand in for activities, and 2) all entities, activities, processes, locations, and relations are specified from ontologies rather than free text or ambiguous symbols. The GO-CAM schema thus provides a defined, structured representation that makes it computable, i.e. usable in computational analyses, such as complex queries and searches including across causal paths, as well as enrichment analysis tools for analyzing genomics data sets. It utilizes the extensive structure of the Gene Ontology to simplify and abstract away the explicit biochemical details without losing that information; for example, the GO term protein kinase activity is already defined in terms of the reaction it catalyzes, including reactants (ATP and a protein substrate) and products (ADP and a phosphorylated protein).
The GO-CAM model repository
Currently there are over 2,300 GO-CAMs of varying complexity, containing over 11,000 distinct triples, encompassing 16 species and over 1,600 gene products. These are currently available from the GO-CAM public site (http://geneontology.org/go-cam), where they can be browsed and visualized. GO-CAM models are created as part of the existing GO annotation curation process, by trained GO curators from multiple groups that are distributed internationally and meet regularly to ensure a consistent process. Moving forward, all GO annotations will be represented using GO-CAM. We are currently beginning the process of importing legacy standard annotations to the GO-CAM repository, with most existing standard GO annotations initially grouped into a single model per gene product. Ongoing curation will move toward models for the most specific GO biological process terms in the ontology (pathways and other coordinated processes). Formally, the GO-CAM models are expressed in RDF/OWL (14), a semantic web standard that makes them interoperable with a large set of computational tools. To enable use of GO-CAM in Cytoscape and other network analysis tools (15,16), we also provide the causal network in Simple Interaction Format (for more information on conversion and information loss, see http://geneontology.org/go-cam/docs).
GO-CAMs are converted to standard GO annotations
Because GO-CAM links together GO annotations, each model can be decomposed into its constituent standard GO annotations. The GO-CAM-derived annotations are integrated into the standard GO annotation releases, and so are already in widespread use. The conversion process inevitably loses some of the information in the full GO-CAM (see http://geneontology.org/go-cam/docs for more detail). Briefly, the conversion involves following chains of multiple relations in the GO-CAM model (e.g. making a GO biological process annotation requires following the enabled by relation to a molecular activity, then a part of relation to a GO biological process term, see Figure 2), as well as logical reasoning (e.g. the conversion uses “logical definitions” of GO terms to infer, for example, that if a molecular activity directly regulates a protein kinase activity, then that activity can be also be classified as a protein kinase regulator activity).
We have found that the GO-CAM curation process of specifying an explicit biological model is leading to improved quality and consistency of GO annotations. For biological process annotations, GO-CAM modeling aids curators in determining which gene functions are parts of a process, which ones regulate that process, and which are part of upstream processes that otherwise affect the process. For example, Wnt ligands are post-translationally processed and trafficked through the secretory system by enzymes such as acyltransferases and carrier proteins, respectively. In the past, curators had often annotated these upstream gene products to Wnt signaling pathway, or regulation of Wnt signaling pathway, to capture the idea that they are “in some way related” to Wnt signaling; with GO-CAM upstream causal activities can be represented without losing the distinction between gene products that execute a given biological program versus those that affect that program. Further, a GO-CAM model can be used as a reference, or template, for new curation of homologous or analogous biological systems. As a result, similar processes and pathways can be annotated much more consistently.
Conclusion
GO-CAM provides a computational framework for representing integrated models of the activities of specific genes as well as the larger biological programs to which they contribute. This framework formalizes and extends GO annotations (statements about specific gene functions) analogously to how, starting 20 years ago, the Gene Ontology formalized an ontology of gene function descriptions. GO-CAM explicitly defines the relationships between: 1) different aspects (molecular function, biological process, cellular component) of the function of each gene, 2) the functions of different genes in a larger system, and 3) functions and critical context such as cell type and developmental stage. GO-CAM provides a framework for representing (and answering complex queries about) qualitative, causal models of how activities of gene products work together to execute a biological program, but does not represent biochemical details like stoichiometry or reaction kinetics.
By clarifying how a basic, building-block GO annotation relates to a description of overall gene function, GO-CAM leads to increased quality and consistency of GO annotations. As the rate-limiting step in creating GO annotations is reading the primary scientific literature, we do not expect any loss in curation productivity using GO-CAM. Instead, we expect that the ability to link together standard GO annotations into larger models will obviate the need for adding increasingly complex, combinatorial terms (e.g. Wnt signaling involved in kidney development, Wnt signaling involved in heart development, etc.) to the GO ontology itself, thus simplifying its maintenance and use. Because GO-CAMs are automatically converted (with some loss) into standard GO annotations as part of the GO release pipeline, the new formalism will continue to support the many current applications of GO annotations. The causal networks in GO-CAM models will also enable entirely new applications, such as network-based analysis of genomic data (17-22), and logical modeling of biological systems (23,24). In addition, the models may also prove useful for pathway visualization. For example, the activity-based representation of GO-CAMs is compatible with the “activity flow” diagrams of the Systems Biology Graphical Notation (SBGN) standard (25). With GO-CAM, the massive knowledgebase of GO annotations collected over the past 20 years can be used as the basis not only for a “genomic biology” representation of gene function, but also for a more expansive “systems biology” representation and its emerging applications to the interpretation of large-scale experimental data.
Acknowledgments
This work was supported by NIH NHGRI grant U41 HG002273 (co-funded by NIGMS) to the Gene Ontology Consortium (PIs: J Blake, J Cherry, C Mungall, P Sternberg and P Thomas). The authors would like to thank P D’Eustachio for helpful discussions, T Mushayahama for work on the Noctua user interface, D Ebert for work on conversion of standard annotations to GO-CAM, and all the GO curators who have extensively tested the GO-CAM framework and provided valuable feedback on it, and on the Noctua tool: S Aleksander, G Antonazzo, H Attrill, T Berardini, L Breuza, A Bridge, A Britan, J Cho, K Christie, M Courtot, I Cusin, B Czub, H Dietze, P Jaiswal, R Dodson, H Drabkin, S Engel, P Fey, M Feuermann, M Fisher, P Garmiri, G Georghiou, D Gonzalez, C Grove, E Hatton-Ellis, M Harris, M-C Harrison, J Hayles, T Hayman, V Hinard, D Howe, X Huang, R Huntley, H Bye-A-Jee, R Kishore, O Lang, R Lee, A Lock, R Lovering, A MacDougall, M Martin, P Masson, J Mendel, M Munoz-Torres, R Nash, L Ni, A Nikjenad, C O’Donovan, B Palka, C Pich, K Pichler, S Poux, L Reiser, P Roncaglia, T Sawford, A Shypitsyna, D Sitnikov, E Speretta, N Tyagi, S Toro, M Tuli, K Warner, E Wong, V Wood, R Zaru.
Footnotes
Competing interests statement
The authors declare no competing interests.
References
- 1.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: Tool for the unification of biology. The gene ontology consortium. Nat Genet 2000, May;25(1):25–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.The Gene Ontology Consortium. The Gene Ontology resource: 20 years and still GOing strong. Nucleic Acids Res 2019;47(D1):D330–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hill DP, Adams N, Bada M, Batchelor C, Berardini TZ, Dietze H, et al. Dovetailing biology and chemistry: Integrating the gene ontology with the chebi chemical ontology. BMC Genomics 2013, July 29;14:513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mungall CJ, Dietze H, Osumi-Sutherland D. Use of OWL within the gene ontology. BioRxiv 2014, October:010090. [Google Scholar]
- 5.Anindya R, Aygün O, Svejstrup JQ. Damage-induced ubiquitylation of human RNA polymerase II by the ubiquitin ligase nedd4, but not cockayne syndrome proteins or BRCA1. Mol Cell 2007, November 9;28(3):386–97. [DOI] [PubMed] [Google Scholar]
- 6.Thomas PD. The gene ontology and the meaning of biological function. Methods Mol Biol 2017;1446:15–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome Biol 2005;6(5):R46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Diehl AD, Meehan TF, Bradford YM, Brush MH, Dahdul WM, Dougall DS, et al. The cell ontology 2016: Enhanced content, modularization, and ontology interoperability. J Biomed Semantics 2016;7(1):44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biol 2012, January 31;13(1):R5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee RY, Sternberg PW. Building a cell and anatomy ontology of caenorhabditis elegans. Comp Funct Genomics 2003;4(1):121–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hayamizu TF, Baldock RA, Ringwald M. Mouse anatomy ontologies: Enhancements and tools for exploring and integrating biomedical data. Mamm Genome 2015, October;26(9-10):422–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res 2016, January 4;44(D1):D1214–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chibucos MC, Siegele DA, Hu JC, Giglio M. The evidence and conclusion ontology (ECO): Supporting GO annotations. Methods Mol Biol 2017;1446:245–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.OWL; Available from: https://www.w3.org/OWL/.
- 15.Franz M, Lopes CT, Huck G, Dong Y, Sumer O, Bader GD. Cytoscape.Js: A graph theory library for visualisation and analysis. Bioinformatics 2016, January 15;32(2):309–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res 2003, November;13(11):2498–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hu Z. Using visant to analyze networks. Curr Protoc Bioinformatics 2014;45:8.8.1–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gosline SJ, Oh C, Fraenkel E. SAMNetWeb: Identifying condition-specific networks linking signaling and transcription. Bioinformatics 2015, April 1;31(7):1124–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Cornish AJ, Markowetz F. SANTA: Quantifying the functional content of molecular networks. PLoS Comput Biol 2014, September;10(9):e1003808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Xia J, Benner MJ, Hancock RE. NetworkAnalyst--integrative approaches for protein-protein interaction network analysis and visual exploration. Nucleic Acids Res 2014, July;42(Web Server issue):W167–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res 2009, July;37(Web Server issue):W305–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Cowley MJ, Pinese M, Kassahn KS, Waddell N, Pearson JV, Grimmond SM, et al. PINA v2.0: Mining interactome modules. Nucleic Acids Res 2012, January;40(Database issue):D862–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Büchel F, Rodriguez N, Swainston N, Wrzodek C, Czauderna T, Keller R, et al. Path2Models: Large-scale generation of computational models from biochemical pathway maps. BMC Syst Biol 2013, November 1;7:116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Naldi A, Monteiro PT, Müssel C, Kestler HA, Thieffry D, Xenarios I, et al. Cooperative development of logical modelling standards and tools with colomoto. Bioinformatics 2015, April 1;31(7):1154–9. [DOI] [PubMed] [Google Scholar]
- 25.Le Novère N, Hucka M, Mi H, Moodie S, Schreiber F, Sorokin A, et al. The systems biology graphical notation. Nat Biotechnol 2009, August;27(8):735–41. [DOI] [PubMed] [Google Scholar]