Skip to main content
microPublication Biology logoLink to microPublication Biology
. 2023 Apr 18;2023:10.17912/micropub.biology.000811. doi: 10.17912/micropub.biology.000811

rrvgo: a Bioconductor package for interpreting lists of Gene Ontology terms

Sergi Sayols 1,§
Reviewed by: Raymond Lee
PMCID: PMC10155054  PMID: 37151216

Abstract

Gene Ontology (GO) annotation is often used to guide the biological interpretation of high-throughput omics experiments, e.g. by analysing lists of differentially regulated genes for enriched GO terms. Due to the hierarchical nature of GOs, the resulting lists of enriched terms are usually redundant and difficult to summarise and interpret. To facilitate the interpretation of large lists of GO terms, I developed rrvgo, a Bioconductor package that aims at simplifying the redundancy of GO lists by grouping similar terms based on their semantic similarity. rrvgo also provides different visualization options to guide the interpretation of the summarized GO terms. Considering that several software tools have been developed for this purpose, rrvgo is unique at combining powerful visualizations in a programmatic interface coupled with up-to-date GO gene annotation provided by the Bioconductor project.


Figure 1. Different visualizations of the reduced terms provided by rrvgo.

Figure 1. Different visualizations of the reduced terms provided by rrvgo

(A) scatter plot represented by the first 2 components of a PCoA of the dissimilarity matrix. (B) space-filling visualization (treemap) of terms grouped by the representative term. (C) word cloud emphasizing frequent words in GO terms. (D) heatmap representation of the similarity matrix. (E) Companion Shiny App for interactive visualization of similarity between GO terms.

Description

Introduction

Structured vocabularies such as GO (The Gene Ontology Consortium. 2019) are important tools for the biological interpretation of high-throughput omics experiments. Due to the hierarchical nature of GO annotation, lists of enriched GO terms are usually large and redundant. One approach to simplify GO analysis is to use GO Slims (Carbon et al. 2009) representing a subset of the full GO. However, using such limited GO versions may hide interesting findings represented by more specific terms which were excluded. Hence, methods such as semantic similarity may better account for the complex structure of the GO graph and be more effective (Pesquita et al. 2009) .

Several online tools to compute semantic similarity between GO terms exist, such as REVIGO (Supek et al. 2011) . The accessibility of such tools comes at a price: they usually offer a limited programmatic interface difficult to integrate into pipelines, and provide pre-packaged GO annotations which cannot be overridden. Offline tools also exist, such as clusterProfiler (Yu et al. 2012) or ViSEAGO (Brionne et al. 2019) including useful but limited exploration capabilities.

Conveniently, the Bioconductor project (Huber et al. 2015) implements several semantic similarity methods and provides up-to-date GO annotations for a number of model organisms, along with the possibility of preparing custom annotations. I developed rrvgo to integrate in a single package access to the semantic similarity methods and annotations implemented in the Bioconductor project, coupled with highly effective visualizations, providing a one-stop-shop for the interpretation of large lists of GO terms in R.

Implementation

rrvgo requires a list of GO terms, usually identified in an overrespresentation analysis, from any of the three orthogonal taxonomies: Biological Process (BP), Molecular Function (MF) or Cellular Compartment (CC). Each term in the list may optionally include a score (eg. a minus log-transformed p-value). In this case, rrvgo will prefer terms with higher scores to identify the most representative term of a group; otherwise higher-level terms (ie. those comprising more genes) are preferred by default.

rrvgo uses the GOSemSim package (Yu et al. 2010) under the hood, which implements methods to compute semantic similarity between pairs of GO terms, and the OrgDb packages of the organisms of interest provided within Bioconductor.

Similarity measures

The application of semantic similarity methods, originally used in Natural Language Processing, to ontological annotation has already been investigated (Lord et al. 2003) . Some of these measures are based on the calculation of the term's Information Content (Resnik 1999; Lin 1998; Jiang and Conrath 1997; Schlicker et al. 2006) or graph-based (Wang et al. 2007) and are implemented in the GOSemSim package.

rrvgo uses the similarity between pairs of terms to compute the matrix of dissimilarities. The terms are then clustered using complete linkage, and the cluster is cut at the desired threshold, picking the term with the highest score as the representative of each group.

Organisms supported and creating a custom OrgDb

As of Bioconductor 3.16, there are OrgDb packages available for the most common organisms used in the lab. Consult the OrgDb BiocView for a full list of current OrgDb packages. It is expected that the list fluctuates between versions, but most common species may be very well supported while the project remains healthy.

For organisms not having an OrgDb package in Bioconductor, it is still possible to create custom OrgDb packages using the AnnotationForge package (Carlson and Pagès 2019).

Visualizations

rrvgo provides visualizations of the reduced terms as: (i) scatter plot represented by the first 2 components of a PCoA of the dissimilarity matrix; (ii) space-filling visualization (treemap) of terms grouped by the representative term; (iii) word cloud emphasizing frequent words in GO terms; and (iv) heatmap representation of the similarity matrix. Figure 1A -D.

Alternatively, the results can be interactively explored using the companion shiny app ( Figure 1E ).

Conclusion

rrvgo is a Bioconductor package that aims at providing a one-stop-shop for the biological interpretation of large lists of GO terms. It integrates access to semantic similarity methods and visualization in coherent and intuitive manner. This software is heavily influenced by REVIGO, mimicking a good part of its core functionality and some of the visualizations. The strength of rrvgo is its programmatic interface coupled with up-to-date GO gene annotation provided by the Bioconductor project.

Reagents

rrvgo is available as a Bioconductor package at http://bioconductor.org/packages/rrvgo/ and released under the GPL-3 License. The version of the software used in this article (rrvgo 1.10.0, Bioconductor 3.16) is also available in the Extended Data Section.

Extended Data

Description: Source Package. Resource Type: Software. DOI: 10.22002/xa9g7-5mm38

Acknowledgments

Acknowledgments

I would like to thank the members of the IMB Core Facilities for discussion, input and proof-reading. I also would like to thank Dr. Raymond Lee (California Institute of Technology) for taking the necessary time and effort to review the manuscript.

Funding Statement

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 393547839 – SFB 1361.

References

  1. Brionne A, Juanchich A, Hennequet-Antier C. ViSEAGO: a Bioconductor package for clustering biological functions using Gene Ontology and semantic similarity. BioData Min. 2019 Aug 6;12:16–16. doi: 10.1186/s13040-019-0204-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S, AmiGO Hub. Web Presence Working Group AmiGO: online access to ontology and annotation data. Bioinformatics. 2008 Nov 25;25(2):288–289. doi: 10.1093/bioinformatics/btn615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Marc Carlson Herve Pages. AnnotationForge. 2017 doi: 10.18129/b9.bioc.annotationforge. [DOI]
  4. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015 Feb 1;12(2):115–121. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Jay J. Jiang and David W. Conrath. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of the 10th Research on Computational Linguistics International Conference , pages 19–33, Taipei, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
  6. Lin D. An Information-Theoretic Definition of Similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning. ICML ’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1998. p. 296–304.
  7. Lord PW, Stevens RD, Brass A, Goble CA. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003 Jul 1;19(10):1275–1283. doi: 10.1093/bioinformatics/btg153. [DOI] [PubMed] [Google Scholar]
  8. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009 Jul 31;5(7):e1000443–e1000443. doi: 10.1371/journal.pcbi.1000443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Resnik P. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research. 1999 Jul 1;11:95–130. doi: 10.1613/jair.514. [DOI] [Google Scholar]
  10. Schlicker A, Domingues FS, Rahnenführer J, Lengauer T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 2006 Jun 15;7:302–302. doi: 10.1186/1471-2105-7-302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011 Jul 18;6(7):e21800–e21800. doi: 10.1371/journal.pone.0021800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. The Gene Ontology Consortium The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019 Jan 8;47(D1):D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007 Mar 7;23(10):1274–1281. doi: 10.1093/bioinformatics/btm087. [DOI] [PubMed] [Google Scholar]
  14. Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010 Feb 23;26(7):976–978. doi: 10.1093/bioinformatics/btq064. [DOI] [PubMed] [Google Scholar]
  15. Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012 Mar 28;16(5):284–287. doi: 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from microPublication Biology are provided here courtesy of California Institute of Technology

RESOURCES