Abstract
Synteny conservation analysis is a well-established methodology to investigate the potential functional role of unknown prokaryotic genes. However, bioinformatic tools to reconstruct and visualise genomic contexts usually depend on slow computations, are restricted to narrow taxonomic ranges, and/or do not allow for the functional and interactive exploration of neighbouring genes across different species. Here, we present GeCoViz, an online resource built upon 12 221 reference prokaryotic genomes that provides fast and interactive visualisation of custom genomic regions anchored by any target gene, which can be sought by either name, orthologous group (KEGGs, eggNOGs), protein domain (PFAM) or sequence. To facilitate functional and evolutionary interpretation, GeCoViz allows to customise the taxonomic scope of each analysis and provides comprehensive annotations of the neighbouring genes. Interactive visualisation options include, among others, the scaled representations of gene lengths and genomic distances, and on the fly calculation of synteny conservation of neighbouring genes, which can be highlighted based on custom thresholds. The resulting plots can be downloaded as high-quality images for publishing purposes. Overall, GeCoViz offers an easy-to-use, comprehensive, fast and interactive web-based tool for investigating the genomic context of prokaryotic genes, and is freely available at https://gecoviz.cgmlab.org
Graphical Abstract
INTRODUCTION
Bacterial and archaeal microorganisms possess packed genomes where functionally related genes, or those physically interacting, tend to cluster together (1,2), sharing regulatory mechanisms (3) and occasionally leading to gene fusion events (4). Thus, genomic context analysis has been extensively applied to predict the putative functional role of unknown genes (5–7). To obtain reliable predictions, comparative genomics methods use synteny conservation across multiple species as a strong indication of functional relationship (2,8,9). This approach has been proven useful in predicting protein-protein interactions (10), discovering novel functional roles (11), finding orphan enzymes (12) and characterising unknown metagenomics sequences (13).
The invaluable information derived from synteny conservation analysis has led to the development of numerous bioinformatic tools to automate the reconstruction of the genomic context of specific genes across multiple genomes (14,15) and explore it in a visual manner. Most notably, STRING uses genomic context conservation to predict protein-protein interactions (16), allowing also to display the neighbourhood of specific genes across different branches of the tree of life as a schematic representation. WebFlaGs (17) and TREND (18) can be used to generate static images representing the genomic context of custom genes based on previous computations of homologous sequences. GeConT2 (19) focuses on providing online searches on reference genomes, while GeneSpy (20) can be used to generate custom plots based on local computations. However, despite the unquestionable value of previous software, tools are still missing that provide fast and interactive exploration of genomic context conservation of prokaryotic genes while keeping a comprehensive phylogenetic and functional scope.
Here, we present GeCoViz, a highly interactive web application that aims at visualising genomic context conservation while offering a responsive and easy-to-use interface accessible to non-expert users. To provide fast searches, GeCoViz uses precomputed information on orthology assignments, phylogenetic information and functional annotations for over 42 million genes extracted from 12 221 reference prokaryotic genomes. Moreover, GeCoViz offers highly customizable searches, allowing users to easily adjust the phylogenetic scope of each analysis and select what functional annotations of neighbouring genes are used to highlight synteny conservation. Notably, GeCoViz uses eggNOG v5 (21) for its orthology assignments, enabling the exploration of thousands of hypothetical genes that are missing in other databases (COG, KEGG, PFAM) but still classified as orthologous groups of unknown function in eggNOG. This is particularly relevant for the characterization of unknown genes without experimentally validated homologs in any reference genome, which account for nearly one third of all the predicted genes in current databases.
RESULTS
Phylogenetic and functional scope
GeCoViz is built upon the reference set of genomes provided in proGenomes v2.0 (22), which includes 11 710 and 511 representative bacterial and archaeal species, respectively. In total, GeCoViz covers 42 542 377 protein coding genes, which were pinpointed to their location in their respective genomes. Comprehensive functional predictions and orthology assignments were computed de novo for all gene entries using eggNOG-mapper v2.1 (23). In total, 34 554 188 genes (81.2%) were mapped to at least one orthologous group in eggNOG v5, 21 671 011 (50.9%) were annotated to KEGG modules and pathways (24), and 32 215 373 to PFAM domains (25). Overall, we estimate that besides the core set of genes putatively assigned to known KEGG pathways or having known domains, GeCoViz enables the exploration of 5 390 779 highly hypothetical genes (no KEGG or known PFAM domains), spanning 1 304 714 eggNOG orthologous groups of unknown function.
Hypothesis driven exploration of genomic context
GeCoViz allows users to look up the genomic context of any bacterial and archaeal gene, automatically estimating the conservation of their genomic neighbours along a customly selected set of species. Searches can be performed by using gene names, protein sequences or orthologous groups identifiers from either eggNOG v5 or KEGG databases as a query. When a single gene name or protein sequence is queried, GeCoViz uses precomputed orthology assignments to automatically identify equivalent genes (putative orthologs) along the different genomes selected. In addition, PFAM names can also be queried in order to explore the context of genes sharing the same protein domain. Since many hypothetical proteins are covered by eggNOG groups of unknown function, GeCoViz can be easily used to explore their genomic context and obtain hints on their possible functional role. Thus, GeCoViz allows users to perform hypothesis-driven searches to either inspect the functional context and synteny conservation of uncharacterized genes, or to explore differences in the genomic organisation of known molecular functions and pathways across a custom set of genomes and organisms (see example use cases).
Interactive exploration of genomic neighbourhood
GeCoViz offers a highly interactive and customizable exploration panel. When a new search is triggered, genomic context and synteny conservation is automatically shown for up to 100 homologous genes automatically selected from all major lineages in the prokaryotic phylogeny where the query term was found. Then, the taxonomic scope can be easily adjusted by means of the interactive sunburst chart available at the taxonomic control panel (Figure 1), allowing users to automatically add or remove representative species from custom clades, or manually select specific genomes.
Genes matching the original query—which are automatically grouped by either orthology, domain or metabolic pathway annotation—are vertically aligned in the genomic context panel and used as an anchoring point to display up- and down-stream loci for each genomic region (Figure 2). The genomic window size and graphical aspect of the genomic context representation can also be adjusted by the user. For instance, scaled gene lengths and genomic distances are displayed by default, but an alternative unscaled visualisation mode can be selected from the context visualisation menu. Moreover, a guiding tree sorting genomic regions by the NCBI Taxonomy, as well as additional habitat information for each species, can be optionally shown in the genomic context panel.
To facilitate the analysis of synteny conservation of particular genes across selected genomes, GeCoViz dynamically calculates a vertical conservation score (VCS) for each gene entry. The VCS can be estimated based on any of the annotations associated with the genes shown, such as the eggNOG orthologous groups restricted to custom taxonomic levels, KEGG orthologs and pathways, and PFAM domains. VCS is calculated as the percentage of genomes that contain the selected term (e.g. eggNOG, KEGG, PFAM), over the total number of genomes shown. Users can adjust the threshold of this simple score to highlight and colour genes that are more or less prevalent across the selected set of genomes, facilitating the identification of conserved patterns.
Moreover, users can interact with each gene entry by either clicking or hovering on its graphical representation. While gene clicking displays a window with detailed description of its location and annotated function, gene hovering immediately highlights other genes belonging to the same orthologous group across all the genomes. Besides, users can interact with the genomic context panel by: (i) hiding specific genomes by clicking on their respective tree nodes, (ii) showing their isolation source, (iii) expanding the size of the genomic window by increasing the number of up- and down-stream neighbouring genes shown and (iv) displaying genes in schematic format. Finally, users may download: (i) a table with the complete genomic context information including functional annotations, gene order and all the gene sequences and (ii) high quality images of any custom view of the genomic context analysis.
EXAMPLE CASE STUDIES
Predicting the functional role of hypothetical proteins
The Salmonella enterica sv. typhimurium LT2 coding gene STM0239 (yaeQ) is currently annotated as a hypothetical protein in NCBI. Sequence search in GeCoViz assigns STM0239 to the orthologous group of unknown function COG4681 (eggNOG), which can be directly queried in GeCoViz. Exploration of COG4681 shows that yaeQ has 1959 orthologous widely distributed along the bacterial phylogeny and is embedded in a highly conserved region for all species under the enterobacteriaceae family (Figure 3A). Most of the conserved neighbouring genes of yaeQ are involved in tRNAs metabolism, and the gene itself is in an operon-like structure with yaeJ, which encodes for a translation release factor. This suggests that yaeQ might indeed be involved in translation processes, which supports previous works reporting that YaeQ has a role regulating the expression of virulence factors in both Escherichia coli and Salmonella typhimurium (26,27).
Discovering novel genes associated with known pathways
Similarly, genomic context can provide valuable insights not only for functional but also for regulatory relationships between neighbouring genes. GeCoViz facilitates the discovery of putative target genes of known regulatory systems by means of the custom adjustment of taxonomic scopes and VCS thresholds. To illustrate this, we analysed the genomic context of PetRP, a recently described regulatory system involved in the plastocyanin (PC)/cytochrome c6 (C6) switch in cyanobacteria (28). Although the copper regulation of these two proteins in cyanobacteria was established 30 years ago (29), the regulatory system remained elusive until its recent identification by a genomic context approach (28). In this work, the role of PetR (Slr0240 Synechocystis sp. PCC 6803 protein), a homologous of the copper transcriptional regulator CopY, as a potential regulator of the PC/C6 switch was explored. Searching Slr0240 by protein sequence in GeCoViz assigned it to the eggNOG COG3682 orthologous group, a transcriptional negative regulator in bacteria. An initial visualisation of the genomic context of COG3682 in the cyanobacteria phylum reveals that petR (Slr0241 Synechocystis protein) is always attached to the metallopeptidase coding gene petP (eggNOG 1G0TE) (Figure 3B), with no other genes highly conserved in their context. However, by decreasing the VCS threshold to ∼20%, GeCoViz highlights two other neighbouring genes moderately conserved around petRP: comB, encoding for a 2-phosphosulfolactate phosphatase, and the experimentally validated target of PetR, petJ (C6; Figure 3C) (28).
Identifying gene fusion events
Besides genome context conservation, GeCoViz also allows to easily spot eventual gene fusions. For instance, the exploration of genes of unknown function under the orthologous group COG3220 reveals that the target gene is tightly coupled to a putative DNA binding protein (COG3219; Figure 3D), occasionally leading to fusion events in several bacterial orders, reinforcing their functional relationship as a DNA interacting protein.
IMPLEMENTATION DETAILS
GeCoViz uses MongoDB (https://www.mongodb.com/) for storing precomputed genomic data and Django on the server side (https://www.djangoproject.com/). The web frontend uses Vue.js (https://vuejs.org/) as a Javascript framework. Code for generating genomic context layouts hinges on the data visualisation library D3.js (https://d3js.org). Data flow and technical implementations are depicted in Supplementary Figure S1.
DATA AVAILABILITY
GeCoViz is available at https://gecoviz.cgmlab.org.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Dr Daniel Mende for his greatly appreciated help in retrieving raw genome information from proGenomes 2.0.
Contributor Information
Jorge Botas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Madrid, 28223, Spain.
Álvaro Rodríguez del Río, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Madrid, 28223, Spain.
Joaquín Giner-Lamia, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Madrid, 28223, Spain; Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid (UPM), Madrid, 28040, Spain.
Jaime Huerta-Cepas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Madrid, 28223, Spain.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
National Programme for Fostering Excellence in Scientific and Technical Research [PGC2018-098073-A-I00 MCIU/AEI/FEDER, UE to J.HC., J.JL.]; ‘la Caixa’ Foundation [100010434, fellowship code LCF/BQ/DI18/11660009 to A.R.dR.]; European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie [713673].
Conflict of interest statement. None declared.
REFERENCES
- 1. Sela I., Wolf Y.I., Koonin E.V.. Theory of prokaryotic genome evolution. Proc. Natl. Acad. Sci. U.S.A. 2016; 113:11399–11407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Dandekar T., Snel B., Huynen M., Bork P.. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 1998; 23:324–328. [DOI] [PubMed] [Google Scholar]
- 3. Teichmann S.A., Babu M.M.. Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 2002; 20:407–410. [DOI] [PubMed] [Google Scholar]
- 4. Pasek S., Risler J.-L., Brézellec P.. Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins. Bioinformatics. 2006; 22:1418–1423. [DOI] [PubMed] [Google Scholar]
- 5. Huynen M., Snel B., Lathe W. 3rd, Bork P.. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000; 10:1204–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Korbel J.O., Jensen L.J., von Mering C., Bork P.. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat. Biotechnol. 2004; 22:911–917. [DOI] [PubMed] [Google Scholar]
- 7. Wolf Y.I., Rogozin I.B., Kondrashov A.S., Koonin E.V.. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001; 11:356–372. [DOI] [PubMed] [Google Scholar]
- 8. Marcotte C.J.V., Marcotte E.M.. Predicting functional linkages from gene fusions with confidence. Appl. Bioinformatics. 2002; 1:93–100. [PubMed] [Google Scholar]
- 9. Marcotte E.M., Pellegrini M., Ng H.L., Rice D.W., Yeates T.O., Eisenberg D.. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999; 285:751–753. [DOI] [PubMed] [Google Scholar]
- 10. Rao V.S., Srinivasa Rao V., Srinivas K., Sujini G.N., Sunand Kumar G.N.. Protein-Protein interaction detection: methods and analysis. International Journal of Proteomics. 2014; 2014:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Jimmy S., Saha C.K., Kurata T., Stavropoulos C., Oliveira S.R.A., Koh A., Cepauskas A., Takada H., Rejman D., Tenson T.et al.. A widespread toxin-antitoxin system exploiting growth control via alarmone signaling. Proc. Natl. Acad. Sci. U.S.A. 2020; 117:10500–10510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Yamada T., Waller A.S., Raes J., Zelezniak A., Perchat N., Perret A., Salanoubat M., Patil K.R., Weissenbach J., Bork P.. Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours. Mol. Syst. Biol. 2012; 8:581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Sberro H., Fremin B.J., Zlitni S., Edfors F., Greenfield N., Snyder M.P., Pavlopoulos G.A., Kyrpides N.C., Bhatt A.S.. Large-Scale analyses of human microbiomes reveal thousands of small, novel genes. Cell. 2019; 178:1245–1259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Yelton A.P., Thomas B.C., Simmons S.L., Wilmes P., Zemla A., Thelen M.P., Justice N., Banfield J.F.. A semi-quantitative, synteny-based method to improve functional predictions for hypothetical and poorly annotated bacterial and archaeal genes. PLoS Comput. Biol. 2011; 7:e1002230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Anand S., Kuntal B.K., Mohapatra A., Bhatt V., Mande S.S.. FunGeCo: a web-based tool for estimation of functional potential of bacterial genomes and microbiomes using gene context information. Bioinformatics. 2020; 36:2575–2577. [DOI] [PubMed] [Google Scholar]
- 16. Szklarczyk D., Gable A.L., Nastou K.C., Lyon D., Kirsch R., Pyysalo S., Doncheva N.T., Legeay M., Fang T., Bork P.et al.. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021; 49:D605–D612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Saha C.K., Sanches Pires R., Brolin H., Delannoy M., Atkinson G.C.. FlaGs and webFlaGs: discovering novel biology through the analysis of gene neighbourhood conservation. Bioinformatics. 2021; 37:1312–1314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Gumerov V.M., Zhulin I.B.. TREND: a platform for exploring protein function in prokaryotes based on phylogenetic, domain architecture and gene neighborhood analyses. Nucleic Acids Res. 2020; 48:W72–W76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Martinez-Guerrero C.E., Ciria R., Abreu-Goodger C., Moreno-Hagelsieb G., Merino E.. GeConT 2: gene context analysis for orthologous proteins, conserved domains and metabolic pathways. Nucleic Acids Res. 2008; 36:W176–W180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Garcia P.S., Jauffrit F., Grangeasse C., Brochier-Armanet C.. GeneSpy, a user-friendly and flexible genomic context visualizer. Bioinformatics. 2019; 35:329–331. [DOI] [PubMed] [Google Scholar]
- 21. Huerta-Cepas J., Szklarczyk D., Heller D., Hernández-Plaza A., Forslund S.K., Cook H., Mende D.R., Letunic I., Rattei T., Jensen L.J.et al.. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019; 47:D309–D314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Mende D.R., Letunic I., Maistrenko O.M., Schmidt T.S.B., Milanese A., Paoli L., Hernández-Plaza A., Orakov A.N., Forslund S.K., Sunagawa S.et al.. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 2020; 48:D621–D625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Cantalapiedra C.P., Hernández-Plaza A., Letunic I., Bork P., Huerta-Cepas J.. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 2021; 38:5825–5829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kanehisa M., Furumichi M., Tanabe M., Sato Y., Morishima K.. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017; 45:D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G.A., Sonnhammer E.L.L., Tosatto S.C.E., Paladin L., Raj S., Richardson L.J.et al.. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021; 49:D412–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Vicari D., Artsimovitch I.. Virulence regulators RfaH and YaeQ do not operate in the same pathway. Mol. Genet. Genomics. 2004; 272:489–496. [DOI] [PubMed] [Google Scholar]
- 27. Wong K.R., Hughes C., Koronakis V.. A gene, yaeQ, that suppresses reduced operon expression caused by mutations in the transcription elongation gene rfaH in escherichia coli and salmonella typhimurium. Mol. Gen. Genet. 1998; 257:693–696. [DOI] [PubMed] [Google Scholar]
- 28. García-Cañas R., Giner-Lamia J., Florencio F.J., López-Maury L.. A protease-mediated mechanism regulates the cytochrome c 6/plastocyanin switch in synechocystis sp. PCC 6803. Proc. Natl. Acad. Sci. U.S.A. 2021; 118:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zhang L., McSpadden B., Pakrasi H.B., Whitmarsh J.. Copper-mediated regulation of cytochrome c553 and plastocyanin in the cyanobacterium synechocystis 6803. J. Biol. Chem. 1992; 267:19054–19059. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
GeCoViz is available at https://gecoviz.cgmlab.org.