Abstract
Background
It has been repeatedly observed that gene order is rapidly lost in prokaryotic genomes. However, persistent synteny blocks are found when comparing more or less distant species. These genes that remain consistently adjacent are appealing candidates for the study of genome evolution and a more accurate definition of their functional role. Such studies require visualizing conserved synteny blocks in a large number of genomes at all taxonomic distances.
Results
After comparing nearly 600 completely sequenced genomes encompassing the whole prokaryotic tree of life, the computed synteny data were assembled in a relational database, SynteBase. SynteView was designed to visualize conserved synteny blocks in a large number of genomes after choosing one of them as a reference. SynteView functions with data stored either in SynteBase or in a home-made relational database of personal data. In addition, this software can compute on-the-fly and display the distribution of synteny blocks which are conserved in pairs of genomes. This tool has been designed to provide a wealth of information on each positional orthologous gene, to be user-friendly and customizable. It is also possible to download sequences of genes belonging to these synteny blocks for further studies. SynteView is accessible through Java Webstart at http://www.synteview.u-psud.fr.
Conclusion
SynteBase answers queries about gene order conservation and SynteView visualizes the obtained results in a flexible and powerful way which provides a comparative overview of the conserved synteny in a large number of genomes, whatever their taxonomic distances.
Background
As prokaryotic species diverge, their gene order is increasingly fading away, except in rare locations where a few genes retain their neighborhood. Such observations gave rise to the concept of genomic context [1-9]. Accordingly, it is assumed that a small number of genes remain adjacent either because their expressions occur at the same time, or because they encode proteins that are constituents of the same molecular machine (e.g. membrane ATPase) or involved in the same cellular function [10]. These genes that remain persistently adjacent in constantly moving genomes form synteny blocks. In a recent work [11], we have identified such synteny blocks in a large and diverse set of nearly 600 microbial genomes using a three-step process. In step one, we compared each protein encoded by a completely sequenced genome with all other available microbial proteomes in order to identify the full set of homologous proteins they share. In step two, we outlined an approach allowing the identification of bona fide orthologues among all recognized homologues when comparing many pairs of genomes. This second step is based on an adaptation of the method designed by Wall et al. [12] to compute the reciprocal smallest distance (RSD) that separates the homologues present in a pair of genomes. Step three allowed further research among the correctly identified orthologues to pinpoint those that belong to a minimal unit that is conserved in each pair of genomes, i.e., a pair of positional orthologous genes (POGs) that remain adjacent in each genome. Then, after extending these minimal units as far as possible, it becomes feasible to assess the relative amount and size of synteny blocks in close and distant species. Such synteny blocks are appealing candidates in the study of the mechanisms of genome evolution and in the verification of the functional annotation of neighboring genes. Accordingly, visualizing these blocks in a large number of genomes at various taxonomic distances help to study their features. In this paper, we describe how to assemble all these synteny data in a relational database (SynteBase) and we develop a tool (SynteView) to visualize all conserved synteny blocks in a large number of completely sequenced prokaryotic genomes.
Implementation
SynteView was designed to display homology and gene context data that are organized in a relational database, SynteBase, described in detail below.
Creating a relational database for synteny data and populating its tables with a dedicated suite of softwares and other tools
We installed PostgreSQL [13], one of the most advanced open source relational database management systems, on a Linux platform and used it to create SynteBase, which is made up of five tables (Fig. 1). The database can be further populated with home-made data using the different tools we developed (see the user guide [Additional file 1]). Alternatively, one can directly use the SynteBase version we built for our own usage (this paper and [11]).
Step one: searching for homologues
Raw data extracted from public genomic databanks (GenBank/EMBL/DDBJ) were organized into two tables. The genome table contains information for the 598 prokaryotic genomes that were compared. The gene/protein table contains many features of their 1,928,135 encoded proteins, such as amino acid sequence, length, species name, location of encoding gene, etc. An exhaustive comparison of all these proteins led to the identification of all homologues. A complete suite of programs (Table 1) was used to compare each pair of proteomes using the following criteria: a pair of aligned proteins was retained as a couple of homologues if their E-values were smaller than 10-5, and if the alignment extended for at least 80% of the length of the shorter matching protein [11,14].
Table 1.
Step | Step designation | Tool | Reference |
1 | identifying homologues | protein Blast | [22] |
2a | identifying orthologues by RBH | Perl script rsd ortho | FL, this work |
2b | clustering homologues | Perl script famtrans | FL, this work |
2c | breaking bridges | graph algorithm (Perl library) | FL, this work |
2d | extracting significant clusters | MCL algorithm | [23] |
3a | identifying pairs of adjacent orthologous genes | SQL query on SynteBase | FL, this work |
3b | discovering synteny blocks | Perl script synblock | FL, this work |
Step two: identifying orthologues among the collected homologues
We further adapted the Reciprocal Best Blast Hit approach [12] to analyze the Blast results obtained in the first step. The best RSD orthologous pairs were determined in each comparison of two proteomes as follows. Protein a encoded by genome GA and protein b encoded by genome GB form the best pair of orthologues if the distance separating a from b is smaller than the distance separating both a from any other protein encoded by GB and b from any other protein encoded by GA. We automated this search (Table 1, step 2a). The data obtained were used to populate the ortho table (Fig. 1).
Step three: identifying positional orthologous genes among the collected orthologues
Once populated, the first three tables were used to identify the synteny blocks. We devised a specific SQL query (see [Additional file 2]) to discover the pairs of adjacent orthologous genes (Table 1, step 3a). Then, blocks of size greater than 2 were detected by progressive accretion of blocks of size 2 which shared a common pair of orthologues (Table 1, step 3b). These computed data were entered in the neighborpairs and synteny blocks tables, respectively (Fig. 1).
Architecture of SynteView
To implement SynteView, we applied an object oriented programming paradigm using the Java programming language [15]. In this way, SynteView may be run either as a Java Webstart application or as a local application (Fig. 2). In both cases, SynteView can be used to query SynteBase through a web service (web service mode), or used to query a local synteny database (loca access mode). The web service mode allows the user to visualize the precomputed data that are present in our version of SynteBase. To do so, SynteView connects to the SynteView web service to retrieve synteny data present in SynteBase. The local access mode will be useful for those who wish to work online, with home-made computed data. This mode requires the local installation of the Data Base Management System PostgreSQL [13], and the creation of a committed SynteBase-like database that must be populated with home made synteny data after applying the following mandatory requirements to visualize these data. SynteView requires information on proteins (identifier, coding strand, sequence, function, and length), genomes (species name, species name abbreviation, strain name, taxonomy), and synteny blocks (identifier of the blocks, and pairs of identifiers of orthologous proteins belonging to this block). Note that it does not matter how the data are organized in the underlying local database. SynteView parameters can be set to retrieve the data it needs. However, while SynteView is independent of the name of the selected fields, their order is of importance for correct functioning. Components required to set up a local database are described in detail in the Additional file 1. Once the custom-made database has been built, SynteView can connect to it, after the settings, including connection information (server, login, etc) and all the mandatory queries have been filled out.
Results
Visualizing synteny data with SynteView
The whole set of synteny data that was stored in SynteBase was further examined using SynteView. This tool was designed to provide a wealth of information on each positional orthologous gene, to be user-friendly and customizable. For example, the user can choose the set of genomes to be studied by defining either an array of species names or a taxonomic sampling. The procedure used to visualize synteny between a reference species s1 and a set of species (s2, s3, s4, s5) is straightforward. The user first chooses a reference species, in the "select reference genome panel" by selecting nodes in the species tree (Fig. 3). Clicking on a node produces a list of all the species that are its leaves (right panel). Then, the reference species is chosen by clicking on the species name in this list. Next, the set of compared species is determined by means of the "select compared species" tab (Fig. 3). As previously noted, the user browses the taxonomic tree of prokaryotes. When the user clicks on one node of the tree (e.g. Enterobacteriales), all the descendants of this node appear in the bottom panel. To choose one or several species, a drag and drop of the selected names will move the corresponding species into the right panel. This can be repeated several times, until the required set of species is selected. When this step is accomplished, clicking on the "Start data retrieval process" button on the bottom of the panel will launch the visualization step. The speed of this process depends on the number and nature of the chosen species. Once the retrieval process is completed, all regions of each compared genome become accessible for visualization in a scrollable window using the following features as shown in Fig. 4. Each line corresponds to a genome. The first line from the top (light blue background) shows gene adjacency in the chromosome of the reference species. Dark blue (positive DNA strand) and yellow (negative strand) rectangles stand for genes belonging to a synteny block that is conserved in at least one other species. Gray rectangles are genes of this reference genome that do not have any POGs in the set of compared genomes. Respective gene names are labeled on each rectangle. The following lines contain the different species that are compared to the reference genome. SynteView automatically sorts the chosen species by their taxonomic proximity to the reference genome. For each gene of the reference genome, columns contain the orthologous genes belonging to a synteny block found to be conserved in the different analyzed genomes with their respective names. The same color code (blue or yellow) helps to discriminate the strand of their respective location on each genome. The number of genes present in a block is displayed when the cursor is run over this block. Note that synteny blocks in compared genomes are defined exclusively with respect to the gene order in the reference genome. Thus, in a SynteView window of synteny blocks, the apparent proximity in compared genomes does not imply that they are as physically close in these genomes as their POGs are in the reference genome. By opening the Settings panel (to do so, click on the "settings" button in the left toolbar menu) the user accesses a Dialog box where it is possible to modify various default parameters. For example, clicking on the "Database" tab allows the user to choose the retrieval mode (database or web service). Once these various parameters have been customized, it is possible to navigate along the reference genome to estimate the density of the synteny blocks present in the other genomes. For example, and as expected, comparing E. coli with the other gammaproteobacteria reveals a rather high density of gene conservation. The bottom blue background shape portrays this rate of conservation in the compared genomes as a histogram (Fig. 4).
Using SynteView for comparative analysis of gene context
Information about any annotated gene is immediately available by clicking on the corresponding rectangle. This opens, to the left of the window, the "gene information" panel (Fig. 4) in which, for the selected gene, its GenBank PID, its name, the species name and the replicon to which it belongs are given; the function of its product (if available), and its exact location on the chromosome are also mentioned. This information panel also contains a text field which permits simple queries such as a search for a protein function, a gene name or a PID in the analyzed genomes, as well as a search for synteny blocks containing at least x adjacent genes and having orthologous genes in at least y species. Moreover, clicking on a gene delivers complete information on its neighbors. For instance, it is possible to estimate the various levels of conservation of detected operons when comparing organisms separated by various taxonomic distances. While the operon histidine is rather well conserved in proteobacteria (Fig. 5, panel A), the neighboring clusters of genes involved in the O-specific lipopolysaccharide biosynthesis (rfb cluster) and the production of extracellular polysaccharide colanic acid (cluster wca), which are located at a short distance and on the other strand, are rapidly fragmented to a scarce number of 2–4 genes such as the rml genes in Pseudomonas aeruginosa (Fig. 5, panel B). In addition, clicking on the "get sequences" button in the information panel opens a dialog box. SynteView shows the sequence of the clicked gene in the first tab and that of its orthologues in the second tab in Fasta format. This further allows downloading of all these amino acid sequences for future work.
Using SynteView for comparative analysis of multiple views
SynteView was also designed to allow complex studies by means of easy and simple operations. For example, looking at a peculiar set of species makes it possible to immediately visualize new assortments of synteny blocks. This is done simply by selecting a new reference species by clicking on a species name on the left of the display and/or by changing the list of compared genomes. Moreover, contrary to challenging tools (see Discussion below), SynteView allows global analyses of the synteny data using various points of view. Scrolling up and down the same window, one can assess the level of conservation of gene order at various taxonomic depths, the relative density of the synteny blocks along the whole genome, the relative size of the blocks, and the respective events of gene insertion/deletion in close and distant species.
Using SynteView to quantify synteny data
Besides being a visualization tool, SynteView can display various kinds of histograms which are computed on-the-fly. For example, the percentage of species displaying POGs in the same synteny block in the reference species is automatically computed and displayed as a histogram (blue background shape at the bottom of the main display). It is also easy to display the distribution of the size of synteny blocks which are conserved between genomes by selecting a pair of species and clicking on the Histogram button in the left toolbar. This histogram may be saved for further use by selecting the "Save as" button located in the contextual menu in the window. Table 2 summarizes the data obtained when comparing the model organism Bacillus subtilis with various bacteria and archaea. It appears that the number of genes present in conserved synteny blocks depends on the phylogenetic (taxonomic) distance between species. Indeed, the mean size of synteny blocks is close to 3.3 genes when comparing two closely related bacteria such as the Bacillaceae B. subtilis and Oceanobacillus iheyensis, whereas it diminishes to nearly 2 when comparing a bacterium (B. subtilis) and an archaeon (Methanosarcina acetivorans), although these genomes encode a similar range of proteins (3000–4500). Likewise, the longest block ranges from 19 to 4 for the same species comparisons.
Table 2.
species 1 | species 2 | synteny blocks | |||
B. subtilisa Taxonomy | Species name | Taxonomy | Proteome size | average size | longest size |
Bacillaceae | Oceanobacillus iheyensis | Bacillaceae | 3500 | 3.6 | 23 |
Firmicutes | Shewanella oneidensis | Proteobacteria | 4471 | 2.4 | 8 |
Firmicutes | Synechocystis species | Cyanobacteria | 3167 | 2.8 | 8 |
Firmicutes | Mycobacterium tuberculosis | Actinobacteria | 4187 | 2.5 | 11 |
Bacteria | Methanosarcina acetivorans | Archaea | 4540 | 2.3 | 7 |
The average size (number of genes) of synteny blocks is dependent on the taxonomic distance separating a pair of genomes that have been compared at the level of their genetic context.
a The size of the B. subtilis proteome is 4112.
Discussion
SynteView was designed to allow fast and easy visualization of the conservation of gene adjacency in many genomes for which orthology and neighborhood data were computed and stocked in a dedicated relational database SynteBase. Our goal was to develop a flexible yet powerful tool to work directly with home-computed data obtained after comparing large and diverse sets of species. Indeed, our tool can be easily installed on any personal computer endowed with one of the main operating systems (Windows, Mac OS X or Linux). Moreover, SynteView can be customized in many aspects. In particular, it can be used with another, home-made, database in place of SynteBase. We observed that among the other tools to visualize synteny data [16-20] that have been designed to be locally installed, not one is adapted to the use of the abundant genomic data for prokaryotic species. Contrary to these previously published softwares [16-20], SynteView allows the user to compare the gene order in many different genomes in the same window. Finally, the strict relationship between SynteBase and SynteView allows their user to enlarge the study of gene order by means of specific queries on SynteBase. In addition to the visualization of synteny blocks, it is possible to obtain productive information through various requests such as "How many genes are involved in a neighbouring relationship, for each pair of genomes?"
Conclusion
We anticipate that we will be inundated by thousands of completely sequenced genomes in the next few years [21]. Our tool SynteBase/SynteView has been designed to support such large sets of prokaryotic data. This tool will serve to quickly evaluate the conservation of gene order in newly-published genomes as soon as they have been compared to those already analyzed.
Availability and requirements
• Project name: SynteView/SynteBase
• Project home page: http://www.synteview.u-psud.fr
• Operating System(s): Windows, Linux, MacOS X (Java web start)
• Programming Language: Java
• Other requirements: Java 1.5
• License: GNU GPL
• Any restrictions to use by non-academics: none
• Perl scripts: available on request
Abbreviations
POGs: Positional Orthologous Genes; RSD: Reciprocal Smallest Distance; SQL: Structured Query Language.
Authors' contributions
FL wrote the different programs necessary to collect all synteny data and to build up the relational database and the visualizing tool. He is responsible for the website. The three authors participated in the design of the experimental approach, the conception of the tools, and the data analysis. Together, the three authors wrote this manuscript.
Supplementary Material
Acknowledgments
Acknowledgements
FL is a PhD student supported by the French Ministry of Research. This work was funded by the CNRS (UMR 8621) and the Agence Nationale de la Recherche (ANR-05-MMSA-0009 MDMS NV 10). We gratefully acknowledge Stéphane Descorps-Declère for his help in designing the genome comparison pipeline and Mary Bouley (Université de Bourgogne) for her aid in improving the quality of our manuscript.
Contributor Information
Frédéric Lemoine, Email: frederic.lemoine@igmors.u-psud.fr.
Bernard Labedan, Email: bernard.labedan@igmors.u-psud.fr.
Olivier Lespinet, Email: olivier.lespinet@igmors.u-psud.fr.
References
- Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences. 1998;23:324–328. doi: 10.1016/S0968-0004(98)01274-2. [DOI] [PubMed] [Google Scholar]
- Huynen MA, Bork P. Measuring genome evolution. Proc Natl Acad Sci USA. 1998;95:5849–5856. doi: 10.1073/pnas.95.11.5849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Protein interaction maps for complete genomes based on gene fusion events. Natur. 1999;402:86–90. doi: 10.1038/47056. [DOI] [PubMed] [Google Scholar]
- Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999;285:751–753. doi: 10.1126/science.285.5428.751. [DOI] [PubMed] [Google Scholar]
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999;96:2896–2901. doi: 10.1073/pnas.96.6.2896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96:4285–4288. doi: 10.1073/pnas.96.8.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galperin MY, Koonin EV. Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol. 2000;18:609–613. doi: 10.1038/76443. [DOI] [PubMed] [Google Scholar]
- Huynen M, Snel B, Lathe W, Bork P. Exploitation of gene context. Curr Opin Struct Biol. 2000;10:366–370. doi: 10.1016/S0959-440X(00)00098-1. [DOI] [PubMed] [Google Scholar]
- Wolf Y, Rogozin I, Kondrashov A, Koonin E. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Research. 2001;3:356–372. doi: 10.1101/gr.GR-1619R. [DOI] [PubMed] [Google Scholar]
- Huynen M, Snel B, Lathe W, Bork P. Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences. Genome Research. 2000;10:1204–1210. doi: 10.1101/gr.10.8.1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemoine F, Lespinet O, Labedan B. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol Biol. 2007;7:237. doi: 10.1186/1471-2148-7-237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wall D, Fraser H, Hirsh A. Detecting putative orthologs. Bioinformatic. 2003;19:1710–1711. doi: 10.1093/bioinformatics/btg213. [DOI] [PubMed] [Google Scholar]
- PostgreSQL database management systems http://www.postgresql.org/
- Le Bouder-Langevin S, Capron-Montaland I, De Rosa R, Labedan B. A strategy to retrieve the whole set of protein modules in microbial proteomes. Genome Research. 2002;12:1961–1973. doi: 10.1101/gr.393902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Java Technology http://java.sun.com/
- Sinha AU, Meller J. Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics. 2007;8:82. doi: 10.1186/1471-2105-8-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang H, Su Y, Mackey AJ, Kraemer ET, Kissinger JC. SynView: a GBrowse-compatible approach to visualizing comparative genome data. Bioinformatic. 2006;22:2308–2309. doi: 10.1093/bioinformatics/btl389. [DOI] [PubMed] [Google Scholar]
- Hunt E, Hanlon N, Leader DP, Bryce H, Dominiczak AF. The visual language of synteny. OMIC. 2004;8:289–305. doi: 10.1089/omi.2004.8.289. [DOI] [PubMed] [Google Scholar]
- Pan X, Stein L, Brendel V. SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatic. 2005;21:3461–3468. doi: 10.1093/bioinformatics/bti555. [DOI] [PubMed] [Google Scholar]
- Byrne KP, Wolfe KH. The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research. 2005;15:1456–1461. doi: 10.1101/gr.3672305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Ruckert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33:5691–5702. doi: 10.1093/nar/gki866. [DOI] [PMC free article] [PubMed] [Google Scholar]
- protein BLAST http://blast.ncbi.nlm.nih.gov/
- MCL – a cluster algorithm for graphs http://micans.org/mcl/
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.