Abstract
In the past few decades, scientists from all over the world have taken a keen interest in novel functional units such as small regulatory RNAs, small open reading frames, pseudogenes, transposons, integrase binding attB/attP sites, repeat elements within the bacterial intergenic regions (IGRs) and in the analysis of those “junk” regions for genomic complexity. Here we have developed a web server, named Junker, to facilitate the in-depth analysis of IGRs for examining their length distribution, four-quadrant plots, GC percentage and repeat details. Upon selection of a particular bacterial genome, the physical genome map is displayed as a multiple loci with options to view any loci of interest in detail. In addition, an IGR statistics module has been created and implemented in the web server to analyze the length distribution of the IGRs and to understand the disordered grouping of IGRs across the genome by generating the four-quadrant plots. The proposed web server is freely available at the URL http://pranag.physics.iisc.ernet.in/junker/.
Key words: bacterial genome, intergenic region, web server, statistics module
Introduction
The genomic era has witnessed the sequencing of over 1,400 prokaryotic genomes and this enables scientists to analyze the genome to get a clear insight into its functional aspects. Prokaryotic intergenic regions (IGRs) are a natural home to a variety of functional elements, thus the annotation of IGRs is essential for the complete understanding of bacterial physiology. In the past years, bacterial IGRs were routinely analyzed to identify structural non-coding RNAs (tRNA, rRNA and sRNA), which have multiple roles in the survival of the cell 1, 2. It was identified that IGRs carry important functional units like transposons (3), integrase binding sites (4), transcription factor binding sites, small open reading frames (ORFs), pseudogenes and inverted repeats (5). Recently, the traces of potential coding genes were also determined in IGRs (6). Thus, a few qualitative and quantitative studies were performed to identify the dynamics of bacterial IGRs. One such study on the Escherichia coli K12-MG1655 genome (7) compared the cumulative length distribution of IGRs between two replicores (left and right) to identify the impact of IGR on the distribution of sRNA-encoding genes. They found that most of the sRNA genes were located in the left core, though the proportions of IGRs were equal on both segments. They also pointed out that a high number of sRNAs were residing within the IGRs of length between 300 to 900 nucleotides. On the other hand, the sum of the total non-coding DNA or IGR content was found to be associated with the increased biological complexities of the organisms (8). Although a few computational methods were developed to retrieve the genes and their intergenic contexts 9, 10, no specific tool is available for the identification of the distribution pattern and statistical analysis of IGRs at a genome level. Thus, we have developed a web-based tool, named Junker, to identify the length distribution pattern of IGRs in a complete genome. The proposed server can also be used to calculate the cumulative intergenic content of the four equal segments of the genome (quadrants) or left and right replicores 2, 7. The flanking distance between the neighboring genes provides a measurement of local gene density (LGD) (11), which indicates that the quadrant specific intergenic content has inverse relationship with the LGD and is positively correlated with variable segments of the genome 12, 13. The proposed web server is freely available at the URL http://pranag.physics.iisc.ernet.in/junker/.
Web Server
Implementation and utilities
The proposed web server integrates and reports information about the IGRs present in the bacterial genomes. To create a local intergenic database, all the available bacterial genomes have been downloaded from the NCBI portal (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/). Next, the corresponding IGRs from the bacterial genomes were filtered out by excluding the protein and RNA encoding regions. In view of the above, two options, one is protein annotations and the other is protein and RNA annotations, are implemented in the server. Thus, the web interface enables users to search for the distinct IGRs based on their length and location. By default, Junker searches for IGRs with minimum length of 20 nucleotides, but an option is also provided to increase the minimum length. In addition, the nucleotide sequences extracted from IGRs are subjected to various analyses. For example, the experimentally determined pseudogenes are mapped using the annotated gene information from GenBank file. The presence of functional protein coding regions or ORFs in the IGRs is predicted using the gene prediction tools GeneMark2.5 (14), Glimmer3 (15) and Prodigal2.50 (16). In addition, the identical, tandem and inverted repeats in the IGR sequences are identified using FAIR (17) and “etandem” and “einverted” programs from EMBOSS suite (18).
All the IGR extractions, file handling modules and search engine were designed and implemented using Perl/CGI scripts. The histograms and circular maps presented were created using GD graph module (v1.43) (http://search.cpan.org/~mverb/GDGraph-1.43/). The web server runs under Solaris (v10.0) operating system on a 64 bit Quad-core Intel Xenon 5430 processor of 2.67 GHz with 4 GB of random access memory. The web server is implemented with user-friendly options to give explicit results. Presently, the local genome database of Junker contains 1,023 bacterial genomes.
Features
Users can select their genome of interest from the list provided in the index page of the server. Additional options are provided for the users to change the minimum length of IGRs and their location.
Selection of IGR of interest
The web server enables users to select a particular region from the whole genome by using physical genome map viewer. In general, the selected region covers the interval of 200,000 base pairs and is used to list the IGRs present in the selected region. The detailed report of IGRs extracted from the selected map position contains their start and end position, adjacent flanking gene IDs with their length information, different types of repeat elements and known pseudogenes present in the IGR sequence (Figure S1). In addition, options are provided for the users to download (in FASTA format) or display the interested IGR sequence.
IGR statistics module
The IGR statistics module has two major utilities to calculate the length distribution and the four-quadrant plots. The length distribution of the IGRs in different length intervals is represented in an interactive histogram, which also enables users to get the IGR sequences in FASTA format. Similarly, the cumulative lengths of the IGRs within the four quadrants of the genome are displayed using a pie chart known as four-quadrant plots. There are four scale points used in the pie chart to represent the complete genome in four quadrants.
Application
The genome of Sodalis glossinidius str. Morsitans (NC_007712) is reported to have the least coding capacity among the prokaryotes (19). Analysis of the S. glossinidius genome using the method indicated by Taft et al. (8) shows that the genome has an ncDNA/tgDNA ratio of only 50.91%. This fact was confirmed in our study by comparing the S. glossinidius genome with other Gammaproteobacteria genomes (Figure S2). We analyzed the S. glossinidius genome sequence using Junker with the default options and found a total of 1,837 IGRs. Moreover, the length and positional distribution of these IGRs were analyzed using IGR statistics module (Figure S3). Figure S3A indicates that the genome contains many IGRs in different lengths with the maximum of 16 Kb. In addition, the four-quadrant plot of the genome indicates that the disordered grouping of IGRs accumulated mostly in the fourth quadrant compared to the others (Figure S3B). Furthermore, similar analysis with other genomes has shown that Orientia tsutsugamushi Boryong (NC_009488) (20) and Thermocrinis albus DSM14484 (NC_013894) have the highest (51.21%) and the lowest IGR ratio (3.98%), respectively.
The calculated percentages of known CDS (percentage of genome coding for proteins) and IGR (percentage of IGR in the genome) ratios for 1,023 bacterial genomes are available in the form of a table in the web server (http://pranag.physics.iisc.ernet.in/cgi-bin/junker/table.pl).
Conclusion
Junker is a web-based tool designed to efficiently access and analyze the IGRs in bacterial genomes. The selected query genome is represented in the form of a physical genome map, which facilitates the users to select a genome region of interest. In addition, the IGR sequences are checked for the presence of known pseudogenes, probable coding regions and other repetitive elements. Moreover, the length distribution of IGRs over the whole genome is shown as histograms and their disordered grouping is plotted onto a four-quadrant pie chart. It is believed that Junker will be helpful for the in-depth analysis of IGRs.
Authors’ contributions
JS conceived and coordinated the construction of the web server. JS and RS drafted the manuscript. RS and SSB developed the web interface and the scripts for prediction. ZAR and KS improved the web server and revised the manuscript. PG conceived the idea of the study and helped the revision of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This paper is dedicated to the memory of late Prof. Ziauddin Ahamed Rafi who was the inspiration behind this study. We thank the support from Bioinformatics Centre and the Interactive Graphics Facility, Indian Institute of Science. JS thanks the Department of Biotechnology, Government of India for funding the projects. PG and JS thank the University Grants Commission for funding the Networking Resource Centre in Biological Sciences, Madurai Kamaraj University, Madurai.
Supplementary Material
Figures S1-S3
References
- 1.Wassarman K.M. Identification of novel small RNAs using comparative genomics and microarrays. Genes Dev. 2001;15:1637–1651. doi: 10.1101/gad.901001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hershberg R. A survey of small-RNA encoding genes in Escherichia coli. Nucleic Acids Res. 2003;31:1813–1820. doi: 10.1093/nar/gkg297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Siguier P. Insertion sequences in prokaryotic genomes. Curr. Opin. Microbiol. 2006;9:526–531. doi: 10.1016/j.mib.2006.08.005. [DOI] [PubMed] [Google Scholar]
- 4.Doublet B. Secondary chromosomal attachment site and tandem integration of the mobilizable Salmonella genomic island 1. PLoS One. 2008;3 doi: 10.1371/journal.pone.0002060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sharples G.J., Lloyd R.G. A novel repeated DNA sequence located in the intergenic regions of bacterial chromosomes. Nucleic Acids Res. 1990;18:6503–6508. doi: 10.1093/nar/18.22.6503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fu L.M., Shinnick T.M. Genome-wide analysis of intergenic regions of Mycobacterium tuberculosis H37Rv using Affymetrix GeneChips. EURASIP J. Bioinform. Syst. Biol. 2007;2007:23054. doi: 10.1155/2007/23054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Blattner F.R. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- 8.Taft R.J. The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays. 2007;29:288–299. doi: 10.1002/bies.20544. [DOI] [PubMed] [Google Scholar]
- 9.Ray W.C., Daniels C.J. PACRAT: a database and analysis system for archaeal and bacterial intergenic sequence features. Nucleic Acids Res. 2003;31:109–113. doi: 10.1093/nar/gkg013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Oberto J. BAGET: a web server for the effortless retrieval of prokaryotic gene context and sequence. Bioinformatics. 2008;24:424–425. doi: 10.1093/bioinformatics/btm600. [DOI] [PubMed] [Google Scholar]
- 11.Haas B.J. Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature. 2009;461:393–398. doi: 10.1038/nature08358. [DOI] [PubMed] [Google Scholar]
- 12.Chiapello H. MOSAIC: an online database dedicated to the comparative genomics of bacterial strains at the intra-species level. BMC Bioinformatics. 2008;9:498. doi: 10.1186/1471-2105-9-498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Banos R.C. Differential regulation of horizontally acquired and core genome genes by the bacterial modulator H-NS. PLoS Genet. 2009;5 doi: 10.1371/journal.pgen.1000513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Borodovsky M., McIninch J. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem. 1993;17:123–133. [Google Scholar]
- 15.Delcher A.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999;27:4636–4641. doi: 10.1093/nar/27.23.4636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hyatt D. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Banerjee N. An algorithm to find all identical internal sequence repeats. Curr. Sci. 2008;95:188–195. [Google Scholar]
- 18.Rice P. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- 19.Toh H. Massive genome erosion and functional adaptations provide insights into the symbiotic lifestyle of Sodalis glossinidius in the tsetse host. Genome Res. 2006;16:149–156. doi: 10.1101/gr.4106106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Cho N.H. The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes. Proc. Natl. Acad. Sci. USA. 2007;104:7981–7986. doi: 10.1073/pnas.0611553104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figures S1-S3
