Abstract
The availability of a large number of sequenced bacterial genomes allows researchers not only to derive functional and regulation information about specific organisms but also to study the fundamental properties of the organization of a genome. Here we address an important and challenging question regarding the global arrangement of operons in a bacterial genome: why operons in a bacterial genome are arranged in the way they are. We have previously studied this question and found that operons of more frequently activated pathways tend to be more clustered together in a genome. Specifically, we have developed a simple sequential distance-based pseudo energy function and found that the arrangement of operons in a bacterial genome tend to minimize the clusteredness function (C value) in comparison with artificially-generated alternatives, for a variety of bacterial genomes. Here we extend our previous work, and report a number of new observations: (a) operons of the same pathways tend to group into a few clusters rather than one; and (b) the global arrangement of these operon clusters tend to minimize a new “energy” function (C+ value) that reflects the efficiency of the transcriptional activation of the encoded pathways. These observations provide insights into further study of the genomic organization of genes in bacteria.
Keywords: Global genomic arrangement, Bacterial genome, Chromosomal supercoils, Biological pathways, Gene expression
Introduction
Up till very recently, our understanding about the global arrangement of operons in a bacterial genome has been very limited, mainly because of the lack of relevant data. The availability of both genomic and transcriptomic data for a large number of bacterial organisms (http://www.ncbi.nlm.nih.gov/genome/browse/) provides unprecedented opportunities for researchers to derive novel and deeper understanding about what determines the organization of the genomic information (such as operons and associated regulatory systems) in a genome. Understanding such an issue not only satisfies our scientific curiosity but also provides useful information that can guide the synthesis of artificially-designed chromosomes and organisms in the emerging field of synthetic biology [1–3]. A number of computational studies have been carried out aimed at understanding the genomic organization of genes. These include the studies of gene expression periodicities along a bacterial genome [4–8], transcriptional regulation [9–11], functional relatedness [5,12,13] and co-evolution relationship versus genomic co-locations [14]. While substantial efforts have been invested in studying the genome organization since early 90’s [15–18], our understanding about the factors and rules that may determine the global organization of a genome is still fragmented.
We have previously proposed the first model aimed to explain the organizational principle of a bacterial genome [19]. Our main finding was that the global arrangement of operons in a bacterial genome is tightly determined by the activation frequencies of the biological processes encoded in the genome, including metabolic and transcriptional regulation pathways [19,20]. This relationship can be captured to a large degree using a simple mathematical function, based on the sequential relationships among the relevant operons [19].
Here we extend our previous work to provide new and deeper understanding about the global arrangement of operons. Specifically we developed an improved scoring function which makes the current genomic arrangement of operons stand out more substantially against the alternative arrangements. This allows us to discover that operons in the same metabolic or transcriptional regulation pathway tend to group into a few, rather than one, sequentially nearby clusters, and the global arrangement of these operon clusters tends to make the transcriptional activation of the encoded pathways as efficient as possible.
Results
We have carried out our study on Escherichia coli K-12, which has 4144 protein-encoding genes [21] and 2462 predicted operons [22]. The 347 biological pathways encoded in E. coli K12 are retrieved from EcoCyc [23] and expression data for each gene under 466 growth conditions from the M3D database [24] are used to estimate the activation frequency of each pathway.
Operons encoding the same pathways tend to be clustered locally, but not globally, along the genome
We have previously used the following function to measure the global clusteredness of the component operons of the ith pathway, [19],
(1) |
where Mi represents the number of operons in the ith pathway and dij is the distance between the jth operon and operon (i.e., median operon) in the ith pathway. Then the overall clusteredness of the operons encoding all the pathways, C, can be measured using the following function:
(2) |
where N is the number of pathways under consideration (N = 347 in this study). A key discovery that we made was that the current genomic locations of all the operons in E. coli K12 tend to minimize this function in comparison with artificially-generated alternatives [19].
Our recent study indicates that while our initial one-cluster model per pathway provides an informative approximation, it is too crude. Our current analysis suggests that each pathway, particularly the one encoded by a large number of operons, tends to consist of a few operon clusters along the genome, as illustrated in Figure 1 (taking the four largest pathways in EcoCyc as examples). Based on this realization and analysis of the M3D gene expression data, we hypothesize that all the operons are arranged at specific genomic locations to facilitate the most efficient transcription of the operons encoding each pathway during the life cycle of the bacteria.
We aim to use the following C+ function (to distinguish with previous C value, C+ was used here) to assess the validity of this hypothesis by identifying a partition of the genome into set of genomic segments such that the function is minimized:
(3) |
where N is the number of known pathways encoded in the target genome; M is the number of partitioned segments of the genome to be determined; fi represents the activation frequency of the ith pathway estimated from the gene expression data in M3D following the method described previously [19] (with default value 1 for each pathway if such information is not available); ti is the number of partitioned segments containing operons of the ith pathway; sj is the number of operons in the jth partitioned segment (), and sij is the number of operons of the ith pathway covered in the jth partitioned segment; is defined to be 0 if ; otherwise ; and α is a scaling factor whose default value is set as 1. Intuitively, this function measures the total number of partitioned segments that encode the operons of each pathway across all the pathways, along with a measurement of the density of such operons among all the operons encoded in each partitioned segment. A detailed dynamic programming procedure for calculating an optimal partition of a target genome, along with the associated M value, is given in Materials and methods section.
We have randomly reshuffled the genomic locations of X% (X = 10, 20,…,100) of all the E. coli K12 operons with the locations of the remaining operons fixed, and calculated the C+ value of each reshuffled genome. We reshuffled the genomic locations of selected operons 100,000 times for each X, as we did in our original study [19]. Figure 2A shows the C+ value distribution for all the selected Xs versus the current genome of E. coli K12. Clearly we see that the C+ value of the current genome tends to be smaller than the C+ values of the alternatives, i.e., the artificially-generated genomes. To compare the current scoring function with our previous scoring function C, we did the same calculation using the previous function C (shown in Figure 2B). We have estimated the tail probability of “X < the dash line” (dash line denotes the current genomic arrangement), i.e., a P-value, for each score distribution in Figure 2, which is summarized in Table 1.
Table 1.
% Of reshuffling | P (C value) | P (C+ value) | P (C value)/P (C+ value) |
---|---|---|---|
10 | 0.06947282 | 0.004927 | 14.09904 |
20 | 0.01698723 | 0.000201 | 84.51098 |
30 | 0.003641975 | 2.09E−06 | 1740.591 |
40 | 0.000705596 | 3.44E−08 | 20492.59 |
50 | 0.000120973 | 1.42E−11 | 8523197 |
60 | 1.75E−05 | 2.72E−14 | 6.43E+08 |
70 | 2.77E−06 | 4.31E−18 | 6.42E+11 |
80 | 7.47E−07 | 6.50E−22 | 1.15E+15 |
90 | 2.28E−07 | 1.01E−24 | 2.25E+17 |
100 | 1.45E−07 | 1.09E−25 | 1.33E+18 |
Note: P(X) is the tail probability (P-value) of “X < the dash line” (dash line denotes the current genomic arrangement) for each score distribution. The P-values are calculated based on the observations that all the score distributions follow a normal distribution by the skewness and kurtosis test [19].
Through the above analysis, we have demonstrated that the current genomic arrangement of operons in the genome of E. coli K12 tends to minimize the C+ value compared to those of the alternative genomic arrangements of operons; and the C+ function makes the actual genome stands out more significantly than previous C function against the alternatives. The effect of increasing the impact of defined in Eq. (3) or estimating the actual pathway activation frequency based on M3D microarray data is presented in Figure S1 and Table S1.
Pathway-based analysis of the genomic partition
The above calculation also gives rise to an optimal partition of the E. coli K12 genome into 104 segments, which optimizes the C+ value (see details in Table S2). We have examined how the operons encoding the 347 biological pathways are located in the 104 partitioned segments of the E. coli genome. For each of the randomly-shuffled genomes and the original genome of the E. coli K12 shown in Figure 2, we retrieved the corresponding partitioned segments and calculated the total number of partitioned segments covered by each pathway across all the 347 pathways under consideration. From Figure 3, we can see that this total number of the current genome of E. coli K12 is clearly smaller than any of the randomly-reshuffled genomes.
In a separate study (Ma et al., unpublished data), we provided strong evidence that each of the partitioned segments corresponds to a supercoiled domain in the folded chromosome of E. coli K12 [25–28], each being independently foldable. Clearly when a pathway needs to be activated, the (folded) segments containing operons of the pathway need to unfold first to make all the relevant operons transcriptionally accessible, a process that consumes energy. The data shown in Figure 3 suggests that the organism has evolved to minimize the total number of segments that need to be unfolded in order to activate each pathway, hence making this part of the operating cost for the living bacteria as small as possible.
Concluding remarks
We have previously demonstrated that operons encoding more frequently activated pathways tend to be more clustered together in a genome [19]. We have also noted that what prevents such operons from forming one cluster is that many individual operons may each be involved in multiple pathways. Hence they form multi-operon clusters as we have demonstrated in this work that they tend to form a few operon clusters each contained in a folded genomic segment. To activate a pathway, energy is required to unfold the genomic segments that contain the operons encoding the pathway, so that the relevant operons are transcriptionally accessible. By putting all these results together, we speculate, with strong evidence, that operons encoding various pathways are specifically located in a given genome such there is a tendency to minimize the overall energy needed to have the relevant pathways transcribed, during the life cycle of the bacteria.
Materials and methods
A dynamic programming method for genomic partition
The optimization problem defined in Eq. (3) can be solved using a dynamic programming approach. The following recurrence equation gives rise to a simple dynamic programming algorithm for finding the optimal C+ (S, E) value, which generates a partition of a target genome and hence the number M of the partitioned segments, where S and E represent the first and the last inter-operonic regions of the genome, respectively.
(4) |
where L and U are set at 10 k bp and 100 k bp, defining the lower and the upper bounds of a partitioned segment [10–12].
Authors’ contributions
YX conceived the basic idea and planned the project. QM developed the algorithm, analyzed the data and carried out the computational experiments. Both authors edited the manuscript and approved the final manuscript.
Competing interests
The authors have declared that no competing interests exist.
Acknowledgements
This research was supported in part by National Science Foundation (Grant No. NSF DEB-0830024 and NSF MCB-0958172) and by Grant from the BioEnergy Science Center (BESC) of US Department of Energy through the Office of Biological and Environmental Research.
Footnotes
Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences and Genetics Society of China.
Appendix Supplementary. material
References
- 1.Peccoud J., Isalan M. The PLoS One synthetic biology collection: 6 years and counting. PLoS One. 2012;7:e43231. doi: 10.1371/journal.pone.0043231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Képès F., Jester B.C., Lepage T., Rafiei N., Rosu B., Junier I. The layout of a bacterial genome. FEBS Lett. 2012;586:2043–2048. doi: 10.1016/j.febslet.2012.03.051. [DOI] [PubMed] [Google Scholar]
- 3.Juhas M., Eberl L., Church G.M. Essential genes as antimicrobial targets and cornerstones of synthetic biology. Trends Biotechnol. 2012;30:601–607. doi: 10.1016/j.tibtech.2012.08.002. [DOI] [PubMed] [Google Scholar]
- 4.Campbell A.M. Genome organization in prokaryotes. Curr Opin Genet Dev. 1993;3:837–844. doi: 10.1016/0959-437x(93)90002-7. [DOI] [PubMed] [Google Scholar]
- 5.Sukhodolets V.V. Principles of prokaryotic genome organization. Genetika. 1992;28:28–37. [PubMed] [Google Scholar]
- 6.Pavitt G.D., Higgins C.F. Chromosomal domains of supercoiling in Salmonella typhimurium. Mol Microbiol. 1993;10:685–696. doi: 10.1111/j.1365-2958.1993.tb00940.x. [DOI] [PubMed] [Google Scholar]
- 7.Kolsto A.B. Dynamic bacterial genome organization. Mol Microbiol. 1997;24:241–248. doi: 10.1046/j.1365-2958.1997.3501715.x. [DOI] [PubMed] [Google Scholar]
- 8.Lim H.N., Lee Y., Hussein R. Fundamental relationship between operon organization and gene expression. Proc Natl Acad Sci U S A. 2011;28:10626–10631. doi: 10.1073/pnas.1105692108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Mathelier A., Carbone A. Chromosomal periodicity and positional networks of genes in Escherichia coli. Mol Syst Biol. 2010;6:366. doi: 10.1038/msb.2010.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wright M.A., Kharchenko P., Church G.M., Segre D. Chromosomal periodicity of evolutionarily conserved gene pairs. Proc Natl Acad Sci U S A. 2007;104:10559–10564. doi: 10.1073/pnas.0610776104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Képès F. Periodic transcriptional organization of the E. coli genome. J Mol Biol. 2004;340:957–964. doi: 10.1016/j.jmb.2004.05.039. [DOI] [PubMed] [Google Scholar]
- 12.Jeong K.S., Ahn J., Khodursky A.B. Spatial patterns of transcriptional activity in the chromosome of E. coli. Genome Biol. 2004;5:R86. doi: 10.1186/gb-2004-5-11-r86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Janga S.C., Salgado H., Martinez-Antonio A. Transcriptional regulation shapes the organization of genes on bacterial chromosomes. Nucleic Acids Res. 2009;37:3680–3688. doi: 10.1093/nar/gkp231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kolesov G., Wunderlich Z., Laikova O.N., Gelfand M.S., Mirny L.A. How gene order is influenced by the biophysics of transcription regulation. Proc Natl Acad Sci U S A. 2007;104:13948–13953. doi: 10.1073/pnas.0700672104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Elati M., Nicolle R., Junier I., Fernández D., Fekih R., Font J. PreCisIon: PREdiction of CIS-regulatory elements improved by gene’s positION. Nucleic Acids Res. 2012 doi: 10.1093/nar/gks1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Fang G., Rocha E.P., Danchin A. Persistence drives gene clustering in bacterial genomes. BMC Genomics. 2008;9:4. doi: 10.1186/1471-2164-9-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Junier I., Hérisson J., Képès F. Periodic pattern detection in sparse boolean sequences. Algorithms Mol Biol. 2010;5:31. doi: 10.1186/1748-7188-5-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ling X., He X., Xin D. Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics. 2009;25:571–577. doi: 10.1093/bioinformatics/btp027. [DOI] [PubMed] [Google Scholar]
- 19.Yin Y., Zhang H., Olman V., Xu Y. Genomic arrangement of bacterial operons is constrained by biological pathways encoded in the genome. Proc Natl Acad Sci U S A. 2010;107:6310–6315. doi: 10.1073/pnas.0911237107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhang H., Yin Y., Olman V., Xu Y. Genomic arrangement of regulons in bacterial genomes. PLoS One. 2012;7:e29496. doi: 10.1371/journal.pone.0029496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Riley M., Abe T., Arnaud M.B., Berlyn M.K., Blattner F.R., Chaudhuri R.R. E. coli K-12: a cooperatively developed annotation snapshot-2005. Nucleic Acids Res. 2006;34:1–9. doi: 10.1093/nar/gkj405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mao F., Dam P., Chou J., Olman V., Xu Y. DOOR: a database for prokaryotic operons. Nucleic Acids Res. 2009;37:D459–D463. doi: 10.1093/nar/gkn757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Karp P.D., Riley M., Saier M., Paulsen I.T., Collado-Vides J., Paley S.M. The EcoCyc database. Nucleic Acids Res. 2002;30:56–58. doi: 10.1093/nar/30.1.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Faith J.J., Driscoll M.E., Fusaro V.A., Cosgrove E.J., Hayete B., Juhn F.S. Many microbe microarrays database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36:D866–D870. doi: 10.1093/nar/gkm815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Browning D.F., Grainger D.C., Busby S.J. Effects of nucleoid-associated proteins on bacterial chromosome structure and gene expression. Curr Opin Microbiol. 2010;13:773–780. doi: 10.1016/j.mib.2010.09.013. [DOI] [PubMed] [Google Scholar]
- 26.Dillon S.C., Dorman C.J. Bacterial nucleoid-associated proteins, nucleoid structure and gene expression. Nat Rev Microbiol. 2010;8:185–195. doi: 10.1038/nrmicro2261. [DOI] [PubMed] [Google Scholar]
- 27.Dame R.T., Espeli O., Grainger D.C., Wiggins P.A. Multidisciplinary perspectives on bacterial genome organization and dynamics. Mol Microbiol. 2012;86:1023–1030. doi: 10.1111/mmi.12055. [DOI] [PubMed] [Google Scholar]
- 28.Rocha E.P. The organization of the bacterial genome. Annu Rev Genet. 2008;42:211–233. doi: 10.1146/annurev.genet.42.110807.091653. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.