Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands

Jon Bohlin; Mark WJ van Passel; Lars Snipen; Anja B Kristoffersen; David Ussery; Simon P Hardy

doi:10.1186/1471-2164-13-66

. 2012 Feb 10;13:66. doi: 10.1186/1471-2164-13-66

Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands

Jon Bohlin ^1,^✉, Mark WJ van Passel ², Lars Snipen ³, Anja B Kristoffersen ⁴, David Ussery ⁵, Simon P Hardy ⁴

PMCID: PMC3305612 PMID: 22325062

Abstract

Background

We sought to assess whether the concept of relative entropy (information capacity), could aid our understanding of the process of horizontal gene transfer in microbes. We analyzed the differences in information capacity between prokaryotic chromosomes, genomic islands (GI), phages, and plasmids. Relative entropy was estimated using the Kullback-Leibler measure.

Results

Relative entropy was highest in bacterial chromosomes and had the sequence chromosomes > GI > phage > plasmid. There was an association between relative entropy and AT content in chromosomes, phages, plasmids and GIs with the strongest association being in phages. Relative entropy was also found to be lower in the obligate intracellular Mycobacterium leprae than in the related M. tuberculosis when measured on a shared set of highly conserved genes.

Conclusions

We argue that relative entropy differences reflect how plasmids, phages and GIs interact with microbial host chromosomes and that all these biological entities are, or have been, subjected to different selective pressures. The rate at which amelioration of horizontally acquired DNA occurs within the chromosome is likely to account for the small differences between chromosomes and stably incorporated GIs compared to the transient or independent replicons such as phages and plasmids.

Background

Horizontal gene transfer in microbial communities has been recognized as a key driver of evolutionary change in microbes [1,2]. In addition to plasmids and phages, regions within the bacterial chromosomes are assumed to have been horizontally acquired [3]. Such putatively horizontally transferred regions are termed Genomic Islands (GI). GIs originate from different sources [4] including plasmids and phages (prophages) and carry traits that have important biological phenotypes such as virulence determinants and antibiotic resistance genes. Genetic material is most readily exchanged between related genetic elements, [5]i.e. chromosomes exchange DNA with chromosomes, plasmids with plasmids, and phages with phages. However, this exchange is not entirely restrictive with low frequency transfer occurring between chromosomes on one hand and plasmids and phages on the other [5]. Mathematical models predict plasmids to be the predominant means of genetic variation among bacteria [5]. Based on findings from genomic signatures (and analyses of CRISPSs in bacteria [6]), phages, and viruses in general, have been found to co-evolve with their hosts [7]. Plasmids on the other hand, although sharing some similarities with their hosts, have a more different DNA composition than what would be expected compared to the hosts chromosome [8]. In fact, genomic signatures based methods reveal prokaryotic plasmid-host similarity to correlate with genomic GC content, i.e. the more GC rich an organism is the more compositionally similar it tends to be with its plasmid(s) [9]. GC content has also been associated with genome wide rates of mutation, where organisms of low GC content tend to have more random genomes than GC rich ones [10,11], i.e. the signal-to-noise ratio is lower in AT rich genomes. An organism's DNA sequence that has been subjected to numerous random mutations is assumed to possess less information than the DNA of an organism under strong selective pressure. In other words, due to more accumulated mutations, it appears as if less information is carried by the DNA sequences of AT rich microbes compared to GC rich microbes. Thus, to test the assertion that accumulated mutations lower the information capacity we explored the use of information theory as a means of measuring information capacity in DNA sequences.

The concept of information theory was originally introduced by Claude E. Shannon as a tool to systematically analyze data flow in general communication systems [12]. The theory has been extended and subsequently applied to many fields including DNA sequence analysis [13-15]. Methods of Information theory focusing on DNA sequence compression have found differences between coding and non-coding sequences as well as between prokaryotic and eukaryotic organisms [16].

These results led us to apply information theoretical methods to examine the extent to which information content differed between the genomes of bacterial chromosomes, plasmids, phages and GIs, and whether such differences could be related to distinct genomic properties of bacterial chromosomes and mobile genomic elements. We used the Kullback-Leibler divergence measure (D_KL) of tetranucleotide frequencies within genomic DNA sequences, similar to that descried by Sadovsky [15], but using tetranucleotide frequencies and a zero order Markov model instead of a second order Markov model. These alterations increase the sensitivity of detection [17]. The zero order Markov model assumes the simplest possible dependence structure between neighboring nucleotides. This means that D_KLwill be higher than in models that do account for dependence between adjacent nucleotides, like the first or second order Markov models [17]. The expected tetranucleotide frequencies, statistically speaking, are thus calculated from mononucleotide frequencies implying that the bases are independent of each other. Thus, D_KLreflects relative entropy in the sense that the genomic sequences are compared to a random sequence sharing only the same AT content. Low D_KLmeans low relative entropy and high D_KLmeans high relative entropy [18]. Since the DNA sequence from the biological entity is compared to a random, 0^thorder Markov based sequence (sharing only total AT content), a lower D_KLreflects a greater independence between nucleotides in the corresponding tetranucleotides, and hence that less information is carried by the DNA sequence. Conversely, higher D_KLis taken to mean that more information is carried by the DNA sequence since the adjacent nucleotides in the corresponding tetranucleotides are more dependent on each other.

We sought to use methods from information theory to examine information capacity (relative entropy) in chromosomes, plasmids, phages and GIs. We investigated possible influences affecting relative entropy in the different types of DNA sequences and how relative entropy varies along bacterial chromosomes, focusing particularly on the AT rich Bacillus cereus, the medium AT:GC Escherichia coli and the GC rich Mycobacterium tuberculosis. We also examined the relative entropy of highly conserved genes in two closely related species (M.tuberculosis and M.leprae) of which one has presumably undergone considerable genome reduction [19,20].

Results

A note on the calculation of D_KL

The relative entropy of a DNA sequence, which we refer to as D_KL, is measured as the divergence between observed tetranucleotide frequencies from approximated tetranucleotide frequencies using a zero order Markov model. The zero order Markov model assumes that every base in the sequence is occurring with a probability independent of all other neighboring bases. It is reasonable to assume that in regions of high mutation activity this is a good description [11]. We compare the computed tetranucleotide frequencies from the zero order Markov model to counted tetranucleotide frequencies from each DNA sequence. So the information capacity in a DNA sequence is positively associated with the magnitude of the divergence from the approximated sequence. Hence, the higher the divergence between observed and expected (approximated) tetranucleotide frequencies the more information potential in the DNA sequence, and vice versa.

D_KLdifferences between chromosomes, GIs, phages and plasmids

We examined whether information capacity varied between chromosomes and two potential 'vectors': i.e. phages and plasmids, as well as GIs. Figure 1 shows that the D_KLwas slightly lower amongst GIs than chromosomes (p~0.004, see the Methods section for more details on the statistical methods). Phages were in turn found to have a lower D_KLthan GIs (p < 0.001), and plasmids had slightly lower D_KLthan phages (p~0.004). Hence, the largest difference in D_KL(the most divergent tetranucleotide frequencies compared to a random sequence) was between chromosomes and plasmids (p < 0.001). In other words, chromosomes were, on average, the most biased DNA sequences while the plasmids had the most random (least biased) DNA composition.

Relative entropy vs AT content

An association between information capacity and AT content has been found for chromosomes in previous studies using slightly different methods than those described here (see Methods section) [10,11]. Since there was a statistical significant difference in relative entropy between vectors (plasmids and phages) and chromosomes we explored whether similar associations could be found between the vectors and AT content. Figure 2 shows that relative entropy, D_KL, in chromosomes, plasmids, phages and GIs is negatively correlated with AT content: D_KLtends to decrease with increasing AT content. Regression analyses with D_KLas the response and AT content as the predictor gave R²= 0.33 for chromosomes, R²= 0.21 for plasmids, R²= 0.56 for phages, and R²= 0.22 for GIs. A likelihood ratio test between ANOVA models with size plus AT content versus AT content alone did not improve the correlation. All statistical results mentioned were significant, p < 0.001.

*D_KL*vs AT content. Log-transformed *D_KL*(vertical axis) is plotted against AT content (horizontal axis) with accompanying regression lines and 99% prediction intervals for chromosomes, plasmids, phages, and GIs. A clear correlation can be observed for all DNA sequences between (log-transformed) *D_KL*and AT content meaning that randomness in DNA sequences increases with AT content. The highest correlation was observed between relative entropy in phages and AT content.

Relative entropy comparisons of shared genes between M. tuberculosis and M. leprae

It has been shown that the genomes of intracellular microbes have a tendency to reduce in size due in part to more mutations and eventual loss of DNA repair genes [21,22]. We examined whether these changes are reflected in relative entropy of the genomes of M. tuberculosis, a facultative intracellular pathogen, and M. leprae, an obligate intracellular pathogen considered to be in a transitional state between free living and intracellular lifestyles [19,20]. M. leprae has a smaller genome than M. tuberculosis (3.3 mb vs. 4.4 mb) and it is more AT rich (42.3% vs 24.4%). Figure 3 shows that D_KLtaken from highly conserved coding regions was also lower in M. leprae than for M. tuberculosis, implying that M. leprae has a more random base composition, possibly due to an increased number of accumulated mutations. The fact that relative entropy was taken from shared functional genes between the two organisms supports the existing model of genome decay in intracellular microbes [21] resulting in increased randomness amongst the protein coding regions.

***D_KL*differences in *M. leprae* and *M. tuberculosis***. The figure show *D_KL*(vertical axis) for shared highly conserved genes of *M. leprae* and *M. tuberculosis* (horizontal axis).

Phylogenetic influence on relative entropy

Using comparable methods to D_KL, Reva and Tümmler argued that DNA sequence bias appears to be a taxon-specific phenomenon within bacteria [10]. To assess whether D_KLwas influenced by taxonomy (Figure 4) we picked out one strain from each species to decrease bias from multiple strains, reducing the dataset to 709 chromosomes. We found that phylogenetic relationship did significantly influence D_KL, but only slightly (R²= 0.21) and comparable to that of GC content (R²= 0.22). The phyla and %GC factors did, however, not interact and a model including both GC content and phyla as predictors explained approximately 40% (R²= 0.4) of the variance observed. All results were statistically significant with p < 0.001. No significant difference (p~0.87, Welch two-sample T test) in relative entropy was found between archaea and bacteria.

*D_KL*vs Phyla. The figure shows a boxplot of *D_KL*(vertical axis) plotted against phylogenetic groups (horizontal axis, number of genomes from each phylogenetic group in parenthesis). The dashed horizontal red line is the average *D_KL*for all groups.

D_KLchanges within genomes

To assess how relative entropy varied within bacterial chromosomes we examined the chromosomes of GC-rich Mycobacterium tuberculosis (65% GC), Escherichia coli K-12 with approximately 50% AT/GC, and AT rich Bacillus cereus (65% AT) using a sliding window of 5 kbp with D_KLfrom each window compared to D_KLfor the whole chromosome. The aim was to examine whether D_KLcould be regarded as a stable measure within bacterial chromosomes, similar to the genome signature [23]. Figure 5 shows how D_KLchanged within the three species compared to a randomly constructed 50% GC chromosome of equivalent size to E.coli (5 Mbp). Notice that although D_KLvaried within the chromosomes the level of variance was stable, indicating that average D_KLis a robust property for the whole DNA sequence.

Profiles of *D_KL*differences within *M. tuberculosis, E. coli* and *B. cereus*. Profiles made from the *D_KL*values of non-overlapping sliding windows in *M. tuberculosis, E. coli* and *B. cereus*. It can be seen that *D_KL*values within the chromosomes are remarkably stable. *B. cereus* has noticeably lower *D_KL*values than the other genomes indicating that the chromosome has a comparably more random base composition. The *D_KL*values of a 50% GC content random genome are also included for comparison. For all chromosomes, the black horizontal line represents mean *D_KL*.

In addition, Figure 5 shows that although M. tuberculosis and E. coli had similar D_KLmeasures throughout the chromosome, the B. cereus chromosome exhibited considerably lower D_KL. This was especially pronounced in the middle of the chromosome. The accompanying BLAST atlas (Figure 6) [24] shows that the DNA molecule in this area was more AT rich, had more pronounced intrinsic curvature, increased stacking energy (making the double stranded DNA string easier to melt), higher position preference, and a higher occurrence of quasi- and perfect palindromes.

**Blast atlas of *B. cereus* ATCC 10987**. Blast atlas for *B. cereus* ATCC 10987 depicting several structural and sequential features. Of special interest is the region approximately between 2 mb and 3 mb, which has a substantial higher occurrence of both quasi- and perfect palindromes, increased stacking energy and intrinsic curvature, as well as higher position preference than the rest of the chromosome.

Size vs AT content

Although it has been demonstrated that AT content and chromosome sizes are inversely correlated in prokaryotes, we carried out additional tests for plasmids, phages, GIs as well as chromosomes. From Figure 7 it can be seen, as expected, that we found an association between chromosome size and AT content R²~0.22, p < 0.001. In addition, we found a significant association between plasmid size and AT content, albeit low (R²~0.16), which could be due to the increased variance. With an R²~0.01 or less, the size of both phages and GIs were not associated with AT content. All results were statistically significant (p < 0.001).

Size vs. relative entropy

Since the correlation between DNA sequence size and GC content is well established [25,26] we examined whether D_KLwas affected by DNA sequence size. We performed regression analyses with D_KLof chromosomes, GIs, phages and plasmids as the response and the corresponding sequence size as the predictor variable, measuring, in effect, the correlation between D_KLand sequence size. In all instances R²(the coefficient of determination) was found to be lower than 0.05, meaning that less than 5% (p < 0.001) of the variance observed in the data was explained by the regression models. A regression analysis with GC content as outcome indicated that variance explained increased additively as DNA sequence size (21% and 15% (p < 0.001) for bacterial chromosomes and plasmids, respectively) and D_KL(48% and 29% (p < 0.001) chromosomes and plasmids, respectively) was added to the model. Hence, AT content has an independent effect on DNA sequence size and relative entropy in bacterial chromosomes and plasmids, while D_KLwas not affected by DNA sequence size regardless of DNA sequence type examined. It should be noted that for the combined regression model including both D_KLand DNA sequence size the %-variance explained metrics (i.e. R²) were slightly different from the individual models discussed in the above sections due to the different types of transformations used (see Materials section for further details).

Discussion

Relative entropy in chromosomes, plasmids, phages and GIs

Chromosomes were, on average, the most biased sequences (i.e. least similar to a random sequence) and therefore presumably the most subjected to selective pressures of the sequences examined here. In terms of D_KLthere was a small, but significant difference between GIs and chromosomes. This difference is expected since GIs are found within chromosomes and have ameliorated over time, which, in base compositional terms, tend towards that of the host chromosome [27]. Hence, a number of studies indicate that GIs consist of horizontally acquired mobile genetic fragments [22,28], but our data does not identify what type of vector has brought these GIs to their respective chromosomes.

The reduced D_KLof phages compared to plasmids was small but statistically significant. In contrast to phages, plasmids exist independently of the host chromosome and are generally non-lethal [29]. When the phenotypic features of the plasmid are not required for bacterial survival, the plasmid will exist only in a small minority of the total microbial population [30]. In this way the forces of selective pressure are reduced compared to the host chromosome. Phages also exist independently of bacterial chromosomes but rely on the bacterial machinery for replication [29,30]. However, those phages that are lytic will be under greater selective pressure than plasmids. What particular features of phages that result in the reduced information content remains to be clarified.

It should be noted that the comparisons were between all deposited DNA sequences, which means that the results reflect the distributions of chromosomes, GIs, phages and plasmids that initially have been originally selected and sequenced for a purpose. The effect of this bias is not clear.

Association between D_KLand AT content

Figure 2 shows that decreased relative entropy (D_KL) is associated with increasing AT content. An example of this was demonstrated in Figure 3, where the more AT rich M. leprae was found to have lower D_KLin genes that are also shared with the more GC rich M. tuberculosis.

Although the coefficient of determination, R², varied between GIs, phages, plasmids and chromosomes, Figure 2 shows that the trend remained for all DNA sequences examined. Phages obtained a surprisingly high coefficient of determination, R²= 0.56, implying that relative entropy was more linked to changes in AT content in these organisms.

D_KLvariation within chromosomes

The D_KLprofile of the B. cereus chromosome may imply that areas of low relative entropy (low D_KL) might be indicators of genetic regions especially prone to rearrangement. This propensity for re-arrangements may be due to the increased stacking energy, position preference and amount of quasi-palindromes observed in the region, all of which are determinants of genomic re-arrangement. The relatively high occurrence of both palindromes and quasi-palindromes in the region of B. cereus with low relative entropy may indicate that the mechanisms leading to quasi palindrome correction have not been operating properly in these regions as compared to the chromosome in general [31] possibly resulting also in a higher number of accumulated mutations [17]. A similar region has been found for all sequenced members of the B. cereus-group, which implies that the genetic region has been selected and kept possibly due to some unknown advantage. As can be seen from Figure 6, the region is predominantly gene coding. Since the genomes of the B. cereus group are relatively large compared with the distantly related B. subtilis it can be speculated that the region is an acquired phage or plasmid.

Connections between DNA sequence and structure

Although relative entropy has some mathematical associations with thermodynamics the two concepts are, in general, independent of each other [18]. However, it is known that greater energy is required to melt GC rich sequences than AT rich sequences [32]. Considering our results found a negative correlation between D_KLand AT content it is possible that DNA structure energetics and DNA sequence relative entropy may be connected and provides a link between DNA structure and sequence. This is supported by the findings shown in Figure 6 where a genetic region of low relative entropy was found to have more intrinsic DNA structural curvature, increased stacking energies and higher position preference. Hence, our findings may point to possible DNA structural differences between bacterial chromosomes, plasmids and phages that could have implications for how these biological entities are integrated into their hosts.

Phylogenetic influences on relative entropy

Our measure of relative entropy revealed that approximately 21% of the variation in D_KLcould be explained by a close phylogenetic relationship. This value compares well with the 22% in variation that is explained by GC content. Thus, D_KLappears to be as much influenced by phyla as GC content is, while almost 80% is accounted for by other factors. Using a method that is strongly associated with relative entropy (OUV, oligonucleotide usage variance), 55% of the variance could be explained by environment, phyla and AT content [17]. If non-coding regions were excluded 67% of the variance could be explained using environment, phylum and AT content. The above mentioned study also discusses possible influences between environmental factors and possible implications of high and low OUV for a number of microbes that is relevant to the present exposition. The difference between OUV and relative entropy is explained in the Methods section.

Relation between relative entropy and DNA sequence size

Although a possible link between plasmid size and ecology has been reported [29], and a correlation between microbial chromosome size and GC content has been established previously, to the best of our knowledge no such correlation has been reported between plasmid size and GC content. It can also be seen from Figure 7 that plasmid sizes vary considerably more with respect to AT content than chromosomes, which could indicate that the DNA sequences of plasmids are less stable and more prone to genetic exchange than the DNA sequences of chromosomes.

Lack of correlation between relative entropy and DNA sequence size

Although a correlation between DNA sequence size and D_KLin bacterial chromosomes and plasmids could be expected due to the correlation found between these factors and genomic AT content, no such correlation was found. This may imply that the relation between genomic AT content and DNA sequence size is independent of the relation between genomic AT content and relative entropy. In other words, genomic AT content may be differently related to DNA sequence size than to relative entropy in bacterial chromosomes and plasmids (no correlation was found between AT content and DNA sequence size in GIs and phages). This claim was further strengthen by a linear regression analysis, which indicated that the variance explained increased additively with DNA sequence size and relative entropy added as predictors. Hence, our models indicate that the mechanisms connecting AT with DNA sequence size are unrelated and different to the mechanisms linking AT content with relative entropy.

Connections to other studies

By using BLAST and graph/network analyses it has been found that the different groups, i.e. chromosomes, plasmids and phages, share, in the majority of cases, DNA amongst themselves. In other words, chromosomes share DNA with chromosomes, plasmids share DNA with plasmids and phages share DNA with phages [5]. Variation among bacterial chromosomes however is predominantly mediated by genetic exchange from plasmids and only transiently so by phages [5]. Our results indicated that plasmids, on average, had significantly lower D_KLthan any of the other types of DNA sequences. This could mean that plasmids are more tolerant to genetic alterations something that may be crucial to maximize host range [33]. A previous study has reported a correlation between plasmid-host similarity and GC content, i.e. the more similar the plasmids-hosts were in terms of genomic signatures, the more GC rich they tended to be [9]. Phages have been found to have a narrow host range, in fact even more so than plasmids [5] in spite of their larger numbers (estimations go as high as 5-10 phages for each bacterium on earth [34-36]), which may indicate that they have been subjected to increased selective pressures resulting, in turn, in significantly higher D_KLthan for plasmids. Due to the possible link between relative entropy and DNA sequence mutations it can be speculated whether phages are more vulnerable to genetic rearrangements than plasmids, resulting in higher D_KL, on average in phages.

Conclusions

In conclusion, we find that GIs and chromosomes have similar relative entropy (D_KL), which may be due to amelioration of the foreign DNA towards the base composition of the host chromosome. Both plasmids and phages had significantly lower relative entropy than GIs and chromosomes. Plasmids had the lowest D_KLof all types of DNA sequences examined, meaning that plasmids contained, on average, the most mutated DNA sequences. Relative entropy decreased in all types of DNA sequences in concordance with increasing AT content, possibly implying that the number of accumulated mutations appear to increase with AT content regardless of the (prokaryotic) biological entity. This was also demonstrated on a shared set of highly conserved genes from M. tuberculosis and M. leprae, of which the latter, known to have undergone considerable genome reduction, was found to have significantly lower relative entropy (i.e. more random DNA sequences possibly due to mutation) in the protein coding genes. AT content and D_KLassociation was especially pronounced for phages, which may reflect an evolutionary strategy that associates the number of accumulated mutations with AT content to a substantially larger extent in phages than bacteria.

Methods

Chromosomes, plasmids and phages were downloaded from the NCBI website http://www.ncbi.nlm.nih.gov/genome/, while the GIs were downloaded from the Islandviewer website http://www.pathogenomics.sfu.ca/islandviewer/query.php. Only DNA sequences larger than 10 kb were considered due to limitations of the method. Single copy orthologs were assigned by OrthoMCL [37] for the genomes of Mycobacterium tuberculosis F11 (CP000717.1), M. tuberculosis H37Ra (AL123456.2), M. leprae Br4923 (FM211192.1) and M. leprae TN7 (AL450380.1). Statistical analyses were carried out with R http://www.r-project.org/, which was also used to create all figures except the BLAST atlas (Figure 6). The BLAST atlas was made using CBS in-house software [24,38].

The Kullback-Leibler divergence (D_KL, also referred to as the relative entropy) is a measure of difference between two discrete probability mass functions [18]. Let s be a DNA sequence, and z₁,...,z₂₅₆be all possible tetramers of the DNA alphabet (4⁴= 256). The observed frequencies of tetranucleotides from DNA sequence s is written as O(z_i|s). The expected frequencies of tetranucleotides from DNA sequence s found using a zero order Markov model is written as E(z_i|s). The KL divergence for the sequence s is given as:

D_{K L} (s) = \sum_{i = 1}^{256} O (z_{i} | s) \log (\frac{O (z_{i} | s)}{E (z_{i} | s)})

A lower D_KLis interpreted as lesser information potential is carried by the DNA sequence s due to lesser dependence between the nucleotides in the corresponding tetranucleotides. Conversely, a higher D_KLis taken to mean that higher information potential is carried by the DNA sequence (higher relative entropy), since the nucleotides in the corresponding tetranucleotides are more dependent on each other. The OUV measure [17] described in the Discussion section and compared to relative entropy is calculated as follows (O, E, z_iand s are the same as above):

D_{O U V} (s) = \frac{1}{256} \sum_{i = 1}^{256} \frac{O (z_{i} | s)}{E (z_{i} | s)}

Although the OUV measure is similar to relative entropy, we use the latter here due to the larger theoretical framework and tools available from information theory [12,18].

Comparisons between D_KLand factors such as phyla, AT content, DNA sequence size, etc. were carried out using linear regression with transformations applied to correct for non-normality where needed.

D_KLwas computed for each DNA sequence (chromosome, plasmid, phage and GI) and compared to AT content, size and phyla using linear regression:

Y = a + b X + ε

For comparisons between chromosome, plasmid, GI and phage size (Y = Y_size) versus D_KL(X_KL) no transformation was used.

To examine the relationship between D_KL, DNA sequence size and AT content for bacterial chromosomes and plasmids, a linear regression model was used without transformations on the response:

Y_{A T} = a + b X_{K L} + c X_{S i z e} + d {(X_{S i z e})}^{2} + ε

Linear regression between D_KLas outcome (Y = Y_KL) and AT content as response (X = X_AT) was log-transformed:

L o g Y_{K L} = a + b X_{A T} + ε

Several transformations were used to assess associations between chromosome, plasmid, phage and GI size (Y_Size) vs AT content (X_AT) using the following regression equation:

Y_{S i z e} = a + b X_{A T} + ε

A square root transform was used when the response was sequence sizes for chromosomes; log transformations for both phage and plasmid sizes; and (1/ Y_Size) transform for GI sizes as outcome.

Comparison of D_KLbetween chromosomes, plasmids, phages and GI, as seen in Figure 1, were carried out using the non-parametric Wilcoxon (Mann-Whitney) test due to skewed (but similar) distributions.

All statistical results presented as results were found to be statistically significant with p < 0.001, if not otherwise stated in the text.

All D_KLmeasurements of DNA sequences were carried out using in-house software. The profiles measuring D_KLchanges within bacterial chromosomes as seen in Figure 5 were performed using non-overlapping sliding windows of 5 kbp compared to average chromosomal D_KL.

Authors' contributions

JB, LS and ABK carried out statistical analyses. JB, MWJvP, SPH, and DU contributed to data analyses and discussion. All authors participated in the writing of the manuscript. The study was initiated by JB. All authors have read and approved the final manuscript.

Contributor Information

Jon Bohlin, Email: jon.bohlin@nvh.no.

Mark WJ van Passel, Email: Mark.vanPassel@wur.nl.

Lars Snipen, Email: lars.snipen@umb.no.

Anja B Kristoffersen, Email: anja.kristoffersen@vetinst.no.

David Ussery, Email: dave@cbs.dtu.dk.

Simon P Hardy, Email: simon-paul.hardy@vetinst.no.

Acknowledgements

The authors wish to thank the referees as well as Hilde Mellegård and Torunn Dønsvik for their helpful comments. MWJvP is funded by the Netherlands Organization for Scientific Research (NWO) via a VENI grant.

References

van Passel MW, Marri PR, Ochman H. The emergence and fate of horizontally acquired genes in Escherichia coli. PLoS Comput Biol. 2008;4(4):e1000059. doi: 10.1371/journal.pcbi.1000059. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roos TE, van Passel MW. A quantitative account of genomic island acquisitions in prokaryotes. BMC Genomics. 2011;12:427. doi: 10.1186/1471-2164-12-427. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fournier PE, Drancourt M, Raoult D. Bacterial genome sequencing and its use in infectious diseases. 2007. pp. 711–723. [DOI] [PubMed]
Langille MG, Hsiao WW, Brinkman FS. Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics. 2008;9:329. doi: 10.1186/1471-2105-9-329. [DOI] [PMC free article] [PubMed] [Google Scholar]
Halary S, Leigh JW, Cheaib B, Lopez P, Bapteste E. Network analyses structure genetic diversity in independent genetic worlds. Proc Natl Acad Sci USA. 2010;107(1):127–132. doi: 10.1073/pnas.0908978107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haerter JO, Trusina A, Sneppen K. Targeted bacterial immunity buffers phage diversity. J Virol. 2011;85(20):10554–10560. doi: 10.1128/JVI.05222-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics. 2006;7:8. doi: 10.1186/1471-2164-7-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
van Passel MW, Bart A, Luyf AC, van Kampen AH, van der EA. Compositional discordance between prokaryotic plasmids and host chromosomes. BMC Genomics. 2006;7(1):26. doi: 10.1186/1471-2164-7-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bohlin J, Skjerve E, Ussery DW. Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes. BMC Genomics. 2008;9:104. doi: 10.1186/1471-2164-9-104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reva ON, Tummler B. Global features of sequences of bacterial chromosomes, plasmids and phages revealed by analysis of oligonucleotide usage patterns. BMC Bioinformatics. 2004;5:90. doi: 10.1186/1471-2105-5-90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bohlin J, Skjerve E, Ussery DW. Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS Comput Biol. 2008;4(4):e1000057. doi: 10.1371/journal.pcbi.1000057. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shannon CE. The mathematical theory of communication. 1963. MD Comput. 1997;14(4):306–317. [PubMed] [Google Scholar]
Yockey HP. Origin of life on earth and Shannon's theory of communication. Comput Chem. 2000;24(1):105–123. doi: 10.1016/S0097-8485(99)00050-9. [DOI] [PubMed] [Google Scholar]
Schneider TD. Information content of individual genetic sequences. J Theor Biol. 1997;189(4):427–441. doi: 10.1006/jtbi.1997.0540. [DOI] [PubMed] [Google Scholar]
Sadovsky MG. Information capacity of nucleotide sequences and its applications. Bull Math Biol. 2006;68(4):785–806. doi: 10.1007/s11538-005-9017-0. [DOI] [PubMed] [Google Scholar]
Menconi G, Marangoni R. A compression-based approach for coding sequences identification. I. Application to prokaryotic genomes. J Comput Biol. 2006;13(8):1477–1488. doi: 10.1089/cmb.2006.13.1477. [DOI] [PubMed] [Google Scholar]
Bohlin J, Skjerve E. Examination of genome homogeneity in prokaryotes using genomic signatures. PLoS One. 2009;4(12):e8113. doi: 10.1371/journal.pone.0008113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cover TM, Thomas JA. Elements of Information Theory: Wiley; 1991. [Google Scholar]
Vissa VD, Brennan PJ. The genome of Mycobacterium leprae: a minimal mycobacterial gene set. Genome Biol. 2001;2(8):REVIEWS1023. doi: 10.1186/gb-2001-2-8-reviews1023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gomez-Valero L, Rocha EP, Latorre A, Silva FJ. Reconstructing the ancestor of Mycobacterium leprae: the dynamics of gene loss and genome reduction. Genome Res. 2007;17(8):1178–1185. doi: 10.1101/gr.6360207. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moran NA. Microbial minimalism: genome reduction in bacterial pathogens. Cell. 2002;108(5):583–586. doi: 10.1016/S0092-8674(02)00665-7. [DOI] [PubMed] [Google Scholar]
Rocha EP, Danchin A. Base composition bias might result from competition for metabolic resources. Trends Genet. 2002;18(6):291–294. doi: 10.1016/S0168-9525(02)02690-2. [DOI] [PubMed] [Google Scholar]
Karlin S, Campbell AM. Which bacterium is the ancestor of the animal mitochondrial genome? Proc Natl Acad Sci USA. 1994;91(26):12842–12846. doi: 10.1073/pnas.91.26.12842. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hallin PF, Binnewies TT, Ussery DW. The genome BLASTatlas-a GeneWiz extension for visualization of whole-genome homology. Mol Biosyst. 2008;4(5):363–371. doi: 10.1039/b717118h. [DOI] [PubMed] [Google Scholar]
Mitchell D. GC content and genome length in Chargaff compliant genomes. Biochem Biophys Res Commun. 2007;353(0006-291; 1):207–210. doi: 10.1016/j.bbrc.2006.12.008. [DOI] [PubMed] [Google Scholar]
Musto H, Naya H, Zavala A, Romero H, varez-Valin F, Bernardi G. Genomic GC level, optimal growth temperature, and genome size in prokaryotes. Biochem Biophys Res Commun. 2006;347(0006-291; 1):1–3. doi: 10.1016/j.bbrc.2006.06.054. [DOI] [PubMed] [Google Scholar]
Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol. 1997;44(4):383–397. doi: 10.1007/PL00006158. [DOI] [PubMed] [Google Scholar]
Langille MG, Hsiao WW, Brinkman FS. Detecting genomic islands using bioinformatics approaches. Nat Rev Microbiol. 2010;8(5):373–382. doi: 10.1038/nrmicro2350. [DOI] [PubMed] [Google Scholar]
Slater FR, Bailey MJ, Tett AJ, Turner SL. Progress towards understanding the fate of plasmids in bacterial communities. FEMS Microbiol Ecol. 2008;66(1):3–13. doi: 10.1111/j.1574-6941.2008.00505.x. [DOI] [PubMed] [Google Scholar]
Bahl MI, Hansen LH, Sorensen SJ. Persistence mechanisms of conjugative plasmids. Methods Mol Biol. 2009;532:73–102. doi: 10.1007/978-1-60327-853-9_5. [DOI] [PubMed] [Google Scholar]
van Noort V, Worning P, Ussery DW, Rosche WA, Sinden RR. Strand misalignments lead to quasipalindrome correction. Trends Genet. 2003;19(7):365–369. doi: 10.1016/S0168-9525(03)00136-7. [DOI] [PubMed] [Google Scholar]
Sinden RR. DNA Structure and Function: Academic Press; 1994. [Google Scholar]
Kirzinger MW, Stavrinides J. Host specificity determinants as a genetic continuum. Trends Microbiol. 2012;20(2):88–93. doi: 10.1016/j.tim.2011.11.006. [DOI] [PubMed] [Google Scholar]
Brussow H, Hendrix RW. Phage genomics: small is beautiful. Cell. 2002;108(1):13–16. doi: 10.1016/S0092-8674(01)00637-7. [DOI] [PubMed] [Google Scholar]
Paul JH, Sullivan MB, Segall AM, Rohwer F. Marine phage genomics. Comp Biochem Physiol B Biochem Mol Biol. 2002;133(4):463–476. doi: 10.1016/S1096-4959(02)00168-9. [DOI] [PubMed] [Google Scholar]
Lima-Mendez G, Van Helden J, Toussaint A, Leplae R. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol Biol Evol. 2008;25(4):762–777. doi: 10.1093/molbev/msn023. [DOI] [PubMed] [Google Scholar]
Fischer S, Brunk BP, Chen F, Gao X, Harb OS, Iodice JB, Shanmugam D, Roos DS, Stoeckert CJ. , JrUsing OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr Protoc Bioinformatics. 2011. pp. Unit 6.12.1–19. [DOI] [PMC free article] [PubMed]
Hallin PF, Ussery DW. CBS Genome Atlas Database: a dynamic storage for bioinformatic results and sequence data. Bioinformatics. 2004;20(18):3682–3686. doi: 10.1093/bioinformatics/bth423. [DOI] [PubMed] [Google Scholar]

[B1] van Passel MW, Marri PR, Ochman H. The emergence and fate of horizontally acquired genes in Escherichia coli. PLoS Comput Biol. 2008;4(4):e1000059. doi: 10.1371/journal.pcbi.1000059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Roos TE, van Passel MW. A quantitative account of genomic island acquisitions in prokaryotes. BMC Genomics. 2011;12:427. doi: 10.1186/1471-2164-12-427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Fournier PE, Drancourt M, Raoult D. Bacterial genome sequencing and its use in infectious diseases. 2007. pp. 711–723. [DOI] [PubMed]

[B4] Langille MG, Hsiao WW, Brinkman FS. Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics. 2008;9:329. doi: 10.1186/1471-2105-9-329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Halary S, Leigh JW, Cheaib B, Lopez P, Bapteste E. Network analyses structure genetic diversity in independent genetic worlds. Proc Natl Acad Sci USA. 2010;107(1):127–132. doi: 10.1073/pnas.0908978107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Haerter JO, Trusina A, Sneppen K. Targeted bacterial immunity buffers phage diversity. J Virol. 2011;85(20):10554–10560. doi: 10.1128/JVI.05222-11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics. 2006;7:8. doi: 10.1186/1471-2164-7-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] van Passel MW, Bart A, Luyf AC, van Kampen AH, van der EA. Compositional discordance between prokaryotic plasmids and host chromosomes. BMC Genomics. 2006;7(1):26. doi: 10.1186/1471-2164-7-26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Bohlin J, Skjerve E, Ussery DW. Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes. BMC Genomics. 2008;9:104. doi: 10.1186/1471-2164-9-104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Reva ON, Tummler B. Global features of sequences of bacterial chromosomes, plasmids and phages revealed by analysis of oligonucleotide usage patterns. BMC Bioinformatics. 2004;5:90. doi: 10.1186/1471-2105-5-90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Bohlin J, Skjerve E, Ussery DW. Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS Comput Biol. 2008;4(4):e1000057. doi: 10.1371/journal.pcbi.1000057. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Shannon CE. The mathematical theory of communication. 1963. MD Comput. 1997;14(4):306–317. [PubMed] [Google Scholar]

[B13] Yockey HP. Origin of life on earth and Shannon's theory of communication. Comput Chem. 2000;24(1):105–123. doi: 10.1016/S0097-8485(99)00050-9. [DOI] [PubMed] [Google Scholar]

[B14] Schneider TD. Information content of individual genetic sequences. J Theor Biol. 1997;189(4):427–441. doi: 10.1006/jtbi.1997.0540. [DOI] [PubMed] [Google Scholar]

[B15] Sadovsky MG. Information capacity of nucleotide sequences and its applications. Bull Math Biol. 2006;68(4):785–806. doi: 10.1007/s11538-005-9017-0. [DOI] [PubMed] [Google Scholar]

[B16] Menconi G, Marangoni R. A compression-based approach for coding sequences identification. I. Application to prokaryotic genomes. J Comput Biol. 2006;13(8):1477–1488. doi: 10.1089/cmb.2006.13.1477. [DOI] [PubMed] [Google Scholar]

[B17] Bohlin J, Skjerve E. Examination of genome homogeneity in prokaryotes using genomic signatures. PLoS One. 2009;4(12):e8113. doi: 10.1371/journal.pone.0008113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Cover TM, Thomas JA. Elements of Information Theory: Wiley; 1991. [Google Scholar]

[B19] Vissa VD, Brennan PJ. The genome of Mycobacterium leprae: a minimal mycobacterial gene set. Genome Biol. 2001;2(8):REVIEWS1023. doi: 10.1186/gb-2001-2-8-reviews1023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Gomez-Valero L, Rocha EP, Latorre A, Silva FJ. Reconstructing the ancestor of Mycobacterium leprae: the dynamics of gene loss and genome reduction. Genome Res. 2007;17(8):1178–1185. doi: 10.1101/gr.6360207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Moran NA. Microbial minimalism: genome reduction in bacterial pathogens. Cell. 2002;108(5):583–586. doi: 10.1016/S0092-8674(02)00665-7. [DOI] [PubMed] [Google Scholar]

[B22] Rocha EP, Danchin A. Base composition bias might result from competition for metabolic resources. Trends Genet. 2002;18(6):291–294. doi: 10.1016/S0168-9525(02)02690-2. [DOI] [PubMed] [Google Scholar]

[B23] Karlin S, Campbell AM. Which bacterium is the ancestor of the animal mitochondrial genome? Proc Natl Acad Sci USA. 1994;91(26):12842–12846. doi: 10.1073/pnas.91.26.12842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Hallin PF, Binnewies TT, Ussery DW. The genome BLASTatlas-a GeneWiz extension for visualization of whole-genome homology. Mol Biosyst. 2008;4(5):363–371. doi: 10.1039/b717118h. [DOI] [PubMed] [Google Scholar]

[B25] Mitchell D. GC content and genome length in Chargaff compliant genomes. Biochem Biophys Res Commun. 2007;353(0006-291; 1):207–210. doi: 10.1016/j.bbrc.2006.12.008. [DOI] [PubMed] [Google Scholar]

[B26] Musto H, Naya H, Zavala A, Romero H, varez-Valin F, Bernardi G. Genomic GC level, optimal growth temperature, and genome size in prokaryotes. Biochem Biophys Res Commun. 2006;347(0006-291; 1):1–3. doi: 10.1016/j.bbrc.2006.06.054. [DOI] [PubMed] [Google Scholar]

[B27] Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol. 1997;44(4):383–397. doi: 10.1007/PL00006158. [DOI] [PubMed] [Google Scholar]

[B28] Langille MG, Hsiao WW, Brinkman FS. Detecting genomic islands using bioinformatics approaches. Nat Rev Microbiol. 2010;8(5):373–382. doi: 10.1038/nrmicro2350. [DOI] [PubMed] [Google Scholar]

[B29] Slater FR, Bailey MJ, Tett AJ, Turner SL. Progress towards understanding the fate of plasmids in bacterial communities. FEMS Microbiol Ecol. 2008;66(1):3–13. doi: 10.1111/j.1574-6941.2008.00505.x. [DOI] [PubMed] [Google Scholar]

[B30] Bahl MI, Hansen LH, Sorensen SJ. Persistence mechanisms of conjugative plasmids. Methods Mol Biol. 2009;532:73–102. doi: 10.1007/978-1-60327-853-9_5. [DOI] [PubMed] [Google Scholar]

[B31] van Noort V, Worning P, Ussery DW, Rosche WA, Sinden RR. Strand misalignments lead to quasipalindrome correction. Trends Genet. 2003;19(7):365–369. doi: 10.1016/S0168-9525(03)00136-7. [DOI] [PubMed] [Google Scholar]

[B32] Sinden RR. DNA Structure and Function: Academic Press; 1994. [Google Scholar]

[B33] Kirzinger MW, Stavrinides J. Host specificity determinants as a genetic continuum. Trends Microbiol. 2012;20(2):88–93. doi: 10.1016/j.tim.2011.11.006. [DOI] [PubMed] [Google Scholar]

[B34] Brussow H, Hendrix RW. Phage genomics: small is beautiful. Cell. 2002;108(1):13–16. doi: 10.1016/S0092-8674(01)00637-7. [DOI] [PubMed] [Google Scholar]

[B35] Paul JH, Sullivan MB, Segall AM, Rohwer F. Marine phage genomics. Comp Biochem Physiol B Biochem Mol Biol. 2002;133(4):463–476. doi: 10.1016/S1096-4959(02)00168-9. [DOI] [PubMed] [Google Scholar]

[B36] Lima-Mendez G, Van Helden J, Toussaint A, Leplae R. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol Biol Evol. 2008;25(4):762–777. doi: 10.1093/molbev/msn023. [DOI] [PubMed] [Google Scholar]

[B37] Fischer S, Brunk BP, Chen F, Gao X, Harb OS, Iodice JB, Shanmugam D, Roos DS, Stoeckert CJ. , JrUsing OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr Protoc Bioinformatics. 2011. pp. Unit 6.12.1–19. [DOI] [PMC free article] [PubMed]

[B38] Hallin PF, Ussery DW. CBS Genome Atlas Database: a dynamic storage for bioinformatic results and sequence data. Bioinformatics. 2004;20(18):3682–3686. doi: 10.1093/bioinformatics/bth423. [DOI] [PubMed] [Google Scholar]

PERMALINK

Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands

Jon Bohlin

Mark WJ van Passel

Lars Snipen

Anja B Kristoffersen

David Ussery

Simon P Hardy

Abstract

Background

Results

Conclusions

Background

Results

A note on the calculation of DKL

DKL differences between chromosomes, GIs, phages and plasmids

Figure 1.

Relative entropy vs AT content

Figure 2.

Relative entropy comparisons of shared genes between M. tuberculosis and M. leprae

Figure 3.

Phylogenetic influence on relative entropy

Figure 4.

DKL changes within genomes

Figure 5.

Figure 6.

Size vs AT content

Figure 7.

Size vs. relative entropy

Discussion

Relative entropy in chromosomes, plasmids, phages and GIs

Association between DKL and AT content

DKL variation within chromosomes

Connections between DNA sequence and structure

Phylogenetic influences on relative entropy

Relation between relative entropy and DNA sequence size

Lack of correlation between relative entropy and DNA sequence size

Connections to other studies

Conclusions

Methods

Authors' contributions

Contributor Information

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A note on the calculation of D_KL

D_KLdifferences between chromosomes, GIs, phages and plasmids

D_KLchanges within genomes

Association between D_KLand AT content

D_KLvariation within chromosomes