Skip to main content
PLOS One logoLink to PLOS One
. 2010 Apr 14;5(4):e9844. doi: 10.1371/journal.pone.0009844

Universal Global Imprints of Genome Growth and Evolution – Equivalent Length and Cumulative Mutation Density

Hong-Da Chen 1,2, Wen-Lang Fan 2,3, Sing-Guan Kong 1,2, Hoong-Chien Lee 1,2,4,5,*
Editor: Josh Bongard6
PMCID: PMC2854691  PMID: 20418954

Abstract

Background

Segmental duplication is widely held to be an important mode of genome growth and evolution. Yet how this would affect the global structure of genomes has been little discussed.

Methods/Principal Findings

Here, we show that equivalent length, or Inline graphic, a quantity determined by the variance of fluctuating part of the distribution of the Inline graphic-mer frequencies in a genome, characterizes the latter's global structure. We computed the Inline graphics of 865 complete chromosomes and found that they have nearly universal but (Inline graphic-dependent) values. The differences among the Inline graphic of a chromosome and those of its coding and non-coding parts were found to be slight.

Conclusions

We verified that these non-trivial results are natural consequences of a genome growth model characterized by random segmental duplication and random point mutation, but not of any model whose dominant growth mechanism is not segmental duplication. Our study also indicates that genomes have a nearly universal cumulative “point” mutation density of about 0.73 mutations per site that is compatible with the relatively low mutation rates of (1Inline graphic5)Inline graphic10Inline graphic/site/Mya previously determined by sequence comparison for the human and E. coli genomes.

Introduction

Evolution has many facets, and one that is particularly accessible to quantitative analysis is the evolution of genomic sequences. In particular, the study of point mutations (here used in the sense that includes relatively small insertions and deletions, or indels) on genes has led to deep understandings of many aspects of genome evolution [1], [2]. Point mutation however cannot be the main force driving genome growth, because it does not give rise to gene duplication [3][8], and because the pace of evolution based on point mutation alone would be too slow. Gene duplication is a product of segmental duplication (SD). In fact, genomes are replete with vestiges of duplication [9][11], not only in the form of homologous genes, but also as transposons [12][14], pseudogenes [15][18], and many other types of coding and non-coding repeats [19][22]. There is also evidence of large-scale genomic rearrangements [23][27] and whole genome duplications [3], [28][30]. This has led to the generally held view that SD is an important mode of genome growth and evolution.

If products of SD are so prevalent in genomes, we expect the SD's in a genome, collectively, to leave a large imprint on the global structure of its host, one that is detectable using means not relying on sequence alignment, which in any case is not suitable for global studies. One may reasonably expect a study to understand the formation of such an imprint to yield useful insights into the global pattern of genome growth and evolution, yet no such effort has been made.

Here, we study the statistical properties of genomes by analyzing the distribution of the frequency of occurrence, or FD, of Inline graphic-letter words, or Inline graphic-mers, in the sequence. Although genomic FDs have been much studied before [31][36], the method and focus of the present study are both distinct from all previous studies. A novel approach we use, crucial to our ability to extract results presented here, is the separation of the contributions to the variance from the fluctuating part of an FD (FFD), and the non-fluctuaing part (NFFD). We show that NFFD is entirely understood; it carries no statistical information other than the base composition of a sequence. A genomic sequence and its matching random sequence have essentially the same NFFD. The contribution from NFFD overwhelmingly dominates the variance (of an FD) of a random sequence in all cases and dominates the variance of a genome except when its base composition is approximately even. As a consequence, if the separation mentioned above is not carried out, then it is sometimes easy to distinguish genomic from random sequences and sometimes not, a situation that has confounded many previous studies. We will demonstrate that the very special characteristics of genomic FFDs sharply distinguishes them from their random counterparts under all circumstances.

In this study we used the FFD to define the equivalent lengths (Inline graphic's; one for each Inline graphic) of a sequence and discovered a universality in these quantities. We then identify these Inline graphic's and their small values, as a clear and distinct global imprints of genome growth and evolution. (The Inline graphic of a sequence is inversely proportional to the FFD part of the variance and is defined such that the Inline graphic of a random sequence is its own true length. Therefore, a sequence whose equivalent length is Inline graphic has the characteristic randomness of a random sequence of length Inline graphic.) We computed the Inline graphic of about 900 complete chromosomes, all the complete sequences at the time of download from GenBank, for Inline graphic = 2 to 10, and found some unexpected and useful results: Roughly, the complete set of about 7400 Inline graphic-dependent whole-chromosome Inline graphic's is well represented by the universal formula Inline graphic(Inline graphic) = Inline graphic Inline graphic where Inline graphic b (base pair) and Inline graphic = 0.92. The formula means that, for the smaller Inline graphic's, the universal genomic Inline graphic is only a small fraction of the genome length even for the shortest genomes. Another unexpected result is the small difference between the Inline graphic's of coding and non-coding parts. In our successful attempt to describe these results in a simple genome growth model driven by random segmental duplication, we obtained a universal cumulative point mutation density of Inline graphic = 0.73Inline graphic0.07/site for genomes. This value is compatible with the relatively low mutation rates previously determined by sequence comparison for the human and E. coli genomes [37][39].

Results

Only FFD contains non-trivial information

A key to our approach to the analysis of genomic sequences is the decomposition of Inline graphicInline graphic is the coefficient of variation of an FD – into FFD and NFFD components (Methods). This is illustrated in Fig. 1, which shows the values of Inline graphic for 2-mers; results for other Inline graphic's are similar. The full Inline graphic of genomic sequences (Fig. 1(a)) differs from that of their matching random sequences (Fig. 1(b)) clearly only when Inline graphic Inline graphic Inline graphic Inline graphic0.1, where Inline graphic is the fractional A/T-content. (A genome and its matching random sequence have the same length and base composition.) The situation becomes much clearer when Inline graphic is decomposed into its FFD and NFFD parts, Inline graphic and Inline graphic, respectively. While the values of Inline graphic for the two type of sequences are almost indistinguishable ((red) triangles, Fig. 1(c,d); the two “volcano” curves are identical, being both given by the theoretical prediction, Eq. (12)), the values of Inline graphic for genomes and random sequences are drastically different ((blue) bullets, Fig. 1(c,d)). The genomic Inline graphic span a narrow band ranging from 0.01 to 0.1, while the random Inline graphic are several orders of magnitude smaller. In fact for random sequences the value of Inline graphic is well understood to be inversely proportional to sequence length (Eq. (13), and below). Clearly, if random sequences are used as controls to discuss the non-random properties of genomic sequences when the distinction between FFD and NFFD is not made, then it is possible that conflicting conclusions [32], [40][43] may be drawn.

Figure 1. Fluctuating and non-fluctuating parts of variance.

Figure 1

(a) Variances of 2-mer frequency distribution of 865 complete sequences. (b) Same as (a) but for for 865 matching random sequences. Bottom: same data as in top plots, but with each variance split into non-fluctuating (triangles) and fluctuating (bullets) parts, for (c) genomes and (d) matching random sequences. The “volcanic” curves through the non-fluctuating data in (c) and (d) plot theoretical values given by Eq. (12).

Genomic Inline graphic is approximately a constant of sequence length

Throughout this paper we use Inline graphic to denote generically the equivalent length of any sequence (Eq. (14), Methods), and reserve Inline graphic for denoting entire sequences such as a complete chromosomes. Fig. 2 shows Inline graphic versus segment length Inline graphic for segments taken from the chromosomes of four model organisms: E. coli Inline graphic; C. elegans, Chr. (chromosome) 1; A. thaliana, Chr. 1; H. sapiens, Chr. 1, and matching random sequences. The computation is carried out only when Inline graphic is at least four times Inline graphic, since for shorter lengths the systematic error becomes too large. It is seen that whereas the Inline graphic of random sequences closely tracks Inline graphic, as expected, the Inline graphic of genomic sequences quickly levels off to a saturation value Inline graphic. These results for Inline graphic Inline graphic5 kb may be summarized in terms of the scaling relation Inline graphic Inline graphic Inline graphic. Then we have the two distinct classes Inline graphic Inline graphic1 for random sequences and Inline graphic Inline graphic0 for genomic sequences. This scaling relation is not the same as the long-range correlation and scale-invariance observed in binary analyses of long genomic sequences [44][46]. In Fig. 2 Inline graphic is seen not to depend strongly on organism. For small Inline graphic, Inline graphic is diminutive relative to genome length: Inline graphic0.35 and Inline graphic1.0 kb when Inline graphic = 2 and 4, respectively, growing to Inline graphic600 kb when Inline graphic = 10. Within a genome, the apparent invariance of Inline graphic (not Inline graphic) with respect to segment length was noted in [47][49] and the relation between Shannon information and a quantity similar to Inline graphic was discussed in [50].

Figure 2. Segmental equivalent lengths from four model organisms.

Figure 2

Equivalent length Inline graphic versus sequence length Inline graphic for genomic (hollow symbols) and matching random (solid symbols) sequences. Genomic segments are from E. coli (Inline graphic), worm (C. elegans (chromosome) I, Inline graphic), mustard (A. thaliana I, Inline graphic), and human (H. sapiens I, Inline graphic). Each Inline graphic in the form of meanInline graphicSD is averaged over the maximum number of non-overlapping segments (of length Inline graphic) in the chromosome or, if the chromosome is longer than 20Inline graphic, 20 randomly selected segments.

Whole chromosomes have nearly universal Inline graphic

A list of the 865 complete chromosomes studied here is given in Table S1, and a list of Inline graphic's, Inline graphic = 2 to 10, for the chromosomes is give in Table S2. Fig. 3 shows Inline graphic, as a function of Inline graphic (top panels) and chromosome length Inline graphic (bottom panels), computed from the complete chromosomes for even Inline graphic's up to Inline graphic = 10. Table 1 gives the Inline graphic, Inline graphic = 2 to 10, of chromosomes of seven model organisms. It is seen that Inline graphic has a clear dependence on Inline graphic, is essentially independent of sequence length, and has a weak dependence on Inline graphic. Fig. 4 gives Inline graphic for odd Inline graphic's averaged over categories of organisms and over chromosomes in model organisms (for more detailed results see Table S3). The Inline graphic = 5 data reconfirms the absence in Inline graphic of a systematic dependence on chromosome length (similarly for other Inline graphic's). In the Inline graphic = 3 and 7 plots Inline graphic's are given separately for the whole chromosome, and genic (gn), and inter-genic (ig), exon (ex) and intron (in, when applicable) concatenates (Methods). The unicellulars are seen to have the largest variation in Inline graphic, especially for the ig and in regions. This partly reflects the fact that this category includes two phylogenetically remote groups, protists and fungi. In contrast, the relatively small variation in the vertebrate Inline graphic reflects the fact that, compared to organisms in other categories, vertebrates are phylogenetically very close. Two examples in opposite extremes are shown in the bottom panel of Fig. 4 (Inline graphic = 7): the malaria causing parasite P. falciparum with especially small Inline graphic's, and the fungus S. pombe with relatively large Inline graphic's. This indicates that the chromosomes of P. falciparum and S. pombe are much less and much more random, respectively, than the genomic norm. Although such inter-category, inter-species and inter-regional differences are significant, they pale when compared with the difference between Inline graphic and true chromosome lengths. Table 2 lists Inline graphic, Inline graphic = 2, 5, 7 and 10, averaged over all 865 sequences, for whole chromosome and the four types of concatenates.

Figure 3. Chromosomal equivalent length (Inline graphic) versus Inline graphic and Inline graphic.

Figure 3

Top panels: Inline graphic versus Inline graphic; bottom panels: Inline graphic versus Inline graphic. Each piece of data gives the Inline graphic from a complete chromosome: Inline graphic (red), Inline graphic = 2; Inline graphic (gray), Inline graphic = 4; Inline graphic (blue), Inline graphic = 6, Inline graphic (green), Inline graphic = 8, Inline graphic (orange), Inline graphic = 10. Lines in top-left panel represent the “universality class” Inline graphic (Inline graphic;Inline graphic) (Eq. (1)). The right panels show the collapse of genomic data to around unity when the genomic Inline graphic is divided by Inline graphic (Inline graphic;Inline graphic).

Table 1. Genomic equivalent lengths for model organisms.

Inline graphic (kb)Inline graphic
Organism Inline graphic Inline graphic 2 3 4 5 6 7 8 9 10
H. sapiens (24)Inline graphic .188Inline graphic.021 .448Inline graphic.046 1.22Inline graphic.13 3.39Inline graphic.41 9.34Inline graphic1.36 23.8Inline graphic4.4 53.9Inline graphic12.6 103Inline graphic29 170Inline graphic54
H. sapiens (gn; 43.2%)Inline graphic .185Inline graphic.022 .440Inline graphic.048 1.20Inline graphic.14 3.31Inline graphic.42 9.02Inline graphic1.33 22.4Inline graphic4.0 49.2Inline graphic10.9 90.5Inline graphic23.7 144Inline graphic42
H. sapiens (ig; 63.6%)Inline graphic .190Inline graphic.021 .452Inline graphic.045 1.24Inline graphic.13 3.44Inline graphic.41 9.51Inline graphic1.36 24.5Inline graphic4.5 56.6Inline graphic13.4 111Inline graphic32 186Inline graphic61
H. sapiens (ex; 2.1%)Inline graphic .171Inline graphic.019 .412Inline graphic.042 1.12Inline graphic.12 3.07Inline graphic.39 8.21Inline graphic1.26 19.9Inline graphic3.8 41.9Inline graphic10.3 72.2Inline graphic21.6 117Inline graphic22
H. sapiens (in; 37%)Inline graphic .182Inline graphic.020 .434Inline graphic.043 1.18Inline graphic.13 3.26Inline graphic.40 8.84Inline graphic1.34 21.9Inline graphic4.2 47.7Inline graphic11.5 87.2Inline graphic24.9 139Inline graphic45
A. thaliana (5)Inline graphic .373Inline graphic.005 .871Inline graphic.013 2.20Inline graphic.04 5.89Inline graphic.10 16.0Inline graphic.3 42.1Inline graphic.8 109Inline graphic2 273Inline graphic7 642Inline graphic20
A. thaliana (gn; 55.8%)Inline graphic .333Inline graphic.004 .822Inline graphic.011 2.06Inline graphic.03 5.57Inline graphic.08 15.9Inline graphic.2 44.9Inline graphic.7 129Inline graphic2 367Inline graphic6 981Inline graphic22
A. thaliana (ig; 44.1%)Inline graphic .394Inline graphic.007 .798Inline graphic.014 1.94Inline graphic.04 4.95Inline graphic.10 12.3Inline graphic.2 28.9Inline graphic.6 66.1Inline graphic1.5 144Inline graphic4 296Inline graphic12
A. thaliana (ex; 32.9%)Inline graphic .288Inline graphic.003 .715Inline graphic.007 1.75Inline graphic.02 4.72Inline graphic.05 13.6Inline graphic.1 38.9Inline graphic.4 113Inline graphic2 326Inline graphic7 865Inline graphic35
A. thaliana (in; 16.1%)Inline graphic .350Inline graphic.003 .752Inline graphic.006 1.80Inline graphic.02 4.42Inline graphic.04 11.1Inline graphic.1 27.3Inline graphic.4 68.1Inline graphic1.0 167Inline graphic3 400Inline graphic1
Inline graphic (4)Inline graphic .409Inline graphic.142 .957Inline graphic.213 2.54Inline graphic.46 6.90Inline graphic1.17 18.7Inline graphic3.2 48.2Inline graphic9.5 117Inline graphic31 268Inline graphic102 676Inline graphic294
Inline graphic (gn; 56.4%)Inline graphic .432Inline graphic.108 1.02Inline graphic.15 2.71Inline graphic.30 7.35Inline graphic.85 20.0Inline graphic2.8 51.6Inline graphic9.9 127Inline graphic35 326Inline graphic120 756Inline graphic321
Inline graphic (ig; 43.5%)Inline graphic .392Inline graphic.194 .882Inline graphic.305 2.30Inline graphic.66 6.15Inline graphic1.57 16.1Inline graphic3.3 39.4Inline graphic7.5 90.0Inline graphic28.1 235Inline graphic87 536Inline graphic231
Inline graphic (ex; 23.9%)Inline graphic .478Inline graphic.023 1.16Inline graphic.09 2.82Inline graphic.41 7.55Inline graphic1.39 21.0Inline graphic4.2 55.6Inline graphic10.7 140Inline graphic29 377Inline graphic111 907Inline graphic324
Inline graphic (in; 34.8%)Inline graphic .378Inline graphic.145 .833Inline graphic.168 2.15Inline graphic.30 5.65Inline graphic.73 14.8Inline graphic2.3 36.2Inline graphic7.9 84.0Inline graphic26.2 207Inline graphic79 458Inline graphic198
C. elegans (6)Inline graphic .119Inline graphic.012 .258Inline graphic.032 .624Inline graphic.089 1.63Inline graphic.26 4.46Inline graphic.78 12.6Inline graphic2.3 35.5Inline graphic6.9 98.8Inline graphic21.0 264Inline graphic63
C. elegans (gn; 58.6%)Inline graphic .126Inline graphic.017 .284Inline graphic.047 .697Inline graphic.135 1.83Inline graphic.40 5.06Inline graphic1.21 14.3Inline graphic3.7 40.8Inline graphic11.1 114Inline graphic34 306Inline graphic99
C. elegans (ig; 41.3%)Inline graphic .109Inline graphic.009 .226Inline graphic.022 .539Inline graphic.061 1.39Inline graphic.18 3.78Inline graphic.51 10.5Inline graphic1.5 29.3Inline graphic4.5 79.5Inline graphic13.6 202Inline graphic41
C. elegans (ex; 27.5%)Inline graphic .184Inline graphic.010 .483Inline graphic.025 1.28Inline graphic.07 3.64Inline graphic.23 10.9Inline graphic.7 33.2Inline graphic2.4 102Inline graphic8 306Inline graphic25 822Inline graphic58
C. elegans (in; 32.3%)Inline graphic .085Inline graphic.015 .169Inline graphic.037 .382Inline graphic.096 .939Inline graphic.265 2.44Inline graphic.73 6.52Inline graphic1.99 17.4Inline graphic5.3 45.4Inline graphic14.1 113Inline graphic37
S. pombe (3)Inline graphic .362Inline graphic.010 .894Inline graphic.030 2.41Inline graphic.09 6.74Inline graphic.28 19.2Inline graphic.9 54.6Inline graphic3.0 153Inline graphic11 402Inline graphic39 1013Inline graphic39
S. pombe (gn; 57.8%)Inline graphic .339Inline graphic.002 .880Inline graphic.006 2.38Inline graphic.01 6.82Inline graphic.05 20.2Inline graphic.2 59.6Inline graphic.8 173Inline graphic6 455Inline graphic42
S. pombe (ig; 42.1%)Inline graphic .364Inline graphic.019 .812Inline graphic.045 2.08Inline graphic.12 5.31Inline graphic.32 13.5Inline graphic.8 33.6Inline graphic2.1 81.7Inline graphic5.8 187Inline graphic16
S. pombe (ex; 53.9%)Inline graphic .357Inline graphic.007 .889Inline graphic.018 2.40Inline graphic.06 6.73Inline graphic.18 19.2Inline graphic.6 54.4Inline graphic2.3 149Inline graphic10 374Inline graphic42
S. pombe (in; 3%)Inline graphic .361Inline graphic.007 .898Inline graphic.017 2.41Inline graphic.06 6.53Inline graphic.14 17.0Inline graphic.4 38.2Inline graphic3.1
Inline graphic (14)Inline graphic 1.40Inline graphic.20 .287Inline graphic.019 .376Inline graphic.023 .512Inline graphic.036 .729Inline graphic.059 .998Inline graphic.089 1.34Inline graphic.13 1.73Inline graphic.19
Inline graphic (gn; 56%)Inline graphic .595Inline graphic.118 .659Inline graphic.085 1.02Inline graphic.12 1.86Inline graphic.29 3.59Inline graphic.74 6.73Inline graphic1.86 12.3Inline graphic4.3 16.3Inline graphic10.4
Inline graphic (ig; 44%)Inline graphic .665Inline graphic.108 .111Inline graphic.017 .130Inline graphic.017 .162Inline graphic.022 .212Inline graphic.031 .276Inline graphic.042 .357Inline graphic.057 .398Inline graphic.032
Inline graphic (ex; 53%)Inline graphic .515Inline graphic.058 .717Inline graphic.060 1.12Inline graphic.07 2.10Inline graphic.11 4.21Inline graphic.23 8.30Inline graphic.56 16.0Inline graphic1.3 32.0Inline graphic1.6
Inline graphic (in; 5.7%)Inline graphic .163Inline graphic.019 .052Inline graphic.002 .064Inline graphic.003 .076Inline graphic.003 .095Inline graphic.004 .116Inline graphic.003
E. coli (1)Inline graphic .373 .729 1.74 4.52 12.6 37.0 111 328 879
E. coli (gn; 88.7%)Inline graphic .346 .656 1.56 4.05 11.3 33.0 98.9 292
E. coli (ig; 11.2%)Inline graphic .553 1.22 2.60 6.33 16.0 39.3 83.9

Inline graphic, Inline graphic = 2 to 10, of chromosomes of model organisms. The Inline graphic's given are meanInline graphicSD averaged over chromosomes of the organism, except for the single chromosome E. coli. See Table S2 for list of all computed Inline graphic's. (Inline graphic) Number in parentheses indicates total number of complete chromosomes in organism. (Inline graphic) Abbreviations: gn, gene; gn, intergenic; ex, exon; in, intron. Percentage given indicates portion of complete sequence. “N-runs” or gaps in sequences are not counted. (Inline graphic) Ex and in segments selected as given by Genbank; sum of percentages for ex and in may be less than or exceed that of gn due to incomplete or duplicated segments. (Inline graphic) Inline graphic(Inline graphic) computed only if category has more than one sequence whose length exceeds Inline graphic.

Figure 4. Averaged equivalent lengths for complete chromosomes and concatenates.

Figure 4

The concatenates are: “gene” (gn in main text), coding regions; “intergene” (ig), non-coding or intergenic regions; “exon” (ex), exons in gn (for eukaryotes); “intron” (in), introns in gn. Top left, Inline graphic (Inline graphic = 3) averaged over phylogenetic categories (Uni, unicellulars; Pla, plants; Ins, insects; Ver, vertebrayes; Pro, prokaryotes); top right, Inline graphic (Inline graphic = 5) versus chromosome length average over categories; bottom, Inline graphic (Inline graphic = 7) for seven model organisms averaged over chromosomes. Boxes indicate data in the 10, 25, 50, 75 and 90% range.

Table 2. Average genomic equivalent lengths.

Inline graphic (kb)
Category (Inline graphic = ) 2 5 7 10
All Inline graphic Inline graphic Inline graphic Inline graphic
gn (41.8%) Inline graphic Inline graphic Inline graphic Inline graphic
ig (59.6%) Inline graphic Inline graphic Inline graphic Inline graphic
ex (3.3%) Inline graphic Inline graphic Inline graphic Inline graphic
in (31.8%) Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic (Inline graphic = 0.5) Inline graphic Inline graphic Inline graphic Inline graphic
RSD model Inline graphic Inline graphic Inline graphic Inline graphic

Inline graphic, Inline graphic = 2, 5, 7 and 10, averaged over 865 chromosomes. Total sequences length is about 2.2Inline graphic10Inline graphic bases. Abbreviations: All, complete chromosome; gn, genes; ig, intergenic; ex, exons; in, introns. Percentage given indicates portion of complete sequence. Inline graphic is defined in Eq. (1) and RSD results are averaged over 200 model sequences. See Table S4 for Inline graphic of other Inline graphic values.

Summary of genomic data

We summarize the trends of genomic data: (a) Inline graphic increases with Inline graphic. (b) For given Inline graphic, Inline graphic has no systematic dependence on Inline graphic and has a weak dependence on Inline graphic. (c) For given Inline graphic, Inline graphic for different organisms are of the same order of magnitude. (d) Within a genome, Inline graphic differs little among chromosomes. (e) There is remarkable agreement between the gn and ex data sets. (f) There is not a significant difference between the Inline graphic's for coding (Inline graphic and Inline graphic) and non-coding (Inline graphic and Inline graphic) regions, and the agreement between the two regions improves when that fact that coding regions tend to be GC-rich is taken into account (Text S1 and Fig. S1). We remark that in splicing the Inline graphic concatenate genes in positive and negative orientations from a Inline graphic strand of DNA are concatenated, without inverting the negatively oriented genes (Methods). Similarly for the Inline graphic concatenate.

Discussion

Universal Inline graphic is not a result of inter-chromome similarity in Inline graphic-mer-content

Fig. 5 shows intra-chromosome Inline graphic-mer-content similarity plots (Methods) for six representative chromosomes. In the plots, a small value of Inline graphic (Inline graphic0.2, black-blue) indicates high degree of similarity, and a large value (Inline graphic1, cyan to red) indicates the opposite. A general trend is that local Inline graphic-mer-content within a chromosome is fairly homogeneous [51], [52] on a scale as small as 50 kb. When Inline graphic-mer-contents of coding and non-coding parts show a significant difference, as is seen in the case of P. falciparum, M. stadtmanae, and E. coli, it is mainly caused by the gn part being substantially richer in GC content than the Inline graphic part (Table 3). Nevertheless, because Inline graphic is defined such that first-order dependence in base composition is removed, within a chromosome the Inline graphic's for the Inline graphic and Inline graphic parts and for the whole chromosome generally have similar values (Table S3, Inline graphic).

Figure 5. Intra-chromosomes similarity plots.

Figure 5

Plots are for Inline graphic = 2 (Methods). Sliding window has width 25 kb and slide 10 kb; pixel size is 10 kb by 10 kb. In each plot, the coordinates for the upper-left triangle are sites along the chromosome (chr), and those for the lower-right triangle are along a concatenate composed of gene (gn, left side) and intergene (ig, right side) parts. In effect, the upper-left triangle shows chr-chr similarity, and the lower-right triangle shows gn-gn (lower-left sub-triangle), ig-ig (upper-right sub-triangle), and gn-ig (rectangular) similarities in three separate regions. The lengths of the gn and ig parts are given in Table 3.

Table 3. Intra-chromosome similarity indexes.

Length (Mb)/Inline graphic Average Inline graphic
Organism chr gn ig chr-chr gn-gn ig-ig gn-ig
S. pombe Chr. 1 2.45/0.64 1.40/0.61 1.05/0.69 0.648 0.569 0.615 0.647
E. cuniculi (genome) 2.50/0.53 2.15/0.53 0.35/0.55 0.527 0.481 0.450 0.666
P. falciparum Chr. 13 2.73/0.82 1.55/0.79 1.18/0.87 0.801 0.742 0.641 2.11
M. stadtmanae 1.77/0.73 1.51/0.71 0.26/0.83 0.805 0.782 0.757 2.52
S. glossinidius morsitans 4.17/0.46 2.15/0.44 2.02/0.47 0.638 0.510 0.635 0.729
E. coli K12 4.64/0.50 4.12/0.49 0.52/0.58 0.517 0.481 0.548 1.63

Compositions and average regional similarity indexes of sequences shown in Fig. 6; chr, chromosome; gn, gene; ig, intergenic.

Fig. 6 compares the intra-E. coli plot with inter-chromosome plots of E. coli versus seven other organisms whose phylogenetic distances to E. coli range from close to remote. The approximate monochromaticity of each plot reconfirms our previous observation that Inline graphic-mer-content within a chromosome has a high degree of homogeneity (on a scale of 100 kb). We see close correlation between phyogenetic distance and the shades (colors) of the seven inter-chromosome plots. Fig. 7 gives the mean Inline graphic for the plots and P-values from Student t-tests for the null assumption that the inter-chromosome plots are the same as the intra- E. coli plot. These results verify that the observed near universal value in Inline graphic is not cause by similarity in Inline graphic-mer-content among chromosomes.

Figure 6. Intra- E. coli and inter-chromosome similarity plots.

Figure 6

The plots are those of E. coli chromosome Inline graphic the chromosomes of, left to right and top to bottom, E. coli, E. coli UT189, Salmonella, the delta-proteobacteria S. aciditrophicus, the cyanobacteria Synechocystis, the archaea P. aerophilum, chromosome 5 of the fungus A. fumigatus, and the first 4.5 Mb segment from chromosome 1 of H. sapiens. Coordinates are sites along the sequence. Sliding window width is 100 kb and slide is 25 kb, pixel size is 25 kb by 25 kb.

Figure 7. Comparison of inter-chromosome similarity matrices.

Figure 7

Mean values and SD of the eight Inline graphic-plots (of Inline graphic-matrices) shown in Fig. 6 and P-values for the null assumption that the 2nd to 7th cases are the same as the 1st case.

As an aside, we note that in Fig. 6 the plot for S. pombe indicates a Inline graphic100 kb ig segment around the 1.1 Mb site has extraordinary low similarity with respect to all other regions of the chromosome. This could be the result of a non-genic horizontal/lateral transfer [53], [54] and suggests that similarity plots may be useful for locating such events.

A universal formula for Inline graphic

The 7360 pieces of data in the “All” set in Table 2 is well represented by the empirical formula,

graphic file with name pone.0009844.e550.jpg (1)
graphic file with name pone.0009844.e551.jpg (2)

where Inline graphic = 0.92, Inline graphic b, and Inline graphic = 0.50Inline graphic0.05. The central values of the formula are shown as solid lines in Fig. 3 and listed as the entries in the row labeled Inline graphic in Table 2. The denominator in Eq. (2) represents the residual Inline graphic-dependence indicated in the data in Fig. 3; it works well even for chromosomes with large Inline graphic Inline graphic0.5Inline graphic (Table S4, Inline graphic). For the vast majority of genomic Inline graphic's, Inline graphic Inline graphic Inline graphic(Inline graphic/Inline graphic (Text S1) is less than 1 (Fig. S2) and, averaged over the 7360 pieces of data in the “All” set, Inline graphic = 0.43. This means that on average the genomic Inline graphic is within a factor of two of Inline graphic. In recognizing that genomes as a category exhibit such a non-trivial common feature which is itself the manifest of an underlying but yet undetermined cause, we say genomes belong to a universality class. It is realized that Eq. (1) cannot be extended to Inline graphic much greater than 10 (and not even to 10 for some of the smaller chromosomes), because a meaningful value for Inline graphic may be extracted only when a sequence is at least Inline graphic bases long.

A universal formula for the standard deviation from the fluctuating part in Inline graphic-mer frequency

The short genomic Inline graphic (relative to actual chromosome length) is a direct consequence of the genomic Inline graphic being much larger than its random-sequence counterpart. If we approximate Inline graphic in Eq. (1) by Inline graphic and approximate the factor Inline graphic in Eq. (14) (Methods) by unity, then through Eq. (14) we convert Eq. (1) to a universal formula for the Inline graphic-set-averaged standard deviation for the Inline graphic-mer FFD:

graphic file with name pone.0009844.e582.jpg (3)

where Inline graphic is the sequence length. The formula is meant to be applicable so long as Inline graphic is several times greater than Inline graphic. For sequences with Inline graphic Inline graphic0.5, Inline graphic reduces to the usual variance. Note that for random sequences Inline graphic Inline graphic Inline graphic. Since Inline graphic is large, genomic Inline graphic can be orders of magnitude greater than its random counterpart. For instance, for the 4.6 Mb chromosome, the Inline graphic = 4 values for Inline graphic given by Eq. (3), the actual chromosome (Inline graphic-averaged), and a random sequence are 6440 b, 6230 b, and 134 b, respectively, and for the 228 Mb human chromosome 1, the corresponding values are 319,000 b, 380,000 b, and 943 b, respectively. To give statistical meaning to such differences, Table 4 examines universal genomes of various lengths and gives the fractions of 2-mers and 9-mers (in the genomes) whose frequencies have P-values that are less than PInline graphic – the P-value corresponding to Inline graphic standard deviations away from the expected frequency in a random sequence – for Inline graphic = 3, 6, and 8, respectively. Because Inline graphic Inline graphic Inline graphic, the fraction increases with decreasing Inline graphic and increasing Inline graphic (for a given Inline graphic). For instance, for a sequence 4.6 Mb long (length of E. coli chromosome), fourteen of the sixteen 2-mers have PInline graphicPInline graphic ( = 1.3Inline graphic), whereas only 26,000 of the 262,144 9-mers are so. In comparison, for a sequence 226 Mb long (length of human chromosome 1), all sixteen 2-mers and 213,000 of the 9-mers are so.

Table 4. P-values for Inline graphic-mer distribution in universality class.

Fraction of Inline graphic-mers whose P-value is less than PInline graphic, PInline graphic, or PInline graphic
Inline graphic = 2 (Inline graphic = 310 b) Inline graphic = 9 (Inline graphic = 194 kb)
Length (Mb) PInline graphicPInline graphic PInline graphicPInline graphic PInline graphicPInline graphic PInline graphicPInline graphic PInline graphicPInline graphic PInline graphicPInline graphic
0.8 0.953 0.906 0.875 0.139 0.0031 0.0001
4.6 0.980 0.960 0.955 0.538 0.418 0.100
30 0.992 0.985 0.979 0.809 0.628 0.519
226 0.997 0.994 0.992 0.930 0.860 0.815

P-values for Inline graphic-mer distribution given by Eq. (1) (at Inline graphic = 0.5). Null theory assumes genomes are random sequences. The P-values PInline graphic = 2.7Inline graphic, PInline graphic = 2.0Inline graphic, and PInline graphic = 1.3Inline graphic correspond to Inline graphic-values of three, six and eight, respectively.

Segmental duplication shortens Inline graphic

We now discuss probable causes for the formation of the universality class. We first list some general properties of the ratio Inline graphic of Inline graphic to the sequence length Inline graphic: if the sequence is (nearly) random then Inline graphic( = Inline graphic/Inline graphic)Inline graphic1; if it is far less random than a random sequence of length Inline graphic then Inline graphic Inline graphic1; if it is essentially ordered then Inline graphic Inline graphic0; if it is the Inline graphic-fold replication of a random sequence, then Inline graphic Inline graphic1/Inline graphic. We illustrate how segmental duplication can cause a sequence to have Inline graphic much less then one, by considering the effect of a generalization of the operation of replication on Inline graphic. To be specific we label XY a concatenate composed of X and Y. If Y is a coarse-grained rearrangement of X, then, provided the scale of the rearrangements is not too small, Inline graphic(X)Inline graphic Inline graphic(Y) and concatenating X and Y is similar to doubling X by replication, hence Inline graphic(XY) will be nearly equal to Inline graphic(X).

In general, if the Inline graphic-mer-contents of X and Y are similar, then (provided the sequences are sufficiently long) we expect Inline graphic(XY)Inline graphic Inline graphic(X)Inline graphic Inline graphic(Y). Conversely, if the Inline graphic-mer-contents of X and Y are significantly different, then we expect Inline graphic(XY)Inline graphic Inline graphic(Inline graphic(X), Inline graphic(Y)) (see Text S1 for an expanded discussion, including formulas given in Table S5). Results for testing these simple rules with real sequences are shown in Table 5. We expect agreement with theory to improve with increasing sequence length (Inline graphic). The first two rows of results in Table 5 verify that for random sequence Inline graphic is always close to one, or Inline graphic Inline graphic Inline graphic. The results for AAInline graphic and BBInline graphic show that concatenating two equal-length segments from the same chromosome is indeed like doubling a sequence by replication. Chromosomes labeled CInline graphic have Inline graphic-mer-contents relatively more similar to A (Figs. 4 and 5), therefore Inline graphic(ACInline graphic)Inline graphic Inline graphic(AAInline graphic)Inline graphic Inline graphic(A) as expected. Chromosomes labeled DInline graphic and B have Inline graphic-mer-contents more dissimilar to A, therefore Inline graphic(AX)Inline graphic Inline graphic(Inline graphic(A), Inline graphic(X)). The case of ADInline graphic, where DInline graphic is H. sapiens chr. 1, is not an exception to the rule even for Inline graphic = 2, because Inline graphic(DInline graphic)Inline graphic Inline graphic(A). In the bottom portion of Table 5 the approximate relation Inline graphic Inline graphic Inline graphic Inline graphic (Table S5; Inline graphic is the equivalent length of the genomic portion and Inline graphic is the ratio of the length of the concatenate to the that of the genomic portion) is seen to hold: Inline graphic(RX)Inline graphic4Inline graphic(X) (X being A or B), Inline graphic(RAB)Inline graphic2.3Inline graphic(AB), and Inline graphic(RR'X)Inline graphic9Inline graphic(X).

Table 5. Equivalent lengths of composite sequences.

Inline graphic
Inline graphic = 2 Inline graphic = 6
Sequence Inline graphic = 50 Inline graphic = 200 Inline graphic = 50 Inline graphic = 200
R 47.5Inline graphic28.2 154Inline graphic126 48.6Inline graphic1.5 192Inline graphic5
RRInline graphic 37.0Inline graphic16.2 124Inline graphic46 48.2Inline graphic1.2 197Inline graphic5
A .348Inline graphic.037 .360Inline graphic.033 9.55Inline graphic.69 11.7Inline graphic.7
AAInline graphic .357Inline graphic.046 .352Inline graphic.023 9.88Inline graphic1.07 11.1Inline graphic.7
ACInline graphic .351Inline graphic.061 .361Inline graphic.021 9.37Inline graphic1.01 11.5Inline graphic.6
ACInline graphic .354Inline graphic.043 .384Inline graphic.045 9.18Inline graphic.83 11.6Inline graphic.9
ACInline graphic .359Inline graphic.051 .371Inline graphic.034 11.0Inline graphic.9 14.2Inline graphic1.5
ADInline graphic .411Inline graphic.044 .423Inline graphic.024 11.8Inline graphic.9 14.3Inline graphic.6
ADInline graphic .942Inline graphic.275 1.05Inline graphic.09 14.9Inline graphic1.4 20.4Inline graphic1.1
ADInline graphic .598Inline graphic.104 .613Inline graphic.052 17.9Inline graphic1.6 24.0Inline graphic1.6
ADInline graphic .324Inline graphic.052 .383Inline graphic.055 11.2Inline graphic1.9 16.9Inline graphic1.9
B .124Inline graphic.029 .166Inline graphic.099 5.17Inline graphic.68 6.54Inline graphic2.00
BBInline graphic .232Inline graphic.155 .258Inline graphic.183 6.16Inline graphic1.94 7.54Inline graphic2.30
AB .463Inline graphic.241 .502Inline graphic.263 11.2Inline graphic1.9 15.2Inline graphic3.5
RA 1.19Inline graphic.09 1.34Inline graphic.20 22.6Inline graphic1.2 38.5Inline graphic3.0
RB .575Inline graphic.321 .754Inline graphic.637 15.6Inline graphic4.2 23.3Inline graphic8.5
RAB .873Inline graphic.424 1.10Inline graphic.49 18.4Inline graphic3.2 31.3Inline graphic6.0
RRInline graphicA 2.63Inline graphic.66 3.16Inline graphic.30 31.5Inline graphic2.1 72.2Inline graphic6.8
RRInline graphicB 1.03Inline graphic.62 1.37Inline graphic.70 22.9Inline graphic4.5 44.7Inline graphic14.3

Equivalent lengths Inline graphic of composite sequences of total length Inline graphic (in kb). The composite XY is the concatenation of two equal-length components X and Y. Similarly for the composite XYZ. A and AInline graphic are segments from E. coli, and B and BInline graphic are from C. tetani (2.80 Mb, Inline graphic = 0.70). CInline graphic and DInline graphic, are the seven “other” chromosomes in Fig. 6, in the order given there. R and RInline graphic are Inline graphic = 0.5 random sequences. Results are averaged over 10 samples in all cases.

Artificial sequences generated by RSD growth model exhibit universal Inline graphic

We show that a very simple growth model, the minimum random segmental duplication (RSD) model [49] (Methods; Text S1)), generates chromosome-length sequences that have Inline graphic's very close to the universal Inline graphic given by Eq. (1). In the model, simple segmental duplication (SD) serves to represent the numerous modes of DNA copying processes known to occur in genomes [9][11], [55], [56], and point mutation represents all small non-duplicating events. We consider random events because it is the simplest assumption and because it generates sequences with a reasonable degree of homogeneity [51], [52]. (It is known that genomes have long-range correlations that require tandem SDs to generate [46], [57]. Since tandem duplications do not effect Inline graphic, for simplicity they are not given special treatment in this study.) The three parameters of the model are Inline graphic (initial length), Inline graphic (average duplicated segment length), and Inline graphic (cumulative point mutation per-base density) (Methods. Inline graphic generated by the model is insensitive to sequence length provided it is longer than 0.5 Mb, allows a generous range in Inline graphic and a tighter range in Inline graphic, and is highly sensitive to Inline graphic (Fig. S3, Inline graphic). (Because RSD will at least initially cause Inline graphic to be longer than Inline graphic and because Inline graphic (Inline graphic = 2)Inline graphic300 b, Inline graphic must be significantly less than 300 b.) Fig. 8 shows that, at Inline graphic = 64, the model admits a basin of good values delimited by Inline graphic = 120 to 5000 and Inline graphic = 0.65 to 0.80. Inline graphic's of model sequences obtained using the “best set” of parameters Inline graphic = 64, Inline graphic = 1000, and Inline graphic = 0.73 are shown in the right panel in Fig. 8, where the lines represent the universality class Inline graphic (Eq. (1)). The Inline graphic for these Inline graphic's is 0.18 and implies that on average, the model Inline graphic and Inline graphic agree to within a factor of 1.6. This small Inline graphic can easily be increased to match that of the genomic data (Inline graphic = 0.43) by using model parameters that cover suitable ranges of values centered around the best values.

Figure 8. Results from minimal RSD model.

Figure 8

Left: Equi-Inline graphic contour on the Inline graphic-Inline graphic plane, with Inline graphic = 64 (bases). Right: Inline graphic, Inline graphic = 2, 4, 6, 8, 10 from 200 model sequences of length 2 Mb generated using the “best set” of parameters Inline graphic = 64, Inline graphic = 1000 (b) and Inline graphic = 0.73 (bInline graphic). Lines in right panel are Inline graphic (Eq. (1)).

The range of Inline graphic within the basin of good values seems biologically realistic, for it is consistent with the range of the characteristic lengths of genes. The isolated basin near Inline graphic = 30, Inline graphic = 0.3 allows copious duplication of regulatory sequences, including microRNAs [58], that are much shorter than genes. The considerable size of the main basin implies that it is easily accessible in an evolutionary selective process. On the other hand, that Inline graphic increases sharply outside the basin of good values demonstrates that even in the context of the RSD model it is very easy to generate sequences that are far outside the universality class.

Rates of genome growth and duplication

The parameters of the RSD model are compatible with rates of genome growth and duplication determined using sequence comparison [37][39]. In a model where a genome grows at a constant per-time rate Inline graphic, we have Inline graphic = Inline graphic where Inline graphic is the length of the genome at time Inline graphic (Eq. (16), Methods). For human we can take Inline graphic to be the current time because the human genome has grown 15% to 20% in the last 50 Mya (10Inline graphic years) [39]. The ancestors of eubacteria and archaea-eukaria diverged Inline graphic3.4 Gya (10Inline graphic years) ago [59][61]), and before that proto-genomes most likely evolved as communities [62][64], and hence had a different growth regime than later times. The smallest bacterial genome is about 0.2 Mb; we take Inline graphic to be from 0.05 to 0.2 Mb and Inline graphic = 3 Gb. Then Inline graphic = 2.7Inline graphic3.7/Mya. These rates imply the human genome grew 14Inline graphic20% in the last 50 Mya, in agreement with [39]. If we assume the growth is purely SD and take the length of duplicated segment Inline graphic to be 500 b to 2 kb, then the rate of SD events is Inline graphic = Inline graphic = 1.4Inline graphic7.4/Mb/Mya. These values are comparable to the estimates of 3.9/Mb/Mya (from animal gene duplication rate of Inline graphic0.01 per gene per Mya [6] and human coding region Inline graphic3% of genome), and 2.8/Mb/Mya (from human retrotransposition event rate [39]).

Cumulative mutation density and mutation rates

The parameter Inline graphic in the RSD model, the cumulative point mutation density, is related to the (per-site per-time) rate density Inline graphic of “point mutations” – including small deletion and insertion but excluding SD – by Inline graphic Inline graphic Inline graphic (Eq. (19), Methods). If we take the best value Inline graphic = 0.73 from the RSD model then Inline graphic = 0.98Inline graphic1.4Inline graphic10Inline graphic/site/Mya. This agrees well with the value Inline graphic Inline graphic1Inline graphic10Inline graphic/site/Mya [37][39] determined by sequence comparison.

We cannot assume the E. coli genome is still growing, as the human genome appears to be. Instead, like most bacteria E. coli probably acquired its full length in antiquity, not too long after ancestors of eubacteria and archaea-eukaria diverged [61]. If we assume E. coli acquired its current length of 4.6 Mb about 0.4 to 0.6 Gya after that, then with Inline graphic as before, we have Inline graphic = 5.4Inline graphic11/Mya, and Inline graphic = 2.0Inline graphic4.0Inline graphic10Inline graphic/site/Mya. Fortuitously or perhaps this range of rates represent an equilibrium value, it is compatible with the sequence-comparison E. coli rate of Inline graphic Inline graphic5Inline graphic10Inline graphic/site/Mya based on mutations that (putatively) occurred in the last 0.5 Gya or less [37], [38]. There is some evidence that natural selection does cause genomes to have a relatively low and stable mutation rate. For instance, laboratory measured spontaneous mutation rates of E. coli [65], C. elegans [65], [66], and Inline graphic [65], [67] tend to be two or three orders of magnitudes higher than the characteristic rates of Inline graphic0.001/site/Mya of wild types.

Presumably the same selective force is what causes the Inline graphic's, hence the cumulative mutation density Inline graphic, of coding and non-coding regions of a chromosome to be nearly equal. Such a force must be acting for otherwise we expect non-coding regions to have a significantly higher Inline graphic, which is not the case.

Materials and Methods

Complete genome sequences

A total of 865 complete chromosomes were downloaded from the genome database [68] on 2006/10/01. The set is composed of 467 prokaryotic chromosomes (435 eubacteria and 32 archaea) and 398 chromosomes from 28 eukaryotes including: 12 unicellulars (A. fumigatus (8 chromosomes), C. albicans (1), C. glabrata (13), C. neoformans (14), D. hansenii (7), E. cuniculi (11), E. gossypii (7), Kluyveromyces lactis (6), S. cerevisiae (16), S. pombe (3), Y. lipolytica (6), P. falciparum (14)), 5 insects (A. gambiae (3), A. mellifera (16), C. elegans (6), D. melanogaster (4), T. casteneum (10)), 2 plants (A. thaliana (5), O. sativa (12), 9 vertebrates (B. taurus (30), C. familiaris (39), D. rerio (25), G. gallus (30), H. sapiens (24), M. multatta (21), M. musculus (21), P. troglodytes (25), R. norvegicus (21)). The complete list of sequences, their accession numbers, lengths and other properties relevant to this study are given in Table S1.

Partition of Inline graphic-mers into Inline graphic-sets

We always speak of single-stranded sequences. We refer to a Inline graphic-base nucleic word as a Inline graphic-mer and denote the set of all Inline graphic Inline graphic Inline graphic types of Inline graphic-mers by Inline graphic. Given a sequence, we count the frequency of occurrence (or frequency) Inline graphic of each Inline graphic-mer-type Inline graphic in Inline graphic using an overlapping sliding window of width Inline graphic and slide one [36]. Then the sum of the frequencies is Inline graphic Inline graphic = Inline graphicInline graphic+1, here approximate by Inline graphic, and the mean frequency is Inline graphic = Inline graphic. Let the fractional AT- and CG-content of a sequence be Inline graphic and Inline graphic = 1−Inline graphic, respectively. We say a sequence has an even-base composition when Inline graphic is equal to or very close to 0.5, otherwise it has biased base composition. Owing to Chargaff's second parity rule [69] Inline graphic is an accurate and efficient classifier of base composition for statistical analysis. The Inline graphic-mers in a sequence are naturally partitioned into Inline graphic+1 “Inline graphic-sets”, Inline graphic, Inline graphic = 0,1,Inline graphic Inline graphic, where each Inline graphic-mer in Inline graphic has Inline graphic and only Inline graphic AT's; Inline graphic. For example, in the case of Inline graphic = 2, Inline graphic is the set {CC, CG, GC, GG}; Inline graphic is the set {CA, CT, GA, GT, AC, AG, TC, TG}; and Inline graphic is the set {AA AT, TA, TT}. The the number of types of Inline graphic-mers in Inline graphic is Inline graphic, which satisfies the sum-rule Inline graphic Inline graphic = Inline graphic = Inline graphic. These relations derive from the binomial expansion (for given Inline graphic)

graphic file with name pone.0009844.e971.jpg (4)

Let Inline graphic = Inline graphic Inline graphic be the sum frequency of the Inline graphic-mers in Inline graphic. Then Inline graphic Inline graphic = Inline graphic and the mean frequency of the Inline graphic-mers in Inline graphic is Inline graphic = Inline graphic. The large-Inline graphic limit of Inline graphic for a random sequence, Inline graphic, is obtained from the binomial expansion

graphic file with name pone.0009844.e987.jpg (5)

That is,

graphic file with name pone.0009844.e988.jpg (6)

Depending on Inline graphic, Inline graphic can vary widely, all collapsing to Inline graphic when Inline graphic = 0.5. Eq. (6) not only provides an highly accurate estimate of the value of Inline graphic for genome-size random sequences, it also gives a reasonable estimate for genomic Inline graphic (Table 6).

Table 6. Average frequency of occurrence (Inline graphic) of 5-mers in Inline graphic Inline graphic0.5 and Inline graphic Inline graphic0.7 sequence.

Inline graphic
Sequence (Inline graphic = ) 0 1 2 3 4 5
Inline graphic
E. coli 2509 2245 1877 1760 1944 2656
Random 2101 2044 1987 1922 1857 1795
Inline graphic RandomInline graphic 2114 2048 1983 1920 1860 1801
Inline graphic
C. acetobutylicum 154 397 918 1951 4272 10300
Random 176 394 882 1970 4400 9832
Inline graphic RandomInline graphic 176 393 880 1968 4402 9845

All sequences normalized to a length of 2 Mb; Inline graphic = 2Inline graphic10Inline graphic/4Inline graphic = 1953. Random means matching random sequence, or sequence obtained by scrambling the genome. Inline graphicValues of Inline graphic given by Eq. (6).

Fluctuation in occurrence frequency

The coefficient of variation of the frequency distribution is Inline graphic = Inline graphic, where Inline graphic is the standard deviation. For random events of equal probability, here translated to Inline graphic-mer frequencies of a (long) random sequence with even-base composition, the distribution is Poisson and Inline graphic = Inline graphic, hence Inline graphic = Inline graphic = Inline graphic, which tends to zero in the large-Inline graphic limit. This no longer holds when the random sequence has a biased base composition. As controls we consider random sequences that match genomes, namely those whose lengths and base compositions are the same as their genomic counterparts. In particular, such sequences obey Chargaff's second parity rule [69] in that their A and T, and C and G, separately have nearly equal probabilities. For any sequence whose Inline graphic-mers are partitioned into Inline graphic-sets, using a generalization of the parallel axis theorem, we write as follows:

graphic file with name pone.0009844.e1026.jpg (7)

The second term vanishes upon summing over Inline graphic, so Inline graphic is composed of two parts,

graphic file with name pone.0009844.e1029.jpg (8)

a non-fluctuating part determined by average frequencies Inline graphic and Inline graphic,

graphic file with name pone.0009844.e1032.jpg (9)

and a fluctuating part determined by the fluctuation of Inline graphic (in an Inline graphic-set) around an average frequency,

graphic file with name pone.0009844.e1035.jpg (10)

Thus,

graphic file with name pone.0009844.e1036.jpg (11)

The non-fluctuating, or “non-statistical”, part, Inline graphic, has a well-defined value in the large-Inline graphic limit, obtained by replacing Inline graphic by Inline graphic in Eq. (9):

graphic file with name pone.0009844.e1041.jpg (12)

which has a strong dependence on Inline graphic and vanishes Inline graphic = 0.5. Because genomes are large, Inline graphic gives an accurate description of Inline graphic for genome-size random sequences; it also happens to do almost as well for genome (Fig. 1). Owing to the existence of this term, the Inline graphic for a genomic sequence may be much greater than that of its matching random sequence (when Inline graphic Inline graphic0.5; see, e.g., Fig. 9 (A)), or quite similar (when Inline graphic differs significantly from 0.5; see, e.g., Fig. 9 (B)). Because Inline graphic hardly depends on the distribution of the Inline graphic-mers, it should be considered a background in Inline graphic in relation to the signal which is Inline graphic.

Figure 9. Frequency distributions of 5-mers.

Figure 9

Frequency occurrence distributions, or spectra, of 5-mers from the genomes of two prokaryotes, (A) E. coli (with (A+T) content Inline graphic Inline graphic0.5) and (B) C. acetobutylicum (Inline graphic Inline graphic0.7), normalized to a sequence length of 2 Mb. Abscissa give occurrence frequency and ordinates give number of 5-mers averaged, for better viewing, over a range of 21 frequencies to reduce fluctuation. The black, green and red curves represent spectra of the complete genomes, the randomized genome sequences and sequences generated in a model (see text), respectively. (C) Details of the m = 2 subspectra from (B).

For a random sequence, the frequency distribution in the subset Inline graphic is nearly Poisson, hence Inline graphic Inline graphic Inline graphic in the large-Inline graphic limit. Therefore, from Eq. (10),

graphic file with name pone.0009844.e1063.jpg (13)

which is exactly the limit expected of Inline graphic for an even-base (Inline graphic = 0.5) random sequence. In other words, for random sequences Inline graphic, but not Inline graphic, has the correct large-Inline graphic limit expected of a random system. The right-hand-side does not depend on Inline graphic, which is a reflection of the fact that for genome as well as random sequences, Inline graphic has at most a weak Inline graphic-dependence; the main Inline graphic-dependence having been removed when Inline graphic is subtracted from Inline graphic. Because (for random sequences) Inline graphic decreases with increasing Inline graphic but Inline graphic does not, there is a crossover value of Inline graphic beyond which Inline graphic becomes the leading term in Inline graphic (when Inline graphic Inline graphic0.5). When Inline graphic = 0.7, this crossover value is 42, 316 and 2851 (bases) for Inline graphic = 2, 4, and 6, respectively, which are orders of magnitudes shorter than even the smallest chromosomes. To summarize, if one wants to compare the statistical properties in the frequency distributions of Inline graphic-mers in the genomic and random sequence, one must use Inline graphic, not Inline graphic.

Two examples: E. coli and C. acetobutylicum

We explain the formulation presented in the last two sections by presenting results of distributions, or spectra, of frequency of 5-mers (as an example), and values of quantities such as Inline graphic, Inline graphic, and Inline graphic for two genomes with very different base compositions: E. coli (Inline graphic = 0.492) and C. acetobutylicum (Inline graphic = 0.691). Here, a spectrum is the number of Inline graphic-mers plotted against occurrence frequency. The spectra for the two genomes are shown as black curves in panels (A) and (B) of Fig. 9. The solid green curves characterized by narrow peaks are the spectra for random sequences obtained by scrambling the genomes. (The red curves are for sequences generated in the RSD model, see text.) In (A) the mean frequency of both spectra is Inline graphic = 2Inline graphic10Inline graphic/4Inline graphic = 1953. However, the genomic spectrum is seen to be much broader then the random-sequence spectrum, indicating that whereas in the random sequence frequencies (Inline graphic) of individual 5-mers deviate little from the mean (Inline graphic), in the genomic sequence that is not the case; frequencies of individual 5-mers fluctuate widely around the mean. Drastically different from (A), the overall widths of genome and random-sequence spectra in (B) are similar. Instead of having a single peak, the random-sequence spectrum is composed of six widely spread narrow subspectra whose peaks are near the theoretical mean frequencies (for Inline graphic = 0.7) of the Inline graphic-sets, Inline graphic Inline graphic152, 354, 827, 1930, 4500, 10500, for Inline graphic = 0 to 5, respectively. Eq. (6) shows that these mean values are determined by Inline graphic and the base composition of the sequence, or Inline graphic, and does not depend on the fluctuation of frequencies of Inline graphic-specific 5-mers. (B) and (C) in Fig. 9 show that in the random sequence frequency fluctuation within an Inline graphic-set is again small. In contrast, and just as in (A), frequency fluctuations of Inline graphic specific 5-mers in the genomic sequence are large (Fig. 9 (C) and Fig. 10 [70]).

Figure 10. Frequency distributions of 5-mers in Inline graphic-sets.

Figure 10

Details of Inline graphic = 5, Inline graphic-specific subspectra from the C. acetobutylicum genome (broken green curves) and matching random sequence (solid green curves); black curve is the same as in (B) Fig. 9. The five narrow subspectra peak (approximately) at Inline graphic, Inline graphic = 0 to 4, or at 152, 354, 827, 1939, 4500, respectively; the Inline graphic = 5 peak at 10500 is off scale (see Fig. 9 (B)).

Table 6 shows that Inline graphic gives a very accurate estimate of Inline graphic for random sequences and a fair one for genomic sequences. In the Inline graphic = 0.492 case, the relation Inline graphic Inline graphic Inline graphic for all the Inline graphic's explains the narrowness of the random spectrum in Fig. 9 (A): like its counterpart in (B), it is also composed of six subspectra, but unlike (B) whose subspectra are spread widely, now the subspectra are superimposed. Table 7 highlight important aspects of our formulation: (i) Inline graphic has a strong dependence on Inline graphic but not on whether a sequence is genomic or random; (ii) Inline graphic gives an excellent estimate of Inline graphic for random sequences, and a fair estimate for genomes; (iii) Inline graphic depends weakly on Inline graphic but strongly on whether a sequence is genomic (relative large value) or random (several orders of magnitude smaller, and much smaller than Inline graphic except when Inline graphic Inline graphic0.5). (iv) For random sequences Eq. (13) is a fairly accurate relation.

Table 7. Values of Inline graphic's from 5-mers in Inline graphic Inline graphic0.5 and Inline graphic Inline graphic0.7 sequences.

Inline graphic (in units of 10Inline graphic) Inline graphic Inline graphic Inline graphic
Sequence (Inline graphic = ) 0 1 2 3 4 5
Inline graphic
E. coli 144 141 74.2 58.4 66.4 83.7 0.212 0.013 Inline graphic
Random .174 0.203 0.185 0.177 0.144 0.110 4.6Inline graphic10Inline graphic 0.0012 0.0013
Inline graphic
C. acetobutylicum 0.60 6.95 26.1 65.4 97.1 336 0.145 1.00 Inline graphic
Random 0.011 0.038 0.102 0.218 0.500 1.24 5.8Inline graphic10Inline graphic 0.969 0.976

All sequences normalized to a length of 2 Mb; for Inline graphic = 5, Inline graphic = 1953, Inline graphic = 1024, and Inline graphic = 32, 160, 320, 160, 32, for Inline graphic = 0 to 5.

Equivalent length

The Inline graphic-mers equivalent length of a sequence is defined as

graphic file with name pone.0009844.e1157.jpg (14)

where Inline graphic is given by the frequency distribution of Inline graphic-mers. Recalling that for a random sequence Inline graphic is inversely proportional sequence length (Eq. (13)), we see that Inline graphic is the length of a random sequence whose Inline graphic has the same value as that of the genome. The empirical factor Inline graphic = 1−Inline graphic, instead of the theoretical binomial factor 1Inline graphic Inline graphic, is used to ensure that for a random sequence, regardless of base composition, Inline graphic approximates the true sequence length with a high degree of accuracy. With the signal term Inline graphic included but the strongly Inline graphic-dependence background term Inline graphic excluded in its definition, Inline graphic is expected to have at most a weak Inline graphic-dependence. That is, Inline graphic is a quantity with which we can compare on the same footing genomes with widely disparate base compositions.

Genic, non-genic, exon, and intron concatenates

These various concatenates are formed by splicing corresponding sections from a single strand of the DNA sequence and them stitching the sections together in the order and orientation they appear in the sequence. In particular, the genic and exon concatenates include genetic codes in positive and negative orientations.

Similarity index and similarity matrix

Given a pair of equal-length sequences Inline graphic and Inline graphic, the similarity index Inline graphic for the pair is defined as

graphic file with name pone.0009844.e1177.jpg (15)

where Inline graphic is an Inline graphic-set and Inline graphic is the variance of the frequency of the Inline graphic-mers in Inline graphic. The pair are similar (in Inline graphic-mer-content) when Inline graphic Inline graphic1, are (considered to be) identical when Inline graphic = 0, and are highly dissimilar when Inline graphic Inline graphic1. If we divide Inline graphic and Inline graphic into (possibly overlapping) segments {Inline graphic,Inline graphic,Inline graphic} and {Inline graphic,Inline graphic,Inline graphic}, respectively, then we call the matrix whose element (Inline graphic,Inline graphic) is valued Inline graphic a similarity matrix. In Fig. 6, similarity matrices are displayed as similarity plots by color coding elements of similarity matrices.

Minimum RSD model for genome growth

We denote by Inline graphic the designated length of a sequence and Inline graphic the designated AT-fraction of the sequence. We call the pair (Inline graphic, Inline graphic) the profile of a sequence; in our model, the two profiles (Inline graphic, Inline graphic) and (Inline graphic, 1−Inline graphic) are mathematically equivalent. By a growth model we mean a computer algorithm for generating, from an initial sequence, a target sequence that has a given profile and other specific genome-like attributes. Ours is a model of random segmental duplication (RSD) [49] in which the three main steps are: (i) randomly select a site from the sequence, (ii) from that site cull a segment of random length (but from a given length distribution) for duplication; (iii) reinsert the duplicated segment into the sequence at a (second) randomly selected site. The model has three explicit parameters: Inline graphic, the initial sequence length; Inline graphic, the average length of duplicated segments; Inline graphic, the cumulative point mutation density (replacement only), or number of mutations per site. The generation of a model sequence involves three steps: selection of initial sequence, growth by RSD, point mutations. An initial sequence (of length Inline graphic) is chosen such that it has a target value Inline graphic but is otherwise random. The lengths Inline graphic of the duplicated segments are selected with uniform probability within the range 1 to 2Inline graphic, unless the current length of the genome Inline graphic is less than 2Inline graphic, in which case Inline graphic is selected from within the range 1 to Inline graphic. Growth is stopped when the length of the sequence exceeds the target length for the first time. Point mutations have a base bias defined by Inline graphic and are administered after the growth is complete. That is, the administration of point mutations on the sequence is not meant to emulate point mutations suffered by a genome during its growth. Rather, Inline graphic is meant to indicate the average cumulative number of point mutations per site experience by the genome throughout its life. Because RSD causes drifts in base composition, the profile of the generated sequence will have a profile that is a close approximation of, but not exactly equal to, the target profile.

Mutation rates

We derive formulas for computing the rate density, or per site rate, of duplication events, Inline graphic, and the rate density of “point mutation” – including small deletion and insertion but excluding SD – events, Inline graphic. If the genome grows from time Inline graphic to time Inline graphic at a rate proportional to its length Inline graphic, that is, Inline graphic = Inline graphic where Inline graphic is the event rate (number of events per unit of time), then

graphic file with name pone.0009844.e1229.jpg (16)

If the grow is purely by SD and the average length of the duplicated segment is Inline graphic, then

graphic file with name pone.0009844.e1231.jpg (17)

If Inline graphic is the cumulative number of point mutations, then Inline graphic = Inline graphic. In SD dominated growth, the effect of point mutation on the overall length of a genome is negligible, so integrating the relation yields

graphic file with name pone.0009844.e1235.jpg (18)

For any Inline graphic such that Inline graphic Inline graphic Inline graphic, Inline graphic = Inline graphic. The cumulative mutation sites is greater than Inline graphic because mutation sites are copied during SD. The number of copied mutation sites satisfy Inline graphic = Inline graphic Inline graphic Inline graphic (for large Inline graphic). Therefore Inline graphic Inline graphic Inline graphic, that is, the cumulative number of mutated sites is twice Inline graphic. At full genome length Inline graphic, this number is Inline graphic, hence

graphic file with name pone.0009844.e1254.jpg (19)

Supporting Information

Figure S1

Category Le for coding and non-coding parts. Averages of p (fractional A/T-content) and Le for k = 7 (situations for other ks are similar) for the coding parts (solid symbols; ex for eukaryotes and gn for prokaryotes) and non-coding parts (hollow symbols; in for eukaryotes and ig for prokaryotes) of chromosomes. Symbols for categories are: vertebrates, red (square); unicellulars, blue (triangle-up); insects, orange (triangle-down); plants, green; prokaryotes, gray (bullet/circle). Numeral indicates number of chromosomes in each category. The curve represents Le for the universality class: Le{uc}(k; p).

(0.26 MB TIF)

Figure S2

Distributions of χ2 versus L and p. Each symbol gives the χ2 for one chromosomal Le. Top panels, for genic (gn) and exon (ex) concatenates. Bottom panels, for intergenic (ig) and intron (in) concatenates. Symbols, with color, number of data in group, and number of data whose χ2 is less than 10−3 given in brackets, stand for: diamond, gn (blue; 7100; 229); square, ex (red; 2844, 95); triangle-down, ig (green; 6377, 270); triangle-up, in (orange; 2960, 104).

(0.77 MB TIF)

Figure S3

Results from minimal RSD model. Top-left: Equi-χ2 contour as function of r and d, with L0 = 64 (bases); length (L) of generated model sequence is 2 Mb and only Le(k) results for k = 7 are used. Top-right: Le(k), k = 2, 4, 6, 8, 10 from 200 model sequences generated using the “best” parameters L0 = 64, <d> = 1000 (b) and r = 0.73 (cumulative point mutations per base). The lines are Le{uc}(k; p) that represent the universality class given in the main text. The χ2 for the model sequences is 0.18. Bottom-left: χ2 versus L0 (otherwise best parameters); model sequences have L = 2 Mb and p = 0.5. Bottom-right: Le versus L, for a p = 0.5 model sequence generated using the best parameters.

(1.17 MB TIF)

Table S1

List of complete sequences included in the study (20 pp).

(0.13 MB PDF)

Table S2

Equivalent lengths of complete sequences (100 pp).

(0.36 MB PDF)

Table S3

Le(k), k = 2 to 10, averaged over categories of organisms.

(0.06 MB PDF)

Table S4

Le of sequences with highly biased compositions.

(0.06 MB PDF)

Table S5

Effect of replication and segmental duplication on le.

(0.04 MB PDF)

Text S1

(0.07 MB PDF)

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was funded by the National Science Council (ROC) (http://web1.nsc.gov.tw/mp.aspx?mp=7), Cathay General Hospital (http://www.cgh.org.tw/en/index.html), National Central University (http://www.ncu.edu.tw/e_web/index.php). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A. 1979;76:5269–5273. doi: 10.1073/pnas.76.10.5269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li WH. Molecular Evolution. Sunderland, , MA.: Sinauer Associates; 1997. [Google Scholar]
  • 3.Ohno S. Evolution by Gene Duplication. Berlin.: Springer-Verlag; 1970. [Google Scholar]
  • 4.Hansche PE, Beres V, Lange P. Gene duplication in Saccharomyces cerevisiae. Genetics. 1978;88:673–687. [PMC free article] [PubMed] [Google Scholar]
  • 5.Yamanaka K, Fang L, Inouye M. The CSPA family in Escherichia coli: multiple gene duplication for stress adaptation. Mol Microbiol. 1998;27(2):247–255. doi: 10.1046/j.1365-2958.1998.00683.x. [DOI] [PubMed] [Google Scholar]
  • 6.Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. doi: 10.1126/science.290.5494.1151. [DOI] [PubMed] [Google Scholar]
  • 7.Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, et al. Role of duplicate genes in genetic robustness against null mutations. Nature. 2003;421:63–66. doi: 10.1038/nature01198. [DOI] [PubMed] [Google Scholar]
  • 8.Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol. 2003;18(6):292–298. [Google Scholar]
  • 9.Lewin B. Genes VII. Oxford Univ Press; 2000. pp. 89–115. [Google Scholar]
  • 10.Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  • 11.Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
  • 12.Kleckner N. Transposable elements in prokaryotes. Ann Rev Genet. 1981;15:341–404. doi: 10.1146/annurev.ge.15.120181.002013. [DOI] [PubMed] [Google Scholar]
  • 13.Castilho BA, Olfson P, Casadaban MJ. Plasmid insertion mutagenesis and lac gene fusion with mini-mu bacteriophage transposons. J Bacteriol. 1984;158(2):488–495. doi: 10.1128/jb.158.2.488-495.1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Levis RW, Ganesan R, Houtchens K, Tolar LA, Sheen FM. Transposons in place of telomeric repeats at a Drosophila telomere. Cell. 1993;75(6):1083–1093. doi: 10.1016/0092-8674(93)90318-k. [DOI] [PubMed] [Google Scholar]
  • 15.Li WH, Gojobori T, Nei M. Pseudogenes as a paradigm of neutral evolution. Nature. 1981;292:237–239. doi: 10.1038/292237a0. [DOI] [PubMed] [Google Scholar]
  • 16.Vanin EF. Processed pseudogenes: Characteristics and evolution. Annu Rev Genet. 1985;19:253–272. doi: 10.1146/annurev.ge.19.120185.001345. [DOI] [PubMed] [Google Scholar]
  • 17.Weiner AM, Deininger PL, Efstratiadis A. Nonviral retroposons: genes, pseudogenes, and trans- posable elements generated by the reverse flow of genetic information. Annu Rev Biochem. 1986;55:631–661. doi: 10.1146/annurev.bi.55.070186.003215. [DOI] [PubMed] [Google Scholar]
  • 18.Bensasson D, Zhang DX, Hartl DL, Hewitt GM. Mitochondrial pseudogenes: evolution's misplaced witnesses. Trends Ecol Evol. 2001;16(6):314–321. doi: 10.1016/s0169-5347(01)02151-6. [DOI] [PubMed] [Google Scholar]
  • 19.McGrath JM, Jancso MM, Pichersky E. Duplicate sequences with a similarity to expressed genes in the genome of Arabidopsis thaliana. Theor Appl Genet. 1993;86:880–888. doi: 10.1007/BF00212616. [DOI] [PubMed] [Google Scholar]
  • 20.Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.187101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. doi: 10.1126/science.1072047. [DOI] [PubMed] [Google Scholar]
  • 22.Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, et al. Segmental duplications and copy-number variation in the human genome. Am J Human Genet. 2005;77:78–88. doi: 10.1086/431652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gaut BS, Doebley JF. DNA sequence evidence for the segmental allotetraploid origin of maize. Proc Natl Acad Sci U S A. 1997;94:6809–6814. doi: 10.1073/pnas.94.13.6809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gale MD, Devos KM. Comparative genetics in the grasses. Proc Natl Acad Sci U S A. 1998;95:1971–1974. doi: 10.1073/pnas.95.5.1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mochizuki K, Fine NA, Fujisawa T, Gorovsky MA. Analysis of a piwi-related gene implicates small RNAs in genome rearrangement in Tetrahymena. Cell. 2002;110:689–699. doi: 10.1016/s0092-8674(02)00909-1. [DOI] [PubMed] [Google Scholar]
  • 26.Coghlan A, Wolfe KH. Fourfold faster rate of genome rearrangement in nematodes than in Drosophila. Genome Res. 2002;12:857–867. doi: 10.1101/gr.172702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Pevzner P, Tesler G. Genome rearrangements in mammalian evolution: Lessons from human and mouse genomes. Genome Res. 2003;13:37–45. doi: 10.1101/gr.757503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Grant D, Cregan P, Shoemaker RC. Genome organization in dicots: Genome duplication in Arabidopsis and synteny between soybean and Arabidopsis. Proc Natl Acad Sci U S A. 2000;97:4168–4173. doi: 10.1073/pnas.070430597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Spring J. Genome duplication strikes back. Nat Genet. 2002;31:128–129. doi: 10.1038/ng0602-128. [DOI] [PubMed] [Google Scholar]
  • 30.Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome dupli- cation in the yeast Saccharomyces cerevisiae. Nature. 2004;428:617–624. doi: 10.1038/nature02424. [DOI] [PubMed] [Google Scholar]
  • 31.Peng CK, Buldyrev SV, Goldberg AL, Havlin S, Simons M, et al. Finite-size effects on long-range correlations: Implications for analyzing DNA sequences. Phys Rev E. 1993;47:3730–3733. doi: 10.1103/physreve.47.3730. [DOI] [PubMed] [Google Scholar]
  • 32.Mantegna RN, Buldyrev SV, Goldberger AL, Havlin S, Peng CK, et al. Linguistic features of noncoding DNA sequences. Phys Rev Lett. 1994;73:3169–3172. doi: 10.1103/PhysRevLett.73.3169. [DOI] [PubMed] [Google Scholar]
  • 33.Forsdyke D. Relative roles of primary sequence and (G+C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J Mol Evol. 1995;41:573–581. doi: 10.1007/BF00175815. [DOI] [PubMed] [Google Scholar]
  • 34.Karlin S, Mrazek J. Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci U S A. 1997;94:10227–10232. doi: 10.1073/pnas.94.19.10227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classiffication of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999;16(10):1391–1399. doi: 10.1093/oxfordjournals.molbev.a026048. [DOI] [PubMed] [Google Scholar]
  • 36.Hao BL, Lee HC, Zhang SY. Fractals related to long DNA sequences and complete genomes. Chaos, Solitons and Fractals. 2000;11:825–836. [Google Scholar]
  • 37.Ochman H, Elwyn S, Moran NA. Calibrating bacterial evolution. Proc Natl Acad Sci U S A. 1999;96:12638–12643. doi: 10.1073/pnas.96.22.12638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000;156:297–304. doi: 10.1093/genetics/156.1.297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Liu G, Program NCS, Zhao S, Bailey JA, Sahinalp SC, et al. Analysis of primate genomic variation reveals a repeat-driven expansion of the human genome. Genome Res. 2003;13:358–368. doi: 10.1101/gr.923303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Voss RF. Comment on “Linguistic features of noncoding DNA sequences”. Phys Rev Lett. 1996;76:1978. doi: 10.1103/PhysRevLett.76.1978. [DOI] [PubMed] [Google Scholar]
  • 41.Bonhoeffer S, Herz AV, Boerlijst MC, Nee S, Nowak MA, et al. No signs of hidden language in noncoding DNA. Phys Rev Lett. 1996;76:1977. doi: 10.1103/PhysRevLett.76.1977. [DOI] [PubMed] [Google Scholar]
  • 42.Israeloff NE, Kagalenko M, Chan K. Can Zipf distinguish language from noise in noncoding DNA? Phys Rev Lett. 1996;76:1976. doi: 10.1103/PhysRevLett.76.1976. [DOI] [PubMed] [Google Scholar]
  • 43.Mantegna RN, Buldyrev SV, Goldberger AL, Halvin S, Peng CK, et al. Mantegna et al. reply:. Phys Rev Lett. 1996;76:1979–1981. doi: 10.1103/PhysRevLett.76.1979. [DOI] [PubMed] [Google Scholar]
  • 44.Peng CK, Buldyrev SV, Havlin S, Simons M, Stanley HE, et al. Mosaic organization of DNA nucleotides. Phys Rev E. 1994;49:1685–1689. doi: 10.1103/physreve.49.1685. [DOI] [PubMed] [Google Scholar]
  • 45.Bernaola-Galvffan P, Carpena P, Roman-Roldan R, Oliver JL. Study of statistical correlations in DNA sequences. Gene. 2002;300:105–115. doi: 10.1016/s0378-1119(02)01037-5. [DOI] [PubMed] [Google Scholar]
  • 46.Messer PW, Arndt PF, Lassig M. Solvable sequence evolution models and genomic correlations. Phys Rev Lett. 2005;94:138103. doi: 10.1103/PhysRevLett.94.138103. [DOI] [PubMed] [Google Scholar]
  • 47.Fickett JW, Torney DC, Wolf DR. Base compositional structure of genomes. Genomics. 1992;13:1056–1064. doi: 10.1016/0888-7543(92)90019-o. [DOI] [PubMed] [Google Scholar]
  • 48.Xie HM, Hao BL. Visualization of k-tuple distribution in procaryote complete genomes and their randomized counterparts. Proceedings of the IEEE Computer Society Bioinformatics Conference. 2002:31–42. [PubMed] [Google Scholar]
  • 49.Hsieh LC, Luo LF, Lee HC. Genomes are large systems with small-system statistics: Seg- mental duplication in the growth of microbial chromosomes. AAPPS Bulletin. 2003;13:22–27. [Google Scholar]
  • 50.Chen TY, Hsieh LC, Lee HC. Shannon information and self-similarity in complete chromosomes. Comput Phys Commun. 2005;169:218–221. [Google Scholar]
  • 51.Zhou F, Olman V, Xu Y. Barcodes for genomes and applications. BMC Bioinformatics. 2008;9:546. doi: 10.1186/1471-2105-9-546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kong SG, Chen HD, Fan WL, Wigger J, Torda AE, et al. Quantitative measure of random- ness and order for complete genomes. Phys Rev E. 2009;79:061911. doi: 10.1103/PhysRevE.79.061911. [DOI] [PubMed] [Google Scholar]
  • 53.Bapteste E, Boucher Y, Leigh J, Doolittle WF. Phylogenetic reconstruction and lateral gene transfer. Trends Microbiol. 2004;12:406–411. doi: 10.1016/j.tim.2004.07.002. [DOI] [PubMed] [Google Scholar]
  • 54.Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005;6:361–375. doi: 10.1038/nrg1603. [DOI] [PubMed] [Google Scholar]
  • 55.Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302:1401–1404. doi: 10.1126/science.1089370. [DOI] [PubMed] [Google Scholar]
  • 56.Coghlan A, Eichler EE, Oliver SG, Paterson AH, Stein L. Chromosome evolution in eukaryotes: a multi-kingdom perspective. Trends Genet. 2005;21:673–682. doi: 10.1016/j.tig.2005.09.009. [DOI] [PubMed] [Google Scholar]
  • 57.Messer PW, Bundschuh R, Vingron M, Arndt PF. Effects of long-range correlations in DNA on sequence alignment score statistics. J Comput Biol. 2007;14:655–668. doi: 10.1089/cmb.2007.R008. [DOI] [PubMed] [Google Scholar]
  • 58.Bartel DP. Micrornas: Genomics, biogenesis, mechanism, and function. Bioinformatics. 2004;116:281–297. doi: 10.1016/s0092-8674(04)00045-5. [DOI] [PubMed] [Google Scholar]
  • 59.Doolittle WF. Fun with genealogy. Proc Natl Acad Sci U S A. 1997;94:12751–12753. doi: 10.1073/pnas.94.24.12751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Feng DF, Cho G, Doolittle RF. Determining divergence times with a protein clock: Update and reevaluation. Proc Natl Acad Sci U S A. 1997;94:13028–13033. doi: 10.1073/pnas.94.24.13028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Hedges SB. The origin and evolution of model organisms. Nat Rev Genet. 2002;3:838–849. doi: 10.1038/nrg929. [DOI] [PubMed] [Google Scholar]
  • 62.Woese CR. The universal ancestor. Proc Natl Acad Sci U S A. 1998;95:6854–6859. doi: 10.1073/pnas.95.12.6854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Woese CR. On the evolution of cells. Proc Natl Acad Sci U S A. 2002;99:8742–8747. doi: 10.1073/pnas.132266999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Glansdorff N, Xu Y, Labedan B. The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner. Biol Direct. 2008;3:29. doi: 10.1186/1745-6150-3-29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Drake JW, Charlesworth B, Charlesworth D, Crow JF. Rates of spontaneous mutation. Genetics. 1998;148:1667–1686. doi: 10.1093/genetics/148.4.1667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Denver DR, Morris K, Lynch M, Thomas WK. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature. 2004;430:679–682. doi: 10.1038/nature02697. [DOI] [PubMed] [Google Scholar]
  • 67.Haag-Liautard C, Dorris M, Maside X, Macaskill S, Halligan DL, et al. Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature. 2007;445:82–85. doi: 10.1038/nature05388. [DOI] [PubMed] [Google Scholar]
  • 68.GenBank The genbank genome database. 2009. URL http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome.
  • 69.Rudner R, Karkas JD, Chargaff E. Separation of B. subtilis DNA into complementary strands. iii. direct analysis. Proc Natl Acad Sci U S A. 1968;60:921–922. doi: 10.1073/pnas.60.3.921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Chen HD, Chang CH, Hsieh LC, Lee HC. Divergence and Shannon information in genomes. Phys Rev Lett. 2005;94:178103. doi: 10.1103/PhysRevLett.94.178103. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Category Le for coding and non-coding parts. Averages of p (fractional A/T-content) and Le for k = 7 (situations for other ks are similar) for the coding parts (solid symbols; ex for eukaryotes and gn for prokaryotes) and non-coding parts (hollow symbols; in for eukaryotes and ig for prokaryotes) of chromosomes. Symbols for categories are: vertebrates, red (square); unicellulars, blue (triangle-up); insects, orange (triangle-down); plants, green; prokaryotes, gray (bullet/circle). Numeral indicates number of chromosomes in each category. The curve represents Le for the universality class: Le{uc}(k; p).

(0.26 MB TIF)

Figure S2

Distributions of χ2 versus L and p. Each symbol gives the χ2 for one chromosomal Le. Top panels, for genic (gn) and exon (ex) concatenates. Bottom panels, for intergenic (ig) and intron (in) concatenates. Symbols, with color, number of data in group, and number of data whose χ2 is less than 10−3 given in brackets, stand for: diamond, gn (blue; 7100; 229); square, ex (red; 2844, 95); triangle-down, ig (green; 6377, 270); triangle-up, in (orange; 2960, 104).

(0.77 MB TIF)

Figure S3

Results from minimal RSD model. Top-left: Equi-χ2 contour as function of r and d, with L0 = 64 (bases); length (L) of generated model sequence is 2 Mb and only Le(k) results for k = 7 are used. Top-right: Le(k), k = 2, 4, 6, 8, 10 from 200 model sequences generated using the “best” parameters L0 = 64, <d> = 1000 (b) and r = 0.73 (cumulative point mutations per base). The lines are Le{uc}(k; p) that represent the universality class given in the main text. The χ2 for the model sequences is 0.18. Bottom-left: χ2 versus L0 (otherwise best parameters); model sequences have L = 2 Mb and p = 0.5. Bottom-right: Le versus L, for a p = 0.5 model sequence generated using the best parameters.

(1.17 MB TIF)

Table S1

List of complete sequences included in the study (20 pp).

(0.13 MB PDF)

Table S2

Equivalent lengths of complete sequences (100 pp).

(0.36 MB PDF)

Table S3

Le(k), k = 2 to 10, averaged over categories of organisms.

(0.06 MB PDF)

Table S4

Le of sequences with highly biased compositions.

(0.06 MB PDF)

Table S5

Effect of replication and segmental duplication on le.

(0.04 MB PDF)

Text S1

(0.07 MB PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES