Salerno et al. 10.1073/pnas.0605735103. |
Supporting Figure 6
Supporting Figure 7
Supporting Figure 8
Supporting Methods
Fig. 6. Maximal L-mer length distributions for intersections of X chromosomes and fragments of X chromosomes of rat and mouse.
Fig. 7. (a) Length distribution of masked and unmasked intervals from repeat-masked mouse X chromosome. (b) Log-log plot of normalized power spectra for mouse/human maximal L-mer locations over the mouse genome for different window sizes. Mouse/human maximal L-mers for window size: black, 218; red, 216; green, 214; brown, 212. Power spectra for random locations from Fig. 2, but here displaced vertically for clarity: violet, excluded from masked regions of genome; orange, minimally separated and excluded from masked regions. The repeat-masked mouse genome was represented as a series of 1s and 0s, with a 1 corresponding to the right-hand terminal base spanned by each maximal L-mer of length ³30 and Fourier-transformed. Other representations are possible; for example, a maximal L-mer could instead be represented by a sequence of L contiguous 1s spanning the entire L-mer. However, because power-law correlations are exhibited in the spectrum at lengths exceeding 5,000 bases, and there are few L-mers with L > 400, one might naively expect such an alternative to have little impact at the long wavelengths of primary interest here, and this expectation has been confirmed explicitly.
Fig. 8. Maximal L-mer length distribution from Drosophila melanogaster/Drosophila yakuba intersection.
Supporting Methods
Maximal L-mers.
Sequence perfectly conserved between two genomes was computed in several different ways: (i) as described in ref. 1 by L-mer intersection, a process that neglects sequence location; (ii) based on whole-genome alignments obtained from UCSC, again as described in ref. 1; and (iii) in such a way as to retain location as follows. For fixed K, we identify every sequence of length K shared by both (repeat-masked) mouse and human genomes and record the chromosomal position(s) of each K-mer. Within each genome, adjacent K-mers are assembled into L-mers, where L = (K 1) + (the number of assembled adjacent K-mers). In order of decreasing L, Lmax ³ L ³ Lmin, where Lmax is the length of the longest mouse L-mer, we report as maximal L-mers all L-mers contained in both human and mouse lists. L-mers in the mouse list that are also in the human list are broken up into their two constituent (L 1)-mers, and these (L 1)-mers are deleted from the list of mouse (L 1)-mers. Finally, we set L → L 1 and iterate. For the studies described here, Lmin = K = 23.We obtained 1,987,268 maximal L-mers in the mouse genome, each with a unique chromosomal position. A set of these sequences common to both human and mouse genomes was derived by removing all positional information, representing each distinct sequence only once, and discarding any sequence that is a subsequence of another member of the set, yielding 1,553,444 position-independent sequences. The procedure yielding the latter set is symmetric under interchanging mouse with human, and these sequences were used to compute length distributions. The former set of sequences was used for clustering and power spectra.
Single-Copy Sequences.
Sets of single-copy sequence from genomes that have not been repeat-masked were generated in two different ways: (a) by iii above, except that before assembling adjacent K-mers into L-mers, any K-mer occurring more than once in either genome was discarded; and (b) subsequent to iii, any maximal L-mer occurring more than once in either genome was discarded, the entropy of base-composition was computed for each sequence, and those sequences with entropies less than a specified cut-off were discarded.For method i, we chose K = 23 to generate Fig. 2e. Increasing K successively cuts off the leftmost parts of the distribution, without altering the distribution for large L. Choosing K = 20, on the other hand, leads to significant degradation of the power law for all L. Because the choice of K has no effect on the assembly of the L-mers for L ³ K, this degradation must be a consequence of ultraconserved elements containing 20-mers that occur multiple times in at least one of the two genomes (whereas evidently few such 23-mers are subsequences of an ultraconserved element).
The base-composition entropy was defined as Sn fn log2(fn), where fn is the fraction of base n in the sequence, n = (A, G, C, T). This quantity is a naive measure of the log probability of obtaining the given sequence by randomly picking from a uniform distribution of bases.
Random Signals.
The ratio of the number mouse/human maximal L-mers with length ³30 bases (~3 ´ 105) to the number of bases in the repeat-masked mouse genome (~1.4 ´ 109) yields a mean density r of maximal L-mers per unmasked base. Random sequences with lengths of mouse chromosomes were generated by setting the value at each position to 1 with probability r and 0 otherwise. As indicated in the text, sequences were also generated subject to a minimum-separation requirement, such that no two "1" values were within 30 bases of one another. Both of these signals were generated subject to the additional condition that the sequence vanish at all repeat-masked locations.Spectral Analysis.
We generated sequences of chromosomal length with a 0 in every position except those of the right-hand terminal base of a maximal L-mer, which was assigned the value 1. Following ref. 2, each of these chromosomes was broken into just under 103 nonoverlapping windows of length 218 bases. Fourier coefficients a(k) were calculated for each window by fast Fourier transform (3), and the power spectrum computed as Sk = a*(k)a(k). The Sk were then averaged over all windows. The slope in the long-wavelength regime in Fig. 3 was confirmed by examining several smaller window sizes (Fig. 7b).Scaling.
Two sets of local densities were scaled: (i) local maximal L-mer densities and (ii) local densities of positions randomly chosen within the repeat-masked mouse genome such that the overall total number of positions was equal to the total number of maximal L-mers. The randomly chosen positions are expected yield a distribution of local densities that, above the mean, exhibits a Gaussian tail; consequently, the sampling of the local density distribution in the tail may be much poorer than for the stretched exponential yielded by the maximal L-mers. In addition, the mean density is low enough that for this range of window sizes, the full local density distributions for randomly chosen positions is just crossing over from Poisson to Gaussian. Consequently, corrections to scaling may not be negligible in this regime.Denoting the window size by n, our procedure for both data sets was as follows. (i) For each window size, normalize the x coordinates of each data point by n to yield spatial densities, and the y coordinates by the total number of windows of size n to yield probability densities. (ii) For each n, translate the curves linearly in x so that their maxima line up at x = 0. (iii) For each n, collapse by scaling in x and y directions.
In contrast to the local densities of randomly chosen positions, which as expected collapsed optimally when scaled in the x direction as n1/2, the local maximal L-mer densities collapsed best on naive scaling in the x direction (e.g., as a density, no further scaling in x necessary beyond i above). No significant differences in the scaled local density distributions were observed between positions randomly chosen in (i) whole-mouse genome and (ii) repeat-masked mouse genome.
1. Tran, T., Havlak, P. & Miller, J. (2006) Nucleic Acids Res. 34, e65.
2. Voss, R. F. (1992) Phys. Rev. Lett. 68, 3805-3808.
3. Press, W., Teukolsky, S., Vetterling, W. & Flannery, B. (2002) Numerical Recipes in C++: The Art of Scientific Computing (Cambridge Univ. Press, Cambridge, U.K.).