Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2008 Jul 28;24(19):2157–2164. doi: 10.1093/bioinformatics/btn391

Is there an acceleration of the CpG transition rate during the mammalian radiation?

M Peifer 1,*, J E Karro 2,3, H H von Grünberg 1
PMCID: PMC2553435  PMID: 18662928

Abstract

Motivation: In this article we build a model of the CpG dinucleotide substitution rate and use it to challenge the claim that, that rate underwent a sudden mammalian-specific increase approximately 90 million years ago. The evidence supporting this hypothesis comes from the application of a model of neutral substitution rates able to account for elevated CpG dinucleotide substitution rates. With the initial goal of improving that model's accuracy, we introduced a modification enabling us to account for boundary effects arising by the truncation of the Markov field, as well as improving the optimization procedure required for estimating the substitution rates.

Results: When using this modified method to reproduce the supporting analysis, the evidence of the rate shift vanished. Our analysis suggests that the CpG-specific rate has been constant over the relevant time period and that the asserted acceleration of the CpG rate is likely an artifact of the original model.

Contact: peifer@uni-graz.at

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

In much of the bioinformatics literature addressing mammalian neutral nucleotide substitution rates, researchers make the simplifying assumption that the rate at which a given base undergoes substitution is unaffected by the content of its neighbors—that it is free of its context (Hardison et al., 2003; Hasegawa et al., 1985; Jukes and Cantor, 1969; Lio and Goldman, 1998; Tyekucheva et al., 2008). However, it is well established that this assumption only approximates the biological reality. The best known example of a context-sensitive substitution is triggered by the methylation of CpG dinucleotides in vertebrates. Spontaneous, hydrolytic deamination then converts the 5-methylcytosine into thymine, while the same process would result in a uracil were methylation not in play. Since the repair mechanism is considerably less efficient in correcting the CT transition than the CU mutation, a particularly high rate for CpGTpG (and CpGCpA on the reverse strand) is observed (Coulondre et al., 1978; Ehrlich et al., 1986; Razin and Riggs, 1980; Wiebauer et al., 1993). Estimates of the rate for this context-dependent transition ranges from 10 to 60 times the rate of the other, single-nucleotide transitions (Arndt et al., 2003a; Blake et al., 1992; Fryxell and Zuckerkandl, 2000; Hess et al., 1994; Hwang and Green, 2004; Meunier et al., 2005; Siepel and Haussler, 2004).

In a recent series of studies, Arndt and colleagues have gone to great effort to model the rate of CpG dinucleotide substitutions. To this end, they incorporate the CpG effect into their method of estimating substitution rates and then apply that method to repeat data of mammalian genomes (Arndt and Hwa, 2004; Arndt et al., 2005; Arndt et al., 2003a, Arndt et al., 2003b). The results provide a picture of the behavior of the CpG substitution rate over time by relating it to the average transversion frequency. From this analysis they conclude that CpG-related substitutions occur at rates as high as 40 times the rate of the average transversions in modern mammals. Further, by applying the model to repeats of differing ages, they provide evidence that the rate of CpG substitution dramatically increased along the mammalian lineage about 90 Mya—roughly the time of the mammalian radiation (Arndt and Hwa, 2004; Arndt et al., 2003a).

Expecting to advance the results of the Arndt et al. studies, we implemented an alternative model for the analysis of context-sensitive substitution rates. We followed their methodology in that we restricted the context-dependent substitution process to a triplet of bases. However, we also tried to account for the ‘truncation errors’ that result from the consideration of triplets in isolation [as opposed to considering the whole sequence at once, as done in Hwang and Green (2004)]. In addition, we chose a different objective function used in the estimation of the substitution rates. Unexpectedly, when we fitted the boundary-corrected model to the data, the evidence for the shift in the CpG substitution rate vanished and we found a considerably higher CpG rate of substitutions in older repeats than predicted in Arndt et al. (2003a), suggesting the CpG rate has in fact been constant in time, not changing as has been suggested.

We will first compare two substitution rate models: the one used in the studies of Arndt et al., which we refer to as the simple context-dependent model (Arndt and Hwa, 2004; Arndt et al., 2003a), and our boundary-corrected model. Development of our model also requires the use of a different objective function for estimating the substitution rates than that used by Arndt et al. (Arndt and Hwa, 2004; Arndt et al., 2003a). A comparison of the estimates produced by each model will then shed light on the claim of a mammal-specific shift in CpG substitution rates and help us determine whether this shift is real.

2 METHODS

2.1 Substitution models

The technique for estimating substitution rates proposed by (Arndt et al., 2003a), and our variation on that technique, is built on Markov chain theory that has been used in numerous other studies [e.g. Jukes and Cantor (1969) Hasegawa et al. (1985); see Ewens and Grant (2005) for an overview]. The Markov-chain model is defined by a rate matrix R, from which we can derive P(t)—a matrix quantification of the state-transitional probabilities over a time period t. Each state of the chain represents one possible nucleotide configuration. When the common simplifying assumption of base independence is made (disregarding CpG or other context-sensitive effects), we then have a four state Markov chain and R and P matrices of size 4 × 4. If we instead account for limited contextual effect by assuming that the transitional probability of a base is dependent on its left and right neighbors, we then require a state for each possible triplet of nucleotides and the matrices are of size 64 × 64. The resulting 64-state model has been explored in several studies (Arndt et al., 2003a; Arndt et al., 2003b; Lunter and Hein, 2004; Siepel and Haussler, 2004).

Previous works have shown that when looking at genome-wide average substitution rates (as opposed to local rates), we can use a model conforming to strand symmetry (Lobry and Lobry, 1999; Sueoka, 1995). That is, we can assume the rates of complementary substitutions (e.g. AC and TG) are equal. When substitution rates are independent of context, strand symmetry implies that the model can be described by six rates: two distinct transition rates (r5:A/TG/C and r6:G/CA/T), and four distinct transversion rates (which we denote r1,…,r4, but have no need to distinguish between them in the following).

Modeling full context dependency is difficult, as it is not clear what dinucleotide substitutions exist. In non-coding regions the CpG substitution is the most clearly identifiable example of a context-sensitive substitution (Siepel and Haussler, 2004). Hence, we follow the common practice of adding only this one context-dependent rate to our model (Arndt and Hwa, 2004; Arndt et al., 2003b). Specifically: for single base substitutions we continue to use our ri parameters defined above (e.g. r5 represents the rate of an AAAAGA transition); for unique CpG transitions (up to strand symmetry), we introduce a seventh parameter r7 (representing the rate for transitions such as CGATGA/CAA); it is assumed that no other substitution (e.g. AAATTT) can occur in a single step. This model, which has been the subject of numerous studies (Arndt and Hwa, 2004; Arndt et al., 2003a, Arndt et al., 2003b, Arndt et al., 2005), will be referred to as the simple context-dependent model. The model structure is described in the Supplementary Materials, Equation (S8).

The simple context-dependent model is an improvement over the context-free model, but it still suffers from truncation error. Because we are considering isolated triplets, the neighbor effect is only partially corrected for in the flanking bases: while the substitution ACGATG is correctly reflected by r7, the substitution AACAAT is reflected in r6 even though the flanking CT transition may have been influenced by a neighboring G (Hwang and Green, 2004). In the Supplementary Materials, we describe a strategy for dealing with this boundary effect. In short, we add six new parameters (rb1,···,rb6) which reflect the single-base substitution rates of the flanking bases. Differences between ri and rbi result from the truncation error, and can be used to measure the boundary effect. The derived rate matrix, given by Equation (S12) of the Supplementary Materials, will be referred to as the boundary-corrected model and is denoted Rb in the rest of this study.

2.2 Estimation of substitution frequencies

Following studies such as Hardison et al. (2003), Arndt et al. (2005), Gaffney and Keightley (2005) and Karro et al. (2008), we assume interspersed repeats reflect the neutral substitution rates. For each family of repeats, the RepeatMasker tool (Smit et al., 1996–2004) provides us with the alignment of a derived ancestral sequence to each modern interspersed repeat sequence, allowing us to calculate the number of substitutions from ancestral state (i,j,k) to modern state (i′,j′,k′). After discarding all low-complexity repeats, functional repeats and repeat families covering <100 kb on the modern genome, we can calculate this information for 494 repeat families, covering roughly 40% of the human genome. Note that the presence of repeats in CpG islands can be safely neglected since those repeats are statistically underrepresented. Thus, they should not contribute to a significant impact on the result.

For each repeat family, denoted α and inserted at time −tα (i.e. a time tα time units in the past), the number of substitutions N(i,j,ki′,j′,k′; tα) is calculated from the alignments. The substitution frequencies R tα and Rbtα can then be determined by making use of the exponential relationship P(R tα) and P(Rb tα), respectively, as given in Equation (S9) (Supplementary Material) for each repeat family inserted at time −tα. More specifically, we need to estimate substitution frequencies such that P(Rtα) or P(Rbtα) fits the observed number of substitutions N(i,j,ki′,j′,k′;tα). This leads to an optimization problem for which a suitable objective function is required. Since in the following we need to reproduce the results presented in the studies of Arndt et al. (Arndt and Hwa, 2004; Arndt et al., 2003a, Arndt et al., 2003b, Arndt et al., 2005), when using the context-dependent approach without boundary correction we adopt the same objective function used by Arndt et al. (Arndt et al., 2003a); Arndt and Hwa (2004)):

graphic file with name btn391m1.jpg (1)

where N(i,j,k→·,j′,·;tα)=∑i′,kN(i,j,ki′,j′,k′;tα) and in a similar fashion P(i,j,k→·,j′, · ∣Rtα)=∑i′,k P(i,j,ki′,j′,k′∣Rtα) with P(i,j,ki′,j′,k′∣R tα) denoting the components of P(R tα). It is important to note that summing over the flanking bases i′,k′ within N and P (represented by the dot symbol in both expressions) can be viewed as an observation function, an idea which was introduced in Arndt et al. (2003a).

When using the boundary-corrected model, the use of ℒ1 as an objective function would make it impossible to estimate the additional set of parameters rb1,…, rb6. We therefore choose the objective function which follows directly from the principle of maximum likelihood:

graphic file with name btn391m2.jpg (2)

where again the matrix components of P(Rbtα) are denoted by P(i,j,ki′,j′,k′∣Rbtα). Note that the summation over the flanking bases (i′ and k′) has been dropped. Having thus defined the objective functions, substitution frequencies r1tα, …, r7tα are obtained by minimizing Equation (1) and r1 tα, …, r7tα, rb1tα,…,rb6tα are estimated by minimizing Equation (2). The minimization of both ℒ1 and ℒ2 was preformed using the same optimization algorithm employed in Arndt et al. (2003a).

2.3 Simulations

For simplicity we will refer to the method of Arndt et al. (2003a) that uses the simple context-dependent model and objective function ℒ1 as Method 1, or M1 [see Equation (1)]; we further refer to the method using the boundary corrected model and objective function ℒ2 as Method 2, or M2 [see Equation (2)]. To test the performance of both methods, we simulated sequences using certain predefined rates and then estimated these rates from the synthetic data using either M1 or M2. The test is successful if predefined and estimated rates agree. A sequence of 106 bases was generated from a random consensus sequence. Simulations were preformed by mutating an ancestral sequence and then stepping forward in time by increments of Δt=10−5. At each step, every base is given a chance to mutate with probabilities dictated by the possible seven substitution processes, with a probability computed by multiplying the corresponding substitution rate with Δt. By repeating this procedure the sequence is propagated through time until the desired divergence (in terms of average transversion frequency) is reached. In accordance with the results presented further below, we used the following substitution rates in the simulation: r1, …, r4=1, r5=3, r6=5 and r7=50 and varied the time t over a range of t=0.001 to t=0.1. The result of our test is shown in Figure 1, where straight lines represent the rates used in the simulation while symbols show the rates as estimated by both methods. Note that for each data point in Figure 1 the consensus sequences used in the simulation had a different (randomly assigned) base composition. We also computed 95% confidence intervals by using 10 independent replications for each data point; these intervals are only shown for the rate estimates in which we see a significant deviation from the predefined values.

Fig. 1.

Fig. 1.

Results from simulation study. Sequences having a length of 106 bases were generated and evolved over a time ranging between 0.001 and 0.1. All data are given as a function of the mean transversion frequency (r1+r2+r3+r4)t/4. Straight lines show the transition frequencies r5t, r6t and r7t as used in the simulation generating the sequences, while symbols represent the substitution frequencies that one obtains applying method 1 (open symbols) and method 2 (full symbols) to the synthetic data. Ideally, predefined and estimated rates should agree. Such an agreement or disagreement is quantified by calculating the 95% confidence intervals over 10 independent replications. For sake of readability we only showed the confidence intervals of the rate estimate for which a significant deviation from the pre-defined value occurred (that is, the value of r7 as predicted by method 1). The * symbols indicate those spots in which the difference between the method 1 estimate and the actual value is statistically significant. We note that a log-scale has been used to help show that the single nucleotide transitions are accurately estimated for both methods.

The two single nucleotide transitions r5 and r6 are accurately reproduced by both methods. For the CpG-specific rate, however, only M2 could reproduce the expected frequencies over the entire range, while M1 produced a slight but significant bias for t > 0.03. Interestingly, there is a specific time range where M1 seems to be susceptible to biases resulting from the summation over the flanking neighbors in the objective function without correcting for the truncation error. To explore whether the summation over the flanking bases or the truncation error is primarily responsible for the bias produced by M1, we simulated an ensemble of trinucleotides using the same parameters as stated above. Since this dataset is already restricted to a trinucleotide scale, a truncation error cannot arise. This allows us to study in isolation the effect of the summation over the flanking neighbors. We observe that a similar bias (shown in Fig. 1) occurs when applying M1 to the data, which vanishes again if M2 is used instead (data not shown). We conclude from this that the bias produced by M1 must be primarily due to the summation over the flanking neighbors.

Returning to the simulation results shown in Figure 1, it is worth noting that the estimation error of the CpG-related rate was about an order of magnitude lower for M2 than for M1. Additionally, the importance of the boundary effect can also be seen by calculating the difference between ri and rbi: |1−rbi/ri| averaged over the observed range. For r6 this yields 72%, but only 6% for r5 and < 1% for the average transversion frequency.

2.4 Materials and statistics

Genome build hg18 (human) and mm8 (mouse) was downloaded from the UCSC browser (Kent et al., 2002). Human and mouse repeat information was extracted by RepeatMasker v. 3.1.2 and RM database version 20051025. All regression analyses were performed in R (http://www.r-project.org). In order to test for differences in the slope of two linear relations, a bootstrap test was implemented. Here, the residues from the regression rather than the data itself are resampled with replacement, as described in Davison and Hinkley (1997). To achieve a higher convergence rate of the bootstrap procedure and a more powerful test (Hall and Wilson, 1991; Peifer et al., 2005), studentized bootstrap was used as well. For each test 10 000 bootstrap samples were chosen.

3 RESULTS

First we compare and test the two context-dependent models discussed in Section 2. Since M1 is the basis for the results of Arndt et al. (2003a), by using that method we should be able to reproduce the main result of that work: the sudden decrease in the CpG-related frequency r7tα after a given point in the past. To this end, we used M1 to analyze 494 repeat families, including all 38 families used in the study of Arndt et al. (Arndt and Hwa, 2004; Arndt et al., 2003a). This allows us to calculate the context-dependent genome-wide transition frequencies r5tα, r6tα and r7tα, and the average transversion frequency rtrtα=(r1+r2+r3+r4)tα/4. In Figure 2 we reproduce Figure 4 from Arndt et al. (2003a), marking the 38 families used by that study with crosses. As expected, the 38 data points used by Arndt et al. (2003a) clearly reproduce the figure of the previous study, including the sudden change in the number of CpG-related transitions at rtrtα≈0.025. For the transition frequencies r5tα and r6tα, a linear relationship becomes apparent. The slope of the straight lines yields: r5/rtr=2.73 ± 0.04 and r6/rtr=5.35 ± 0.07—results compatible with the Arndt study. Including the other 456 repeat families confirms the decrease of r7tα after rtrtα≈0.025. While r5/rtr=3.00±0.03 changes slightly (p<10−4), r6/rtr=5.28±0.03 does not undergo any significant change at all.

Fig. 2.

Fig. 2.

Genome-wide averages of substitution frequencies for the two single nucleotide transitions (triangles) and the CpG-related transition (circles) against the average transversion frequency. Method 1, as suggested in Arndt et al. (2003a), has been applied. Each data point corresponds to a single (out of 494) repeat family of the human genome. Cross symbols refer to a selection of families used in Arndt et al. (2003a); all other families are displayed with open symbols. Lines are fits to the data as suggested in Arndt et al. (2003a).

Fig. 4.

Fig. 4.

Here we have shown the same comparison of method 1 against method 2 as in Figure 3, fitted to mouse genomic data. Again, applying method 1, a shift of the CpG-related rate becomes apparent (open symbols), whereas this shift disappears when method 2 is applied (closed symbols). Interestingly, the sudden change in r7tα of the M1-based estimates is roughly at the same average transversion frequency as for the human genome.

Our simulations indicate that results obtained using M1 will be biased for rtrtα > 0.03, but that M2 effectively negates this bias. Indeed a different result is found when we apply M2 (Fig. 3). Now the CpG-related substitution frequency r7tα no longer exhibits the sudden decrease after rtrtα=0.025. Instead, it shows an uninterrupted linear relationship with rtrtα, suggesting that r7 is approximately constant over the observed time range. This is verified by the following procedure: we fitted two models to the CpG-related substitution frequency; one model assumes a linear relationship and the other, more complex model, assumes separate linear relations before and after rtrtα=0.025. To select the more suitable model we consulted the Bayesian information criterion, or BIC (Schwarz, 1978). It turns out that the simple model for which a constant CpG rate is assumed is indeed sufficient to explain the data (having the lower BIC score of 324.6, compared to a score of 326.3 for the model allowing a change of the CpG rate). The linear fit of r7t against rtrt results in a slope of r7/rtr=48.3 ± 0.4. The M2 estimation of one of the single nucleotide transition rates (r5/rtr=3.02 ± 0.03$) is not significantly different than that calculated from M1, while its estimation of r6/rtr=5.08 ± 0.03 is slightly smaller than in the calculation using M1 (p<10−4). The relative absolute difference between frequencies r1tα, …, r6tα and the boundary specific parameters rb1tα, …, rb6tα are on average: 3% for r1tα, 7% for r2tα, 7% for r3tα, 2% for r4tα, 6% for r5tα and 47.3% for r6tα. This again illustrates the importance of correcting for truncation effects. In addition, the huge difference between r6tα and rb6tα is expected as this rate represents A/TG/C, which is included in the boundary-sensitive CpG-dinucleotide transitions.

Fig. 3.

Fig. 3.

Substitution frequencies as in Figure 2, computed now with two alternative methods. Method 1 (open symbols) has been used in Arndt et al. (2003a) and consists in a fitting procedure optimizing a contracted log-likelihood function, see Equation (1), while method 2 (closed symbols) corrects for truncation errors arising by the restriction to trinucleotides and uses the objective function given in Equation (2).

Given the differences resulting from the application of two methods to the same input data, the simulation study suggests that the results derived from M2 are more reliable than those derived from M1. We repeated the same analysis for the mouse genome; results are shown in Figure 4. As with the human genome, there is a significant difference in the CpG-specific rate estimations resulting from the application of each method to the data. In fact, the results of Figure 3 and 4 are very similar. We observe a sudden shift of r7tα at about rtrtα=0.025 for the estimates based on M1, which disappears if M2 is used instead. Interestingly, the average transversion frequency where this shift occurs appears to be at the same value of rtrtα in both genomes—a finding we argue is in direct conflict with the Arndt et al. (2003a) interpretation of this value as the point of the mammalian radiation. We illustrate this conflict with the following argument. We know that there is a large difference of the total substitution rate between the primate and the rodent lineage. In fact, the murid substitution rate is at least twice that of human (Waterston et al., 2002). This must be taken into account when the average transversion frequencies of Figure 3 and 4 are translated into physical time. In other words, if we mapped the data points of Figure 3 and 4 to a common time axis, we would have to expand the values on the x-axis of Figure 4 relative to those of Figure 3 by roughly a factor of two (up to the mouse–human speciation). Hence any point on the x-axis which is the same in both Figure 3 and 4 must map to significantly different points on this common time axis. If the predicted, sudden change in the CpG substitution rates for human and mouse occurs at roughly the same x-axis point in both figures, it follows that these changes must map to two different points on the temporal axis. Therefore, the two shifts cannot both have happened at the time of the mammalian radiation. A more plausible explanation is that the sudden change is an artifact of the method of calculation.

Next we look at a human–mouse comparison of r5/rtr, r6/rtr and r7/rtr obtained by using M2. For r5/rtr, we observe a slight upward change to r5/rtr=3.14±0.02 (P < 10−4), whereas for the other two rates we obtain r6/rtr=4.69 ±0.03 and r7/rtr=42.3±0.5 for the mouse, which is significantly smaller (P < 10−4) than the corresponding values obtained for the human genome. The factors potentially responsible for this change are investigated in the discussion.

The most direct way to check the performance of a fitting procedure is to compare the fitted results to the original data. This is done in Figure 5 on the basis of the transition probabilities towards CpG sites. Aligning the ancestral consensus sequence to the individual repeat copies, one can empirically compute the transition probability from a CpG site (or non-CpG site) on the consensus sequence to a CpG site on today's genome; thus computing the CpG-content restricted to locations where either the consensus sequence has a CpG site or not. This probability is plotted against the averaged transversion frequency, which can be considered as a time axis. Again, each data point (plus symbols) corresponds to a single repeat family whose ‘age’ is given by the value on the x-axis. However, the rates r1tα,…, r7t_α and r1tα, …, r7tα, rb1tα, …, rb6tα, estimated for each repeat family α by applying either M1 or M2, can be inserted into the corresponding rate matrices to compute transitional probabilities from the models. Comparison of these model predictions to the actual transition probabilities allows us to judge the success of the respective fitting procedure. Figure 5 also shows the same comparison using the synthetic data from the simulation presented in Figure 1.

Fig. 5.

Fig. 5.

Probability of observing a CpGCpG (left) and a non-CpGCpG (right) state transition, versus the averaged transversion frequency representing the age of each repeat family. Upper two panels: data from the human genome; lower two panels: synthetic data from simulation. Original data (plus symbols) are computed from the alignment of the consensus sequences to modern repeat copies. Model predictions are based on the rates estimated either with method M1 (blue open square symbols) or with method M2 (filled red circles).

We observe that both the fast decay of the CpG sites and the build-up of CpG sites from non-CpG sites are well reproduced by M2, while method M1 fails to produce a satisfactory agreement for all data points beyond rtrtα=0.025. Note that this is just the point in Figure 2 beyond which the sudden change in CpG occurs. Since the statistical noise is orders of magnitude lower for the simulated data than that of real data, M1 is expected to perform better using synthetic data. The lower two panels in Figure 5 indeed show a better agreement for M1, even though a bump is to be observed—a deviation similar to that already found in Figure 1.

4 DISCUSSION AND CONCLUSION

In this work, we reconsidered the context-dependent substitution rate model proposed by Arndt and colleagues (Arndt and Hwa, 2004; Arndt et al., 2003a). These studies have concluded that the CpG-specific substitution rate r7 along the mammalian lineage is not constant in time, but underwent a drastic change about 90 Mya (corresponding to an average transversion frequency of rtrtα=0.025), or at roughly the time of the mammalian radiation. Our main finding is that this result cannot be confirmed when we use a more precise objective function and correct for the trinucleotide truncation effect. When applying our modified model, the evidence for the asserted change of the CpG-specific rate after rtrtα=0.025 vanishes; we see that the CpG-specific rate has been constant over time, is 48.3 times larger than the average transversion rate for the human genome, and 42.3 times larger than rtr for the mouse genome.

Our analysis suggests that the sudden shift of the CpG-specific rate proposed by Arndt and colleagues (Arndt and Hwa, 2004; Arndt et al., 2003a) is an artifact of their method. We have a number of arguments supporting this claim. First, the method used by Arndt et al. leads to an inconsistency which comes to light if one compares rates based on the analysis of the mouse and the human genome. A sudden CpG rate shift associated with an event in the past, such as the mammalian radiation, would imply that the shift occurs at a single point in physical time—and thus the estimation should be independent of the organism from which it is derived. We have applied the method used by Arndt et al. to estimate frequencies r7tα for both the human and the mouse genome (Figs 3, 4). This comparison reveals that the predicted CpG-rate shift occurs at a point in the past corresponding to the same transversion frequency for the mouse and the human genome. When we account for the different rates of substitution experienced by the two genomes, we realize that the time periods corresponding to a given transversion frequency are different for the two organisms, implying that the identified shift occurred at different times in each organism's history. This result is thus not compatible with the idea that the time-point of the sudden CpG-rate shift corresponds to a real event at a specific point on a physical time axis.

We also find evidence against the asserted CpG shift from a direct comparison of the results of the Arndt method against those of our method. Our method uses the objective function ℒ2 [Equation (2)] for the optimization procedure and applies it to the boundary-corrected model that accounts for truncation errors. Arndt and colleagues have taken the simple context-dependent model, which allowed them to apply the objective function ℒ1, Equation (1). The crucial difference is that ℒ1 is based on a summation over the two flanking bases in the triplet state—something that has not been done in ℒ2. More specifically, an optimization based on ℒ1 searches for an optimal solution in a space of lower dimensionality [using N(i,j,k,→.,j′,.)] as compared to an optimization based on ℒ2 which uses the full N(i,j,k,→i′,j′,k′) to estimate the rates. Thus, while ℒ1 uses less information (in this case disregarding the information about a whole triplet of bases (i,j,k), any method using ℒ2 exploits all available information and hence is a more general approach.

We find the Arndt method cannot cope with an unfavorable signal-to-noise ratio. Using simulation data with an exceptionally high ratio, the Arndt method has performed comparatively well in the consistency tests (see Figs 1 and 5). However, when the signal-to-noise ratio becomes low—as is the case, for example, for the CpGCpG transitions on the older repeat families—this method fails dramatically, producing large, systematic errors (as we can see from the two upper panels in Fig. 5).

While Arndt et al. combined the objective function ℒ1 with the simple context-dependent model (in which triplets were considered in isolation), we have fitted the data to the boundary-corrected model using the objective function ℒ2. It is interesting to ask whether the objective function ℒ2 combined with the simple context-dependent model performs better. (The converse experiment, pairing ℒ1 with the boundary corrected model, cannot be performed because of the restricted ability of ℒ1 to estimate necessary parameters.) We have checked alternative pairings and still found that the sudden shift in the CpG rate still disappears, showing a rate r7 that is again constant in time (though different from our value). This suggests that it is ℒ1—and in particular the summation over flanking bases—which is the factor responsible for the artificial shift in the CpG rate. Comparing the simplified model and the boundary-corrected model using ℒ2, we have also found that boundary effects can become extremely important. As detailed in Section 3, the relative difference between frequencies r1tα, …, r6tα and the boundary specific parameters rb1tα, …, rb6tα can be as high as 47.3%.

mathematical analysis can explain why problems in estimating r7tα arise if one sums over the flanking bases of the triplet. Details of this analysis are given in the Supplementary Material, where it is shown that the context-dependent model cannot be distinguished from a context-independent one, if we sum over the flanking bases and the sequence has reached its equilibrium. The central finding is that we can reduce the context-dependent rate matrix to an effective 4 × 4 matrix which is independent of the sequences' initial state in the equilibrium [see Equation (S15)]. Strictly speaking, this holds only in the equilibrium, but since the data are corrupted by noise (e.g. statistical errors caused by the finite size of the sample, alignment errors and errors in the reconstruction of the ancestor sequences of the repeats) differences in the single nucleotide composition arising from different initial states are increasingly masked by the noise as the sequence approaches its equilibrium. Essentially, it is these differences that are needed to estimate the CpG-specific rate r7tα.

A further possible source of error in our analysis is related to the ancestral CpG dinucleotide frequency for highly diverged repeat families. As Figure 5 reveals, when the average transversion rate rtrtα=0.025 we expect about 93% of all CpG sites having to change to non-CpG sites. This might hamper the identification of ancestral CpG dinucleotides, and the reconstructed consensus sequences for highly diverged repeats may contain less CpG sites than were present in the actual ancestor. Using simulations we examined whether a partial loss of ancestral CpG-sites in the reconstruction leads to a substantial bias in the identification of r7. To this end, 25% of the initial CpG sites of the synthetic data were altered to either TpG or CpA (with equal probability). Applying M1 to the simulated data we find that the method estimates r7/rtr at about 63% of its actual value. In contrast, our estimation method (M2) does not result in a significant estimation bias. Our method does not require a perfect reconstruction of the ancestral sequence CpG sites. We correctly estimate the CpG-specific rate even with a significant number of reconstruction errors because our method is based on conditional transition probabilities and takes the CpG content of today's sequence into account. Furthermore, the ability to take the modern CpG content into account helps in the estimation of the CpG rate even if the data are affected by additional noise coming from the omission of some ancestral CpG sites. See the Supplementary Materials for the supporting simulation results.

In considering the points discussed above, we did look for a signature of CpG depletion in the consensus sequences of highly diverged repeats by correlating the CpG content of the ancestral sequence to the average transversion frequency. No clear trend became apparent. In contrast, if the reconstructed consensus sequences had a strong systematic CpG bias, then one would expect the CpG content on the consensus sequences would decay with the average transversion frequency.

Taken together, these results strongly suggest that the shift of r7 tα observed in Arndt et al. (2003a) is simply an artifact produced by the method. We note that the same model and method has been used in a number of studies (e.g. Arndt and Hwa, 2004; Arndt et al., 2003a, Arndt et al., 2003b; Ebersberger and Meyer, 2005; Meunier and Duret, 2004; Webster et al., 2006). The results obtained in these papers should be reconsidered in the light of the results presented here.

Our finding that the purported CpG-related substitution rate shift is a computational artifact is supported by a number of previous biologically oriented studies. Arndt et al. explained their shift with a sudden change in the methylation pattern, but to date there has been no report of a biological event which would explain such a shift. In contrast, such events have been identified at an earlier point in genomic history. A distinct change of the methylation pattern is known to have taken place only at the invertebrate–vertebrate boundary, roughly 450 Mya. Evidence for this change is presented in Tweedie et al. (1997). There it is observed that the methylation pattern is relatively constant over all vertebrate genomes (showing a high degree of methylation throughout), whereas CpG sites in invertebrate genomes are predominantly non-methylated. This change of the methylation pattern can be attributed to a distinct change in the methyl–CpG binding proteins, as discussed in (Hendrich and Tweedie (2003). In specific, the MBD2/3 protein is encoded by a single gene within invertebrate genomes, whereas the two genes MBD2 and MBD3 are only present in vertebrate genomes. It is likely that MBD2/3 represents the original methyl–CpG binding protein, that MBD2 and MBD3 are direct descendants of MBD2/3, and that this diversification plays an important role in the change of the DNA methylation pattern at the vertebrate–invertebrate boundary. Further studies have shown that this change of the DNA methylation pattern at the invertebrate–vertebrate boundary is also reflected by the sequence content between vertebrates and invertebrates. Using an odds ratio measure between the CpG dinucleotide content and the product of the single nucleotide content of C and G, the lowest values were obtained for vertebrates—indicating a high degree of CpG depletion (Karlin and Mrázek, 1997). Within the analyzed group of vertebrates all derived odds ratios are comparable and significantly lower than those of the invertebrates. In Cardon et al. (1994) the same odds ratio measure was computed for several mitochondrial genomes, showing that there is no significant change at the vertebrate–invertebrate boundary. Moreover, the odds ratios for the mitochondrial genomes are quite close to those of the nuclear genomes of the invertebrates. As no methylase is known to be active in the mitrochondria, we might conclude that the observed change between vertebrates and invertebrates is due to a change in the methylation pattern. In addition, based on a simple context-dependent substitution model and assuming constancy of the substitution rates, Cooper and Krawczak (1989) estimated that the massive CpG depletion started roughly 450 Mya, thus at the vertebrate–invertebrate boundary. This result has been disputed in Jabbari et al. (1997), where a slight acceleration in depletion of CpG sites was observed for mammals and birds (amniotes) when compared to fish and amphibians (anamniotes). The authors explain this result by a change of body temperature resulting in a higher deamination rate for warm-blooded vertebrates. But neither a change at the amniote–anamniote boundary (about 320 Mya) nor the massive change of the methylation pattern at the vertebrate–invertebrate boundary (about 450 Mya) would explain a change in the methylation pattern 90 Mya, as claimed by Arndt et al. to explain the CpG-rate shift. Without evidence of such an event, our prediction of a constant CpG rate over the studied time span is the biologically more plausible result.

Hwang and Green (2004) suggest an alternative interpretation of the CpG-rate change at the time of the mammalian radiation. These authors find that CpG substitutions are relatively clock-like [in agreement with Kim et al., 2006)], while the single nucleotide substitutions are not consistent with the molecular clock hypothesis (see also Steiper et al., 2004; Yi et al., 2002);. Because the single nucleotide substitutions, as opposed to the CpG substitutions, are prone to replication-dependent errors, it is conjectured that the CpG rate has remained constant whereas the rate of the other state transitions have decreased over time due to these replication-dependent errors. The combined effect of these findings would lead to a change of the CpG rate when plotted against the average transversion rate. In fact, if we compare the values of r7/rtr between human and mouse we find significantly smaller values for the mouse than for the human. This decrease is consistent with the relatively clock-like behavior of r7tα, as discussed in Hwang and Green (2004), and Kim et al. (2006), because the increase of the average transversion frequency is larger than the increase in CpG-rate when passing from human to mouse. Consequently, the ratio r7/rtr is expected to be smaller for the mouse. Conversely, r7/rtr should have similar values for human and mouse when estimated from repeats which are older than the speciation time between these species. Therefore, a small upward shift should be visible for the ancient repeats of the mouse lineage. But since the noise on the data is so high, such a shift cannot be reliably examined with the proposed method. Thus, further studies are needed to completely clarify fine structure of the time dependency of the CpG rate.

Supplementary Material

[Supplementary Data]
btn391_index.html (674B, html)

ACKNOWLEDGEMENTS

The authors would like to thank Philip Green, Ross Harrison, Webb Miller and Svitlana Tyekucheva for their help with this research, Laura Tabacca and David Karro for their help in writing this article, and Richard Burhans and Nathan Coraor for their technical support.

Funding: National Institutes of Health (5K01HG003315 to J.K.); Science Foundation (FWF) (P18762 to M.P.).

Conflict of Interest: none declared.

REFERENCES

  1. Arndt P, Hwa T. Regional and time-resolved mutation patterns of the human genome. Bioinformatics. 2004;20:1482–1485. doi: 10.1093/bioinformatics/bth105. [DOI] [PubMed] [Google Scholar]
  2. Arndt P, Hwa T. Identification and measurement of neighbor-dependent nucleotide substitution processes. Bioinformatics. 2005;21:2322–2328. doi: 10.1093/bioinformatics/bti376. [DOI] [PubMed] [Google Scholar]
  3. Arndt P, et al. Distinct changes of genomic biases in nucleotide substitution at the time of mammalian radiation. Mol. Biol. Evol. 2003a;20:1887–1896. doi: 10.1093/molbev/msg204. [DOI] [PubMed] [Google Scholar]
  4. Arndt P, et al. DNA sequence evolution with neighbor-dependent mutation. J. Comp. Biol. 2003b;10:313–322. doi: 10.1089/10665270360688039. [DOI] [PubMed] [Google Scholar]
  5. Arndt P, et al. Substantial regional variation in substitution rates in the human genome: importance of GC content, gene density, and telomere-specific effects. J. Mol. Evol. 2005;60:748–763. doi: 10.1007/s00239-004-0222-5. [DOI] [PubMed] [Google Scholar]
  6. Blake R, et al. The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. J. Mol. Evol. 1992;34:189–200. doi: 10.1007/BF00162968. [DOI] [PubMed] [Google Scholar]
  7. Cardon L, et al. Pervasive CpG suppression in animal mitrochondrial genomes. Proc. Natl Acad. Sci USA. 1994;91:3799–3803. doi: 10.1073/pnas.91.9.3799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cooper D, Krawczak M. Cytosine methylation and the fate of CpG dinucleotides in vertebrate genomes. Hum. Genet. 1989;83:181–188. doi: 10.1007/BF00286715. [DOI] [PubMed] [Google Scholar]
  9. Coulondre C, et al. Molecular basis of base substitution hotspots in Escherichia coli. Nature. 1978;274:775–780. doi: 10.1038/274775a0. [DOI] [PubMed] [Google Scholar]
  10. Davison A, Hinkley D. Bootstrap methods and their application. Cambridge: Cambridge University Press; 1997. [Google Scholar]
  11. Ebersberger I, Meyer M. A genomic region evolving toward different GC contents in humans and chimpanzees indicates a recent and regionally limited shift in the mutation pattern. Mol. Biol. Evol. 2005;22:1240–1245. doi: 10.1093/molbev/msi109. [DOI] [PubMed] [Google Scholar]
  12. Ehrlich M, et al. DNA cytosine methylation and heat-induced deamination. Biosci. Rep. 1986;6:387–393. doi: 10.1007/BF01116426. [DOI] [PubMed] [Google Scholar]
  13. Ewens J, Grant G. Statistical Methods in Bioinformatics: An Introduction. New York, NY: Springer Science; 2005. [Google Scholar]
  14. Fryxell K, Zuckerkandl E. Cytosine deamination plays a primary role in the evolutiuon of mammalian isochores. Mol. Biol. Evol. 2000;17:1371–1383. doi: 10.1093/oxfordjournals.molbev.a026420. [DOI] [PubMed] [Google Scholar]
  15. Gaffney D, Keightley P. The scale of mutational variation in the murid genome. Genome Res. 2005;15:1086–1094. doi: 10.1101/gr.3895005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hall P, Wilson S. Two guidelines for bootstrap hypothesis testing. Biometrics. 1991;47:757–762. [Google Scholar]
  17. Hardison,R, et al. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 2003;13:13–26. doi: 10.1101/gr.844103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hasegawa M, et al. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985;22:160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
  19. Hendrich B, Tweedie S. The methyl-CpG binding domain and evolving role of DNA methylation in animals. Trend Genet. 2003;19:269–277. doi: 10.1016/S0168-9525(03)00080-5. [DOI] [PubMed] [Google Scholar]
  20. Hess S, et al. Wide variations in neighbor-dependent substitution rates. J. Mol. Biol. 1994;236:1022–1033. doi: 10.1016/0022-2836(94)90009-4. [DOI] [PubMed] [Google Scholar]
  21. Hwang D, Green P. Baysian Markov chain monte carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl Acad. Sci. USA. 2004;101:13994–14001. doi: 10.1073/pnas.0404142101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Jabbari K, et al. Eolutionary changes in CpG and methylation levels in the genome of vertebrates. Gene. 1997;205:109–118. doi: 10.1016/s0378-1119(97)00475-7. [DOI] [PubMed] [Google Scholar]
  23. Jukes T, Cantor C. Evolution of protein molecules. In: Munro H, editor. Mannalian Protein Metabolism. New York: Academic Press; 1969. pp. 121–132. [Google Scholar]
  24. Karlin S, Mrázek J. Compositional differences within and between eukaryotic genomes. Proc. Natl Acad. Sci. USA. 1997;94:10227–10232. doi: 10.1073/pnas.94.19.10227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Karro J, et al. Exponential decay of GC content detected by strand-symmetric substitution rates influences the evolution of isochore structure. Mol. Biol. Evol. 2008;25:362–374. doi: 10.1093/molbev/msm261. [DOI] [PubMed] [Google Scholar]
  26. Kent W, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kim S, et al. Heterogeneous genomic molecular clocks in primates. PLoS Genet. 2006;2:1527–1534. doi: 10.1371/journal.pgen.0020163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lio P, Goldman N. Model of molecular evolution and phylogeny. Genome Res. 1998;8:1233–1244. doi: 10.1101/gr.8.12.1233. [DOI] [PubMed] [Google Scholar]
  29. Lobry J, Lobry C. Evolution of DNA base composition under no-strand-bias conditions when the substitution rates are not constant. Mol. Biol. Evol. 1999;16:719–723. doi: 10.1093/oxfordjournals.molbev.a026156. [DOI] [PubMed] [Google Scholar]
  30. Lunter G, Hein J. A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics. 2004;20:216–223. doi: 10.1093/bioinformatics/bth901. [DOI] [PubMed] [Google Scholar]
  31. Meunier J, Duret L. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 2004;21:984–990. doi: 10.1093/molbev/msh070. [DOI] [PubMed] [Google Scholar]
  32. Meunier J, et al. Homology-dependent methylation in primate repetitive DNA. Proc. Natl Acad. Sci. USA. 2005;102:5471–5476. doi: 10.1073/pnas.0408986102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Peifer M, et al. On studentising and blocklength selection for the bootstrap on time series. Biometr. J. 2005;47:346–357. doi: 10.1002/bimj.200310112. [DOI] [PubMed] [Google Scholar]
  34. Razin C, Riggs A. DNA methylation and gene function. Science. 1980;210:604–610. doi: 10.1126/science.6254144. [DOI] [PubMed] [Google Scholar]
  35. Schwarz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464. [Google Scholar]
  36. Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 2004;21:468–488. doi: 10.1093/molbev/msh039. [DOI] [PubMed] [Google Scholar]
  37. Smit A. 1996–2004. Repeatmasker open-3.0. [Google Scholar]
  38. Steiper M, et al. Genomic data support the hominoid slowdown and an early oligocene estimate for the hoinoid-cercopithecoid divergence. Proc. Natl Acad. Sci. USA. 2004;101:17021–17026. doi: 10.1073/pnas.0407270101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Sueoka N. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons. J. Mol. Evol. 1995;40:318–325. doi: 10.1007/BF00163236. [DOI] [PubMed] [Google Scholar]
  40. Tweedie S, et al. Methylation of genomes and genes at the boundary of invertebrate-vertebrate boundary. Mol. Cell. Biol. 1997;17:1469–1475. doi: 10.1128/mcb.17.3.1469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Tyekucheva S, et al. Human-macaque comparisons illuminate variation in neutral substitution rates. Genome Biol. 2008;9:R76. doi: 10.1186/gb-2008-9-4-r76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Waterston R, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
  43. Webster M, et al. Strong regional biases in nucleotide substitution in the chicken genome. Mol. Biol. Evol. 2006;23:1203–1216. doi: 10.1093/molbev/msk008. [DOI] [PubMed] [Google Scholar]
  44. Wiebauer K, et al. The repair of 5-methylcytosine deamination damage. In: Jost J, Saluz H, editors. DNA Methylation: Molecular Biology and Biological Significance. Birkhäuser Verlag, Basel; 1993. pp. 510–522. [DOI] [PubMed] [Google Scholar]
  45. Yi S, et al. Slow molecular clocks in old world monkeys, apes and humans. Mol. Biol. Evol. 2002;19:2191–2198. doi: 10.1093/oxfordjournals.molbev.a004043. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]
btn391_index.html (674B, html)
btn391_1.pdf (106.1KB, pdf)

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES