Abstract
Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human–mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of “neutral” sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse. [Supplemental material is available online at www.genome.org and http://www.soe.ucsc.edu/research/compbio/covariation/.]
In principle, the alignment of the human and mouse genome sequences provides the opportunity to find most functional sequences whose role is conserved in the two species. Nearly all such sequences are subject to purifying selection, and thus will change less than nonfunctional sequences, which will evolve at a faster, neutral rate. Thus, one way to find conserved functional elements of the human genome is to identify those DNA sequences that are changing significantly more slowly than the neutral rate.
This approach is complicated by variation in both the level of selection on various functional sequences, which affects the extent to which they change relative to the neutral rate, and regional variation in the neutral rate within the genome. Variation in the level of selection is well-known. Most protein-coding sequences change little in comparisons between orthologous human and mouse genomic sequences. The mean similarity is ∼85% identity, but the range varies between 36% and 100% identity (Makalowski et al. 1996; Makalowski and Boguski 1998b). Studies of a large number of protein families show that the nonsynonymous substitution rate varies about 300-fold (Nei 1987), presumably reflecting differences in the portion of the protein under functional constraint and the severity of that constraint. Other functional regions, such as those transcribed into RNAs that do not encode proteins and DNA sequences regulating gene expression, have been studied less, but it is reasonable to expect substantial variation in the level of selection for these as well.
Many investigators have found that the neutral substitution rate, primarily estimated from substitutions at synonymous sites, also varies regionally within a genome (Matassi et al. 1999; Williams and Hurst 2000; Chen et al. 2001; Castresana 2002b; Ebersberger et al. 2002; Lercher and Hurst 2002; Smith et al. 2002), but it is uncertain how consistent this variation is across different mammalian lineages (Williams and Hurst 2002), and there are dissenting opinions on the existence of rate variations (Kumar and Subramanian 2002). Early results showed that the synonymous substitution rate varied for different genes (Wolfe et al. 1989) and that it correlates with the nonsynonymous rate (Graur 1985). Additional studies have consistently seen wide variation in the rate of substitution per synonymous site for human and other mammalian species (primarily rodents; Makalowski et al. 1996; Casane et al. 1997; Makalowski and Boguski 1998a). Regional effects are evident from studies of correlations in the synonymous rates of neighboring genes and genes within fixed-size regions (Matassi et al. 1999; Williams and Hurst 2000; Chen et al. 2001; Castresana 2002a; Lercher and Hurst 2002; Smith et al. 2002).
The inferred regional variation in evolutionary rates has been examined directly by comparisons of long genomic DNA sequences between humans and other mammals (usually mouse). Alignments of these sequences showed that some loci have extensive matches outside the coding region (Koop and Hood 1994; Epp et al. 1995; Oeltjen et al. 1997; Ellsworth et al. 2000), whereas in others the matches are largely limited to the coding regions (Lamerdin et al. 1996; Endrizzi et al. 1999), and still others have an intermediate level of noncoding sequence matches (Margot et al. 1989; Shehee et al. 1989; Lamerdin et al. 1995; Ansari-Lari et al. 1998). Quantitative analysis showed that the fraction of noncoding, nonrepetitive genomic sequence that aligns in comparisons between mammalian orders varies over a 10-fold range at different loci (Endrizzi et al. 1999; DeSilva et al. 2002). Thus, analysis both of substitutions at apparently neutral sites in coding regions and the extent of aligning DNA in noncoding regions reveals substantial regional differences in the amount of divergence between mammalian genomic DNA sequences.
The search for functional genomic DNA sequences based on comparative analyses requires a much better understanding of this regional variation in the rate of evolution. The goal is to determine which sequences have changed significantly less than expected, given a particular underlying rate of change for the region encompassing those sequences, knowing that the underlying rates can vary. Several things are needed to accomplish this goal. Reliable, well-understood measures of divergence must be developed (Nei and Kumar 2000), and they need to be applied genome-wide to ascertain their variation along each chromosome. The extent of correlation among the measures of divergence needs to be examined; for example, one study on a small scale (relative to a whole genome) has shown that the fraction of human DNA sequences aligning with mouse is negatively correlated with the frequency of insertion and retention of interspersed repeats (Chiaromonte et al. 2001). Any covariation should be explained to the extent possible. However, the literature has conflicting reports on the ability of parameters such as GC content to explain variation in divergence (Wolfe et al. 1989; Wolfe and Sharp 1993; Bernardi 1995; Matassi et al. 1999). Once these steps have been accomplished, approaches for finding sites that are candidates for being under selection (given variation in underlying rates) can be applied (Elnitski et al. 2003; Li and Miller 2002; Roskin et al. 2002; Waterston et al. 2002).
This paper reports our initial results analyzing divergence between human and mouse genome-wide, done in conjunction with the Mouse Genome Sequencing Consortium (Waterston et al. 2002). Three measures of evolutionary change were derived from the aligned human and mouse genomes; these are nucleotide substitution per fourfold degenerate site in coding regions, nucleotide substitution per site in ancestral repeats, and the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats. The first two are measures of nucleotide substitutions, one at a class of sites commonly used to model neutral evolution, and the other at a newly studied class of sites that may provide a superior model of neutral evolution. The third is a measure dominated by deletions in mouse (Waterston et al. 2002). Three additional measures of divergence were derived from the human genome alone: the frequency of insertion of transposable elements, the density of single nucleotide polymorphisms in human, and the meiotic recombination rate.
We show that all six of these measures of chromosomal DNA change vary regionally in their rates, and most of the rates covary. Some of these observations extend previous results obtained on smaller data sets, as discussed below. Thus large (megabase-sized) segments of mammalian genomes vary substantially in their rate of change by substitution, deletion, insertion, and recombination, and regions with more changes acquired recently (high SNP density) also have accumulated more substitutions since the human–mouse divergence. We show that variation in GC content accounts for some but not all of this variation, and has a quadratic relationship with the level of divergence. Similar results are obtained for the change in GC content between human and mouse: It can account for part of the variation, but cannot account for all the variation in divergence. The involvement of double-strand breaks during recombination and DNA repair processes is a potential mechanism to explain the variation (Lercher and Hurst 2002), although many possibilities will need to be examined in future studies.
RESULTS
Measurement of Rates of Neutral Substitution
We used a whole-genome alignment between the June 2002 human genome assembly and the mouse genome assembly as reported in Waterston et al. (2002) built by the BLASTZ alignment program (Schwartz et al. 2003). This alignment covers ∼40% of the human genome sequence, with 69.8% of the aligned bases matching. To attempt to separately study substitutions representing neutral evolutionary drift (Kimura 1983) from those influenced by selection, it is common to look separately at substitutions in fourfold degenerate sites in codons, that is, sites marked “x” in the codons GCx (ALA), CCx (PRO), TCx (SER), ACx (THR), CGx (ARG), GGx (GLY), CTx (LEU), and GTx (VAL), which we call 4D sites (see Methods). We have found about two million such sites in our human–mouse genome alignment using codons defined by human gene annotations from BLAT (Kent 2002) alignments of 9562 RefSeq cDNAs that passed certain quality checks.
The overall observed percent identity in the 4D sites is 67.2%, but it varies depending on the human GC content of the surrounding 100-kb region, from 69.1% in low (<36.2%) GC regions, to 68.4% in medium (between 36.2% and 41.2%) GC regions, and 66.4% in high (>41.2%) GC regions. (These GC ranges divide the data roughly into equal thirds.) Because hypermutable CpG dinucleotides can sometimes skew estimates of the levels of conservation (Fryxell and Zuckerkand 2000), we also recalculated the percent identities after removing all sites that are in a CpG either in human or in mouse. This increased them to 74.4% in low-GC regions, 74.1% in medium-GC regions, and 73.6%, in high-GC regions. The frequencies of the 16 observed changes in 4D sites for medium-GC-content regions, not excluding CpGs, are given in Table 1A; similar tables for the other cases are given as Supplementary Material (available online at www.genome.org and http://www.soe.ucsc.edu/research/compbio/covariation/).
Table 1.
A. Observed Changes in 4D Sites | ||||
Mouse | ||||
A | C | G | T | |
Human A | 0.1779 | 0.0246 | 0.0499 | 0.0192 |
Human C | 0.0135 | 0.1597 | 0.0169 | 0.0348 |
Human G | 0.0358 | 0.0169 | 0.1641 | 0.0134 |
Human T | 0.0187 | 0.0495 | 0.0257 | 0.1793 |
B. Observed Changes in AR Sites | ||||
Mouse | ||||
A | C | G | T | |
Human A | 0.2163 | 0.0198 | 0.0508 | 0.0207 |
Human C | 0.0160 | 0.1184 | 0.0116 | 0.0463 |
Human G | 0.0463 | 0.0116 | 0.1183 | 0.0159 |
Human T | 0.0207 | 0.0509 | 0.0199 | 0.2166 |
Frequency of observed changes in (A) 4D sites and (B) ancestral repeat sites in 100-kb windows with medium human GC content (between 36.2% and 41.2% G or C). GC content was calculated using all aligned bases in the window. Frequencies are expressed as the fraction of the total observed changes.
Using the general time-reversible Markov model of base substitution, REV (Tavaré 1986; Yang 1994; Whelan et al. 2001), we used the frequencies of observed changes in Table 1A to estimate the average number of substitutions per 4D site on the combined primate and rodent lineages since their divergence from a common ancestor. For medium-GC-content regions, not excluding CpGs, we obtained 0.42 substitutions per site, which can be broken down into the 12 types of substitution shown in Table 2A. Recomputing this number for other GC contents, we obtained related estimates for the number of substitutions per site, varying between 0.41 and 0.47; excluding CpGs, we obtained much smaller estimates of 0.32–0.34 substitutions per site, as shown in the Supplementary Material. Other Markov models (Lio and Goldman 1998) that distinguish between transitions and transversions, such as K2P, HKY, and TN93, gave similar estimates, whereas simpler models, such as JC and FEL, gave slightly lower estimates, as expected (Yang 1994; data not shown). There is insufficient information in two-species data sets to effectively use more complex models that include rate variation among sites, and separate parameters for each branch.
Table 2.
A. Substitutions per Site in 4D Sites | |||||
Base arising | |||||
A | C | G | T | ||
Base replaced | A | 0.0232 | 0.0623 | 0.0218 | |
C | 0.0232 | 0.0197 | 0.0613 | ||
G | 0.0623 | 0.0197 | 0.0238 | ||
T | 0.0218 | 0.0613 | 0.0238 | ||
B. Substitutions per Site in AR Sites | |||||
Base arising | |||||
A | C | G | T | ||
Base replaced | A | 0.0223 | 0.0748 | 0.0213 | |
C | 0.0223 | 0.0136 | 0.0748 | ||
G | 0.0748 | 0.0136 | 0.0224 | ||
T | 0.0213 | 0.0748 | 0.0224 |
For each type of substitution, the expected number of substitutions of that type per site is estimated from the REV model using data in Table 1. Estimates are for the combined number of substitutions of the given type in both the primate and rodent lineages since they diverged from their common ancestor.
With two million data points, and approximately one-third of these for each of the GC levels, the asymmetries between human and mouse in the frequencies of observed changes (especially in the transition rates at extreme GC contents) are statistically significant, and indicate either lack of time-reversibility or lack of stationarity to a certain degree, which very likely creates some inaccuracies in the REV estimates that will need to be reexamined when large data sets from other mammals become available. Unfortunately, pooling of the data into such large data sets also introduces inaccuracies because of the regional variability in the substitution process. Indeed, if the REV model is applied separately to the 4D sites in every 1-Mb region of the human genome that has at least 1000 4D sites (including CpGs), and an average of the resulting regional estimates of the number of substitutions per site is taken, weighted by the number of 4D sites in each window, then the resulting genome-wide average is 0.467 substitutions per site (Waterston et al. 2002), very similar to what we observe in the highest third of the GC range when we combine all that data and do one estimate. Therefore, one must also be careful in how one breaks down the data when making genome-wide estimates of substitution rates.
Bases at 4D sites are not a perfect data source for models of neutral evolution. They can sometimes be under selection for their role in mRNA splicing and other nuclear functions. In some species, 4D sites show biased base frequencies relating to differences in tRNA abundances, also indicating possible selection effects. Bernardi and coworkers have suggested human 4D sites are under selection (Bernardi 1995, 2001), but others argue that it has not been convincingly shown that tRNA-abundance-based codon bias or other kinds of selection affect 4D-site substitution rates in mammals (Graur and Li 2000; Iida and Akashi 2000). Also, the flanking bases can have a significant impact on substitution rates, as with the hypermutable CpG sites, but flanking bases are not equally represented in 4D sites. For instance, no bases 5′ to a 4D site are ever an A. For these reasons, we suggest another data source for modeling neutral evolution that we call an ancestral repeat, or AR, site (Waterston et al. 2002).
AR sites are aligned nucleotides within copies of transposable elements that were fixed in the common ancestor of human and mouse. A set of such elements was selected based on an average divergence level in human indicative of an age predating the mammalian radiation, and the whole-genome alignments confirmed that individual copies are at orthologous sites in human and mouse (Waterston et al. 2002). Thus, copies of these elements were already present as interspersed repeats in the common ancestor of human and mouse. We chose to focus on these sequences as they are highly likely to have been under no functional constraint. In contrast, single-copy DNA not annotated as exons can contain unidentified coding regions, RNA-coding genes, and other functional sequences, and thus does not provide a good model for neutral evolution. The ancestral repeats are abundant: Half of the interspersed repeats identifiable in the human genome with RepeatMasker (Smit and Green 1999) predate the human–mouse split (22% of all human DNA). The distribution of all ancestral repeats is uniform across the human genome, with little bias toward A + T-rich or G + C-rich DNA (Lander et al. 2001). Orthologous ancestral repeats can be reliably found and aligned by extending alignments of nearby unique genomic DNA (Schwartz et al. 2003).
We identified ∼165 million aligned AR sites in our human–mouse alignment. The overall observed percent identity in these AR sites is 66.7%, and varies between 66.1% and 67.0% in the three different GC levels. These numbers increase only slightly to 66.6%–67.9% if we exclude CpGs, in contrast with the big increases observed in 4D sites. Observed substitutions in medium-GC regions, not excluding CpGs, are shown in Table 1B, and for other cases in the Supplementary Material. Estimates from the REV model are ∼0.46 substitutions per site for medium-GC content, not excluding CpGs (Table 2B), and vary from 0.44–0.48 in the other cases, including cases in which CpGs are excluded and cases in which they are not (Supplementary Material). Overall, the substitution levels in AR are roughly similar to the 4D sites when CpGs are included, but show a generally higher number of transitions, are much less affected by the removal of CpGs, and show less asymmetry between human and mouse as well. Hence AR sites provide a different, and possibly better model of neutral evolution.
If we use a range of 65–105 Mya (million years ago) as an estimate for the origins of the eutherian orders, and the above estimates of 0.44–0.48 substitutions per site from the AR sites data, then we obtain estimates of the rate of neutral substitution in the range of 2.1–3.7 × 10−9 substitutions per year, averaged over both lineages, which includes most published estimates (Li et al. 1985; Kondrashov and Crow 1993; Kumar and Subramanian 2002).
Inference of Rate of Deletion
Analysis of the genome-wide alignments between human and mouse indicates that the majority of the nonaligning regions that are not identifiable as insertions of lineage-specific transposons represent deletions in the other species since divergence from their common ancestor (Waterston et al. 2002; see also Ogata et al. 1996). Thus, to measure regional variation in mouse-lineage deletion rate, we assume the human DNA not occupied by primate lineage-specific repeats represents the DNA that shares a common ancestor with mouse, and we use the nonaligning fraction of this DNA (NAanc) as a rough estimate of the amount of DNA deleted from mouse. In the following sections, we discuss how this measure and the two measures of neutral substitution rate covary in the human genome.
Large-Scale Regional Variation and Covariation in Rates of Substitution
As noted above, significant variation in the level of sequence conservation between human and mouse from locus to locus has been reported in several studies of long DNA sequences of single loci (Hardison et al. 1991, 1997; Koop 1995; DeBry and Seldin 1996; Göttgens et al. 2001; Shiraishi et al. 2001; Wilson et al. 2001) and in comparative studies of gene sequences in these and other mammals (Wolfe et al. 1989; Bernardi 1993, 1995; Casane et al. 1997; Matassi et al. 1999; Williams and Hurst 2000; Lercher et al. 2001; Castresana 2002a,b), albeit with some dissenting analysis (Williams and Hurst 2000; Kumar and Subramanian 2002). With ∼700 4D sites/Mb and 50,000 AR sites/Mb genome-wide, we were able to do a much larger scale study of regional variation in rates of substitution, and found correlated fluctuations in regional substitution rates for both types of sites.
A series of 3038 5-Mb windows of human DNA was taken from the June 2002 assembly of the human genome with consecutive windows overlapped by 4 Mb. From the 4D sites in each window we estimated the quantity t4D, the expected number of substitutions per 4D site in the evolution of human and mouse from their common ancestor, and from the AR sites the analogous quantity, tAR. These estimates were made using the REV model of nucleotide evolution, using only data within the window. Windows with <800 4D sites were discarded. (Simulation experiments on 5000 replicates showed that this ensures that the standard deviations of the estimates of tAR and t4D will be <0.0338 and 0.0319, respectively.) This left 2504 windows for analysis, all of which had at least 800 4D sites and at least 4666 AR sites. The standard deviation of t4D in this data set was 0.0702, compared with a sampling standard deviation of 0.0209, and the standard deviation of tAR was 0.0187, compared with a sampling standard deviation of 0.0030, indicating that the observed regional variation cannot be explained from sample size effects. (The sampling deviations were computed using random replicates as above, chosen to have the same numbers of sites and base compositions as the actual data.) Even accounting for the smaller sampling deviation, the variation in tAR is substantially less than that in t4D.
The above analysis was repeated with a set of 510 nonoverlapping 5-Mb windows, each with at least 800 4D sites, and with a set of 1430 nonoverlapping 1-Mb windows, each with at least 400 4D sites. The results also showed variance larger than can be explained from sample size effects. Genome-wide, using the nonoverlapping windows we found t4D and tAR to be very significantly correlated (r2 = 0.26 for 5-Mb windows and 0.27 for 1-Mb windows), often showing quite similar behavior along a human chromosome, as well as correlation with deletion rate Nanc (see below), as shown in Figure 1 for all human chromosomes. This suggests that some regional chromosome property is leading to a variable rate of substitution in different parts of the chromosome.
One possibility is that regional variation in GC content accounts for the covariation between t4D and tAR, so that this is entirely a function of isochore structure (Bernardi 1986, 2000; Hurst and Williams 2000; Eyre-Walker and Hurst 2001). This is explored in detail below, in a combined analysis that also includes the other measures of divergence that we examine. We factor out the effects of GC content by computing residuals of a quadratic regression of t4D and tAR on GC content, and then compute the correlation between the residuals. The resulting residuals are plotted for 5-Mb overlapping windows along all the human chromosomes in Figure 1. By factoring out GC content in this way, the genome-wide correlation between t4D and tAR is actually enhanced (r2 jumps from ∼0.22 to 0.33 for overlapping 5-Mb windows, and similarly for nonoverlapping 5-Mb windows; see Figure 4B below), indicating that GC content does not fully explain this correlation, and other factors must also be at work (Waterston et al. 2002).
It has been noted that rates of nonsynonymous and synonymous changes in genes are correlated (Graur 1985; Li et al. 1985; Ticher and Graur 1989; Lercher et al. 2001). Correlations between rates in coding regions and rates in introns have also been observed in some studies (Castresana 2002b) but not in others (Hughes and Yeager 1998) (see critique in Smith and Hurst 1998). Correlations between rates in coding regions and UTR regions have also been observed (Makalowski and Boguski 1998b). We find that regional variation in substitution rates in AR sites is also significantly correlated with variation in sites in and around genes.
Specifically, using the REV model as above, let tintron, tcoding, t5′UTR and t3′UTR be the estimated number of substitutions per site in aligned positions from a given window from sites in intron, coding exons, 5′-UTR and 3′-UTR regions, respectively. (AR sites are excluded from introns in this calculation.) Let dN, dS, and dN/dS be the rates of nonsynonymous substitutions per nonsynonymous site, synonymous substitutions per synonymous site, and their ratio, computed by the method of Goldman and Yang (1994), also described as the ML method of Yang and Nielsen (2000), using the PAML software package of Yang (1997). These quantities are computed from the same set of human RefSeq genes used to collect 4D sites. We computed the correlation between tAR and each of these gene-feature substitution rates, measured in a set of 510 nonoverlapping 5-Mb windows, and found significant (p < 0.001) but not always large correlation with all of them except dN and dN/dS (r2 = 0.44 for tintron, 0.06 for tcoding, 0.06 for t5′UTR, 0.05 for t3′UTR, 0.003 for dN, 0.15 for dS, and 0.03 for dN/dS). Correlations with t4D were higher (r2 = 0.46 for tintron, 0.43 for tcoding, 0.18 for t5′UTR, 0.34 for t3′UTR, 0.07 for dN, 0.86 for dS, and 0.03 for dN/dS), and significant with P < 0.001 for all but dN/dS. Similar results were obtained for 1-Mb windows. Figure 2A shows how tintron, tcoding, t5′UTR and t3′UTR vary with tAR and t4D along human Chromosome 22 for overlapping 5-Mb windows. As above, the correlations are also present after factoring of GC content (Fig. 2B). The experiments were repeated removing CpG sites from the calculation of all quantities except dN and dS, and the results were very similar (data not shown). These data indicate that substitutions in all sites in and around genes, with the possible exception of nonsynomous substitutions in nonsynonymous sites, are affected by the same conditions that cause regional covariation in t4D and tAR substitution rates.
Covariation of Rates of Substitution With Rates of Deletion and Density of Lineage-Specific Transposable Elements
The fraction of human DNA aligning with mouse varies among chromosomes (Fig. 3). The portion of human DNA that is nonrepetitive or in ancestral repeats (i.e., non-lineage-specific repeats) is the DNA that is likely derived from the common ancestor to human and mouse, and hence it is the portion that could align with mouse. As discussed above, the fraction of the human genome derived from the common ancestor that does not align with mouse (NAanc) is an estimate of the amount of DNA deleted from the mouse lineage. This function also varies among chromosomes (Fig. 3). One extreme is illustrated by Chromosome 19, which has the smallest fraction aligning, the largest fraction of lineage-specific repeats, and a substantial amount inferred as deleted. In contrast, Chromosome 20 is about the same size but has a substantial fraction aligning, a roughly average fraction of repeats, and one of the smaller amounts of inferred deletion.
When measured in the same 5-Mb overlapping windows as above, the estimate of amount of DNA deleted in mouse, NAanc, varies widely across the genome, and it tends to covary with both measures of substitutions per neutral site (Fig. 1). This graph shows the variation in NAanc, t4D, and tAR after factoring out the effect of GC content, as discussed below. The pairwise correlations among the three divergence measures are positive and highly significant (p < 0.001), whether measured in nonoverlapping windows of 1 Mb (Fig. 4A) or 5 Mb (Fig. 4B). Thus the rate of nucleotide substitution at two different types of neutral sites and NAanc covary in large regions of DNA. Some part of this effect could be explained by an ascertainment bias, because ancestral DNA in the faster-evolving regions will be harder to align. However, as discussed above, it is likely that a substantial portion of the nonaligning DNA reflects deletions. To the extent that NAanc reflects deletions, these data demonstrate a correlation between neutral substitutions and deletions.
The relationship of these divergence measures with the frequency of insertion of transposable elements was then examined, using the proportion of DNA in a window composed of lineage-specific repeats (RepLS) as an estimate of the frequency of insertion and retention of transposons. This function has a strong positive correlation with NAanc, but a negative correlation with tAR and no correlation with t4D (for the original data) at all window sizes and configurations tested (Fig. 4). Thus the frequency of insertions of several families of retrotransposons covaries with the inferred deletion rate, but the relationships with substitution rates are complex. The type of correlation observed with any measure of divergence depends to some extent on the particular families and ages of repeats included in the comparison. For example, in contrast to the correlations seen for RepLS, if the analysis is confined to the density of lineage-specific LTRs (LtrLS), a significant positive correlation is observed with all three measures of divergence in most window configurations (Fig. 4). A full examination of the correlations with different families of repeats will be the subject of other studies.
The positive correlation between NAanc and RepLS was confirmed by a randomization study using methods described in Chiaromonte et al. (2001). The positions of interspersed repeats were randomized independently 100 times, while keeping the alignment constant. None of the randomized data sets showed a correlation as strong as the overall data, giving an empirical P-value < 0.01. Local correlations were also computed at the 10-kb scale, both for the original data and for the 100 randomized data sets. The distribution of local correlations is plotted in Figure 5, along with envelopes derived from the randomizations. In comparison with the “null” scenario represented by these envelopes, the histogram of the actual data shows a significant concentration on large positive values, demonstrating that covariation between deletion and insertion can be detected also at much smaller scales on individual chromosomes. Similar results (data not shown) were obtained for all chromosomes and various window sizes.
Covariation With the Frequency of Polymorphisms in Human
We also examined the density of single-nucleotide polymorphisms, as compiled by The SNP Consortium (Sachidanandam et al. 2001), for association with these divergence measures, and again obtained strong positive correlations (Fig. 4). An exception is NAanc, which shows no correlation with SNP density at 1-Mb windows, but does at 5-Mb windows, perhaps reflecting the greater amount of data in each window. Thus the frequency of nucleotide substitutions accumulating recently in human populations correlates with several measures of divergence between human and mouse. This can be explained by regional variation in substitution rates, both recently (SNPs) and long-term (human–mouse divergence).
Covariation With Recombination Rate in Human
Correlations of human meiotic recombination rates (Kong et al. 2002) are positive with divergence at neutral sites and density of human polymorphisms (Fig. 4), and are significant. The correlation between recombination rate and NAanc is significant with 1-Mb windows but not with 5-Mb windows (Fig. 4). We also note that the high-density genetic map of Chromosome 22 (Dawson et al. 2002) shows regions of high linkage disequilibrium (low recombination) that correspond to the regions of low divergence and lower inferred deletions. In contrast, a negative correlation is seen for recombination rate and frequency of insertions of lineage-specific repeats, both for all families and also for the lineage-specific LTRs (Fig. 4).
GC Content Correlates With Variation in Conservation, But in Opposite Ways for Low-GC and Moderate- to High-GC DNA, and Does Not Fully Explain the Variation
The physical and biological properties of genomic DNA may be strong contributors to the variation in conservation, but some previous studies have led to differing conclusions. For instance, some of the studies cited above have found regional variation in substitution rates to be significantly correlated with fluctuations in the G + C content of the aligned human bases, or with the difference between G + C content in the aligned human and mouse bases (Castresana 2002a), whereas others have not found significant correlations (see discussion in Hurst and Williams 2000). The whole-genome alignments provide an opportunity to carry out a more comprehensive analysis.
The relationship of tAR and t4D with GC content (fraction GC, or fGC) is not linear, but is better fit by a quadratic relationship (Hurst and Williams 2000; Waterston et al. 2002; see also Bernardi 2001). Indeed, substitutions, NAanc, RepLS, and LtrLS all show a quadratic relationship when plotted against GC content (Fig. 6). The divergence tends to decrease with fGC for the portion of the genome with a lower GC content, whereas the divergence tends to increase with fGC for the portion of the genome that is higher in GC content. These data can be fit to a quadratic expression, with a negative coefficient for fGC but a positive coefficient for the square of fGC. The quadratic fits for t4D, tAR, NAanc, and LtrLS on fGC have r2 of 24.0%, 11.0%, 10.2%, and 18.7%, respectively, for 5-Mb nonoverlapping windows. This implies that fluctuations in GC content predict an appreciable amount of the regional variation we see in neutral substitution rates and deletions, but still leaves the majority of this variation to be explained, because there is little sampling variance in the rates estimated in these large windows. The dependence of recombination rate on fGC is also fit by a quadratic function, but the curvature is opposite to that seen for substitutions and NAanc (Fig. 6).
Next, we considered change in GC content between human and mouse (dGC, expressed as the difference between human GC and mouse GC in aligning segments in the windows) as an additional predictor as well as CpG density. Fitting expressions comprising fGC, dGC, and their squares, we obtain significant gains in explained variability for some but not all functions. Combining second-order effects of fGC and dGC, we can predict 29.9% of the variation in t4D, whereas the explained variation in tAR and NAanc increase only slightly, to 12.8% and 10.4%, respectively. Another potential predictor is the density of CpG dinucleotides. However, adding second-order effects of the CpG density to those of fGC and dGC increased the explained variation very little, to 30.8%, 13.0%, and 11.6% for t4D, tAR, and NAanc, respectively. Summary statistics for all three predictors, and for all measures of divergence are given in Table 3.
Table 3.
Variable | Minimum | Maximum | Mean | Median | SD |
tAR | 0.39111 | 0.53377 | 0.46317 | 0.46389 | 0.01929 |
t4D | 0.20867 | 0.85917 | 0.44608 | 0.43759 | 0.06628 |
NAanc | 0.29468 | 0.80029 | 0.50696 | 0.49322 | 0.08460 |
SNP | 0.00008 | 0.00085 | 0.00039 | 0.00039 | 0.00010 |
Rec | 0.0280 | 4.2156 | 1.2773 | 1.1376 | 0.6954 |
LtrLS | 0.01581 | 0.20228 | 0.05671 | 0.05544 | 0.01577 |
RepLS | 0.16229 | 0.68071 | 0.28538 | 0.27807 | 0.05284 |
fGC | 0.32202 | 0.61873 | 0.40956 | 0.39808 | 0.05121 |
dGC | −0.06588 | 0.08245 | −0.01127 | −0.01346 | 0.01985 |
CpG density | 0.00019 | 0.08102 | 0.00787 | 0.00544 | 0.00798 |
Data are the 2489 windows of 5 Mb (overlapping by 4 Mb). Windows were filtered so that they contained at least 800 4D and AR sites and well-defined values for the other attributes.
As discussed in the analysis of the mouse genome (Waterston et al. 2002), because all divergence measures are predicted to some extent by GC content, the latter constitutes a confounding variable in evaluating their covariation. The predictors dGC and CpG density may have a similar effect. To account for this, pairwise correlations were computed not just among divergence variables, but also among residuals from their quadratic regressions on fGC, dGC, and the density of CpG dinucleotides. In most cases, passing to residuals enhanced the observed correlations (Fig. 4). This effect was dramatic in some cases, such as the correlation t4D with LtrLS. The fact that correlations are significant and often enhanced after removing the effect of fGC, dGC, and CpG density confirms that additional factors beyond GC and CpG content are needed to explain the covariation among divergence variables.
DISCUSSION
These results show that six measures of change in chromosomal DNA vary regionally in their rates, and those rates covary. Thus substitutions, deletions, insertions, and recombinations are all correlated, and changes occurring over ∼65–90 million years (Li et al. 1985; Kondrashov and Crow 1993; Archibald et al. 2001; Huchon et al. 2002) correlate with polymorphisms arising much more recently in the human populations. These results indicate that some regions of the human genome are changing slowly by all processes that alter DNA, whereas others change faster. Because of this, it is challenging to develop a single criterion for likely selection that will be effective when applied to all regions of the genome; a simple similarity cutoff will not work well globally (Hardison 2000; Pennachio and Rubin 2001).
The regional variation in rates of divergence is partially predicted by human GC content. The relationship with GC content is complex and best fit by a quadratic function. The divergence decreases with GC content for low-GC DNA and increases with GC content for higher-GC DNA, consistent with important differences in the patterns of evolution in these two classes of genomic DNA (Bernardi 1995, 2001; Fullerton et al. 2001; Castresana 2002a).
GC content is an explanatory variable in the sense that it allows one to predict a certain portion of the variability of each divergence measure. This raised the possibility that the divergence measures vary together simply because they all vary with GC content. We show that this is not the case, because removing the components explained by GC content enhances, instead of suppressing, the covariation. Thus in some ways GC content is a confounding variable in this type of analysis.
It has been suggested that disruption of isochores in the mouse lineage has been associated with rapid divergence in some regions of the genome (Castresana 2002a). However, introducing change in GC content as an additional predictor after GC content still does not explain most of the variation in the measures of divergence examined. A similar effect is seen for CpG density. Removing the components explained by GC content, difference in GC content between human and mouse, and CpG density preserves or enhances covariation.
It has been proposed that slowly changing genomic regions are under stronger selection (Shabalina et al. 2001), but purifying selection is unlikely to explain fully the variation in divergence described here. The regions examined are very large (5 Mb), so a substantial fraction of those regions would have to be functional for direct selection to act on them. Substitutions have been measured not only at 4D sites in coding regions but also in ancestral repeats. The latter are good models for neutral DNA, and it is particularly difficult to imagine how selection could be working on defunct transposable elements. Lower levels of polymorphism in regions of higher recombination have been explained by background selection against deleterious mutations also removing linked polymorphisms (Begun and Aquadro 1992; Charlesworth 1994; Hudson and Kaplan 1995), but it is not clear that this effect, or genetic hitchhiking (Maynard Smith and Haigh 1974) can be extended to divergence between mouse and human. Rather, it appears that some property of genomic DNA, or its location, makes it a more or less active template for several aspects of DNA metabolism. Regional variation in rates of mutation would be expected to lead to variation in rates of substitutions in AR, 4D, intron, UTR, and other types of sites, and to significant correlations between those substitutions rates, as we observe.
The correlation of neutral substitution with recombination frequency, insertion rate, and deletion rate points to a class of potential explanations for the variation in divergence involving regional variations in frequency of double-stranded breaks, which may be prone to faulty repair (Lercher and Hurst 2002). Other effects associated with differences in DNA repair could also be responsible for regional variation in divergence, as briefly reviewed by Matassi et al. (1999). Through these or other repair or mutation mechanisms, the proximity of a DNA sequence to segments involved in some aspects of nuclear metabolism could also affect their underlying rate of divergence. We have no evidence that proximity to particular types of transcription units can explain the fluctuations we see and, as discussed by Eyre-Walker and Hurst (2001), the relationship between GC content, neutral substitution rate, and differences in the timing of replication in S-phase (Wolfe 1991; Gu and Li 1994) are potentially quite complex and cannot be resolved with this type of data. However, future examination of intranuclear localization, including proximity to matrix attachment sites, pericentromeric heterochromatin, the nuclear membrane, and sites of chiasmata during meiosis may show significant correlations with divergence.
Bioinformatics Resources
The source code for BLASTZ is available at http://bio.cse.psu.edu. Precomputed percent identity plots (pip) of all alignments are available at the PipDispenser (http://bio.cse.psu.edu). Entering either a location in the human genome or a RefSeq gene name will return a pip of the 1-Mb interval including the query. Aligning regions, measures of level of conservation, and nucleotide-level alignment are available from the UCSC Human Genome Browser (http://genome.ucsc.edu; Kent et al. 2002), using the Mouse Cons track.
METHODS
Generating Whole-Genome Alignments Using BLASTZ
The program BLASTZ was used to align the human and mouse genome assemblies on a 1024-node Pentium III cluster, as described by Schwartz et al. (2003) and the MGSC (Waterston et al. 2003). The alignments were processed to get single coverage of human sequences with mouse sequences using the program axtBest (Schwartz et al. 2003). The human assembly from June 2002 was aligned to the February 2002 assembly of mouse.
Measurements of Divergence, Recombination, and Polymorphisms
The function tAR was calculated as the number of substitutions per site in ancestral repeats, determined by the REV model on the observed base changes in ancestral repeats. The function t4D was calculated as the number of substitutions per site in fourfold degenerate sites that were preceded by matches in the other two positions of the codon, determined by the REV model on the observed base changes at these sites. (An alternate definition that did not require the 4D site to be preceded by two matches produced similar data with slightly higher divergence; see Supplementary Material). When searching for 4D sites, the RefSeq (Pruitt and Maglott 2001) alignments to the human genome were used and checked to ensure that the human CDS begins with a start codon, ends with a stop codon, and has no in-frame stop codons; human introns are GT/AG, GC/AG, or AT/AC; aligned mouse sequence has no in-frame stop codons except in the last 20 codons of the human gene.
The portion of the human genome derived from the common ancestor to mouse and human (the “ancestral part of the genome”) is approximately the portion of the genome not in lineage-specific repeats. This part of the human genome was identified by analyzing the output of the program RepeatMasker (Smit and Green 1999). Lineage-specific repeats are those that are not ancestral. The fraction of this ancestral part of the genome that aligns (alnanc) was calculated as the number of aligned bases (disregarding intra-alignment gaps) in DNA that is not lineage-specific repeats divided by the amount of non-lineage-specific repetitive DNA in each window. The nonaligning portion (NAanc) is 1 − alnanc, and is our estimate of the fraction of the human genome likely deleted from mouse.
The frequency of insertions was monitored as the density of all lineage-specific repeats again analyzing output from RepeatMasker (Smit and Green 1999). The frequency of particular families of repeats was also determined, and broken down into ancestral and lineage-specific subfamilies.
Recombination data were from Kong et al. (2002). The markers were mapped onto the June 2002 assembly of the human genome, and recombination frequencies were determined from the genetic distances reported and the measured physical distances from the assemblies. Each base is assigned the recombination rate calculated by assuming a linear genetic distance across the immediately flanking genetic markers. The recombination rate assigned to each 1-Mb window is the average recombination rate of the bases contained within the window. These regional estimates substantially agree with those obtained by the spline method of Kong et al. (2002).
SNP density was computed using the tables of SNPs from the SNP Consortium (Sachidanandam et al. 2001) for SNPs derived from random reads.
Analysis of Covariation and Predictive Variables
We considered data relative to the 5-Mb windows (overlapping by 4 Mb), again filtered so as to contain at least 800 4D sites, and required to have a defined recombination measurement. Data were also computed for nonoverlapping 1-Mb and 5-Mb windows. Covariation among different measures of divergence was assessed through pairwise correlation coefficients. As discussed above, these were computed on the original variables, as well as on residuals from various regressions, in order to remove GC-related effects. In particular, for each divergence measure, we considered residuals from five second-order regressions, namely, (1) quadratic regression on GC content (comprising intercept, fGC, and fGC squared); (2) quadratic regression on difference in GC content between human and mouse (comprising intercept, dGC, and dGC squared); (3) quadratic regression on CpG density in human (comprising intercept, CpG density and CpG density squared, dGC and dGC squared); (4) quadratic regression on GC content and difference in GC content between human and mouse, without interaction (comprising intercept, fGC, fGC squared, dGC, and dGC squared); and (5) quadratic regression on GC content, difference in GC content between human and mouse, and human CpG density, without interaction (comprising intercept, fGC, fGC squared, dGC, dGC squared, CpG density, and CpG density squared). Inclusion of interaction terms between the variables did not improve the correlations (data not shown). Correlation computations and regression fits were implemented using the MINITAB software package (Ryan and Joiner 2000).
Methods for Calculating Genomic Parameters
The tables at the UCSC Genome Browser (Kent et al. 2002) were used to compute most genomic parameters. Data for repeats (all classes) came from RepeatMasker (Smit and Green 1999). GC content was computed from the human sequence using aligned bases only. The change in GC content is the fraction GC for human in alignments in a window minus the fraction GC for mouse in alignments in a window.
WEB SITE REFERENCES
http://bio.cse.psu.edu; PipDispenser and source code for BLASTZ.
http://genome.ucsc.edu; UCSC Human Genome Browser.
http://www.soe.ucsc.edu/research/compbio/covariation/; frequency tables.
Acknowledgments
We thank Francis Collins, Eric Lander, Bob Waterston, and the Mouse Genome Sequencing Consortium for sharing their data and for valuable suggestions on this work. Thanks to Laurence Hurst for useful comments on this paper. R. Hardison, S. Yang, F. Chiaromonte, L. Elnitski, J. Li, M. O'Connor, D. Kolbe, S. Schwartz, and W. Miller were supported by NHGRI grant HG02238, with additional support for L. Elnitski from NHGRI grant HG02325 and for R. Hardison from NIDDK grant RO1 DK27635; S. Whelan and N. Goldman by the Wellcome Trust, K. Roskin, M. Dickhans, W.J. Kent, and T.S. Furey by NHGRI Grant 1P41HG02371; and D. Haussler by NHGRI Grant 1P41HG02371 and the Howard Hughes Medical Institute.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
E-MAIL rch8@psu.edu; FAX (814) 863-7024. E-MAIL haussler@cse.ucsc.edu; FAX (831) 459-4829.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.844103.
REFERENCES
- 1.Ansari-Lari M.A., Oeltjen, J.C., Schwartz, S., Zhang, Z., Muzny, D.M., Lu, J., Gorrell, J.H., Chinault, A.C., Belmont, J.W., Miller, W., et al. 1998. Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 8: 29-40. [PubMed] [Google Scholar]
- 2.Archibald J.D., Averianov, A.O., and Ekdale, E.G. 2001. Late Cretaceous relatives of rabbits, rodents, and other extant eutherian mammals. Nature 414: 62-65. [DOI] [PubMed] [Google Scholar]
- 3.Begun D.J. and Aquadro, C.F. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356: 519-520. [DOI] [PubMed] [Google Scholar]
- 4.Bernardi G. 1986. Compositional constraints and genome evolution. J. Mol. Evol. 24: 1-11. [DOI] [PubMed] [Google Scholar]
- 5.___, 1993. The isochore organization of the human genome and its evolutionary history—A review. Gene 135: 57-66. [DOI] [PubMed] [Google Scholar]
- 6.___, 1995. The human genome: Organization and evolutionary history. Ann. Rev. Genet. 23: 637-661. [DOI] [PubMed] [Google Scholar]
- 7.___, 2000. Isochores and the evolutionary genomics of vertebrates. Gene 241: 3-17. [DOI] [PubMed] [Google Scholar]
- 8.___, 2001. Misunderstandings about isochores. Part 1. Gene 276: 3-13. [DOI] [PubMed] [Google Scholar]
- 9.Casane D., Boissinot, S., Chang, B.H., Shimmin, L.C., and Li, W. 1997. Mutation pattern variation among regions of the primate genome. J. Mol. Evol. 45: 216-226. [DOI] [PubMed] [Google Scholar]
- 10.Castresana J. 2002a. Genes on human chromosome 19 show extreme divergence from the mouse orthologs and a high GC content. Nucleic Acids Res. 30: 1751-1756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.___, 2002b. Estimation of genetic distances from human and mouse introns. Genome Biol. 3: Res.: 0028.1-0028.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Charlesworth B. 1994. The effect of background selection against deleterious mutations on weakly selected, linked variants. Genet. Res. 63: 213-227. [DOI] [PubMed] [Google Scholar]
- 13.Chen F.C., Vallender, E.J., Wang, H., Tzeng, C.S., and Li, W.H. 2001. Genomic divergence between human and chimpanzee estimated from large-scale alignments of genomic sequences. J. Heredity 92: 481-489. [DOI] [PubMed] [Google Scholar]
- 14.Chiaromonte F., Yang, S., Elnitski, L., Yap, V.B., Miller, W., and Hardison, R.C. 2001. Association between divergence and interspersed repeats in mammalian noncoding genomic DNA. Proc. Natl. Acad. Sci. 98: 14503-14508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dawson E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., et al. 2002. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418: 544-548. [DOI] [PubMed] [Google Scholar]
- 16.DeBry R.W. and Seldin, M.F. 1996. Human/mouse homology relationships. Genomics 33: 337-351. [DOI] [PubMed] [Google Scholar]
- 17.DeSilva U., Elnitski, L., Idol, J.R., Doyle, J.L., Gan, W., Thomas, J.W., Schwartz, S., Dietrich, N.L., Beckstrom-Sternberg, S.M., McDowell, J.C., et al. 2002. Generation and comparative analysis of approximately 3.3 Mb of mouse genomic sequence orthologous to the region of human chromosome 7q11.23 implicated in Williams syndrome. Genome Res 12: 3-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ebersberger I., Metzler, D., Schwarz, C., and Paabo, S. 2002. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70: 1490-1497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ellsworth R.E., Jamison, D.C., Touchman, J.W., Chissoe, S.L., Braden Maduro, V.V., Bouffard, G.G., Dietrich, N.L., Beckstrom-Sternberg, S.M., Iyer, L.M., Weintraub, L.A., et al. 2000. Comparative genomic sequence analysis of the human and mouse cystic fibrosis transmembrane conductance regulator genes. Proc. Natl. Acad. Sci. 97: 1172-1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Elnitski, L., Hardison, R.C., Li, J., Yang, S., Kolbe, D., Eswara, P., O'Connor, M.J., Schwartz, S., Miller, W., and Chiaromonte, F. 2003. Distinguishing regulatory DNA from neutral sites. Genome Res. (this issue). [DOI] [PMC free article] [PubMed]
- 21.Endrizzi M., Huang, S., Scharf, J.M., Kelter, A.R., Wirth, B., Kunkel, L.M., Miller, W., and Dietrich, W.F. 1999. Comparative sequence analysis of the mouse and human Lgn1/SMA interval. Genomics 60: 137-151. [DOI] [PubMed] [Google Scholar]
- 22.Epp T.A., Wang, R., Sole, M.J., and Liew, C.C. 1995. Concerted evolution of mammalian cardiac myosin heavy chain genes. J. Mol. Evol. 41: 284-292. [DOI] [PubMed] [Google Scholar]
- 23.Eyre-Walker A. and Hurst, L.D. 2001. The evolution of isochores. Nat. Rev. Genet. 2: 549-555. [DOI] [PubMed] [Google Scholar]
- 24.Fryxell K. and Zuckerkand, E. 2000. Cytosine deanimation plays a primary role in the evolution of mammalian isochores. Mol. Biol. Evol. 17: 1371-1383. [DOI] [PubMed] [Google Scholar]
- 25.Fullerton S.M., Carvalho, A.B., and Clark, A.G. 2001. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. 18: 1139-1142. [DOI] [PubMed] [Google Scholar]
- 26.Goldman N. and Yang, Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11: 725-736. [DOI] [PubMed] [Google Scholar]
- 27.Göttgens B., Gilbert, J.G., Barton, L.M., Grafham, D., Rogers, J., Bentley, D.R., and Green, A.R. 2001. Long-range comparison of human and mouse SCL loci: Localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. Genome Res. 11: 87-97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Graur D. 1985. Amino acid composition and the evolutionary rates of protein-coding genes. J. Mol. Evol. 22: 53-62. [DOI] [PubMed] [Google Scholar]
- 29.Graur D. and Li, W.-H., 2000. Fundamentals of molecular evolution. Sinauer Associates, Sunderland, MA.
- 30.Gu X. and Li, W.H. 1994. A model for the correlation of mutation rate with GC content and the origin of GC-rich isochores. J. Mol. Evol. 38: 468-475. [DOI] [PubMed] [Google Scholar]
- 31.Hardison R.C. 2000. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16: 369-372. [DOI] [PubMed] [Google Scholar]
- 32.Hardison R., Krane, D., Vandenbergh, D., Cheng, J.F., Mansberger, J., Taddie, J., Schwartz, S., Huang, X.Q., and Miller, W. 1991. Sequence and comparative analysis of the rabbit α-like globin gene cluster reveals a rapid mode of evolution in a G + C-rich region of mammalian genomes. J. Mol. Biol. 222: 233-249. [DOI] [PubMed] [Google Scholar]
- 33.Hardison R.C., Oeltjen, J., and Miller, W. 1997. Long human–mouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome. Genome Res. 7: 959-966. [DOI] [PubMed] [Google Scholar]
- 34.Huchon D., Madsen, O., Sibbald, M.J., Ament, K., Stanhope, M.J., Catzeflis, F., de Jong, W.W., and Douzery, E.J. 2002. Rodent phylogeny and a timescale for the evolution of Glires: Evidence from an extensive taxon sampling using three nuclear genes. Mol. Biol. Evol. 19: 1053-1065. [DOI] [PubMed] [Google Scholar]
- 35.Hudson R.R. and Kaplan, N.L. 1995. Deleterious background selection with recombination. Genetics 141: 1605-1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hughes A.L. and Yeager, M. 1998. Natural selection at major histocompatibility complex loci of vertebrates. Annu. Rev. Genet. 32: 415-435. [DOI] [PubMed] [Google Scholar]
- 37.Hurst L.D. and Willliams, E.J.B. 2000. Covariation of GC content and the silent site substitution rate in rodents: Implications for methodology and for the evolution of isochores. Gene 261: 107-114. [DOI] [PubMed] [Google Scholar]
- 38.Iida K. and Akashi, H. 2000. A test of translational selection at ‘silent’ sites in the human genome: Base composition comparisons in alternatively spliced genes. Gene 261: 93-105. [DOI] [PubMed] [Google Scholar]
- 39.Kent W.J. 2002. BLAT—The BLAST-like alignment tool. Genome Res. 12: 656-664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kent W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The human genome browser at UCSC. Genome Res. 12: 996-1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kimura M., 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge.
- 42.Kondrashov A.S. and Crow, J.F. 1993. A molecular approach to estimating the human deleterious mutation rate. Hum. Mutat. 2: 229-234. [DOI] [PubMed] [Google Scholar]
- 43.Kong A., Gudbjartsson, D.F., Sainz, J., Jonsdottir, G.M., Gudjonsson, S.A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien, A., et al. 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31: 241-247. [DOI] [PubMed] [Google Scholar]
- 44.Koop B.F. 1995. Human and rodent DNA sequence comparisons: A mosaic model of genomic evolution. Trends Genet. 11: 367-371. [DOI] [PubMed] [Google Scholar]
- 45.Koop B.F. and Hood, L. 1994. Striking sequence similarity over almost 100 kilobases of human and mouse T-cell receptor DNA. Nat. Genet. 7: 48-53. [DOI] [PubMed] [Google Scholar]
- 46.Kumar S. and Subramanian, S. 2002. Mutation rates in mammalian genomes. Proc. Natl. Acad. Sci. 99: 803-808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Lamerdin J.E., Montgomery, M.A., Stilwagen, S.A., Scheidecker, L.K., Tebbs, R.S., Brookman, K.W., Thompson, L.H., and Carrano, A.V. 1995. Genomic sequence comparison of the human and mouse XRCC1 DNA repair gene regions. Genomics 25: 547-554. [DOI] [PubMed] [Google Scholar]
- 48.Lamerdin J.E., Stilwagen, S.A., Ramirez, M.H., Stubbs, L., and Carrano, A.V. 1996. Sequence analysis of the ERCC2 gene regions in human, mouse, and hamster reveals three linked genes. Genomics 34: 399-409. [DOI] [PubMed] [Google Scholar]
- 49.Lander E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921. [DOI] [PubMed] [Google Scholar]
- 50.Lercher M.J. and Hurst, L.D. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18: 337-340. [DOI] [PubMed] [Google Scholar]
- 51.Lercher M.J., Williams, E.J., and Hurst, L.D. 2001. Local similarity in evolutionary rates extends over whole chromosomes in human–rodent and mouse–rat comparisons: Implications for understanding the mechanistic basis of the male mutation bias. Mol. Biol. Evol. 18: 2032-2039. [DOI] [PubMed] [Google Scholar]
- 52.Li J. and Miller, W., 2002. Significance of interspecies matches when evolutionary rate varies. In Proceedings of RECOMB 2002., pp. 216–224. [DOI] [PubMed]
- 53.Li W.H., Wu, C.I., and Luo, C.C. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2: 150-174. [DOI] [PubMed] [Google Scholar]
- 54.Lio P. and Goldman, N. 1998. Models of molecular evolution and phylogeny. Genome Res. 8: 1233-1244. [DOI] [PubMed] [Google Scholar]
- 55.Makalowski W. and Boguski, M.S. 1998a. Synonymous and nonsynonymous substitution distances are correlated in mouse and rat genes. J. Mol. Evol. 47: 119-121. [DOI] [PubMed] [Google Scholar]
- 56.___, 1998b. Evolutionary parameters of the transcribed mammalian genome: An analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad. Sci. 95: 9407-9412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Makalowski W., Zhang, J., and Boguski, M.S. 1996. Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res. 6: 846-857. [DOI] [PubMed] [Google Scholar]
- 58.Margot J.B., Demers, G.W., and Hardison, R.C. 1989. Complete nucleotide sequence of the rabbit β-like globin gene cluster: Analysis of intergenic sequences and comparison with the human β-like globin gene cluster. J. Mol. Biol. 205: 15-40. [DOI] [PubMed] [Google Scholar]
- 59.Matassi G., Sharp, P.M., and Gautier, C. 1999. Chromosomal location effects on gene sequence evolution in mammals. Curr. Biol. 9: 786-791. [DOI] [PubMed] [Google Scholar]
- 60.Maynard Smith J. and Haigh, J. 1974. The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23-35. [PubMed] [Google Scholar]
- 61.Nei M., 1987. Molecular evolutionary genetics. Columbia University Press, New York.
- 62.Nei M. and Kumar, S., 2000. Molecular evolution and phylogenetics. Oxford University Press, New York, NY.
- 63.Oeltjen J.C., Malley, T.M., Muzny, D.M., Miller, W., Gibbs, R.A., and Belmont, J.W. 1997. Large-scale comparative sequence analysis of the human and murine Bruton's tyrosine kinase loci reveals conserved regulatory domains. Genome Res. 7: 315-329. [DOI] [PubMed] [Google Scholar]
- 64.Ogata H., Fujibuchi, W., and Kanehisa, M. 1996. The size differences among mammalian introns are due to the accumulation of small deletions. FEBS Letts. 390: 99-103. [DOI] [PubMed] [Google Scholar]
- 65.Pennacchio L.A. and Rubin, E.M. 2001. Genomic strategies to identify mammalian regulatory sequences. Nat. Rev. Genet. 2: 100-109. [DOI] [PubMed] [Google Scholar]
- 66.Pruitt K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137-140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Roskin K.M., Diekhans, M., Kent, W.J., and Haussler, D., 2002. Score functions for assessing conservation in locally aligned regions of DNA from two species. In UCSC Tech Report UCSC-CRL-02-03. University of California at Santa Cruz, CA.
- 68.Ryan B. and Joiner, B., 2000. Minitab handbook. Duxbury Press, Belmont, CA.
- 69.Sachidanandam R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., et al. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928-933. [DOI] [PubMed] [Google Scholar]
- 70.Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. 2003. Human–mouse alignments with BLASTZ. Genome Res. (this issue). [DOI] [PMC free article] [PubMed]
- 71.Shabalina S.A., Ogurtsov, A.Y., Kondrashov, V.A., and Kondrashov, A.S. 2001. Selective constraint in intergenic regions of human and mouse genomes. Trends Genet. 17: 373-376. [DOI] [PubMed] [Google Scholar]
- 72.Shehee W.R., Loeb, D.D., Adey, N.B., Burton, F.H., Casavant, N.C., Cole, P., Davies, C.J., McGraw, R.A., Schichman, S.A., Severynse, D.M., et al. 1989. Nucleotide sequence of the BALB/c mouse β-globin complex. J. Mol. Biol. 205: 41-62. [DOI] [PubMed] [Google Scholar]
- 73.Shiraishi T., Druck, T., Mimori, K., Flomenberg, J., Berk, L., Alder, H., Miller, W., Huebner, K., and Croce, C.M. 2001. Sequence conservation at human and mouse orthologous common fragile regions, FRA3B/FHIT and Fra14A2/Fhit. Proc. Natl. Acad. Sci. 98: 5722-5727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Smit, A. and Green, P. 1999. RepeatMasker at http://ftp.genome.washington.edu/RM/RepeatMasker.html.
- 75.Smith N.G. and Hurst, L.D. 1998. Sensitivity of patterns of molecular evolution to alterations in methodology: A critique of Hughes and Yeager. J. Mol. Evol. 47: 493-500. [DOI] [PubMed] [Google Scholar]
- 76.Smith N.G.C., Webster, M., and Ellegren, H. 2002. Deterministic mutation rate variation in the human genome. Genome Res. 12: 1350-1356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Tavaré S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 17: 57-86. [Google Scholar]
- 78.Ticher A. and Graur, D. 1989. Nucleic acid composition, codon usage, and the rate of synonymous substitution in protein-coding genes. J. Mol. Evol. 28: 286-298. [DOI] [PubMed] [Google Scholar]
- 79.Waterston R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562. [DOI] [PubMed] [Google Scholar]
- 80.Whelan S., Lio, P., and Goldman, N. 2001. Molecular phylogenetics: State-of-the-art methods for looking into the past. Trends Genet. 17: 262-272. [DOI] [PubMed] [Google Scholar]
- 81.Williams E.J. and Hurst, L.D. 2000. The proteins of linked genes evolve at similar rates. Nature 407: 900-903. [DOI] [PubMed] [Google Scholar]
- 82.___, 2002. Is the synonymous substitution rate in mammals gene-specific? Mol. Biol. Evol. 19: 1395-1398. [DOI] [PubMed] [Google Scholar]
- 83.Wilson M.D., Riemer, C., Martindale, D.W., Schnupf, P., Boright, A.P., Cheung, T.L., Hardy, D.M., Schwartz, S., Scherer, S.W., Tsui, L.C., et al. 2001. Comparative analysis of the gene-dense ACHE/TFR2 region on human chromosome 7q22 with the orthologous region on mouse chromosome 5. Nucleic Acids Res. 29: 1352-1365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Wolfe K.H. 1991. Mammalian DNA replication: Mutation biases and the mutation rate. J. Theor. Biol. 149: 441-451. [DOI] [PubMed] [Google Scholar]
- 85.Wolfe K.H. and Sharp, P.M. 1993. Mammalian gene evolution: Nucleotide sequence divergence between mouse and rat. J. Mol. Evol. 37: 441-456. [DOI] [PubMed] [Google Scholar]
- 86.Wolfe K.H., Sharp, P.M., and Li, W.H. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337: 283-285. [DOI] [PubMed] [Google Scholar]
- 87.Yang Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39: 105-111. [DOI] [PubMed] [Google Scholar]
- 88.___, 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555-556. [DOI] [PubMed] [Google Scholar]
- 89.Yang Z. and Nielsen, R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17: 32-43. [DOI] [PubMed] [Google Scholar]