Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2003 Oct 15;31(20):6043–6052. doi: 10.1093/nar/gkg784

A compositional segmentation of the human mitochondrial genome is related to heterogeneities in the guanine mutation rate

David C Samuels *, Richard J Boys 1, Daniel A Henderson 2, Patrick F Chinnery 3
PMCID: PMC219467  PMID: 14530452

Abstract

We applied a hidden Markov model segmentation method to the human mitochondrial genome to identify patterns in the sequence, to compare these patterns to the gene structure of mtDNA and to see whether these patterns reveal additional characteristics important for our understanding of genome evolution, structure and function. Our analysis identified three segmentation categories based upon the sequence transition probabilities. Category 2 segments corresponded to the tRNA and rRNA genes, with a greater strand-symmetry in these segments. Category 1 and 3 segments covered the protein- coding genes and almost all of the non-coding D-loop. Compared to category 1, the mtDNA segments assigned to category 3 had much lower guanine abundance. A comparison to two independent databases of mitochondrial mutations and polymorphisms showed that the high substitution rate of guanine in human mtDNA is largest in the category 3 segments. Analysis of synonymous mutations showed the same pattern. This suggests that this heterogeneity in the mutation rate is partly independent of respiratory chain function and is a direct property of the genome sequence itself. This has important implications for our understanding of mtDNA evolution and its use as a ‘molecular clock’ to determine the rate of population and species divergence.

INTRODUCTION

The mitochondrial DNA molecule (mtDNA) is a small, circular genome present typically in multiple copies in each mitochondrion. The human mitochondrial genome is approximately 16 600 base pairs in length. Ninety-three percent of the human mtDNA molecule codes for 13 peptides, two rRNAs and 22 tRNAs, leaving only a small section of non-coding sequence which includes the D-loop and a few non-coding nucleotides. The 13 mtDNA-encoded peptides form part of the electron transfer chain of the mitochondrion, with the remaining essential proteins encoded in the nuclear DNA. Human mtDNA, and vertebrate mtDNA in general, is highly strand-asymmetric in cytosine and guanine composition. The light strand is enriched in cytosine and deficient in guanine, while the heavy strand is the reverse. Twelve of the 13 protein genes (all but the ND6 gene) are translated from the light strand.

Since the publication of the complete sequence of human mtDNA (1), it has become the focus of intense research interest. mtDNA is inherited almost exclusively down the maternal line, and has been used to study evolutionary history (2), population migration (2,3) and in forensic medicine (4). In addition, an increasing number of human diseases are attributed to mutations of mtDNA (57). The last two decades have seen major advances in our understanding of the molecular evolution of mtDNA and the molecular pathogenesis of mtDNA diseases, but many questions remain unanswered. For example, phylogenetic and family studies have shown different rates of mutation in different regions of the mtDNA that cannot be explained purely in terms of known functions of mtDNA or corresponding amino acid sequences (811). A greater understanding of the mtDNA sequence, and its susceptibility to mutation, has broad implications in all of these areas. One way of tackling this problem is to analyze the mtDNA sequence carefully using statistical techniques that can identify patterns that are not immediately apparent to the naked eye.

In this paper, we apply a hidden Markov model segmentation technique to the human mitochondrial DNA genome. This technique segments a DNA sequence using nucleotide probability patterns at each sequence location. The segmentation may depend on mononucleotide, dinucleotide or tri-nucleotide patterns. For the human mitochondrial genome, the best segmentation found was based on dinucleotide information. The resulting segmentation is very closely related to the organization of the genes along the DNA sequence, information that was not used in the segmentation method. We describe the correspondence between the DNA segmentation and the gene organization, and then give an interpretation of this relationship. Finally, we compare our segmentation results to two databases of human mtDNA mutations and polymorphisms and show that the segmentation can be explained as the result of a variation in the guanine substitution rate with sequence position.

MATERIALS AND METHODS

Sequence data accession

The human mtDNA genome used was the Revised Cambridge Reference Sequence (rCRS) (12), a revision of the sequence determined by Anderson et al. (1) which incorporates a small number of corrections due to sequencing artifacts and other technical difficulties during the original sequencing (1,12). The rCRS is the complete sequence of a single human individual of African descent (i.e. it is not a consensus sequence). Aside from two small, hypervariable sections in the D-loop, variations from the rCRS within the population are small (13) and are not likely to affect the genome segmentation significantly. We obtained the rCRS from the National Center for Biotechnology Information (NCBI) as file NC_001807[gi:13959823], and followed the original numbering scheme for the gene positions (1). All of the analyses were run on the light strand sequence. Analysis of the nucleotide and codon distributions along the mtDNA genes was done using the STADEN package. The mitochondrial gene map and the amino acid maps for the mtDNA-encoded proteins used in the analysis of the segmentation results were taken from the MitoMap web site (www.mitomap.org) (14). This information was not used in the HMM analysis and was only used in our interpretation of the segmentation results.

Data for the mutation and polymorphism analysis were taken from the MitoMap website and the Human Mitochondrial Genome Database (mtDB) website maintained by Uppsala University (www.genpat.uu.se/mtDB). The mtDB database contains mtDNA polymorphisms while the MitoMap database contains both polymorphisms and pathogenic mutations. Only the mtDNA sequence outside of the non-coding D-loop was used in the mutation analysis. To compare the number of mutations found in the different segmentation categories we have used Bayesian confidence intervals (CI) using non-informative priors. Confidence intervals that include zero indicate that the difference is not statistically significant.

Statistical analysis

Hidden Markov models are commonly used to reveal segmented structure in DNA sequences (15,16). These models are flexible in terms of the number of homogeneous categories used, the organization of these categories along the genome and the complexity of the sequence structure within each category. The model assumes that both the hidden component, representing the segmentation, and the observed sequence within each category each follow Markov chains of a specified order. Order 0 chains give segmentations based on single nucleotide probabilities, while in order 1 chains, nucleotide probabilities depend on the previous base and are therefore related to the dinucleotide properties of the sequence. Order 2 chains are based on tri-nucleotide properties and are the highest order chains considered in our analysis. Both the segmentation and the actual sequence are described by a set of transition probabilities. In this paper, we use an HMM-based method that also estimates the number of homogeneous categories and the complexity (order of Markov dependence) of the observed sequence. The method is implemented using modern computer intensive Markov chain Monte Carlo Bayesian techniques (17). It takes as inputs prior distributions describing likely values of the model parameters and it outputs distributions for these parameters that combine this information with that in the DNA sequence, together with a probability distribution for the locations of the segments. Parameter estimates and their precision are obtained using the means and standard deviations of these posterior distributions. See Boys and Henderson (18) and Boys et al. (19) for a more detailed description of the model and implementation.

In this analysis, the highest probability segmentation was achieved with an HMM of first order, based on the sequence transition matrix P(j|i) where i and j refer to the four bases. For example, the matrix element P(C|A) is the conditional probability that the sequence contains a C at a particular location given that the previous location contained an A. Each segmentation category is defined by a different matrix P(j|i). Within each category, the stationary nucleotide probabilities NP(i) are given by the solution of the implicit equation

graphic file with name gkg784equ1.jpg

The dinucleotide probabilities DP(i,j) can then be calculated using

DP(i, j) = P(j | i)NP(i). 2

RESULTS

Segmentation probabilities

The application of the segmentation method to the human mitochondrial genome results in a most probable segmentation using three categories based on the nucleotide sequence transition probabilities (an order 1 chain). The analysis gives a very high posterior probability of greater than 90% that there are three segmentation categories. The probability distributions defining these three categories are given in Figure 1. The estimated sequence transition probabilities P(j|i) for the three categories, given by the means of the distributions in Figure 1, are

Figure 1.

Figure 1

Sequence transition probability distributions between nucleotides for the three segmentation categories. Red: category 1. Blue: category 2. Green: category 3.

graphic file with name gkg784equ2.jpg

graphic file with name gkg784equ3.jpg

graphic file with name gkg784equ4.jpg

The order of the elements in the rows and columns of these arrays is (A,C,G,T). The precision of these estimates is useful in evaluating the importance of differences in the array elements given in equations 35. This is given by the standard deviation of the probability distribution for each array element:

graphic file with name gkg784equ5.jpg

graphic file with name gkg784equ6.jpg

graphic file with name gkg784equ7.jpg

The standard deviations of the estimates can be seen in Figure 1 as they are related to the width of the probability distributions.

The stationary nucleotide probabilities NP, calculated using equation 1, for the three categories are

NP1 = (0.278 0.324 0.138 0.259)9

NP2 = (0.338 0.252 0.169 0.241)10

NP3 = (0.338 0.360 0.071 0.231)11

where again the order of the elements is (A,C,G,T).

The dinucleotide probability matrices DP calculated from equation 2 for each category are

graphic file with name gkg784equ8.jpg

graphic file with name gkg784equ9.jpg

graphic file with name gkg784equ10.jpg

A map of the probability of assignment of the mtDNA to these categories is given in Figure 2; see (19) for details on how these probabilities are calculated. The segmentation map of the mtDNA into the most probable categories is given in detail in Table 1.

Figure 2.

Figure 2

Probability map of assignment of the human mtDNA genome to the three segmentation categories. Protein gene annotations are given in (a) for categories 1 and 3. tRNA and rRNA annotations are given in (b) for category 2. The arrows give the direction of translation of the gene. Color codes are the same as in Figure 1.

Table 1. Category assignment of mtDNA.

Base pair position Category assignment Comparison to mtDNA genes Boundary difference (bp)
192–257 2 Displacement loop (16024–576)  
258–574 3 D loop  
575–2223 2 tRNA F (577–647), 12S rRNA gene (648–1601), tRNA V (1602–1670) and Start 16S rRNA (1671–3229) –2
2224–2270 3 Middle of 16S rRNA  
2271–3291 2 End 16S rRNA and tRNA L1 (3230–3304)  
3292–3989 1 Start ND1 (3307–4262) –15
3990–4259 3 End ND1  
4260–4458 2 tRNA I, Q and M (4263–4469) –3
4459–5055 1 Start ND2 (4470–5511) –11
5056–5506 3 End ND2  
5507–5869 2 tRNA W, A, N, C and Y (5512–5891) –5
5870–7296 1 COX I (5904–7445) –34
7297–7571 2 tRNA S1 and D (7445–7585) –148
7572–8093 1 COX II (7586–8269) –14
8094–8373 2 tRNA K (8295–8364) –201
8374–8817 3 ATPase8 (8366–8572) and start of ATPase6 (8527–9207) (overlapping) +8
8818–9932 1 End of ATPase6 and COX III (9207–9990)  
9933–10098 2 tRNA G (9991–10058) –58
10099–10143 3 Start of ND3 (10059–10404) +40
10144–10349 1 End of ND3  
10350–10448 2 tRNA R (10405–10469) –55
10449–10737 1 ND4L (10470–10766) –21
10738–11127 3 Start of ND4 (10760–12137) –22
11128–11995 1 Middle of ND4  
11996–12121 3 End of ND4  
12122–12336 2 tRNA H, S2 and L2 (12138–12336) –16
12337–12463 3 Start of ND5 (12337–14148) 0
12464–13641 1 Middle of ND5  
13642–14827 3 End of ND5, ND6 (14149–14673), tRNA E (14674–14742) and start of Cyt b (14747–15887)  
14828–15827 1 Middle of Cyt b  
15828–15859 3 End of Cyt b  
15860–16157 2 tRNA T and P (15888–16023) –28
16158–16306 3 D loop +134
16307–191 1 D loop  

Category 2 corresponds to tRNA and rRNA genes while categories 1 and 3 correspond to the protein-coding genes. mtDNA gene positions are given where they are first mentioned. Column 4 gives the difference between the segment and gene boundaries, where applicable. We only list the difference between the leading edges of the gene and segment positions since the genes are all either contiguous or separated by only a few non-coding bases so differences at the trailing edges are the same as the leading edge differences of the following segment.

DISCUSSION

We have applied an HMM segmentation analysis to the entire human mtDNA sequence. Our analysis identified three categories based upon adjacent nucleotide transition probabilities with segment boundaries that correspond well to the boundaries of known mtDNA genes (Table 1 and Fig. 2), with only minor discrepancies. The difference between the segmentation and gene boundaries is 25 bp or less for 11 out of the 19 segment boundaries that we identify with gene boundaries. The only differences in boundary position larger than 100 base pairs occurred at the edges of three small groups of tRNA genes (serine 1 and aspartic acid at 7445–7585, lysine at 8295–8364 and threonine and proline at 15888–16023).

The overall pattern indicates that category 2 segments correspond to the tRNA and rRNA genes, and categories 1 and 3 correspond primarily to the protein-coding genes, with the category 3 segments appearing at the 3′ or 5′ (or both) ends of most of these genes. This gene location information was not used in the HMM segmentation analysis. Curiously, the non-coding displacement loop is not assigned to a separate category, but is instead covered primarily with categories 1 and 3, with a small section (66 bp) of category 2. Only a single tRNA gene fails to be included in category 2 (tRNA glutamic acid at location 14674–14742) and a short 45 bp segment in the middle of the 16S rRNA gene is assigned to category 3 instead of category 2.

The interpretation of the results of the segmentation of a DNA sequence by a statistical algorithm is often difficult. With a segmentation based on the probability of the transition between nucleotides along the sequence, the difference in the categories is in principle determined by the differences in the 16 transition probability distributions plotted in Figure 1 and summarized in equations 38. The distinctions among the categories may be caused by subtle differences distributed over all 16 of these probability distributions. However, we were able to interpret much of the results of our segmentation analysis from just a consideration of the individual nucleotide frequencies. In Table 2, we give the frequencies of the four nucleotides in the total mtDNA, and in the combined segments assigned to each category. The nucleotide frequencies given in Table 2 are very close to the stationary nucleotide probabilities for each category given in equations 911. We initially focus our discussion of these sequence patterns on the single nucleotide frequencies, and then specifically on how the dinucleotide properties are helpful in interpreting category 2.

Table 2. Nucleotide composition and Chargaff differences in the total human mtDNA and the segments of mtDNA in each of the three categories.

  Length (bp) % A % T % C % G (A–T)/(A+T) (C–G)/(C+G)
Total mtDNA 16568 30.9 24.7 31.3 13.1 0.112 0.410
Category 1 8354 27.8 25.8 32.7 13.7 0.038 0.409
Category 2 4630 34.0 24.1 24.8 17.0 0.171 0.185
Category 3 3584 34.1 22.9 36.4 6.6 0.197 0.694

These values are taken from the actual sequence, not the estimated stationary nucleotide probabilities calculated in equations 911. The mtDNA classified in category 2 shows very little strand asymmetry in C and G, while the mtDNA segments in category 3 have a very high CG asymmetry.

Strand asymmetry influences the dinucleotide segmentation of mtDNA

The frequency of guanine and cytosine determines the strand asymmetry of mtDNA, which is a fundamental property of the genome. Over the entire human mitochondrial genome there is a large strand asymmetry (20), but the degree varies throughout the sequence and is the most extreme in category 3 (Table 2). This can also be seen in the third column of Figure 1, where the probability distributions for guanine in category 3 (green) are significantly shifted to lower values.

Strand asymmetry can be measured by the Chargaff differences (A–T)/(A+T) and (C–G)/(C+G) (21). In Table 2, we show that the C–G Chargaff difference varies significantly between the three categories. Category 1 has a C–G Chargaff difference virtually identical to the mean value for the entire mtDNA, though it accounts for only 50% of the total genome length. The C–G Chargaff difference in category 2 is much smaller and conversely the C–G Chargaff difference in category 3 is much larger than the average values. In order to function properly, the RNA transcribed from the tRNA and rRNA genes (category 2 segments) must be able to fold into stem–loop structures and this requires a significant amount of strand symmetry over the length of these genes, and thus a low Chargaff difference.

Although strand symmetry facilitates RNA folding, the formation of stem–loop structures requires palindrome strings of complementary bases. Information about the sequence order of the bases is contained in the dinucleotide and higher order statistics. Table 3 shows the differences in the probabilities, DP (equations 1214), of pairs of complementary dinucleotides (such as AG and CT) in each segmentation category. Category 2 segments do contain the lowest values, so complementary dinucleotide pairs occur in more nearly equal numbers in these segments than in the category 1 and 3 segments. This means that palindrome complementary sequences are more likely to exist in category 2 mtDNA than in category 1 or 3 segments. To test whether the differences in the values in Table 3 are significant, we calculated the equivalent values from the probability distributions given by the segmentation algorithm. The advantage of this calculation is that we can also calculate 95% confidence intervals for the values and use these to determine whether the differences are statistically significant. The calculated mean values for the sum of the absolute differences in the complementary dinucleotide pairs, with 95% CI given in parenthesis, are the following: category 1 segments, 0.31 (0.25–0.34); category 2 segments, 0.19 (0.15–0.23); and for category 3 segments, 0.51 (0.45–0.56). Since the 95% CI do not overlap, these differences are statistically significant. This difference in the dinucleotide properties of the RNA and the protein-coding genes is one possible explanation for why the best segmentation of the human mitochondrial genome was based on the sequence transition probability, a dinucleotide property. The functions of the tRNA and rRNA genes constrain their sequence structure, allowing them to be separated from the rest of the mtDNA molecule by the HMM segmentation. This highlights the value of the HMM technique in identifying logical sequence patterns in the mtDNA that are not immediately apparent to the naked eye.

Table 3. The differences in the frequencies of complementary dinucleotide pairs in the three categories, calculated by subtracting one probability from the other.

  |AA–TT| |AC–GT| |AG–CT| |CA–TG| |CC–GG| |GA–TC| Sum
Category 1 0.009 0.061 0.051 0.062 0.074 0.052 0.309
Category 2 0.057 0.037 0.003 0.036 0.050 0.004 0.187
Category 3 0.066 0.107 0.065 0.092 0.133 0.050 0.513

The sum of these values is lowest in the category 2 mtDNA, indicating that complimentary dinucleotide pairs occur in nearly equal numbers in this category. This is consistent with the need to form stem–loop structures in the RNA synthesized from the tRNA and rRNA genes encoded primarily in these segments of the mtDNA.

Category 1 and 3 segments

Although both categories 1 and 3 contain low amounts of guanine, the probability distributions in Figure 1 and the numbers in Table 2 show that this is most marked for category 3 segments. To determine whether this difference is reflected in the corresponding protein sequence, we measured the occurrence of G in each codon position for the protein gene segments assigned to either category 1 or 3 (Table 4). This analysis included practically all the protein-coding mtDNA since only very small sections of these genes were assigned to category 2. In cases where the boundary between category assignments occurred within a codon, we shifted the boundary to the nearest codon edge. This, of course, never requires a boundary shift of more than one base pair. We did not include the ND6 gene since this is the only protein gene that is transcribed from the heavy strand and its inclusion would complicate the codon analysis unnecessarily. The small overlapping gene sections between the ATPase8 and ATPase6 genes, and the ND4L and ND4 genes, are also not included since no unique codon positions could be assigned there.

Table 4. Distribution of guanine over the three codon positions in the protein-coding sections of categories 1 and 3 on the light strand.

  % G in codon position 1 % G in codon position 2 % G in codon position 3
Category 1 22.9 12.5 5.0
Category 3 7.8 7.2 3.5

Overlapping gene sections, where no unique codon position can be assigned, and the ND6 gene, encoded on the heavy strand, are not included. The decrease in G in category 3 occurs mainly in the first two codon positions, not the third, so the amino acid composition must differ between the two segments.

The guanine abundance is indeed very low in the third position of the codons, but there is little difference in this value between the gene segments assigned to categories 1 and 3. The difference in the guanine abundance lies instead in the first two positions of the codons resulting in an actual difference in the amino acid compositions of these segments of the proteins (Table 5). Compared to category 1 segments, the parts of the proteins encoded on category 3 gene segments of mtDNA had a lower abundance of all but one of the amino acids requiring guanine in the vertebrate mitochondrial code. Only arginine was not depressed in category 3 segments, and that amino acid occurs in low and nearly equal frequency in both category 1 and 3 mtDNA segments. The amino acids alanine, glycine and valine appeared much less often in the category 3 segments of the mtDNA-encoded proteins than in the category 1 segments.

Table 5. Abundance of the codons in the protein-coding mtDNA sections assigned to categories 1 and 3, including the two termination codons.

Amino acid mtDNA codon Frequency in Category 1 (%) Frequency in Category 3 (%) Difference (%)
Ala, A GCN 8.27 2.74 –5.53
Gly, G GGN 6.15 1.67 –4.48
Val, V GTN 4.34 1.43 –2.91
Asp, D GAY 2.04 0.83 –1.21
Phe, F TTY 5.96 5.00 –0.96
His, H CAY 2.92 2.14 –0.78
Tyr, Y TAY 3.54 2.86 –0.68
Trp, W TGR 2.81 2.14 –0.67
Glu, E GAR 2.08 1.43 –0.65
Met, M ATR 5.46 5.12 –0.34
Gln, Q CAR 2.54 2.26 –0.28
Cys, C TGY 0.54 0.48 –0.06
Ser, S AGY 1.35 1.31 –0.04
Term AGR 0.00 0.00 0.00
Arg, R CGN 1.65 1.67 0.02
Ser, S TCN 6.04 6.07 0.03
Term TAR 0.04 0.36 0.32
Ile, I ATY 8.30 9.64 1.34
Asn, N AAY 3.84 6.43 2.59
Lys, K AAR 1.77 4.64 2.87
Pro, P CCN 5.19 8.10 2.91
Leu, L CTN, TTR 16.69 20.00 3.31
Thr, T ACN 8.50 13.69 5.19

The rows are ordered by the size of the difference in the codon abundances. All, except one, of the amino acids requiring guanine (in bold face) appear with lower abundance in category 3 than in category 1 segments. The exception, arginine, occurs at nearly identical and low amounts in both categories. In column 2, we use the notation N = A, C, G or T; Y = T or C; R = A or G.

Considering the spatial distribution of the categories, with category 3 segments occurring on one or both ends of many genes (Fig. 2a), it is reasonable to suspect that this category corresponds to some property of the encoded protein, such as hydrophobicity. However, we found no pattern in the physical and chemical properties of the amino acids with large differences in abundance between category 1 and 3 segments. We compared the differences in amino acid abundances (column 5 of Table 5) with various amino acid properties; number of Gs required in the codon, hydrophobicity [using Kyte-Doolittle (22), Bull-Breese (23), Radzicka-Wolfenden (24) and Eisenberg-McLachlan (25) scales], polarity, acid–base categorization and mass. The difference in amino acid abundance was only related to the number of Gs required in the codon, and not to the predicted physico-chemical properties of the proteins. This is illustrated by Figure 3 which shows a comparison of the location of the category 3 segments to the guanine distribution along the sequence (Fig. 3A) and to the hydrophobicity profile of three mtDNA encoded proteins, ND1, ND2 and COX I (Fig. 3B). Since the hydrophobicity profile of these proteins varies rapidly, for clarity we only show this section of the mtDNA, comprising approximately one-quarter of the total mtDNA. This section of mtDNA was chosen because it contains two genes, ND1 and ND2, with large category 3 segments and one gene, COX I, with no category 3 segments. While the relationship between the position of the category 3 segments and the guanine abundance minima is clear (Fig. 3A), there is no apparent relationship between the category 3 segments and the variations in hydrophobicity of the amino acids in the proteins. We conclude that the segmentation is determined directly by DNA sequence properties, in this case guanine abundance, and not by any physico-chemical property of the encoded proteins that could exert a selection pressure on the evolution of the genome.

Figure 3.

Figure 3

Figure 3

A comparison of the positions of the category 3 segments, shaded in green, to (A) guanine distribution and (B) predicted hydrophobicity. The guanine abundance was calculated using a sliding window of length 201 bp. The predicted hydrophobicity was calculated using the Kyte-Doolittle scale averaged over a sliding window of 21 amino acids and is plotted as a function of the base pair position of the codons for each amino acid. Due to the rapid variations in predicted hydrophobicity, only three genes (ND1, ND2 and COX I) are plotted. No relationship between the predicted protein hydrophobicity and the category 3 locations is apparent in this plot, but the location of the category 3 segments clearly corresponds to sequence regions with low guanine abundance. This plot shows that the low guanine content of category 3 is true for each segment individually, not just for the average taken over all the segments.

The best segmentation model, with a posterior probability of greater than 90%, used three order one categories. Our interpretation of category 2, covering the rRNA and tRNA genes, requires an order one model based on dinucleotide statistics while our interpretation of categories 1 and 3 is based only on nucleotide abundances and could have been defined by order zero models. We examined the best order zero model results (data not shown) and found one category corresponding very well with our category 3, another category corresponding well to category 1, but no single order zero category corresponding to the rRNA and tRNA genes in category 2. This is in agreement with our interpretation of the meaning of the categories. However, the posterior probability of the best order zero segmentation was much lower (<50%) than the probability of the best order one segmentation, so there must be significant differences between categories 1 and 3 in their dinucleotide properties, in addition to the large and obvious difference in guanine abundance.

What determines the low frequency of guanine in category 3 segments of mtDNA?

In the following section, we will focus our discussion in terms of the light strand, though of course any changes could be due to mutations of the complementary bases on the heavy strand.

Studies of mtDNA mutation frequencies (26) have shown that over the whole human mitochondrial genome, guanine is the least stable nucleotide, while cytosine is the most stable and adenine and thymine have intermediate stability. The deficiency in guanine in the category 3 segments could therefore be due to an increased mutation rate of guanine in the category 3 mtDNA segments. Over many generations this would lead to a decreased number of guanine residues in these sequence segments. To test this hypothesis, we compared the sequence segmentation to the positions of observed point mutations listed in two independent databases, MitoMap and the mtDB from Uppsala University, to determine whether the mutation frequency of the nucleotides varied among the segmentation categories (Table 6). We included all reported point mutations (transitions and transversions) within the coding region, defined as all of the mtDNA sequence outside of the D-loop. We only used the coding region so that the large number of observed mutations in the hyper-variable regions of the D-loop would not dominate our analysis, and we did not include deletion or insertion mutations.

Table 6. Analysis of mutation frequencies recorded in the MitoMap and mtDB databases.

Category Observed no. mutated/total no. of nucleotide Mutation intensity
  A C G T A C G T
  MitoMap, 783 mutated bases        
1 119/2211 73/2594 98/1056 95/2040 1.06 0.56 1.83 0.92
2 75/1511 41/1105 55/756 55/1054 0.98 0.73 1.44 1.03
3 58/1063 36/1110 31/205 47/740 1.08 0.63 2.98 1.25
All 252/4785 150/4809 184/2017 197/3834 1.04 0.62 1.80 1.01
  mtDB (University of Uppsala), 1775 mutated bases        
1 249/2211 234/2594 163/1056 240/2040 0.98 0.78 1.34 1.02
2 141/1511 106/1105 85/756 127/1054 0.81 0.83 0.98 1.05
3 130/1063 130/1110 51/205 119/740 1.06 1.02 2.16 1.40
All 520/4785 470/4809 299/2017 486/3834 0.95 0.85 1.29 1.10

Columns 2–5 record the number of reported mutations of each nucleotide in each category of human mtDNA, in the coding region. To allow us to compare the data from databases of different sizes, in the last four columns, we define a mutation intensity as the value in the first four columns divided by the normalization (total number of all mutated bases in coding region/total length of coding region). Guanine bases were mutated more often than the other nucleotides, and this enhanced mutation is largest in the category 3 segments shown in bold type. The ‘All’ rows give the values over the entire coding region for comparison. We found evidence of a difference in the guanine mutations in categories 1 and 3 for the MitoMap data, with the 95% CI of (0.0063, 0.1105) excluding zero. There is strong evidence of this difference in the mtDB data, with a 99% CI of (0.0115, 0.1773).

As of March 2003, there were 783 nucleotide positions in the coding region with observed mutations listed in the MitoMap database and 1775 mutated nucleotide positions in the mtDB database. In the mtDB database, the number of tested individuals with each listed polymorphism is given, but that information is confounded by inheritance effects and population dynamics and cannot give us a measurement of the mutation frequency at each base position. We have used the information on the number of individuals with each polymorphism to provide an upper limit in our analysis of synonymous mutations given below. Though one cannot precisely define mutation rates from these databases, we can show that the probability of mutation at a particular site depends both on the nucleotide and the sequence position.

For the sections of mtDNA in each segmentation category, we counted the number of each nucleotide with a reported mutation, and divided this by the total number of that nucleotide in that segmentation category (columns 2–5 of Table 6). A database generated from measurements from a larger population sample will detect more mutations, so in order to compare results between databases, we normalize these values by the ratio of the total number of all reported mutated sequence positions in the coding region in that database divided by the total length of the coding region. For the MitoMap database this normalization factor was 783/15445 and for the mtDB database it was 1775/15445. These normalized values are given in columns 6–9 of Table 6. We labeled this quantity ‘mutation intensity’. If the frequency of mutation is independent of sequence position and nucleotide type then this normalization would give the value 1, aside from random noise due to sampling effects. However, we observed the same clear pattern when this analysis was carried out on data from the two independent databases. Guanine bases were mutated more often than the other nucleotides, and this enhanced mutation is largest in the category 3 segments (Table 6 in bold type). Neither adenine nor cytosine showed much variation in mutation intensity over the three categories, but in both databases, thymine had a slightly elevated mutation intensity in the category 3 segments.

In the mtDB database, 92% of the guanine point mutations in the coding region of the light strand were G to A transitions. Is the difference between the segmentation categories due to changes in the guanine transition or transversion rates? To determine this, from the mtDB database, we calculated the percentage of guanine bases in each segmentation category (in the coding region) with recorded transitions or transversions to (A,C,T) respectively. For category 1 mtDNA segments, these values were (14.9, 0.4 and 0.2%). For category 2, they were (10.2, 0.4 and 1.1%). For category 3, they were (22.9, 0 and 2.9%). The increase in the guanine mutation intensity in category 3 segments compared to the category 1 and 2 segments is primarily due to an increase in the G to A transition.

These observations can be explained in two ways. The most direct interpretation is that the G to A transition rate is variable along the mitochondrial genome and is highest in the category 3 segments. An alternative interpretation is that the guanine mutation rate is uniformly high along the mitochondrial genome, but there is a strong selection pressure against guanine mutations in categories 1 and 2. To distinguish between these two possibilities, we analyzed just the synonymous mutations recorded in the each database, since these mutations would not be influenced by selective forces acting at the protein level. The guanine mutation intensity is higher for synonymous mutations (Table 7) than it is for all mutations (Table 6), but the pattern of higher guanine mutation intensity in category 3 than in category 1 persists. It is reasonable to interpret the synonymous mutation intensity as reflecting the true rate of mutation formation, before selection effects occur.

Table 7. Synonymous mutations recorded in the MitoMap and mtDB databases.

Category Synonymous mutations/total number of nucleotides in codon position 3 Mutation intensity
  A C G T A C G T
  MitoMap, 291 synonymous codon 3 mutations        
1 67/937 52/1134 46/130 45/399 0.88 0.56 4.35 1.39
3 31/371 15/423 16/34 19/153 1.03 0.44 5.79 1.53
  mtDB, 4952 synonymous codon 3 mutations in 646 individuals        
1 1205/(937 × 646) 769/(1134 × 646) 793/(130 × 646) 900/(399 × 646) 0.93 0.49 4.41 1.63
3 307/(371 × 646) 250/(423 × 646) 292/(34 × 646) 436/(153 × 646) 0.60 0.43 6.21 2.06

The analysis is the same as that in Table 6, but only includes data from synonymous mutations in codon position 3 in the protein coding genes, excluding the overlapping gene sections where no unique codon position can be assigned. The category 2 segments contained very little protein coding mtDNA, so they can have very few synonymous mutations and are not included in this analysis. The normalization constants used to calculate the mutation intensities were 291/3581 for the MitoMap data and 4952/(3581 × 646) for the mtDB data. The mutation intensity for guanine in category 3 segments is very high compared to the other nucleotides and to guanine in the other segmentation categories (bold type). Since synonymous mutations should not be affected by selection pressures, this indicates that the guanine mutation rate varies with sequence position and is highest in the category 3 segments.

In the analysis of the synonymous mutations in the MitoMap database, the difference in the mutations for the codon 3 guanines in categories 1 and 3 is not significant, since the 95% CI of (–0.0709, 0.3036) contains zero. There are two difficulties with this analysis. First, it is difficult to detect significant differences in this data since in category 3 there are only 34 codon 3 sequence positions containing guanine. Second, approximately half of those sequence positions are listed in this mutation database so it is very likely that the same sequence positions have mutated independently multiple times, but this analysis interprets each mutated position as a single mutation event and will thus undercount these mutations. So the measurement of the mutation intensity of guanine in category 3 from the MitoMap database is a lower limit to the actual value. The mtDB database contains the number of individuals sequenced with each polymorphism. It is likely that some of these individuals have inherited these polymorphisms from common ancestors, but if we do interpret the number of individuals with a polymorphism as indicating the number of times that each mutation has occurred then this analysis gives an upper limit to the actual mutation intensity. By comparing this with the lower limit from the MitoMap analysis, we gain a better understanding of the true mutation intensity. The values, listed in Table 7, calculated from the two independent databases are not very different. In the mtDB data, we have strong evidence of the difference in the guanine mutations between categories 1 and 3, with 99.9% CI of (0.0011, 0.0066). The high mutation intensity of synonymous guanine mutations in category 3 segments indicates that the heterogeneous guanine mutation rate occurs independently of selective forces acting at the protein level, and is a direct property of the mtDNA sequence itself.

It is interesting that only some of the respiratory chain polypeptide genes contain category 3 regions. None of the mitochondrial genes of complex IV (COX1, COX2 and COX3) contain any category 3 segments. In contrast, all of the mtDNA genes of complex I (ND1, ND2, ND3, ND4, ND4L, ND5 and ND6) contain category 3 segments, as do both of the complex V genes (ATPase6 and ATPase8). The single complex III gene in mtDNA (Cyt b) contains only a short section of category 3 mtDNA at the beginning of the gene, and another short and fairly low probability category 3 segment at the end of the gene (Fig. 2). This distribution of the segmentation implies that the complex I and V genes, with their high category 3 content, have experienced a greater mutation rate than the complex and III and IV genes. The evidence presented here suggests that the different rates of evolution of the different respiratory chain complex genes are at least partly independent of protein structure and function, and are an innate property of the genome itself.

We have identified hitherto unrecognized patterns within the mitochondrial genome sequence, and provide evidence that there is a heterogeneous mutation rate of mtDNA. We have shown that the synonymous mutation intensity of guanine is largest in the category 3 segments and that the difference in guanine abundance between category 1 and 3 segments does not noticeably influence the predicted physico-chemical properties of the corresponding polypeptide sequences. These results suggest that these regions of human mtDNA have evolved, to some extent, independently of protein function at the level of the genome. These findings have important implications for our understanding of mtDNA genome evolution and mtDNA disease. This may provide part of the explanation for the varied phylogenetic and family-based mutation rates that have been measured in different regions of mtDNA (11), and also explain why the rate of mtDNA mutation (the so-called ‘molecular clock’) varies from study to study (811). It is intriguing that the category 3 segments generally occurred towards either end of the protein-encoding genes, but we currently cannot see a clear reason for this interesting observation. Further studies are required to confirm these observations and to determine the mechanism responsible for this non-uniform mutation rate. This will require extensive mtDNA sequence analysis carried out in the human population, on patients with mtDNA disease, or on single cells looking for rare somatic mutations.

Acknowledgments

ACKNOWLEDGEMENTS

D.C.S. thanks the Commonwealth of Virginia for financial support. P.F.C. and D.C.S. receive support from The Wellcome Trust. P.F.C. also receives support from Ataxia UK.

REFERENCES

  • 1.Anderson S., Bankier,A.T., Barrell,B.G., de Bruijn,M.H., Coulson,A.R., Drouin,J., Eperon,I.C., Nierlich,D.P., Roe,B.A., Sanger,F. et al. (1981) Sequence and organization of the human mitochondrial genome. Nature, 290, 457–465. [DOI] [PubMed] [Google Scholar]
  • 2.Mishmar D., Ruiz-Pesini,E., Golik,P., Macaulay,V., Clark,A.G., Hosseini,S., Brandon,M., Easley,K., Chen,E., Brown,M.D. et al. (2003) Natural selection shaped regional mtDNA variation in humans. Proc. Natl Acad. Sci. USA, 100, 171–176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wallace D.C. (1994) Mitochondrial DNA sequence variation in human evolution and disease. Proc. Natl Acad. Sci. USA, 91, 8739–8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wilson M.R., DiZinno,J.A., Polanskey,D., Replogle,J. and Budowle,B. (1995) Validation of mitochondrial DNA sequencing for forensic casework analysis. Int. J. Legal Med., 108, 68–74. [DOI] [PubMed] [Google Scholar]
  • 5.Wallace D.C. (1999) Mitochondrial diseases in mouse and man. Science, 283, 1482–1488. [DOI] [PubMed] [Google Scholar]
  • 6.DiMauro S. and Schon,E.A. (2001) Mitochondrial DNA mutations in human disease. Am. J. Med. Genet., 106, 18–26. [DOI] [PubMed] [Google Scholar]
  • 7.Servidei S. (2003) Mitochondrial encephalomyopathies: gene mutation. Neuromuscular Disord., 13, 109–114. [DOI] [PubMed] [Google Scholar]
  • 8.Howell N. and Smejkal,C.B. (2000) Persistent heteroplasmy of a mutation in the human mtDNA control region: hypermutation as an apparent consequence of simple-repeat expansion/contraction. Am. J. Hum. Genet., 66, 1589–1598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ingman M., Kaessmann,H., Paabo,S. and Gyllensten,U. (2000) Mitochondrial genome variation and the origin of modern humans. Nature, 408, 708–713. [DOI] [PubMed] [Google Scholar]
  • 10.Sigurgardottir S., Helgason,A., Gulcher,J.R., Stefansson,K. and Donnelly,P. (2000) The mutation rate in the human mtDNA control region. Am. J. Hum. Genet., 66, 1599–1609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Heyer E., Zietkiewicz,E., Rochowski,A., Yotova,V., Puymirat,J. and Labuda,D. (2001) Phylogenetic and familial estimates of mitochondrial substitution rates: study of control region mutations in deep-rooting pedigrees. Am. J. Hum. Genet., 69, 1113–1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Andrews R.M., Kubacka,I., Chinnery,P.F., Lightowlers,R.N., Turnbull,D.M. and Howell,N. (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nature Genet., 23, 147–147. [DOI] [PubMed] [Google Scholar]
  • 13.Hernstadt C., Elson,J.L., Fahy,E., Preston,G., Turnbull,D.M., Anderson,C., Ghosh,S.S., Olefsky,J.M., Beal,M.F., Davis,R.E. and Howell,N. (2002) Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major African, Asian, and European haplogroups. Am. J. Hum. Genet., 70, 1152–1171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kogelnik A.M., Lott,M.T., Brown,M.D., Navathe,S.B. and Wallace,D.C. (1998) MITOMAP: a human mitochondrial genome database—1998 update. Nucleic Acids Res., 26, 112–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Churchill G.A. (1989) Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol., 51, 79–94. [DOI] [PubMed] [Google Scholar]
  • 16.Nicolas P., Bize,L., Muri,F., Hoebeke,M., Rodolphe,F., Ehrlich,S.D., Prum,B. and Bessières,P. (2002) Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res., 30, 1418–1426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Brooks S.P. (1998) Markov chain Monte Carlo method and its application. Statistician, 47, 69–100. [Google Scholar]
  • 18.Boys R.J. and Henderson,D.A. (2003) A Bayesian approach to DNA sequence segmentation. Biometrics, in press. [DOI] [PubMed] [Google Scholar]
  • 19.Boys R.J., Henderson,D.A. and Wilkinson,D.J. (2000) Detecting homogeneous segments in DNA sequences by using hidden Markov models. J. R. Stat. Soc. Ser. C Appl. Stat., 49, 269–285. [Google Scholar]
  • 20.Shioiri C. and Takahata,N. (2001) Skew of mononucleotide frequencies, relative abundance of dinucleotides, and DNA strand asymmetry. J. Mol. Evol., 53, 364–376. [DOI] [PubMed] [Google Scholar]
  • 21.Schattner P. (2002) Searching for RNA genes using base-composition statistics. Nucleic Acids Res., 30, 2076–2082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kyte J. and Doolittle,R. F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157, 105–132. [DOI] [PubMed] [Google Scholar]
  • 23.Bull H. and Breese,K. (1974) Surface tension of amino acid solutions: A hydrophobicity scale of the amino acid residues. Arch. Biochem. Biophys., 161, 665–670. [DOI] [PubMed] [Google Scholar]
  • 24.Radzicka A. and Wolfenden,R. (1988) Comparing the polarities of the amino-acids—side-chain distribution coefficients between the vapor-phase, cyclohexane, 1-octanol, and neutral aqueous-solution. Biochemistry, 27, 1664–1670. [Google Scholar]
  • 25.Eisenberg D. and McLachlan,A.D. (1986) Solvation energy in protein folding and binding. Nature, 319, 199–203. [DOI] [PubMed] [Google Scholar]
  • 26.Tanaka M. and Ozawa,T. (1994) Strand asymmetry in human mitochondrial-DNA mutations. Genomics, 22, 327–335. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES