Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2007 May 31;104(24):10080–10085. doi: 10.1073/pnas.0703737104

The selection of acceptable protein mutations

Rajkumar Sasidharan *,†,, Cyrus Chothia *
PMCID: PMC1891269  PMID: 17540730

Abstract

We have determined the general constraints that govern sequence divergence in proteins that retain entirely, or very largely, the same structure and function. To do this we collected data from three different groups of orthologous sequences: those found in humans and mice, in humans and chickens, and in Escherichia coli and Salmonella enterica. In total, these organisms have 21,738 suitable pairs of orthologs, and these contain nearly 2 million mutations. The three groups differ greatly in the taxa from which they come and/or in the time that separates them from their last common ancestor. Nevertheless, the results we obtain from the three different groups are strikingly similar. For each group, the orthologous sequence pairs were assigned to six different divergence categories on the basis of their sequence identities. For categories with the same divergence, common accepted mutations have similar frequencies and rank orders in the three groups. With divergence, the width of the range of common mutations grows in the same manner in each group. We examined the distribution of mutations in protein structures. With increasing divergence, mutations increase at different rates in the buried, intermediate, and exposed regions of protein structures in a manner that explains the exponential relationship between the divergence of structure and sequence. This work implies that commonly allowed mutations are selected by a set of general constraints that are well defined and whose nature varies with divergence.

Keywords: codon frequencies, distribution of mutations in protein structure, sequence-1 structure divergence


The sequences of almost all proteins diverge, to a greater or lesser extent, during the course of evolution. Here we describe an investigation into the nature of the selection process that governs divergence in those proteins that retain entirely, or very largely, the same fold and function. We try to answer a number of questions for which, to our knowledge, there are few or no detailed answers at present. Does divergence of proteins in very different organisms involve the same or different acceptable mutations? How does the selection of mutations vary with divergence? What is the distribution of mutations in protein structures, and how does it vary with divergence?

Proteins adapt to mutations by changes in their structure (16). For homologous proteins that have mutated up to 60% of their residues, the accommodation of mutations occurs largely through small shifts in the relative position of regions that maintain the same or very similar conformations: there are relatively few insertions, deletions, or changes in local conformation. For proteins with divergences greater than this, the accommodation of mutations increasingly involves insertions, deletions, and/or changes in local conformation: for homologous pairs of proteins with sequence identities of 20%, it is common for regions that have different conformations to form half of each structure (2).

These observations mean that, to determine the nature of the mutations that are acceptable to proteins that retain the same function and fold, the best data come from pairs of orthologous proteins whose sequences have diverged by not more than ≈60% and are similar in length. Three groups of orthologous pairs of sequences are used in this work: those from humans and mice, those from humans and chickens, and those from Escherichia coli and Salmonella enterica. These groups have large differences. The first two involve proteins from higher eukaryotes, and the third involves those from prokaryotes. The time of divergence from the common ancestor of humans and chickens (310 million years) is more than three times that for humans and mice (90 million years) and for E. coli and S. enterica (≈100 million years) (7, 8).

Results and Discussion

Three Sets of Orthologous Proteins That Have Similar Structures.

We performed two rounds of all-against-all comparisons of protein sequences using BLASTP (9) for three groups of organisms: human_mouse (h_m), human_chicken (h_c), and E. coli_S. enterica (e_s). The protein sequences predicted from the genome sequences of humans (10), mice (11), and chickens (12) were obtained from Ensembl (13), and those from the genome sequences of E. coli (14) and S. enterica (15) were obtained from the National Center for Biotechnology Information (16). For each group, pairs of sequences that made reciprocal best matches are taken to be orthologous. To check that the two sequences in each pair have not been subject to large insertions or deletions we removed those pairs in which the length of the shorter sequence is <80% of that of the longer sequence. We also removed pairs where the sequence divergence is >60%.

These comparisons gave 11,160 pairs of h_m orthologs, 7,685 for the h_c group, and 2,893 for the e_s group. The total number of orthologous pairs is 21,738, and the total number of mutations found in these pairs is 1,955,309 (Table 1).

Table 1.

h_m, h_c, and e_s orthologs of similar size: The number of sequences, known structures, and mutations

Orthologs No. of sequences and structures in ranges of different divergence
Total <10% 10–20% 20–30% 30–40% 40–50% 50–60%
h_m
    Sequences 11,160 4,852 3,575 1,725 706 234 68
    Mutations in sequences 810,305 139,447 286,052 220,208 105,727 45,548 13,323
    Known structures 275 131 80 39 17 7 1
    Mutations in structures 10,122 1,802 3,505 3,035 1,055 636 89
h_c
    Sequences 7,685 1,323 1,901 1,813 1,398 868 382
    Mutations in sequences 1,037,022 35,718 146,384 234,619 264,745 229,153 126,403
    Known structures 225 48 58 55 38 17 9
    Mutations in structures 15,130 526 2,592 4,101 4,002 2,325 1,584
e_s
    Sequences 2,893 1,541 935 244 84 46 43
    Mutations in sequences 107,982 30,168 41,219 16,660 7,324 5,828 6,783
    Known structures 495 313 146 27 4 2 3
    Mutations in structures 13,733 5,389 5,683 1,797 291 104 469
Total no. of sequences 21,738 7,716 6,411 3,782 2,188 1,148 493
Total no. with known structure 995 492 284 121 59 26 13

For each group, the sequence divergence of each pair was computed as the percentage of residues that are not identical. Then, according to their divergence, each pair was placed in one of six categories: <10%, 10–20%, 20–30%, 30–40%, 40–50%, or 50–60%. Thus, taken together, the three groups of orthologs have in all 3 × 6 = 18 categories. The distribution of h_m and e_s orthologs in these categories is exponential: many pairs are found in the <10% category, and decreasingly few are found in the other categories. For the h_c orthologs, the distribution is that of a skewed bell curve [see supporting information (SI)]. The different shapes of the distributions arise from the differences in times that have elapsed since the three pairs of organisms diverged from their common ancestor.

General Characteristics of the Mutations in the Three Groups of Orthologs.

Here we describe the general characteristics of the mutations found in each category and determine the extent to which they are common to the three groups of orthologs.

When counting the frequency of a mutation type, we are not concerned with the direction of the mutation. Thus, a count of VI mutations in the h_m orthologs will include both those that have a valine in the human sequence and an isoleucine in the mouse sequence and vice versa. Given this definition, there are 20 × (20 − 1)/2 = 190 different mutation types.

All, or almost all, of the 190 possible types of mutations are observed in each category.

As described above, the orthologs in each group are assigned to one of six divergence categories (Table 1). In Table 2 we list the frequencies of the different types of mutation that are seen in the <10% category of the h_m group. Inspection of this table shows that all 190 possible mutation types occur (although to very different extents; see below). Examination of the other 17 categories shows that in another 10 categories all types of mutations are found (see SI).

Table 2.

The 139,447 mutations in the 4,852 h_m orthologs whose sequence identities have diverged by up to 10%

s
n
b
K R H Q E D N S T A G P C V I L M F Y W
s K
R 5,235
H 87 1,639
Q 1,173 2,337 2,057
E 853 173 79 1,301
D 92 85 225 92 7,942
N 743 149 876 112 155 1,588
n S 237 708 245 208 195 372 6,965
T 487 261 72 113 158 108 1,326 5,433
A 149 184 63 199 874 509 256 5,830 10,021
G 142 1,085 73 137 1,218 1,122 515 5,141 424 3,044
P 76 481 497 1,593 127 52 78 5,073 1,888 2,918 156
C 13 404 119 30 12 20 53 1,082 58 96 356 65
b V 67 88 31 74 325 117 43 445 1,048 5,531 1,017 263 51
I 77 68 31 39 51 22 156 390 1,748 523 96 83 26 8,995
L 109 479 354 743 104 44 63 1,064 313 489 148 2,128 100 2,912 2,229
M 135 98 17 49 54 11 29 126 1,203 339 69 60 13 2,524 1,513 2,135
F 31 35 64 29 25 14 26 461 58 88 46 74 200 337 295 2,044 51
Y 20 66 954 53 23 90 112 253 22 32 24 35 505 46 30 97 8 1,119
W 15 266 12 55 22 4 9 74 11 23 79 24 96 19 13 122 9 27 27

s, hydrophilic residues; n, neutral residues; b, hydrophobic residues; boldface, the 30 most frequent mutations.

The other seven are the <10% category in the h_c group and all of the six categories in the e_s group. For these we find that between two and six of the 190 possible types of mutation are absent (see SI). The absent mutations are as follows, with the number of categories in which the mutation type is not observed given in parentheses: KC (two), KF (one), HC (one), PC (one), CM (three), KW (three), DW (five), NW (one), PW (three), and IW (three). Of these 10 types of mutations, all require a change in all three nucleotides and nine involve C or W residues, which are the two residues that occur least frequently in proteins. These mutations also involve radical changes in chemical character and/or shape. The absence of these rare mutations in the e_s categories most probably arises because the total number of mutations in these categories is much smaller than those in the h_m and h_c categories (Table 1).

The frequencies of mutations have an exponential character that is related to the extent to which the sequences have diverged.

An examination of the frequencies of different mutations in the 18 categories shows that to good approximation they fit exponential distributions with only small deviations at the top and bottom of the distribution. These distributions have an R2 value that ranges from 0.96 to 0.98. In categories that have low levels of divergence, the more common mutation types occur with a frequency that is greater than is found in categories that have high divergence. This is seen in the exponents of the distributions. For the three <10% categories, the exponents of the distribution are in the range of −0.0325 ± 0.0018. For the three 50–60% categories, the exponents are in the range of −0.0206 ± 0.0009 (see SIfor additional data).

To determine the implications of this behavior in more detail, we first ranked the mutation types in each category in descending order of their frequencies. We then plotted for each category the cumulative contribution to the total number of mutations made by the ranked mutations as we move from the most frequent to the least frequent. Fig. 1 shows the results of this calculation for the six categories in the h_c group. Plots for the h_m and e_s categories are identical in appearance (see SI). They show that for the three <10% divergence categories, the 28 or 29 most frequent mutation types form 75% of all mutations. For the three 50–60% categories, between 64 and 67 of the most frequent mutation types make up this percentage. Intermediate categories have intermediate numbers of mutation types that form this percentage (Fig. 1).

Fig. 1.

Fig. 1.

The cumulative contributions made by mutations when placed in descending rank order of their frequencies. To produce the figure, for the six h_c categories, each type of mutation was placed in descending rank order according to its frequency. Then, moving down the list, the frequencies were summed, and their cumulative contributions were calculated as a percentage of all mutations. In the <10% h_c category, the 28 most frequent mutation types form 75% of all mutations; for the 50–60% category, this proportion is formed by the 67 most frequent types. The numbers of frequent mutation types that form this proportion in the intermediate h_c categories and in the h_m and e_s categories are given in the table.

Similarities and Differences in the Rank Order of Mutations.

In this section we investigate the extent to which individual mutations occur with similar frequencies in the different groups. Close similarities would imply that they are subject to similar constraints. We pay particular attention to the sets of mutation types that occur with high frequency and make the major contribution to the total number of mutations. For mutations with low frequencies, small differences in their absolute number can produce large but not very significant differences in their rank order.

We describe calculations carried out on the categories that are the least and most divergent, i.e., the <10% and the 50–60% sets. As might be expected from the data discussed in the previous section, calculations on mutations in the intermediate categories give intermediate results.

Correlation of the rank order of mutation types.

Inspection of h_m, h_c, and e_s categories that have the same divergence shows that many of the individual mutation types have a similar rank order. To obtain a overview of these similarities, we determined the correlation coefficients of the rank orders of the 190 mutation types in the different categories. For the <10% category, the correlation coefficients of the rank orders in the h_m and h_c categories is 0.96; for h_m and e_s it is 0.88, and for h_c and e_s it is 0.91. The same calculation for the 50–60% categories, shows that the correlation coefficient for h_m and h_c is 0.96, for h_m and e_s it is 0.82, and for h_c and e_s it is 0.89.

The rank order of the frequent types of mutation.

We next examined the rank order of the frequent mutation types in the three groups. The overall position of each mutation type in the three <10% categories and the three 50–60% categories was first determined. To do this we summed the rank order of each type of mutation in the categories that have the same divergence. The mutation types were placed in ascending order of this sum. In Fig. 2 we plot the overall rank order of the 30 most frequent mutations in the <10% and 50–60% categories against the rank order of these mutations in the individual h_m, h_c, and e_s categories.

Fig. 2.

Fig. 2.

The overall rank position of the 30 most frequent mutation types in the <10% (Left) and the 50–60% (Right) categories. Along the x axis we list, in descending order, the mutation types that, overall, are the 30 most frequent in the <10% and 50–60% categories. The overall positions are determined by summing, for categories with the same divergence range, the rank order of each mutation type in the h_m, h_c, and e_s data sets (see text). The rank order of the 30 mutations in the individual h_m, h_c, and e_s categories is given by the numbers on the y axis.

The 30 most frequent overall mutation types in the <10% categories form some 75% of all mutations. Of these 30, there are 26 or 27 types that are also among the 30 most frequent in the individual h_m, h_c, and e_s categories (Fig. 2). This means that the same small subset of mutation types forms close to 75% of all mutations in all of the three <10% categories. Their rank positions differ, on average, by four positions in the h_m and h_c <10% categories, by nine positions in the h_m and e_s categories, and by eight positions in the h_c and e_s categories (see Fig. 2).

The 65 most frequent overall mutation types in the 50–60% categories form some 75% of all mutations. Of these, there are 59, 61, and 58 that are also in among the top 65 of, respectively, the h_m, h_c, and e_s categories. Thus, as before, the same subset of mutation types form close to 75% of all mutations in each of the three categories. The rank positions of the 65 types of mutations are roughly similar: they differ, on average, by eight positions in the h_m and h_c 50–60% categories, by 18 positions in the h_m and e_s categories, and by 13 positions in the h_c and e_s categories.

Changes in the rank order of mutations as a function of divergence.

Examination of the 30 most frequent mutations in the <10% and 50–60% categories (Fig. 2) shows that 22 mutation types are common to both sets. Some of these common mutations have systematic differences in their rank order in the two categories. For example, the mutation NS in the h_m, h_c, and e_s <10% categories is found at rank positions 4, 5, and 10, respectively, whereas in the three 50–60% categories the rank positions are 18, 13, and 14. Conversely, VL is found at positions 13, 12, and 13 in the <10% categories and at positions 6, 6, and 3 in the 50–60% categories (Fig. 2). The frequencies of the observed mutations are the number of mutations that remain after selection. Selection for the <10% categories is, of course, more rigorous than that for more divergent categories, which allow larger structural changes (2). Thus, these differences in rank order imply that, before selection, VL mutations actually occur more frequently than NS, but for highly conserved proteins VL is less acceptable than NS and has been removed by selection to a greater extent.

Eight mutation types, QH, VM, NT, IM, TI, RH, NH, and QP, are present in the top 30 of the <10% categories but not in top 30 of the 50–60% categories. Conversely, EK, TV, LS, AL, ES, DS, KN, and EG occur in top 30 of the 50–60% categories but not the top 30 of the <10% categories (Fig. 2). Note that those unique to <10% categories conserve their chemical character, shape, and/or volume to a greater extent than those unique to the 50–60% categories. The mutations TV, AL, ES, and DS require at least two nucleotide changes. Most of these are likely to be produced by a combination of frequent single mutations: for example, the combination of the frequent single-nucleotide mutations TA and AV mutations will produce TV mutations. For the unique mutations that can be produced by single-nucleotide changes, the process behind their presence or absence in the two categories is, of course, the same as that outlined for NS and VL in the previous paragraph.

Variations in the mutation frequencies related to codon bias.

Most of the frequent types of mutations in the categories with the same divergence have similar rank orders, but there are a few that do not. Differences in the total number of codons for an amino acid, and in the relative frequencies of the different codons of those residues that have more than one, might be expected to affect the extent to which mutations can occur.

In different genomes, identical codons can occur with different frequencies (17). Previous work has shown that the biases in the frequencies of codons are correlated with biases in the frequencies of their tRNAs. As a result of this, the expression of a protein is more efficient if the coding regions of its genes contain a high proportion of the more frequent codons (1820). Thus, in relation to expression, mutations between codons with similar frequency would have a close-to-neutral effect. If efficient expression is important for a protein, a mutation that substituted a rare codon for a frequent one would be deleterious to some degree, whereas the substitution of a frequent codon for a rare one could be advantageous for expression but would be uncommon because of the low number of rare codons.

Codon frequencies in humans, mice, and chickens are very similar (the correlation coefficients are 0.96–0.99). The codon frequencies in E. coli and Salmonella are similar to each other (the correlation coefficient is 0.83), but some are very different from those in the vertebrates (the correlation coefficients between the two sets of codons are between 0.44 and 0.57).

In the h_m, h_c, and e_s <10% categories, the mutation SP is found at rank positions 10, 11, and 35, respectively (Fig. 2). Proline and serine have four codons each. In Table 3 we list these codons, their frequencies in the different genomes, and an indication of the four single-nucleotide changes that that give SP mutations.

Table 3.

The frequency of codons that produce SP mutations by change in a single nucleotide

Residue Codon Codon frequencies
Codon Residue
Human Mouse Human Mouse
S UCU 15.1 16.1 17.4 18.4 CCU P
UCC 17.7 18.1 19.9 18.3 CCC
UCA 12.2 11.6 16.9 17.1 CCA
UCG 4.5 4.3 7.0 6.2 CCG
E. coli S. enterica E. coli S. enterica
S UCU 8.5 19.5 7.0 6.3 CCU P
UCC 8.6 15.4 5.5 3.9 CCC
UCA 7.1 8.6 8.5 6.3 CCA
UCG 8.9 6.5 23.3 11.9 CCG

The total frequency of serine and proline codons in the three animal genomes is 110 per 1,000, and in the bacterial genomes it is 78, i.e., 30% smaller (see Table 3).

The individual frequencies of the four proline codons in the animals are quite different from those in the bacteria. In animals the four single-nucleotide SP mutations involve transitions between codons of similar high (three cases) or low (one case) frequencies (Table 3). In the bacteria, on the other hand, the four SP mutations involve transitions between codons that have low frequencies or between one with a high frequency and one with a low frequency. This means that in the e_s orthologs single-nucleotide SP mutations are likely to be less frequent than those in h_m and h_c orthologs.

Examination of the other mutations that have large differences in their rank positions indicates that, in several cases, differences in codon frequency are a contributing factor. In a few cases, however, rank differences cannot be linked in a simple way to codon frequencies.

Distribution of Mutations in Different Regions of Protein Structures.

In this section we discuss the distribution of mutations in protein structures and how this distribution varies with sequence divergence. To obtain data for this analysis, the sequences that form the three groups of orthologs were matched to the entries in the Protein Data Bank (21) to find which of them of them have had their structures determined. We were able to find at least one structure for 275 orthologous pairs of the h_m group, for 225 orthologous pairs of the h_c group, and for 495 orthologous pairs of the e_s group. The total number of structures, 995, matches 5% of the total orthologous pairs. Although this proportion is small, the constancy of the results described below indicates that they are likely to be representative of a large proportion of the orthologous pairs in the three groups.

Buried, intermediate, and exposed regions in proteins.

Examination in 1965 of the first pair of homologous proteins to have their structure determined, myoglobin and hemoglobin, clearly showed that sites buried within the protein are less susceptible to mutations than those on the surface (22). This observation has been confirmed by numerous subsequent studies. It implies that the structural property that is most directly related to mutability is the extent to which tertiary interactions restrict adaptation to change in the size, shape, and/or chemical character of residues produced by mutations. (Residues directly involved in function will, of course, be more sensitive to mutation, but the number of such residues is usually small, and, in this article, the proteins being considered have conserved functions.)

Accessible surface area (ASA) is a measure of the extent to which atomic groups or residues in proteins are accessible to the solvent or buried in the interior (23). Previous analyses of homologous proteins have shown that residues whose ASAs are <20 Å2 tend to be conserved to a greater extent than those whose ASAs are larger (1, 2426). Based on these observations we defined a residue at a site as being in one of three regions: “buried” if the ASA value is between 0 Å2 and 20 Å2, “intermediate” if the ASA value is between 20 Å2 and 60 Å2, and “exposed” for those with ASA values ≥60 Å2.

In our set of structures, the average protein has some 300 residues, of which 42% are buried, 25% are intermediate, and 33% are exposed.

The ASA of mutation sites.

On the basis of the ASA values of the residues in the structures, we assigned the sites' subject mutations to the buried, intermediate, or exposed region. At sites not buried in the interior, mutations will produce residues with different ASAs. However, the average value of the change in ASA for acceptable mutations is ≈27 Å2 (26). This means that most mutations are likely to either leave the site in same ASA category or move it to an adjacent one.

We assigned 38,985 mutation sites to one of the three ASA regions. Of these, 10,122 come from the h_m group, 15,130 come from the h_c group, and 13,733 come from the e_s group.

Distribution of mutations.

For each structure we counted, in the buried, intermediate, and exposed regions, the total number of residues and the number of mutations. Using this data we then determined the proportion of residues that is mutated in each region. This calculation was carried out for all six divergence categories in each of the three groups. To give one example: in the buried regions of structures in <10% categories, the proportion of mutated residues is 2.0% in the h_m group, 2.8% in the h_c group, and 2.5% in the e_s group. Thus, the midpoint and the range of these values is 2.4 ± 0.4%. In Table 4 we give the results of this calculation for each ASA region and each divergence category.

Table 4.

Proportion (%) of residues that are mutated in different protein regions

ASA of the mutation sites, Å2 Proportion of residues in each region of the average protein, % Sequence divergence categories,* %
<10 10–20 20–30 30–40 40–50 50–60
0–20 42 2.4 ± 0.4 6.7 ± 1.0 12.4 ± 1.2 17.7 ± 1.8 29.3 ± 0.7 39.6 ± 2.8
20–60 25 5.0 ± 0.6 13.5 ± 0.4 23.7 ± 1.5 33.2 ± 0.9 44.0 ± 1.7 55.7 ± 1.9
60+ 33 9.8 ± 1.8 24.3 ± 1.1 38.2 ± 1.2 49.8 ± 1.2 57.5 ± 0.2 67.8 ± 1.1

For the four categories <10%, 10–20%, 20–30%, and 30–40%, the values are derived from structures of h_m, h_c, and e_s orthologs. For the 40–50% category there are only sufficient structures available for h_m and h_c proteins, and for the 50–60% category there are only sufficient structures for h_c and e_s structures (see Table 1).

*Note that the proportion of mutated residues increases 16.5-fold between the values for the <10% and 50–60% categories when ASA is 0–20 Å2. For ASA values of 60+ Å2, the proportion of mutated residues increases 7-fold.

Inspection of the data in Table 4 shows that h_m, h_c, and e_s categories with the same divergence have very similar distributions of mutations in the three ASA regions. On average, the range about the mean is ± 1.2% and there is only one entry where it is >2.0%. The number of structures available for the 40–50% and 50–60% categories is small, 26 and 13, respectively (Table 1), but the data they provide clearly follow the trend observed in categories that have many more structures. These results show that the distribution of mutation in the structures, and the way in which it changes with divergence, follows a remarkably similar pattern in all three groups.

A striking feature is how, with an increase in divergence, mutations increase at different rates in the three regions. On going from the <10% category to the 50–60% category, the proportion of residues with mutations in the buried regions increases by a factor of ≈16.5, in the intermediate regions by a factor of ≈11, and in the surface regions by a factor of ≈7 (Table 4).

This behavior arises from a combination of two factors. The first is how residues are distributed between the three regions: the average structure has some 300 residues of which 126 are buried, 75 are intermediate, and 99 are exposed. The second is that, at low divergence, the mutations in the exposed region are approximately four times more acceptable than mutations in the buried region (Table 4). This means, of course, that initially mutations accumulate rapidly in the exposed region. But, as divergence increases, new surface mutations will increasingly occur in residues that have already been mutated. While in the buried region, which has more residues and a smaller proportion with mutations, new mutations will mainly occur at new sites.

The divergence of structure and sequence in homologous proteins is related by the exponential relationship Δ = 0.40e1.87H where Δ is the root mean square difference (in angstroms) in the position of main chain atoms that have the same local conformation and H is the proportion of mutated residues (2). Residues at buried sites tend to have more contacts than those on the surface, and their mutation will usually have a greater effect of structure. The high relative rate with which mutations increase, with divergence, in the buried regions is the basis for this exponential relationship.

Conclusion

The functions and structure of individual proteins impose different constraints on their evolution. However, the very similar overall patterns of divergence that we have seen in three very different groups of orthologs show that individual responses of most proteins are variations on a common set of selective constraints. These constraints on proteins govern the types of frequent mutations that are acceptable, their distribution in protein structures, and how they change with divergence. Their features, which have been described above in some detail, can be summarized in the following terms: (i) All types of mutations are acceptable in proteins of any divergence, but (ii) the frequencies of the acceptable mutation types follow an exponential distribution. The exponent of the distribution decreases with divergence: in proteins whose sequences have diverged by <10% some 30 types of mutations form 75% of all mutations; in proteins that have diverged by 50–60% the equivalent number is close to 65. (iii) In proteins that have diverged by up to 10%, ≈30 mutation types are conservative and usually have very similar rank orders. In proteins with 50–60% divergence, there are ≈60 common mutation types that form close to 75% of all mutations. They are less conserved in shape and/or chemical character, and the rank order of their frequencies tends to be roughly similar. This increase in the diversity of mutations comes in part from the mutation of mutations. We have discussed previously the acceptance by proteins of rare deleterious mutations (24). (iv) The distribution of mutations in protein structures varies systematically with divergence. In proteins with low divergence, mutations in the interior are under strong selection that removes all but a few conservative changes. With increasing divergence mutations in the interior become more widespread and closer in number to what is found in the intermediate and exposed regions. It is this relative increase in the proportion of buried mutations that is responsible for the exponential relationship between sequence and structural divergence.

These conclusions have significant implications for the use of mutation matrices, e.g., the Dayhoff point accepted mutation (PAM) matrices that describe the probabilities of amino acid mutations for a given period of evolution (27). It is based on a model of evolution in which amino acids mutate randomly and independent of one another. The mutation profile that we observe is very similar to PAM matrices at low levels of sequence divergence. In the Dayhoff model, a matrix for divergent sequences is derived from a matrix for closely related sequences by taking the matrix to a power. This assumes that each position is equally mutable. However, as we show from our structural analysis of point mutations, different regions of a protein accept mutations at different rates with increasing divergence, and this complex behavior is not represented by the PAM matrices.

Supplementary Material

Supporting Information

Acknowledgments

We thank our colleagues and the referees for comments on the manuscript. R.S. acknowledges financial help from the Cambridge Nehru Trust (New Delhi, India) and the Medical Research Council Laboratory of Molecular Biology (Cambridge, U.K.).

Abbreviations

ASA

accessible surface area

h_c

human_chicken

h_m

human_mouse

e_s

E. coli_S. enterica.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0703737104/DC1.

References

  • 1.Lesk AM, Chothia C. J Mol Biol. 1980;136:225–270. doi: 10.1016/0022-2836(80)90373-3. [DOI] [PubMed] [Google Scholar]
  • 2.Chothia C, Lesk AM. EMBO J. 1986;5:823–826. doi: 10.1002/j.1460-2075.1986.tb04288.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Matthews BW. Biochemistry. 1987;26:6885–6888. doi: 10.1021/bi00396a001. [DOI] [PubMed] [Google Scholar]
  • 4.Serrano L, Day AG, Fersht AR. J Mol Biol. 1993;233:305–312. doi: 10.1006/jmbi.1993.1508. [DOI] [PubMed] [Google Scholar]
  • 5.Flores TP, Orengo CA, Moss DS, Thornton JM. Protein Sci. 1993;2:1811–1826. doi: 10.1002/pro.5560021104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Koehl P, Levitt M. J Mol Biol. 2002;323:551–562. doi: 10.1016/S0022-2836(02)00971-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hedges SB. Nat Rev Genet. 2002;3:838–849. doi: 10.1038/nrg929. [DOI] [PubMed] [Google Scholar]
  • 8.Feng DF, Cho G, Doolittle RF. Proc Natl Acad Sci USA. 1997;94:13028–13033. doi: 10.1073/pnas.94.24.13028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 10.Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  • 11.Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
  • 12.Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, Bork P, Burt DW, Groenen MA, Delany ME, et al. Nature. 2004;432:695–716. [Google Scholar]
  • 13.Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, et al. Nucleic Acids Res. 2005;33:D447–D453. doi: 10.1093/nar/gki138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Blattner FR, Plunkett G, III, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. Science. 1997;277:1453–1474. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
  • 15.McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P, Courtney L, Porwollik S, Ali J, Dante M, Du F, et al. Nature. 2001;413:852–856. doi: 10.1038/35101614. [DOI] [PubMed] [Google Scholar]
  • 16.Maglott D, Ostell J, Pruitt KD, Tatusova T. Nucleic Acids Res. 2005;33:D54–D58. doi: 10.1093/nar/gki031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Nakamura Y, Gojobori T, Ikemura T. Nucleic Acids Res. 2000;28:292. doi: 10.1093/nar/28.1.292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Grantham R, Gautier C, Gouy M, Mercier R, Pave A. Nucleic Acids Res. 1980;8:r49–r62. doi: 10.1093/nar/8.1.197-c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ikemura T. J Mol Biol. 1981;151:389–409. doi: 10.1016/0022-2836(81)90003-6. [DOI] [PubMed] [Google Scholar]
  • 20.Akashi H. Genetics. 2003;164:1291–1303. doi: 10.1093/genetics/164.4.1291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Perutz MF, Kendrew JC, Watson HC. J Mol Biol. 1965;13:669–678. [Google Scholar]
  • 23.Lee B, Richards FM. J Mol Biol. 1971;55:379–400. doi: 10.1016/0022-2836(71)90324-x. [DOI] [PubMed] [Google Scholar]
  • 24.Chothia C, Gelfand I, Kister A. J Mol Biol. 1998;278:457–479. doi: 10.1006/jmbi.1998.1653. [DOI] [PubMed] [Google Scholar]
  • 25.Hill EE, Morea V, Chothia C. J Mol Biol. 2002;322:205–233. doi: 10.1016/s0022-2836(02)00653-8. [DOI] [PubMed] [Google Scholar]
  • 26.Miller S, Janin J, Lesk AM, Chothia C. J Mol Biol. 1987;196:641–656. doi: 10.1016/0022-2836(87)90038-6. [DOI] [PubMed] [Google Scholar]
  • 27.Dayhoff MO, Schwartz RM, Orcutt BC. In: Atlas of Protein Sequence and Structure. Dayhoff MO, editor. Vol 5. Washington, DC: Natl Biomed Res Foundation; 1978. pp. 345–352. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
pnas_0703737104_1.pdf (392.1KB, pdf)
pnas_0703737104_2.pdf (53.8KB, pdf)
pnas_0703737104_3.pdf (41.6KB, pdf)
pnas_0703737104_4.pdf (23.6KB, pdf)
pnas_0703737104_5.pdf (606.2KB, pdf)
pnas_0703737104_6.pdf (605.9KB, pdf)
pnas_0703737104_7.pdf (619.8KB, pdf)
pnas_0703737104_8.pdf (616.9KB, pdf)
pnas_0703737104_9.pdf (621KB, pdf)
pnas_0703737104_10.pdf (608.7KB, pdf)
pnas_0703737104_11.pdf (17.7KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES