Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2000 Oct 1;28(19):3801–3810. doi: 10.1093/nar/28.19.3801

Amino acid and nucleotide recurrence in aligned sequences: synonymous substitution patterns in association with global and local base compositions

Manami Nishizawa 1, Kazuhisa Nishizawa 1,a
PMCID: PMC110763  PMID: 11000273

Abstract

The tendency for repetitiveness of nucleotides in DNA sequences has been reported for a variety of organisms. We show that the tendency for repetitive use of amino acids is widespread and is observed even for segments conserved between human and Drosophila melanogaster at the level of >50% amino acid identity. This indicates that repetitiveness influences not only the weakly constrained segments but also those sequence segments conserved among phyla. Not only glutamine (Q) but also many of the 20 amino acids show a comparable level of repetitiveness. Repetitiveness in bases at codon position 3 is stronger for human than for D.melanogaster, whereas local repetitiveness in intron sequences is similar between the two organisms. While genes for immune system-specific proteins, but not ancient human genes (i.e. human homologs of Escherichia coli genes), have repetitiveness at codon bases 1 and 2, repetitiveness at codon base 3 for these groups is similar, suggesting that the human genome has at least two mechanisms generating local repetitiveness. Neither amino acid nor nucleotide repetitiveness is observed beyond the exon boundary, denying the possibility that such repetitiveness could mainly stem from natural selection on mRNA or protein sequences. Analyses of mammalian sequence alignments show that while the ‘between gene’ GC content heterogeneity, which is linked to ‘isochores’, is a principal factor associated with the bias in substitution patterns in human, ‘within gene’ heterogeneity in nucleotide composition is also associated with such bias on a more local scale. The relationship amongst the various types of repetitiveness is discussed.

INTRODUCTION

Amino acid sequences of eukaryotic proteins have a tendency for recurrent use of identical amino acids (15). For human proteins, within a 1–10 residue distance of a glutamine (Q) residue 27–38% more Q than expected by chance tends to occur (5). Among human proteins, modern proteins that are unique to human (or mammals) have a higher repetitiveness (5). Such a tendency for clustering of the same amino acids has been suggested to be caused by the tendency for simplicity of DNA, which has been studied for non-coding and coding regions by various methods (610). Such studies have discovered concatemeric repeats of short DNA sequences, or microsatellites, which are likely to be mainly generated by replication slippage (10,11). They are highly polymorphic sequences and have drawn much interest due to their usefulness in the construction of genetic maps (see for example 12), assessment of genetic diversity within a population (see for example 1318) and due to the importance in several genetic diseases (see for example 19,20). The most well-characterized case of the latter is expansion of the trinucleotide sequence CAG, which results in stretching of the Q residues in amino acid sequences and is believed to be the cause of many diseases involving myotonic dystrophy (see for example 2124). Other examples of triplet repeats implicated in diseases include the CGG/CCG repeat (see for example 2527) and the GAA/TTC repeat (see for example 28).

Because the majority of studies on microsatellites have been concerned with particular loci with repetitive segments, it is not yet clear how widely such repetitiveness is found in normal genes, where sequences are under obvious functional constraints. It seems important to ask whether the tendency for repetitiveness can affect genetic variation, especially of those sequences which are conserved between evolutionarily distant organisms and thus not likely to be ‘junk’. In fact, the general tendency depicted by statistical analyses (5,29,30) could merely reflect the contribution of a small number of highly repetitive sequences, because they are more frequently found in regions which seemingly encode the weakly constrained protein segments (5; our unpublished results). This is an important issue because repetitiveness in conserved segments may imply the involvement of a mechanism which does not require a change in length. Of note, Golding and Glickman have reported several convincing examples for ‘sequence-directed mutagenesis’ using human interferon genes (31). Unlike replication slippage (6,10,32), sequence-directed mutagenesis does not always result in alterations of gene length (31). How widely is such repetitiveness present in conserved sequences? And if it is widespread, how is it maintained?

First we studied the general statistical features of repetitiveness in mammals and Drosophila melanogaster, instead of focusing on some particular examples. The general tendency for repetitiveness is present even for well-conserved sequences (having >50% amino acid identity between human and D.melanogaster). A similar degree of repetitiveness has been observed for various codons for many of the 20 amino acids. The local repetitiveness of concern here tends to be disrupted by introns. We also analyzed the local repetitiveness in DNA sequences and show that some local repetitiveness in humans cannot be explained by sequence-directed mutagenesis. In fact, even human genes encoding ancient proteins have repetitiveness in bases at the codon 3 position, while their amino acid sequences do not have a significant level of repetitiveness.

In general, nucleotide substitutions in a given species consist of two steps: first, mutations in genes; second, fixation of the mutant genes in the population (33). Analyses of synonymous substitutions in humans show that both global (between gene) and local (within gene) base compositions are factors associated with a local imbalance in substitution patterns. While the global factor, which is linked to the isochore scale GC content heterogeneity, is strong for humans but not for mouse, the local factor (on an ∼20 nt scale) can account for the very local regional imbalance in substitutions for both species.

MATERIALS AND METHODS

Systems for sequence alignment and repetitiveness analyses

By modifying the BLAST program (34), we implemented our system for repetitiveness analyses with a system for sequence compilation based on homology searches between two organisms. All the C source codes and the sequence alignments used are available from us or from our Web site (http://village.infoweb.ne.jp/~gene ), respectively. We first compared protein sequences obtained from the SwissProt database (http://www.expasy.ch/sprot/ ) between organisms and then, for the sequences aligned with a high similarity (i.e. less than a BLAST score of e–40 for the human–D.melanogaster comparison and less than e–200 for the human–mouse and mouse–bovine comparisons), we collected the corresponding cDNA and/or genomic DNA sequences. All of the homologs were aligned again using the CLUSTALW method (35) provided as Megalign in Lasergene (Dnastar, Madison, WI). The intron sequences used in Table 1 (human and D.melanogaster introns with a total of 2.5 × 105 and 6.4 × 105 nt obtained from 573 and 542 genes, respectively) were downloaded from the NCBI web site without any artifactual selections.

Table 1. Repetitiveness in intron DNA sequences of human and D.melanogaster.

  Average over intervals Permuted sequencea (mean ± SD)
  <12b 13–29 30–60 61–100  
Mononucleotide (1-tuple) (×100)          
 Human 28.645 27.307 26.973 26.702 26.620 ± 0.116
D.melanogaster 28.538 27.563 27.119 26.950 26.693 ± 0.066
2-tuple (×100)          
 Human 8.861 8.059 7.777 7.594 7.112 ± 0.073
D.melanogaster 8.865 8.057 7.688 7.532 7.119 ± 0.041
3-tuple (×100)          
 Human 2.922 2.468 2.302 2.203 1.905 ± 0.037
D.melanogaster 3.049 2.474 2.230 2.163 1.910 ± 0.025
4-tuple (×100)          
 Human 1.078 0.789 0.699 0.654 0.512 ± 0.020
D.melanogaster 1.204 0.792 0.668 0.635 0.512 ± 0.014
5-tuple (×100)          
 Human 0.444 0.279 0.225 0.196 0.139 ± 0.012
D.melanogaster 0.518 0.270 0.210 0.194 0.138 ± 0.007
Dinucleotide (×100)          
 Human 27.860 18.582 15.843 13.714 13.002 ± 0.886
D.melanogaster 27.247 20.405 16.794 15.331 12.975 ± 0.534

aNucleotides were permuted within each intron and then intervals (positions) +10 to +30 were analyzed.

bFor the k-tuple method, k – 12 positions were considered. For example, for the 3-tuple method, positions 3–12 were analyzed, because positions +1 and +2 overlap the query word itself.

Protocol A

From the human–D.melanogaster alignments, we obtained highly conserved peptide segments using the following procedure. First, score each segment (of five residues) by summing the value of each position scored based on the Blosum62 matrix (36). Then, starting from the segments scored above the threshold T (typically 19), scan the sequence in both directions using a window of five residues, then record the score. The scanning is stopped just before the window hits a ‘gap’ (unaligned position) or it scores lower than 0. Collect the regions over which the window was scanned at least once as ‘highly conserved peptide segments’.

Repetitiveness in human–D.melanogaster aligned sequences (Figs 1–3)

A homology search of all the fruit fly proteins in the SwissProt database versus human proteins, followed by re-alignment and trimming according to protocol A, yielded 1007 aligned segments, which have ∼25–100% amino acid identity with few gaps, between the two organisms. (The aligned segments are available on our Web site; see above). Monotonous sequences such as QQQQ… were removed using a filter (described under a default condition on the BLAST web site; http://www.ncbi.nlm.nih.gov ).

The neighbor residues of, for example, Q were analyzed as follows (see also ref. 2 and our Web site). Given a Q residue at position i, we looked at the amino acid at each nearby position i + n (n = 1, … , 25 or more if appropriate). Such data for each n value are summed over all the Q residues, so that we know the amino acid ‘composition’ at a given distance (n) from Q. For example, in a 20-residue sequence MRKRQHSAVQNQTKCYRKSA there are three Q residues (3/20 = 15%). If we look at only the +2 position (italic) from each Q, we find S, Q and K. Thus, at n = 2 of Q, Q occurs at 1/3 = 33.3%, about twice the average (33.3/15 = 2.2).

We define this index (here 2.2) as the ‘normalized frequency of Q at n = 2 of Q’. Such an analysis is performed for different n values, for all amino acids, using all the sequences in the alignments. Note that if there is no specific correlation between residues, all of the indices should be 1.0. We thus use {Fyx(i)/Fy} as the normalized frequency of Y at position i of X. In this study we were mainly concerned with Fxx(i), which is just a special case of Fyx(i), where X = Y. Thus

Fxx(i) = {number of X residues at the i position from each of the X residues} ÷ {total number of residues at the i position from each of X residues}.

Analysis of repetitiveness in DNA sequences (Fig. 3 and Table 1)

We employed the ‘dinucleotide method’, by which two dinucleotide sequences are compared position-by-position and given a score of 6 (3 for each position) when they are identical, –2 (–1 for each) when neither position has the same base and 2 [i.e. 3 + (–1)] when only one is identical. For example, given an AG at position 0, if another AG is found at position i, AG(i) is scored as 6. If TG is found at position j, AG(j) is scored as 2, because AG and TG are the same at only one position. In a manner similar to the amino acid repetitiveness analysis, repetitiveness scores are cumulatively calculated for 16 (= 4 × 4) individual dinucleotide motifs (and for a different distance n). These profiles are usually combined into one profile (as in Fig. 3) after weighting according to the relative frequency of each motif.

Figure 3.

Figure 3

Repetitiveness in cDNA sequences. Repetitiveness was analyzed by the dinucleotide scanning method (5; see also Materials and Methods). (A) Results for entire cDNAs encoding the proteins analyzed in Figure 1. (B) Results for cDNA segments corresponding to protein segments analyzed in Figure 2. Note that the general level of non-3n profiles is different between (A) and (B), which is most likely caused by stronger constraints on the conserved segments (B) regarding the base composition at each position of the codons.

Intra-exon and inter-exon analyses of amino acid recurrence (Fig. 5)

GenBank was screened with the keywords ‘human’, ‘complete cds’ (for title search) and ‘genomic’ (for text search) and all the obtained genomic sequences (133 genes, 960 exons) were compiled. For analyses of the effect of introns on repetitiveness in the protein sequences, the presence of an exon boundary was taken into account upon data collection: amino acid recurrence was analyzed as described above but was sorted into two matrices based on whether an exon boundary exists in the interval. For example, if the above 20-residue sequence is encoded by a gene segment with one exon boundary at position H in MRKRQhSAVQNQTKCYRKSA, S7 (i.e. at n = 2 of Q5) is encoded by a different exon from that encoding Q5. In this case, distinct matrices were used for the position n = 2 of Q5 (that has S) and for position n = 2 of Q12 (that has K), because, for the latter, there is no intron between Q12 and K14.

Synonymous substitution rate and repetitiveness (Table 2 and Supplementary Material available at NAR Online)

We first collected all of the human/bovine/mouse sequence alignments which can be mutually aligned with a BLAST score (e-value) better (less) than e–200. The alignments were re-aligned using CLUSTALW. Because many alignments were partial, we trimmed the alignment to obtain segments which were longer than 50 residues and had amino acid identities >95% with no gap. The alignment data are available on our Web site (see above). The relationship between local repetitiveness and the occurrence of synonymous substitutions was analyzed as described in Supplementary Data.

Protocol B: synonymous substitutions and local nucleotide compositions

To analyze the association between the ‘within gene’ variation of the nucleotide content and the pattern of synonymous substitution, we used the following indices. (We first explain the parameters and then show the algorithm for the analyses.) Let Flocal(G│A→Gsynon), for example, denote the proportion of nucleotide G in ‘neighbor nucleotides’ in the ancestral sequences, given that an ancestral A has synonymously changed to G. (As described in Supplementary Data, we used a parsimony method and did not take into account the position where the ancestral sequence cannot be determined.) Although there are alternative ways which can seemingly be arbitrarily chosen, in this study we used those 11 neighbor nucleotides located at the positions shown by x in xxx xxx ABC Dxx xxx. Note that C is the position that we examined for occurrence of synonymous substitution. Position D, as well as A and B, were not considered due to possible direct linkage resulting from the dinucleotide motif and/or codon preference. For example, in the following alignments

… AGG GTC CTA TCG TCG CGG CCA… (human)

… AGG GTC CTA TCG TCA CGG CCA… (bovine)

… AGG GTC CTA TCG TCA CGG CCA… (mouse),

Flocal(G│A→Gsynon) for the human sequence is 27.3% (3/11 = 27.3; see the underlined bases). Similarly, we define Flocal(G│A→A), i.e. the local G content in the segments near the A which has not changed despite the ‘chance’ for synonymous substitution to G. For example, codon 3 in the above example is CTA (encoding Leu), which can be synonymously substituted to CTG. Considering the 11 neighbor nucleotides (AGG GTC and CG TCA), Flocal(G│A→A) here becomes 4/11 = 36.4%.

Because both Flocal(G│A→Gsynon) and Flocal(G│A→A) are likely to be under the influence of the GC content of the gene, in this case we subtracted ‘G% of the gene’ from each. Thus, we define, ΔFlocal(G│A→Gsynon) and ΔFlocal(G│A→A) as follows.

ΔFlocal(G│A→Gsynon) = Flocal(G│A→Gsynon) – Fgene(G)

ΔFlocal(G│A→A) = Flocal(G│A→A) – Fgene(G),

where Fgene(G) is the content (%) of G in the gene.

While these two indices deal with each of the A residues which can be synonymously substituted, in some analyses we also used ΔFilocal(G│A→Gsynon) and ΔFilocal(G│A→A), each denoting the mean value of ΔFlocal(G│A→Gsynon) and ΔFlocal(G│A→A) analyzed over the gene i.

To analyze the general trend of the substitution patterns and nucleotide compositions, we also introduced indices which consider all the patterns of transition: A→G, G→A, T→C and C→T, i.e.

ΔFlocal(J│I→Jsynon) = Flocal(J│I→Jsynon) – Fgene(J)

ΔFlocal(J│I→I) = Flocal(J│I→I) – Fgene(J),

where the pair I and J denotes any of A and G, G and A, T and C and C and T. Following the above rule, we used ΔFilocal(J│I→Jsynon) [and ΔFilocal(J│I→I)] to denote the local richness of the nucleotide (in comparison with the percentage of the relevant nucleotide over the gene), given the occurrence (and absence) of synonymous substitutions.

RESULTS AND DISCUSSION

Repetitiveness in conserved sequences

A BLAST-based system (34) for sequence comparison was used to find homologous protein sequences between human and D.melanogaster. To detect any clustering between different (or the same) amino acids in the sequences obtained, we cumulatively calculated the frequency of occurrence of different amino acids in the proximity (1–5 residues downstream) of each amino acid type. The results, i.e. the average amino acid composition near individual amino acid types, are shown in Figure 1A. A tendency for amino acid repetitiveness is evident (from the high repetitiveness score along the diagonal), which is in agreement with our previous findings (5). While Figure 1A shows the frequency of amino acid Y near amino acid X [or, correctly, Fyx(i)/Fy (averaged over i = +1~5), where the average frequency of Y normalizes the profile], Fxx(i), i.e. the proportion of X at position i from X, was also calculated. Then the obtained profiles (for the 20 amino acids) were combined as described in Materials and Methods. The combined profiles [ΣX Fxx(i)] of the human and D.melanogaster proteins show that D.melanogaster proteins tend to have a higher level of amino acid repetitiveness than homologous human proteins (Fig. 1B).

Figure 1.

Figure 1

Amino acid occurrence near different amino acids. (A) Amino acid occurrence at positions +1,…, +5 (where + means downstream) from a given type of amino acid. The results for human proteins which have at least partial homology with a D.melanogaster protein are shown. The scores are shown by color, indicating the frequency as percent change from the frequency expected by chance. For example, occurrence of Q near E is shown in yellow, which corresponds to 10–20, indicating a 1.10- to 1.20-fold frequency of occurrence of Q near E, compared with the average occurrence of Q. (B) Amino acid repetitiveness profile in the human and D.melanogaster sequences used in (A). These profiles represent the occurrence of amino acid X (X is any of 20 amino acids) at the indicated position shown on the abscissa from amino acid X. To obtain these profiles, we combined the 20 individual profiles for 20 amino acids after weighting according to the relative frequency of each amino acid (see ref. 5 for more details). For example, a score of 1.36 at position +3 (D.melanogaster plot) means that, on average, amino acids tend to be used 36% more frequently than average at position +3.

Note, however, that the analyses in Figure 1 were based on the sequences of entire proteins, not conserved segments. To examine whether repetitiveness can be found in conserved segments, we performed the same analysis for aligned segments trimmed using protocol A as described in Materials and Methods. (These segments consist of 1007 pairs for which the mean amino acid identity score was 63.3% and the average number of gaps was only 0.88 per 1000 residues. For more details see http://village.infoweb.ne.jp/~gene ). It is still evident that in close proximity the same amino acid tends to be used again more frequently than expected by chance (Fig. 2A). The results were similar for human and D.melanogaster (not shown), due to the high sequence similarity of the trimmed alignments. Comparison between proximal and distant positions for individual amino acids (Fig. 2B) shows that, except for a few amino acids, the same type of amino acid tends to occur ∼5–20% more frequently in close proximity (1–5 positions, closed circles) than expected by chance. At an increasing distance the relative frequency of the same amino acid tends to decrease (open circles). Thus, in general, the same amino acids tend to cluster, even in the conserved segments.

Figure 2.

Figure 2

Figure 2

Local repetitiveness in the human protein segments that can be aligned with D.melanogaster sequences. (A) Amino acid occurrence at positions +1,…, +5 for a given type of amino acids. Results are presented in the same way as in Figure 1A. (B) Normalized frequency of the indicated amino acid type in the proximity (1–5 residues, filled circles) of and at positions more distant (20–30 residues, open circles) from the same amino acid. For example, in the proximity of Q residues (see the filled circle in column Q), Q occurs at a 1.18-fold frequency of the average (= 1.0). The data for C (cysteine) were 1.9 (1–5 residues) and 1.4 (20–30 residues). (C) Amino acid repetitiveness of the human segments as in (A) but classified based on the level of identity between human and D.melanogaster. (We did not obtain segments with 25–50% identity under our criteria which did not allow frequent gaps.) The results for the ancient human proteins are also shown (blue line). For each group, the profiles for 20 amino acids [as in (A)] were combined into one profile as described (5). To reduce fluctuation due to the limited number of total residues, the profiles were smoothed with a window of three neighbor positions (as described on our Web site).

For a more unambiguous analysis, we classified the conserved segments based on the levels of amino acid identity and calculated the repetitive profiles of the classes, as shown in Figure 2C. The tendency for recurrence of the same amino acids is still present for well-conserved segments (>50% identity), whereas ancient human proteins, consisting of proteins whose Escherichia coli homologs are known, show a less significant level of repetitiveness. Note, however, that the repetitiveness scores in the proximity (the +1 to +10 positions) are generally not as high in Figure 2C as in Figure 1B, supporting our previous hypothesis that repetitiveness tends to be strong where constraints are weak (5).

We also performed cumulative analyses of the recurrence of dinucleotide motifs in the cDNA sequences that encode the proteins used above (Fig. 3). Note that, due to the bias in codon usage and in amino acid occurrence, 3n (i.e. 3, 6, 9, …) intervals always tend to have a high score (37). (In Fig. 3 the points scoring >0.2 are the results for 3n intervals.) Intriguingly, for both the entire proteins (Fig. 3A) and the segments (Fig. 3B) the ‘non-3n’ repetitiveness (as represented in CAG TCA) appears higher in near than in more distant positions (see points <0.1 in Fig. 3A and B). A statistical test verifies this notion (i.e. scores at positions +2 to +20 are higher than those at +50 to +75 with P < 0.01 for both Fig. 3A and B). Simulation analyses show that this difference cannot be attributed to any feature of the encoded amino acid sequences per se (2; our unpublished data).

In the above analyses of the ‘3n intervals’ the DNA nucleotide positions (1, 2 and 3) in the codon were not discriminated. Hence, we split the 3n repetitiveness into three distinct phases, 1–2, 2–3 and 3–1, where each number indicates the nucleotide position in a codon. The results for 2–3 and 3–1, in comparison with the data for mononucleotide repetitiveness, suggest that they reflect the repetitiveness of the bases at the codon 3 position (not shown). Mononucleotide analysis shows that while both human and D.melanogaster show repetitiveness at the codon 3 position, the local repetitiveness is clearer for human (Fig. 4A). (Note that the general level of the profile is likely associated with between gene heterogeneity in GC content, because normalization by overall frequency of each nucleotide type enhanced the human profile, as shown in Fig. 4B.) Strikingly, mononucleotide repetitiveness at codon position 3 is generally insensitive to amino acid repetitiveness (Fig. 4C). For example, while ancient proteins have weak amino acid repetitiveness, repetitiveness of codon 3 bases of the human genes for such proteins is comparable with that of human genes encoding immune system-specific proteins (Fig. 4C) (for immune system proteins see ref. 5). The results for (codon positions) 1–2, when applied to these sets of genes, were largely similar to those for amino acid repetitiveness (Fig. 4D and data not shown). This finding, i.e. the discrepancy between repetitiveness in amino acid sequences and that in the codon 3 bases, suggests that in human, repetitiveness can be partly generated by a mechanism which does not require the repetitive use of a short motif consisting of a few nucleotides.

Figure 4.

Figure 4

Figure 4

Repetitiveness of codon 3 bases and codon position 1–2 bases. (A) Codon 3 bases of human and D.melanagaster genes. (B) As (A) but data normalized to the total frequency of each base prior to summing over individual base profiles. (C) Codon position 3 bases of human ancient and immune system-specific genes. (D) Codon 1–2 bases of the genes analyzed in (C).

We next considered the question of whether or not some constraints on mRNA sequences such as local codon usage contribute to the local repetitiveness shown above. Hence, we collected many human sequences for which the positions of the exon boundaries are known. The repetitiveness score for amino acids (Fig. 5) and nucleotides (not shown) is markedly lower beyond the exon boundary of the associated gene. (Note that we here examine the similarity between those exons encoding consecutive parts of the mRNA, not an exon and intron.) Thus, exon boundaries disrupt the amino acid and nucleotide repetitiveness. This finding argues that the tendency for amino acid recurrence does not result mainly from selection regarding the use of codons.

Figure 5.

Figure 5

Effect of exon boundaries on amino acid repetitiveness. Intra-exon analysis was performed in a manner similar to that for Figures 1B and 3, while inter-exon analysis dealt only with the cases where an exon boundary was found between the two amino acids concerned.

Local repetitiveness in DNA sequences: introns and codon 3 positions

To examine further the claim that DNA has its own tendency to generate repetitiveness and that this is the primary force generating the major part of the amino acid repetitiveness, we analyzed the local repetitiveness in intron sequences. Intron sequences of 573 human and 542 D.melanogaster genes were analyzed by several methods, such as the dinucleotide method, mononucleotide method and k-tuple method (Table 1). (While our dinucleotide method gives partial scores for imperfect matches, the k-tuple method gives scores only for perfect matches, as represented by ATC in ATCXXATC.) In general, D. melanogaster and human show similar degrees of local repetitiveness. (Table 1 also shows the level of repetitiveness in artificial intron sequences, generated by permuting the nucleotides randomly within each intron, as a control.)

Local base composition has some effect on substitution patterns

As described in the next section, the absolute value of the repetitiveness score is greatly influenced by the global (isochore scale) unevenness in GC content. However, it seems likely that local repetitiveness, as shown above, is independent of the isochore effect and associated with the local difference in substitution pattern. Hence, we compiled alignments of homologs among human, bovine and mouse and, based on the parsimony method, inferred substitution at codon position 3 in each lineage. We then analyzed the local frequency of nucleotide J (J is either A, G, C or T), i.e. local J%, near the sites where a I→J synonymous transition ({I,J} is any of {A,G}, {G,A}, {T,C} or {C,T}) has occurred in human. As a control, we also analyzed the local J% near the site of I→I, i.e. positions where the I→J transition has not occurred in human despite there being a chance of it. (Only codon position 3 was analyzed by protocol B as set out in Materials and Methods). In general, {local J% – gene J%} is greater for those segments subject to a synonymous transition than that for those segments without such transitions (P < 0.001 for human and bovine and P < 0.005 for mouse; data not shown). Analyses of substitution rates for sites with different local nucleotide compositions showed consistent results. For example, when {local A% – gene A%} is >20%, synonymous G→A transitions occurred at 6.97% of eligible G residues (n = 1957), while the total rate was 5.98% (hypergeometric distribution, P < 0.001). Similarly, the A→G rate was 8.34% for A residues in a G-rich region (control 6.50%), T→C was 9.04% (control 8.53%) and C→T was 8.55% (control 7.25%) (both P < 0.001). While these data are consistent with the idea that substitutions tend to maintain local unevenness in nucleotide composition, they do not rule out the possibility that a very small number of genes predominantly contribute to the difference. Hence, we also calculated, for each gene, the average scores of {local J% – gene J%} near I→J and I→I and then the pairs of data were analyzed (Table 2). Again, there was a difference (P < 0.01) between the results for near I→J and near I→I. A weak but similar tendency was observed for mouse. For bovine, the strength of the local factor was comparable to that of human. We thus concluded that, in general, those transitions which contribute to the local bias (i.e. within gene heterogeneity) in nucleotide composition are significantly more frequent than those which homogenize the local bias.

Table 2. Local contenta of nucleotide J near the site of a synonymous I→J transition (where I→J is any of the transitions A→G, G→A, T→C and C→T) and the site with no change (i.e. I→I).

  {local J% – gene J%} ± SD (n) t-testb
Human    
 Near transition (I→J) 0.721 ± 4.72 (415) P < 0.01
 No transition (I→I) 0.040 ± 1.42 (415)  
Mouse    
 Near transition (I→J) 0.50 ± 3.5 (510) P < 0.05
 No transition (I→I) 0.17 ± 1.6 (510)  
Bovine    
 Near transition (I→J) 0.68 ± 4.4 (457) P < 0.005
 No transition (I→I) 0.01 ± 1.5 (457)  

a{local J% – gene J%} represents ΔFilocal(J│I→Jsynon) for near transition and ΔFilocal(J│I→I) for near no transition, which are as defined in Materials and Methods. The pairs of indices were determined for each gene and then the pairs were treated as independent data. Genes which had been subjected to a very small number (<5) of synonymous transitions were not used.

bSignificance of the difference between the data for transition and no transition.

Isochore and its effect on the repetitiveness and substitution

While our study was originally intended to determine local factors, our procedure was unexpectedly found to be convenient for studying an isochore scale effect. Here we summarize our findings concerning global scale GC content unevenness and its effect on substitution. [Note that this section deals with the global scale effect, not directly related to local repetitiveness. The data for this section (Tables S1–S5) are available at NAR Online as Supplementary Data.] First, human/bovine/mouse alignments show that our repetitiveness score is greatly influenced by GC unevenness within the genome and the trend in substitution pattern. For example, substitutions in the mouse lineage resulted in a marked reduction in the repetitiveness score while in the bovine lineage this decrease is small (Table S1). This is because, for mouse, substitutions tend to destroy the GC unevenness by GC homogenization (Tables S1 and S2; see also 38). Secondly, while our analyses were based on well-conserved (slowly evolving) genes, the well-known phenomenon of high substitution rates in murids is evident (Table S1; 39,40). Thirdly, in human (but not in mouse), genes with different GC contents are subject to differently biased directions of substitution, which largely maintains the human isochore structure (Tables S3–S5). Finally, despite the isochore-related difference, a general trend of a bias towards AT is observed for human (Table S3). This trend is found for both GC-poor, GC-medium and GC-rich genes (Tables S3 and S5).

These findings indicate that stationarity is unlikely to be the case with the human as well as mouse genes that we analyzed. It has been shown that in human HLA genes more GC→AT mutations have occurred, but that polymorphic alleles generated by AT→GC mutations tend to segregate (spread in the population) to a greater extent than do GC→AT mutant genes (41). Although such findings provide evidence for the presence of selection favoring GC mutation at codon position 3, it is not clear to what extent the condition of stationarity (which is required for the approach as in ref. 39) is also satisfied for non-HLA genes. The fact that there is no correlation between GC content and substitution rate (42) suggests that the general trend towards AT substitutions is not specific to those slowly evolving genes we analyzed. A careful examination of stationarity utilizing multiple species alignments may be helpful.

Although it seems difficult to draw any conclusions from our data regarding the selection/mutation debate (41), we surmise that mutation plays a greater role than selection. First, presence of both a general trend towards GC→AT substitutions (Table S3) and a marked difference in substitution pattern among isochores with different GC contents (Tables S3 and S5) seem to be difficult to explain from selection theory. Secondly, if selection is the primary factor, it is difficult to explain why mouse, bovine and human have totally different general trends and intensity of the isochore-specific effect (Tables S3 and S4 and data not shown). Notably, a recent study on pseudogenes strongly argues that mutation is the major factor (43).

There is a regional difference in substitution rate within the human genome, suggesting that substitution rate is greatly affected by mutation rate, which varies among regions of the genome (42). Interestingly, the isochore structure does not overlap the regional difference in substitution rate over the human genome. The efficiency of the DNA repair system varies over the genome, potentially explaining the regional differences in substitution rate (42,44). Together with these findings, the complex nature of the substitution pattern shown in our data lead us to surmise that there are many factors influencing the mutation rate and pattern.

Conclusions and further questions

We summarize the above findings as follows. First, any type of local repetitiveness we have described in this study is different from the ‘isochore (>100 kb) scale’ GC content unevenness, because all the local repetitiveness is found on the scale of <∼20 residues (<∼60 nt). Second, with respect to local repetitiveness in DNA sequences, the human appears to have at least two types of repetitivenesses, one of which is not found in D.melanogaster cDNAs, i.e. while introns of human and D.melanogaster have similar patterns of repetitiveness (Table 1), codon base 3 repetitiveness has a markedly different pattern (Fig. 4A and B). In support of this, human ancient genes (i.e. homologs of E.coli genes) have a significant level of repetitiveness in codon base 3 (Fig. 4C), despite the fact that the amino acid sequences and nucleotides at codon positions 1 and 2 have no significant level of repetitiveness (Fig. 4D and unpublished data). Third, nucleotide substitution analyses showed, rather unexpectedly, that the global GC content is an influential factor strongly associated with the direction of substitutions. The mouse has a different strategy for genomic evolution from the human; the substitution patterns in mouse are not as dependent on global GC% as those in human (Tables S3 and S4). Finally, the data nonetheless show that some part of the bias in substitution pattern is associated not with global but with local nucleotide content.

In our view, repetitiveness in amino acid sequences is largely DNA sequence directed and similar to the repetitiveness in introns in the sense of evolutionary distribution and possible underlying mechanism; both could result from strand slippage, although introns specifically prefer purine–pyrimidine dinucleotide motifs, such as CACACA (37). However, while strand slippage is generally believed to change the length, our data imply that some strand slippage does not require a change in length.

Besides such slippage-like mechanisms, we also propose that amino acid repetitiveness may be partly generated in human by a local bias in substitution pattern and thus bias in nucleotide composition. We previously showed that locally GC-rich segments tend to encode R (Arg) (3). (On the local scale G content and C content are not coupled. Therefore, GC-rich segments are generated as an overlap of G-rich and C-rich segments.) Therefore, some amino acid repetitiveness should result from nucleotide composition bias. Thus, amino acid repetitiveness is likely to be generated by at least two genomic causes. One is sequence-directed mutagenesis, which is possibly related to slippage. The other is a local bias in nucleotide content, which, as we show, is generated by a bias in nucleotide substitution.

Due to the limited data that could be used for substitution analysis, much remains unresolved. For example, we cannot determine the relative contributions of the above factors contributing to amino acid repetitiveness. In the future, advanced statistical methods and their application on further expansion of the genome database, especially of mammals, may allow a more detailed examination regarding the relative strength of the local as opposed to the global scale correlation and assessment of various local factors. While the substitution analyses in this study were concerned mostly with isolated substitution events, in the future we may be able to use sufficient data for the analysis of double substitutions (such as AA→GG; see 45), which may provide more information about sequence-directed mutagenesis.

SUPPLEMENTARY MATERIAL

See Supplementary Material available at NAR Online.

[Supplementary Data]

Acknowledgments

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their insightful comments. This study was supported in part by Grants-in-Aid for Scientific Research from the Ministry of Education Science and Culture, Japan.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]
nar_28_19_3801__1.pdf (44.2KB, pdf)

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES