Skip to main content
Genome Biology and Evolution logoLink to Genome Biology and Evolution
. 2014 Feb 25;6(3):655–665. doi: 10.1093/gbe/evu042

Increased Substitution Rates Surrounding Low-Complexity Regions within Primate Proteins

Carolyn Lenz 1, Wilfried Haerty 2, G Brian Golding 1,*
PMCID: PMC3971593  PMID: 24572016

Abstract

Previous studies have found that DNA-flanking low-complexity regions (LCRs) have an increased substitution rate. Here, the substitution rate was confirmed to increase in the vicinity of LCRs in several primate species, including humans. This effect was also found among human sequences from the 1000 Genomes Project. A strong correlation was found between average substitution rate per site and distance from the LCR, as well as the proportion of genes with gaps in the alignment at each site and distance from the LCR. Along with substitution rates, dN/dS ratios were also determined for each site, and the proportion of sites undergoing negative selection was found to have a negative relationship with distance from the LCR.

Keywords: low-complexity region, substitution, primate, mutation

Introduction

More than half of the human genome is composed of repetitive sequences (transposable elements, micro- and minisatellites) often defined as “junk” DNA, and even the 1.2% of our genome that includes all the protein coding genes is not free of these repetitive sequences. In humans, ∼14% of all proteins include at least one peptide sequence characterized by a low information content as a consequence of the repetition of a handful of amino acids (Golding 1999; Marcotte et al. 1999; Huntley and Golding 2000, 2002; Haerty and Golding 2010). The archetype of these peptide sequences is the reiteration of a single residue encoded by a trinucleotide repeat whose motif can be expanded up to 265 times in some severe cases of Huntington’s disease (Milunsky et al. 2003). Additionally, low-complexity amino acid sequences, including single amino acid repeats, are significantly enriched within specific classes of proteins such as transcription factors and proteins involved in development (Alba and Guigo 2004; Faux et al. 2005, 2007; Huntley and Clark 2007).

Like their noncoding counterparts, trinucleotide repeats within protein coding sequences, and more generally sequences encoding low-complexity regions (LCRs), are believed to evolve neutrally because of their low nucleotide conservation (Schmid and Tautz 1999; Huntley and Golding 2000; Alba and Guigo 2004; Huntley and Clark 2007) and their propensity to rapidly vary in size because of replication slippage (Ellegren 2004) and unequal recombination (Zilversmit et al. 2010). As a consequence of these mutational processes, the propensity of both noncoding and coding repeats to vary in size is directly correlated to their length homogeneity (Alba et al. 1999, 2001). Additionally, these sequences are found to be enriched in regions of proteins that are often associated with relaxed purifying selection such as alternatively spliced exons, interdomain sequences, regions far from functional domains, or at the surface of the protein in contact with the solvent (Romero et al. 2001; Haerty and Golding 2010).

In the case of noncoding microsatellites, mutation rates have been estimated to be 100–10,000 times greater than for other point mutations (Ellegren 2004). Although the vast majority of these sequences are often considered to evolve neutrally and their high mutability does not affect their host’s fitness, the evolution of many coding repetitive sequences greatly departs from neutral expectations, and evidence of selection acting on many of these sequences have been reported in a wide range of species (see Haerty and Golding 2010 for review).

Until recently, all analyses focused on describing the mutability of the repeats and their evolutionary patterns (Huntley and Golding 2002; Alba and Guigo 2004; Faux et al. 2005; Haerty and Golding 2010; Mularoni et al. 2010). However, the instability of a repeat is also likely to correlate with nucleotide variation within its flanking nucleotide sequences, either as a manifestation of a local relaxation of selection or as a direct consequence of a mutagenic effect of the repeat. For instance in Escherichia coli, instability at a trinucleotide repeat was associated with increased variation at adjacent repeats (Blackwood et al. 2010). Furthermore, multiple recent studies showed that insertion/deletions (indels) tend to be associated with an increased density of polymorphic sites within their flanking regions, suggesting that the high mutation rate and the size instability of low-complexity sequences in the genome also affects the sequences surrounding these regions (Vowles and Amos 2004; Huntley and Clark 2007; Amos et al. 2008). This observation of increased nucleotide variation near indels has been extended to microsatellite sequences (Siddle et al. 2011; McDonald et al. 2011; Amos 2013) and sequences encoding LCRs within proteins in Drosophila (Huntley and Clark 2007) and Plasmodium falciparum (Haerty and Golding 2011).

These recent findings extend the mechanisms proposed to explain the observed increased nucleotide variation near indel sequences (McDonald et al. 2011). Despite a study in Drosophila melanogaster, which ascribed the elevated density of single nucleotide polymorphism (SNPs) in the vicinity of indels to population admixture (Massouras et al. 2012), most analyses thus far have found an association between mutagenic effects with indels and repeated sequences. The increased mutation rate was first associated to heterozygous indels leading to error during DNA repair (Tian et al. 2008). In contrast McDonald et al. (2011) attributed the observed increased mutation rate to the recruitment of error-prone DNA polymerases at stalled replication forks due to repetitive sequences (but see Jovelin and Cutter 2013 for review). Finally, the observation of an increased transversion rate within the flanking regions of indels tend to favor this latter hypothesis (McDonald et al. 2011; Jovelin and Cutter 2013).

In light of the high number of protein coding sequences that contain these rapidly evolving low-complexity repeats within eukaryote genomes and their potential mutagenic effect within their surrounding sequences, one can wonder about the extent of such instability, its impact on the coding sequence, and the intensity of selective pressures, if any, exercised on these regions. Because the most deleterious mutations are expected to be ephemeral (Fu et al. 2013; Kiezun et al. 2013), in order to accurately quantify the mutational effect of the repetitive sequences and the selection acting on variation surrounding these regions, the analysis needs to focus on the comparison of closely related species or individuals within the same species. Most studies so far have been limited to a small number of genomes; however, the recent release of fully sequenced human genomes as part of the 1000 Genomes initiative (1000 Genomes Project Consortium 2010) gives us unprecedented power to assess the mutagenic effect of repetitive sequences within human populations and to quantify the strength of selection acting on these genomic regions.

Here we characterize the substitution pattern within primate proteins surrounding these LCRs. We also examine the nucleotide polymorphism in the vicinity of these repeated sequences within human populations. We show that LCRs significantly affect the evolution of their surrounding sequences, leading to both increased substitution and mutation rates within protein-coding sequences. With the availability of abundant data, we found an extremely strong relationship between the distance from the nearest LCR and the number of substitutions and indels seen in protein-coding DNA. Finally, a strong evidence of negative selection implies that these substitutions are occurring in functional, conserved sequences.

Materials and Methods

Representative Homo sapiens (human, GRCh37), Pan troglodytes (chimpanzee, CHIMP2.1.3), Gorilla gorilla (gorilla, gorGor2), Pongo pygmaeus (orangutan, PPYG2), and Macaca mulatta (rhesus macaque, MMUL_1) genomes were downloaded from Ensembl release 62 (Flicek et al. 2011). The protein sequences were searched for regions of low-complexity with SEG (Wootton and Federhen 1993), using a window size of 15 bp and a complexity threshold of 1.9. Using this method, 7,387 LCRs were identified in humans. LCRs from the four other primate species were found and compared with find orthologs based on location within the protein as well as identity between the sequences, and we identified 5,584 LCRs (in 2,425 genes) shared by all five species (table 1). Data from the 1000 Genomes Project (1000 Genomes Project Consortium 2010) were downloaded and LCRs that corresponded to the LCRs associated with the primate flanking regions were identified. The translated protein sequences containing these LCRs were aligned using Muscle (Edgar 2004).

Table 1.

Number of 1-to-1 Orthologous LCRs Shared between Each Pair of Species

Human Chimpanzee Gorilla Orangutan Macaque
Human 7,387 6,908 6,776 6,635 6,662
Chimpanzee 7,208 6,635 6,611 6,565
Gorilla 7,098 6,461 6,412
Orangutan 6,957 6,414
Macaque 6,454

The SNP calls, including the allele frequencies, for 178 individuals of African origin (1000 Genomes Project Consortium 2010), were obtained, and the distance from the nearest LCR was found for all SNPs. Using the pairwise genome alignments downloaded from the UCSC database, the nucleotides were compared with homologous sites in macaque and chimpanzee to determine the ancestral state of the SNPs using maximum parsimony. Because we expect the most recent SNPs to have low allelic frequency, we assessed the relationship between the allelic frequency of each SNP and its distance to the LCR to assess the mutagenic propensity of LCR-encoding sequences on their flanking sequences and their potential response to selection. Only SNPs from protein-coding regions were used, because the more relaxed selection outside of coding regions could alter any relationship. Although a lower SNP frequency could also indicate more deleterious alleles, unless the presence of LCR was increasing the substitution rate or causing a relaxation in selection, we do not expect a distributional bias. Distance to the nearest protein-coding LCR, including introns and intergenic regions, was calculated. The various functional categories of the protein products of the genes were downloaded from Gene Ontology (GO) (Gene Ontology Consortium 2000) to see if the genes studied tended to be associated with any particular functional category.

The protein-coding regions containing LCRs were scrutinized to determine whether they had “suitable” flanking regions. To get a clear picture of how distance from an LCR-encoding sequence affects substitution rate, we identified flanking regions that did not overlap with those of another LCR and were not interrupted by either a transcription start site or stop codon. For example, in the case of flanking regions up to 1,500 nt away from an LCR, this would mean the flanking region from an LCR could not be used if the LCR was less than 3,000 nt away from the nearest LCR or twice the length of the flanking region. The termini of exons were not taken into account, as the DNA transcript was considered as a whole. Because LCRs are often located near the termini of proteins (Huntley and Clark 2007), there were many LCRs for which only the 5′ or 3′ flanking region could be used, as the flanking region on the opposite site of the LCR was interrupted (see supplementary fig. S1, Supplementary Material online, for distribution of maximum flanking region length and see supplementary table S1 and S2, Supplementary Material online). As well, CodeML (Yang 2007) requires that sequences being compared do not contain gaps, so flanking regions could only be used if sequences from all five primate species were present. This resulted in 1,385 upstream flanking regions and 1,399 downstream flanking regions of at least 300 nt in length from 1,136 genes; however, the number of gaps in the vicinity of LCRs lowered the number of available examples surrounding the LCR. The same subset of flanking regions was used for both the primate sequences and sequences from the 1000 Genomes Project.

CodeML (Yang 2007) was used to estimate the number of substitutions for each codon of the flanking regions, using model 2 (two or more dN/dS ratios for branches) for both the primate and 1000 Genomes data. As an accurate phylogenetic tree could not be provided for data from the 1000 Genomes Project, pairwise comparisons between all individuals were used instead. The mean of these pairwise comparisons was found for each site. The average number of substitutions was computed for all usable codons adjacent to the LCR (fig. 1). Pearson’s product-moment correlation (R Development Core Team 2011) was used to test for a relationship between the average number of substitutions and the distance from the LCR. To ensure that substitutions detected were not the result of misalignments, the relationship between the number of substitutions and distance from the LCR was also calculated only for flanking regions that were likely to be accurately aligned (such as flanking regions that contained no gaps when aligned).

Fig. 1.—

Fig. 1.—

Data workflow: (1) Homologous proteins found for five primate species; (2) LCRs identified using Seg; (3) Maximum length of flanking sequences determined by protein termini and midpoints between two LCRs; (4) Flanking regions filtered for examples with homologous sequences from all five species. Because the second LCR in this example is present in only four species, its flanking regions are not used; (5) The 3′ and 5′ flanking sequences are considered separately so that if all five homologous sequences are not available, the other can still be used; (6, continuing with the 3′ sequence) The upstream and downstream flanking regions are aligned separately; gaps are represented with thin lines; (7) Alignments of individual codons are used to find the number of substitutions at each site for each flanking region with CodeML. Codons with gaps (in this case codons 1, 2, and 7–12) are not usable and are not considered when calculating the average number of substitutions per site. The number of substitutions was found for all usable sites of all genes found to contain LCRs, and the average across all usable sites was found for all positions relative to the LCR (i.e., codon 1, codon 2, etc.).

To test whether or not selection was occurring, CodeML was used to estimate dN/dS, or ω, for each codon in flanking regions. An expected distribution of dN/dS ratios was found for each number of substitutions by simulating codon sequences under neutral selection using evolver (part of the PAML package; Yang 2007). The dN/dS ratios were found for each of these sets of codons and used to determine whether estimated dN/dS ratios of the codons from flanking regions were expected under neutral selection or indicated significant evidence of positive or negative selection. Although there were a larger number of aligned codons in proximity to the LCR, this should not affect the significance of dN/dS, as the sample size used when calculating the dN/dS ratio and generating expected distributions was fixed to five. The proportion of genes that had evidence of positive selection (i.e., a significant dN/dS ratio greater than 1) or negative selection (i.e., a significant dN/dS ratio less than 1) was found for each site (Yang and Bielawski 2000). The cutoff for significance, found using the expected distribution of dN/dS values, varied depending on the number of substitutions in the aligned codon sequences. The average proportion of genes that had significant evidence of positive or negative selection was found for each aligned codon from the flanking regions, and the data were searched for a relationship between distance from the LCR and proportion of genes undergoing positive and negative selection.

As CodeML is only capable of analyzing substitutions, but not indels, any site with a gap in the alignment was ignored when calculating the average substitution rate. To study whether LCRs are associated with indels, the number of genes in which at least one sequence had a gap was found for each nucleotide. This was expressed as a proportion, i.e., as a percentage of genes that had an indel next to the LCR, 1 bp away from the LCR, 2 bp away from the LCR, etc. This data was tested for a correlation between average number of gaps per site and distance from the LCR.

To further explore the possible effects of selection, the value of Tajima’s D (Tajima 1989) was calculated for each codon in the flanking regions of the data from the 1000 Genomes Project. To determine the significance of each D value, the program ms (Hudson 2002) was used to generate a distribution of expected Tajima’s D values for neutrally evolving fragments with the same length and number of polymorphic sites as each codon. The Benjamini–Hochberg procedure (Benjamini and Hochberg 1995) was used to control for the false discovery rate, and the proportion of genes with evidence for positive or negative selection was found for each site.

Results

LCR-encoding sequences were found to have a profound effect on their flanking regions. Several characteristics of the protein-coding DNA surrounding LCRs appeared to be directly affected by their distance from the nearest LCR. A strong, statistically significant negative correlation was seen between the average number of substitutions and distance from the LCR (correlation coefficient Inline graphic, P value Inline graphic, see fig. 2A). Although misalignments due to indels may be more likely in the vicinity of LCRs, this does not appear to be a factor, because the relationship between substitution and distance to LCR was only slightly weaker (−0.7432, P value Inline graphic) when only flanking regions that had no alignment gaps were used. As well, both synonymous and nonsynonymous substitution rates had a significant relationship to distance from the LCR, although the relationship between synonymous substitution rate and distance was stronger than the relationship between nonsynonymous substitution rate and distance (–0.8748 and –0.3787, respectively, P value Inline graphic for both, see fig. 2B and C).

Fig. 2.—

Fig. 2.—

Effect of distance from LCR on average number of (A) total substitutions, (B) nonsynonymous substitutions, and (C) synonymous substitutions at each codon in five primate species. Gray points indicate N, the number of genes that could provide information and were free of gaps at each site. Negative values are upstream of the LCR.

CodeML was also used for each site to estimate the dN/dS ratio to predict the nature of selective forces acting on the LCR-flanking sequences. If dN/dS was greater than 1, there is evidence that positive selection is occurring, while negative selection is more likely if dN/dS is less than 1. Using the expected distributions of dN/dS for neutrally evolving codon alignments generated using evolver to identify only significant results and ignoring sites for which dN/dS could not be estimated (i.e., sites with gaps in the alignment or no substitutions), a strong, negative correlation was found between distance from the LCR and proportion of genes that had evidence for negative selection at each site (Pearson’s product-moment correlation, correlation coefficient Inline graphic, fig. 3). Like the findings of Huntley and Clark (2007), there was also a negative (although, in this case, nonsignificant) correlation between distance from the LCR and the proportion of genes that had evidence for positive selection at each site (Pearson’s product-moment correlation, correlation coefficient Inline graphic, fig. 4).

Fig. 3.—

Fig. 3.—

Effect of distance from LCR on proportion of flanking regions that had evidence for negative selection (Inline graphic). Gray points indicate the number of genes that had substitutions and could be used to calculate dN/dS. Negative values are upstream of the LCR.

Fig. 4.—

Fig. 4.—

Effect of distance from LCR on proportion of flanking regions that had evidence for positive selection (ω > 1). Gray points indicate the number of genes that had substitutions and could be used to calculate dN/dS. Negative values are upstream of the LCR.

Using corresponding flanking regions from the 1000 Genome Project, a relationship similar to the one seen between primate sequences was found. The average number of substitutions per site had a significant negative correlation with distance from the LCR (correlation coefficient Inline graphic, P value Inline graphic, fig. 5A). This relationship was consistently seen, even when the number of substitutions was separated into synonymous and nonsynonymous substitutions, although the correlation between distance and nonsynonymous substitutions was slightly stronger than that for synonymous substitutions (correlation coefficients of −0.452 and −0.354, respectively, both with P values Inline graphic, figs. 5B and C). Using the SNP calls from the Broad Institute, distance from the LCR was found to have a weak but positive relationship with the frequency of derived SNPs (Spearman’s Inline graphic). This substantiates the high substitution rate in the vicinity of LCRs, as SNPs that have lower frequencies (i.e., are more recent) are more likely to be found near LCRs.

Fig. 5.—

Fig. 5.—

Effect of distance from LCR on average number of (A) total substitutions, (B) nonsynonymous substitutions and (C) synonymous substitutions at each codon in humans. Gray points indicate N, the number of genes that could provide information and were free of gaps at each site. Negative values are upstream of the LCR.

As a measure of potential negative selection, we searched for negative values of Tajima’s D. These calculations were done for each individual codon, and the distance of the codon from the nearest LCR was recorded. Using only values of Tajima’s D that were found to be significant, a negative correlation was found between distance from the LCR and proportion of genes with negative values of Tajima’s D at each codon (correlation coefficient Inline graphic, fig. 6). As a negative value for Tajima’s D can also indicate a bottleneck, a high number of negative values of Tajima’s D at any distance from an LCR can be expected between human sequences; however, the presence of an LCR clearly increases the proportion of sites in accord with more negative selection. There was also a negative relationship between distance from the LCR and proportion of genes with positive values of Tajima’s D at each codon, but this correlation was extremely weak and nonsignificant (correlation coefficient = −0.010, P = 0.7545).

Fig. 6.—

Fig. 6.—

Effect of distance from the LCR on proportion of genes that had negative values for Tajima’s D at each codon. Gray points indicate the number of genes with substitutions that could be used to calculate Tajima’s D at each site. Negative values are upstream of the LCR.

The proportion of genes with a gap appearing in the aligned flanking regions was found and compared with distance from the LCR. There was a strong, negative relationship between distance from the LCR and number of gaps (correlation coefficient = −0.244, P Inline graphic, fig. 7).

Fig. 7.—

Fig. 7.—

Effect of distance from the LCR on average proportion of genes with gaps in the alignment at each nucleotide in five primate species. Gray points indicate N, the number of genes that could provide information on each site. Negative values are upstream of the LCR.

The nucleotide composition of the protein-coding sequences of these genes was also investigated. The correlations between nucleotide composition in LCRs and their flanking regions were compared with the correlations between nucleotide composition in the flanking regions and the entire gene. Although there were significant correlations for all nucleotides between LCRs and flanking regions, the relationships between the flanking regions and genes were much stronger (table 2).

Table 2.

Correlation between Nucleotide Representation in LCRs and Their Associated Flanking Regions, Compared with the Correlation between Flanking Regions and the Rest of the Protein

Nucleotide Correlation between Flanking Region and LCR Correlation between Flanking Region and Protein
A 0.5070 0.6635
C 0.4152 0.6206
G 0.3343 0.5415
T 0.2726 0.5395

Note.— All correlations were significant (Inline graphic).

A search of the Gene Ontology database (Gene Ontology Consortium 2000) revealed no major trends. The functions seen most often are extremely common and likely to be shared among many unrelated genes.

Discussion

Although the evolution of repetitive regions has been studied for many years, the study of repetitive regions within coding regions is comparatively more recent. In noncoding regions, repetitive regions are known to evolve far more rapidly than nonrepetitive sequences due to both base substitutions and indels generated by slippage (Levinson and Gutman 1987). Indeed, microsatellites are described as one of the most variable types of DNA sequence in the genome (Ellegren 2004). Microsatellites have many mutational biases and are known to lead to multiple mutational events (Drake 2007; Schrider et al. 2011) and to mutational nonindependence (Amos 2010a). Microsatellites have also been shown to increase the frequency of mutations in the surrounding nucleotide sequences (Amos 2010c, 2010b; Varela and Amos 2010).

In addition to intergenic regions, repetitive sequences are also present in protein coding regions (Golding 1999; Marcotte et al. 1999). They too evolve rapidly (Huntley and Golding 2002; Alba and Guigo 2004; Huntley and Clark 2007; Brown et al. 2010; Haerty and Golding 2010) and have unusual mutation properties including the suggestion that the rate of substitutions are elevated in the surrounding nonrepetitive coding sequence (Huntley and Clark 2007; Haerty and Golding 2011). The few attempts made to assess the evolution of the genetic variation in the surrounding sequences of repeats reported an absence of selection on these sites (Jovelin and Cutter 2013). However, despite the fact that previous studies found repeated sequences to be preferentially located within coding regions under relatively low selective constraints (Romero et al. 2001; Haerty and Golding 2010), purifying selection is still expected to act on those SNPs arising in the vicinity of these LCRs because of the protein-coding sequence context. The extensive data now available for human and other primate genomes allows a more precise evaluation of the mutational potential of LCRs and the evolution of their flanking sequences.

In coding regions surrounding LCR sequences, we identified a relationship between the presence of LCRs and increased local substitution rates for humans and related primates. Not only is the correlation statistically significant, but the average number of substitutions is higher in sequences near LCRs and decreases rapidly as distance from the LCR increases. This demonstrates that the relationship between proximity to LCRs and substitution rate seen in previous studies, E. coli, yeast (McDonald et al. 2011), Drosophila (Huntley and Clark 2007; McDonald et al. 2011), Caenorhabditis (Jovelin and Cutter 2013), and Plasmodium (Haerty and Golding 2011), is also present in primates. It is also consistent with previous findings showing an increased number of mismatches and SNPs in the vicinity of microsatellites (Vowles and Amos 2004; Siddle et al. 2011), suggesting that the repetitiveness of these sequences is key to their increased mutation rate. This relationship between LCR distance and substitution is also apparent for substitutions segregating within humans, as shown using the data from the 1000 Genomes Project. Along with the substitution rate, the number of gaps in the aligned sequences appears to be greater in the vicinity of LCRs. From these results, it is clear that the frequency of both point mutations and indels increases in DNA immediately surrounding LCRs in primates.

The specific subset of flanking regions used in this analysis (1,385 upstream and 1,399 downstream flanking regions) should accurately represent the genome-wide effect of protein-coding LCRs on adjacent DNA. The criteria by which sequences were eliminated from this analysis was the length of flanking regions, determined by the distance from LCRs to the protein termini, or, in examples where there were multiple LCRs per gene, by the distance between LCRs. No analyses have been performed to determine whether there is a relationship between the protein function and the location or number of LCRs, but no trends were found in the GO functional classification of the proteins used in this study. Although flanking regions from only a fraction of LCRs originally detected were used, the diverse array of functions suggests that this is a representative sample of LCR-containing genes.

Although the flanking regions clearly show a relationship between distance from the LCR and substitution rate, they also show asymmetry in that relationship. The substitution rate is higher on the 5′ side of the LCR-encoding sequence. Although the nature of the mutational pressure affecting DNA adjacent to LCRs is still uncertain, the skewness of the substitution rates could implicate a transcription-related mechanism. Most of the LCRs used here were expressed in germline cells (expression patterns found using the Human Protein Atlas (Uhlen et al. 2012); 1,108 upstream and 1,119 downstream flanking regions were expressed in the germline and 277 upstream and 280 downstream flanking regions were not expressed in the germline), but the relationship between distance from the LCR and substitution rate remains skewed even for cells not expressed in the germline (supplementary figs. S2 and S3, Supplementary Material online). Polak et al. (2010) found a substitution transcriptional bias in DNA, where regions just downstream of the transcription start site are more prone to substitutions that lead to strong bonds (i.e., more CG base pairs, with three hydrogen-bonds, than TA pairs); an excess of C to T substitutions over G to A was also found near the transcription start site in some genes. Although a bias in substitution rate was not found, the difference in substitutions based on proximity to the transcription start site could be related to the asymmetry in the number of substitutions seen around LCRs.

Hence, our observation of a transcriptional asymmetry in the SNP distribution broadens the mechanism proposed by McDonald et al. (2011) to explain the mutagenic effect of repeats. In their model, McDonald et al. (2011) associated the increased mutation rate within the flanking sequences of microsatellites to the recruitment of error-prone DNA-polymerases at stalled replication forks. The presence of highly repetitive sequence within protein-coding sequences is likely to also stall the RNA-polymerase (Parsons et al. 1998), leading to DNA damage in the transcribed strand followed by transcription coupled repair or nucleotide excision repair.

A possibly interesting result is the distinct dip adjacent to the LCR in both the proportion of genes with negative values of Tajima’s D and the proportion of genes with gaps. The decrease in indel mutations immediately adjacent to the LCR may indicate that DNA at the borders of LCRs is important for the correct alignment of these protein-coding regions during recombination. It is however possible that this is simply due to alignment artifacts. These alternatives have not yet been distinguished.

The LCRs themselves have variations in length that can cause inaccurate alignments; a slightly more stable region surrounding LCRs could mitigate the effects of this. However, there was also a drop in the proportion of genes with negative values of Tajima’s D in this area, suggesting that the DNA immediately adjacent to the LCRs is less conserved. This could be a result of the misidentification of the exact borders of the LCR, as these few sites might be under different selection pressures if they were part of the LCR itself.

Jovelin and Cutter (2013) applied a similar procedure for the SNPs in the vicinity of indels within up to 217 sequences surrounding miRNA genes in Caenorhabditis. In contrast to our results, no significant evidence for a skew in the distribution of allele frequency was observed in nematodes. The differences between the studies may be a consequence of the fact that we selected LCRs within protein-coding sequences and also the increased power we gained by interrogating SNPs across the whole human genome.

There are several possible causes for the high substitution rate in LCRs that have been previously studied (Levinson and Gutman 1987; Moore et al. 1999; Amos et al. 2008), but the reason why this effect should extend into flanking regions has not yet been determined. One of the major factors believed to be affecting the mutation rate in LCRs is replication slippage, which causes the LCR to expand or contract by one or more repeated units during replication. Mutations resulting in LCRs of varying lengths could affect the surrounding DNA, as misalignments could make DNA repairs in flanking regions more error prone. The increase in the number of SNPs surrounding mononucleotide repeats found by Siddle et al. (2011) supports this, as expansion and contraction mutations are more likely in repetitive sequences and both are potential causes of misalignment. Another less-explored possibility is unequal recombination. Due to the repetitive nature of LCRs, alignment during crossover events is sometimes inaccurate. This can lead to unequal DNA sequences being exchanged between two strands of DNA. These recombination events can also incorporate DNA adjacent to the LCR, which would explain the high substitution rate of LCR-flanking regions. As well, more gaps would be expected if unequal recombination were likely, and a strong correlation was seen between distance from the LCR and the proportion of genes with gaps seen in the alignment.

There is also an alternative possibility that these flanking regions were previously part of the LCR, created through the LCR’s expansion. Later, point mutations, which can increase the complexity of an LCR sometimes to the point of causing LCR death (Radó-Trilla and Albà 2012), might then change the sequence enough to obscure its origin within an LCR. The high substitution rate in regions flanking LCRs would then be a result of the LCR’s expansion and high substitution rate. As well, the natural propensity for expansion and contraction of LCRs would then still be expected to have an effect on regions on the verge of gaining complexity through point mutations, leading to the relationship between the number of gaps seen and distance from the LCR. If this were the case, however, flanking regions would be expected to have a nucleotide composition more similar to their associated LCRs than to the rest of the protein, which was not seen.

It is also possible that the increased number of substitutions surrounding LCRs is due to relaxed selection allowing more substitutions. The data from CodeML directly contradict this, however, as many sites in flanking regions had dN/dS ratios that indicated significant evidence for negative selection. The flanking regions of LCRs may be subject to more synonymous and nonsynonymous substitutions, but synonymous substitutions are more prevalent than nonsynonymous. As the probability of seeing negative selection significantly increases in proximity to the LCR, it appears that the region surrounding the LCR is under strong selection pressure favoring synonymous substitutions. This intolerance for changes to the protein sequence indicates functional importance in the regions around LCRs.

Using Tajima’s D to test what type of selection was likely to be occurring within data from the 1000 Genomes Project data, negative values were represented far more often than positive, especially in proximity to LCRs, likely indicating purifying selection. As a negative Tajima’s D can also indicate recent population expansion, some of these values could be the result of the recent evolutionary history of humans, but demographic history would not be expected to result in an apparent relationship between Tajima’s D and distance from an LCR. The relationship between Tajima’s D and distance from the LCR therefore also conforms with the relationship seen in the primate species, providing more evidence for a functional importance in flanking regions.

Conclusions

It is clear that substitution rates are greatly increased in LCRs, and this effect extends into the flanking regions beyond the LCR. There is a strong correlation between distance from an LCR and number of changes seen between sequences. This correlation exists for both synonymous and nonsynonymous changes while a lesser correlation exists for indels. Average dN/dS ratios indicate that the regions around LCRs may be tightly controlled, as negative selection appears to be far more common than positive selection.

Supplementary Material

Supplementary figures S1–S3 and tables S1–S2 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

Supplementary Data

Acknowledgments

The authors thank two anonymous reviewers for their comments on previous versions of the manuscript. This work was supported by a Natural Sciences and Engineering Research Council of Canada grant number 140221-10 to G.B.G.

Literature Cited

  1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Alba M, Guigo R. Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 2004;14:549–554. doi: 10.1101/gr.1925704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Alba M, Santibanez-Koref M, Hancock J. Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol Biol Evol. 1999;16:1641–1644. doi: 10.1093/oxfordjournals.molbev.a026077. [DOI] [PubMed] [Google Scholar]
  4. Alba M, Santibanez-Koref M, Hancock J. The comparative genomics of polyglutamine repeats: extreme differences in the codon organization of repeat-encoding regions between mammals and Drosophila. J Mol Evol. 2001;52:249–259. doi: 10.1007/s002390010153. [DOI] [PubMed] [Google Scholar]
  5. Amos W. Even small SNP clusters are non-randomly distributed: is this evidence of mutational non-independence? Proc Biol Sci. 2010a;277:1443–1449. doi: 10.1098/rspb.2009.1757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Amos W. Heterozygosity and mutation rate: evidence for an interaction and its implications: the potential for meiotic gene conversions to influence both mutation rate and distribution. Bioessays. 2010b;32:82–90. doi: 10.1002/bies.200900108. [DOI] [PubMed] [Google Scholar]
  7. Amos W. Mutation biases and mutation rate variation around very short human microsatellites revealed by human-chimpanzee-orangutan genomic sequence alignments. J Mol Evol. 2010c;71:192–201. doi: 10.1007/s00239-010-9377-4. [DOI] [PubMed] [Google Scholar]
  8. Amos W. Variation in heterozygosity predicts variation in human substitution rates between populations, individuals and genomic regions. PLoS One. 2013;8:e63048. doi: 10.1371/journal.pone.0063048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Amos W, Flint J, Xu X. Heterozygosity increases microsatellite mutation rate, linking it to demographic history. BMC Genetics. 2008;9:72. doi: 10.1186/1471-2156-9-72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57:289–300. [Google Scholar]
  11. Blackwood J, Okely E, Zahra R, Eykelenboom J, Leach D. DNA tandem repeat instability in the Escherichia coli chromosome is stimulated by mismatch repair at an adjacent CAG.CTG trinucleotide repeat. Proc Natl Acad Sci U S A. 2010;107:22582–22586. doi: 10.1073/pnas.1012906108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Brown CJ, Johnson AK, Daughdrill GW. Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol. 2010;27:609–621. doi: 10.1093/molbev/msp277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Drake J. Too many mutants with multiple mutations. Crit Rev Biochem Mol Biol. 2007;42:247–258. doi: 10.1080/10409230701495631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5:435–445. doi: 10.1038/nrg1348. [DOI] [PubMed] [Google Scholar]
  16. Faux N, et al. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res. 2005;15:537–551. doi: 10.1101/gr.3096505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Faux NG, et al. RCPdb: an evolutionary classification and codon usage database for repeat-containing proteins. Genome Res. 2007;17:1118–1127. doi: 10.1101/gr.6255407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Flicek P, et al. Ensembl 2011. Nucleic Acids Res. 2011;39:D800–D806. doi: 10.1093/nar/gkq1064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Fu W, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493:216–220. doi: 10.1038/nature11690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Golding GB. Simple sequence is abundant in eukaryotic proteins. Prot Sci. 1999;8:1358–1361. doi: 10.1110/ps.8.6.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Haerty W, Golding G. Low-complexity sequences and single amino acid repeats: not just “junk” peptide sequences. Genome. 2010;53:753–762. doi: 10.1139/g10-063. [DOI] [PubMed] [Google Scholar]
  23. Haerty W, Golding GB. Increased polymorphism near low-complexity sequences across the genomes of Plasmodium falciparum isolates. Genome Biol Evol. 2011;3:539–550. doi: 10.1093/gbe/evr045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hudson R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  25. Huntley M, Golding G. Evolution of simple sequence in proteins. J Mol Evol. 2000;51:131–140. doi: 10.1007/s002390010073. [DOI] [PubMed] [Google Scholar]
  26. Huntley M, Golding G. Simple sequences are rare in the protein data bank. Proteins. 2002;48:134–140. doi: 10.1002/prot.10150. [DOI] [PubMed] [Google Scholar]
  27. Huntley MA, Clark AG. Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species. Mol Biol Evol. 2007;24:2598–2609. doi: 10.1093/molbev/msm129. [DOI] [PubMed] [Google Scholar]
  28. Jovelin R, Cutter A. Fine-scale signatures of molecular evolution reconcile models of indel-associated mutation. Genome Biol Evol. 2013;5:978–986. doi: 10.1093/gbe/evt051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kiezun A, et al. Deleterious alleles in the human genome are on average younger than neutral alleles of the same frequency. PLoS Genet. 2013;9:e1003301. doi: 10.1371/journal.pgen.1003301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Levinson G, Gutman G. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol. 1987;4:203–221. doi: 10.1093/oxfordjournals.molbev.a040442. [DOI] [PubMed] [Google Scholar]
  31. Marcotte E, Pellegrini M, Yeates T, Eisenberg D. A census of protein repeats. J Mol Biol. 1999;293:151–160. doi: 10.1006/jmbi.1999.3136. [DOI] [PubMed] [Google Scholar]
  32. Massouras A, et al. Genomic variation and its impact on gene expression in Drosophila melanogaster. PLoS Genet. 2012;8:e1003055. doi: 10.1371/journal.pgen.1003055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. McDonald M, Wang W, Huang H, Leu J. Clusters of nucleotide substitutions and insertion/deletion mutations are associated with repeat sequences. PLoS Biol. 2011;9:e1000622. doi: 10.1371/journal.pbio.1000622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Milunsky J, Maher T, Loose B, Darras B, Ito M. XL PCR for the detection of large trinucleotide expansions in juvenile Huntington’s disease. Clin Genet. 2003;64:70–73. doi: 10.1034/j.1399-0004.2003.00108.x. [DOI] [PubMed] [Google Scholar]
  35. Moore H, Greenwell PW, Liu CP, Arnheim N, Petes TD. Triplet repeats form secondary structures that escape DNA repair in yeast. Proc Natl Acad Sci U S A. 1999;96:1504–1509. doi: 10.1073/pnas.96.4.1504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Mularoni L, Ledda A, Toll-Riera M, Albá MM. Natural selection drives the accumulation of amino acid tandem repeats in human proteins. Genome Res. 2010;20:745–754. doi: 10.1101/gr.101261.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Parsons M, Sinden R, Izban M. Transcriptional properties of RNA polymerase II within triplet repeat-containing DNA from the human myotonic dystrophy and fragile X loci. J Biol Chem. 1998;273:26998–27008. doi: 10.1074/jbc.273.41.26998. [DOI] [PubMed] [Google Scholar]
  38. Polak P, Querfurth R, Arndt PF. The evolution of transcription-associated biases of mutations across vertebrates. BMC Evol Biol. 2010;10:187. doi: 10.1186/1471-2148-10-187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. R Development Core Team. R: a language and environment for statistical computing. Vienna (Austria): R Foundation for Statistical Computing; 2011. [Google Scholar]
  40. Radó-Trilla N, Albà M. Dissecting the role of low-complexity regions in the evolution of vertebrate proteins. BMC Evol Biol. 2012;12:155. doi: 10.1186/1471-2148-12-155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Romero P, et al. Sequence complexity of disordered protein. Proteins. 2001;42:38–48. doi: 10.1002/1097-0134(20010101)42:1<38::aid-prot50>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  42. Schmid K, Tautz D. A comparison of homologous developmental genes from Drosophila and Tribolium reveals major differences in length and trinucleotide repeat content. J Mol Evol. 1999;49:558–566. doi: 10.1007/pl00006577. [DOI] [PubMed] [Google Scholar]
  43. Schrider D, Hourmozdi J, Hahn M. Pervasive multinucleotide mutational events in eukaryotes. Curr Biol. 2011;21:1051–1054. doi: 10.1016/j.cub.2011.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Siddle KJ, Goodship JA, Keavney B, Santibanez-Koref MF. Bases adjacent to mononucleotide repeats show an increased single nucleotide polymorphism frequency in the human genome. Bioinformatics. 2011;27:895–898. doi: 10.1093/bioinformatics/btr067. [DOI] [PubMed] [Google Scholar]
  45. Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Tian D, et al. Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes. Nature. 2008;455:105–108. doi: 10.1038/nature07175. [DOI] [PubMed] [Google Scholar]
  47. Uhlen M, et al. Towards a knowledge-based human protein atlas. Nat Biotechnol. 2012;28:1248–1250. doi: 10.1038/nbt1210-1248. [DOI] [PubMed] [Google Scholar]
  48. Varela MA, Amos W. Heterogeneous distribution of SNPs in the human genome: microsatellites as predictors of nucleotide diversity and divergence. Genomics. 2010;95:151–159. doi: 10.1016/j.ygeno.2009.12.003. [DOI] [PubMed] [Google Scholar]
  49. Vowles EJ, Amos W. Evidence for widespread convergent evolution around human microsatellites. PLoS Biol. 2004;2:e199. doi: 10.1371/journal.pbio.0020199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Wootton JC, Federhen S. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993;17:149–163. [Google Scholar]
  51. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  52. Yang Z, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 2000;15:496–503. doi: 10.1016/S0169-5347(00)01994-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Zilversmit MM, et al. Low-complexity regions in Plasmodium falciparum: missing links in the evolution of an extreme genome. Mol Biol Evol. 2010;27:2198–2209. doi: 10.1093/molbev/msq108. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Genome Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES