We found that insertions and deletions occur more frequently in longer internal exons and mostly encode structure-less regions, and that half of them are attributable to alterations in tandem repeats.
Abstract
Insertions and deletions (indels) are known to preferentially encode intrinsically disordered regions (IDRs), regions that by themselves do not form unique three-dimensional structures. As we previously showed that long internal exons tend to encode IDRs, we decided to analyze how indels alter internal exons and affect IDRs of the encoded proteins. Here, we analyzed eight eukaryotes to select indels commonly observed in all variants (“fixed” indels). The fixed indels in internal exons mostly encode IDRs. Residue-wise ∼50% of the indels are attributable to alterations in tandem repeats. Deletion is generally more prevalent in long internal exons and in most species the same trend is detected in insertion. Tandem repeats occur preferentially in long internal exons, indicating that their alterations partly account for the high frequency of indels in long internal exons. Also, since tandem repeats mostly encode IDRs, this finding partially explains the high incidence of IDRs in long internal exons. We propose that long internal exons had been produced in early eukaryotes mainly by repeat expansion that added IDRs to the encoded proteins.
Introduction
Only a small fraction of prokaryotic proteins (residue-wise 2.0% of archean and 4.2% of eubacterial proteins) consist of intrinsically disordered regions (IDRs) that by themselves do not form unique three-dimensional structures, whereas 33.0% of residues in eukaryotic proteins are comprised of IDRs (Ward et al, 2004). IDRs in eukaryotic proteins participate in binding to other molecules and are thereby involved in important functions such as gene transcription and signaling (Anbo et al, 2019). Two mechanisms of de novo IDR generation are expansion of repetitive DNA sequences (Tompa, 2003) and exonization of introns (Kondrashov & Koonin, 2003; Sorek, 2007; Marquez et al, 2015), but to our knowledge their proportional shares have not been quantified. It is of interest to elucidate how IDRs had been acquired in the evolutionary process from the first eukaryotic common ancestor (FECA) to the last eukaryotic common ancestor (LECA).
We previously reported the general eukaryotic tendency of long internal exons to encode IDRs and proposed that long internal exons in the LECA had resulted from short exons by addition of IDR-encoding nucleotides (Fukuchi et al, 2023). As indels play important roles in exon length alterations and mostly encode IDRs (Light et al, 2013a, 2013b; Khan et al, 2015), indel analyses may reveal mechanisms of exon length alterations and thereby explain how the preferential encoding of IDRs by long internal exons came about.
Indels in human genomes cause as much variation as small nucleotide polymorphisms (Mullaney et al, 2010) and one way to identify them in coding sequences is to find indels in genomes and select those in coding sequences (Anzai et al, 2003; Mills et al, 2006; Wetterbom et al, 2006; Fan et al, 2007; Khan et al, 2015; Lin et al, 2017). As coding sequences vary from one alternative splicing variant to another, however, different sets of coding sequences result in different indel identifications. Another method to select indels in proteins is to compare the amino acids sequences of orthologous proteins either by sequence alignments (Benner et al, 1993; Light et al, 2013a, 2013b) or by structural alignments (Pascarella & Argos, 1992; Qian & Goldstein, 2001). Since indels thereby identified are dependent on the selection of orthologous proteins, splicing variants need to be considered, and structural alignments cannot be made to stretches containing IDRs, which is a relevant issue because, as stated above, indels frequently encode IDRs (Light et al, 2013a, 2013b; Khan et al, 2015). As the increasing availability of variant sequences makes it possible to identify indels shared by all variants, we conducted sequence alignments of all available variants, selected commonly observed indels, and called them “fixed” indels.
Fixed indels can result from constitutive changes in splicing and genomic mutations. Changes in splicing that give rise to insertions can be regarded as intron to exon conversions (exonizations), whereas those resulting in deletions can be considered as exon to intron conversions (intronizations). Indels by genomic mutations have three plausible causes: DNA slippage, DNA damage followed by imperfect repair mostly by means of homologous repair pathways (Redelings et al, 2024), and transposable elements, which constitute ∼4% of human protein-coding genes (Nekrutenko & Li, 2001). DNA slippage frequently occurs in short tandem repeats that in most cases encode IDRs (Levinson & Gutman, 1987; Simon & Hancock, 2009). Proteins in all kingdoms of life have tandem repeats, but prokaryotic proteins tend to have tandem repeats less often than eukaryotic proteins do (Delucchi et al, 2020).
Since splicing of exons longer than 300 nt is inhibited unless flanked by short introns (Robberson et al, 1990; Chen & Chasin, 1994; Sterner et al, 1996), many exons that became excessively long may not be efficiently spliced. On the other hand, analyses of RNA-Seq data demonstrated that introns shorter than 70 nt are not completely spliced out (Abebrese et al, 2017). Exon definition was proposed to explain the splicing mechanism in vertebrates in which short exons separated by long (>250 bp) introns predominate (Robberson et al, 1990; Berget, 1995; De Conti et al, 2013), whereas intron definition was postulated to account for the splicing of long exons with short introns prevalent in lower eukaryotes (Lang & Spritz, 1983; Berget, 1995; De Conti et al, 2013). Recent research disclosed that the same spliceosome assembled on introns provides the mechanisms for both intron and exon definition in Saccharomyces cerevisiae (Li et al, 2019).
Here, we identified fixed indels that are uniformly present in all variants and attempted to determine their generation mechanisms and found a considerable number of indels generated by alteration in the number of tandem repeats, which frequently encode IDRs. Our analyses also revealed that longer internal exons tend to experience more indels and generally have a higher prevalence of tandem repeats, suggesting that long internal exons were generated chiefly by repeat expansions. We propose that the expansion of tandem repeats encoding IDRs was a crucial evolutionary mechanism by which the LECA had arisen from the FECA.
Results
Selection of fixed indels in internal exons
Exploiting a wealth of sequences in the Ensembl database (Martin et al, 2023), we first aligned all variants of orthologous genes in closely related sp. 1 and sp. 2 and selected sections missing in sp. 2 as indel candidates (Fig 1A and B). We then carried out BLASTN alignments (Altschul et al, 1990) of all variants of orthologous genes in spp. 1 and 2 and an outgroup to classify them into insertion and deletion candidates; those present in an outgroup variant were considered as insertion candidates because the segments were probably absent before the bifurcation of sp. 1 and sp. 2 (Fig 1C), whereas those missing in an outgroup variant were regarded as deletion candidates as the segments were likely to be present in the immediate ancestor of sp. 1 and sp. 2 (Fig 1D). To filter out candidates that correspond to splicing variants, we applied uniformity tests, i.e., we selected the insertion candidates whose segments are present in all sp. 1 variants and are unexceptionally absent in sp. 2 and outgroup variants and regarded them as fixed insertions (Fig 1C); we likewise chose the deletion candidates whose segments are invariably absent in sp. 2, but omnipresent in both sp. 1 and outgroup variants and considered them as fixed deletions (Fig 1D). Finally, only those in internal exons are selected and are called fixed indels. We present the actual steps followed to select fixed indels in internal exons in Fig S1 and the number of cases in Table S1 and Fig S2A–H (for explanations of step 5, see below).
Figure 1. Selection of fixed indels.
(A, C) The criteria for selecting fixed insertions. The rectangles represent sections of coding sequences and are arranged so that aligned segments match vertically. (B, D) The criteria for selecting fixed deletions. They are shown as in (A, C).
Figure S1. Selection steps of fixed indels.
The average numbers of cases are shown in circles.
Figure S2. Numbers of cases in fixed indels.
(A, B, C, D, E, F, G, H) The numbers of indels at each step in each species are plotted in log scale with the numbers displayed above bars.
Table S1. Number of cases at each selection step. (54.2KB, xlsx)
Insertions are probably generated either by intron to exon conversion (exonization) (Fig 2A–H) or by genomic modifications, the latter of which may be caused by expansion of tandem repeats (Fig 2I), homologous recombination (Fig 2J), or insertion of exogenous segments (Fig 2K). Inserted segments that are presumably generated by exonization but correspond to introns with lengths less than 70 nt (Fig 2B and F) were discarded (step 5 in Fig S1) because such introns may be imperfectly removed (Abebrese et al, 2017), making them unqualified as “fixed” insertions. Similarly, the three generating mechanisms of deletions are exon to intron conversion (intronization) (Fig 2L–S), contraction of tandem repeats (Fig 2T), homologous recombination (Fig 2U), and genomic deletion (Fig 2V). Deleted segments that corresponded to short (<70 nt) introns (Fig 2M and Q) were removed for the aforementioned reason (in the following, we additionally carried out analyses of the permissive sets of fixed indels to demonstrate that the removal does not affect conclusions). Insertions with corresponding introns in sp. 2 possessing no significant homology were considered as possibly generated by exonization (Fig 2A, C, and D), whereas those alignable to intron segments in sp. 2 were classified as probably generated by exonization (Fig 2E, G, and H). As introns generally have a higher mutation rate than exons do (Li, 1997), exons created by exonization may no longer possess detectable sequence homology with the introns. Likewise, we consider deletions as possibly generated by intronization if the corresponding intron segment in sp. 2 has no discernable similarity to the exons in sp. 1 (Fig 2L, N, and O), whereas sorting those as probably generated by intronization if the intron segments were alignable to the exons (Fig 2P, R, and S). We distinguish fixed insertions of entire exon(s) (Fig 2C and G) from those within exons (Fig 2A and C) and regard fixed deletions of entire exon(s) (Fig 2N and R) separately from those within exons (Fig 2L and P). The total number of indels within internal exons as well as their breakdown by indel location within exons are shown in Table S2, Fig S3A–J, and Table S3, the last of which lists numbers without removing those corresponding to short (<70 nt) introns (designated the “permissive” sets of fixed indels). Fixed indels of entire exon(s) are rare and almost all the fixed indels in internal exons are in the middle of rather than at 5′ and 3′ ends of exons.
Figure 2. Generation mechanisms of fixed indels.
(A, B, C, D, E, F, G, H, I, J, K) Presumed generating mechanisms of fixed insertions. Gray rectangles and horizontal bars stand for exons and introns, respectively, whereas parallel vertical bars signify tandem repeats. (L, M, N, O, P, Q, R, S, T, U, V) Possible generating mechanisms of fixed deletions. They are presented as in (A, B, C, D, E, F, G, H, I, J, K).
Figure S3. Most fixed indels in internal exons are located in the middle of exons.
(A) Average fractional distribution of fixed indels. The data presented are the ratios of the sum of residues in each category to the overall total. (B) Average fractional distribution of the permissive sets of fixed indels. (A) The data were calculated as in (A). (C, D, E, F, G, H, I, J) Distribution of fixed indels in each species. (A, B, C, D, E, F, G, H, I, J) The total numbers of inserted and deleted nucleotides are shown below each graph.
Table S2. Numbers of cases and residues of fixed indels classified by location with respect to internal exons. (48.9KB, xlsx)
Six actual examples of fixed indels are provided (Fig 3A–F) and two examples of indels that were removed due to their correspondence to excessively short (<70 nt) introns are shown (Fig S4A and B). We selected these solely based on easy visualization of indels. The human protein in Fig 3A is a protein enabled homolog involved in a range of processes dependent on cytoskeleton remodeling and cell polarity, the Drosophila melanogaster protein in Fig 3B is E2F transcription factor, isoform D, whereas the rat protein presented as Fig 3C is nuclear factor erythroid 2-related factor 2, which is a transcription factor that plays a key role in the response to oxidative stress. The chimpanzee protein depicted in Fig 3D is a trinucleotide repeat-containing protein that plays a role in RNA-mediated gene silencing, the human protein shown in Fig 3E is GRIP and coiled-coil domain-containing protein isoform 2, whereas the human protein in Fig 3F is SMG1 phosphatidylinositol 3-kinase-related kinase partaking in both mRNA surveillance and genotoxic stress response pathways.
Figure 3. Actual examples of fixed indels.
(A, B, C) Examples of fixed insertions. They are depicted as in Fig 2 in the scale indicated with tandem repeats represented by parallel vertical bars. The gene IDs are shown in italics, whereas the encoded proteins are drawn as orange rectangles in arbitrary scales with horizontal parallel bars representing intrinsically disordered regions and gray rectangles corresponding to the depicted genomic sections. (D, E, F) Examples of fixed deletions. They are drawn as in (A, B, C).
Figure S4. Actual indel cases discarded because of their correspondence to excessively short introns.
(A) A case involving insertion. (B) A deletion case. Legend as in Fig 3.
Generation mechanisms of fixed indels
The residue-wise distributions of generation mechanisms of fixed indels are graphically presented (Fig 4) with the indels whose generating mechanisms remain unidentified labeled “unknown.” No indel cases involving transposable elements were detected even if the expectation value of the BLASTN search was relaxed to 10 and no insertion cases whose generation mechanisms are unknown were long enough to be assessed for involvement of other exogenous elements. The corresponding figure of the permissive sets of fixed indels is presented as Fig S5. The numbers of fixed indel cases and those of residues in insertion and deletion are tabulated (Table S4), whereas the corresponding numbers of the permissive sets of indels are presented as Table S5. Although interspecies variations are considerable, on average 50.9% of inserted and 49.8% of deleted residues were attributable to alterations in tandem repeats and the generation mechanisms of a substantial number of fixed indels remain unidentified (see the Discussion section). Fixed indels generated by intronization and exonization exist but have small shares except in human insertions, chimpanzee deletions, and rat, O. latipes, and O. melastigma indels whereas those attributable to homologous recombination are universally rare.
Figure 4. Generating mechanisms of fixed indels.
The numbers of nucleotides in indels are shown at the bottom.
Source data are available for this figure.
Figure S5. Distribution of generating mechanisms of the permissive sets of fixed indels.
The number of nucleotides in each case is shown at the bottom.
Table S4. Numbers of cases and residues in fixed indels classified by generating mechanism. (47.9KB, xlsx)
Long internal exons tend to have a high incidence of indels
As we previously found that long internal exons tend to encode IDRs (Fukuchi et al, 2023), we searched for generation mechanisms of long internal exons and thus determined the dependence of indels on exon length. We first calculated the abundance of internal exons in 60 nt length bins (Fig S6A–I) and found that, whereas long internal exons are rare in the four mammals and the two fishes, the two Drosophila species showed length distributions shifted rightward, confirming a previous report (Hawkins, 1988). We then examined the frequencies of indels in each length bin of internal exons, using the length of the corresponding sp. 1 exons for the analyses of insertion in sp. 1 and deletion in sp. 2. The length of the internal exon with an insertion used was that of the exon minus the insertion length because that is the presumed exon length before insertion. In the example presented as Fig 3C in which a 21 nt insertion exists in rat exon 4 of 189 nt, we used 168 nt as the length of the primordial exon in which the insertion occurred (although 168 nt happens to be the length of the corresponding sp. 2 exon in this case, the lengths sometimes disagree as some sp. 2 exons have been altered). The indels of entire exon(s) (Fig 2C, G, N, and R), which constitute small fractions (Fig S3, Tables S2 and S3), were not included in this and subsequent analyses because our aim was to analyze incremental length changes in internal exons. Interestingly, we found higher frequencies of deletions in long internal exons without exception and the insertion frequencies of five out of eight species were positively correlated with internal exon length (Figs 5B–I and S7B–I). The three species that did not show a large positive correlation in insertion frequency were chimpanzee, Oryzias laptipes, Oryzias melatigma, and O simulans. As the smallness of samples may account for the lack of correlation, we calculated the indel frequencies in three length ranges of internal exons, namely S (1–180 nt), M (181–360 nt), and L (361–540 nt) and compared them. All but the insertion frequencies of O. laptipes and O. melatigma showed a statistically significant increase in L compared with M (Figs S8B–I and S9B–I). Thus, indel frequencies generally increase with exon length, except for the two fish species in which insertion frequencies do not appreciably vary with exon length. In addition, overall deletion frequency is higher than insertion frequency in all the species examined (statistically significant at P value < 10−3, chi-square test). That deletion generally exceeds insertion agrees with previous reports (de Jong & Rydén, 1981; Redelings et al, 2024).
Figure S6. Length distribution of all internal exons.
(A) The arithmetic average of the frequencies of the eight species in each length bin is presented as a bar. Also displayed is the total number of nucleotides. (B, C, D, E, F, G, H, I) Rectangles represent the frequencies of internal exons in the specified length ranges whereas the numbers are the total numbers of nucleotides in the corresponding species.
Figure 5. Long internal exons tend to have a high incidence of indels in each species.
(A) Normalized averages of indels. The insertion (left scale) and deletion (right scale) frequencies are shown with the total numbers of indel residues and the correlation coefficients. One, two, and three asterisks signify that the correlation coefficient is significantly different from zero at P value < 10−2, 10−3, and 10−4, respectively. The dotted lines are regression lines for insertion (yellow) and deletion (blue) frequencies. The length ranges (S, M, and L) used in Figs S8 and S9 are shown as the rectangles. (B, C, D, E, F, G, H, I) Indel frequencies in each species. (A) The total numbers of nucleotides in indels, the correlation coefficients, and statistical significance are displayed as in (A).
Source data are available for this figure.
Figure S7. Dependence of the frequency of the permissive sets of fixed indels on the length of internal exons.
(A) Normalized averages of indels. (B, C, D, E, F, G, H, I) Indel frequencies in each species. Legend as in Fig 5.
Figure S8. The frequencies of fixed indels in three length ranges of internal exons.
The overall indel frequency of each range is the ratio of the sum of indel residues and the total number of residues. The differences between indel frequencies were assessed by chi-squared test. Those with statistically significant differences at P value < 10−2, 10−3, and 10−4 are represented by one, two, and three asterisks, respectively with those in black and red respectively signifying an increase and decrease in the longer range. (A) The average indel frequencies of the eight species with the total numbers of nucleotides in indels shown beneath the graph. (B, C, D, E, F, G, H, I) The indel frequencies of the species with the total numbers of inserted and deleted residues added at the bottom.
Figure S9. The frequencies of the permissive sets of fixed indels in three length ranges of internal exons.
(A) The average indel frequencies of the eight species with the total numbers of nucleotides in indels shown beneath the graph. (B, C, D, E, F, G, H, I) The indel frequencies of the species with the total numbers of inserted and deleted residues added at the bottom. Legend as in Fig S8.
We also calculated the average indel frequency of the eight species in each internal exon length bin. The combined dependence of indel frequency on internal exon length (Figs 5A and S7A) shows that positive correlations exist although the correlation of insertion is weaker than that of deletion. The analysis in the three exon length ranges supports the observation (Figs S8A and S9A). Additionally, deletion frequency is on average higher than insertion frequency with the difference more pronounced in long internal exons: the overall ratio of deletion frequency to insertion frequency is 2.72 in range S (1–180 nt), increases to 4.07 in range M (181–360 nt), but remains at 4.07 in L (361–540 nt).
Most fixed indels encode intrinsically disordered regions, conserve reading frames, and are short
As indels reportedly occur preferentially in IDRs (Light et al, 2013a, 2013b; Khan et al, 2015), we calculated the fractions of IDRs in the fixed indels. For deleted segments in sp. 2, we used the corresponding segments in sp. 1 (Fig 1D) as a proxy. In agreement with the literature, most indels encode IDRs and the fractions are all much higher than those of all coding exons (Figs 6A, S10A and B, and S11A and B). We also examined indel lengths and verified reports (Mills et al, 2006; Wetterbom et al, 2006; Khan et al, 2015) that those of nearly all indels in coding sequences are multiples of 3 nt (Figs 6B, S12, and S13). Frame-preserving insertions except for those containing stop codons merely add residues to the encoding proteins, whereas frame-preserving deletions excluding those containing start or stop codons simply remove residues. Since most indels encode IDRs, alterations in the number of amino acids caused by fixed indels are likely to be tolerated. The length distributions (Figs 6C, S14A–I, and S15A–I) demonstrate that shorter indels predominate, in agreement with reported findings (Benner et al, 1993; Qian & Goldstein, 2001; Khan et al, 2015) and almost all fixed indels are shorter than 19 nt.
Figure 6. Most indels in internal exons encode intrinsically disordered regions (IDRs), are multiples of 3 nt in length, and are short, and amino acid compositions of indels resemble those of IDRs.
The bars and line graphs represent arithmetic averages of the eight species. (A) Fractions of IDRs. The bars represent the fractions of IDRs encoded by all internal exons, indels, tandem repeats in indels, and indels whose generation mechanisms are unknown. The total numbers of nucleotides in indels are displayed, too. (B) Phase distributions of indel lengths. The arithmetic averages of case-wise phase distributions of the eight species are graphed and the total case numbers are shown. (C) Length distributions of indels. The averages of all indels, tandem repeats in indels, and indels of unknown generation mechanisms were calculated as in (B) and presented together with the total case numbers of indels. (D) Average frequencies of amino acids and fold changes. The average frequency of each amino acid encoded by all internal exons (“All”), that of fixed indels, and that of IDRs are represented by rectangles (left scale). Each datum is the arithmetic average of the frequencies of the eight species. The line graphs (right scale) are the fold changes in frequency relative to that of all internal exons. Also shown are the numbers of amino acid residues encoded by all internal exons, indels, and in IDRs encoded by internal exons.
Source data are available for this figure.
Figure S10. Indels mostly encode intrinsically disordered regions (IDRs) in all species.
(A) The bars represent the fractions of IDRs in all internal exons, inserted segments, inserted segments generated by tandem repeats, and inserted segments whose generation mechanisms are unknown. The numbers are those of inserted residues. (B) The rectangles stand for the fractions of IDRs in all internal exons, deleted segments, deleted segments generated by tandem repeats, and deleted segments whose generation mechanisms are unknown. The numbers are those of deleted residues.
Figure S11. Fixed indels of the permissive sets mostly encode intrinsically disordered regions (IDRs).
(A) The bars represent the fractions of IDRs in all internal exons, inserted segments, inserted segments generated by tandem repeats, and inserted segments whose generation mechanisms are unknown. The numbers are those of inserted residues. (B) The rectangles stand for the fractions of IDRs in all internal exons, deleted segments, deleted segments generated by tandem repeats, and deleted segments whose generation mechanisms are unknown. The numbers are those of deleted residues.
Figure S12. Nearly all fixed indels preserve the reading frame.
The phase distributions of indels in each species are shown as in Fig 6B.
Figure S13. Almost all fixed indels of the permissive sets preserve the reading frame.
The phase distributions of indels in each species are shown as in Fig 6B.
Figure S14. Length distributions of fixed indels in each species.
(A, B, C, D, E, F, G, H) The length distributions of indel cases together with the total numbers of indel cases.
Figure S15. Length distributions of the permissive sets of fixed indels.
(A) The average length distributions of indel cases together with the total numbers of indel cases. (B, C, D, E, F, G, H, I) The length distributions of indel cases together with the total numbers of indel cases.
Amino acid compositions of indels resemble those of IDRs
As most indels encode IDRs and IDRs have a characteristic amino acid composition (Dunker et al, 2008), we calculated the amino acid compositions of indels (Figs 6D and S16). Indels have amino acid compositions similar to those of IDRs; in indels order-promoting residues are depleted, whereas disorder-promoting residues appear more frequently than the average. For the data presented, the correlation coefficient between log fold changes of insertion and those of IDR is 0.797 (statistically significant at P value < 10−4), whereas the correlation coefficient between log fold changes of deletion and those of IDR is 0.876 (statistically significant at P value < 10−5).
Figure S16. Amino acid compositions of the permissive sets of fixed indels resemble those of intrinsically disordered regions (IDRs).
The average frequency of each amino acid encoded by all internal exons (“All”), that of fixed indels, and that of IDRs are represented by rectangles (left scale). Each datum is the arithmetic average of the frequencies of the eight species. The line graphs (right scale) are the fold changes in frequency relative to that of all internal exons. Also shown are the numbers of amino acid residues encoded by all internal exons, indels, and in IDRs encoded by internal exons.
Long internal exons tend to have tandem repeats
What accounts for the higher incidences of indels in longer internal exons? We found no grounds to suppose that intronization, exonization, and homologous recombination occur more frequently in longer internal exons. Since tandem repeats account for a large fraction of indels (Fig 4) and longer internal exons generally experience more indels (Fig 5A–I), we thought it possible that tandem repeats are more prevalent in long internal exons. We thus calculated the fraction of each internal exon occupied by tandem repeats and investigated its dependence on internal exon length. The results (Fig 7A–I) verified the conjecture; the longer internal exons are, the higher the repeat frequency with all the correlation coefficients statistically significant at P value < 10−2 and the conclusion remains unchanged in the longer length range (1–840 nt instead of 1–540 nt) (Fig S17A–I).
Figure 7. Long internal exons tend to have tandem repeats as well as intrinsically disordered regions (IDRs).
(A) Arithmetic averages, regression lines, and correlation coefficients. The fractions of tandem repeats, IDRs, tandem repeats in IDRs, and others in IDRs are shown with the correlation coefficients of the fractions with exon length. Asterisks signify statistical significances as in Fig 5 legend. The dotted lines represent regression lines, whereas the sample number represents the sum of nucleotides in internal exons. (B, C, D, E, F, G, H, I) Frequencies and correlation coefficients in each species. The data are presented as in (A) without regression lines. The sample number below each species name is the total number of nucleotides in internal exons.
Source data are available for this figure.
Figure S17. Long internal exons tend to have a high fraction of repeats as well as intrinsically disordered regions.
(A) Arithmetic averages, regression lines, and correlation coefficients. (B, C, D, E, F, G, H, I) Frequencies and correlation coefficients in each species. Legend as in Fig 7.
To examine how much of the higher prevalence of IDRs in longer internal exons is explained by tandem repeats, we calculated the fractions of IDRs, tandem repeats within IDRs, and others within IDRs (i.e., IDRs that are not in tandem repeat segments) in each length bin in the eight species in the 1–540 nt (Fig 7A–I) and 1–840 nt length ranges (Fig S17A–I). Note that the previously reported dependence of the fraction of IDRs on internal exon length (Fukuchi et al, 2023) was more pronounced in the longer exon length range. Since the IDRs are divided into tandem repeats (within IDRs) and others, the sum of the fractions of the two constituents is equal to that of IDRs and, consequently, the sum of the slopes of the regression lines of the two is identical to that of IDRs. If the positive correlation of IDRs is entirely accounted for by tandem repeats, the slope of the regression line of tandem repeats within IDRs will be the same as that of IDRs, whereas the slope of others will be zero, that is, the fraction of others will be constant irrespective of intron length. The observed slope of the regression line of the fraction of tandem repeats in IDRs was 0.890 per cent/unit, where 1 unit equals 60 nt, and that of others in IDRs was 1.457, whereas that of the corresponding slope of the fraction of IDRs was 2.348 (Fig 7A). Thus, only 37.9% of the dependence of IDRs on internal exon length is contributed by tandem repeats. As tandem repeats constitute a mere 29.6% of IDRs on average, however, they make a disproportionate contribution to IDRs. On the other hand, in the longer range of exon length (Fig S17A), the contribution ratio of tandem repeats in IDRs was 34.0%, slightly lower than the figure calculated in the shorter exon length range (37.9%).
Gene ontology (GO) analyses of genes with indels
What functions do genes with indels frequently have? Using SwissProt annotations, we analyzed the frequency of GO appearance in human and mouse genes with or without indels. We tabulated GO numbers that are assigned significantly more often to both human and mouse genes with indels compared with those without indels (Table S6). The genes with insertion are enriched with many functions in the nucleus such as acetyl transferase activity and RNA uridylyltransferase activity, and those with deletions also have a high prevalence of nucleus-related functions including histone kinase activity, the Las1 complex, and RNA cap trimethylguanosine synthase activity (nuclear-related GO terms are rubricated in the table).
Involvement of indels in human diseases
Are there any indels that occur in sites related to human diseases? We consulted SwissProt annotations to examine if any of the inserted residues and the bordering residues of deletions are in known disease-related sites. We found 6 and 12 insertions and deletions, respectively, that fall on disease-related sites (Table S7). Thus, a considerable number of indels occur in sensitive locations whose mutations cause human diseases. As indels predominantly encode IDRs (Fig 6A), we determined if the disease-related indel segments encode IDRs and found that 12 out of the 18 segments are within IDRs (rubricated in Table S7).
Table S7. Human indels that fall on disease-related sites. (42.9KB, xlsx)
The frequency of indels correlates with that of tandem repeats
Since both the frequency of indels and that of tandem repeats mostly show a positive correlation with internal exon length, the two frequencies are likely to be correlated. The correlations between the normalized insertion/deletion frequency and the normalized repeat frequency are shown by scatter plots (Fig 8A). The correlation coefficient between insertion and repeat frequencies is 0.733 (statistically significant at P value < 0.05), whereas that between deletion and repeat frequencies is 0.965 (statistically significant at P value < 10−5). Similar correlations are observed in the permissive sets of indels (Fig S18). In confirmation of the expectation, the indel frequencies are significantly correlated with repeat frequency, with the deletion frequency more strongly so.
Figure 8. Indel frequencies are correlated with tandem repeat frequency and a proposed model.
(A) Correlations of repeat frequency and indel frequencies. The panel represents scatter plots of the normalized repeat frequency of each length bin and the corresponding normalized insertion (yellow dots) and deletion (blue dots) frequencies together with the correlation coefficients. Dotted lines (yellow: insertion, blue: deletion) are regression lines. The correlation coefficients and the total numbers of nucleotides in indels are also shown. (B, C, D) Proposed evolutionary steps that had led to the last eukaryotic common ancestor that have long internal exons that frequently encode intrinsically disordered regions. (E) Ongoing process.
Source data are available for this figure.
Figure S18. Indel frequencies of the permissive sets and repeat frequency are correlated.
Legend as in Fig 8A.
Discussion
We unexpectedly discovered that fixed indels occur preferentially in longer internal exons and large fractions of indels were generated by alterations in tandem repeats. We also found a high prevalence of tandem repeats in long internal exons, which at least partially accounts for the previously observed tendency of long internal exons to encode IDRs (Fukuchi et al, 2023) as tandem repeats mostly encode IDRs (Simon & Hancock, 2009). In view of the findings of this research, we propose that long internal exons had resulted from primordial short exons (Fig 8C) that had evolved from the FECA (Fig 8B) mainly by expansion of tandem repeats that added IDRs to the encoded proteins, which constitutes a step toward the emergence of the LECA (Fig 8D). This model explains how eukaryotic proteins acquired IDRs, especially in sections encoded by long internal exons, whereas only a small fraction of prokaryotic proteins consists of IDRs. This step also accounts for the increased frequency of tandem repeats in eukaryotic proteins compared with prokaryotic proteins. Moreover, we suggest that, after the LECA, long internal exons have been subject to frequent contractions and, to a lesser extent, expansions due to alterations in tandem repeats (Fig 8E). As nearly all fixed indels are nonframeshifting (Fig 6B), indels generated by tandem repeats mostly result in tandem repeats in proteins. Since protein tandem repeats play crucial roles in ligand binding and transcriptional regulation (Kobe & Kajava, 2001; Yagi & Nakamura, 2014), indels generated by tandem repeats may have important biological significance. Tandem repeats are rare in prokaryotes possibly because many IDRs do not play beneficial roles and thus are frequently not tolerated. The tolerability of tandem repeats in eukaryotes may be related to the splicing efficiency of long internal exons; tandem repeats that result in long internal exons are not tolerated in organisms that cannot splice long exons efficiently, and vice versa.
The two fish species are exceptional in that the insertion frequency does not show dependency on internal exon length (Figs 5F and G, S7F and G, S8F and G, and S9F and G). We note that the insertion frequencies in these species are much smaller than the deletion frequencies. The scarcity of inserted residues in the two species might account for the finding that is inconsistent with that of the remaining six species. Further research with other species is needed to clarify this issue.
The reported number, 262, of human indels in all coding exons (Mills et al, 2006) is the sum of human insertions, chimpanzee deletions (the two constitute apparent human insertions), human deletions, and chimpanzee insertions (the latter two account for apparent human deletions). The corresponding number in this investigation is thus the total of insertions and deletions in humans and chimpanzees in all exons and is 1,502 (permissive sets, step 5 in Table S2). The increase in number is attributable not only to revised sequence data we had the good fortune to use, but also to the methodology that identifies changed alternative splicing in addition to genomic alterations in constitutively spliced exons.
Although the frequencies of the detected fixed indels are small (less than 0.06%), they may well be an underestimate as our methods of selecting fixed indels probably capture only a fraction of actual fixed indels. This is both because transcripts with corresponding segments must be present in all three species in our identification method and because BLASTN alignments can be made only for highly conserved nucleotide sequences. Additionally, the outgroup requirement means that the capture rate of fixed indels varies from one species group from another and thus makes it inappropriate to compare indel frequencies of different groups; we for instance cannot meaningfully compare the indel frequencies of human and mouse. Moreover, the availability of more variant data may disqualify some fixed indels due to the presence of counterexamples in the newly identified variants. Since the methodology has no bias on exon length, however, the finding that higher indel frequency is observed in longer internal exons is unaffected by these shortcomings.
We consider the fractions of fixed indels generated by tandem repeats in exons represent an underestimate; slight sequence changes after repeat expansion and contraction make tandem repeats imperfect, making them undetectable by the repeat detection program. It is thus conceivable that indels generated by unknown mechanisms contain those attributable to tandem repeats, although they may well include some cases generated by inverted and mirror repeats (Burssed et al, 2022), which are undetectable by the program we used. In fact, the fractions of IDRs in indels generated by alterations in tandem repeats are comparable to those produced by unknown mechanisms (Fig 6A), their length distributions are nearly identical (Fig 6C), and amino acid compositions of the two groups are similar (Fig S19A and B). It is thus plausible that many indels whose generation mechanisms remain unidentified were really generated by alterations in tandem repeats. An underestimation of tandem repeats can explain why the tandem repeats explain only 38% (in the internal exon length range 1–540 nt, Fig 7A) or even less in the longer (1–840 nt, Fig S17A) exon length range of the dependence of IDRs on the length of internal exons; unidentified tandem repeats may account for some of the remaining dependence.
Figure S19. Amino acid compositions of all internal exons, those generated by repeats, and those whose generation mechanisms are unknown.
(A) The amino acid compositions of all internal exons, tandem repeats in inserted segments, and inserted segments generated by unknown mechanisms. (B) The amino acid compositions of all internal exons, tandem repeats in deleted segments, and deleted segments whose generation mechanisms are unknown. Legend as in Fig 6D.
Notwithstanding all the emphasis on tandem repeats, we would like to reiterate that intronization and exonization also generate deletions and insertions, respectively, and they account for considerable fractions of insertions in Homo sapiens, deletions in chimpanzee, and indels in rat and the two fishes (Fig 4). The fact that many indels were “probably” generated by intronization and exonization makes it quite likely that they really account for some indels. The wide variation in the fractions of intronization and exonization is largely explainable by disparity in location distribution of indels within internal exons; most insertions at 5′ and 3′ ends of exons and those involving entire exons were judged to be generated by exonization (87.8, 85.1, and 74.9%, respectively), and nearly all deletions at 5′ and 3′ ends of exons and those of entire exon(s) were attributable to intronization (93.4, 92.1, and 97.8%, respectively). By contrast, exonization and intronization were seldom assigned as the generation mechanisms of indels within exons (0.7% and 2.0%, respectively). The insertions in human, the deletions in chimpanzee, and the indels in rat and the two fishes all have high total proportions of those at the ends of exon and those involving entire exons (Fig S3C, D, F, G, and H). Consistently, these have comparatively high fractions of cases generated by intronization/exonization. Indel analyses of other species are needed to judge whether intronization and exonization generally generate less indels than tandem repeats do.
The number of insertion cases attributable to homologous recombination is much smaller than that generated by tandem repeats (Fig 4 and Table S4). This observation indicates that replication slippage rather than homologous recombination is the dominant mechanism of tandem repeat expansion.
The total absence of transposable elements in the selected indels was unexpected. Since we carried out BLASTN alignments of the entire variants with indels, transposable elements must have been detected in indels if the sequences are sufficiently similar. Possibly transposable elements that gave rise to indels had mutated considerably to escape detection by BLASTN. Since most exons containing transposable elements are alternatively spliced (Sorek et al, 2002; Lin et al, 2008), another plausible explanation is that we have filtered out those corresponding to transposable elements as we selected for “fixed” indels commonly shared by all splicing variants in our methodology.
What accounts for the general preponderance of deletion over insertion? We consider it possible that insertion is evolutionarily disfavored especially in long internal exons because splicing of many excessively long exons (>300 nt) is inhibited (Robberson et al, 1990; Chen & Chasin, 1994; Sterner et al, 1996). This interpretation dovetails with the observed tendencies that insertion frequency of internal exons initially rises but levels off at around 360 nt, whereas deletion frequency shows a monotonically increasing trend (Fig 5A). However, we excluded fixed insertions that create entire exons (Fig 2C and G) and fixed deletions that eliminate whole exons (Fig 2N and R) from our analyses of exon length dependence. Neither can our analysis capture cases of exon fusion and fission that do not affect coding sequences because our selection pipeline only detects cases whose coding sequences are affected. Although analysis of exon evolution including these neglected cases is expected to elucidate exon evolution dynamics in its entirety, it is beyond the scope of the current study.
The gene ontology analysis revealed that as many as half of the functions significantly enriched in the genes with indels are associated with the nucleus. Since most indels encode IDRs (Fig 6A) and IDRs are especially prevalent in transcription factors and other nuclear proteins (Anbo et al, 2019), many genes with indels probably encode proteins that function in the nucleus. What accounts for the finding that as many as 18 human indels correspond to disease-related sites? Considering the enrichment of IDRs in disease-related proteins (Anbo et al, 2019) and the fact that two-thirds of the disease-related indel segments encode IDRs, we consider it likely that the high coincidence of indels with disease-related sites reflects frequent involvement of IDRs in diseases.
The existence of a positive correlation between tandem repeat frequency and internal exon length was an intriguing finding. Since four other eukaryotes, Oryza sativa, Arabidopsis thaliana, Caenorhabditis elegans, and Schizosaccharomyces pombe were shown to have a positive correlation of the fraction of IDRs and internal exon length (Fukuchi et al, 2023), we checked if repeat frequency and internal exon length are correlated in the 1–840 nt range in these model eukaryotes, too. We found the first three model eukaryotes exhibit statistically significant (P value < 10−2) positive correlations, but S. pombe does not (Fig S20). The case of S. pombe is apparently inconsistent with the notion that long internal exons were produced mainly by the expansion of tandem repeats. However, as stated above, the fraction of tandem repeats is likely to be underestimated, and the underestimation can explain the lack of correlation in the yeast species.
Figure S20. Long internal exons tend to have a high fraction of tandem repeats in O. sativa, A. thaliana, and C. elegans, but not in S. pombe.
The fractions of tandem repeats are plotted with the total numbers of nucleotides, and the correlation coefficients and statistical significance are shown as in Fig 5.
As far as we are aware, this represents the first systematic identification of fixed indels including those generated by constitutive alterations in alternative splicing and examination of the dependence of their frequencies on internal exon length. As our observations are limited to indels in eight species only, however, the generality of our findings needs to be tested by indels in other species. Unfortunately, it is not easy to apply the current methodology to many other species because of the current selection method requires the presence of a trio of closely related species with completely sequenced genomes and an abundance of sequenced variants. For instance, the application of our methods to C. elegans, Caenorhabditis briggsae with Caenorhabditis japonica as the outgroup identified less than 70 cases of fixed insertions in C. elegans, a number judged too small for further analyses. Hopefully, widespread availability of variant and genome sequences will make extensive testing possible.
Materials and Methods
Data sets
For the selection of H. sapiens and Pan troglodytes (chimpanzee) indels, we chose Gorilla gorilla as the outgroup. As the outgroup of Mus musculus (mouse) and Rattus norvegicus (rat), we used Peromyscus maniculatus, whereas Nothobranchius furzeri (turquoise killifish) served as the outgroup of Oryza latipes (Japanese medaka) and Oryza melastigma (Indian medaka). As the outgroup of D. melanogaster and Drosophila simulans, we chose Drosophila yakuba. We downloaded all the sequence and exon data, and the lists of orthologous genes except for that between D. melanogaster and D. simulans and the three fish species from the Ensembl database (Martin et al, 2023). Since lists of orthologous genes between the two Drosophila species and the fish species were unavailable from the Ensemble database, we selected pairs of orthologous genes by mutual best hit of BLASTN alignments of the longest variants. The list of transposable elements used was in release 19 of the TREP Transposable Element Platform (Wicker et al, 2022).
Selection of fixed indels and their classification by location
The actual selection steps followed are schematically shown (Fig S1). The first two steps are the same for the selection of insertions and deletions. In step 1, we first got the DNA sequences of all transcript pairs corresponding to all the orthologous genes of sp. 1 and sp. 2. We then performed BLASTN alignments (Altschul et al, 1990) of the pairs with the default parameters except that we set the gap-opening penalty at 10 and changed the expectation value to 10−3, unless otherwise stated. We identified indel candidates in BLASTN alignments with special considerations for sections of multiple alignments; we regarded two aligned sections continuous if the last query residue of one section coincides with the first query residue of the other within 3 nt (The 3 nt allowance was made to cope with the uncertainty in alignments. The specific number was chosen by inspection of a number of alignments involving indels). We judged indel sections identical if their genomic addresses agreed within 3 nt. In the uniformity tests, we disregarded variants that do not have sections covering indels. An indel was regarded as coinciding with the 5′ or 3′ end of an exon of sp. 1 if the start or the end of the segment coincides with the start or the end of the exon within 3 nt. We judged an indel without coincidence with exon border(s) to be located inside an exon. After identifying indel candidates, we assigned the corresponding genome addresses to all the sp. 1 variants with indel candidates and made the indel candidates nonredundant based on the genome addresses. Two candidates were considered identical if either the beginning or the end genome address of one insertion candidate matches the start or the end of another within 3 nt.
In step 2, we checked all the alignments made in step 1 containing indel candidates and left only those with 0% coverage in all sp. 2 variants.
In step 3, we first collected the DNA sequences of all the variants of the sp.1 genes with indel candidates and those of the outgroup variants of the orthologous genes then ran BLASTN alignments with the same parameters as in step 1. To select insertion candidates, we then analyzed the alignments and left only the indel candidates whose coverage of the corresponding sections of the outgroup variants was 0% without exception to get insertion candidates. On the other hand, to choose deletion candidates, we selected for the candidates whose corresponding sections in the outgroup variants were universally 100%.
To select insertion candidates in step 4, we identified all the sp. 1 variants of the genes containing insertion candidates, get the sequences, and carried out BLASTN alignments with the sp. 1 variants containing insertion candidates. If some variants were found not to have the insertions, then the candidates were discarded. To filter deletion candidates, we similarly carried out BLASTN alignments with all the sp. 1 variants containing deletion candidates and selected those in which all sp. 1 variants had the deletions.
Step 5 is intended to identify and remove indels that correspond to introns that are shorter than 70 nt and is skipped in the permissive set. We first reexamined the BLASTN alignments made in step 1 and identified the residue numbers of the sp. 2 variants before and after each insertion. Then we determined if each indel candidate in sp. 1 corresponds to an intron in sp. 2. If the intron length so identified is less than 70 nt, then the indel candidate is discarded. The rest of indel candidates, including those that did not correspond to introns in sp. 2, were retained.
In step 6 only insertions or deletions in internal exons were selected.
Generation mechanisms of fixed indels
An insertion is considered to be “possibly” generated by exonization if an intron whose length is more than or equal to two-thirds of the length of the insertion exists in sp. 2 at the corresponding location but is upgraded to “probably” generated by exonization if at least two-thirds of the residues were aligned by BLASTN (Fig 2A, C–E, G, and H). Likewise, a deletion is regarded as “possibly” generated by intronization if the corresponding segment in sp. 2 has an intron of at least two-thirds of the length of the deleted segment but is reclassified as “probably” generated by intronization if the BLASTN-aligned segment is more than or equal to two-thirds of the length of the deleted segment (Fig 2L, N–P, R, and S).
We identified the generation mechanisms of the rest of the fixed indels as follows. Homologous recombination is chosen as the generation mechanism of an insertion in sp. 1 based on the following conjecture (Fig S21A and B); sequence A with insertion generated by homologous recombination in sp. 1 is probably more like sequence A′ in sp. 1 that replaced the sequence than to the orthologous sequence B in sp. 2. Based on this idea, we set two criteria for an insertion to be a product of homologous recombination. First, there must be at least one sequence A′ of a different gene in sp. 1 aligned to the sp. 2 ortholog B, and second the fraction of the identity between the sequences of the insertion-containing exon and A′ is not significantly lower than that of the exon and B sequences (chi-square test). Similarly, a deletion in sp. 2 is considered to have been generated by homologous recombination if (1) sequence B in sp. 2 with an exon containing the deletion has at least one sequence B′ mapped to a different gene in sp. 2 that is aligned to the sp. 1 ortholog A and (2) the fraction of sequence identity between the exon and B′ is not significantly lower than that between the exon and A (chi-square test).
Figure S21. Rationale for the criteria for selecting indel cases generated by homologous recombination.
(A) Rationale for the criteria for selecting insertion cases generated by homologous recombination. (B) Rationale for the criteria for selecting deletion cases generated by homologous recombination.
Tandem repeats were identified in coding sequences by Tandem Repeats Finder (Benson, 1999) with the mismatch and indel penalties and maximum period size set at 5, 3, and 2,000, respectively, whereas the other parameters were at default settings. We chose the loose parameter settings to identify as many tandem repeats as possible. If there are segments of tandem repeats in the fixed indels, the tandem repeats were regarded as responsible for generating the indels.
Assignment of IDRs and statistical analyses
We ran IUPred3 (Erdős et al, 2021) and judged amino acids with a score more than or equal to 0.502 to be in IDRs. We carried out all the tests of statistical significance by t test unless otherwise stated. For the data shown in Figs 5A, 8A, S7A, and S18, we first divided the frequency of each length bin by the average frequency of the species and arithmetically averaged the normalized values of the eight species.
Supplementary Material
Acknowledgements
This work was supported in part by Grant-in-Aid for Transformative Research Areas “Multifaceted proteins: Expanding and transformative protein world” from the Ministry of Education, Culture, Sports, Science and Technology in Japan, 20H05932.
Author Contributions
K Homma: conceptualization, data curation, software, formal analysis, supervision, validation, investigation, methodology, project administration, and writing—original draft, review, and editing.
H Anbo: data curation, software, supervision, and writing—review and editing.
M Ota: supervision, funding acquisition, validation, and writing—review and editing.
S Fukuchi: supervision, funding acquisition, validation, and writing—review and editing.
Conflict of Interest Statement
The authors declare that they have no conflict of interest.
Data Availability
All data and code necessary to reproduce these analyses are available at Figshare; the lists of data and programs, the data of each selection step, the programs for the primates, rodents, Oryzias species, and Drosophila species, and the remaining four eukaryotes have been respectively deposited at https://figshare.com/articles/journal_contribution/Data_ProgramLists/29553317, https://figshare.com/articles/journal_contribution/DataFileModified/29553287, https://figshare.com/articles/journal_contribution/HumanChimpanzeeProgramsModified/29553299, https://figshare.com/articles/journal_contribution/MouseRatProgramsModified/29553305, https://figshare.com/articles/journal_contribution/MedakaProgramsModified/29553308, https://figshare.com/articles/journal_contribution/DrosophilaProgramsModified/29553311, and https://figshare.com/articles/journal_contribution/Other4SppProgramsModified/29553314.
References
- Abebrese EL, Ali SH, Arnold ZR, Andrews VM, Armstrong K, Burns L, Crowder HR, Day RT, Jr., Hsu DG, Jarrell K, et al. (2017) Identification of human short introns. PLoS One 12: e0175393. 10.1371/journal.pone.0175393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
- Anbo H, Sato M, Okoshi A, Fukuchi S (2019) Functional segments on intrinsically disordered regions in disease-related proteins. Biomolecules 9: 88. 10.3390/biom9030088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anzai T, Shiina T, Kimura N, Yanagiya K, Kohara S, Shigenari A, Yamagata T, Kulski JK, Naruse TK, Fujimori Y, et al. (2003) Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence. Proc Natl Acad Sci U S A 100: 7708–7713. 10.1073/pnas.1230533100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benner SA, Cohen MA, Gonnet GH (1993) Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol 229: 1065–1082. 10.1006/jmbi.1993.1105 [DOI] [PubMed] [Google Scholar]
- Benson G (1999) Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. 10.1093/nar/27.2.573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berget SM (1995) Exon recognition in vertebrate splicing. J Biol Chem 270: 2411–2414. 10.1074/jbc.270.6.2411 [DOI] [PubMed] [Google Scholar]
- Burssed B, Zamariolli M, Bellucco FT, Melaragno MI (2022) Mechanisms of structural chromosomal rearrangement formation. Mol Cytogenet 15: 23. 10.1186/s13039-022-00600-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen IT, Chasin LA (1994) Large exon size does not limit splicing in vivo. Mol Cell Biol 14: 2140–2146. 10.1128/mcb.14.3.2140-2146.1994 [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Conti L, Baralle M, Buratti E (2013) Exon and intron definition in pre-mRNA splicing. Wiley Interdiscip Rev RNA 4: 49–60. 10.1002/wrna.1140 [DOI] [PubMed] [Google Scholar]
- de Jong WW, Rydén L (1981) Causes of more frequent deletions than insertions in mutations and protein evolution. Nature 290: 157–159. 10.1038/290157a0 [DOI] [PubMed] [Google Scholar]
- Delucchi M, Schaper E, Sachenkova O, Elofsson A, Anisimova M (2020) A new census of protein tandem repeats and their relationship with intrinsic disorder. Genes 11: 407. 10.3390/genes11040407 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V, Obradovic Z, Uversky VN (2008) The unfoldomics decade: An update on intrinsically disordered proteins. BMC Genomics 9: S1. 10.1186/1471-2164-9-S2-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erdős G, Pajkos M, Dosztányi Z (2021) IUPred3: Prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res 49: W297–W303. 10.1093/nar/gkab408 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan Y, Wang W, Ma G, Liang L, Shi Q, Tao S (2007) Patterns of insertion and deletion in mammalian genomes. Curr Genomics 8: 370–378. 10.2174/138920207783406479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukuchi S, Noguchi T, Anbo H, Homma K (2023) Exon elongation added intrinsically disordered regions to the encoded proteins and facilitated the emergence of the last eukaryotic common ancestor. Mol Biol Evol 40: msac272. 10.1093/molbev/msac272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hawkins JD (1988) A survey on intron and exon lengths. Nucleic Acids Res 16: 9893–9908. 10.1093/nar/16.21.9893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khan T, Douglas GM, Patel P, Nguyen Ba AN, Moses AM (2015) Polymorphism analysis reveals reduced negative selection and elevated rate of insertions and deletions in intrinsically disordered protein regions. Genome Biol Evol 7: 1815–1826. 10.1093/gbe/evv105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kobe B, Kajava AV (2001) The leucine-rich repeat as a protein recognition motif. Curr Opin Struct Biol 11: 725–732. 10.1016/s0959-440x(01)00266-4 [DOI] [PubMed] [Google Scholar]
- Kondrashov FA, Koonin EV (2003) Evolution of alternative splicing: Deletions, insertions and origin of functional parts of proteins from intron sequences. Trends Genet 19: 115–119. 10.1016/S0168-9525(02)00029-X [DOI] [PubMed] [Google Scholar]
- Lang KM, Spritz RA (1983) RNA splice site selection: Evidence for a 5′ leads to 3′ scanning model. Science 220: 1351–1355. 10.1126/science.6304877 [DOI] [PubMed] [Google Scholar]
- Levinson G, Gutman GA (1987) Slipped-strand mispairing: A major mechanism for DNA sequence evolution. Mol Biol Evol 4: 203–221. 10.1093/oxfordjournals.molbev.a040442 [DOI] [PubMed] [Google Scholar]
- Li WH (1997) Molecular Evolution. Sunderland, MA: Sinauer Associates. [Google Scholar]
- Li X, Liu S, Zhang L, Issaian A, Hill RC, Espinosa S, Shi S, Cui Y, Kappel K, Das R, et al. (2019) A unified mechanism for intron and exon definition and back-splicing. Nature 573: 375–380. 10.1038/s41586-019-1523-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Light S, Sagit R, Ekman D, Elofsson A (2013a) Long indels are disordered: A study of disorder and indels in homologous eukaryotic proteins. Biochim Biophys Acta 1834: 890–897. 10.1016/j.bbapap.2013.01.002 [DOI] [PubMed] [Google Scholar]
- Light S, Sagit R, Sachenkova O, Ekman D, Elofsson A (2013b) Protein expansion is primarily due to indels in intrinsically disordered regions. Mol Biol Evol 30: 2645–2653. 10.1093/molbev/mst157 [DOI] [PubMed] [Google Scholar]
- Lin L, Shen S, Tye A, Cai JJ, Jiang P, Davidson BL, Xing Y (2008) Diverse splicing patterns of exonized Alu elements in human tissues. PLoS Genet 4: e1000225. 10.1371/journal.pgen.1000225 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin M, Whitmire S, Chen J, Farrel A, Shi X, Guo JT (2017) Effects of short indels on protein structure and function in human genomes. Sci Rep 7: 9313. 10.1038/s41598-017-09287-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marquez Y, Höpfler M, Ayatollahi Z, Barta A, Kalyna M (2015) Unmasking alternative splicing inside protein-coding exons defines exitrons and their role in proteome plasticity. Genome Res 25: 995–1007. 10.1101/gr.186585.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin FJ, Amode MR, Aneja A, Austine-Orimoloye O, Azov AG, Barnes I, Becker A, Bennett R, Berry A, Bhai J, et al. (2023) Ensembl. Nucleic Acids Res 51: D933–D941. 10.1093/nar/gkac958 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, Devine SE (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16: 1182–1190. 10.1101/gr.4565806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mullaney JM, Mills RE, Pittard WS, Devine SE (2010) Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet 19: R131–R136. 10.1093/hmg/ddq400 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nekrutenko A, Li WH (2001) Transposable elements are found in a large number of human protein-coding genes. Trends Genet 17: 619–621. 10.1016/s0168-9525(01)02445-3 [DOI] [PubMed] [Google Scholar]
- Pascarella S, Argos P (1992) Analysis of insertions/deletions in protein structures. J Mol Biol 224: 461–471. 10.1016/0022-2836(92)91008-d [DOI] [PubMed] [Google Scholar]
- Qian B, Goldstein RA (2001) Distribution of indel lengths. Proteins 45: 102–104. 10.1002/prot.1129 [DOI] [PubMed] [Google Scholar]
- Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M (2024) Insertions and deletions: Computational methods, evolutionary dynamics, and biological applications. Mol Biol Evol 41: msae177. 10.1093/molbev/msae177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robberson BL, Cote GJ, Berget SM (1990) Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol Cell Biol 10: 84–94. 10.1128/mcb.10.1.84-94.1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simon M, Hancock JM (2009) Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins. Genome Biol 10: R59. 10.1186/gb-2009-10-6-r59 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorek R (2007) The birth of new exons: Mechanisms and evolutionary consequences. RNA 13: 1603–1608. 10.1261/rna.682507 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorek R, Ast G, Graur D (2002) Alu-containing exons are alternatively spliced. Genome Res 12: 1060–1067. 10.1101/gr.229302 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sterner DA, Carlo T, Berget SM (1996) Architectural limits on split genes. Proc Natl Acad Sci U S A 93: 15081–15085. 10.1073/pnas.93.26.15081 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tompa P (2003) Intrinsically unstructured proteins evolve by repeat expansion. BioEssays 25: 847–855. 10.1002/bies.10324 [DOI] [PubMed] [Google Scholar]
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337: 635–645. 10.1016/j.jmb.2004.02.002 [DOI] [PubMed] [Google Scholar]
- Wetterbom A, Sevov M, Cavelier L, Bergström TF (2006) Comparative genomic analysis of human and chimpanzee indicates a key role for indels in primate evolution. J Mol Evol 63: 682–690. 10.1007/s00239-006-0045-7 [DOI] [PubMed] [Google Scholar]
- Wicker T, Matthews DE, Keller B (2002) TREP: A database for triticeae repetitive elements. Trends Plant Sci 7: 561–562. 10.1016/S1360-1385(02)02372-5 [DOI] [Google Scholar]
- Yagi Y, Nakamura T, Small I (2014) The potential for manipulating RNA with pentatricopeptide repeat proteins. Plant J 78: 772–782. 10.1111/tpj.12377 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Table S1. Number of cases at each selection step. (54.2KB, xlsx)
Table S2. Numbers of cases and residues of fixed indels classified by location with respect to internal exons. (48.9KB, xlsx)
Source Data for Figure 4LSA-2024-03148_SdataF4.xlsx (44.6KB, xlsx)
Table S4. Numbers of cases and residues in fixed indels classified by generating mechanism. (47.9KB, xlsx)
Source Data for Figure 5.1LSA-2024-03148_SdataF5.1.xlsx (29.6KB, xlsx)
Source Data for Figure 5.2LSA-2024-03148_SdataF5.2.xlsx (57.9KB, xlsx)
Source Data for Figure 6LSA-2024-03148_SdataF6.xlsx (32.7KB, xlsx)
Source Data for Figure 7.1LSA-2024-03148_SdataF7.1.xlsx (50.4KB, xlsx)
Source Data for Figure 7.2LSA-2024-03148_SdataF7.2.xlsx (40.1KB, xlsx)
Source Data for Figure 7.3LSA-2024-03148_SdataF7.3.xlsx (53.3KB, xlsx)
Source Data for Figure 7.4LSA-2024-03148_SdataF7.4.xlsx (40.8KB, xlsx)
Table S7. Human indels that fall on disease-related sites. (42.9KB, xlsx)
Source Data for Figure 8LSA-2024-03148_SdataF8.xlsx (32.7KB, xlsx)
Data Availability Statement
All data and code necessary to reproduce these analyses are available at Figshare; the lists of data and programs, the data of each selection step, the programs for the primates, rodents, Oryzias species, and Drosophila species, and the remaining four eukaryotes have been respectively deposited at https://figshare.com/articles/journal_contribution/Data_ProgramLists/29553317, https://figshare.com/articles/journal_contribution/DataFileModified/29553287, https://figshare.com/articles/journal_contribution/HumanChimpanzeeProgramsModified/29553299, https://figshare.com/articles/journal_contribution/MouseRatProgramsModified/29553305, https://figshare.com/articles/journal_contribution/MedakaProgramsModified/29553308, https://figshare.com/articles/journal_contribution/DrosophilaProgramsModified/29553311, and https://figshare.com/articles/journal_contribution/Other4SppProgramsModified/29553314.





























