Abstract
The rapid growth in the number of sequenced genomes makes it possible to search for the appearance of entirely new introns in the human lineage. In this study, we compared the genomic sequences for 19,120 human protein-coding genes to a collection of 3,493 vertebrate genomes, mapping the patterns of intron alignments onto a phylogenetic tree. This mapping allowed us to trace many intron gain events to precise locations in the tree, corresponding to distinct points in evolutionary history. We discovered 342 intron gain events, all of them relatively recent, in 293 distinct human genes. Among these events, we explored the hypothesis that intronization was the mechanism responsible for intron gain. Intronization events were identified by locating instances where human introns correspond to exonic sequences in homologous vertebrate genes. Although apparently rare, we found three compelling cases of intronization, and for each of those, we compared the human protein sequence and structure to homologous genes that lack the introns.
Keywords: intron gain, comparative genomics, intronization
Significance.
Intron evolution has long been a subject of debate, especially regarding how new introns arise in eukaryotic genomes. While intron loss has been well-documented, recent intron gain events, particularly in humans, have been more difficult to discover. Our study leverages the unprecedented scale of genomic data now available to identify 342 recent intron gains in human genes through comparative analysis across 3,493 vertebrate genomes. These findings provide fresh insights into the mechanisms behind intron emergence, highlighting the potential role of intronization in shaping modern human genes. By pinpointing specific evolutionary points where these gains occurred, this research enhances our understanding of genomic complexity and sheds light on the evolutionary processes that continue to shape the human genome.
Introduction
The discovery that eukaryotic genes are interspersed with non-coding segments, known as introns, was a surprising revelation in molecular genetics when it was first reported in 1977 (Berget et al. 1977; Chow et al. 1977). In the decades since, researchers have continued to delve into fundamental questions about introns: their origins, the timeline of their development, and whether their occurrence follows identifiable patterns that could reveal when and how they emerged (Gilbert 1978; Jeffares et al. 2006; Koonin 2006; Coulombe-Huntington and Majewski 2007; Catania and Lynch 2008; Tarrío et al. 2008; Roy and Irimia 2009a, 2009b; Ragg 2011; Chorev and Carmel 2012; Roy and Irimia 2012; Koonin et al. 2013; Wu et al. 2013; Jo and Choi 2015; Lee and Stevens 2016; Catania 2017; Poverennaya and Roytberg 2020). We now know that introns are ubiquitous in plants and animals, with the average number of introns in human protein-coding genes currently estimated at 10.5 (Morales et al. 2022).
The early 2000s saw the first concerted efforts to capture intron gain patterns on a large scale. In 2002, Fedorov and colleagues compared the available data on orthologous animal-plant, animal-fungi, and plant-fungi gene pairs and found that 39% of fungal introns matched both animal and plant positions, which showed evidence of ancestral introns predating the fungal–plant–animal divergence (Fedorov et al. 2002). Another study compared intron structures in 1,560 human–mouse orthologs, but did not find evidence of new intron creation within either lineage (Roy et al. 2003). In contrast, a study from the same year that analyzed 684 orthologous gene sets from a diverse range of organisms, including animals, plants, fungi, and protists, identified numerous intron insertions in vertebrates and plants (Rogozin et al. 2003).
While those studies were important in advancing our knowledge of intron evolution, improvements in sequencing technology over the past two decades have greatly increased the number and variety of genomes and the accuracy of their annotations. This progress motivated us to undertake the current study, in which we conducted a large-scale comparison of intron positions between human genes and other vertebrate genes that aimed to discover and determine the timing of intron gain events in an evolutionary context. As the basis for our searches, we utilized the 19,062 proteins in the MANE dataset (Morales et al. 2022), a recently created high-quality human gene set that contains one “gold standard” splice isoform for each human protein-coding gene. We searched all of these against a comprehensive collection of 17,302,662 proteins from 3,493 vertebrate species from the RefSeq database (O’Leary et al. 2016). Our methodology involved identifying introns in every orthologous gene that aligned with a human gene, and then mapping these intron positions onto a phylogenetic tree. This strategy enabled us to pinpoint the appearance of many introns at the base of a specific clade in the tree. Our analysis uncovered 342 recently gained introns in 293 distinct proteins.
Having identified numerous newly gained introns, we then investigated whether we could determine more about their origins. Mechanisms that have been proposed for the de novo creation of introns include: intron transposition (Sharp 1985), double-strand break repair (Li et al. 2009), intron transfer (Hankeln et al. 1997), and intronization (Yenerall and Zhou 2012, Kini 2018, Yang and Huang 2011, Ryll et al. 2022). A more recent study suggested that most new introns are derived from transposable elements, referred to as Introners (Gozashti et al. 2022).
Despite the evidence that a majority of intron gains can be attributed to Introners, we wanted to explore whether intronization might be an alternative method of intron acquisition. Intronization, first proposed in 2008 (Catania and Lynch 2008; Irimia et al. 2008), is a process whereby part of an exon is spliced out and becomes an intron in subsequent generations (Fig. 1). Intronization events have been reported within lineages including Cryptococcus (Roy 2009; Croll and McDonald 2012), fission yeast (Zhu and Niu 2013), Drosophila (Zhan et al. 2014), plants (Zhu et al. 2009), and human retrogenes (Kang et al. 2012). To the best of our knowledge, the only reported instance of intronization across different species groups so far involves humans and chimpanzees (Kim and Hahn 2012). As of yet, no similar events have been documented within the genomes of vertebrates or extensively compared across their subgroups. We restricted our search for intronization events to cases in which the intron in humans corresponds to an exonic sequence in multiple other vertebrate species. As we describe below, our search identified three novel intronization events in the human genome.
Fig. 1.
The intronization process. The ancestral form of a gene (top) has two exons, labeled exon 1 and exon 2. After intronization, a segment of exon 2 is transformed into an intron, resulting in the division of exon 2 into two distinct exons, labeled 2a and 2b. The newly gained intron needs to be a multiple of 3 in length in order to avoid causing a frameshift.
Results
From the set of 19,120 MANE genes, we identified 342 recently gained introns in 293 distinct proteins. For each human gene that appeared to have gained an intron, we arranged the gene and its orthologs according to their taxonomic relationships as shown in Fig. 2, where each intron is displayed with a small red square if present and a blue square if absent. Intron gain events can be identified by finding a vertical blue line connecting multiple genes, all of which are missing the intron, that surrounds a region where the intron is present, shown in red. As the figure shows, when we find a set of adjacent species in the tree that contain an intron, and those species are surrounded by others that are missing the intron, this finding corresponds to a subtree (containing human) that has gained an intron. We estimated the approximate origin of intron gain by identifying the subtree that maximizes the number R of aligned (red) introns and minimizes the number (B) nonaligned (blue) ones; i.e. that maximizes the quantity R-B.
Fig. 2.
Phylogenetic analysis showing intron gain in gene CLIP3 (GenBank accession NP_056341.1). The left panel shows a taxonomy of species that contain orthologues of CLIP3, with each species’ introns shown as a row of red or blue squares corresponding to the human gene, which has 12 introns. The squares to the right of each gene are colored red for introns that occur in the same position as the human sequence, and blue for introns present in humans but missing from the ortholog. The right panel provides a magnified view, highlighting the suborder Theria in which the gain of intron 8 occurred. Note that the taxonomic relationship among humans, chimpanzees, and gorillas shown in the tree is directly from the NCBI taxonomy, which lists them as equidistant rather than showing chimps as closer to humans.
Figure 2 illustrates the gain of intron 8 in the gene CLIP3 (CAP-Gly domain-containing linker protein 3) in humans and other members of the suborder Theria. Also, supplementary fig. S1, Supplementary Material online shows the gain of intron 4 in CYLC2 (NP_001331.1), illustrating both the exon-intron structure and a multiple sequence alignment for 9 orthologs, six that contain the intron (including human) and three without it; supplementary figs. S2 to S4, Supplementary Material online show three additional intron gain events in CYP21A2, RPGR, and HELZ2, and a complete list of the 342 intron gains, including gene accession numbers, genomic coordinates, and taxonomic groups in which the intron gain event occurred, is provided in supplementary table S1, Supplementary Material online.
From the 342 intron gain events, we identified three examples of intronization among all recently gained intron events as described in Methods. Although these represent a very small fraction of all events, they nonetheless appear to provide support for the hypothesis that intronization accounts for at least some intron gains. The three events we identified occurred in the genes CYP21A2, RPGR, and HELZ2. Each of these events features well-aligned flanking exons, highlighting conserved regions in the proteins. In addition, orthologous proteins in other species contain insertions of amino acid sequences at positions corresponding precisely to the location of the human intron, and the lengths of those insertions match the length of a translation of the human intron, as shown by sequence alignments in Fig. 3 and supplementary figs. S5 to S6, Supplementary Material online.
Fig. 3.
Evidence of intronization in intron 4 of CYP21A2 (NCBI accession NP_000491.4). Shown here is an alignment between the human protein and its orthologs in river otter (XP_032703087.1), greater spear-nosed bat (XP_045682190.1), Indian flying fox (XP_039738891.1), and Egyptian fruit bat (XP_036094280.1). The flanking exons on either side are very highly conserved. The X's in the human protein were added in the position of intron 4, with the number of Xs corresponding to 1/3 of the length of the intron in nucleotides. As shown here, the orthologous proteins contain amino acid sequences that are similar in length to the expected sequence based on the human intron. The lower alignment demonstrates the similarity between the translated intron sequence in human (frame 2) and the original exon sequence in the Indian flying fox ortholog, which has a BLAST E-value of 5 × 10−10.
For each of these three proteins that had undergone intronization, we selected the orthologous proteins from the closest relative that did not contain the specific intron and compared the translated intron sequence to the translations of the corresponding exon sequences in the orthologs. For CYP21A2 (NP_000491.4), we found similarity between the translated intron and orthologs, as shown in Fig. 3. The other two cases showed no detectable sequence similarities between the human intron and the corresponding amino acid sequences in other species, possibly because the human intronic sequence had simply accumulated too many mutations.
We used ColabFold (Mirdita et al. 2022; Kim et al. 2023) to predict the structures of the three human proteins that had undergone intronization (CYP21A2, RPGR, and HELZ2) and their orthologous counterparts that retain the original exons. The protein structures for human CYP21A2 and one of its orthologs are shown in Fig. 4, and those for RPGR and HELZ2 are shown in supplementary figs. S7 to S8, Supplementary Material online. Interestingly, we observed an improvement in average pLDDT scores for each of the human protein sequences. Upon closer examination, we found that the intronized segments in the non-human proteins are predicted to form unstructured coil regions. This observation suggests that the intronization of these exons did not compromise the protein's functional integrity. Moreover, the higher pLDDT scores reported by AlphaFold2, which are considered to be a reliable measure of structural stability, suggest that the proteins with the intronized sequences might provide some evolutionary advantages.
Fig. 4.
Protein structures of human protein CYP21A2 (NP_000491.4) (left) and its homolog in flying fox (XP_039738891.1) (right), as predicted by ColabFold. Notably, the region in the flying fox protein that aligns with the intronized sequence in human, indicated by a box, is mostly an unstructured coil region, which lowers the overall pLDDT score and suggests that it is not needed for the protein to function. The human protein's pLDDT score is 90.2, while the fox protein's pLDDT score is 85.9.
Features of Recently Gained Introns
We mapped intron gain events to divergence times and discovered that a majority of these events occurred relatively recent in evolutionary history, with a significant portion happening after the emergence of Mammals (∼187.5 MYA) (Fig. 5). This observation might be a consequence of our methodology, which relies on alignment and is therefore inherently more reliable at detecting recent events. We looked at the lengths of recently gained introns and found they were similar to all intron lengths in the MANE dataset (Fig. 6), suggesting that the length of an intron does not influence the likelihood of its gain. Interestingly, our method detected a slight tendency for recently gained introns to appear near the 5′ ends of transcripts (Fig. 6c). Among the 38 genes that gained more than one intron (supplementary table S2, Supplementary Material online), we observed the same tendency. This could be indicative of these positions being less stable, although further research is required to confirm this hypothesis. These tendencies are relatively slight and may be a consequence of our methods rather than the intron gain process.
Fig. 5.
Number of introns gained in humans as a function of divergence time for intron gain events found in this study. The y-axis represents the number of events detected, and the x-axis indicates the estimated time since divergence.
Fig. 6.
Histograms of intron lengths and positions for intron gain events. a) The distribution of intron lengths of all MANE human introns. b) Length distribution of recently gained introns. c) Counts of recently gained introns plotted against their relative position in the transcript.
To investigate whether intron gain might be associated with the strength of splicing signals, we used SPLAM (Chao et al. 2024) to score all splice sites in proteins from the MANE annotation, and also to score the sites of intron gain events. As shown in supplementary fig. S9, Supplementary Material online, the overall distribution of scores for intron gain splice sites is very similar to that of splice sites in the MANE genes. Additionally, we used PANTHER (Mi and Thomas 2009) to categorize proteins with recently gained introns based on their functions, but this analysis did not reveal any particularly notable trends.
Discussion
Earlier studies suggested that intron gains within a specific lineage are rare (Roy et al. 2003), but become more frequent when comparing across different evolutionary subgroups (Carmel et al. 2007). Fedorov and colleagues reported in 2003 that only about 14% of animal introns align with plant intron positions, although that study was based on far less data than is available today (Fedorov et al. 2003). Our study used more than 17 million proteins spanning over 3,400 vertebrate species to successfully identify numerous instances of intron gains in the vertebrate lineage, focusing specifically on gains in human genes. In addition to casting new light on the origins of introns, our findings may also benefit genome annotation. The most common strategy for annotating genes today is to align the known genes from other species, using both DNA-based and protein-based alignments. If an intron is not shared among species, then it might be missed when using this strategy.
While our study provides evidence supporting intronization as a mechanism for intron gain, other mechanisms, including transposition, double-strand break repair, intron transfer, and the activity of transposable elements (Gozashti et al. 2022), might also create new introns. Although the methods described here were not designed to detect introns gained through these alternative mechanisms, they remain an important question for future research.
As a mechanism for intron gain, intronization appears to be relatively rare, at least in the findings described here. While indirect evidence for intronization events has been identified within several taxa, including Cryptococcus, Caenorhabditis, and Plasmodium vivax, as well as in primate and rodent retrogenes (Irimia et al. 2008; Roy 2009; Szcześniak et al. 2011; Yang and Huang 2011), we have not encountered any studies that report intronization events across different subgroups of vertebrates. Identifying intronization events can be challenging due to sequence divergence and potential changes in length resulting from mutations within introns. However, the three cases we identified are particularly compelling. In these instances, the lengths of the exon sequences in other species closely match those of the translated intronized sequences in humans. Our observations are also consistent with Ryll et al. (2022), who reported that intronization tends to occur preferentially at the 5′ end of highly expressed genes. Investigating the impact of these intron gains on gene function and gene expression across species will be a valuable area for further study.
Methods
Dataset
We used the MANE (Matched Annotation from NCBI and EMBL-EBI) dataset (v1.0) as our primary source of protein-coding gene annotations (Morales et al. 2022). The MANE database contains a single canonical transcript for each protein-coding locus in the genome, and it was developed with the intent of establishing a universal standard for clinical reporting and comparative genomics. Every transcript in MANE is also contained in the RefSeq (O'Leary et al. 2016), GENCODE (Frankish et al. 2021), and CHESS (Varabyou et al. 2023) human annotations, in which all of the exon and intron boundaries agree precisely among all three catalogs. Release 1.0 of MANE includes transcripts for 19,062 protein-coding genes (plus 58 additional transcripts in the MANE Plus Clinical set), which is over 95% of the total protein-coding gene content in RefSeq, GENCODE, and CHESS. For our database searches, we used the vertebrate subset of the NCBI Protein Reference Sequences (refseq_protein) database, which contained 17,302,662 proteins from 3,493 species (as of November 2023). An overview of our analysis process is shown in supplementary fig. S10, Supplementary Material online.
Identifying Intron Gain Events in Human Proteins
For each MANE protein, we first performed a BLASTP search (Camacho et al. 2009) against the RefSeq vertebrates database to find all orthologous proteins, keeping the single best hit for each species and requiring a minimum BLAST e-value of 10−6. Next, for each human protein and its orthologs, we retrieved gene coordinate tables from NCBI via Entrez Direct (Kans 2024) and used that data to compute the intron positions. We then inserted an “X” character into every amino acid sequence at the position of each intron in that gene. If the intron occurred in the middle of a codon, we inserted “X” just to the left of the corresponding amino acid. We then used MUSCLE (Edgar 2004) to create multiple alignments for each human protein sequence and its orthologs.
In the resulting multiple alignments, we looked for intron conservation, focusing on the positions of the human introns. For each intron in each human protein sequence, we assessed how many orthologous proteins had an intron at the same position based on the aligned “X” characters. We then used the NCBI taxonomy to determine where the intron had first appeared. To be considered as a potential recently gained intron, a specific human intron had to be unaligned (e.g. missing) in at least 70% of all orthologous species. This threshold is a somewhat arbitrary choice to ensure that the intron gain was not widespread across many other vertebrate species.
To validate that our initial ortholog searches did not miss paralogous proteins that might contain the putative newly gained introns, we used BLASTP to search each human protein with a putative intron gain against all other proteins from species that apparently lacked the intron, collecting all paralogs from those species. We then built a multi-alignment of those paralogs, inserting an “X” at the positions of annotated introns, to determine whether they contained the newly gained intron. If any paralog contained the intron in question, then we marked that species as having the intron. Separately, we also looked at intron gain events that occurred beyond the annotated start or end of the orthologous protein, and we excluded these from further analysis. This filtering reduced the initial set of 584 candidate intron gain events to 342 high-confidence events.
To ensure that the identified intronization events were not artifacts of alternative splicing, we cross-referenced our findings with RefSeq human annotations (release 110) (O’Leary et al. 2016). We found two genes (RASSF7 and BEST1) in which alternative transcripts retained an intron that was gained in the human lineage, and neither of those gains occurred through intronization.
Analysis of Potential Recently Gained Introns
After obtaining an initial set of intron gain events, we plotted the intron counts against (1) divergence time from humans, (2) intron lengths, and (3) relative intron position (computed as intron position divided by total protein length). We also employed PANTHER to classify the proteins that have gained introns according to their molecular function, biological process, cellular component, and protein class. To identify putative intronization events, we compared the exonic sequences flanking the gained introns in the human proteins with their counterparts in orthologous proteins. Intronization might also involve the incorporation of exonic sequences into existing intronic regions (Wang et al. 2005), although we did not explore this possibility here. If intronization occurred, this comparison should reveal a string of amino acids where the exon (now an intron in humans) was previously located. To detect such patterns, we conducted separate BLASTP searches for the sequences of the two human exons flanking the intron, searching against all vertebrates except for the subgroup containing the intron (Fig. 7). We retained hits from these searches if they exhibited a gap between the two matching regions in another genome. These hits were considered potential intronization events. Subsequently, we aligned the amino acid sequences using MUSCLE and BLASTP to visualize the evidence of intronization.
Fig. 7.
Alignment of sequences in a potential intronization event. We modified the human protein sequence (bottom) by inserting Xs at the position of the intron, as shown, and the modified sequence was aligned to all possible orthologs. The upper part of the figure shows an aligned sequence from an orthologous protein in another species. The ortholog contains additional amino acids spanning the position that is now an intron in humans.
To further investigate intronization events, we extracted the DNA sequence of the human intron, translated it in all three frames, and assessed its similarity to the orthologous proteins. Even though the human sequence is no longer constrained to encode a protein, some amino acid similarity might still be detectable. Additionally, we used ColabFold (Mirdita et al. 2022; Kim et al. 2023) to predict the structures of both the human and orthologous sequences and to compute confidence scores (pLDDT) for both. The predicted structures were then visualized in PyMOL.
Supplementary Material
Contributor Information
Celine Hoh, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA; Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA.
Steven L Salzberg, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA; Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21211, USA; Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21211, USA; Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA.
Supplementary Material
Supplementary material is available at Genome Biology and Evolution online.
Funding
This work was supported in part by National Institutes of Health (NIH) under grants R01-HG006677 and R35-GM130151 to SLS.
Data Availability
The genomic datasets analyzed during this study, including the human MANE dataset and vertebrate protein sequences from RefSeq, are publicly available from www.ncbi.nlm.nih.gov/refseq/MANE/and www.ncbi.nlm.nih.gov/refseq/. Detailed information on the 342 intron gain events, including genomic coordinates and species, is provided in supplementary table S1, Supplementary Material online. Additional data and analysis scripts are on Github at https://github.com/celinehohzm/intron_gain, while the 293 trees created in this study could be found at this Zenodo link.
Literature Cited
- Berget SM, Moore C, Sharp PA. Spliced segments at the 5′ terminus of Adenovirus 2 late mRNA. Proc Natl Acad Sci U S A. 1977:74(8):3171–3175. 10.1073/pnas.74.8.3171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009:10(1):421. 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carmel L, Wolf YI, Rogozin IB, Koonin EV. Three distinct modes of intron dynamics in the evolution of eukaryotes. Genome Res. 2007:17(7):1034–1044. 10.1101/gr.6438607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Catania F. From intronization to intron loss: how the interplay between mRNA-associated processes can shape the architecture and the expression of eukaryotic genes. Int J Biochem Cell Biol. 2017:91(Pt B):136–144. 10.1016/j.biocel.2017.06.017. [DOI] [PubMed] [Google Scholar]
- Catania F, Lynch M. Where do introns come from? PLoS Biol. 2008:6(11):e283. 10.1371/journal.pbio.0060283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chao KH, Mao A, Salzberg SL, Pertea M. Splam: a deep-learning-based splice site predictor that improves spliced alignments. Genome Biol. 2024:25(1):243. 10.1186/s13059-024-03379-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chorev M, Carmel L. The function of introns. Front Genet. 2012:3:55. 10.3389/fgene.2012.00055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chow LT, Gelinas RE, Broker TR, Roberts RJ. An amazing sequence arrangement at the 5′ ends of adenovirus 2 messenger RNA. Cell. 1977:12(1):1–8. 10.1016/0092-8674(77)90180-5. [DOI] [PubMed] [Google Scholar]
- Coulombe-Huntington J, Majewski J. Characterization of intron loss events in mammals. Genome Res. 2007:17(1):23–32. 10.1101/gr.5703406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Croll D, McDonald BA. Intron gains and losses in the evolution of fusarium and cryptococcus fungi. Genome Biol Evol. 2012:4(11):1148–1161. 10.1093/gbe/evs091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004:32(5):1792–1797. 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fedorov A, Feisal Merican A, Gilbert W. Large-scale comparison of intron positions among animal, plant, and fungal genes. Proc Natl Acad Sci U S A. 2002:99(25):16128–16133. 10.1073/pnas.242624899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fedorov A, Roy S, Fedorova L, Gilbert W. Mystery of intron gain. Genome Res. 2003:13(10):2236–2241. 10.1101/gr.1029803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I, et al. GENCODE 2021. Nucl Acids Res. 2021:49(D1):D916–D923. 10.1093/nar/gkaa1087. PMID: 33270111; PMCID: PMC7778937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert W. Why genes in pieces? Nature. 1978:271(5645):501. 10.1038/271501a0. [DOI] [PubMed] [Google Scholar]
- Gozashti L, Roy SW, Thornlow B, Kramer A, Ares Jr M, Corbett-Detig R. Transposable elements drive intron gain in diverse eukaryotes. Proc Natl Acad Sci U S A. 2022:119(48):e2209766119. 10.1073/pnas.2209766119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hankeln T, Friedl H, Ebersberger I, Martin J, Schmidt ER. A variable intron distribution in globin genes of chironomus: evidence for recent intron gain. Gene. 1997:205(1-2):151–160. 10.1016/S0378-1119(97)00518-0. [DOI] [PubMed] [Google Scholar]
- Irimia M, Rukov JL, Penny D, Vinther J, Garcia-Fernandez J, Roy SW. Origin of introns by ‘intronization’ of exonic sequences. Trends Genet. 2008:24(8):378–381. 10.1016/j.tig.2008.05.007. [DOI] [PubMed] [Google Scholar]
- Jeffares DC, Mourier T, Penny D. The biology of intron gain and loss. Trends Genet. 2006:22(1):16–22. 10.1016/j.tig.2005.10.006. [DOI] [PubMed] [Google Scholar]
- Jo B-S, Choi SS. Introns: the functional benefits of introns in genomes. Genomics Inform. 2015:13(4):112–118. 10.5808/GI.2015.13.4.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang L-F, Zhu Z-L, Zhao Q, Chen L-Y, Zhang Z. Newly evolved introns in human retrogenes provide novel insights into their evolutionary roles. BMC Evol Biol. 2012:12(1):128. 10.1186/1471-2148-12-128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kans J. Entrez direct: E-utilities on the unix command line. 2013 Apr 23 [Updated 2025 Mar 25]. In: Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010. Available from: https://www.ncbi.nlm.nih.gov/books/NBK179288/. [Google Scholar]
- Kim DS, Hahn Y. Human-specific protein isoforms produced by novel splice sites in the human genome after the human-chimpanzee divergence. BMC Bioinformatics. 2012:13(1):299. 10.1186/1471-2105-13-299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim G, Lee S, Levy Karin E, Kim H, Moriwaki Y, Ovchinnikov S, Steinegger M, Mirdita M. Easy and accurate protein structure prediction using ColabFold. Nat Protoc. 2025;20(3):620–642. 10.1038/s41596-024-01060-5. Epub 2024 Oct 14. PMID: 39402428. [DOI] [PubMed] [Google Scholar]
- Kini RM. Accelerated evolution of toxin genes: exonization and intronization in snake venom disintegrin/metalloprotease genes. Toxicon. 2018:148:16–25. 10.1016/j.toxicon.2018.04.005. [DOI] [PubMed] [Google Scholar]
- Koonin EV. The origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate? Biol Direct. 2006:1(1):22. 10.1186/1745-6150-1-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koonin EV, Csuros M, Rogozin IB. Whence genes in pieces: reconstruction of the exon-intron gene structures of the last eukaryotic common ancestor and other ancestral eukaryotes. Wiley Interdiscip Rev RNA. 2013:4(1):93–105. 10.1002/wrna.1143. [DOI] [PubMed] [Google Scholar]
- Lee S, Stevens SW. Spliceosomal intronogenesis. Proc Natl Acad Sci U S A. 2016:113(23):6514–6519. 10.1073/pnas.1605113113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li W, Tucker AE, Sung W, Thomas WK, Lynch M. Extensive, recent intron gains in Daphnia populations. Science. 2009:326(5957):1260–1262. 10.1126/science.1179302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mi H, Thomas P. PANTHER pathway: an ontology-based pathway database coupled with data analysis tools. Methods Mol Biol. 2009:563:123–140. 10.1007/978-1-60761-175-2_7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022:19(6):679–682. 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morales J, Pujar S, Loveland JE, Astashyn A, Bennett R, Berry A, Cox E, Davidson C, Ermolaeva O, Farrell CM, et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature. 2022:604(7905):310–315. 10.1038/s41586-022-04558-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. Reference sequence (RefSeq) database at NCBI: current Status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016:44(D1):D733–D745. 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poverennaya IV, Roytberg MA. Spliceosomal introns: features, functions, and evolution. Biochemistry (Mosc). 2020:85(7):725–734. 10.1134/S0006297920070019. [DOI] [PubMed] [Google Scholar]
- Ragg H. Intron creation and DNA repair. Cell Mol Life Sci. 2011:68(2):235–242. 10.1007/s00018-010-0532-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV. Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in Eukaryotic evolution. Curr Biol. 2003:13(17):1512–1517. 10.1016/S0960-9822(03)00558-X. [DOI] [PubMed] [Google Scholar]
- Roy SW. Intronization, de-intronization and intron sliding are rare in cryptococcus. BMC Evol Biol. 2009:9(1):192. 10.1186/1471-2148-9-192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roy SW, Fedorov A, Gilbert W. Large-scale comparison of intron positions in mammalian genes shows intron loss but no gain. Proc Natl Acad Sci U S A. 2003:100(12):7158–7162. 10.1073/pnas.1232297100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roy SW, Irimia M. Mystery of intron gain: new data and new models. Trends Genet. 2009a:25(2):67–73. 10.1016/j.tig.2008.11.004. [DOI] [PubMed] [Google Scholar]
- Roy SW, Irimia M. Splicing in the eukaryotic ancestor: form, function and dysfunction. Trends Ecol Evol. 2009b:24(8):447–455. 10.1016/j.tree.2009.04.005. [DOI] [PubMed] [Google Scholar]
- Roy SW, Irimia M. Genome evolution: where do new introns come from? Curr Biol. 2012:22(13):R529–R531. 10.1016/j.cub.2012.05.017. [DOI] [PubMed] [Google Scholar]
- Ryll J, Rothering R, Catania F. Intronization signatures in coding exons reveal the evolutionary fluidity of eukaryotic gene architecture. Microorganisms. 2022:10(10):1901. 10.3390/microorganisms10101901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharp PA. On the origin of RNA splicing and introns. Cell. 1985:42(2):397–400. 10.1016/0092-8674(85)90092-3. [DOI] [PubMed] [Google Scholar]
- Szcześniak MW, Ciomborowska J, Nowak W, Rogozin IB, Makałowska I. Primate and rodent specific intron gains and the origin of retrogenes with splice variants. Mol Biol Evol. 2011:28(1):33–37. 10.1093/molbev/msq260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tarrío R, Ayala FJ, Rodríguez-Trelles F. Alternative splicing: a missing piece in the puzzle of intron gain. Proc Natl Acad Sci U S A. 2008:105(20):7223–7228. 10.1073/pnas.0802941105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Varabyou A, Sommer MJ, Erdogdu B, Shinder I, Minkin I, Chao KH, Park S, Heinz J, Pockrandt C, Shumate A, et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure Genome Biol. 2023:24(1):249. 10.1186/s13059-023-03088-4. PMID: 37904256; PMCID: PMC10614308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W, Zheng H, Yang S, Yu H, Li J, Jiang H, Su J, Yang L, Zhang J, McDermott J, et al. Origin and evolution of new exons in rodents. Genome Res. 2005:15(9):1258–1264. 10.1101/gr.3929705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu J, Xiao J, Wang L, Zhong J, Yin H, Wu S, Zhang Z, Yu J. Systematic analysis of intron size and abundance parameters in diverse lineages. Sci China Life Sci. 2013:56(10):968–974. 10.1007/s11427-013-4540-y. [DOI] [PubMed] [Google Scholar]
- Yang Z, Huang J. De Novo origin of new genes with introns in Plasmodium vivax. FEBS Lett. 2011:585(4):641–644. 10.1016/j.febslet.2011.01.017. [DOI] [PubMed] [Google Scholar]
- Yenerall P, Zhou L. Identifying the mechanisms of intron gain: progress and trends. Biol Direct. 2012:7(1):29. 10.1186/1745-6150-7-29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhan L, Meng Q, Chen R, Yue Y, Jin Y. Origin and evolution of a new retained intron on the vulcan gene in Drosophila melanogaster subgroup species. Genome. 2014:57(10):567–572. 10.1139/gen-2014-0132. [DOI] [PubMed] [Google Scholar]
- Zhu T, Niu D-K. Mechanisms of intron loss and gain in the fission yeast Schizosaccharomyces. PLoS One. 2013:8(4):e61683. 10.1371/journal.pone.0061683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu Z, Zhang Y, Long M. Extensive structural renovation of retrogenes in the evolution of the Populus genome. Plant Physiol. 2009:151(4):1943–1951. 10.1104/pp.109.142984. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The genomic datasets analyzed during this study, including the human MANE dataset and vertebrate protein sequences from RefSeq, are publicly available from www.ncbi.nlm.nih.gov/refseq/MANE/and www.ncbi.nlm.nih.gov/refseq/. Detailed information on the 342 intron gain events, including genomic coordinates and species, is provided in supplementary table S1, Supplementary Material online. Additional data and analysis scripts are on Github at https://github.com/celinehohzm/intron_gain, while the 293 trees created in this study could be found at this Zenodo link.