Skip to main content
eLife logoLink to eLife
. 2023 Oct 3;12:e91997. doi: 10.7554/eLife.91997

Inter-species association mapping links splice site evolution to METTL16 and SNRNP27K

Matthew T Parker 1, Sebastian M Fica 2, Geoffrey J Barton 1, Gordon G Simpson 1,3,
Editors: Christian R Landry4, Christian R Landry5
PMCID: PMC10581693  PMID: 37787376

Abstract

Eukaryotic genes are interrupted by introns that are removed from transcribed RNAs by splicing. Patterns of splicing complexity differ between species, but it is unclear how these differences arise. We used inter-species association mapping with Saccharomycotina species to correlate splicing signal phenotypes with the presence or absence of splicing factors. Here, we show that variation in 5’ splice site sequence preferences correlate with the presence of the U6 snRNA N6-methyladenosine methyltransferase METTL16 and the splicing factor SNRNP27K. The greatest variation in 5’ splice site sequence occurred at the +4 position and involved a preference switch between adenosine and uridine. Loss of METTL16 and SNRNP27K orthologs, or a single SNRNP27K methionine residue, was associated with a preference for +4 U. These findings are consistent with splicing analyses of mutants defective in either METTL16 or SNRNP27K orthologs and models derived from spliceosome structures, demonstrating that inter-species association mapping is a powerful orthogonal approach to molecular studies. We identified variation between species in the occurrence of two major classes of 5’ splice sites, defined by distinct interaction potentials with U5 and U6 snRNAs, that correlates with intron number. We conclude that variation in concerted processes of 5’ splice site selection by U6 snRNA is associated with evolutionary changes in splicing signal phenotypes.

Research organism: C. elegans, Human, S. cerevisiae, S. pombe, Other

Introduction

Introns in genes are a defining characteristic of eukaryotes (Madhani, 2013). The excision of introns and the splicing together of flanking exon sequences from RNA Polymerase II-transcribed RNAs is mediated by dynamic assemblies of proteins and small nuclear RNAs (UsnRNAs) called spliceosomes (Wilkinson et al., 2020). The number of introns and the complexity of splicing patterns vary widely between species (Irimia et al., 2007; Lim et al., 2021; Sales-Lee et al., 2021). For example, human gene expression is characterised by a relatively large number of introns and spliceosomal proteins, degenerate splicing cis-elements, and complex patterns of alternative splicing choices on transcripts from the same gene (Lee and Rio, 2015; Nilsen and Graveley, 2010; Sales-Lee et al., 2021). In contrast, most genes in the experimental model Saccharomyces cerevisiae lack introns; splicing cis-elements are almost invariant and alternative splicing is rare (Neuvéglise et al., 2011; Plaschka et al., 2019; Sales-Lee et al., 2021). Alternative splicing correlates with developmental complexity (Bénitìere et al., 2023; Bush et al., 2017; Chen et al., 2014; Wright et al., 2022). Yet some developmentally simple fungal species, including Cryptococcus neoformans, have many introns, degenerate splicing signals, and a near complete repertoire of human spliceosomal protein orthologs (Csuros et al., 2011; Lim et al., 2021; Mitrovich et al., 2010; Sales-Lee et al., 2021). The simplification of splicing detected in S. cerevisiae must therefore have occurred through the evolutionary loss of some introns and splicing factors (Csuros et al., 2011; Irimia et al., 2007; Lim et al., 2021; Mitrovich et al., 2010; Rogozin et al., 2012; Schwartz et al., 2008). It is unclear what drives global splicing patterns to become either more or less complex or how these changes in complexity are achieved (Bénitìere et al., 2023; Bush et al., 2017; Wright et al., 2022).

Splicing proceeds through two sequential transesterification reactions (Wilkinson et al., 2020). The first reaction, branching, occurs via nucleophilic attack of the 5’ splice site (5’SS) by the 2’ hydroxyl of the intron branch point adenosine. The second reaction, exon ligation, occurs via nucleophilic attack of the 3’ hydroxyl of the 5’ exon on the 3’ splice site (3’SS). Spliceosomes undergo sequential conformational and compositional changes as they process introns. During initial assembly, potential 5’SSs are typically recognised by U1 snRNA, and the branch point adenosine and 3’SS are recognised cooperatively by binding of the SF1 and U2AF proteins, and U2 snRNA (Fica, 2020). 5’SSs are subsequently transferred to U5 and U6 snRNA. In humans, 5’SS transfer is decoupled from active site formation, potentially enabling plasticity in 5’SS selection (Charenton et al., 2019; Fica, 2020). U5 snRNA loop 1 binds to the exon upstream of the 5’SS. The conserved ACAGA sequence of U6 snRNA pairs with the intron sequence adjacent to the 5’SS. The central adenosine of human U6 snRNA ACAGA is methylated at the N6 position (m6A) by METTL16 (Aoyama et al., 2020; Pendleton et al., 2017; Warda et al., 2017). Cryogenic-electron microscopy (cryo-EM) structures show that human U6 snRNA m6A43 faces the +4 position of the 5’SS (which is generally A in humans) in a trans Hoogsteen sugar edge interaction (Bertram et al., 2017; Parker et al., 2022). m6A-modification of the U6 snRNA ACAGA sequence is found in diverse eukaryotes. For example, the corresponding U6 snRNA nucleotide is m6A-modified in Caenorhabditis elegans, Schizosaccharomyces pombe and Arabidopsis thaliana by METTL16 orthologs (Ishigami et al., 2021; Mendel et al., 2021; Wang et al., 2022). In contrast, cryo-EM structures of S. cerevisiae spliceosomal complexes reveal major differences with humans in these early stages of splicing (Charenton et al., 2019). In S. cerevisiae, the transfer of 5’ SSs from U1 to U5 and U6 snRNA is directly coupled to active site formation. U5 snRNA loop 1 pairs with degenerate upstream exon sequences while the ACAGA box of U6 snRNA pairs with almost invariant intron sequences. There is no METTL16 ortholog in S. cerevisiae and the ACAGA sequence is not m6A-modified (Morais et al., 2021). The central adenosine of S. cerevisiae U6 snRNA ACAGA does not face a 5’SS +4 A but instead makes a Watson-Crick base pair with the almost invariant +4 U of S. cerevisiae 5’SSs (Neuvéglise et al., 2011; Wan et al., 2019). Therefore, these early steps in 5’SS selection exemplify distinctions between species that differ not only in terms of developmental complexity but also patterns of splicing complexity.

The direct impact of U6 snRNA m6A modification on splicing in humans remains uncharacterised. However, recent analyses of mutants lacking METTL16 orthologs in S. pombe (mtl16Δ) and Arabidopsis (fio1) have revealed that U6 snRNA m6A modification has profound effects on the accuracy and efficiency of splicing (Ishigami et al., 2021; Parker et al., 2022). Global splicing analyses of S. pombe mtl16Δ and Arabidopsis fio1 mutants show that 5’SSs with +4 A are most sensitive to the loss of U6 snRNA m6A modification. In contrast, splicing became more efficient at 5’SSs with +4 U in S. pombe mtl16Δ and Arabidopsis fio1 mutants. Relatively strong interactions with U5 snRNA can compensate for weakened U6 snRNA interactions caused by the loss of U6 snRNA m6A modification (Ishigami et al., 2021; Parker et al., 2022). A negative correlation between U5 and U6 snRNA interaction potentials reveals that there are two major classes of 5’SSs (Parker et al., 2022). Since in some species, pairs of alternative 5’SSs tend to be of opposite classes, this genomic feature may influence alternative splicing.

Given the similarity of the 5’SS +4 consensus preferred in Arabidopsis fio1 mutants to that observed in S. cerevisiae, we speculated that the loss of a METTL16 ortholog in some species could contribute to an evolutionary change in 5’SS sequence preference. To address this possibility, we took a phylogenetics approach by examining how 5’SS sequence preference changed in different species. We applied inter-species association mapping as an unbiased approach to identify factors correlated with 5’SS sequence (Kiefer et al., 2019; Smith et al., 2020). This approach is conceptually related to a genome-wide association study (GWAS) in which genomic variation is correlated with phenotype. However, rather than consider variation within a species, inter-species association mapping correlates a phenotype with genotypic variation between species, usually at the level of gene presence or absence, whilst correcting for the relatedness of species (Kiefer et al., 2019; Smith et al., 2020). We surveyed the genomes of more than 200 species from the Saccharomycotina clade of fungi and found multiple independent switches in sequence preference at the 5’SS +4 position. We found that the identity of the nucleotide at the 5’SS +4 position is most strongly correlated with the presence or absence of METTL16. In addition, we detected a conserved methionine in the spliceosomal protein SNRNP27K that also correlates with 5’SS +4 sequence preference. We conclude that variation in factors that act in concert to influence U6 snRNA interactions with 5’SSs are associated with evolutionary change in splicing signal phenotypes.

Results

METTL16 is widely conserved in eukaryotes

Patterns of presence/absence variation of human spliceosomal protein orthologs in some fungal species have recently been reported (Sales-Lee et al., 2021). However, because m6A-modified U6 snRNA is a component of spliceosomes but METTL16 is not, variation in METTL16 or UsnRNA modification was not assessed in this context. To understand the variation in METTL16 across eukaryotes, we used the software tool phmmer (Eddy and Pearson, 2011) to search Uniprot databases (Bateman et al., 2023) for proteins containing similarity to the METTL16 methyltransferase domain (MTD) in 33 diverse eukaryotic species from Metazoa, Fungi, Archaeplastida, Chromista, and Excavata. Putative METTL16 orthologs were identified in species from all groups except Excavata (Figure 1A). We were unable to detect METTL16 orthologs in four species, but since these species were distributed across the kingdoms, we conclude that METTL16 is ancestral to most or all eukaryotes and that absences are the result of independent gene loss events.

Figure 1. METTL16 is widely conserved across eukaryotes.

(A) Phylogenetic tree showing the presence and absence of a METTL16 ortholog in 33 eukaryotic species. Linear protein structures with globular domains identified from Alphafold2 models are shown on the right of the tree. Domains are colored by cluster: MTD domains in blue, MTD +VCR/KA-1 domains in green, and VCR/KA-1 domains in orange. Likely loop/disordered regions with low confidence predictions are shown as grey lines. (B) Distance matrix heatmap showing the pairwise TMscore of segmented domains from the Alphafold2 predictions of 29 METTL16 orthologs. The X-ray structures of human METTL16 MTD and VCR, and TUT1 KA-1 are included as positive controls. Domains are grouped into clusters (diagonal boxes) using the same color scheme as in (A). (C) Boxplot showing TMscores of segmented domains from Alphafold2 predictions of 28 METTL16 orthologs superimposed onto experimentally determined structures of human METTL16 MTD and VCR, and TUT1 KA-1. (D) Superimposition of the VCR/KA-1 domain of Arabidopsis FIO1 predicted by Alphafold2 onto the X-ray structure of the human METTL16 VCR.

Figure 1.

Figure 1—figure supplement 1. Structural analysis of fungal METTL16 orthologs.

Figure 1—figure supplement 1.

(A) Alphafold2 predicted structure of S. pombe MTL16, which cannot be segmented into separate MTD and VCR/KA-1-like domains using the PAE method. Alphafold2 predicts that the MTD and VCR/KA-1 subdomains are sandwiched together by hydrogen bonding of Arginine 258 (orange, hydrogen bonding shown as dashed red lines) with backbone carbonyl oxygens, to create the superdomain structure. (B) Multiple sequence alignment of the 8 fungal METTL16 orthologs, of which 7 cannot be segmented into separate MTD and VCR/KA-1-like domains using the PAE method. Arg258 is highly conserved in these 7 orthologs (orange), but not in Y. lipolytica, which is segmentable into two domains.

To consider potential changes in METTL16 substrate specificity, we analysed the domain structure of the METTL16 orthologs in more detail. Structural studies indicate that human METTL16 is comprised of two globular domains: the N-terminal MTD and a C-terminal domain called the ‘vertebrate conserved region’ (VCR). Both domains contribute to METTL16 U6 snRNA m6A-modification specificity (Aoyama et al., 2020). Since primary sequence alignments fail to identify the METTL16 VCR in more distantly related species (Aoyama et al., 2020; Ju et al., 2023; Pendleton et al., 2017), we developed a revised approach to this search by incorporating protein structure predictions. Human METTL16 VCR has structural and functional equivalence to the KA-1 domain of the U6 3’ uridyltransferase TUT1 (Aoyama et al., 2020). The METTL16 VCR and TUT1 KA-1 domains bind the conserved internal stem-loop of U6 snRNA to enhance substrate specificity (Aoyama et al., 2020; Yamashita et al., 2017). Given the low sequence similarity between METTL16 VCR and TUT1 KA-1, we asked whether METTL16 orthologs contain topologically similar C-terminal domains that are poorly conserved at the sequence level. To address this question, we used Alphafold2 predictions (Varadi et al., 2022) of the 29 orthologous METTL16 proteins identified in the 33 species from across Eukaryota. We used the predicted local distance difference test (pLDDT) and predicted alignment error (PAE) values provided for each model to segment the proteins into globular domains, excluding regions predicted to be flexible or disordered (Oeffner et al., 2022). Using this method, all the METTL16 orthologs were found to have either one or two globular domains (Figure 1A). We then superimposed all pairs of domains and calculated the template modelling score (TM-score), which is a measure of the similarity of two or more protein folds (Zhang and Skolnick, 2005). As positive controls, we included structures of the human METTL16 MTD and VCR domains, and TUT1 KA-1 domain, determined by X-ray crystallography (Aoyama et al., 2020; Ruszkowska et al., 2018; Yamashita et al., 2017). All predicted METTL16 ortholog domains were classified as either MTD-like, VCR/KA-1-like, or MTD +VCR/KA-1-like (Figure 1B), with all MTD-like and VCR-like domains superimposing onto the experimentally determined human METTL16 MTD or VCR/KA-1 structures, respectively, with TM-scores that indicate structural analogy (Figure 1C). The MTD +VCR/KA-1-like class is specific to the fungal species examined here and is explained by the close packing of MTD and KA-1 domains which prevents segmentation by PAE. This results in ‘superdomains’ that have structural analogy to both the MTD and KA-1 domains of human METTL16 (Figure 1C). The VCR/KA-1-like domain was absent from METTL16 orthologs in some species but could be identified in orthologs from across the eukaryotic tree of life, including C. elegans, S. pombe, Arabidopsis, and Phytophthora infestans (Figure 1A). We used the software tool US-align (Universal Structure alignment) (Zhang et al., 2022) to perform multiple structure alignments of all 11 VCR/KA-1-like domains predicted by Alphafold2, resulting in an overall TM-score of 0.53, which indicates that all the domains are structurally analogous (Zhang and Skolnick, 2005). For example, the predicted C-terminal domain of Arabidopsis FIO1 could be superimposed onto the crystal structure of the human METTL16 VCR domain with a TM-score of 0.82, and an RMSD of 0.86 Ångstroms at 59 structurally equivalent alpha carbon atoms (Figure 1D). All five beta strands and the two larger alpha helices of the METTL16 VCR domain are predicted by Alphafold2 to be conserved in FIO1.

The MTD +VCR/KA-1-like cluster of ‘superdomains’ from fungal species, for which the MTD and VCR/KA-1-like subdomains could not be segmented, superimposed well onto each other with an overall TM-score of 0.83 for 7 orthologs (Figure 1B). This indicates that the predicted relative positions of the MTD and VCR/KA-1-like subdomains within these superdomains is consistent across orthologs. Arginine 258 of S. pombe MTL16, which is located between the MTD-like and VCR/KA-1-like subdomains, is predicted to form hydrogen bonds with backbone carbonyl oxygens and a serine sidechain that binds the two subdomains together, helping to form the superdomain (Figure 1—figure supplement 1A). In agreement with this, Arg258 is strongly conserved in the 7 orthologs which have an MTD +VCR/KA-1-like superdomain but absent from Yarrowia lipolytica METTL16 (Figure 1—figure supplement 1B), which is segmentable into two domains (Figure 1A). The position of the VCR/KA-1-like subdomain relative to the known ACAGA binding site of the MTD in these orthologs may help to improve our understanding of how the VCR/KA-1-like domain contributes to the specificity of METTL16 for U6 snRNA.

In summary, by incorporating protein structure predictions we could revise the understanding of the conserved domain structure of METTL16, detecting evidence for the VCR/KA-1 domain in species where it had previously been reported to be absent (Mendel et al., 2021). These results agree with recent structural studies of the C. elegans METTL16 ortholog METT10 which confirmed the presence of a KA-1 domain (Ju et al., 2023). Overall, our analysis suggests that the last eukaryotic common ancestor may not only have had a METTL16 ortholog but one that contained a KA-1-like domain too. Consequently, METTL16 with a domain that enhances substrate specificity for U6 snRNA is a feature of early eukaryotes.

Variation in the identity of the 5’SS +4 position evolved independently on multiple occasions

To understand evolutionary changes in splicing complexity, we focused on splicing signal phenotypes in the Saccharomycotina clade of ascomycetous yeasts. The Saccharomycotina clade was chosen for three reasons. First, simplification of spliceosome composition, intron content and cis-element splicing signals has already been established in these species (Lim et al., 2021; Neuvéglise et al., 2011; Sales-Lee et al., 2021). Second, the Saccharomycotina clade includes S. cerevisiae, which is an important model species for the study of splicing. Finally, the availability of many sequenced Saccharomycotina genomes has the potential to provide statistical power for inter-species association mapping.

We devised a new bioinformatics pipeline to identify groups of orthologous splicing factor genes (orthogroups) and splicing sequence phenotypes from genomic sequences deposited in NCBI (Figure 2—figure supplement 1). We used the software tool Funannotate (Palmer and Stajich, 2020) to annotate the genomes of 227 publicly available Sacchromycotina genomes (Supplementary file 1; Dujon et al., 2004; Riley et al., 2016; Shen et al., 2018; Shen et al., 2016). In addition, we incorporated an outgroup comprised of 13 well-annotated eukaryote genomes, including the human genome and 3 reference Saccharomycotina genomes (S. cerevisiae, Candida albicans, and Yarrowia lipolytica) (Dujon et al., 2004; Engel et al., 2022; Muzzey et al., 2013; Nurk et al., 2022). The outgroup helps root the phylogenetic tree and annotate the orthogroups predictions. Protein sequences from these annotations were clustered into orthogroups using the software tool Orthofinder (Emms and Kelly, 2019), which also generates a species tree (Emms and Kelly, 2018). The initial de novo annotations of some species contained many incorrectly annotated introns. Consequently, we used multiple sequence alignments of proteins from each orthogroup to filter introns, keeping only those introns that were conserved in the same multiple alignment column and frame in at least two species (Rogozin et al., 2003). Overall, the pipeline yielded sequence information on introns and identified orthologous groupings of protein-coding genes.

We used the conserved annotated introns to estimate the 5’SS sequence preference of each species and mapped these onto the species tree whilst also estimating ancestral 5’SS sequence preferences (Figure 2A; Paradis et al., 2004; Schwartz et al., 2008). This analysis identified limited variation in the frequency of the preferred nucleotides at the –1,+3, and +5 positions of the 5’SS (Figure 2B, Figure 2—figure supplement 2A–C). In contrast, there have been multiple independent changes of sequence preference at the 5’SS +4 position in the Saccharomycetaceae, Debaryomycetaceae (CUG-Ser1) and Pichiaceae families (Figure 2C). These preference changes almost exclusively comprise switches between +4 A and +4 U since there was much less variation in the ratio of W (A or U) to S (G or C) nucleotides (Figure 2B, Figure 2—figure supplement 2D). For example, the Saccharomycetaceae (which includes S. cerevisiae) and their sister taxa Saccharomycodaceae all have a similarly strong 5’SS preference for +4 U, suggesting that an overall preference for +4 U is ancestral to this group. However, the estimated next closest family to the Saccharomycetaceae, the Phaffomycetaceae, have an overall preference for +4 A, indicating that the invariant 5’SS +4 U phenotype of Saccharomycetaceae may have evolved since the divergence of the two families. Although the confidence of the relative arrangement of Phaffomycetaceae, Saccharomycetaceae and CUG-Ser1 clades on the species tree is low (Figure 2A), the placement of Phaffomycetaceae matches previous approaches that used genome-scale data (Shen et al., 2020; Shen et al., 2016). Furthermore, the preference for 5’SS +4 A in families basal to the CUG-Ser1 clade, such as Cephaloascaceae (e.g. Cephaloascus fragrans), corroborates the finding that the switch to +4 U seen in many genera of the Debaryomycetaceae family of the CUG-Ser1 clade (e.g. in C. albicans) is independent of Saccharomycetaceae (Figure 2C, Figure 2—figure supplement 3). This indicates that the invariant 5’SS +4 U preferences seen in S. cerevisiae and C. albicans are the result of convergent evolution, which contrasts with a previous interpretation made when fewer genomes were available for comparison (Schwartz et al., 2008). Therefore, the improved resolution that the recent availability of more genome sequences makes possible reveals that the evolutionary history of splicing signal phenotypes in fungi is more complex than previously recognised.

Figure 2. Variation in the identity of the 5’ splice site +4 position evolved independently on multiple occasions.

(A) Phylogenetic tree showing the estimated +4 position nucleotide fractions (as stacked bars) of the last common ancestors of 11 clades of Saccharomycotina, plus three outgroup species. The phylogenetic tree used is a collapsed ultrametric version of the species tree generated by Orthofinder from the proteomes of 240 species. Bifurcations are colored by confidence, calculated using the number of single-locus gene trees that support the bifurcation in the Orthofinder STAG algorithm (Emms and Kelly, 2019; Emms and Kelly, 2018) (this measure is generally more stringent than bootstrap values for trees generated using concatenated multiple sequence alignment and maximum likelihood methods). Ancestral nucleotide fractions were calculated using Sankoff Parsimony (Schwartz et al., 2008) (B) Line plots showing the standard deviation of nucleotide frequency ratio phenotypes for 240 species, across the –3 to +7 positions of the 5’SS. Left panel shows all pairwise combinations of single nucleotide frequency phenotypes (e.g. A to U ratio), right panel shows all single nucleotide versus other combinations (e.g. G to H [A, C or U]), as well as R (A or G) to Y (C or U) and S (G or C) to W (A or U) ratios. Human 5'SS consensus sequences are shown for each position as a guide. (C) Stacked histogram and traitgram showing the distribution of 5’SS +4 A to U ratio phenotypes in different Saccharomycotina species. The traitgram shows the predicted path of the 5’SS +4 A to U ratio phenotype through evolutionary time by plotting the branch length on the x-axis, and the measured or estimated phenotype of each node on the y-axis. Clades defined in (A) have been colored accordingly and the paths of 9 key species have been highlighted in bold – for example showing that the last common ancestor of S. cerevisiae and C. albicans is unlikely to have had a +4 U preference phenotype (last common ancestor of S. cerevisiae and C. albicans is also indicated by black arrow). L. starkeyi = Lipomyces starkeyi, C. fragrans = Cephaloascus fragrans, E. gossypii = Eremothecium gossypii.

Figure 2.

Figure 2—figure supplement 1. Interspecies association mapping pipeline.

Figure 2—figure supplement 1.

Workflow showing the overall method used for generating inter-species association mapping results. Blue cylinders represent input data, orange boxes represent processes, green parallelograms represent intermediate datasets, and yellow parallelograms represent output datasets.
Figure 2—figure supplement 2. Variation in 5’SS splicing signal sequence preference phenotypes across Saccharomycotina.

Figure 2—figure supplement 2.

Stacked histogram and traitgrams showing the distribution of (A) –1 G to H (A, C or U), (B) +3 A to G, (C) +5 G to H and (D) +4 W (A or U) to S (G or C) ratio phenotypes in different Saccharomycotina species. The traitgram shows the predicted path of the phenotypes through evolutionary time by plotting the branch length on the x-axis, and the measured or estimated phenotype of each node on the y-axis. Species have been colored by clade (see key) and the paths of various key species have been highlighted in bold.
Figure 2—figure supplement 3. Variation in 5’SS splicing signal sequence preference phenotypes across Saccharomycotina.

Figure 2—figure supplement 3.

Pruned Saccharomycotina tree showing the diversity of 5'SS sequence preferences. Species were selected to represent those Saccharomycotina species previously analysed by Schwartz et al., 2008, with new additions which demonstrate that changes in 5'SS +4 preference have occurred multiple times in Saccharomycotina.
Figure 2—figure supplement 4. Variation in 3’SS splicing signal sequence preference phenotypes across Saccharomycotina.

Figure 2—figure supplement 4.

(A) Line plots showing the standard deviation of nucleotide frequency ratio phenotypes for 240 species, across the –5 to +3 positions of the 3’SS. Left panel shows all pairwise combinations of single nucleotide frequency phenotypes (e.g. A to U ratio), right panel shows all single nucleotide versus other combinations (e.g. G to H [A, C or U]), as well as R (A or G) to Y (C or U) and S (G or C) to W (A or U) ratios. Human 3'SS consensus sequences are shown for each position as a guide. (B) Stacked histogram and traitgram showing the distribution of 3'SS –3 C to U ratio phenotypes in different Saccharomycotina species. The traitgram shows the predicted path of the phenotypes through evolutionary time by plotting the branch length on the x-axis, and the measured or estimated phenotype of each node on the y-axis. Species have been colored by clade (see key) and the paths of various key species have been highlighted in bold.

We explored whether there were changes in sequence preferences at the 3’SSs of the species in our dataset and detected evidence of switches in 3’SS sequence preference at the –3 position, between C and U. The variation in this phenotype was lower than that observed at the 5'SS +4 position (Figure 2—figure supplement 4A–B). Overall, we conclude that switches from 5’SS +4 A to +4 U preference have occurred independently on multiple occasions in the Saccharomycotina. Consequently, this suggests that Saccharomycotina species present variation in the 5’SS +4 phenotype suitable for inter-species association studies.

Switches in 5’SS sequence preference are not associated with compensatory changes in U6 or U1 snRNA sequence

Changes in 5’SS sequence preference could be associated with either the changed function of a spliceosomal protein or changes in UsnRNA sequences that interact with 5’SSs. For example, a 5’SS +4 A could make a Watson-Crick base pair with U6 snRNA through a compensatory nucleotide change in the central adenosine of the U6 ACAGA box (Figure 3A). Indeed, a change of the corresponding adenosine to uridine is found in U6 snRNA of the unicellular red alga Cyanidioschyzon merolae (Stark et al., 2015; Wong et al., 2023). To examine this possibility in our dataset, we predicted a high-confidence U6 snRNA gene in 229 of the 230 Saccharomycotina genomes and 9 of the 10 outgroups’ using the software tool Infernal, and a covariance model built using the secondary structural model of human U6 snRNA 5’stem, telestem and internal stem-loop. The remaining two species required manual intervention to annotate the U6 snRNA gene (see Materials and methods). The ACAGA sequence, which is targeted by METTL16, was invariant in all species (Figure 3B). Therefore, switches in 5’SS +4 A/U in the Saccharomycotina are not associated with changes in U6 snRNA ACAGA sequences.

Figure 3. Switches in 5’SS sequence preference are not associated with compensatory changes in U6 snRNA sequence.

(A) Diagram showing the interaction of the U6 snRNA ACAGA box with the approximate human 5'SS consensus sequence. The methylated position of the ACAGA box is located opposite the 5'SS +4 position. (B) Conservation of U6 snRNA positions calculated from best U6 snRNA sequences predicted from genomes for 240 species (including 230 Saccharomycotina and 10 outgroups) using Infernal and plotted onto the predicted structure using R-scape and R2R. The predicted structure is derived from the predicted human U6 snRNA structure (Aoyama et al., 2020). All the positions in the ACAGA box were 100% conserved across all 240 species.

Figure 3.

Figure 3—figure supplement 1. Switches in 5’SS sequence preference are not associated with compensatory changes in U1 snRNA sequence.

Figure 3—figure supplement 1.

(A) Diagram showing the interaction of the U1 snRNA 5'SS interacting site with the approximate 5'SS consensus sequence of humans. In humans and S. cerevisiae, the 5'SS +4 position is located opposite a pseudouridylated uridine. (B) Conservation of the 5’ end and stem 1 of the U1 snRNA positions calculated from best U1 snRNA stem 1 sequences predicted from genomes for 236 species (including 227 Saccharomycotina and 9 outgroups) using Infernal and plotted onto the predicted structure using R-scape and R2R. All the positions in the 5’SS interacting region were 100% conserved across all species.

A second possibility is that changes in 5’SS +4 preference could be compensated by changes in the sequence or modification of the interacting position of the U1 snRNA (Figure 3—figure supplement 1A). During canonical recognition of 5’SSs by U1 snRNA, the 5’SS +4 position interacts with position 5 of the U1 snRNA in the U1/5’SS helix, which is a pseudouridine (Ψ5) in both humans and S. cerevisiae. A change in nucleotide identity at this position of U1 snRNA, for example from pseudouridine to adenosine, could compensate a change in 5'SS +4 preference from adenosine to uridine. The U1 snRNA is larger and highly diverged in Saccharomycetaceae compared to metazoans, making computational prediction more challenging. However, using a covariance model of the 5’SS interacting site and stem 1 of the U1 snRNA structure, we were able to identify U1 snRNA candidates from 227 of the 230 Saccharomycotina genomes and 9 of the 10 outgroups. The ACUUACC sequence of the 5’SS interacting positions of U1 snRNA was invariant in all species (Figure 3—figure supplement 1B).

Overall, we conclude that switches in A/U sequence preference at the 5’SS +4 position in Saccharomycotina is not associated with compensatory sequence changes in the interacting positions of either the U6 or U1 snRNAs.

Inter-species association mapping links METTL16 to 5’SS +4 sequence preference

Having established variation in the sequence preference of 5’SSs, which could not be explained by compensatory changes in either U1 or U6 snRNA sequence, we next asked if inter-species association mapping could determine whether the presence or absence of a splicing factor or METTL16 ortholog correlated with the 5’SS +4 A/U phenotype.

We first reduced the number of ortholog absences in the dataset caused by gene prediction failures. Multiple sequence alignments of 153 protein orthogroups corresponding to METTL16 and 157 human splicing factors (Sales-Lee et al., 2021) were used to generate profile hidden Markov models (pHMMs). The pHMMs were then applied to six-frame translations of Saccharomycotina genomic sequences to rescue missing open reading frames (ORFs) using the software package Hmmer (Eddy and Pearson, 2011). Orthogroups were converted into a table of presence/absence variation for each orthogroup in each species, filtering for orthogroups with members present in at least five species and a minimum of two predicted independent loss events in the species tree. These genotypes were then used to perform inter-species association mapping using phylogenetic generalised least squares (PGLS) to identify significant associations with the ratio of 5’SS +4 A to +4 U in each species, whilst controlling for the relatedness of species using distances from the species tree calculated by Orthofinder.

PGLS analysis results in a multiple-testing corrected p-value and model coefficient which can be used to examine the correlation of splicing factor presence/absence to the ratio of 5’SS +4 A/U (Figure 4B). The strongest and most significant association of the 5’SS +4 A/U phenotype was with the presence or absence of METTL16 orthologs. When METTL16 was present, 5’SS +4 A was more likely to be found, and when METTL16 was absent, 5’SS +4 U was more likely to be found (Figure 4A).

Figure 4. Inter-species association mapping links METTL16 to 5’SS +4 sequence preference.

(A) Volcano plot showing the results of the PGLS analysis of the 5’SS +4 A to U ratio phenotype for orthogroups corresponding to known human splicing factors and METTL16. The magnitude and directionality of the model coefficients (x-axis) are used as a measure of effect size and direction of effect on phenotype, respectively. (B) Stacked histogram and traitgram showing the distribution of 5’SS +4 A to U ratio phenotypes for species which retain or lack a METTL16 ortholog. Ancestral loss events shown on the traitgram are estimated using Dollo parsimony. (C) Circular tree showing the full phylogenetic relationship between 240 Saccharomycotina species identified using Orthofinder. Bifurcations are colored by confidence, calculated using the number of single-locus gene trees that support the bifurcation in the Orthofinder STAG algorithm (Emms and Kelly, 2019; Emms and Kelly, 2018). Clades defined in Figure 2A have been colored accordingly. Stacked bar charts show the 5’SS +4 position nucleotide fractions of each species identified from conserved introns. Outer circles show the presence (black) or absence (grey) of an ortholog from the orthogroups representing METTL16 (inner ring) and SNRNP27K (outer ring), respectively.

Figure 4.

Figure 4—figure supplement 1. Structural analysis of METTL16 orthologs in the Saccharomycetaceae clade.

Figure 4—figure supplement 1.

(A) Superimposition of the structure of human METTL16 MTD (green) solved by X-ray crystallography (PDB: 6DU4) with the Alphafold2 predicted structures of the MTD domains of putative METTL16 orthologs from E. gossypii (pink), E. cymbalariae (orange), and Z. parabailli (blue). (B) Multiple sequence alignment showing the conservation of the catalytic NPPF motif of METTL6 in the putative orthologs from subgenera of Eremothecium and Zygosaccharomyces. Generated using Jalview (Procter et al., 2021) (C) Heatmap showing the pairwise TMscore of segmented domains from the Alphafold2 predictions of METTL16 orthologs from Eremothecium and Zygosaccharomyces species, with selected outgroups. The X-ray structures of human METTL16 MTD and VCR, and TUT1 KA-1 are included as positive controls. Domains are grouped into clusters (diagonal boxes) using the same color scheme as in Figure 1A.

METTL16 has been lost from the majority of Saccharomycetaceae and Saccharomycodaceae, and from independent lineages of the CUG-Ser1 and Pichiaceae clades, in association with a switch in 5’SS sequence preference from +4 A to +4 U (Figure 4C). The exact number of independent losses in CUG-Ser1 is uncertain due to relatively low confidence splits separating genera within the clade. Two independent subgenera of Saccharomycetaceae, within the Zygosaccharomyces and Eremothecium genera, retain a METTL16 ortholog despite having a strong preference for 5’SS +4 U. Sequence and Alphafold2 structure analysis indicated that the deduced METTL16 protein sequences were complete, incorporating an MTD domain with an intact NPPF catalytic motif (Figure 4—figure supplement 1A–B) and a VCR/KA-1-like domain topologically similar to the human METTL16 VCR (Figure 4—figure supplement 1C). Association analyses cannot establish causality nor the direction of causation in the METTL16/5’SS +4 relationship. If the METTL16 orthologs in Zygosaccharomyces and Eremothecium are functional, expressed as protein, and target U6 snRNA, it may indicate that changes in 5’SS sequence preference in the Saccharomycetaceae occurred prior to METTL16 loss. Overall, we conclude that inter-species association mapping can link METTL16 with the sequence preference of a single nucleotide position in introns.

Inter-species association mapping links SNRNP27K Methionine 141 to 5’SS sequence preference

Of the splicing factor orthogroups that we analysed, the METTL16 orthogroup was correlated most strongly to 5’SS +4 nucleotide identity in Saccharomycotina (Figure 4A). Since this finding is consistent with the outcomes of transcriptome analysis of Arabidopsis and S. pombe METTL16 ortholog mutants (Ishigami et al., 2021; Parker et al., 2022) and the cryo-EM structures of S. cerevisiae and human spliceosomes (Bertram et al., 2017; Charenton et al., 2019; Parker et al., 2022; Wan et al., 2019), we next asked if any of the other correlated orthogroups might provide new insight into the evolution of splice site recognition.

After METTL16, the next most significant association with the 5’SS +4 A/U phenotype was with the orthogroup containing the human spliceosomal protein SNRNP27K (Figure 4A). We found that the absence of a SNRNP27K ortholog was associated with an increased preference for 5’SS +4 U in the Saccharomycotina (Figure 5A). Although many species that lack SNRNP27K also lack a METTL16 ortholog (Figure 4C), we found that the association of SNRNP27K with the 5’SS +4 A/U phenotype was significant even when controlling for the presence or absence of METTL16 (PGLS P=0.011). In comparison, the next most significant orthogroup, corresponding to orthologs of U2A’, was not significant when also controlling for the presence or absence of METTL16 and SNRNP27K (PGLS P=0.35). In species that lack METTL16, the preference for 5’SS +4 U is even stronger if those species also lack SNRNP27K, demonstrating that the effect of SNRNP27K on the 5’SS +4 A/U phenotype is not wholly explained by correlation with METTL16 (Figure 5B). SNRNP27K is a mostly disordered spliceosomal protein with SR-rich regions at the N-terminus and a conserved C-terminal domain (Fetzer, 1997; Zahler et al., 2018). The C-terminal domain of SNRNP27K is found near the flexible ACAGA sequence of U6 snRNA in cryo-EM structures of human pre-B spliceosomes.

Figure 5. Inter-species association mapping links SNRNP27K Methionine 141–5’SS sequence preference.

(A) Stacked histogram and traitgram showing the distribution of 5’SS +4 A to U ratio phenotypes for species which retain or lack a SNRNP27K ortholog. Ancestral loss events shown on the traitgram are estimated using Dollo parsimony. (B) Boxplots showing the distribution of 5’SS +4 A to U ratio phenotypes for species retaining or lacking a SNRNP27K ortholog, whilst also controlling for the presence or absence of a METTL16 ortholog. (C) Stem plot showing the association of individual conserved positions in the C-terminus of SNRNP27K with the 5’SS +4 A to U ratio phenotype in 158 species retaining a SNRNP27K ortholog. The most strongly associated position is Met141, which is in extreme proximity to the ACAGA box in the pre-B complex. (D) Cryo-EM snapshot of the C-terminus of SNRNP27K in the pre-B complex, in close proximity to the region occupied by the flexible ACAGA box of the U6 snRNA. The surface of PRP8 is shown in grey. (E) Overlayed Cryo-EM snapshots showing the position of SNRNP27K in the pre-B complex, relative to the position of the ordered U6/5’SS helix in the B complex. Positioning of the components labelled with asterisks was determined relative to the position of pre-B complex PRP8 by the superimposition of the PRP8 structures from the pre-B and B complexes.

Figure 5.

Figure 5—figure supplement 1. Sequence analysis of the SNRNP27K C-terminal domain.

Figure 5—figure supplement 1.

(A) Sequence logo showing the alignment of the SNRNP27K orthologs from 158 species to the SNRNP27K C-terminal domain pHMM model. Position 141 (marked with an asterisk) is most commonly occupied by a methionine residue but is replaced in some species with a valine or isoleucine. (B) Boxplots showing the distribution of 5’SS +4 A to U ratio phenotypes for species retaining a SNRNP27K ortholog with a methionine in the pHMM alignment position corresponding to human Met141, compared to those with a different residue at this position (X141), and those that lack a SNRNP27K ortholog entirely.

We next asked if inter-species association mapping could be used to examine not just the presence/absence of SNRNP27K, but functional variation in amino acid sequence in the conserved C-terminal domain. The curated profile hidden Markov model (pHMM) of the SNRNP27K C-terminus from Pfam (Paysan-Lafosse et al., 2023) was used to generate a consensus sequence for this region of the protein. The predicted protein sequences from 158 Saccharomycotina species that retain a SNRNP27K ortholog were aligned to the pHMM using the software tool Hmmer (Eddy and Pearson, 2011) and this alignment was used to generate a table capturing deviation from the consensus sequence in each species (Figure 5—figure supplement 1A). We then used PGLS to test whether there was an association with the 5’SS +4 A/U phenotype. The model position corresponding to Methionine 141 (M141) of human SNRNP27K had the strongest association with the 5’SS +4 A/U phenotype (Figure 5C, Figure 5—figure supplement 1B). In cryo-EM structures of human pre-B spliceosome complexes, SNRNP27K M141 is located within the U-shaped loop of the C-terminal domain close to the U6 snRNA ACAGA box, which is in a flexible orientation at this stage (Charenton et al., 2019; Figure 5D). SNRNP27K is not detectable in the subsequent spliceosomal B complex stage, where the U6/5’SS helix has formed. However, by superimposing the structures of PRP8 from the pre-B and B complexes as a common reference point, it is clear that Met141 of SNRNP27K is particularly close to the space that becomes occupied by U6 snRNA m6A43 (Figure 5E). Significantly, a mutation in the corresponding methionine residue of the C. elegans SNRNP27K ortholog SNRP-27 to threonine (M141T) was identified in a screen designed to reveal factors that modulate splicing fidelity (Zahler et al., 2018). The SNRP-27 M141T mutation caused a shift in 5’SS selection away from 5’SSs that had +4 A, to alternative 5’SSs that did not (Zahler et al., 2018). Therefore, orthogonal human spliceosome cryo-EM structures and C. elegans mutant RNA-sequencing data are consistent with inter-species association mapping linking SNRNP27K to 5’SS sequence selection.

Intron number correlates with 5’SS U5 and U6 snRNA interaction potential in species with 5’SS +4A sequence preference

When we previously characterised Arabidopsis fio1 mutants defective in U6 snRNA m6A modification, we found a global switch away from the selection of 5’SSs with +4 A to 5’SSs that not only had +4 U but also a stronger interaction potential with U5 snRNA loop 1 (Parker et al., 2022). This led us to investigate the architecture of annotated 5’SSs in genome sequences with respect to U5 and U6 snRNA base-pairing and we showed that there is a negative correlation between 5’SS U5 and U6 snRNA interaction potentials. Hence, two major classes of U2-dependent 5’SSs are found in plants and metazoans (Parker et al., 2022). The existence of two major classes of 5’SS may provide a mechanism for regulatable alternative splicing. Given that many Saccharomycotina species have undergone widespread intron loss and have rare or absent alternative splicing, we asked if a negative correlation between 5’SS U5 and U6 snRNA interaction potentials is found in these species.

Position-specific scoring matrices for the 5’SS –3 to –1 and +3 to+7 positions were used to assess the interaction potential with U5 and U6 snRNA, respectively (Figure A). We used the Spearman rank method to assess the level of correlation or anti-correlation between these U5 and U6 snRNA interaction potentials - we refer to this metric as U5/6ρ (Figure 6A). The number of conserved introns identified in each species was used as an estimate of intron number. The number of conserved introns and U5/6ρ were then compared using PGLS to control for phylogenetic structure. This analysis revealed that for species with fewer introns (and therefore, likely reduced splicing complexity), there was an increase in U5/6ρ, indicating weaker anti-correlation of U5 and U6 snRNA interaction potentials (Figure 6B). For species with approximately 500 or fewer conserved introns, U5/6ρ was more likely to be positive (Figure 6B). This suggests that as splicing complexity is reduced, anti-correlated U5 and U6 snRNA interaction potentials are lost. Therefore, the two classes of 5’SS found in plants and metazoans are not found in Saccharomycotina species with reduced splicing complexity.

Figure 6. 5’SS +4 sequence preference interacts with intron number and U5/U6 interaction strength patterns.

(A) Diagram showing the calculation of the U5/6ρ metric. For each species, all conserved 5’SSs were used to generate a log2 transformed position specific scoring matrix (PSSM) representing the consensus 5’SS sequence in that species. This PSSM was then used to score how well each individual 5’SS matched the consensus at the U5 and U6 snRNA interacting positions. These scores were used as a measure of U5 and U6 snRNA interaction potential for each 5’SS. Per-5’SS U5 and U6 snRNA interaction potentials were then correlated using the Spearman rank method to give a correlation coefficient for each species. This metric is referred to as U5/6ρ. In the example given, the species has a negative U5/6ρ indicating an overall anti-correlation between the U5 and U6 snRNA interaction potentials of individual 5’SSs. (B) Scatterplot with marginal histograms showing the relationship between the number of conserved introns (scatterplot x-axis), the correlation of U5 and U6snRNA interaction potentials (U5/6ρ, scatterplot y-axis), and 5’SS +4 nucleotide preference (color). Marginal histograms show the distribution of conserved intron size (top margin) and U5/6ρ (right margin) amongst species. For scatterplot points, 5’SS +4 A to U ratio is shaded using the color map shown on the top right. For marginal histograms, 5’SS +4 preference is discretised into an overall preference for either A (A to U ratio >0.2, blue), U (A to U ratio <= –0.2, orange), or no overall preference (–0.2<A to U ratio ≤ 0.2, grey). Confidence intervals for the U5/6ρ metric were obtained by bootstrapped resampling of conserved 5’SSs for each species before performing the calculations described in (A). The lowess (locally weighted scatterplot smoothing) regression line indicates a strong negative relationship between U5/6ρ and conserved intron number in intron rich species, compared to a weak relationship for intron poor species.

Figure 6.

Figure 6—figure supplement 1. Interspecies association mapping of the U5/6ρ phenotype.

Figure 6—figure supplement 1.

Volcano plot showing the results of the PGLS analysis of the U5/6ρ phenotype for orthogroups corresponding to known human splicing factors and METTL16. The magnitude and directionality of the model coefficients (x-axis) are used as a measure of effect size and direction of effect on phenotype, respectively.

The relationship between intron number and U5/6ρ was different for intron-rich species compared with intron-poor species. U5/6ρ is strongly negatively correlated with intron number amongst species with more than approximately 300 conserved introns, but weakly correlated amongst species with fewer conserved introns (Figure 6B). Most species with a 5'SS +4 U preference were intron-poor, and the median U5/6ρ for species with an overall preference for 5’SS +4 U was positive, with relatively few species showing a negative U5/6ρ. This suggests that a switch in preference towards 5’SS +4 U may accompany, or occur after, an overall reduction in intron number, splicing complexity and/or loss of alternative splicing.

U5/6ρ was tested as a splicing phenotype with inter-species association mapping, but no significant correlations were found after multiple testing correction (Figure 6—figure supplement 1). Overall, we conclude that a preference for 5’SS +4 U is a feature of species with low intron numbers and likely low splicing complexity. In contrast, a preference for 5’SS +4 A is a feature of diverse species that exhibit a correlation between intron number and the presence of two major classes of 5’SS that have anti-correlated U5 and U6 snRNA interaction potentials.

Discussion

The last eukaryotic common ancestor likely had higher intron density and more complex spliceosomes than is observed in developmentally simple eukaryotes like the saccharomycetous yeasts (Irimia et al., 2007; Jeffares et al., 2006; Sales-Lee et al., 2021). The processes that led to splicing simplification in these clades are unclear. We used phylogenetic analysis to reveal a high level of variation at the +4 position of 5’SSs across the Saccharomycotina. Inter-species association mapping demonstrates that variation at the 5’SS +4 position is correlated with the presence or absence of the U6 snRNA N6-methyladenosine methyltransferase, METTL16: when METTL16 is present, a 5’SS +4 A is preferred, but when METTL16 is absent, a 5’SS +4 U is preferred. We also show that a C-terminal domain that increases METTL16 specificity for the U6 snRNA (Aoyama et al., 2020) is ancestral to most eukaryotes. Together, these results suggest that the primary conserved role of METTL16 is to modify the ACAGA box of the U6 snRNA.

We identified a further correlation between 5’SS +4 preference and the spliceosomal protein SNRNP27K. In this case, we could also identify an association with the methionine residue corresponding to position 141 of human SNRNP27K in the conserved C-terminal domain of SNRNP27K orthologs. This finding is consistent with global transcriptome analysis of C. elegans SNRP-27 M141T mutants, which show a switch away from using 5’SS +4 A (Zahler et al., 2018). Therefore, inter-species association mapping can link the function of individual proteins to switches in 5’SS sequence preference found between species.

How do interactions at the 5’SS +4 position affect splice site selection?

Cryo-EM structures of human B complex spliceosomes show that U6 snRNA m6A43 forms a trans Hoogsteen sugar edge interaction with 5’SS +4 A (Bertram et al., 2017; Parker et al., 2022). The U6 snRNA m6A43 may allow 5’SS +4 preferences to become more flexible because biophysical data indicates that m6A stabilises A:A base pairs whilst destabilising A:U base pairs (Kierzek and Kierzek, 2003; Roost et al., 2015). Consistently, in the absence of METTL16 and U6 snRNA m6A modification, RNA sequencing data shows a switch in preference from 5’SS +4 A to 5’SS +4 U (Ishigami et al., 2021; Parker et al., 2022). Inter-species association mapping demonstrates that these findings from mutants within species are replicated at evolutionary timescales in different Saccharomycotina species.

Cryo-EM structures of S. cerevisiae spliceosomes show that a Watson-Crick base pair is made between 5’SS +4 U and the central non-m6A-modified adenosine of the U6 snRNA ACAGA box (Wan et al., 2019). Biophysical data indicate that although an m6A:U base pair can form in a duplex, the methylamino group rotates from a syn geometry on the Watson–Crick face to a higher energy anti-conformation, positioning the methyl group in the major groove (Roost et al., 2015). As a result, m6A has a destabilising effect on A:U base pairs in short RNA helices (Kierzek and Kierzek, 2003; Roost et al., 2015). Consistent with this, the splicing efficiency of introns that have 5’SS +4 U is increased in Arabidopsis fio1 (Parker et al., 2022) and S. pombe mtl16Δ (Ishigami et al., 2021) mutants. These findings are consistent with the inter-species association mapping results which link the absence of METTL16 to a preference for 5’SS +4 U. Therefore, m6A modification of U6 snRNA appears generally incompatible with a preferred selection of 5’SS +4 U sequences.

Inter-species association mapping indicated that the correlation between SNRNP27K presence/absence and the 5’SS +4 A/U phenotype persists in species lacking METTL16, suggesting that retention of an SNRNP27K ortholog allows more tolerance of 5’SS +4 A in the absence of METTL16 and U6 snRNA m6A modification. However, relatively few species of Saccharomycotina have lost SNRNP27K but retained a METTL16 ortholog, hinting that the role of SNRNP27K might be necessary when U6 snRNA is m6A-modified. SNRNP27K is detected in pre-B spliceosomes close to the U6 snRNA ACAGA sequence, which is in a flexible configuration prior to the transfer of the 5’SS from U1 snRNA (Charenton et al., 2019). Met141 of SNRNP27K is located close to the position where the m6A43:5’SS +4 A pairing of the U6/5’SS helix forms in B complexes, although SNRNP27K has left the spliceosome by this stage. Additional cryo-EM snapshots of intermediate spliceosome states are required to determine the conformational and compositional changes that occur between the pre-B and B complexes. However, it seems likely that SNRNP27K stabilises the formation of the U6/5’SS helix product state by chaperoning the arrangement of the flexible U6 snRNA ACAGA sequence for 5’SS docking.

Why does 5’SS +4 sequence preference switch from A to U?

Our analysis suggests that METTL16 and 5’SSs with +4 A sequence preference are features of early eukaryotes. In contrast, 5’SSs with a +4 U preference appear to have evolved later and on multiple independent occasions in saccharomycetous yeasts, a group of species that have simplified developmental complexity and spend most or all of their life cycles in a unicellular form (Nagy et al., 2014). A series of changes are associated with species that have reduced multicellularity, including reductions in intron density and the evolution of more invariant 5’SS sequences (Irimia et al., 2007; Lim et al., 2021; Sales-Lee et al., 2021; Schwartz et al., 2008). We demonstrate that a switch to 5’SS +4 U is a feature of these changes in Saccharomycotina.

A key question arising from evolutionary differences in 5’SS sequence preferences is why a switch between 5’SS +4 A/U is found. We previously reported that there are two major classes of 5’SS in many eukaryotes, defined by their anti-correlated interaction potential with U5 and U6 snRNA (Kenny et al., 2022; Parker et al., 2022). This variation in 5’SS composition may provide regulatory potential in alternative splice site selection because pairs of alternative 5’SSs tend to be from the two opposing classes (Parker et al., 2022). Our analysis demonstrates that the anti-correlation of U5 and U6 snRNA interaction potentials is weaker, and even becomes positive, in species that have a reduced number of conserved introns. This may be due to the reduced importance of alternative splicing as a regulatory mechanism of gene expression in these species. Furthermore, we show that major switches in 5’SS +4 preference have occurred almost exclusively in Saccharomycotina species that lack a strong anti-correlation between U5 and U6 snRNA interaction potentials. We hypothesise that U6 snRNA m6A modification might provide plasticity to 5’SS selection, either through the modulation of U6 snRNA m6A levels directly, or through protein chaperones such as SNRNP27K. This plasticity could be harnessed either within individuals by means of regulatory networks, or in populations due to natural variation in the relevant splicing factors (Price et al., 2020; Sasaki et al., 2015). In species that have lost splicing complexity, U6 snRNA m6A modification may have become detrimental, since plasticity intrinsically means weaker signals that are more prone to cryptic or mis-splicing. Similar reasoning has previously been suggested to explain the link between reduced intron density and more invariant 5’SS sequences (Irimia et al., 2007). Switches in 5’SS +4 preference from A to U may therefore represent one of the final stages of splicing simplification.

What is the order of events directing evolutionary change in splicing signal phenotypes?

Inter-species association mapping reveals genotypes that correlate with a phenotype but cannot necessarily prove causation. Consequently, it is not possible to reconstruct the order of evolutionary changes we detect. However, different scenarios that might explain these findings can be considered. For example, sudden loss of METTL16 might cause an urgent necessity to change 5’SS sequence preferences. This possibility seems unlikely as such rapid change without widespread corresponding 5’SS changes would likely impose a high fitness cost. Alternatively, change in 5’SS sequence from +4 A to +4 U preference could occur first, driven by some other selective pressure such as widespread intron loss, until there is no longer a benefit to retaining the METTL16 gene. This scenario is consistent with the fact that we could detect the METTL16 gene in Zygosaccharomyces and Eremothecium species that have altered their 5’SS +4 preference to a U. However, biophysical data of m6A:U interactions and global splicing analyses of METTL16 ortholog mutants indicate that the co-occurrence of METTL16 and +4 U are generally incompatible (Ishigami et al., 2021; Kierzek and Kierzek, 2003; Parker et al., 2022; Roost et al., 2015). Instead, it is possible that gradual changes in the expression or catalytic efficiency of METTL16 could reduce the stoichiometry of U6 snRNA m6A-modification, permitting a gradual change in 5’SS +4 sequence preference, until complete loss of the METTL16 gene no longer imposes a major fitness cost. Future work might examine these scenarios by determining whether the METTL16 genes detected in Zygosaccharomyces and Eremothecium species are expressed and functional.

Inter-species association mapping is under-utilised

Phylogenetics has played important roles in developing a mechanistic understanding of the RNA components of the spliceosome (Guthrie, 2010). Our analysis demonstrates that inter-species association mapping can be a powerful approach to elucidate molecular interactions controlling splicing. The 5’SS +4 preference phenotype studied here appears to be an example of a relatively simple trait that is predominantly associated with the presence or absence of a small number of genes, i.e., METTL16 and SNRNP27K. Other splicing phenotypes are likely to be more complex, involving smaller contributions from numerous genes as well as associations at the level of protein domains or individual amino acids. These issues may explain why we were unable to identify any candidate orthogroups associated with the U5/6ρ phenotype. As in GWAS, the solution to understanding complex traits may be larger studies (Tam et al., 2019). However, analysing increasing numbers of genomes will require scalable tools for high-quality gene annotation and ortholog clustering (for example, using gene synteny relationships). Since splicing occurs on RNA, transcriptomic data will be valuable for measuring phenotypes such as rates of alternative splicing, as well as aiding the annotation of protein-coding genes. The increasing availability of tree-of-life-scale genomics data from ambitious sequencing projects that aim to catalogue the genomes of all eukaryotic life on Earth (Darwin Tree of Life Project Consortium, 2022; Lewin et al., 2018) will provide resources to improve the quality and quantity of data to broaden the future application of inter-species association mapping (Smith et al., 2020).

Materials and methods

METTL16 structural bioinformatics

Orthologous METTL16 sequences and Alphafold2 structural predictions were manually curated from UniProt using the EBI phmmer server (Potter et al., 2018; Bateman et al., 2023; Varadi et al., 2022). Alphafold2 structural predictions were segmented into domains using network analysis as previously described (Oeffner et al., 2022). For each pair of residues i and j with minimum predicted local distance difference test (pLDDT) values of greater than 70 and a pairwise alignment error (PAE) of less than 5, we connected the residues in a graph using an edge weight of PAEij–1. Residues were then clustered into domains of connected amino acids using the greedy modularity maximisation method implemented in networkx (Hagberg et al., 2008). Domains with fewer than 20 residues were discarded.

To assess the topological similarity of the discovered domains with human METTL16 MTD and VCR/KA-1 domains, the crystal structures of METTL16 MTD (PDB: 6B91; Ruszkowska et al., 2018), METTL16 VCR (PDB: 6M1U; Aoyama et al., 2020) and TUT1 KA-1 (PDB: 5WU5; Yamashita et al., 2017) were downloaded from the Protein Data Bank (Burley et al., 2019). Pairwise TM-scores were calculated for all pairs of experimentally determined and predicted domains using TM-align (Zhang and Skolnick, 2005). Domains with a TM-score of 0.45 or greater compared to human METTL16 MTD (6B91) or VCR (6M1U) were assigned to the MTD-like and KA-1/VCR-like clusters, respectively. Domains with TM-scores of 0.45 or greater compared to both METTL16 MTD and VCR domains were assigned to the MTD +KA-1/VCR-like cluster. Multiple structure alignments and overall cluster TM-scores were generated using US-align (Zhang et al., 2022). Figures were drawn using matplotlib, PyMol, ETE 3, and Jalview (Huerta-Cepas et al., 2016; Hunter, 2007; Procter et al., 2021; Schrödinger, 2015).

Genome annotation, orthogroup clustering and 5’SS+4 phenotype estimation

The representative genomes of 230 Saccharomycotina species and 10 outgroups were downloaded from NCBI. Where possible, the RefSeq genome was used, however for some species only GenBank versions were available. See Supplementary file 1 for the full list of genome metadata. For the 10 well-annotated outgroups and 3 reference Saccharomycotina species (S. cerevisiae, C. albicans, Y. lipolytica), the publicly available gene annotations were also downloaded in GTF format. For all other species, genomes were cleaned (by removal of short repetitive contigs), repeat masked, and reannotated using the Funannotate pipeline using default settings (with internal processing using Minimap2, Tantan, BUSCO, AUGUSTUS, GeneMark, GlimmerHMM and SNAP) (Borodovsky and Lomsadze, 2011; Frith, 2011; Korf, 2004; Li, 2018; Majoros et al., 2004; Palmer and Stajich, 2020; Seppey et al., 2019; Stanke et al., 2006). Seed species for gene finding were chosen based on known phylogenetic clades: Lipomycetaceae, Trigonopsidaceae and Dipodascaceae species were seeded using Y. lipolytica. Alloascoideaceae, Sporopachydermia, CUG-Ala, Pichiaceae and CUG-Ser1 species were seeded using C. albicans. CUG-Ser2, Phaffomycetaceae, Saccharomycodaceae and Saccharomycetaceae species were seeded using S. cerevisiae.

After gene prediction, protein sequences were clustered into orthogroups using Orthofinder version 2.5.4 (Emms and Kelly, 2019) with DIAMOND for pairwise proteome comparisons and DendroBLAST for gene tree inference (Buchfink et al., 2021; Kelly and Maini, 2013). Orthofinder also produces a species tree using the STAG method (Emms and Kelly, 2018). We found that many species initially had spurious ortholog absences due to gene prediction failure. To address this for the 158 splicing factors being examined, we used the human orthologs (Sales-Lee et al., 2021) to identify relevant orthogroups and use the sequences from these to build profile hidden Markov models (pHMM) using MAFFT and hmmer (Eddy and Pearson, 2011; Katoh and Standley, 2013). Score thresholds for each pHMM were determined by aligning the input sequences to the pHMM and selecting the 2.5% percentile (i.e. the score threshold that would recover 97.5% of the input sequences). These were then used to search six-frame translations generated from genome sequences to recover missing orthologs. Finally, the recovered sequences were re-clustered using Orthofinder to generate refined splicing factor orthogroups which were annotated using the human orthologs present within them.

We found that initial intron predictions generated by the Funannotate pipeline contained many false positive introns that negatively affected the estimation of splicing signal motifs. We reasoned that since most genuine introns are conserved across multiple species, and incorrectly annotated introns are unlikely to be in the same position in multiple species by chance, then identifying conserved introns would be a suitable method to reduce the false positive rate. To do this, we took the protein sequences for the initial orthogroups (generated from gene predictions rather than six-frame translations of splicing factors) and generated multiple sequence alignments using MAFFT. Very large orthogroups (with an average of more than 2 orthologs per species) were skipped as a time-saving heuristic. Once multiple sequence alignments were generated, we filtered for introns that were conserved in the same alignment column and frame in at least two species. This significantly improved the information content of splicing motifs at the 5’ and 3’SSs of introns, meaning that 5’SS +4 preference phenotypes could be estimated. Since the number of conserved introns per species varied by several orders of magnitude, we also used bootstrapped resampling of introns to estimate the 95% confidence intervals on the 5’SS +4 preference phenotypes.

Phylogenetic generalised least squares analysis

After a genotype (in the form of orthogroups), phenotype (5’SS +4 preference) and phylogenetic relationships (species tree) had been collected, we were able to proceed with PGLS analysis. Phylogenetic dependence was estimated from the species tree by generating a variance-covariance matrix. Variances were set as the distance from the root to species leaf, and covariances were set as the distance from the root to the last common ancestor of the two species (de Villemereuil et al., 2012). For each orthogroup, the data were binarized to give the presence/absence of an ortholog for each species. Orthogroups were filtered using Dollo parsimony to retain those predicted to have undergone at least two independent loss events (Farris, 1977; Rogozin et al., 2003). PGLS analysis was then conducted for each orthogroup using the GLS method from statsmodels, controlled with the covariance matrix (Seabold and Perktold, 2010). To account for uncertainty in the 5’SS +4 phenotype, bootstrapped 5’SS +4 A to U ratios estimated using resampling of conserved introns were used to perform 100 independent PGLS tests per orthogroup, which were then combined to give a unified model coefficient (with 95% confidence intervals) and p value using the GLS pooling method from the statsmodels implementation of MICE (Seabold and Perktold, 2010). p Values for each orthogroup were then corrected for multiple testing using the Benjamini-Hochberg method.

For the individual amino acid PGLS analyses of the SNRNP27K C-terminus, genotypes used for testing were prepared as follows: the curated pHMM of the SNRNP27K C-terminus was downloaded from Pfam (PF08648). SNRNP27K orthologs were aligned to the pHMM using hmmalign (Eddy and Pearson, 2011), discarding insertions to the model, and the highest-scoring ortholog for each species was selected. The consensus sequence of the pHMM model was collected using hmmemit (Eddy and Pearson, 2011). For each ortholog sequence, at each position in the model, a score was assigned by looking up the pairwise similarity scores of the aligned ortholog amino acid and the consensus amino acid in the BLOSUM62 matrix. Scores for each column were standardised by subtracting the mean score and dividing by the standard deviation. The resulting table was used to test each conserved position separately, using the same bootstrapped PGLS method described above.

U6 and U1 snRNA conservation analysis

A seed alignment of U6 snRNA sequences from the 10 outgroups and 3 reference Saccharomycotina genomes was generated using U6 snRNA sequences manually curated from RNACentral (Sweeney et al., 2021) and the secondary structural model of the human U6 snRNA (Aoyama et al., 2020; Montemayor et al., 2014). Sequences were aligned to the sequence and secondary structural model of human U6 snRNA using Infernal cmalign to create the seed alignment (Nawrocki and Eddy, 2013). This seed alignment was then used to create a covariance model that was used to search the genomes of the 240 species using Infernal cmsearch (Nawrocki and Eddy, 2013). We were unable to identify a U6 snRNA gene in the representative genome of Komagataella phaffii using this method. This appears to be the result of an assembly error since a U6 snRNA sequence could be manually recovered from other published K. phaffii genomes by BLAST search with the U6 snRNA sequence identified in K. pastoris. The U6 snRNA of S. pombe was also not detected by this approach because it is interrupted by a pre-mRNA-like intron (Potashkin and Frendewey, 1989). Therefore, we manually annotated the S. pombe U6 snRNA sequence using data from RNAcentral (Sweeney et al., 2021). The highest scoring hit per species was selected as the representative U6 snRNA ortholog for each species and then aligned to the human U6 snRNA structure using cmalign. Finally, R-scape and R2R were used to plot conservation and covariation relationships onto the consensus U6 snRNA structure (Rivas et al., 2017; Weinberg and Breaker, 2011).

An initial model of U1 stem 1 was created by manually truncating the covariance model of metazoan U1 snRNA from Rfam to include only the 5’ end and U1 stem 1 region. The sequences of U1 snRNA from 7 outgroups and 15 species of Saccharomycotina were collected either from RNAcentral or manual curation from genomic sequences using the U1 stem 1 covariance model and aligned using the initial covariance model to create a seed alignment. This seed alignment was then used to create a final covariance model that was used to search the genomes of all 240 species using the same procedures as described above for U6 snRNA. We were unable to detect U1 snRNA genes from Cryptococcus neoformans, Hanseniaspora uvarum, Hanseniaspora guilliermondii, or Saturnispora silvae using this approach.

Phylogenetic trees and traitgrams

Ancestral character estimations for the 5’SS +4 A to U ratio (used for traitograms) were estimated using the ACE method from the R package Ape (Paradis et al., 2004). Ancestral splicing motif nucleotide fractions (i.e. the stacked bar preferences in Figure 2A) were estimated using Sankoff parsimony using a previously described method (Schwartz et al., 2008), with nucleotide fractions discretised to multiples of 0.05. Phlyogenetic trees and traitgrams were drawn using ete3 and matplotlib (Huerta-Cepas et al., 2016; Hunter, 2007).

Analysis of cryo-electron microscopy structures

The cryo-EM structures of the human pre-B (PDB: 6QX9; Charenton et al., 2019) and B complexes (PDB: 6AHD; Zhan et al., 2018) were downloaded from the Protein Data Bank (Burley et al., 2019). The pre-B and B complex structures were superimposed with PyMol using the relative positions of the spliceosomal protein PRP8 (Schrödinger, 2015).

Analysis of U5 and U6 snRNA interaction potential correlations

U5/6ρ for each species was estimated using conserved introns (as defined in Materials and methods section ‘Genome annotation, orthogroup clustering and 5’SS +4 phenotype estimation’). For each species, conserved introns were used to create a position-specific scoring matrix describing the 5’SS consensus sequence. The –3 to –1 positions of this matrix were used to score the U5 snRNA interacting potential of each 5’SS, whilst the +3 to+7 positions were used to score U6 snRNA interacting potential. The Spearman rank correlation coefficient of these two scores was used to define U5/6ρ. Bootstrapped 95% confidence intervals were generated by resampling introns from each species with replacement. The association between U5/6ρ and conserved intron number was assessed using PGLS (as described in Materials and methods section ‘Phylogenetic generalised least squares analysis’).

Code availability

All pipelines, scripts and notebooks used to generate figures are available from GitHub at github.com/bartongroup/mettl16_phylogenetics (copy archived at Bartongroup, 2023).

Acknowledgements

We thank Alper Akay, Brendan Davies, Carey Metheringham and Korbinian Schneeberger for comments on the manuscript. This work was supported by awards from the BBSRC (BB/V010662/1 to GGS and BB/W007673/1 to GGS and GJB); SMF is a Wellcome Trust and Royal Society Sir Henry Dale Fellow (grant number 220212/Z/20/Z).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Contributor Information

Gordon G Simpson, Email: g.g.simpson@dundee.ac.uk.

Christian R Landry, Université Laval, Canada.

Christian R Landry, Université Laval, Canada.

Funding Information

This paper was supported by the following grants:

  • Biotechnology and Biological Sciences Research Council BB/V010662/1 to Gordon G Simpson.

  • Biotechnology and Biological Sciences Research Council BB/W007673/1 to Geoffrey J Barton, Gordon G Simpson.

  • Wellcome Trust 220212/Z/20/Z to Sebastian M Fica.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Software, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing.

Conceptualization, Formal analysis, Funding acquisition, Writing – review and editing.

Conceptualization, Supervision, Funding acquisition, Project administration, Writing – review and editing.

Conceptualization, Supervision, Funding acquisition, Writing – original draft, Project administration, Writing – review and editing.

Additional files

Supplementary file 1. Table containing the 240 genome assemblies used for phylogenetic association mapping.

All assemblies were downloaded from NCBI. Where available, RefSeq assemblies were preferred over GenBank.

elife-91997-supp1.xlsx (30.3KB, xlsx)
MDAR checklist

Data availability

This study uses the publicly available genome sequences of 240 different species. The details of the previously used data sets corresponding to these 240 genome sequences are reported in Supplementary file 1.

References

  1. Aoyama T, Yamashita S, Tomita K. Mechanistic insights into m6A modification of U6 snRNA by human METTL16. Nucleic Acids Research. 2020;48:5157–5168. doi: 10.1093/nar/gkaa227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bartongroup Mettl16_Phylogenetics. swh:1:rev:da470022f5810bf0607854d017deca90f1889ca7Software Heritage. 2023 https://archive.softwareheritage.org/swh:1:dir:432d5d431d5244d39f0bc5fdba63317ce37a1513;origin=https://github.com/bartongroup/mettl16_phylogenetics;visit=swh:1:snp:e9389e474ee08da4c39cb6a3b5a58bacb1b38fc7;anchor=swh:1:rev:da470022f5810bf0607854d017deca90f1889ca7
  3. Bateman A, Martin MJ, Orchard S, Magrane M, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bye-A-Jee H, Cukura A, Denny P, Dogan T, Ebenezer T, Fan J, Garmiri P, da Costa Gonzales LJ, Hatton-Ellis E, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Joshi V, Jyothi D, Kandasaamy S, Lock A, Luciani A, Lugaric M, Luo J, Lussi Y, MacDougall A, Madeira F, Mahmoudy M, Mishra A, Moulang K, Nightingale A, Pundir S, Qi G, Raj S, Raposo P, Rice DL, Saidi R, Santos R, Speretta E, Stephenson J, Totoo P, Turner E, Tyagi N, Vasudev P, Warner K, Watkins X, Zaru R, Zellner H, Bridge AJ, Aimo L, Argoud-Puy G, Auchincloss AH, Axelsen KB, Bansal P, Baratin D, Batista Neto TM, Blatter MC, Bolleman JT, Boutet E, Breuza L, Gil BC, Casals-Casas C, Echioukh KC, Coudert E, Cuche B, de Castro E, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gaudet P, Gehant S, Gerritsen V, Gos A, Gruaz N, Hulo C, Hyka-Nouspikel N, Jungo F, Kerhornou A, Le Mercier P, Lieberherr D, Masson P, Morgat A, Muthukrishnan V, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Poux S, Pozzato M, Pruess M, Redaschi N, Rivoire C, Sigrist CJA, Sonesson K, Sundaram S, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Huang H, Laiho K, McGarvey P, Natale DA, Ross K, Vinayaka CR, Wang Q, Wang Y, Zhang J. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research. 2023;51:D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bénitìere F, Necsulea A, Duret L. Random Genetic Drift Sets an Upper Limit on Mrna Splicing Accuracy in Metazoans. bioRxiv. 2023 doi: 10.1101/2022.12.09.519597. [DOI] [PMC free article] [PubMed]
  5. Bertram K, Agafonov DE, Dybkov O, Haselbach D, Leelaram MN, Will CL, Urlaub H, Kastner B, Lührmann R, Stark H. Cryo-EM structure of a pre-catalytic human spliceosome primed for activation. Cell. 2017;170:701–713. doi: 10.1016/j.cell.2017.07.011. [DOI] [PubMed] [Google Scholar]
  6. Borodovsky M, Lomsadze A. Eukaryotic gene prediction using geneMark.hmm‐E and GeneMark‐ES. Current Protocols in Bioinformatics. 2011;35:4. doi: 10.1002/0471250953.bi0406s35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods. 2021;18:366–368. doi: 10.1038/s41592-021-01101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L, Costanzo LD, Christie C, Duarte JM, Dutta S, Feng Z, Ghosh S, Goodsell DS, Green RK, Guranovic V, Guzenko D, Hudson BP, Liang Y, Lowe R, Peisach E, Periskova I, Randle C, Rose A, Sekharan M, Shao C, Tao Y-P, Valasatava Y, Voigt M, Westbrook J, Young J, Zardecki C, Zhuravleva M, Kurisu G, Nakamura H, Kengaku Y, Cho H, Sato J, Kim JY, Ikegawa Y, Nakagawa A, Yamashita R, Kudou T, Bekker G-J, Suzuki H, Iwata T, Yokochi M, Kobayashi N, Fujiwara T, Velankar S, Kleywegt GJ, Anyango S, Armstrong DR, Berrisford JM, Conroy MJ, Dana JM, Deshpande M, Gane P, Gáborová R, Gupta D, Gutmanas A, Koča J, Mak L, Mir S, Mukhopadhyay A, Nadzirin N, Nair S, Patwardhan A, Paysan-Lafosse T, Pravda L, Salih O, Sehnal D, Varadi M, Vařeková R, Markley JL, Hoch JC, Romero PR, Baskaran K, Maziuk D, Ulrich EL, Wedell JR, Yao H, Livny M, Ioannidis YE, wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research. 2019;47:D520–D528. doi: 10.1093/nar/gky949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bush SJ, Chen L, Tovar-Corona JM, Urrutia AO. Alternative splicing and the evolution of phenotypic novelty. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences. 2017;372:20150474. doi: 10.1098/rstb.2015.0474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Charenton C, Wilkinson ME, Nagai K. Mechanism of 5’ splice site transfer for human spliceosome activation. Science. 2019;364:362–367. doi: 10.1126/science.aax3289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Molecular Biology and Evolution. 2014;31:1402–1413. doi: 10.1093/molbev/msu083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Csuros M, Rogozin IB, Koonin EV. A detailed history of intron-rich eukaryotic ancestors inferred from A global survey of 100 complete genomes. PLOS Computational Biology. 2011;7:e1002150. doi: 10.1371/journal.pcbi.1002150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Darwin Tree of Life Project Consortium Sequence locally, think globally: the darwin tree of life project. PNAS. 2022;119:e2115642118. doi: 10.1073/pnas.2115642118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. de Villemereuil P, Wells JA, Edwards RD, Blomberg SP. Bayesian models for comparative analysis integrating phylogenetic uncertainty. BMC Evolutionary Biology. 2012;12:102. doi: 10.1186/1471-2148-12-102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine I, De Montigny J, Marck C, Neuvéglise C, Talla E, Goffard N, Frangeul L, Aigle M, Anthouard V, Babour A, Barbe V, Barnay S, Blanchin S, Beckerich J-M, Beyne E, Bleykasten C, Boisramé A, Boyer J, Cattolico L, Confanioleri F, De Daruvar A, Despons L, Fabre E, Fairhead C, Ferry-Dumazet H, Groppi A, Hantraye F, Hennequin C, Jauniaux N, Joyet P, Kachouri R, Kerrest A, Koszul R, Lemaire M, Lesur I, Ma L, Muller H, Nicaud J-M, Nikolski M, Oztas S, Ozier-Kalogeropoulos O, Pellenz S, Potier S, Richard G-F, Straub M-L, Suleau A, Swennen D, Tekaia F, Wésolowski-Louvel M, Westhof E, Wirth B, Zeniou-Meyer M, Zivanovic I, Bolotin-Fukuhara M, Thierry A, Bouchier C, Caudron B, Scarpelli C, Gaillardin C, Weissenbach J, Wincker P, Souciet J-L. Genome evolution in yeasts. Nature. 2004;430:35–44. doi: 10.1038/nature02579. [DOI] [PubMed] [Google Scholar]
  16. Eddy SR, Pearson WR. Accelerated Profile HMM Searches. PLOS Computational Biology. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Emms DM, Kelly S. STAG: species tree inference from all genes. bioRxiv. 2018 doi: 10.1101/267914. [DOI]
  18. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology. 2019;20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Engel SR, Wong ED, Nash RS, Aleksander S, Alexander M, Douglass E, Karra K, Miyasato SR, Simison M, Skrzypek MS, Weng S, Cherry JM, Wood V. New data and collaborations at the Saccharomyces Genome Database: updated reference genome, alleles, and the Alliance of Genome Resources. Genetics. 2022;220:iyab224. doi: 10.1093/genetics/iyab224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Farris JS. Phylogenetic analysis under dollo’s law. Systematic Biology. 1977;26:77–88. doi: 10.1093/sysbio/26.1.77. [DOI] [Google Scholar]
  21. Fetzer SJ. Practice characteristics of the dual certificant--CPAN/CAPA. Journal of Perianesthesia Nursing. 1997;12:240–244. doi: 10.1016/s1089-9472(97)80004-4. [DOI] [PubMed] [Google Scholar]
  22. Fica SM. Cryo-EM snapshots of the human spliceosome reveal structural adaptions for splicing regulation. Current Opinion in Structural Biology. 2020;65:139–148. doi: 10.1016/j.sbi.2020.06.018. [DOI] [PubMed] [Google Scholar]
  23. Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Research. 2011;39:e23. doi: 10.1093/nar/gkq1212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Guthrie C. From the ribosome to the spliceosome and back again. The Journal of Biological Chemistry. 2010;285:1–12. doi: 10.1074/jbc.X109.080580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hagberg A, Swart P, Chult D. Exploring network structure, dynamics, and function using networkx (no.LA-UR-08-05495; LA-UR-08-5495). Los Alamos National Lab. (LANL), Los Alamos, NM (United States.2008. [Google Scholar]
  26. Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular Biology and Evolution. 2016;33:1635–1638. doi: 10.1093/molbev/msw046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Hunter JD. Matplotlib: A 2D graphics environment. Computing in Science & Engineering. 2007;9:90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
  28. Irimia M, Penny D, Roy SW. Coevolution of genomic intron number and splice sites. Trends in Genetics. 2007;23:321–325. doi: 10.1016/j.tig.2007.04.001. [DOI] [PubMed] [Google Scholar]
  29. Ishigami Y, Ohira T, Isokawa Y, Suzuki Y, Suzuki T. A single m6A modification in U6 snRNA diversifies exon sequence at the 5’ splice site. Nature Communications. 2021;12:3244. doi: 10.1038/s41467-021-23457-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Jeffares DC, Mourier T, Penny D. The biology of intron gain and loss. Trends in Genetics. 2006;22:16–22. doi: 10.1016/j.tig.2005.10.006. [DOI] [PubMed] [Google Scholar]
  31. Ju J, Aoyama T, Yashiro Y, Yamashita S, Kuroyanagi H, Tomita K. Structure of the Caenorhabditis elegans m6A methyltransferase METT10 that regulates SAM homeostasis. Nucleic Acids Research. 2023;51:2434–2446. doi: 10.1093/nar/gkad081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kelly S, Maini PK. DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments. PLOS ONE. 2013;8:e58537. doi: 10.1371/journal.pone.0058537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kenny CJ, McGurk MP, Burge CB. Human LUC7 Proteins Impact Splicing of Two Major Subclasses of 5’ Splice Sites. bioRxiv. 2022 doi: 10.1101/2022.12.07.519539. [DOI]
  35. Kiefer C, Willing EM, Jiao WB, Sun H, Piednoël M, Hümann U, Hartwig B, Koch MA, Schneeberger K. Interspecies association mapping links reduced CG to TG substitution rates to the loss of gene-body methylation. Nature Plants. 2019;5:846–855. doi: 10.1038/s41477-019-0486-9. [DOI] [PubMed] [Google Scholar]
  36. Kierzek E, Kierzek R. The thermodynamic stability of RNA duplexes and hairpins containing N6-alkyladenosines and 2-methylthio-N6-alkyladenosines. Nucleic Acids Research. 2003;31:4472–4480. doi: 10.1093/nar/gkg633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. doi: 10.1186/1471-2105-5-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Lee Y, Rio DC. Mechanisms and regulation of alternative pre-mRNA Splicing. Annual Review of Biochemistry. 2015;84:291–323. doi: 10.1146/annurev-biochem-060614-034316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, Goldstein MM, Grigoriev IV, Hackett KJ, Haussler D, Jarvis ED, Johnson WE, Patrinos A, Richards S, Castilla-Rubio JC, van Sluys M-A, Soltis PS, Xu X, Yang H, Zhang G. Earth BioGenome Project: Sequencing life for the future of life. PNAS. 2018;115:4325–4333. doi: 10.1073/pnas.1720115115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Lim CS, Weinstein BN, Roy SW, Brown CM. Analysis of fungal genomes reveals commonalities of intron gain or loss and functions in intron-poor species. Molecular Biology and Evolution. 2021;38:4166–4186. doi: 10.1093/molbev/msab094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Madhani HD. The frustrated gene: origins of eukaryotic gene expression. Cell. 2013;155:744–749. doi: 10.1016/j.cell.2013.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–2879. doi: 10.1093/bioinformatics/bth315. [DOI] [PubMed] [Google Scholar]
  44. Mendel M, Delaney K, Pandey RR, Chen KM, Wenda JM, Vågbø CB, Steiner FA, Homolka D, Pillai RS. Splice site m6A methylation prevents binding of U2AF35 to inhibit RNA splicing. Cell. 2021;184:3125–3142. doi: 10.1016/j.cell.2021.03.062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Mitrovich QM, Tuch BB, De La Vega FM, Guthrie C, Johnson AD. Evolution of yeast noncoding RNAs reveals an alternative mechanism for widespread intron loss. Science. 2010;330:838–841. doi: 10.1126/science.1194554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Montemayor EJ, Curran EC, Liao HH, Andrews KL, Treba CN, Butcher SE, Brow DA. Core structure of the U6 small nuclear ribonucleoprotein at 1.7-Å resolution. Nature Structural & Molecular Biology. 2014;21:544–551. doi: 10.1038/nsmb.2832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Morais P, Adachi H, Yu YT. Spliceosomal snRNA Epitranscriptomics. Frontiers in Genetics. 2021;12:652129. doi: 10.3389/fgene.2021.652129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Muzzey D, Schwartz K, Weissman JS, Sherlock G. Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure. Genome Biology. 2013;14:R97. doi: 10.1186/gb-2013-14-9-r97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Nagy LG, Ohm RA, Kovács GM, Floudas D, Riley R, Gácser A, Sipiczki M, Davis JM, Doty SL, de Hoog GS, Lang BF, Spatafora JW, Martin FM, Grigoriev IV, Hibbett DS. Latent homology and convergent regulatory evolution underlies the repeated emergence of yeasts. Nature Communications. 2014;5:4471. doi: 10.1038/ncomms5471. [DOI] [PubMed] [Google Scholar]
  50. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Neuvéglise C, Marck C, Gaillardin C. The intronome of budding yeasts. Comptes Rendus Biologies. 2011;334:662–670. doi: 10.1016/j.crvi.2011.05.015. [DOI] [PubMed] [Google Scholar]
  52. Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463:457–463. doi: 10.1038/nature08909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen N-C, Cheng H, Chin C-S, Chow W, de Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, Xiao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Oeffner RD, Croll TI, Millán C, Poon BK, Schlicksup CJ, Read RJ, Terwilliger TC. Putting AlphaFold models to work with phenix.process_predicted_model and ISOLDE. Acta Crystallographica. Section D, Structural Biology. 2022;78:1303–1314. doi: 10.1107/S2059798322010026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Palmer JM, Stajich J. Funannotate V1.8.1: Eukaryotic genome annotation. Zenodo. 2020 doi: 10.5281/zenodo.4054262. [DOI]
  56. Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004;20:289–290. doi: 10.1093/bioinformatics/btg412. [DOI] [PubMed] [Google Scholar]
  57. Parker MT, Soanes BK, Kusakina J, Larrieu A, Knop K, Joy N, Breidenbach F, Sherwood AV, Barton GJ, Fica SM, Davies BH, Simpson GG. m6A modification of U6 snRNA modulates usage of two major classes of pre-mRNA 5’ splice site. eLife. 2022;11:e78808. doi: 10.7554/eLife.78808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, Bileschi ML, Bork P, Bridge A, Colwell L, Gough J, Haft DH, Letunić I, Marchler-Bauer A, Mi H, Natale DA, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A. InterPro in 2022. Nucleic Acids Research. 2023;51:D418–D427. doi: 10.1093/nar/gkac993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Pendleton KE, Chen B, Liu K, Hunter OV, Xie Y, Tu BP, Conrad NK. The U6 snRNA m6A Methyltransferase METTL16 Regulates SAM Synthetase Intron Retention. Cell. 2017;169:824–835. doi: 10.1016/j.cell.2017.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Plaschka C, Newman AJ, Nagai K. Structural Basis of Nuclear pre-mRNA Splicing: Lessons from Yeast. Cold Spring Harbor Perspectives in Biology. 2019;11:a032391. doi: 10.1101/cshperspect.a032391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Potashkin J, Frendewey D. Splicing of the U6 RNA precursor is impaired in fission yeast pre-mRNA splicing mutants. Nucleic Acids Research. 1989;17:7821–7831. doi: 10.1093/nar/17.19.7821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. HMMER web server: 2018 update. Nucleic Acids Research. 2018;46:W200–W204. doi: 10.1093/nar/gky448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Price N, Lopez L, Platts AE, Lasky JR. In the presence of population structure: From genomics to candidate genes underlying local adaptation. Ecology and Evolution. 2020;10:1889–1904. doi: 10.1002/ece3.6002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Procter JB, Carstairs GM, Soares B, Mourão K, Ofoegbu TC, Barton D, Lui L, Menard A, Sherstnev N, Roldan-Martinez D, Duce S, Martin DMA, Barton GJ. Alignment of biological sequences with Jalview. Methods in Molecular Biology. 2021;2231:203–224. doi: 10.1007/978-1-0716-1036-7_13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Riley R, Haridas S, Wolfe KH, Lopes MR, Hittinger CT, Göker M, Salamov AA, Wisecaver JH, Long TM, Calvey CH, Aerts AL, Barry KW, Choi C, Clum A, Coughlan AY, Deshpande S, Douglass AP, Hanson SJ, Klenk H-P, LaButti KM, Lapidus A, Lindquist EA, Lipzen AM, Meier-Kolthoff JP, Ohm RA, Otillar RP, Pangilinan JL, Peng Y, Rokas A, Rosa CA, Scheuner C, Sibirny AA, Slot JC, Stielow JB, Sun H, Kurtzman CP, Blackwell M, Grigoriev IV, Jeffries TW. Comparative genomics of biotechnologically important yeasts. PNAS. 2016;113:9882–9887. doi: 10.1073/pnas.1603941113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Rivas E, Clements J, Eddy SR. A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nature Methods. 2017;14:45–48. doi: 10.1038/nmeth.4066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV. Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Current Biology. 2003;13:1512–1517. doi: 10.1016/s0960-9822(03)00558-x. [DOI] [PubMed] [Google Scholar]
  68. Rogozin IB, Carmel L, Csuros M, Koonin EV. Origin and evolution of spliceosomal introns. Biology Direct. 2012;7:11. doi: 10.1186/1745-6150-7-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Roost C, Lynch SR, Batista PJ, Qu K, Chang HY, Kool ET. Structure and Thermodynamics of N6-Methyladenosine in RNA: A Spring-Loaded Base Modification. Journal of the American Chemical Society. 2015;137:2107–2115. doi: 10.1021/ja513080v. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Ruszkowska A, Ruszkowski M, Dauter Z, Brown JA. Structural insights into the RNA methyltransferase domain of METTL16. Scientific Reports. 2018;8:5311. doi: 10.1038/s41598-018-23608-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Sales-Lee J, Perry DS, Bowser BA, Diedrich JK, Rao B, Beusch I, Yates JR, III, Roy SW, Madhani HD. Coupling of spliceosome complexity to intron diversity. Current Biology. 2021;31:4898–4910. doi: 10.1016/j.cub.2021.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Sasaki E, Zhang P, Atwell S, Meng D, Nordborg M, Gibson G. “Missing” G x E Variation Controls Flowering Time in Arabidopsis thaliana. PLOS Genetics. 2015;11:e1005597. doi: 10.1371/journal.pgen.1005597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Schrödinger LLC. The Pymol molecular Graphics system. Version 1.8 2015
  74. Schwartz SH, Silva J, Burstein D, Pupko T, Eyras E, Ast G. Large-scale comparative analysis of splicing signals and their corresponding splicing factors in eukaryotes. Genome Research. 2008;18:88–103. doi: 10.1101/gr.6818908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with pythonProceedings of the 9th Python in Science Conference. Presented at the Python in Science Conference. SciPy; 2010. [DOI] [Google Scholar]
  76. Seppey M, Manni M, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness. Methods in Molecular Biology. 2019;1962:227–245. doi: 10.1007/978-1-4939-9173-0_14. [DOI] [PubMed] [Google Scholar]
  77. Shen XX, Zhou X, Kominek J, Kurtzman CP, Hittinger CT, Rokas A. Reconstructing the backbone of the saccharomycotina yeast phylogeny using genome-scale data. G3: Genes, Genomes, Genetics. 2016;6:3927–3939. doi: 10.1534/g3.116.034744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Shen XX, Opulente DA, Kominek J, Zhou X, Steenwyk JL, Buh KV, Haase MAB, Wisecaver JH, Wang M, Doering DT, Boudouris JT, Schneider RM, Langdon QK, Ohkuma M, Endoh R, Takashima M, Manabe RI, Čadež N, Libkind D, Rosa CA, DeVirgilio J, Hulfachor AB, Groenewald M, Kurtzman CP, Hittinger CT, Rokas A. Tempo and mode of genome evolution in the budding yeast subphylum. Cell. 2018;175:1533–1545. doi: 10.1016/j.cell.2018.10.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Shen XX, Steenwyk JL, LaBella AL, Opulente DA, Zhou X, Kominek J, Li Y, Groenewald M, Hittinger CT, Rokas A. Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota. Science Advances. 2020;6:eabd0079. doi: 10.1126/sciadv.abd0079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Smith SD, Pennell MW, Dunn CW, Edwards SV. Phylogenetics is the new genetics (for most of biodiversity) Trends in Ecology & Evolution. 2020;35:415–425. doi: 10.1016/j.tree.2020.01.005. [DOI] [PubMed] [Google Scholar]
  81. Stanke M, Schöffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62. doi: 10.1186/1471-2105-7-62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Stark MR, Dunn EA, Dunn WSC, Grisdale CJ, Daniele AR, Halstead MRG, Fast NM, Rader SD. Dramatically reduced spliceosome in cyanidioschyzon merolae. PNAS. 2015;112:E1191–E1200. doi: 10.1073/pnas.1416879112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Sweeney BA, Petrov AI, Ribas CE, Finn RD, Bateman A, Szymanski M, Karlowski WM, Seemann SE, Gorodkin J, Cannone JJ, Gutell RR, Kay S, Marygold S, dos Santos G, Frankish A, Mudge JM, Barshir R, Fishilevich S, Chan PP, Lowe TM, Seal R, Bruford E, Panni S, Porras P, Karagkouni D, Hatzigeorgiou AG, Ma L, Zhang Z, Volders P-J, Mestdagh P, Griffiths-Jones S, Fromm B, Peterson KJ, Kalvari I, Nawrocki EP, Petrov AS, Weng S, Bouchard-Bourelle P, Scott M, Lui LM, Hoksza D, Lovering RC, Kramarz B, Mani P, Ramachandran S, Weinberg Z, RNAcentral Consortium RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Research. 2021;49:D212–D220. doi: 10.1093/nar/gkaa921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nature Reviews. Genetics. 2019;20:467–484. doi: 10.1038/s41576-019-0127-1. [DOI] [PubMed] [Google Scholar]
  85. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Žídek A, Green T, Tunyasuvunakool K, Petersen S, Jumper J, Clancy E, Green R, Vora A, Lutfi M, Figurnov M, Cowie A, Hobbs N, Kohli P, Kleywegt G, Birney E, Hassabis D, Velankar S. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Wan R, Bai R, Yan C, Lei J, Shi Y. Structures of the catalytically activated yeast spliceosome reveal the mechanism of branching. Cell. 2019;177:339–351. doi: 10.1016/j.cell.2019.02.006. [DOI] [PubMed] [Google Scholar]
  87. Wang C, Yang J, Song P, Zhang W, Lu Q, Yu Q, Jia G. FIONA1 is an RNA N6-methyladenosine methyltransferase affecting Arabidopsis photomorphogenesis and flowering. Genome Biology. 2022;23:40. doi: 10.1186/s13059-022-02612-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Warda AS, Kretschmer J, Hackert P, Lenz C, Urlaub H, Höbartner C, Sloan KE, Bohnsack MT. Human METTL16 is a N6-methyladenosine (m6A) methyltransferase that targets pre-mRNAs and various non-coding RNAs. EMBO Reports. 2017;18:2004–2014. doi: 10.15252/embr.201744940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Weinberg Z, Breaker RR. R2R--software to speed the depiction of aesthetic consensus RNA secondary structures. BMC Bioinformatics. 2011;12:3. doi: 10.1186/1471-2105-12-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Wilkinson ME, Charenton C, Nagai K. RNA Splicing by the Spliceosome. Annual Review of Biochemistry. 2020;89:359–388. doi: 10.1146/annurev-biochem-091719-064225. [DOI] [PubMed] [Google Scholar]
  91. Wong DK, Grisdale CJ, Slat VA, Rader SD, Fast NM. The evolution of pre-mRNA splicing and its machinery revealed by reduced extremophilic red algae. The Journal of Eukaryotic Microbiology. 2023;70:e12927. doi: 10.1111/jeu.12927. [DOI] [PubMed] [Google Scholar]
  92. Wright CJ, Smith CWJ, Jiggins CD. Alternative splicing as a source of phenotypic diversity. Nature Reviews. Genetics. 2022;23:697–710. doi: 10.1038/s41576-022-00514-4. [DOI] [PubMed] [Google Scholar]
  93. Yamashita S, Takagi Y, Nagaike T, Tomita K. Crystal structures of U6 snRNA-specific terminal uridylyltransferase. Nature Communications. 2017;8:15788. doi: 10.1038/ncomms15788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Zahler AM, Rogel LE, Glover ML, Yitiz S, Ragle JM, Katzman S. SNRP-27, the C. elegans homolog of the tri-snRNP 27K protein, has a role in 5’ splice site positioning in the spliceosome. RNA. 2018;24:1314–1325. doi: 10.1261/rna.066878.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Zhan X, Yan C, Zhang X, Lei J, Shi Y. Structures of the human pre-catalytic spliceosome and its precursor spliceosome. Cell Research. 2018;28:1129–1140. doi: 10.1038/s41422-018-0094-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  96. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research. 2005;33:2302–2309. doi: 10.1093/nar/gki524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature Methods. 2022;19:1109–1115. doi: 10.1038/s41592-022-01585-1. [DOI] [PubMed] [Google Scholar]

Editor's evaluation

Christian R Landry 1

The manuscript addresses the ways in which different organisms have evolved pre-messenger RNA systems that are either more or less complex, a question that underlies the evolution of complex organisms and the genome adaptation of simple organisms to their specific environments. This important manuscript provides the underlying molecular mechanisms of how 5' splice site sequence preference may have evolved, with solid structural modeling data in support.

Decision letter

Editor: Christian R Landry1

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

[Editors' note: this paper was reviewed by Review Commons.]

eLife. 2023 Oct 3;12:e91997. doi: 10.7554/eLife.91997.sa2

Author response


General Statements

We are grateful to the reviewers for the constructive comments designed to improve our manuscript. All three reviewers recognise the broad impact of our study. We have responded to each of the comments they make as detailed below and revised our manuscript accordingly.

  • We have reworded statements in the Abstract and Discussion about the evolutionary relationship between METTL16 and 5’SS sequence preference, to more carefully reflect what we can currently conclude from the data analysed in our study. We have also updated the Discussion to include a new section that addresses the evolutionary scenarios that could explain these data.

  • We have edited the Summary and Introduction sections to give a more balanced view of the link between splicing complexity and developmental complexity.

  • We have added 5’SS sequence motifs to the x-axis of figure 2B to make the plot clearer.

  • We have created a pruned tree showing the 5’SS motifs of a selection of Saccharomycotina species, which further demonstrates that the changes in 5’SS+4 position preferences seen in S. cerevisiae and C. albicans are likely to be a result of convergent evolution. We have added this tree as Figure 2 —figure supplement 3.

  • We have altered the way in which we describe and visualise the relationship between conserved intron number and U5/6rho in Figure 6B, to remove the implication that changes in +4 preference, rather than changes in intron number, explain this trend.

Reviewer #1 (Evidence, reproducibility and clarity (Required)):

Summary

The manuscript by Parker et al. addresses the important question of how different organisms have evolved pre-messenger RNA systems that are either more or less complex. This question underlies the evolution of complex organisms and the genome adaptation of simple organisms to their specific environments, so is an important question to answer. This manuscript now provides the underlying molecular mechanisms of how 5' splice site sequence preference may have evolved which is both an interesting and exciting advance for the field.

We thank the reviewer for these kind comments.

Major comments

This manuscript builds on the previous work from this group where they identified the role of adenosine N6 methylation (m6A) of the U6 small nuclear RNA (snRNA) of the spliceosome by METTL16 as being important for 5' splice site selection. This work led to the speculation that loss of a METTL16 ortholog, or potentially other splicing factors, in some species could contribute to an evolutionary change in 5' splice site sequence preference. Here the authors now use the power of phylogenetics, interspecies association mapping and the available spliceosome structures to provide convincing conclusions that 5' splice site sequence preferences in the extensive number of organisms examined correlate with the presence of the U6 snRNA methyltransferase METTL16 and the splicing factor SNRNP27K.

An analysis of METTL16 conservation was first carried out by comparing the METTL16 methyltransferase domain (MTD) in 29 diverse eukaryotic species. All the METTL16 orthologs were found to have either one or two globular domains. Three domain types were identified and compared in detail. What was not clear from this analysis was the functional significance of orthologs having either one or two domains.

We identified several species, including Drosophila melanogaster, whose METTL16 orthologs do not contain a VCR domain. However, in this study we do not draw specific conclusions about the functional significance of orthologs having different domain topologies.

In addition, while this analysis provides important new information on the domain structure of METTL16 orthologs, especially where these domains had not been identified previously, the link between this section of the results and the following sections is not that apparent.

We agree that there is a significant difference in approach between the first section of the Results and the following sections. However, we are keen to keep this part of the manuscript because it provides an orthogonal line of evidence suggesting that the ancestral role of METTL16 in eukaryotes is specifically the methylation of U6 snRNA.

Next novel bioinformatics pipelines were developed to compare both introns and orthologous groupings of protein coding genes between 227 Sacchromycotina genomes as well as 13 well-annotated eukaryote genomes. First, the 5' splice site sequence preference was compared and clearly indicates that the +4 position has the greatest variation in preferences within the Sacchromycotina. The ability to now compare a large number of genomes has provided novel information on the evolution of the 5' splice site sequence and the conclusion that there is more complexity to the 5' splice site in fungi that previously recognized. While it is apparent why only the 5' splice site signal was investigated here, with its relationship to the U6 snRNA and METTL16, it seems a shame the other splice site sequences were not analyzed using this novel pipeline. In any case, the complexity of the 5' splice site +4 position now allows, for the first time, interesting interspecies association studies.

We have now included the variance plots for 3’SS motifs (analogous to the 5’SS variance plots shown in Figure 2B) as Figure 2 supplementary figure 4A, and a traitgram for 3’SS -3C to U ratio as Figure 2 supplementary figure 4B. We have included a short section of text in the Results section to describe these additional findings.

With the 5' splice site +4 variation identified, the next step was to determine the underlying molecular mechanisms that dictate the evolution of the various sequence preferences. Some obvious players here are the U1 and U6 snRNAs which directly interact with the 5' splice site during splicing. However, no association was found between these snRNAs and the 5' splice site +4 sequence.

The powerful interspecies association mapping was then used to determine whether the presence or absence of METTL16 ortholog or a splicing factor correlated with the 5' splice site +4 sequence variation. Interestingly, a clear association was found between METTL16 and the 5' splice site +4 position; METTL16 presence was associated with +4A at the 5' splice site and METTL16 absence was associated with +4U at the 5' splice site. This is an exciting and significant finding.

We thank the reviewer for these comments on the importance of this study.

Interestingly, the next most significant association with the 5' splice site +4 position was with SNRNP27K. This result makes sense as in the cryo-EM structure of the pre-B spliceosome complex the C-terminal domain of SNRNP27K is found near the region of the U6 snRNA that will interact with the 5' splices site. Absence of SNRNP27K was associated with an increased preference for +4U at the 5' splice site. Now the real power of the interspecies association mapping was demonstrated by investigating whether any association could be determined specifically within the C-terminus of SNRNP27K. Significantly, the methionine 141 position in SNRNP27K was found to be associated with the +4 position of the 5' splice site. This finding fits nicely with previous studies where mutation of M141 caused a shift in 5' splice site selection away from +4A 5' splice sites, to 5' splice sites without +4A. What is not clear is whether M141 is conserved or invariant between all the species that were compared?

M141 is not completely conserved across the species that were compared for the SNRNP27K C-terminus analysis. We did not test positions with very strong sequence conservation, because without variation in both the genotype and phenotype it is not possible to test for an association. We have rephrased the relevant Results and Methods sections to make this point clearer. In addition, we have incorporated a sequence logo to illustrate the degree of conservation of each position in the SNRNP27K C-terminal domain as Figure 5 —figure supplement 1A. Finally, we have included an additional box-plot to illustrate the finding that species which have lost SNRNP27K or have only lost the Methionine equivalent to human SNRNP27K position 141, show a similar preference for +4U at 5’ SSs. This is now included as Figure 5 —figure supplement 1B.

Overall, this result reveals the power of the interspecies association approach and provides interesting and exciting information on the molecular determinants of 5' splice site evolution.

We are grateful to the reviewer for these comments.

The final analysis was to investigate the interaction potentials of the U5 and U6 snRNAs with the 5' splice site in the Sacchromycotina genomes and try to relate this to species with fewer introns and less alternative splicing. Species with low intron numbers and low splicing complexity were revealed to have weaker U5 and U6 anti-correlation potentials and favor +4U at the 5' splice site. On the other hand, species with high intron number and presumably higher splicing complexity featured anti-correlated U5 and U6 snRNA interaction potentials and favored +4A 5' splice sites. This extensive analysis provides novel information on the interactions and splice site properties of species with simple and complex splicing. Again, I see why there is emphasis on the 5' splice site here but a similar analysis with the U2 snRNA and the branch site could also be informative.

We absolutely agree that inter-species association mapping could be applied to other splicing signal phenotypes including 3’ splice sites and intron branchpoints. Accordingly, we raise this subject in the final section of the Discussion. However, branchpoint sequences are challenging to predict with genomic data. Because preliminary analyses suggest independent variation in these other splicing signal phenotypes, we feel a separate focused study is required to properly explain (and substantiate) even the analytical approaches involved. We hope the reviewer would agree that incorporating U2 snRNA and branchpoint variation analyses into this manuscript as well, could detract from the clarity of the conceptual advances that we make here.

Minor comments

Should the Title include SNRNP27K?

We have included SNRNP27K into the revised title.

Should the title specify that it is the evolution of only the 5’ splice site sequence preference being studied here?

Because apostrophes in titles can compromise some scholarly online search engines (https://insights.uksg.org/ar5cles/10.1629/uksg.534), we would prefer not to include 5’ in the title.

Include information on intron number and 5’ splice site interaction potential of U5 and U6 snRNA in the Summary?

We thank the reviewer for this suggestion. We have updated the Summary to include our findings on U5 and U6 interaction potential in species with reduced intron number.

Figure 1C is not referred to in the text?

We apologise for this oversight. We have added references to figure 1C in the appropriate Results section.

Page 8, line 5 – better to say “splicing signal phenotypes”.

We have amended this statement on Page 8 and at other places in the text where related phrasing was made.

What are the other points on Figure 3B? What is the next point below SNRNP27K? Is it U2A’?

The other points on Figure 3B represent Orthofinder orthogroups which contain human orthologs that are known components of the spliceosome. The list of spliceosomal components was taken from Sales-Lee et al. 2021. The third most significant point is indeed the orthogroup containing the human ortholog of U2A’. As we state in the text, however, the correlation of U2A’ with the 5’SS+4 A to U ratio phenotype is no longer significant once METTL16 presence/absence is controlled for, indicating that the correlation of U2A’ with the +4A phenotype is likely explained by similarity in the patterns of gene loss of U2A’ and METTL16.

The second paragraph of the Discussion is vague and lacks a reference. “we could also identify an association with a methionine residue in the conserved C-terminal domain of SNRNP27K orthologs.” There are a few methionines in the C-terminus, which one? Please reference the statement “transcriptome analysis of C. elegans SNRP-27 M141T mutants…”

We apologise for the lower quality of writing in this section of the Discussion. We have updated the text, made the statements about the SNRNP27K C-terminus less ambiguous, and added the relevant citations as appropriate.

Reviewer #1 (Significance (Required)):

Overall, this is a well written and clearly presented study that provides some key molecular information on the splicing factors involved in the evolution of 5’ splice sites and shows the power of interspecies association studies. Some important conceptual principles have now been defined for the field going forward.

With thank the reviewer for this kind comment on the importance of this work.

The question remains as to whether METTL16 and SNRNP27K are the sole determinants of 5’ splice site preference evolution at +4?

We cannot say for certain that METTL16 and/or SNRNP27K determine the 5’SS +4 phenotype – only that they are correlated with it. In our response to reviewer 3, and in a new Discussion section, we have detailed some of the scenarios that could explain these correlations. We also cannot rule out whether there are changes in the presence/absence (or domain/sequence-level changes) of other, untested proteins that correlate with the 5’SS +4 phenotype and we allude to this in the final section of the Discussion.

One splicing factor that immediately comes to mind is Prp8 where there is extensive evidence for involvement in splice site selection and is clearly in the right location throughout splicing to be involved. This question should at least be discussed but Prp8 would also be a very interesting candidate for the interspecies association mapping.

Prp8 is a core component of spliceosomes and is conserved throughout the Saccharomycotina. For this reason, we were unable to associate splicing phenotypes with Prp8 presence or absence variation at the level of orthogroups. However, we revisited this question posed by the reviewer. Our experience with inter-species association mapping, so far, indicates it works well with orthogroup presence/absence or when straightforward amino acid substitutions can be detected in conserved and hence alignable protein sequence domains. We analysed the conserved U6 snRNA-interacting region of the Prp8 linker domain, which maps close to the 5’ splice site in cryo-EM models, using the profile HMM PF10596 available from Pfam. We found that the majority of this domain was extremely highly conserved with variation in only a few species and positions. The strongest correlation with the +4A to U ratio phenotype was at position 58, which is conserved as a Glycine in all but 8 species (6 Dipodascaceae, 2 CUG-Ser1), that also tend to have a stronger preference for +4A. However, examination of the species contributing to this result (and to similar results at other positions) indicated that in the 6 Dipodascaceae species, this change is part of a larger deletion or replacement that makes the whole linker region align poorly to the model. Hence, the G58 position itself may not be specifically important for the +4 phenotype. Although the wholesale loss or replacement of the U6 snRNA-interacting region in these species is potentially interesting, these larger scale structural changes in a small number of species are difficult to interpret. Therefore, to maintain the focus of the manuscript and the clear links to METTL16 and SNRNP27K that have orthogonal support, we have decided not to add these results to the manuscript but present them here.

Author response image 1. Analysis of the U6-interacting region of the PRPF8 linker domain using pHMMPGLS analysis.

Author response image 1.

(A) Stem plot showing the association of individual conserved positions in the U6 snRNA interacting domain of PRPF8 with the 5’SS+4 A to U ratio phenotype in 239 species retaining a detectable PRPF8 ortholog. (B) Boxplot showing the distribution of 5’SS+4 A to U ratio phenotypes for species retaining a PRPF8 ortholog with a Glycine at pHMM alignment position 58, compared to those with a different or unalignable residue at this position (X58). Six of the eight species without a Glycine at this position are from the clade Dipodascaceae and have PRPF8 orthologs that align poorly to the pHMM, suggesting that this position itself may not be specifically important for the +4 phenotype.

Also, as mentioned previously, only the 5’ splice site was investigated here and the manuscript could become a more substantial piece of work if the other splice sites were included in some way.

We agree that it will be exciting to apply this approach to other splicing signal phenotypes and in other phylogenetic clades with emerging tree-of-life-scale genomics data. We have included variation in 3’ splice sites in the revised manuscript. As the first of its kind, this study should pioneer a wider use of this approach, by us and others, to understand the mechanisms and functions of molecular interactions not only in splicing but in other areas of biology too.

The obvious audience here are those directly in the splicing field but the overall principles are relevant for evolutionary biologists and those studying organismal complexity.

We thank the reviewer for recognising the broad importance of this work.

My expertise is in yeast and human splicing mechanisms. I do not have the expertise to critically evaluate the bioinformatic pipelines but they were clearly explained and presented.

Reviewer #2 (Evidence, reproducibility and clarity (Required)):

In their manuscript, Parker et al. investigate the evolutionary patterns of splice site preference, focusing on the A/U ratio at position A+4 on the 5´ splice site. Building upon prior studies in S. pombe and A. thaliana, the authors establish a strong correlation between this preference and the co-evolution of the METTL16 U6 snRNA methyltransferase. Furthermore, through inter-species association mapping, they identify the involvement of the splicing factor SNRNP27K in altered A/U ratios and highlight the significance of the residue Met-141 in SNRNP27K for this function. Overall, the paper effectively presents impactful new findings on the evolution of METTL16, U6 snRNA, and splicing.

We thank the reviewer for these kind comments on the importance of our study.

The computational analyses employed in this study are situated outside our field of expertise, preventing us from offering a comprehensive evaluation of the methodology’s appropriateness and rigor. Nonetheless, the identification of METTL16 through the authors’ methods, which aligns with previous research in S. pombe and A. thaliana, lends support to the validity of their approach. Notably, the close proximity between SNRNP27K and the methylated A43 residue in U6 snRNA within the spliceosome, particularly near Met-141, is an impressive finding. Previous studies have shown that a mutation at position M141T affects splicing at +4A introns, thus providing robust validation for their methods.

We thank the reviewer for these kind comments on our work.

The data presented in this study furnish crucial insights into the role of METTL16, U6 snRNA methylation, and splice site recognition. The authors expand upon recent observations that the “vertebrate conserved region” exists in non-vertebrates, despite the absence of primary sequence homology. These results will serve as a valuable guide for future molecular investigations into U6 snRNA methylation and its mechanisms in splicing. Furthermore, the implications of this paper extend to human evolution, as the plasticity in splicing is an essential factor in the evolution of developmental complexity.

We thank the reviewer for these kind comments.

Minor suggestions for improvement:

1. Given the significance of the interaction between U6 snRNA and the intron for understanding the data, it would be beneficial to include a figure illustrating the RNA-RNA base-pairing interactions between U6 snRNA and the 5´ splice site. This addition is particularly important if the paper is intended for publication in a journal with a general readership.

We thank the reviewer for this excellent suggestion. We have included this as Figure 3A.

2. Similarly, the section on U1 snRNA would be more comprehensible with the inclusion of U1 RNA-RNA intron diagrams and improved descriptions of both the figures and the assay. Despite being negative data in the supplement, clarifying this section is essential. As currently written, it is challenging to follow.

We agree that this section is difficult to follow. We have updated the text to improve the readability and included a figure of U1 snRNA:5’SS basepairing as Figure 3 —figure supplement 1A.

3. Whenever possible, consider increasing the figure and font sizes to enhance readability for readers.

We agree that some of the more complex figures can be difficult to read when embedded into a Word document/pdf. We hope that providing high-resolution figures for reading online will mitigate this.

4. In the text, there is no reference to Figure 1C.

We apologise for this oversight. We have resolved this issue with the appropriate references in the Results text.

5. In Figure 5B, the y-axis in the top panel is 8labelled “species,” but the legend only mentions U5/6p as the y-axis. Please revise the legend to include the appropriate information.

We apologise for the confusion caused by our poorly written legend for this plot. We have updated the legend so that the text clearly refers to either the scatter plot or the marginal histograms.

Reviewer #2 (Significance (Required)):

The data presented in this study furnish crucial insights into the role of METTL16, U6 snRNA methylation, and splice site recognition. The authors expand upon recent observations that the “vertebrate conserved region” exists in non-vertebrates, despite the absence of primary sequence homology. These results will serve as a valuable guide for future molecular investigations into U6 snRNA methylation and its mechanisms in splicing. Furthermore, the implications of this paper extend to human evolution, as the plasticity in splicing is an essential factor in the evolution of developmental complexity.

We are grateful to the reviewer for these kind comments on the importance of this work.

Reviewer #3 (Evidence, reproducibility and clarity (Required)):

In this manuscript, Parker et al. present a nice exploration of the evolutionary and mechanistic relationships between 5ʹ splice site consensus sequences, intron numbers and METTL16/SNRNP27K. By performing inter-species association mapping in Saccharomycotina species, they found that a T in position +4 is strongly associated with the absence of METTL16 (and/or in some cases SNRNP27K or mutations in it). They also provide solid structural modelling data in support of this association.

In general, I think this is a very nice manuscript. I only have a few comments, which could be addressed by rewording specific parts and/or improving the current figures.

We are grateful to the reviewer for the kind comments on this work.

1. As the authors acknowledge, a key issue that cannot be fully resolved in this study is causality between the different events investigated. Overall, the authors are careful about this, but there are some exceptions that should be corrected. Probably the most important is in the abstract, where they write: “We conclude that variation in concerted processes of 5’ splice site selection by U6 snRNA is crucial to evolutionary change in splicing complexity”. I suggest they write something more open (and correct), such as: “We conclude that variation in concerted processes of 5’ splice site selection by U6 snRNA is associated with evolutionary changes in splicing complexity”. Similarly, other plausible scenarios should be discussed in the corresponding Discussion section.

We agree with the reviewer that it is not possible to infer the causal relationship between METTL16 absence and 5’SS+4 preference change from the current data. We, therefore, apologise for failing to be more careful in the Summary and Introduction. We have reworded these statements to better reflect what we can currently say about the evolutionary relationship between METTL16 and 5’SS sequence preference.

The correlation between METTL16 absence and 5'SS+4 sequence preference change could most likely be explained by one of several scenarios: (a) sudden loss of METTL16 causes a rapid necessity to change 5'SS sequence preferences. This is unlikely as such rapid change without widespread corresponding 5'SS changes would likely impose a high fitness cost. (b) Changes in 5'SS sequence preference occur first, driven by some other selective pressure, until there is no longer a benefit to retaining the METTL16 gene. (c) Gradual changes in the expression or catalytic efficiency of METTL16 reduce the stoichiometry of U6 snRNA m6A modification, which permits gradual change in 5'SS+4 sequence preference until complete loss of the METTL16 no longer imposes a major fitness cost. As we suggest in the Discussion, future work could examine this question by determining whether the METTL16 orthologs found in Zygosaccharomyces and Eremothecium species, which have altered their 5'SS+4 preference to a U, are expressed and functional. We have updated the Discussion to include a new section that addresses these scenarios.

2. I do not agree with the statement that "The extent of alternative splicing is the best genomic predictor of developmental complexity". To start with, there are many ways to quantify "extent of alternative splicing" and there are also different types of alternative splicing that might have different prevalence and biological impact. Then, this claim is usually related with exon skipping, which is tightly linked with intron length, and that is likely a better prediction of complexity (yet clearly not causative). My concern is: to what extent has this claim been formally and properly assessed by comparing splicing prevalence with other genomic features, such as intergenic region length, intron length, or average distance between enhancer-promoter interactions (arguably the most relevant predictor, in light of many other studies)? Moreover, I found it a bit misleading to frame the work presented in this study as directly related with developmental (or even splicing) complexity. The work is very interesting on its own, and I doubt their findings on +4 position preference in Saccharomycotina has anything to do with developmental complexity (as the Abstract and Introduction seem to imply).

On reflection, we agree with the reviewer. Some of our framing of the text isn’t balanced with other studies on the scaling of alternative splicing with developmental complexity. We have edited the Summary and Introduction sections accordingly and cited other references that broaden the consideration of this subject. We are grateful to the reviewer for this suggestion because the changes we make improve the focus of the manuscript since our findings relate more to splicing simplification than to an understanding of increased developmental complexity.

2. I found Figure 2 and its associated supplementary figure very difficult to follow. I suggest the authors try to improve it and make it clearer. Also, other trees summarizing the results might be helpful.

We apologise for the complexity of these figures. We opted to show phylogenetic trees with phenotypes plotted on the y axis, rather than simply trait histograms or box-plots, because the underlying structure of the tree is important for demonstrating that multiple independent changes in the 5’SS phenotype have occurred in the Saccharomycotina. We have tried to improve the comprehensibility of the figures in the following ways: (a) We have added 5’SS sequence motifs to the x-axis of figure 2B to make what the plot represents clearer, (b) as suggested by the reviewer, we have created a pruned tree showing the 5’SS motifs of a selection of Saccharomycotina species, which demonstrates that the changes in 5’SS+4 position preferences seen in S. cerevisiae and C. albicans are likely to be a result of convergent evolution. We have added this tree as Figure 2—figure supplement 3.

3. I also found the Results section corresponding to Figure 5B a bit confusing. I would argue (as I think the authors do) that there are two main patterns here: below 500 introns, there is no association, while above 500 introns there is an increasingly negative association (correlation). I think it would help to more explicitly distinguishing these two patterns. Then, for the intron-poor species: is the correlation (or lack of) for species with a T or an A in position +4 different?

We do indeed think that there are two patterns here, as indicated by the reviewer. In the previous version of the manuscript, we separated species into those having an overall preference for A at the +4 position, and those having +4U. By showing regression lines for these two classes, rather than for the general relationship between intron number and U5/6rho, we somewhat imply that the switch in +4 base preference might be causing the loss of correlation between U5/6rho and intron number. However, since essentially all species with a 5'SS +4U preference are intron poor, it seems more likely that these trends are the result of a loss of the negative correlation between intron number and U5/6rho in intron poor species, as suggested by the reviewer. To address this issue, we have replaced the regression lines on Figure 6B with a single loess (locally estimated scatterplot smoothing) regression line for all species and updated the text to make it clearer that we think loss of U5/6rho and +4A preference are separate traits of intron poor species. Although this is not exactly what the reviewer requested, we hope that it satisfies their issue with the analysis.

Reviewer #3 (Significance (Required)):

This is a very interesting study that sheds light on an intriguing evolutionary pattern: the change in consensus sequence at position +4 of the 5' splice site. This topic is relevant since it is closely associated with intron loss and splicing efficiency and evolution.

We thank the reviewer for the kind and constructive comments on this study.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Supplementary file 1. Table containing the 240 genome assemblies used for phylogenetic association mapping.

    All assemblies were downloaded from NCBI. Where available, RefSeq assemblies were preferred over GenBank.

    elife-91997-supp1.xlsx (30.3KB, xlsx)
    MDAR checklist

    Data Availability Statement

    This study uses the publicly available genome sequences of 240 different species. The details of the previously used data sets corresponding to these 240 genome sequences are reported in Supplementary file 1.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES