Skip to main content
eLife logoLink to eLife
. 2024 Mar 12;13:RP93629. doi: 10.7554/eLife.93629

Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans

Florian Bénitière 1, Anamaria Necsulea 1, Laurent Duret 1,
Editors: Vincent Castric2, George H Perry3
PMCID: PMC10932544  PMID: 38470242

Abstract

Most eukaryotic genes undergo alternative splicing (AS), but the overall functional significance of this process remains a controversial issue. It has been noticed that the complexity of organisms (assayed by the number of distinct cell types) correlates positively with their genome-wide AS rate. This has been interpreted as evidence that AS plays an important role in adaptive evolution by increasing the functional repertoires of genomes. However, this observation also fits with a totally opposite interpretation: given that ‘complex’ organisms tend to have small effective population sizes (Ne), they are expected to be more affected by genetic drift, and hence more prone to accumulate deleterious mutations that decrease splicing accuracy. Thus, according to this ‘drift barrier’ theory, the elevated AS rate in complex organisms might simply result from a higher splicing error rate. To test this hypothesis, we analyzed 3496 transcriptome sequencing samples to quantify AS in 53 metazoan species spanning a wide range of Ne values. Our results show a negative correlation between Ne proxies and the genome-wide AS rates among species, consistent with the drift barrier hypothesis. This pattern is dominated by low abundance isoforms, which represent the vast majority of the splice variant repertoire. We show that these low abundance isoforms are depleted in functional AS events, and most likely correspond to errors. Conversely, the AS rate of abundant isoforms, which are relatively enriched in functional AS events, tends to be lower in more complex species. All these observations are consistent with the hypothesis that variation in AS rates across metazoans reflects the limits set by drift on the capacity of selection to prevent gene expression errors.

Research organism: Other

Introduction

Eukaryotic protein-coding genes are interrupted by introns, which have to be excised from the primary transcript to produce functional mRNAs that can be translated into proteins. The removal of introns from primary transcripts can lead to the production of diverse mRNAs, via the differential use of splice sites. This process of alternative splicing (AS) is widespread in eukaryotes (Chen et al., 2014), but its ’raison d’être’ (adaptive or not) remains elusive. Numerous studies have shown that some AS events are functional, that is that they play a beneficial role for the fitness of organisms, either by allowing the production of distinct protein isoforms (Graveley, 2001) or by regulating gene expression post-transcriptionally (McGlincy and Smith, 2008; Hamid and Makeyev, 2014). However, other AS events are undoubtedly not functional. Like any biological machinery, the spliceosome occasionally makes errors, leading to the production of aberrant mRNAs, which represent a waste of resources and are therefore deleterious for the fitness of the organisms (Hsu and Hertel, 2009; Gout et al., 2013). The splicing error rate at a given intron is expected to depend both on the efficiency of the spliceosome and on the intrinsic quality of its splice signals. The information required in cis for the removal of each intron resides in 20–40 nucleotide sites, located within the intron or its flanking exons (Lynch, 2006). Besides the two splice sites that are essential for the splicing reaction (almost always GT for the donor and AG for the acceptor), all other signals tolerate some sequence flexibility. Population genetics principles state that the ability of selection to promote beneficial mutations or eliminate deleterious mutations depends on the intensity of selection (s) relative to the power of random genetic drift (defined by the effective population size, Ne): if the selection coefficient is sufficiently weak relative to drift (|Nes|1), alleles behave as if they are effectively neutral. Thus, random drift sets an upper limit on the capacity of selection to prevent the fixation of alleles that are sub-optimal (Kimura et al., 1963; Ohta, 1973). This so-called ‘drift barrier’ (Lynch, 2007) is expected to affect the efficiency of all cellular processes, including splicing. Hence, species with low Ne should be more prone to make splicing errors than species with high Ne.

The extent to which AS events correspond to functional isoforms or to errors is a contentious issue (Bhuiyan et al., 2018; Tress et al., 2017b; Blencowe, 2017; Tress et al., 2017a). In humans, the set of transcripts produced by a given gene generally consists of one major transcript (the ‘major isoform‘), which encodes a functional protein, and of multiple minor isoforms (splice variants), present in relatively low abundance, and whose coding sequence is frequently interrupted by premature termination codons (PTCs) (Tress et al., 2017a; Gonzàlez-Porta et al., 2013). Ultimately, less than 1% of human splice variants lead to the production of a detectable amount of protein (Abascal et al., 2015). Furthermore, comparison with closely related species showed that AS patterns evolve very rapidly (Barbosa-Morais et al., 2012; Merkin et al., 2012) and that alternative splice sites present little evidence of selective constraints (Pickrell et al., 2010). All these observations are consistent with the hypothesis that a vast majority of splice variants observed in human transcriptomes simply correspond to erroneous transcripts (Pickrell et al., 2010). However, some authors argue that a large fraction of AS events might in fact contribute to regulating gene expression. Indeed, PTC-containing splice variants are recognized and degraded by the non-sense mediated decay (NMD) machinery. Thus, AS can be coupled with NMD to modulate gene expression at the post-transcriptional level (McGlincy and Smith, 2008; Hamid and Makeyev, 2014). This AS-NMD regulatory process does not involve the production of proteins and does not necessarily imply strong evolutionary constraints on splice sites. Thus, based on these observations, it is difficult to firmly refute selectionist or non-adaptive models.

The analysis of transcriptomes from various eukaryotic species showed substantial variation in AS rates across lineages, with the highest rate in primates (Barbosa-Morais et al., 2012; Chen et al., 2014; Mazin et al., 2021). Interestingly, the genome-wide average AS level was found to correlate positively with the complexity of organisms (approximated by the number of cell types; Chen et al., 2014). This correlation was considered as evidence that AS contributed to the evolution of complex organisms by increasing the functional repertoire of their genomes (Chen et al., 2014). This pattern is often presented as an argument supporting the importance of AS in adaptation (Verta and Jacobs, 2022; Singh and Ahi, 2022; Wright et al., 2022). However, this correlation is also compatible with a totally opposite hypothesis. Indeed, eukaryotic species with the highest level of complexity correspond to multi-cellular organisms with relatively large body size, which tend to have small effective population sizes (Ne) (Lynch and Conery, 2003; Figuet et al., 2016). Thus, the higher AS rate observed in ‘complex’ organisms might simply reflect an increased rate of splicing errors, resulting from the effect of the drift barrier on the quality of splice signals (Bush et al., 2017).

To assess this hypothesis and evaluate the impact of genetic drift on alternative splicing patterns, we quantified AS rates in 53 metazoan species, covering a wide range of Ne values, and for which high-depth transcriptome sequencing data were available. We show that the genome-wide average AS rate correlates negatively with Ne, in agreement with the drift barrier hypothesis. This pattern is mainly driven by low abundance isoforms, which represent the vast majority of splice variants and most likely correspond to errors. Conversely, the AS rate of abundant splice variants, which are enriched in functional AS events, show the opposite trend. These results support the hypothesis that the drift barrier sets an upper limit on the capacity of selection to minimize splicing errors.

Results

Genomic and transcriptomic data collection

To analyze variation in AS rates across metazoans, we examined a collection of 69 species for which transcriptome sequencing (RNA-seq) data, genome assemblies, and gene annotations were available in public databases. We focused on vertebrates and insects, the two metazoan clades that were the best represented in public databases when we initiated this project. To be able to compare average AS rates across species, we needed to control for several possible sources of biases. First, given that AS rates vary across genes (Saudemont et al., 2017), we had to analyze a common set of orthologous genes. For this purpose, we extracted from the BUSCO database (Seppey et al., 2019) a reference set of single-copy orthologous genes shared across metazoans (N=978 genes), and searched for their homologues in each species in our dataset. We retained for further analyses those species for which at least 80% of the BUSCO metazoan gene set could be identified (N=67 species; see Materials and methods). Second, we had to ensure that RNA-seq read coverage was sufficiently high in each species to detect splicing variants. Indeed, to be able to detect AS at a given intron, it is necessary to analyze a minimal number of sequencing reads encompassing this intron (we used a threshold of N=10 reads). To assess the impact of sequencing depth on AS detection, we conducted a pilot analysis with two species (Homo sapiens and Drosophila melanogaster) for which hundreds of RNA-seq samples are available. This analysis (detailed in Figure 2—figure supplement 1) revealed that AS rate estimates are very noisy when sequencing depth is limited, but that they converge when sequencing is high enough. We therefore kept for further analysis those species for which the median read coverage across exonic regions of BUSCO genes was above 200 (Figure 2—figure supplement 1). Our final dataset thus consisted of 53 species (15 vertebrates and 38 insects; Figure 1A), and of 3496 RNA-seq samples (66 per species on average). In these species, the number of analyzable annotated introns (i.e. encompassed by at least 10 reads) among BUSCO genes ranges from 2032 to 10,981 (which represents 88.6% to 99.6% of their annotated introns; Figure 1—source data 1). It should be noted that analyzed samples originate from diverse sources; however, they are very homogenous in terms of sequencing technology (99% of RNA-seq samples sequenced with Illumina platforms; refer to Data10-supp.tab in the Zenodo data repository).

Figure 1. Species phylogeny and Ne proxies.

Figure 1.

(A) Phylogenetic tree of the 53 studied species (15 vertebrates and 38 insects). (B) Relationship between body length (cm, log scale) and longevity (days, log scale) of the organism. Each dot represents one species (colored by clade, as in the species tree in panel A). (C) Relationship between longevity (days, log scale) and the dN/dS ratio on terminal branches of the phylogenetic tree (Materials and methods). (B,C) PGLS stands for Phylogenetic Generalized Least Squared regression, which takes into account phylogenetic inertia (Materials and methods).

Figure 1—source data 1. Summary of the main features of the samples analyzed in this study.
Figure 1—source data 2. Longevity and body lenth across the 53 metazoans studied.

Proxies for the effective population size (Ne)

Effective population sizes (Ne) can in principle be inferred from levels of genetic polymorphism. However, population genetics data are lacking for most of the species in our dataset. We therefore used two life history traits that were previously proposed as proxies of Ne in metazoans (Waples, 2016; Romiguier and Weyna, 2020; Figuet et al., 2016): body length and longevity (Materials and methods; Figure 1—source data 2). An additional proxy for Ne can be obtained by studying the intensity of purifying selection acting on protein sequences, through the dN/dS ratio (Kryazhimskiy and Plotkin, 2008). To evaluate this ratio, we aligned 922 BUSCO genes, reconstructed the phylogenetic tree of the 53 species (Figure 1A) and computed the dN/dS ratio along each terminal branch (Materials and methods).

We note that these three proxies provide ‘inverse’ estimates of Ne, meaning that species with high longevity, large body length and/or elevated dN/dS values tend to have low Ne values. As expected, these different proxies of Ne are positively correlated with each other (p <1 x10–3, Figure 1B and C). We note however that these correlations are not very strong. It thus seems likely that none of these proxies provides a perfect estimate of Ne. To take phylogenetic inertia into account, all cross-species correlations presented here were computed using Phylogenetic Generalized Least Squared (PGLS) regression (Freckleton et al., 2002).

Alternative splicing rates are negatively correlated with Ne proxies

To quantify AS rates, we mapped RNA-seq data of each species on the corresponding reference genome assembly. We detected sequencing reads indicative of a splicing event (hereafter termed ‘spliced reads’), and inferred the corresponding intron boundaries. We were thus able to validate the coordinates of annotated introns and to detect new introns, not present in the annotations. For each intron detected in RNA-seq data, we counted the number of spliced reads matching with its two boundaries (Ns) or sharing only one of its boundaries (Na), as well as the number of unspliced reads covering its boundaries (Nu) (Figure 2A). We then computed the relative abundance of this spliced isoform compared to other transcripts with alternative splice boundaries (RAS=NsNs+Na) or compared to unspliced transcripts (RANS=NsNs+Nu2).

Figure 2. Distinguishing major and minor-isoform introns and measuring the rate of alternative splicing.

(A) Definition of the variables used to compute the relative abundance of a spliced isoform compared to other transcripts with alternative splice boundaries (RAS) or compared to unspliced transcripts (RANS): Ns: number of spliced reads corresponding to the precise excision of the focal intron; Na: number of reads corresponding to alternative splice variants relative to this intron (i.e. sharing only one of the two intron boundaries); Nu: number of unspliced reads, co-linear with the genomic sequence. (B,C) Histograms representing the distribution of RAS and RANS values (divided into 5% bins), for protein-coding gene introns. Each line represents one species. Two representative species are colored: Drosophila melanogaster (red), Homo sapiens (brown). (D) Description of the variables used to compute the AS rate of a given a major-isoform intron, and the ’minor-isoform intron relative abundance’ (MIRA) of each of its splice variants (SVs): NM: number of spliced reads corresponding to the excision of the major-isoform intron; Nim: number of spliced reads corresponding to the excision of a minor-isoform intron (i); Nm: total number of spliced reads corresponding to the excision of minor-isoform introns. (E) Definitions of the main variables used in this study.

Figure 2.

Figure 2—figure supplement 1. Transcriptome sequencing depth affects intron detection power and AS rate estimates.

Figure 2—figure supplement 1.

To assess the impact of sequencing depth on AS detection, we conducted a pilot analysis with two species (A,C: Homo sapiens and B,D: Drosophila melanogaster) for which hundreds of RNA-seq samples are available (; refer to Data10-supp.tab in the Zenodo data repository). We randomly drew 1–20 RNA-seq samples and, for each draw, we computed the median read coverage across BUSCO gene exons (to get a measure of transcriptome sequencing depth that is comparable across species). We also computed for each draw the average AS rate and the fraction of introns supported by at least 10 RNA-seq reads, out of all introns annotated for BUSCO genes (Materials and methods). We repeated this procedure 30 times. As expected, the fraction of BUSCO introns that are supported by at least 10 reads (i.e. Ns+Na10) increases with sequencing depth (A,B). More importantly, we observed that when sequencing depth is limited, the mean AS rate of BUSCO introns is very variable across draws (C,D). However, AS rate estimates converge when sequencing depth exceeds 200. We therefore kept for further analysis those species for which the median read coverage across exonic regions of BUSCO genes was above this threshold.
Figure 2—figure supplement 2. The power to detect AS events is positively correlated with transcriptome sequencing depth.

Figure 2—figure supplement 2.

Relationship between the proportion of major-isoform introns that have at least one read corresponding to splice variants (i.e.Na>0; see Figure 2), and the median per-base read coverage computed on BUSCO gene exons, across metazoans. Each dot represents one species, colored by taxonomic clade.
Figure 2—figure supplement 3. Description of the bioinformatic analyses pipeline.

Figure 2—figure supplement 3.

First, we retrieved genomic sequences and annotations from the NCBI Genomes database. We aligned RNA-seq reads with HISAT2 on the corresponding reference genomes, to analyze various variables (see Figure 2), to compute the AS rate, and to estimate gene expression using Cufflinks. To compute dN/dS, we first identified BUSCO genes with BUSCOv3 and aligned their coding sequences (CDS) using PRANK (codon model). We reconstructed a phylogenetic tree using RAxML-NG with 461 multiple alignments. Using bio++, we estimated dN/dS along the phylogenetic tree on concatenated alignments.

To limit measurement noise, we only considered introns for which both RAS and RANS could be computed based on at least 10 reads (Materials nd methods). In all species, both RAS and RANS metrics show clearly bimodal distributions (Figure 2B and C): the first peak (mode < 5%) corresponds to ‘minor-isoform introns’, whose splicing occurs only in a minority of transcripts of a given gene, whereas the second one (mode > 95%) corresponds to the introns of major isoforms. It has been previously shown that in humans, for most genes, one single transcript largely dominates over other isoforms (Tress et al., 2017a; Gonzàlez-Porta et al., 2013). Our observations indicate that this pattern is generalized across metazoans. For the rest of our analyses, we computed the rate of alternative splicing with respect to introns of the major isoform. We will hereafter use the term ‘splice variant’ (SV) to refer to those splicing events that are detected in a minority of transcripts (i.e. with RAS ≤ 0.5 or RANS ≤ 0.5; see Figure 2E for a definition of the main variables used in this study).

We focused our analyses on major-isoform introns interrupting protein-coding regions (i.e. we excluded introns located within UTRs, Materials and methods). In vertebrates, each BUSCO gene contains on average 8.4 major-isoform introns (Figure 1—source data 1). The intron density is more variable among insect clades, ranging from 2.8 major-isoform introns per BUSCO gene in Diptera to 6.1 in Blattodea. As expected, most major-isoform introns have GT/AG splice sites (99.1% on average across species), and only a small fraction have boundaries that do not match the canonical U2-introns splice sites (0.8% GC/AG and 0.1% AT/AC). The fraction of non-canonical splice sites is slightly higher among minor-isoform introns (2.8% GC/AG and 0.3% AT/AC). This might reflect a higher prevalence of U12-type introns but might also be caused by the presence of some false positives in the set of minor-isoform introns. In any case, the difference in splice signal usage between minor and major-isoform introns is small, which indicates that the vast majority of detected minor-isoform introns correspond to bona fide splicing events.

The proportion of major-isoform introns for which AS has been detected (i.e. with Na>0) ranges from 16.8% to 95.7% depending on the species (Figure 1—source data 1). This metric is however not very meaningful because it directly reflects differences in sequencing depth across species (the higher the sequencing effort, the higher the probability to detect a rare SV, Figure 2—figure supplement 2). To allow a comparison across taxa, we computed the AS rate of introns, normalized by sequencing depth (AS=NmNM+Nm, Materials and methods; Figure 2D). The average AS rate for BUSCO genes varies by a factor of 5 among species, from 0.8% in Drosophila grimshawi (Diptera) to 3.8% in Megachile rotundata (Hymenoptera) (3.4% in humans). Interestingly, the average AS rates of BUSCO gene introns are significantly correlated with the three proxies of Ne: species longevity (Figure 3A), body length and the dN/dS ratio (Figure 3—figure supplement 1A and B). These correlations are positive, which implies that AS rates tend to increase when Ne decreases. It is noteworthy that despite the fact that these proxies are not strongly correlated with each other (Figure 1B and C), they all show similar relationships with AS rates. It should be stressed that these correlations were estimated using the PGLS method to account for phylogenetic inertia (and they remain significant when analyzing insects and vertebrates separately, Figure 3—figure supplement 2). Thus, these observations are consistent with the hypothesis that Ne has an impact on the evolution of AS rate.

Figure 3. The rate of alternative splicing correlates with life history traits across metazoans.

(A) Relationship between the per intron average AS rate of an organism and its longevity (days, log scale). (B) Variation in average AS rate across seven organs (brain, cerebellum, heart, liver, kidney, testis, and ovary) among seven vertebrate species (RNA-seq data from Cardoso-Moreira et al., 2019). AS rates are computed on major-isoform introns from BUSCO genes (Materials and methods).

Figure 3.

Figure 3—figure supplement 1. Relationship between AS rates and other Ne proxies.

Figure 3—figure supplement 1.

(A, B) Correlation between the average AS rate per intron and the body length of each species (cm, log scale) (A) or the dN/dS ratio on terminal branches of the phylogenetic tree (B). (C, D, E, F) Relationship between the average AS rate per intron and the body length (cm, log scale) (C, E) or the dN/dS ratio (D, F). (C, D) Low-AS major-isoform introns (i.e. major-isoform introns that do not have any abundant SV). (E, F) High-AS major-isoform introns (i.e. major-isoform introns having at least one abundant SV). Only BUSCO genes were used in the analysis.
Figure 3—figure supplement 2. The rate of alternative splicing correlates with life history traits in both vertebrates and insects.

Figure 3—figure supplement 2.

Correlation between the average AS rate per intron and longevity of each species (days, log scale) (A, B), body length (cm, log scale) (B, E), or the dN/dS ratio on terminal branches of the phylogenetic tree (C, F). In vertebrates (A, B, C) and insects (C, D,E). Only the BUSCO genes were included in the analysis.
Figure 3—figure supplement 3. The variation in AS rates between species is not explained by organ differences.

Figure 3—figure supplement 3.

Variation in average AS rate across seven organs (brain, cerebellum, heart, liver, kidney, testis, and ovary) among seven vertebrate species (RNA-seq data from Cardoso-Moreira et al., 2019) and across three organs (ovary, testis, and head) for one insect (Dendroctonus ponderosae, Coleoptera). AS rates were computed for the major-isoform introns from BUSCO genes (Materials and methods).
Figure 3—figure supplement 4. The per-gene AS rate is negatively correlated with Ne.

Figure 3—figure supplement 4.

Relationship between per-gene average AS rates and Ne proxies. We use as inverse Ne proxies the longevity (days, log scale) (A, D) or the body length (cm, log scale) (B, E) or the dN/dS ratio (C, F). The analysis was done on BUSCO genes (A, B, C) and on all protein-coding genes (D, E, F).

One limitation of our analyses is that we used heterogeneous sources of transcriptomic data. To obtain enough sequencing depth, we combined for each species many RNA-seq samples, irrespective of their origin (whole body, or specific tissues or organs, in adults or embryos, etc.). It is known that genome-wide average AS rates vary according to tissues or developmental stages (Barbosa-Morais et al., 2012; Mazin et al., 2021), and according to environmental conditions (John et al., 2021). To explore how this might have affected our results, we repeated our analyses using a recently published dataset that aimed to compare transcriptomes across seven organs, sampled at several developmental stages in seven species (six mammals, one bird; Cardoso-Moreira et al., 2019). In agreement with previous reports (Mazin et al., 2021), our analysis of BUSCO genes revealed substantial differences in AS rates among organs, with consistent patterns of variation across species. For instance, in all species, testes and brain tissues show higher AS rates than liver and kidney (Figure 3B). However, the variation in AS rate among organs in each species is limited compared to differences between species. Specifically, in an ANOVA analysis performed on the average AS rate across BUSCO gene introns, with the species and the organ of origin as explanatory variables, the species factor explained 89% of the total variance, while the organ factor explained only 9%. Among insects, we found only one species (Dendroctonus ponderosae) for which RNA-seq samples were available from multiple tissues. Here again, the variance in AS rate among tissues was limited compared to inter-species variability (Figure 3—figure supplement 3). Thus, despite the variability that can be introduced by the heterogeneity of RNA-seq samples, the relationship between AS rate and longevity remains detectable among these seven species (Figure 3B).

Functional vs. non-functional alternative splicing

The negative correlation observed between Ne and alternative splicing rates is consistent with the hypothesis that differences in AS rates across species are driven by variation in the rate of splicing errors (drift barrier model). This does not exclude however that functional splicing variants might also contribute to AS rate variation across species. To evaluate this point, we selected a subset of SVs that are enriched in functional AS events. To do this, we reasoned that selective pressure against the waste of resources should maintain splicing errors at a low rate (as low as permitted by the drift barrier), whereas functional SVs are expected to represent a sizeable fraction of the transcripts expressed by a given gene, at least in some specific conditions (cell type, developmental stage…). Thus, functional SVs are expected to be enriched among abundant SVs compared to rare SVs.

To assess this prediction, we analyzed the proportion of SVs that preserve the reading frame according to their abundance relative to the major isoform. For this, we focused on minor-isoform introns that share a boundary with one major-isoform intron and that have their other boundary at less than 30 bp from the major splice site (either in the flanking exon or within the major-isoform intron). We determined whether the distance between the minor-isoform intron boundary and the major-isoform intron boundary was a multiple of 3. We computed the abundance of each minor isoform, relative to the corresponding major isoform, with the following formula: MinorintronrelativeabundanceMIRAi=NimNM+Nm (see Figure 2D).

We divided minor-isoform introns into 5% bins according to their MIRA and computed for each bin the proportion of minor-isoform introns that maintain the reading frame of the major isoform (Figure 4A). In all species, we observe that this proportion varies according to the abundance of splice variants, with two distinct regimes (Figure 4A). First, for MIRA values above 5%, the proportion of frame-preserving variants correlates positively with MIRA, reaching up to 60–70% for the most abundant isoforms. Second, for MIRA values below 1%, the proportion of frame-preserving variants does not covary with MIRA, and fluctuates around 30–40%, close to the random expectation (33%). The excess of frame-preserving variants among the most abundant isoforms implies that a substantial fraction of them is under constraint to encode functional protein isoforms. This fraction varies from 0% for MIRA values below 1%, to 50% for isoforms with the highest MIRA values. It should be noted that these estimates correspond to a lower bound, since it is possible that some frame-shifting splice variants are functional. Nevertheless, these observations clearly indicate that the subset of SVs with MIRA values >5% (hereafter referred to as ‘abundant SVs’) is strongly enriched in functional isoforms relative to other SVs (MIRA ≤ 5%, hereafter referred to as ‘rare SVs’). Of note, the subset of rare SVs represents the vast majority of the SV repertoire (from 62.4% to 96.9% depending on the species; Figure 1—source data 1). Thus, the positive correlation between AS rate and longevity reported above (Figure 3A) is mainly driven by the set of introns with a low AS rate (Figure 4C). Interestingly, introns with high AS rate (enriched in functional SVs) show an opposite trend (Figure 4D), and they display a lower proportion of frame-preserving SVs in vertebrates than in dipterans (Figure 4B). This is the opposite of what would have been expected if functional SVs were more prevalent in complex organisms.

Figure 4. Variation in AS rate across metazoans: distinguishing abundant splice variants (enriched in functional variants) from rare splice variants.

(A) Frame-preserving isoforms are strongly enriched among abundant splice variants (SVs). For each species, SVs were classified into 20 equal-size bins according to their abundance relative to the major isoform (MIRA, see Materials and Methods), and the proportion of frame-preserving SVs was computed for each bin. Each line represents one species. Three representative species are colored: red: Drosophila melanogaster, brown: Homo sapiens, yellow: Apis mellifera. We used a threshold MIRA value of 5% to define ‘abundant’ vs. ‘rare’ SVs. (B) Proportion of frame-preserving SVs among abundant SVs across metazoans. Each dot represents one species. All annotated protein-coding genes are used in the analysis. (C,D) Relationship between the average per intron AS rate of an organism and its longevity (days, log scale). Only BUSCO genes are used in the analysis. (C) Low-AS major-isoform introns (i.e. major-isoform introns that do not have any abundant SV), (D) High-AS major-isoform introns (i.e. major-isoform introns having at least one abundant SV).

Figure 4.

Figure 4—figure supplement 1. Relationship between AS rates and Ne proxies, for all major-isoform introns, low-AS major-isoform introns (i.e. major-isoform introns that do not have any abundant spliced variants) and high-AS major-isoform introns (i.e. major-isoform introns having at least one abundant spliced variants).

Figure 4—figure supplement 1.

Relationship between the average AS rate of all major-isoform introns (A,B,C) or low-AS major-isoform introns (D,E,F) or high-AS major-isoform introns (G,H,I) and longevity (days, log scale) (A,D,G) or body length (cm, log scale) (B,E,H) or the dN/dS ratio (C,F,I).
Figure 4—figure supplement 2. Relationship between the proportion of frame-preserving SVs and Ne proxies.

Figure 4—figure supplement 2.

(A,B) Relationship between the proportion of frame-preserving SVs among abundant SVs, and the body length (cm, log scale) of the organism (A) or the dN/dS ratio (B). Each dot represents one species. All protein-coding genes were used in the analysis.

Investigating selective pressures on minor splice sites

A complementary approach to assess the functionality of AS events consists in investigating signatures of selective constraints on splice sites. For this, we used polymorphism data from Drosophila melanogaster and Homo sapiens to measure single-nucleotide polymorphism (SNP) density at major and minor splice sites, considering separately rare and abundant SVs. We focused on the first two and last two bases of each intron (consensus sequences GT, AG), which represent the most constrained sites within splice signals. We studied minor-isoform introns that share one splice site with a major-isoform intron and we measured SNP density at the corresponding major and minor splice sites. To account for constraints acting on coding regions, we considered separately minor splice sites that were located in an exon or in an intron of the major isoform. As negative controls, we selected AG or GT dinucleotides that were unlikely to correspond to alternative splice sites (Figure 5, Materials and methods). Furthermore, for Homo sapiens we controlled for the presence of hypermutable CpG dinucleotides (Tomso and Bell, 2003; Figure 5—figure supplement 1, Materials and methods).

Figure 5. Variation in selective constraints on alternative splice signals from rare and abundant SVs.

For each minor-isoform intron sharing one boundary with a major-isoform intron, we measured the SNP density at its minor splice site (red), and at the corresponding major splice site (green). We distinguished minor splice sites that are located in an exon or in an intron of the major isoform. As a control (blue), we selected AG or GT dinucleotides that are unlikely to correspond to alternative splice sites, namely: AG dinucleotides located toward the end of the upstream exon or the beginning of the intron (unlikely to correspond to a genuine acceptor site), and GT dinucleotides located toward the beginning of the downstream exon or the end of the intron (unlikely to correspond to a donor site). To increase the sample size, we analyzed data from all annotated protein-coding genes (and not only the BUSCO gene set). The number of sites studied is shown at the top of each bar. Error bars represent the 95% confidence interval of the proportion of polymorphic sites (proportion test). (A,B) SNP density in Drosophila melanogaster (polymorphism data from 205 inbred lines derived from natural populations, N=3,963,397 SNPs Huang et al., 2014; Mackay et al., 2012). (C,D) SNP density in Homo sapiens (polymorphism data from 2504 individuals, N=80,868,061 SNPs Auton et al., 2015). We excluded dinucleotides affected by CpG hypermutability (Materials and methods, see Figure 5—figure supplement 1 for CpG sites). (A,C) Abundant SVs (MIRA > 5%). B,D: Rare SVs (MIRA ≤ 5%).

Figure 5.

Figure 5—figure supplement 1. SNP density in human splice signals, for dinucleotides affected by CpG hypermutability.

Figure 5—figure supplement 1.

Density of SNPs on splice signals for major-isoform introns and for SVs that have their minor splice site within the adjacent exon or in the major-isoform intron. The number of introns studied is shown at the top of each bar. (A,B) SNP data from the human 1000 Genomes project (Auton et al., 2015). We included only dinucleotides affected by CpG hypermutability (Materials and methods). Error bars represent the 95% confidence interval of the proportion of polymorphic sites (proportion test). (A): Abundant SVs (MIRA > 5%). (B): Rare SVs (MIRA ≤ 5%). green: major splice sites; red: minor splice sites; blue: control dinucleotides.

For both species, the lowest SNP density is observed at major splice signals, which reflects the strong selective constraints on these sites (Figure 5). In Drosophila melanogaster, there is also a strong signature of selection on minor splice signals of abundant SVs: both in introns and in exons, the SNP density at minor splice signals of abundant SVs is much lower than in corresponding controls (from –37% to –74%, Figure 5A) and than in minor splice signals of rare SVs (from –38% to –71%, Figure 5B). This observation confirms that abundant SVs are strongly enriched in functional variants compared to rare SVs. In Homo sapiens, patterns of SNP density showed little evidence of selective constraints on minor splice sites, irrespective of the abundance of SVs (Figure 5C and D): minor acceptor splice sites (AG) located within the major-isoform intron show a weak but significant SNP deficit relative to corresponding control sites (p-value <1 x10–5), but other categories of minor splice sites do not show any sign of selective constraints. The fact that the signature of selection on minor splice signals is much weaker in humans compared to Drosophila is indicative of a lower prevalence of functional variants, even among abundant SVs. This observation is therefore in total contradiction with the adaptive hypothesis (more functional alternative splicing in complex organisms).

The splicing rate of rare SVs is negatively correlated with gene expression levels

The above analyses are consistent with the hypothesis that the vast majority of rare SVs correspond to erroneous transcripts, and that changes in Ne contribute to variation in AS rate across taxa by shifting the selection-mutation-drift balance. If true, then this model predicts that the erroneous AS rate should also vary among genes, according to their expression level. Indeed, it has been shown that the selective pressure on splicing accuracy is stronger on highly expressed genes (Saudemont et al., 2017). This reflects the fact that for a given splicing error rate, the waste of resources (both in terms of metabolic cost and of futile mobilization of cellular machineries) increases with gene expression level (Saudemont et al., 2017; Xiong et al., 2017). Thus, the selection-mutation-drift balance should lead to a negative correlation between gene expression level and the rate of splicing errors. To test this prediction, we focused on low-AS major-isoform introns, i.e. introns that are unlikely to have functional SVs. For each species, we considered all major-isoform introns with a sufficient sequencing depth to have a precise measure of their AS rate (Ns+Na100). The selected subset represents 38.1% to 86.7% of major-isoform introns of each species (median = 70.9%). Introns were then divided into 20 bins of equal size, according to the expression level of the corresponding genes. For each species, we computed the Pearson correlation between the average AS rate and the average expression level across bins. We observed a negative correlation between AS rates and gene expression levels in 52 out of the 53 species (significant with p < 0.05, in 48/53 species; Figure 6A; two representative examples are shown in Figure 6C and D). This pattern indicates that in almost all metazoan species, genes with a higher expression level have a lower AS rate, consistent with the hypothesis the rate of splicing errors is shaped by the selection-mutation-drift balance. It should be noted that this negative correlation between AS rate and gene expression level is not expected for functional SVs (there is a priori no reason why the AS rate of functional SVs should be higher in weakly expressed genes than in highly expressed genes). Interestingly, when we performed this analysis on all introns (including those with abundant SVs, which are enriched in functional variants), then most species (31/53) still showed a negative correlation between AS rate and gene expression level (Figure 6B), but some species, such as Drosophila melanogaster showed the opposite pattern (Figure 6—figure supplement 1). This probably reflects that fact that, in those species, functional AS events make a significant contribution to the genome-wide average AS rate.

Figure 6. Relationship between AS rate and gene expression level.

For each species, we selected major-isoform introns with a sufficient sequencing depth to have a precise measure of their AS rate (Ns+Na100). We divided major-isoform introns into 5% bins according to their gene expression level and computed the correlation between the average AS rate and median expression level across the 20 bins. To increase sample size, these analyses were based on all annotated protein-coding genes (and not only the BUSCO gene set). (A) Distribution of Pearson correlation coefficients (R) between the AS rate and expression level observed in the 53 metazoans. The vertical dashed lines indicates the thresholds under and above which correlations are significant (i.e. p-value < 0.05). (B): Distribution of Pearson correlation coefficients computed on the subsets of low-AS major-isoform introns (i.e. after excluding major-isoform introns with abundant SVs). (C,D) Two representative species illustrating the negative relation between the average AS rate of low-AS major-isoform introns and the expression level of their gene. Error bars represent the standard error of the mean. (C) N=127,599 low-AS major-isoform introns from Homo sapiens, (D) N=31,357 low-AS major-isoform introns from Drosophila melanogaster.

Figure 6.

Figure 6—figure supplement 1. Correlations between gene expression levels and AS rates differ among species.

Figure 6—figure supplement 1.

(A,B) Relationship between the average AS rate of major-isoform introns (with Ns+Na100, see Figure 2) and the expression levels of the corresponding genes (FPKM, log scale). We divided major-isoform introns into 5% bins according to the expression level of the corresponding genes and computed for each bin the average AS rate and the median expression level. Error bars represent the standard error of the mean. (A) Homo sapiens, (B) Drosophila melanogaster. This analysis was performed on all protein-coding genes (blue) and BUSCO genes (light blue). Pearson correlation presented here was computed on protein-coding genes.

Discussion

To investigate the factors that drive variation in AS rates across species, we analyzed publicly available RNA-seq data across a large set of 53 species, from diverse metazoan clades, covering a wide range of Ne values. To facilitate comparisons across species, we sought to limit the impact of the among-gene variance in AS rates. For this, we primarily based our analyses on a common set of nearly 1000 orthologous protein-coding genes (BUSCO gene set). We focused our study on introns located within protein-coding regions, because introns from UTRs or lncRNAs are expected to be subject to different functional constraints. We measured AS rates on introns corresponding to a major isoform. When sequencing depth is limited, the set of introns for which AS can be quantified is biased toward the most highly expressed genes. To avoid this bias, we restricted our study to species for which the median sequencing depth of BUSCO exons was above 200. With this setting, on average 96.9% of BUSCO annotated introns could be analyzed in each species (Figure 1—source data 1).

We observed a fivefold variation in the average AS rate of BUSCO introns across species from 0.8% in Drosophila grimshawi (Diptera) to 3.8% in Megachile rotundata (Hymenoptera)(Figure 3A). In agreement with previous work, we observed that AS rates tend to be high in vertebrates (average = 2.3%), and notably in primates (average = 3.1%) (Barbosa-Morais et al., 2012; Chen et al., 2014; Mazin et al., 2021). This observation was previously interpreted as an evidence that AS played an important role in the diversification of the functional repertoire necessary for the development of more complex organisms (Chen et al., 2014). However, this pattern is also compatible with the hypothesis that variation in AS rates across species result from differences in splicing error rates, which are expected to be higher in species with low Ne (Bush et al., 2017). Indeed, consistent with this drift barrier hypothesis, we observed significant correlations between AS rates and proxies of Ne (Figure 3B, Figure 3—figure supplement 1A and B).

In their original study, Chen et al., 2014 investigated the hypothesis that variation in AS rates across taxa might be driven by variation in Ne. For this, they focused on 12 species, for which they had measured levels of polymorphism at silent sites (π). They found that the correlation between AS rate and the number of cell types (proxy for organismal complexity) remained significant after controlling for π. They therefore concluded that the association between the cellular diversity and alternative splicing was not a by-product of reduced effective population sizes among more complex species. This conclusion was however based on a very small sample of species. More importantly, it assumed that π could be taken as a proxy for Ne. At mutation-drift equilibrium, π is expected to be proportional to Neu (where u is the mutation rate per bp per generation). Thus, if u is constant across taxa, π can be used to estimate variation in Ne. However, the dataset analyzed by Chen et al., 2014 included very diverse eukaryotic species, with mutation rates ranging from 1.7 x1010 mutation per bp per generation in budding yeast, to 1.1 x108 mutation per bp per generation in humans (Lynch et al., 2016). Hence, at this evolutionary scale, variation in Ne cannot be directly inferred from π without accounting for variation in u. Moreover, the drift barrier hypothesis states that the AS rate of a species should reflect the genome-wide burden of slightly deleterious substitutions, which is expected to depend on the intensity of drift over long evolutionary times (i.e. long-term Ne). Conversely, π reflects Ne over a short period of time (of the order of Ne generations), and can be strongly affected by recent population bottlenecks (too recent to have substantially impacted the genome-wide deleterious substitution load). The drift barrier hypothesis therefore predicts that the splicing error rate should correlate more strongly with proxies of long-term Ne (such as dN/dS, life history traits, or organismal complexity) than with π. The fact that AS rates remained significantly correlated to cellular diversity after controlling for π(Chen et al., 2014) is therefore not a conclusive argument against the drift barrier hypothesis.

To contrast the two models (drift barrier vs diversification of the functional repertoire in complex organisms), we sought to distinguish functional splice isoforms from erroneous splicing events. Based on the assumption that splicing errors should occur at a low frequency, we split major-isoform introns into two categories, those with abundant SVs (MIRA>5%), and those without (MIRA5%). Rare SVs represent the vast majority of the repertoire of splicing isoforms detected in a given transcriptome (from 62.4% to 96.9% according to the species; Figure 1—source data 1). Two lines of evidence indicate that the small subset of abundant isoforms is strongly enriched in functional transcripts relative to other SVs. First, we observed that in all species, the proportion of SVs that preserve the reading frame is much higher among abundant SVs than among rare SVs (Figure 4A). Second, the analysis of polymorphism data in Drosophila indicates that the average level of purifying selection on alternative splice sites is much stronger for abundant than rare SVs (Figure 5A and B).

If variation in AS rate across species had been driven by a higher prevalence of functional SVs in more complex organisms, one would have expected the proportion of frame-preserving SVs to be stronger in vertebrates than in insects, in particular for the set of introns with high AS rate (i.e. enriched in functional SVs). On the contrary, the highest proportion of frame-preserving SVs is observed in dipterans (Figure 4B). In fact, the overall higher AS rate of vertebrates (Figure 3A) is driven by the set of introns with a low AS rate (Figure 4C), that is the set of introns in which the prevalence of functional SVs is the lowest. On the contrary, among the set of introns with high AS rate, vertebrates have lower AS rates than insects (Figure 4D).

These observations are difficult to reconcile with the hypothesis that the higher AS rate in vertebrates results from a higher rate of functional AS. Conversely, these observations fit very well with a model where variation in AS rate across species is entirely driven by variation in the efficacy of selection against splicing errors. To illustrate this model, let us consider three hypothetical species with different Ne, in which a small fraction of major-isoform introns (say 5%) is subject to functional alternative splicing. Let us consider that the distribution of AS rates of functional splicing variants is the same for all species (i.e. independent of Ne), with a mean of 25% (and a standard deviation of 5%). In addition, we assume that all major-isoform introns are potentially affected by splicing errors, with a mean error rate ranging from 0.2% in species of high Ne to 1.2% in species of low Ne, owing to the drift barrier effect (these parameters were set to match approximately the AS rates observed in empirical data for rare SVs). The distributions of AS rate given by this model are presented in Figure 7A: rare SVs (MIRA ≤ 5%) essentially correspond to splicing errors, while abundant SVs (MIRA > 5%) correspond to a mixture of functional and spurious variants, whose relative proportion depend on Ne (Figure 7B). This simple model makes predictions that match with our observations: we noted a positive correlation between AS rate and longevity (i.e. a negative correlation with Ne) for the set of low-AS major-isoform introns (Figure 4C), but an opposite trend for high-AS major-isoform introns (Figure 4D), as predicted by the model (Figure 7D and E). Given that high-AS major-isoform introns represent only a small fraction of major-isoform introns, this model predicts that, overall, AS rates correlate negatively with Ne (Figure 7), as observed in empirical data (Figure 3A, Figure 3—figure supplement 1).

Figure 7. Impact of the drift-barrier on the genome-wide AS rate: model predictions.

Figure 7.

To illustrate the impact of the drift barrier, we sketched a simple model, with three hypothetical species of different Ne. In this model, the repertoire of SVs consists of a mixture of functional variants and splicing errors. We assumed that in all species, only a small fraction of major-isoform introns (5%) produce functional SVs, but that these variants have a relatively high AS rate (average = 25%, standard deviation = 5%; see Materials and methods for details on model settings). Splicing error rates were assumed to be gamma-distributed, with a low mean value. Owing to the drift barrier effect, the mean error rate was set to vary from 0.2% in species of high Ne to 1.2% in species of low Ne (these parameters were chosen to match approximately the AS rates observed in empirical data for rare SVs). (A) Genome-wide distribution of AS rates in each species (high Ne, medium Ne and low Ne). Each distribution corresponds to a mixture of functional SVs (green) and splicing errors (red). (B) Zoom on the y-axis to better visualize the contribution of functional SVs to the whole distribution: rare SVs (AS ≤ 5%) essentially correspond to splicing errors, while abundant SVs (AS > 5%) correspond to a mixture of functional and spurious variants, whose relative proportion depend on Ne. The following panels show how these different distributions, induced by differences in Ne, impact genome-wide AS patterns. (C) Relationship between the average AS rate per major-isoform intron and Ne. (D) Fraction of frame-preserving splice variants among introns with high AS rates vs Ne. Relationship between the average AS rate per intron and Ne, for ‘low-AS’ major-isoform introns (MIRA ≤ %) (E), and for ‘high-AS’ major-isoform introns (MIRA > 5%) (F).

It should be noted that the BUSCO dataset corresponds to genes that are strongly conserved across species, often highly expressed, and hence might not be representative of the entire genome. Notably, AS rates are on average lower in the BUSCO gene set than in other genes, even after accounting for their expression level (Figure 6—figure supplement 1). However, results remained qualitatively unchanged when we repeated our analyses on the whole set of annotated protein-coding genes for each species: correlations between AS rates and Ne proxies are slightly weaker than on the BUSCO subset, but remain significant (Figure 4—figure supplement 1).

The model also predicts that the proportion of functional SVs among high-AS major-isoform introns should vary with Ne (Figure 7C). To assess this point, we measured in each species the enrichment in reading frame-preserving events among abundant SVs compared to rare SVs. As predicted, this estimate of the prevalence of functional SVs tends to decrease with decreasing Ne proxies (e.g. Figure 4B, where Ne is approximated by longevity). However, these correlations are weak, marginally significant after accounting for phylogenetic inertia with only two of the three Ne proxies, and not robust to multiple testing issues (Figure 4—figure supplement 2). Thus, Ne does not appear to be a strong predictor of the prevalence of functional SVs among high-AS major-isoform introns.

According to the drift-barrier model, the level of splicing errors is expected to decrease with increasing selective pressure. In all above analyses, we considered AS rates measured per intron, and not per gene. Yet, the trait under selection is the per-gene error rate, which depends not only on the error rate per intron, but also on the number of introns per gene. Given that intron density varies widely across clades (from 2.8 introns per gene in diptera to 8.4 introns per gene in vertebrates; ), the correlations reported above between AS rates and Ne may undervalue the predictive power of the drift-barrier model. The RNA-seq datasets that we analyzed consist of short-read sequences, which do not allow a direct quantification of the per-gene AS rate. We therefore indirectly estimated the per-gene AS rate in each species, based on the per-intron AS rate and on the number of introns per gene (Materials and methods). Interestingly, as predicted by the drift-barrier model, Ne proxies correlate more strongly with this estimate of the per-gene AS than with the per-intron AS rates (Figure 3—figure supplement 4).

One other important prediction of the drift barrier model is that splicing error rate should vary not only across species according to Ne, but also among genes, according to their expression level. Indeed, for a given splicing error rate, the waste of resources (and hence the fitness cost) is expected to increase with the level of transcription. Thus, the selective pressure for optimal splice signals is expected to be higher, and hence the error rate to be lower, in highly expressed genes. Consistent with that prediction, we observed a negative correlation between gene expression level and AS rate in low-AS major-isoform introns in all but one species (Figure 6C).

It should be noted that our analyses suffer from several important limitations. First, the proxies that we considered for Ne are quite noisy (Figure 1). Second, to maximize the number of species in our analyses, we had to use very heterogeneous sources of RNA (whole-body, specific tissues, or organs, at different life stages, in different sexes, different environmental conditions, etc.). Third, we used short-read sequencing data, which allow the quantification of AS rates for individual introns, but do not provide a direct measure of AS rates per gene. Hopefully progress of long-read sequencing technologies will soon allow the comparative analysis of AS rates on full-length transcripts (e.g. see Leung et al., 2021). But presently, publicly available long-read transcriptomic data are restricted to a narrow set of model organisms, and their sequencing depth is still too limited to quantify rare splicing events. The fact that we detected significant correlations between AS rate and the three Ne proxies, despite these uncontrolled sources of variability, suggests that we underestimate the effect of Ne on AS rates.

Thus, overall, all observations fit qualitatively well with the predictions of the drift barrier model, according to which most of the variation in AS rate across species reflects differences in splicing error rates. Of course, this model is not in contradiction with the fact, well established, that some AS events play an essential role in various processes. Different criteria can be used to distinguish functional SVs from spurious splicing events. Notably, AS events that are strongly tissue-specific or developmentally dynamic tend to be more conserved across species, which indicates that a substantial fraction of them are evolutionary constrained, and hence functional (Mudge et al., 2011; Barbosa-Morais et al., 2012; Merkin et al., 2012; Reyes et al., 2013). The abundance of an SV is also an important predictor of its functionality. In particular, we observed that in all species, the proportion of frame-preserving events is much higher among abundant SVs than among rare SVs (Figure 4A). We note however that the threshold that we used to define abundant SVs is somewhat arbitrary. In fact, according to our model, this class of SVs corresponds to a mixture of functional and spurious events, whose relative proportion is expected to depend on Ne (Figure 7C). Thus, in low-Ne species, even the subset of abundant SVs includes a substantial fraction of errors. This probably explains why, contrarily to Drosophila, we do not detect any signature of purifying selection on alternative splice signals in humans, even for abundant SVs (Figure 5).

In conclusion, all observations fit with the hypothesis that random genetic drift sets an upper limit on the capacity of selection to prevent splicing errors. It should be noted that this limit on the optimization of genetic systems is expected to affect not only splicing, but all aspects of gene expression. Notably, there is a growing body of evidence that the complexity of transcripts produced by eukaryotic genes (resulting from alternative transcription initiation, polyadenylation, splicing or back-splicing, RNA editing) often does not correspond to fine-tuned adaptations but simply to the accumulation of errors (Pickrell et al., 2010; Saudemont et al., 2017; Xu et al., 2019; Xu and Zhang, 2018; Liu and Zhang, 2018b; Liu and Zhang, 2018a; Xu and Zhang, 2014; Xu and Zhang, 2020; Gout et al., 2013; Zhang and Xu, 2022). It should be noted however that the relationship between the genome-wide error rate and Ne is not expected to be monotonic. Indeed, models predict that in species with very high Ne, selection on each individual gene should favor genotypes that are robust to errors of the gene expression machinery, which in turn, reduces the constraints on the global level of gene expression errors (Rajon and Masel, 2011; Xiong et al., 2017). Thus, paradoxically, species with very large Ne are expected to have gene expression machineries that are more error-prone than species with very small Ne (Rajon and Masel, 2011). This argument was developed by Xiong et al., 2017 to account for the fact that transcription error rates had been found to be about 10 times higher in bacteria than in eukaryotes (Traverse and Ochman, 2016; Gout et al., 2013). More recent work indicates that bacterial transcription error rates had been largely overestimated, presumably owing to RNA damages during the preparation of sequencing libraries (Li and Lynch, 2020). Given these uncertainties in the measures of transcription error rates, it seems for now difficult to interpret the differences reported across species. But in any case, it is important to note that it is in principle possible that the drift barrier affects differently the different steps of the gene expression process. It would therefore be important to investigate to which extent each step of gene expression responds (or not) to variation in Ne. As illustrated here by the relationship observed between alternative splicing and Ne, it appears essential to consider the contribution of non-adaptive evolutionary processes when trying to understand the origin of eukaryotic gene expression complexity.

Materials and methods

Genomic and transcriptomic data collection

To analyze AS rate variation across metazoans, three types of information are required: transcriptome sequencing (RNA-seq) datasets, genome assemblies, and gene annotations. To obtain this data, we first queried the Short Read Archive database (Leinonen et al., 2011) to extract publicly available RNA-seq datasets. We also queried the NCBI Genomes database (Agarwala et al., 2018) to retrieve genomic sequences and annotations. When this project was initiated, the vast majority of metazoans represented in this database corresponded to vertebrates or insects. We therefore decided to focus our analyses on these two clades (N=69 species).

Identification of orthologous gene families

To be able to compare average AS rates across species, given that AS rates vary among genes (Saudemont et al., 2017), it is necessary to analyze a common set of orthologous genes. We searched for homologues of the BUSCOv3 (Benchmarking Universal Single Copy Orthologs Seppey et al., 2019) metazoan gene subset (N=978 genes) in each of the 69 genomes. To do this, we used the software BUSCO v.3.1.0 to associate BUSCO genes to annotated protein sequences. For each species, BUSCO genes were removed from the analysis if they were associated to more than one annotated gene or to an annotated gene that was associated to more than one BUSCO gene.

RNA-seq data processing and intron identification

We aligned the RNA-seq reads on the corresponding reference genomes with HISAT2 v.2.1.0 (Kim et al., 2019). We built the genome indexes using annotated introns and exons coordinates in addition to genome sequences, to improve splice junction detection sensitivity. The maximum allowed intron length was fixed to 2,000,000 bp. We then extracted intron coordinates from HISAT2 alignments using an in-house perl script that scanned for CIGAR strings containing N, which indicate regions that are skipped from the reference sequence. For intron detection and quantification we used only uniquely mapping reads that had a maximum mismatch ratio of 0.02. We required a minimum anchor length (that is, the number of bases that align on each flanking exon) of 8 bp for intron detection, and of 5 bp for intron quantification. We kept only those predicted introns that had GT-AG, GC-AG or AT-AC splice signals, and we predicted the strand of the introns based on the splice signal.

We assigned an intron to a gene if at least one of the intron boundaries fell within 1 bp of the annotated exon coordinates of the gene, combined across all annotated isoforms. We excluded introns that could not be unambiguously assigned to a single gene. We distinguish annotated introns (which appear as such in the reference genome annotations) and un-annotated introns, which were detected with RNA-seq data and assigned to previously annotated genes.

We further restricted our analyses to introns located within protein-coding regions. To do this, for each protein-coding gene, we extracted the start codons and the stop codons for all annotated isoforms. We then identified the minimum start codon and the maximum end codon positions and we excluded introns that were upstream or downstream of these extreme coordinates.

The alignment process, which is the most time-consuming step in the pipeline (see Figure 2—figure supplement 3), can take up to 1 week when using 16 cores per RNA-seq for larger genomes, such as mammals. Additionally, the processed compressed files generated during this process can exceed 7 terabytes in size.

Alternative splicing rate definition

For each intron we noted Ns the number of reads corresponding to the precise excision of this intron (spliced reads), and Na the number of alternatively spliced reads (i.e. spliced variant sharing only one of the two intron boundaries). Finally, we note Nu the number of unspliced reads, co-linear with the genomic sequence, and which overlap with at least 10 bp on each side of an exon-intron boundary. These definitions are illustrated in Figure 2. We then defined the relative abundance of the focal intron compared to introns with one alternative splice boundary (RAS=NsNs+Na), as well as relative to unspliced reads (RANS=NsNs+Nu2).

To compute these ratios we required a minimal number of 10 reads at the denominator. We thus calculated the RAS only if (Ns+Na)10 and the RANS only if (Ns+Nu2)10 (We divided Nu by 2 because retention is quantified at two sites, which increases the detection power by a factor of 2). If the criteria were not met, the values were labeled as not available (NA). We computed these ratios using reads from all available RNA-seq samples, unless otherwise specified (e.g. in sub-sampling analyses). Based on these ratios, we defined three categories of introns: major-isoform introns, defined as those introns that have RANS > 0.5 and RAS > 0.5; minor-isoform introns, defined as those introns that have RANS ≤ 0.5 or RAS ≤ 0.5; unclassified introns, which do not satisfy the above conditions.

We determined the alternative splicing (AS) rate of major-isoform introns using the following formula: AS=NmNM+Nm, where NM is the number of spliced reads corresponding to the excision of the major-isoform intron and Nm is the total number of spliced reads corresponding to the excision of minor-isoform introns sharing a boundary with a major-isoform intron (see Figure 2).

For minor-isoform introns sharing a boundary with a major-isoform intron, we computed the relative abundance of the minor-isoform intron (i) with respect to the corresponding major-isoform intron, with the following formula: MinorintronrelativeabundanceMIRAi=NimNM+Nm, where Nim is the number of spliced reads corresponding to the excision of a minor-isoform intron (i) (see Figure 2).

We defined the per-gene AS rate as the probability to observe at least one alternative splicing event across all the major-isoform introns of a gene. To estimate the per-gene AS rate of a given gene, we assumed that the AS rate is uniform across its major-isoform introns, and that AS events occur independently at each intron. We calculated the AS rate for each gene as the number of spliced reads corresponding to the excision of major-isoform introns, divided by the number of spliced reads corresponding to minor and major-isoform introns (NmNM+Nm). The probability for a given gene to produce no splice variant across all its major-isoform introns is thus p0=(1NmNM+Nm)Ni, where Ni is the number of major-isoform introns of the gene. The per-gene AS rate (ASg), that is the probability to have at least one AS event, is therefore the complement of p0: ASg = 1-p0.

Identification of reading frame-preserving splice variants

To determine the proportion of open-reading frame-preserving splice variants, we first identified minor-isoform introns that had their minor splice site within a maximum distance of 30 bp from the major splice site (either in the flanking exon or within the major-isoform intron). We chose this length threshold because it is shorter than the size of the smallest introns in metazoans, so that to avoid the possibility of having a skipped exon between the minor and the major splice site (which could induce some ambiguities in the assessment of the reading frame). Among these introns, we considered that frame-preserving variants are those introns for which the distance between the minor-isoform intron boundary and the major-isoform intron boundary was a multiple of 3.

Gene expression level

Gene expression levels were calculated with Cufflinks v2.2.1 (Roberts et al., 2011) based on the read alignments obtained with HISAT2, for each RNA-seq sample individually. We estimated FPKM levels (Fragments Per Kilobase of exon per Million mapped reads) for each gene.

The overall gene expression of a gene was computed as the average FPKM across samples, weighted by the sequencing depth of each sample. The sequencing depth of a sample is the median per-base read coverage across BUSCO genes.

Phylogenetic tree reconstruction

For each of the 978 BUSCO gene families we collected the longest corresponding proteins identified in each species. We removed proteins for which the amino acid sequence provided with the annotations did not perfectly correspond to the translation of the corresponding coding sequences. We then aligned the resulting sets of protein-coding sequences for each BUSCO gene, using the codon alignment option in PRANK v.170427 (Löytynoja and Goldman, 2008). We translated the codon alignments into protein alignments using the R package seqinr (Charif and Lobry, 2007).

To infer the phylogenetic tree rapidly, we sub-sampled the resulting multiple alignments (N=461), selecting alignments with the highest number of species (ranging from 49 to 53 species per alignment). We then concatenated these alignments and kept sites that were aligned in at least 30 species. We used RAxML-NG v.0.9.0 (Kozlov et al., 2019) to infer the species phylogeny with a final alignment of 53 taxa and 165,648 sites (amino acids). RAxML was set to perform one model per gene with fixed empirical substitution matrix (LG), empirical amino acid frequencies from alignment (F) and 8 discrete GAMMA categories (G8), specified in a partition file with one line per multiple alignment. The analysis generated 10 starting trees, 5 starting from a random topology and 5 starting from a tree generated by the parsimony-based randomized stepwise addition algorithm. The best-scoring topology was kept as the final ML tree and 10 bootstrap replicates have been generated.

dN/dS computation

We estimated dN/dS ratios for the BUSCO gene families that were present in at least 45 species (N=922 genes), using the codon alignments obtained with PRANK (see above). We divided the 922 sequence alignments into 18 groups, based on their average GC3 content across species, and concatenated the alignments within each group. We thus obtained concatenated alignments that were 209 kb long on average. We used bio ++v.3.0.0 libraries (Guéguen et al., 2013; Dutheil and Boussau, 2008; Bolívar et al., 2019) to estimate the dN/dS on terminal branches of the phylogenetic tree, for each concatenated alignment. We attributed the dN/dS of the terminal branches to the species that corresponds.

In a first step, we used an homogeneous codon model implemented in bppml to infer the most likely branch lengths, codon frequencies at the root, and substitution model parameters. We used YN98 (F3X4) (Yang and Nielsen, 1998) substitution model, which allows for different nucleotide content dynamics across codon positions. In a second step, we used the MapNH substitution mapping method (Guéguen and Duret, 2018) to count synonymous and non-synonymous substitutions (Dutheil et al., 2012). We defined dN as the total number of non-synonymous substitutions divided by the total number of non-synonymous opportunities, both summed across concatenated alignments, for each branch of the phylogenetic tree. Likewise, we defined dS as the total number of synonymous substitutions divided by the total number of synonymous opportunities, both summed across concatenated alignments. The per-species dN/dS corresponds to the ratio between dN and dS, on the terminal branches of the phylogenetic tree.

Life history traits

We used various life history traits to approximate the effective population size of each species. For vertebrates species we considered the maximum lifespan (i.e. from birth to death) and body length referenced. For insects we took the maximum lifespan and body length of the imago. For eusocial insects and the eusocial mammal Heterocephalus glaber, the selected values correspond to the queens. The sources from which the lifespan and the body length information was taken are listed in data/Data9-supp.pdf in the Zenodo repository (see Data and code availability).

Analyses of sequence polymorphism

We analyzed the distribution of single-nucleotide polymorphisms (SNPs) around splice sites in Drosophila melanogaster and Homo sapiens.

For D. melanogaster, we used polymorphism data from the Drosophila Genetic Reference Panel (DGRP; Mackay et al., 2012; Huang et al., 2014), from which we extracted 3,963,397 SNPs that were identified from comparisons across 205 inbred lines. We converted the SNP coordinates from the dm3 genome assembly to the dm6 assembly with the liftOver utility (Hinrichs et al., 2006) of the UCSC genome browser, using a whole genome alignment between the two assemblies downloaded here.

For H. sapiens, we used polymorphism data from the 1000 Genomes project, phase 3 release (Auton et al., 2015). This dataset included 80,868,061 SNPs that were genotyped in 2,504 individuals.

For each minor-isoform intron sharing one boundary with a major-isoform intron, we computed the number of SNPs that occur at their respective splice sites: at their shared boundary, and at the major-isoform intron and minor-isoform introns specific boundaries.

We focused our study on minor-isoform introns that have their specific boundary folding in the exons adjacent to the major-isoform intron or in the major-isoform intron. As a control, for each minor-isoform intron, we searched for one GT and one AG dinucleotides in the interval between 20 and 60 bp with respect to the major splice site, in the neighboring exon and in the major-isoform intron, and computed the number of SNPs that occur on these sites. We searched for control AG dinucleotides in the vicinity of the donor splice site of the major-isoform intron and for GT dinucleotides in the vicinity of its acceptor splice site, to avoid studying sites that might correspond to unidentified minor splice sites. For H. sapiens, we further divided the splice sites and the control dinucleotides into two groups, depending on whether they were subject to CpG hypermutability or not.

Impact of the drift-barrier on genome-wide AS rates: sketched model

To illustrate the impact of the drift barrier, we sketched a simple model, with three hypothetical species of different Ne (low, medium, and high Ne). In each species, the repertoire of SVs consists of two categories: functional variants and spurious variants (which result from errors of the splicing machinery). The rate of splicing error was assumed to be low and to depend on Ne, owing to the drift barrier effect. We considered that in all species, only a small fraction of major-isoform introns (5%) produce functional SVs, but that these variants have a relatively high AS rate. The AS rates of functional SVs were modeled by a normal distribution, with a mean of 25% and a standard deviation of 5% (same parameters for the three species). We modeled the distribution of error rates by a gamma distribution, with shape parameter = 1, and with mean values of 0.2%, 0.6% and 1.2% respectively in species of high, medium or low Ne (these parameters were set to match approximately the AS rates observed in empirical data for rare SVs). We then combined the two distributions (functional SVs and splicing errors) to compute the genome-wide average AS rates in each species. We also computed the average AS rate on the subsets of low-AS or high-AS major-isoform introns (i.e. with AS rates respectively below or above the threshold AS rate of 5%). Finally, we computed the proportion of frame-preserving SVs among high-AS major-isoform introns, assuming that two thirds of splicing errors induce frameshifts and that all functional SVs preserve the reading frame.

Data and code availability

All processed data that we generated and used in this study, as well as the scripts that we used to analyze the data and to generate the figures, are available on Zenodo DOI: https://doi.org/10.5281/zenodo.7415114.

In particular, the sources of transcriptomic data, genome assemblies and annotations are reported in the Zenodo archive in data/Data1-supp.tab. The archive includes several directories, including figure, which contains the necessary materials to produce the figures of the manuscript. Rmarkdown scripts located in the table_supp directory were used to generate supplementary tables, which are also saved in the same directory. The processed data used to generate figures and conduct analyses are stored in the data directory in tab-separated text format.

Acknowledgements

We thank Loïc Guille for his contribution to an initial pilot study, Tristan Lefébure for insightful discussions and Laurent Guéguen for his help on dN/dS analyses. Computational analyses were performed using the computing facilities of the CC LBBE/PRABI and the Core Cluster of the Institut Français de Bioinformatique (IFB) (ANR-11-INBS-0013). We thank five anonymous reviewers for their thorough and constructive comments, which were very helpful to improve our manuscript. A preprint version of this article has been peer-reviewed and recommended by PCIEvolBiol (https://doi.org/10.24072/pci.evolbiol.100642). This work was funded by the French National Research Agency (ANR-20-CE02-0008-01 ”NeGA” and ANR-17-CE12-0019-01 ”LncEvoSys”).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Laurent Duret, Email: Laurent.Duret@univ-lyon1.fr.

Vincent Castric, University of Lille, France.

George H Perry, Pennsylvania State University, United States.

Funding Information

This paper was supported by the following grants:

  • Agence Nationale de la Recherche ANR-20-CE02-0008-01 to Florian Bénitière, Anamaria Necsulea, Laurent Duret.

  • Agence Nationale de la Recherche ANR-17-CE12-0019-01 to Anamaria Necsulea, Laurent Duret.

Additional information

Competing interests

No competing interests declared.

Author contributions

Resources, Data curation, Software, Formal analysis, Investigation, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Conceptualization, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Additional files

MDAR checklist

Data availability

All processed data that we generated and used in this study, as well as the scripts that we used to analyze the data and to generate the figures, are available on zenodo DOI: https://doi.org/10.5281/zenodo.7415114.

The following dataset was generated:

Bénitière F, Necsulea A, Duret L. 2024. 2023-Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. Zenodo.

References

  1. Abascal F, Ezkurdia I, Rodriguez-Rivas J, Rodriguez JM, del Pozo A, Vázquez J, Valencia A, Tress ML. Alternatively spliced homologous exons have ancient origins and are highly expressed at the protein level. PLOS Computational Biology. 2015;11:e1004325. doi: 10.1371/journal.pcbi.1004325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Agarwala R, Barrett T, Beck J, Benson DA, Bollin C, Bolton E, Bourexis D, Brister JR, Bryant SH, Canese K, Cavanaugh M, Charowhas C, Clark K, Dondoshansky I, Feolo M, Fitzpatrick L, Funk K, Geer LY, Gorelenkov V, Graeff A, Hlavina W, Holmes B, Johnson M, Kattman B, Khotomlianski V, Kimchi A, Kimelman M, Kimura M, Kitts P, Klimke W, Kotliarov A, Krasnov S, Kuznetsov A, Landrum MJ, Landsman D, Lathrop S, Lee JM, Leubsdorf C, Lu Z, Madden TL, Marchler-Bauer A, Malheiro A, Meric P, Karsch-Mizrachi I, Mnev A, Murphy T, Orris R, Ostell J, O’Sullivan C, Palanigobu V, Panchenko AR, Phan L, Pierov B, Pruitt KD, Rodarmer K, Sayers EW, Schneider V, Schoch CL, Schuler GD, Sherry ST, Siyan K, Soboleva A, Soussov V, Starchenko G, Tatusova TA, Thibaud-Nissen F, Todorov K, Trawick BW, Vakatov D, Ward M, Yaschenko E, Zasypkin A, Zbicz K, NCBI Resource Coordinators Database resources of the national center for biotechnology information. Nucleic Acids Research. 2018;46:D8–D13. doi: 10.1093/nar/gkx1095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, Slobodeniuc V, Kutter C, Watt S, Colak R, Kim T, Misquitta-Ali CM, Wilson MD, Kim PM, Odom DT, Frey BJ, Blencowe BJ. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338:1587–1593. doi: 10.1126/science.1230612. [DOI] [PubMed] [Google Scholar]
  5. Bhuiyan SA, Ly S, Phan M, Huntington B, Hogan E, Liu CC, Liu J, Pavlidis P. Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics. 2018;19:637. doi: 10.1186/s12864-018-5013-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Blencowe BJ. The relationship between alternative splicing and proteomic complexity. Trends in Biochemical Sciences. 2017;42:407–408. doi: 10.1016/j.tibs.2017.04.001. [DOI] [PubMed] [Google Scholar]
  7. Bolívar P, Guéguen L, Duret L, Ellegren H, Mugal CF. GC-biased gene conversion conceals the prediction of the nearly neutral theory in avian genomes. Genome Biology. 2019;20:5. doi: 10.1186/s13059-018-1613-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bush SJ, Chen L, Tovar-Corona JM, Urrutia AO. Alternative splicing and the evolution of phenotypic novelty. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences. 2017;372:20150474. doi: 10.1098/rstb.2015.0474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cardoso-Moreira M, Halbert J, Valloton D, Velten B, Chen C, Shao Y, Liechti A, Ascenção K, Rummel C, Ovchinnikova S, Mazin PV, Xenarios I, Harshman K, Mort M, Cooper DN, Sandi C, Soares MJ, Ferreira PG, Afonso S, Carneiro M, Turner JMA, VandeBerg JL, Fallahshahroudi A, Jensen P, Behr R, Lisgo S, Lindsay S, Khaitovich P, Huber W, Baker J, Anders S, Zhang YE, Kaessmann H. Gene expression across mammalian organ development. Nature. 2019;571:505–509. doi: 10.1038/s41586-019-1338-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Charif D, Lobry JR. In: Structural Approaches to Sequence Evolution: Molecules, Networks, Populations, Biological and Medical Physics, Biomedical Engineering. Bastolla U, Porto M, Roman HE, Vendruscolo M, editors. Springer; 2007. Seqinr 1.0-2: A contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis; pp. 207–232. [Google Scholar]
  11. Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Molecular Biology and Evolution. 2014;31:1402–1413. doi: 10.1093/molbev/msu083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Dutheil J, Boussau B. Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs. BMC Evolutionary Biology. 2008;8:255. doi: 10.1186/1471-2148-8-255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dutheil JY, Galtier N, Romiguier J, Douzery EJP, Ranwez V, Boussau B. Efficient selection of branch-specific models of sequence evolution. Molecular Biology and Evolution. 2012;29:1861–1874. doi: 10.1093/molbev/mss059. [DOI] [PubMed] [Google Scholar]
  14. Figuet E, Nabholz B, Bonneau M, Mas Carrio E, Nadachowska-Brzyska K, Ellegren H, Galtier N. Life history traits, protein evolution, and the nearly neutral theory in amniotes. Molecular Biology and Evolution. 2016;33:1517–1527. doi: 10.1093/molbev/msw033. [DOI] [PubMed] [Google Scholar]
  15. Freckleton RP, Harvey PH, Pagel M. Phylogenetic analysis and comparative data: A test and review of evidence. The American Naturalist. 2002;160:712–726. doi: 10.1086/343873. [DOI] [PubMed] [Google Scholar]
  16. Gonzàlez-Porta M, Frankish A, Rung J, Harrow J, Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biology. 2013;14:R70. doi: 10.1186/gb-2013-14-7-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gout JF, Thomas WK, Smith Z, Okamoto K, Lynch M. Large-scale detection of in vivo transcription errors. PNAS. 2013;110:18584–18589. doi: 10.1073/pnas.1309843110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Graveley BR. Alternative splicing: increasing diversity in the proteomic world. Trends in Genetics. 2001;17:100–107. doi: 10.1016/s0168-9525(00)02176-4. [DOI] [PubMed] [Google Scholar]
  19. Guéguen L, Gaillard S, Boussau B, Gouy M, Groussin M, Rochette NC, Bigot T, Fournier D, Pouyet F, Cahais V, Bernard A, Scornavacca C, Nabholz B, Haudry A, Dachary L, Galtier N, Belkhir K, Dutheil JY. Bio++: efficient extensible libraries and tools for computational molecular evolution. Molecular Biology and Evolution. 2013;30:1745–1750. doi: 10.1093/molbev/mst097. [DOI] [PubMed] [Google Scholar]
  20. Guéguen L, Duret L. Unbiased estimate of synonymous and nonsynonymous substitution rates with nonstationary base composition. Molecular Biology and Evolution. 2018;35:734–742. doi: 10.1093/molbev/msx308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hamid FM, Makeyev EV. Emerging functions of alternative splicing coupled with nonsense-mediated decay. Biochemical Society Transactions. 2014;42:1168–1173. doi: 10.1042/BST20140066. [DOI] [PubMed] [Google Scholar]
  22. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A, Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ. The UCSC Genome Browser Database: update 2006. Nucleic Acids Research. 2006;34:D590–D598. doi: 10.1093/nar/gkj144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hsu SN, Hertel KJ. Spliceosomes walk the line: splicing errors and their impact on cellular function. RNA Biology. 2009;6:526–530. doi: 10.4161/rna.6.5.9860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Huang W, Massouras A, Inoue Y, Peiffer J, Ràmia M, Tarone AM, Turlapati L, Zichner T, Zhu D, Lyman RF, Magwire MM, Blankenburg K, Carbone MA, Chang K, Ellis LL, Fernandez S, Han Y, Highnam G, Hjelmen CE, Jack JR, Javaid M, Jayaseelan J, Kalra D, Lee S, Lewis L, Munidasa M, Ongeri F, Patel S, Perales L, Perez A, Pu L, Rollmann SM, Ruth R, Saada N, Warner C, Williams A, Wu Y-Q, Yamamoto A, Zhang Y, Zhu Y, Anholt RRH, Korbel JO, Mittelman D, Muzny DM, Gibbs RA, Barbadilla A, Johnston JS, Stone EA, Richards S, Deplancke B, Mackay TFC. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Research. 2014;24:1193–1208. doi: 10.1101/gr.171546.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. John S, Olas JJ, Mueller-Roeber B. Regulation of alternative splicing in response to temperature variation in plants. Journal of Experimental Botany. 2021;72:6150–6163. doi: 10.1093/jxb/erab232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology. 2019;37:907–915. doi: 10.1038/s41587-019-0201-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kimura M, Maruyama T, Crow JF. The mutation load in small populations. Genetics. 1963;48:1303–1312. doi: 10.1093/genetics/48.10.1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35:4453–4455. doi: 10.1093/bioinformatics/btz305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLOS Genetics. 2008;4:e1000304. doi: 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration The sequence read archive. Nucleic Acids Research. 2011;39:D19–D21. doi: 10.1093/nar/gkq1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Leung SK, Jeffries AR, Castanho I, Jordan BT, Moore K, Davies JP, Dempster EL, Bray NJ, O’Neill P, Tseng E, Ahmed Z, Collier DA, Jeffery ED, Prabhakar S, Schalkwyk L, Jops C, Gandal MJ, Sheynkman GM, Hannon E, Mill J. Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell Reports. 2021;37:110022. doi: 10.1016/j.celrep.2021.110022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Li W, Lynch M. Universally high transcript error rates in bacteria. eLife. 2020;9:e54898. doi: 10.7554/eLife.54898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Liu Z, Zhang J. Human C-to-U Coding RNA Editing Is Largely Nonadaptive. Molecular Biology and Evolution. 2018a;35:963–969. doi: 10.1093/molbev/msy011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Liu Z, Zhang J. Most m6A RNA Modifications in Protein-Coding Regions Are Evolutionarily Unconserved and Likely Nonfunctional. Molecular Biology and Evolution. 2018b;35:666–675. doi: 10.1093/molbev/msx320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–1635. doi: 10.1126/science.1158395. [DOI] [PubMed] [Google Scholar]
  36. Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302:1401–1404. doi: 10.1126/science.1089370. [DOI] [PubMed] [Google Scholar]
  37. Lynch M. The origins of eukaryotic gene structure. Molecular Biology and Evolution. 2006;23:450–468. doi: 10.1093/molbev/msj050. [DOI] [PubMed] [Google Scholar]
  38. Lynch M. The frailty of adaptive hypotheses for the origins of organismal complexity. PNAS. 2007;104:8597–8604. doi: 10.1073/pnas.0702207104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Lynch M, Ackerman MS, Gout JF, Long H, Sung W, Thomas WK, Foster PL. Genetic drift, selection and the evolution of the mutation rate. Nature Reviews. Genetics. 2016;17:704–714. doi: 10.1038/nrg.2016.104. [DOI] [PubMed] [Google Scholar]
  40. Mackay TFC, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, Casillas S, Han Y, Magwire MM, Cridland JM, Richardson MF, Anholt RRH, Barrón M, Bess C, Blankenburg KP, Carbone MA, Castellano D, Chaboub L, Duncan L, Harris Z, Javaid M, Jayaseelan JC, Jhangiani SN, Jordan KW, Lara F, Lawrence F, Lee SL, Librado P, Linheiro RS, Lyman RF, Mackey AJ, Munidasa M, Muzny DM, Nazareth L, Newsham I, Perales L, Pu L-L, Qu C, Ràmia M, Reid JG, Rollmann SM, Rozas J, Saada N, Turlapati L, Worley KC, Wu Y-Q, Yamamoto A, Zhu Y, Bergman CM, Thornton KR, Mittelman D, Gibbs RA. The Drosophila melanogaster genetic reference panel. Nature. 2012;482:173–178. doi: 10.1038/nature10811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Mazin PV, Khaitovich P, Cardoso-Moreira M, Kaessmann H. Alternative splicing during mammalian organ development. Nature Genetics. 2021;53:925–934. doi: 10.1038/s41588-021-00851-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. McGlincy NJ, Smith CWJ. Alternative splicing resulting in nonsense-mediated mRNA decay: what is the meaning of nonsense? Trends in Biochemical Sciences. 2008;33:385–393. doi: 10.1016/j.tibs.2008.06.001. [DOI] [PubMed] [Google Scholar]
  43. Merkin J, Russell C, Chen P, Burge CB. Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science. 2012;338:1593–1599. doi: 10.1126/science.1228186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Mudge JM, Frankish A, Fernandez-Banet J, Alioto T, Derrien T, Howald C, Reymond A, Guigó R, Hubbard T, Harrow J. The origins, evolution, and functional potential of alternative splicing in vertebrates. Molecular Biology and Evolution. 2011;28:2949–2959. doi: 10.1093/molbev/msr127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Ohta T. Slightly deleterious mutant substitutions in evolution. Nature. 1973;246:96–98. doi: 10.1038/246096a0. [DOI] [PubMed] [Google Scholar]
  46. Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy Splicing Drives mRNA Isoform Diversity in Human Cells. PLOS Genetics. 2010;6:e1001236. doi: 10.1371/journal.pgen.1001236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Rajon E, Masel J. Evolution of molecular error rates and the consequences for evolvability. PNAS. 2011;108:1082–1087. doi: 10.1073/pnas.1012918108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Reyes A, Anders S, Weatheritt RJ, Gibson TJ, Steinmetz LM, Huber W. Drift and conservation of differential exon usage across tissues in primate species. PNAS. 2013;110:15377–15382. doi: 10.1073/pnas.1307202110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011;27:2325–2329. doi: 10.1093/bioinformatics/btr355. [DOI] [PubMed] [Google Scholar]
  50. Romiguier J, Weyna A. Relaxation of Purifying Selection Suggests Low Effective Population Size in Eusocial Hymenoptera and Solitary Pollinating Bees. bioRxiv. 2020 doi: 10.1101/2020.04.14.038893. [DOI]
  51. Saudemont B, Popa A, Parmley JL, Rocher V, Blugeon C, Necsulea A, Meyer E, Duret L. The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome Biology. 2017;18:208. doi: 10.1186/s13059-017-1344-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Seppey M, Manni M, Zdobnov EM. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods in Molecular Biology. 2019;1962:227–245. doi: 10.1007/978-1-4939-9173-0_14. [DOI] [PubMed] [Google Scholar]
  53. Singh P, Ahi EP. The importance of alternative splicing in adaptive evolution. Molecular Ecology. 2022;31:1928–1938. doi: 10.1111/mec.16377. [DOI] [PubMed] [Google Scholar]
  54. Tomso DJ, Bell DA. Sequence context at human single nucleotide polymorphisms: overrepresentation of CpG dinucleotide at polymorphic sites and suppression of variation in CpG islands. Journal of Molecular Biology. 2003;327:303–308. doi: 10.1016/s0022-2836(03)00120-7. [DOI] [PubMed] [Google Scholar]
  55. Traverse CC, Ochman H. Conserved rates and patterns of transcription errors across bacterial growth states and lifestyles. PNAS. 2016;113:3311–3316. doi: 10.1073/pnas.1525329113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Tress ML, Abascal F, Valencia A. Alternative splicing may not be the key to proteome complexity. Trends in Biochemical Sciences. 2017a;42:98–110. doi: 10.1016/j.tibs.2016.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Tress ML, Abascal F, Valencia A. Most alternative isoforms are not functionally important. Trends in Biochemical Sciences. 2017b;42:408–410. doi: 10.1016/j.tibs.2017.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Verta JP, Jacobs A. The role of alternative splicing in adaptation and evolution. Trends in Ecology & Evolution. 2022;37:299–308. doi: 10.1016/j.tree.2021.11.010. [DOI] [PubMed] [Google Scholar]
  59. Waples RS. Life-history traits and effective population size in species with overlapping generations revisited: the importance of adult mortality. Heredity. 2016;117:241–250. doi: 10.1038/hdy.2016.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Wright CJ, Smith CWJ, Jiggins CD. Alternative splicing as a source of phenotypic diversity. Nature Reviews Genetics. 2022;23:697–710. doi: 10.1038/s41576-022-00514-4. [DOI] [PubMed] [Google Scholar]
  61. Xiong K, McEntee JP, Porfirio DJ, Masel J. Drift barriers to quality control when genes are expressed at different levels. Genetics. 2017;205:397–407. doi: 10.1534/genetics.116.192567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Xu G, Zhang J. Human coding RNA editing is generally nonadaptive. PNAS. 2014;111:3769–3774. doi: 10.1073/pnas.1321745111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Xu C, Zhang J. Alternative polyadenylation of mammalian transcripts is generally deleterious, not adaptive. Cell Systems. 2018;6:734–742. doi: 10.1016/j.cels.2018.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Xu C, Park J-K, Zhang J. Evidence that alternative transcriptional initiation is largely nonadaptive. PLOS Biology. 2019;17:e3000197. doi: 10.1371/journal.pbio.3000197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Xu C, Zhang J. A different perspective on alternative cleavage and polyadenylation. Nature Reviews. Genetics. 2020;21:63. doi: 10.1038/s41576-019-0198-z. [DOI] [PubMed] [Google Scholar]
  66. Yang Z, Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Journal of Molecular Evolution. 1998;46:409–418. doi: 10.1007/pl00006320. [DOI] [PubMed] [Google Scholar]
  67. Zhang J, Xu C. Gene product diversity: adaptive or not? Trends in Genetics. 2022;38:1112–1122. doi: 10.1016/j.tig.2022.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife assessment

Vincent Castric 1

This fundamental study evaluates the evolutionary significance of variations in the accuracy of the intron-splicing process across vertebrates and insects. Using a powerful combination of comparative and population genomics approaches, the authors present convincing evidence that higher rates of alternative splicing tend to be observed in species with lower effective population size, a key prediction of the drift-barrier hypothesis. The analysis is carefully conducted and has broad implications beyond the studied species. As such, it will strongly appeal to anyone interested in the evolution of genome architecture and the optimisation of genetic systems.

Reviewer #1 (Public Review):

Anonymous

Summary:

Functionally important alternative isoforms are gold nuggets found in a swamp of errors produced by the splicing machinery.

The architecture of eukaryotic genomes, when compared with prokaryotes, is characterised by a preponderance of introns. These elements, which are still present within transcripts, are rapidly removed during the splicing of messenger RNA (mRNA), thus not contributing to the final protein. The extreme rarity of introns in prokaryotes, and the elimination of these introns from mRNAs before translation into protein, raises questions about the function of introns in genomes. One explanation comes from functional biology: introns are thought to be involved in post-transcriptional regulation and in the production of translational variants. The latter function is possible when the positions of the edges of the spliced intron vary. While some light has been shed on specific examples of the functional role of alternative splicing, to what extent are they representative of all introns in metazoans?

In this study, the hypothesis of a functional role for alternative splicing, and therefore to a certain extent for introns, is evaluated against another explanation coming from evolutionary biology: isoforms are above all errors of imprecision by the molecular machinery at work during splicing. This hypothesis is based on a principle established by Motoo Mikura, which has become central to population genetics, explaining that the evolutionary trajectory of a mutation with a given effect is intimately linked to the effective population size (Ne) where this mutation emerges. Thus, the probability of fixation of a weakly deleterious mutation increases when Ne decreases, and the probability of fixation of a weakly advantageous mutation increases when Ne increases. The genomes of populations with low Ne are therefore expected to accumulate more weakly deleterious mutations and fewer weakly advantageous mutations than populations with high Ne. In this framework, if splicing errors have only small effects on the fitness of individuals, then natural selection cannot increase the precision of the splicing machinery, allowing tolerance for the production of alternative isoforms.

In the past, the debate opposed one-off observations of effectively functional isoforms on the one hand, to global genomic quantities describing patterns without the possibility of interpreting them in detail. The authors here propose an elegant quantitative approach in line with the expected continuous variation in the effectiveness of selection, both between species and within genomes. The result describing the inter-specific pattern on a large scale confirms what was already known (there is a negative relationship between effective size and average alternative splicing rate). The essential novelty of this study lies in (1) the quantification, for each intron studied, of the relative abundance of each isoform, and (2) the analysis of a relationship between this abundance and the evolutionary constraints acting on these isoforms.

What is striking is the light shed on the general very low abundance of alternative isoforms. Depending on the species, 60% to 96% of cases of alternatively spliced introns lead to an isoform whose abundance is less than 5% of the total variants for a given intron.

In addition to the fact that 60%-96% of the total isoforms are more than 20 times less abundant than their majority form, this large proportion of alternative isoforms exhibit coding-phase shift at rates similar to what would be expected by chance, i.e. for a third of them, which reinforces the idea that there is no particular constraint on these isoforms.

The remaining 4%-40% of isoforms see their coding-phase shift rate decrease as their relative abundance increases. This result represents a major step forward in our understanding of alternative splicing and makes it possible to establish a quantitative model directly linking the relative abundance of an isoform with a putative functional role concerning only those isoforms produced in abundance. Only the (rare) isoforms which are abundantly produced are thought to be involved in a biological function.

Within the same genome, the authors show that only highly expressed genes, i.e. those that tend to be more constrained on average, are also the genes with the lowest alternative splicing rates on average.

The comparison between species in this study reveals that the smaller the effective size of a species, the more its genome produces isoforms that are low in abundance and low in constraint. Conversely, species with a large effective size relatively reduce rare isoforms, and increase stress on abundant isoforms.

To sum up:

• the higher the effective size of a species, the fewer introns are spliced.

• highly expressed genes are spliced less.

• when splicing occurs, it is mainly to produce low-abundance isoforms.

• low-abundance isoforms are also less constrained.

Taken together, these results reinforce a quantitative view of the evolution of alternative splicing as being mainly the product of imprecision in the splicing machinery, generating a great deal of molecular noise. Then, out of all this noise, a few functional gold nuggets can sometimes emerge. From the point of view of the reviewer, the evolutionary dynamics of genomes are depressing. The small effective population sizes are responsible for the accumulation of multiple slightly deleterious introns. Admittedly, metazoan genomes try to get rid of these introns during RNA maturation, but this mechanism is itself rendered imprecise by population sizes.

Strengths:

• The authors simultaneously study the effects of effective population size, isoform abundance, and gene expression levels on the evolutionary constraints acting on isoforms. Within this framework, they clearly show that an isoform becomes functionally important only under certain rare conditions.

• The authors rule out an effect putatively linked to variations in expression between different organs which could have biased comparisons between different species.

Weaknesses:

• While the longevity of organisms as a measure of effective size seems to work overall, it may not be relevant for discriminating within a clade. For example, within Hymenoptera, we might expect them to have the same overall longevity, but that effective size would be influenced more by the degree of sociality: solitary bees/ants/wasps versus eusocial. I am therefore certain that the relationship shown in Figure 4D is currently not significant because the measure of effective size is not relevant for Hymenoptera. The article would have been even more convincing by contrasting the rates of alternative splicing between solitary versus social hymenopterans.

• When functionalist biologists emphasise the role of the complexity of living things, I'm not sure they're thinking of the comparison between "Drosophila" and "Homo sapiens", but rather of a broader evolutionary scale. Which gives the impression of an exaggeration of the debate in the introduction.

Reviewer #2 (Public Review):

Anonymous

Summary:

Two hypotheses could explain the observation that genes of more complex organisms tend to undergo more alternative splicing. On one hand, alternative splicing could be adaptive since it provides the functional diversity required for complexity. On the other hand, increased rates of alternative splicing could result through nonadaptive processes since more complex organisms tend to have smaller effective population sizes and are thus more prone to deleterious mutations resulting in more spurious splicing events (drift-barrier hypothesis). To evaluate the latter, B́enitiere et al. analyzed transcriptome sequencing data across 53 metazoan species. They show that proxies for effective population size and alternative splicing rates are negatively correlated. Furthermore, the authors find that rare, nonfunctional (and likely erroneous) isoforms occur more frequently in more complex species. Additionally, they show evidence that the strength of selection on splice sites increases with increasing effective population size and that the abundance of rare splice variants decreases with increased gene expression. All of these findings are consistent with the drift-barrier hypothesis.

This study conducts a comprehensive set of separate analyses that all converge on the same overall result and the manuscript is well organized. Furthermore, this study is useful in that it provides a modified null hypothesis that can be used for future tests of adaptive explanations for variation in alternative splicing.

Strengths:

The major strength of this study lies in its complementary approach combining comparative and population genomics. Comparing evolutionary trends across phylogenetic diversity is a powerful way to test hypotheses about the origins of genome complexity. This approach alone reveals several convincing lines of evidence in support of the drift-barrier hypothesis. However, the authors also provide evidence from a population genetics perspective (using resequencing data for humans and fruit flies), making results even more convincing.

The authors are forward about the study's limitations and explain them in detail. They elaborate on possible confounding factors as well as the issues with data quality (e.g. proxies for Ne, inadequacies of short reads, heterogeneity in RNA-sequencing data).

Weaknesses:

The authors primarily consider insects and mammals in their study. This only represents a small fraction of metazoan diversity. Sampling from a greater diversity of metazoan lineages would make these results and their relevance to broader metazoans substantially more convincing. Although the authors are careful about their tone, it is challenging to reconcile these results with trends across greater metazoans when the underlying dataset exhibits ascertainment bias and represents samples from only a few phylogenetic groups. Relatedly, some trends (such as Figure 1B-C) seem to be driven primarily by non-insect species, raising the question of whether some results may be primarily explained by specific phylogenetic groups (although the authors do correct for phylogeny in their statistics). How might results look if insects and mammals (or vertebrates) are considered independently?

Throughout the manuscript, the authors refer to infrequently spliced (mode <5%) introns as "minor introns" and frequently spliced (mode >95%) as "major introns". This is extremely confusing since "minor introns" typically represent introns spliced by the U12 spliceosome, whereas "major introns" are those spliced by the U2 spliceosome. Furthermore, it remains unclear whether the study only considers major introns or both major and minor introns. Minor introns typically have AT-AC splice sites whereas major introns usually have GT/GC-AG splice sites, although in rare cases the U2 can recognize AT-AC (see Wu and Krainer 1997 for example). The authors also note that some introns show noncanonical AT-AC splice sites while these are actually canonical splice sites for minor introns.

eLife. 2024 Mar 12;13:RP93629. doi: 10.7554/eLife.93629.3.sa3

Author Response

Florian Bénitière 1, Anamaria Necsulea 2, Laurent Duret 3

The following is the authors’ response to the original reviews.

eLife assessment

This fundamental study evaluates the evolutionary significance of variations in the accuracy of the intron-splicing process across vertebrates and insects. Using a powerful combination of comparative and population genomics approaches, the authors present convincing evidence that species with lower effective population size tend to exhibit higher rates of alternative splicing, a key prediction of the drift-barrier hypothesis. The analysis is carefully conducted and all observations fit with this hypothesis, but focusing on a greater diversity of metazoan lineages would make these results even more broadly relevant. This study will strongly appeal to anyone interested in the evolution of genome architecture and the optimisation of genetic systems.

Public Reviews:

Reviewer #1 Public Review:

Summary:

Functionally important alternative isoforms are gold nuggets found in a swamp of errors produced by the splicing machinery.

The architecture of eukaryotic genomes, when compared with prokaryotes, is characterised by a preponderance of introns. These elements, which are still present within transcripts, are rapidly removed during the splicing of messenger RNA (mRNA), thus not contributing to the final protein. The extreme rarity of introns in prokaryotes, and the elimination of these introns from mRNAs before translation into protein, raises questions about the function of introns in genomes. One explanation comes from functional biology: introns are thought to be involved in post-transcriptional regulation and in the production of translational variants. The latter function is possible when the positions of the edges of the spliced intron vary. While some light has been shed on specific examples of the functional role of alternative splicing, to what extent are they representative of all introns in metazoans?

In this study, the hypothesis of a functional role for alternative splicing, and therefore to a certain extent for introns, is evaluated against another explanation coming from evolutionary biology: isoforms are above all errors of imprecision by the molecular machinery at work during splicing. This hypothesis is based on a principle established by Motoo Kimura, which has become central to population genetics, explaining that the evolutionary trajectory of a mutation with a given effect is intimately linked to the effective population size (Ne) where this mutation emerges. Thus, the probability of fixation of a weakly deleterious mutation increases when Ne decreases, and the probability of fixation of a weakly advantageous mutation increases when Ne increases. The genomes of populations with low Ne are therefore expected to accumulate more weakly deleterious mutations and fewer weakly advantageous mutations than populations with high Ne. In this framework, if splicing errors have only small effects on the fitness of individuals, then natural selection cannot increase the precision of the splicing machinery, allowing tolerance for the production of alternative isoforms.

In the past, the debate opposed one-off observations of effectively functional isoforms on the one hand, to global genomic quantities describing patterns without the possibility of interpreting them in detail. The authors here propose an elegant quantitative approach in line with the expected continuous variation in the effectiveness of selection, both between species and within genomes. The result describing the inter-specific pattern on a large scale confirms what was already known (there is a negative relationship between effective size and average alternative splicing rate). The essential novelty of this study lies in (1) the quantification, for each intron studied, of the relative abundance of each isoform, and (2) the analysis of a relationship between this abundance and the evolutionary constraints acting on these isoforms.

What is striking is the light shed on the general very low abundance of alternative isoforms. Depending on the species, 60% to 96% of cases of alternatively spliced introns lead to an isoform whose abundance is less than 5% of the total variants for a given intron.

In addition to the fact that 60 %-96% of the total isoforms are more than 20 times less abundant than their majority form, this large proportion of alternative isoforms exhibit coding-phase shift at rates similar to what would be expected by chance, i.e. for a third of them, which reinforces the idea that there is no particular constraint on these isoforms.

The remaining 4%-40% of isoforms see their coding-phase shift rate decrease as their relative abundance increases. This result represents a major step forward in our understanding of alternative splicing and makes it possible to establish a quantitative model directly linking the relative abundance of an isoform with a putative functional role concerning only those isoforms produced in abundance. Only the (rare) isoforms which are abundantly produced are thought to be involved in a biological function.

Within the same genome, the authors show that only highly expressed genes, i.e. those that tend to be more constrained on average, are also the genes with the lowest alternative splicing rates on average.

The comparison between species in this study reveals that the smaller the effective size of a species, the more its genome produces isoforms that are low in abundance and low in constraint. Conversely, species with a large effective size relatively reduce rare isoforms, and increase stress on abundant isoforms.To sum up:

  • the higher the effective size of a species, the fewer introns are spliced.

  • highly expressed genes are spliced less.

  • when splicing occurs, it is mainly to produce low-abundance isoforms.

  • low-abundance isoforms are also less constrained.

Taken together, these results reinforce a quantitative view of the evolution of alternative splicing as being mainly the product of imprecision in the splicing machinery, generating a great deal of molecular noise. Then, out of all this noise, a few functional gold nuggets can sometimes emerge. From the point of view of the reviewer, the evolutionary dynamics of genomes are depressing. The small effective population sizes are responsible for the accumulation of multiple slightly deleterious introns. Admittedly, metazoan genomes try to get rid of these introns during RNA maturation, but this mechanism is itself rendered imprecise by population sizes.

Strengths:

  • The authors simultaneously study the effects of effective population size, isoform abundance, and gene expression levels on the evolutionary constraints acting on isoforms. Within this framework, they clearly show that an isoform becomes functionally important only under certain rare conditions.

  • The authors rule out an effect putatively linked to variations in expression between different organs which could have biased comparisons between different species.

Weaknesses:

  • While the longevity of organisms as a measure of effective size seems to work overall, it may not be relevant for discriminating within a clade. For example, within Hymenoptera, we might expect them to have the same overall longevity, but that effective size would be influenced more by the degree of sociality: solitary bees/ants/wasps versus eusocial. I am therefore certain that the relationship shown in Figure 4D is currently not significant because the measure of effective size is not relevant for Hymenoptera. The article would have been even more convincing by contrasting the rates of alternative splicing between solitary versus social hymenopterans.

As suggested by the reviewer, we investigated the degree of sociality for the 18 hymenopterans included in our study. We observed that the average dN/dS of the 12 eusocial species (4 bees, 6 ants, 2 wasps) is significantly higher than that of the 6 solitary species (p=2.1x10-3; Fig. R1A), consistent with a lower effective population size in eusocial species compared to solitary ones.

However, the AS rate does not differ significantly between these two groups, neither for the full set of major-isoform introns (Fig. R1B), nor for the subsets of low-AS or high-AS major-isoform introns (Fig. R1C,D). Given the limited sample size (12 eusocial species, 6 solitary species), it is possible that some uncontrolled variables affecting the AS rate hide the impact of Ne.

Author response image 1. Comparison of solitary (N=6) and eusocial hymenopterans (N=12).

Author response image 1.

(A) dN/dS ratio.(B) AS rate (all major-isoform introns). (C) AS rate (low-AS major-isoform introns). (D) AS rate (high-AS major-isoform introns). The means of the two group were compared with a Wilcoxon test.

  • When functionalist biologists emphasise the role of the complexity of living things, I'm not sure they're thinking of the comparison between "Drosophila" and "Homo sapiens", but rather of a broader evolutionary scale. Which gives the impression of an exaggeration of the debate in the introduction.

We disagree with the referee: in fact, all the debate regarding the paradox of the absence of relationship between the number of genes and organismal complexity arose from the comparative analysis of gene repertoires across metazoans. This debate started in the early 2000’s, when the sequencing of the human genome revealed that it contains only ~20,000 protein-coding genes (far less than the ~100,000 genes that were expected at that time). This came as a big surprise because it showed that the gene repertoire of mammals is not larger than that of invertebrates such as Caenorhabditis elegans (19,000 genes) or Drosophila melanogaster (14,000 genes) . We cite below several articles that illustrate how this paradox has been perceived by the scientific community:

Graveley BR 2001 Alternative splicing: increasing diversity in the proteomic world. Trends in Genetics 17 : 100–107. https://doi.org/10.1016/S0168-9525(00)02176-4

“ How can the genome of Drosophila melanogaster contain fewer genes than the undoubtedly simpler organism Caenorhabditis elegans? ”

Ewing B and Green P 2000 Analysis of expressed sequence tags indicates 35,000 human genes. Nature Genetics 25 : 232–234. https://doi.org/10.1038/76115

“ the invertebrates Caenorhabditis elegans and Drosophila melanogaster having 19,000 and 13,600 genes, respectively. Here we estimate the number of human genes […] approximately 35,000 genes, substantially lower than most previous estimates. Evolution of the increased physiological complexity of vertebrates may therefore have depended more on the combinatorial diversification of regulatory networks or alternative splicing than on a substantial increase in gene number. ”

Kim E, Magen A and Ast G 2007 Different levels of alternative splicing among eukaryotes. Nucleic Acids Research 35 : 125–131. https://doi.org/10.1093/nar/gkl924

“we reveal that the percentage of genes and exons undergoing alternative splicing is higher in vertebrates compared with invertebrates. […] The difference in the level of alternative splicing suggests that alternative splicing may contribute greatly to the mammal higher level of phenotypic complexity,”

Nilsen TW and Graveley BR 2010 Expansion of the eukaryotic proteome by alternative splicing. Nature 463 : 457–463. https://doi.org/10.1038/nature08909

“ It is noteworthy that Caenorhabditis elegans, D. melanogaster and mammals have about 20,000 (ref. 68), 14,000 (ref. 69) and 20,000 (ref. 70) genes, respectively, but mammals are clearly much more complex than nematodes or flies.”

Reviewer #2 (Public Review):

Summary:

Two hypotheses could explain the observation that genes of more complex organisms tend to undergo more alternative splicing. On one hand, alternative splicing could be adaptive since it provides the functional diversity required for complexity. On the other hand, increased rates of alternative splicing could result through nonadaptive processes since more complex organisms tend to have smaller effective population sizes and are thus more prone to deleterious mutations resulting in more spurious splicing events (drift-barrier hypothesis). To evaluate the latter, Bénitière et al. analyzed transcriptome sequencing data across 53 metazoan species. They show that proxies for effective population size and alternative splicing rates are negatively correlated. Furthermore, the authors find that rare, nonfunctional (and likely erroneous) isoforms occur more frequently in more complex species. Additionally, they show evidence that the strength of selection on splice sites increases with increasing effective population size and that the abundance of rare splice variants decreases with increased gene expression. All of these findings are consistent with the drift-barrier hypothesis.

This study conducts a comprehensive set of separate analyses that all converge on the same overall result and the manuscript is well organized. Furthermore, this study is useful in that it provides a modified null hypothesis that can be used for future tests of adaptive explanations for variation in alternative splicing.

Strengths:

The major strength of this study lies in its complementary approach combining comparative and population genomics. Comparing evolutionary trends across phylogenetic diversity is a powerful way to test hypotheses about the origins of genome complexity. This approach alone reveals several convincing lines of evidence in support of the drift-barrier hypothesis. However, the authors also provide evidence from a population genetics perspective (using resequencing data for humans and fruit flies), making results even more convincing.

The authors are forward about the study's limitations and explain them in detail. They elaborate on possible confounding factors as well as the issues with data quality (e.g. proxies for Ne, inadequacies of short reads, heterogeneity in RNA-sequencing data).

Weaknesses:

The authors primarily consider insects and mammals in their study. This only represents a small fraction of metazoan diversity. Sampling from a greater diversity of metazoan lineages would make these results and their relevance to broader metazoans substantially more convincing. Although the authors are careful about their tone, it is challenging to reconcile these results with trends across greater metazoans when the underlying dataset exhibits ascertainment bias and represents samples from only a few phylogenetic groups. Relatedly, some trends (such as Figure 1B-C) seem to be driven primarily by non-insect species, raising the question of whether some results may be primarily explained by specific phylogenetic groups ( although the authors do correct for phylogeny in their statistics). How might results look if insects and mammals (or vertebrates) are considered independently?

Following the referee’s suggestion, we investigated the relationship between AS rate and proxies of Ne, separately for insects and vertebrates (Supplementary Fig. 11) . We observed that the relationship was consistent in vertebrates and insects: linear regressions show a positive correlation, significant (p<0.05) in all cases, except for body length in vertebrates. We added a sentence (line 166) to mention this point.

Note that for these analyses we have smaller sample sizes, so we have a weaker power to detect signal. We therefore prefer to present the combined analyses, using PGLS to account for phylogenetic inertia.

Throughout the manuscript, the authors refer to infrequently spliced ( mode <5%) introns as "minor introns" and frequently spliced (mode >95%) as "major introns". This is extremely confusing since "minor introns" typically represent introns spliced by the U12 spliceosome, whereas "major introns" are those spliced by the U2 spliceosome.

To avoid any confusion, we modified the terminology: we now refer to infrequently spliced introns as " minor-isoform introns" and frequently spliced as "major -isoform introns" (see line 135-137) . The entire manuscript (including the figures) has been modified accordingly.

Furthermore, it remains unclear whether the study only considers major introns or both major and minor introns. Minor introns typically have AT-AC splice sites whereas major introns usually have GT/GC-AG splice sites, although in rare cases the U2 can recognize AT-AC (see Wu and Krainer 1997 for example).

We modified the text (line 148-150) to clearly state that we studied all introns, both U2-type and U12-type.

The authors also note that some introns show noncanonical AT-AC splice sites while these are actually canonical splice sites for minor introns.

This is corrected (line 148).

Recommendations for the authors:

Reviewer #2 (Recommendations For The Authors):

Figures 1, 3, and 4: I suggest that authors add regression lines.

We added the regression lines with the “pgls” function from the R package “caper” (in Fig. 1, 3 and 4, and also in all other figures where we present correlations).

Figure 2: As previously mentioned, the terms "minor introns" and "major introns" are extremely confusing. I strongly suggest the authors use different naming conventions.

We changed the terminology:

minor introns -> minor-isoform introns

major introns -> major-isoform introns

Figure 5: Intron-exon boundaries and splice site annotations are shown at the bottom of B, C, and D but not A. I suggest removing the annotation beneath B for consistency and since A+C and B+D are aligned on the x-axis.

Corrected, it was a mistake.

Figure 7: The yellow dotted line is very challenging to see in A.

Corrected, the line has been widened.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Bénitière F, Necsulea A, Duret L. 2024. 2023-Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. Zenodo. [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Figure 1—source data 1. Summary of the main features of the samples analyzed in this study.
    Figure 1—source data 2. Longevity and body lenth across the 53 metazoans studied.
    MDAR checklist

    Data Availability Statement

    All processed data that we generated and used in this study, as well as the scripts that we used to analyze the data and to generate the figures, are available on Zenodo DOI: https://doi.org/10.5281/zenodo.7415114.

    In particular, the sources of transcriptomic data, genome assemblies and annotations are reported in the Zenodo archive in data/Data1-supp.tab. The archive includes several directories, including figure, which contains the necessary materials to produce the figures of the manuscript. Rmarkdown scripts located in the table_supp directory were used to generate supplementary tables, which are also saved in the same directory. The processed data used to generate figures and conduct analyses are stored in the data directory in tab-separated text format.

    All processed data that we generated and used in this study, as well as the scripts that we used to analyze the data and to generate the figures, are available on zenodo DOI: https://doi.org/10.5281/zenodo.7415114.

    The following dataset was generated:

    Bénitière F, Necsulea A, Duret L. 2024. 2023-Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. Zenodo.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES