Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2025 Mar 22;53(6):gkaf184. doi: 10.1093/nar/gkaf184

Conservation assessment of human splice site annotation based on a 470-genome alignment

Ilia Minkin 1,2,, Steven L Salzberg 3,4,5,6
PMCID: PMC11928937  PMID: 40119728

Abstract

Despite many improvements over the years, the annotation of the human genome remains imperfect. The use of evolutionarily conserved sequences provides a strategy for selecting a high-confidence subset of the annotation. Using the latest whole-genome alignment, we found that splice sites from protein-coding genes in the high-quality MANE annotation are consistently conserved across >350 species. We also studied splice sites from the RefSeq, GENCODE, and CHESS databases not present in MANE. In addition, we analyzed the completeness of the alignment with respect to the human genome annotations and described a method that would allow us to fix up to 60% of the missing alignments of the protein-coding exons. We trained a logistic regression classifier to distinguish between the conservation exhibited by sites from MANE versus sites chosen randomly from neutrally evolving sequences. We found that splice sites classified by our model as well-supported have lower single nucleotide polymorphism rates and better transcriptomic evidence. We then computed a subset of transcripts using only “well-supported” splice sites or ones from MANE. This subset is enriched in high-confidence transcripts of the major gene catalogs that appear to be under purifying selection and are more likely to be correct and functionally relevant.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

The annotation of the human genome is a fundamental resource for a broad range of biomedical research and clinical applications. However, more than two decades after the initial publication of the genome itself, the scientific community has not reached a point where a consensus genome annotation is available [1]. For example, one consequence is that the leading gene annotation databases for the human reference genome often disagree even on basic statistics such as the number of protein-coding genes [2]. This is due to a variety of reasons, including the imperfect technologies used to assemble RNA transcripts and the noise inherent in the transcription process itself [3–5].

One of the challenging aspects of constructing a genome annotation is correctly determining the positions of introns inside the genes. The existence of introns and the mechanism of alternative splicing, first proposed by Gilbert [6], are critical for the functioning of cells. At the same time, the evolutionary origin of introns has been the subject of a scientific debate for decades [7–10].

A recent effort to address the challenge of the discrepancy between different human annotations resulted in the creation of a limited, high-quality gene annotation database called MANE [11]. This annotation was intended to include a single representative transcript for each protein-coding gene that has identical exon and intron structures in both RefSeq and GENCODE, two of the leading human annotation databases. The transcripts in MANE are chosen based on criteria that include expression levels and evolutionary conservation, which is a strong predictor of biological function. A similar project called APPRIS [12] provides a single transcript for every protein-coding gene based on human genetics data, protein evidence, and cross-species conservation; APPRIS contains annotations for the human as well as a few other reference species. These approaches yielded a subset of the human transcriptome under strong purifying selection. These and other studies of evolutionary consistency of the human genome annotation [13] were mostly focused on the sequences of the protein-coding exons rather than splice site motifs.

In this study, we address the question of the conservation of splice sites in major gene catalogs, across both multiple species and population levels. First, we analyzed the completeness of the alignment containing 470 mammalian species recently published by the UCSC Genome Browser team [14] with respect to the annotation of the human exons; we restricted this alignment to 405 species due to sequence availability reasons. As we observed alignments of many exons to be missing, we came up with a method to fix the missing alignment, recovering up to 50% of the missing exon/genome pairs. Second, we observed that the canonical dinucleotides GT/AG that flank introns are very highly conserved in protein-coding genes in MANE, with most of them being intact in >350 species. We then investigated the patterns of conservation among splice sites that are not in MANE but that are present in one or more of the leading gene catalogs RefSeq, GENCODE, and CHESS. We found that while many of those splice sites closely follow the pattern of conservation found in MANE, others resemble randomly generated sites from neutrally evolving sequences.

To compare the properties of these two groups of splice sites, we developed a logistic regression model that classifies splice sites as either well-supported or less-supported. The model relies on a comparison of conservation patterns of splice sites from MANE to neutrally evolving sequences. As we detail below, we found that sites predicted as well-supported by our classifier have lower rates of single nucleotide polymorphisms (SNPs) in the human population, are enriched in clinically relevant polymorphisms, and have better transcriptomic support. We then obtained a subset of transcripts from each major gene catalog for which all splice sites were either classified as well-supported by our model or included in a transcript from MANE. These transcripts appear to be under strong purifying selection and are more likely to be functional and clinically relevant.

Materials and methods

Realignment of missing exon/genome pairs

Before investigating the conservation of the splice sites, we performed a procedure to fix the gaps in the alignment that might affect the results. First, we found human exons and particular genomes such that the exon is not aligned anywhere in that target genome. We then tried to realign these exons using the synteny information. The intuition is that if a human exon is not aligned to another genome, but down- and upstream exons are mapped to the same sequence in that target genome, then we can try to place the missing exon in between two of its neighbors in the target genome. Below, we give a more detailed description of the method.

We are given a collection of genomes G = {g1, …, gm}, where each genome is a string Inline graphic over the nucleotides of the DNA alphabet {A, C, G, T}, where |gi| is the length of the i-th genome. The genome g1 is called the reference, and any non-reference genome gt, t > 1, is called a target genome. For the reference genome, we are given an exon annotation represented as a set of segments E = {(x1, y1), …, (x|E|, y|E|)}, 1 ≤ xi < yi ≤ |g1|.

To find the corresponding sequence of each exon of the reference in another species, we use a whole-genome alignment of m species. Formally, we define an alignment function w(k, gt) that maps each position k of the reference genome to its homologous position in the target genome gt included in the alignment if such position exists; otherwise, w(k, gt) = −1.

We say that an exon e = (x, y) ∈ E is unaligned in target genome gt if w(k, gt) = −1 for all x ≤ ky; otherwise, we called the exon aligned. We define the set of all aligned positions of exon e as A(e) = {w(k, gt)|x ≤ ky, w(k, gt) ≠ −1}. We call an exon e syntenicin genome gt if there are two other exons ea = (xa, ya) and eb = (xb, yb), ya < x < y < xb, that are aligned in gt.

We use the fact that unaligned, but syntenic exons have other neighboring exons mapped to the target genome to get a hint of where the alignment of the said exons could be. Let e be such an unaligned exon. Then, the target segment u is defined as u = (max(A(ea)) + 1, min(A(eb)) - 1). We use edlib library [15] to find the best alignment of e to the range u in the target genome gt to get the alignment function we(k, gt) aligning the specific exon e. After trying to realign all such exons e, we merge the resulting alignments we(k, gt) with the original alignment w(k, gt); we define the resulting function as w′(k, gt). Figure 1A illustrates the above definitions.

Figure 1.

Figure 1.

An example demonstrating the definitions from the “Materials and methods” section. Panel (A) shows an alignment of exons from the reference genome g1 (human) to a target genome g2 (mouse) using a whole-genome alignment. Boxes in the annotation represent positions of exons ea, e, and eb from the annotation of the human genome, and the arrowed lines are introns; splice sites are not shown for visual clarity. In this example, exons ea and eb are aligned, with vertical dashes indicating the alignment between nucleotides of the different genomes. The first positions of the corresponding exons xa and xb are aligned to their counterparts in the target genomes, w(xa, g2) and w(xb, g2). The exon e is unaligned since all its positions are missing in the alignment, as indicated by question marks, w(x, g2) = w(x + 1, g2) = w(x + 2, g2) = −1. At the same time, the exon e is syntenic, since its neighboring exons are aligned, and we can reasonably hypothesize that e can be realigned to the segment between the alignments of ea and eb in the mouse genome, i.e. w(xa, g2) < we(x, g2) < w(xb, g2). Panel (B) shows an example of splice site annotation: d1 denotes the position o(d1) of the first of the canonical dinucleotides of the donor splice site, and a1 denotes the position of the first of the canonical dinucleotides o(a1) of the acceptor site. The donor site d1 of the human reference genome g1 has both its canonical dinucleotides intact in the target mouse genome, g2. However, this is not true for the acceptor site a1 mutated in mouse. In this example, the values of the conservation function for these two splice sites are C(d1, 0, 2) = C(d1, 1, 2) = 1, C(a1, 0, 2) = 0, and C(a1, 1, 2) = 1.

We apply several filtering steps along the process. To reduce the computation load, we only consider target segments u with a length less than a predefined threshold, which we set to 100 000 in our experiments. We also used the following criteria to filter out potentially spurious alignments. Let Et ⊆ E be a subset of exons aligned in the genome gt. For an exon e = (x, y) ∈ Et, we define its alignment score Inline graphic, or as a fraction of the position of the exon e that is aligned in gt. Let Rt = {r(e)|eEt} be the set of the scores of all aligned exons with respect to the genome gt in the original alignment. We only accept a realignment we(k, gt) of exon e = (x, y) if the alignment score Inline graphic, where μ and σ are arithmetic mean and standard deviation, respectively.

Splice site classification

Our method for classifying splice sites is based on a logistic regression model designed to predict the probability of a splice site having a MANE-like conservation pattern (well-supported) or a conservation pattern similar to a neutrally evolving sequence (less-supported). One of the primary features used by the regression model is the number of species in which the canonical dinucleotides are conserved, computed from a large multiple genome alignment. In addition, it takes into account an array of positions surrounding a splice site, as they appear to have similar conservation properties. Having such a classification method in addition to the number of species in which the canonical dinucleotides are conserved is necessary because the number itself is not informative without a baseline representing neutrally evolving sequences to compare against. The training data include randomly chosen sites from intronic sequences as negative examples and the whole MANE dataset as positive examples. Below, we give the necessary initial definitions and describe the model.

We are given a splice site annotation for the reference genome, represented as two sets, donor sites D = {d1, …, d|D|} and acceptor sites A = {a1, …, a|A|}. The origin of the site is the position of the nucleotide of the first of the canonical dinucleotides, which we designate as o(s) where s is either a donor or a splice site. Thus, for most donor sites, Inline graphic and Inline graphic, and for most acceptor sites, Inline graphic and Inline graphic.

To find the corresponding sequence of each splice site of the reference in another species, we use a whole-genome alignment of m species represented as the alignment function w′(k, gt) that maps a position k of the reference genome to a target genome gt. We also define the conservation function C(s, , t) as follows: it takes the value of 1 if the nucleotide with the shift of splice site s matches its homologous nucleotide in the genome t and 0 otherwise: C(s, , t) = I[b1,o(s)+ = bt,w(o(s)+,t)]. Figure 1B shows an example of mapping splice site sequence using whole-genome alignment and computation of the alignment and conservation function.

Our model consists of two types of variables to classify a splice site as well-supported or less-supported: (i) number of species in which the canonical dinucleotides are conserved jointly and (ii) number of species in which each nucleotide 30 position down- and upstream of the canonical dinucleotide is conserved in, one variable per each position. This way, the log odds of an acceptor site ar being well-supported are defined as

graphic file with name M0007.gif

where α0 is the interceptor term, α1 corresponds to the conservation of canonical dinucleotides, and α is the coefficient corresponding to the conservation of the position with the shift of the splice site. The log odds of a donor splice being well-supported are defined analogously. In addition, we evaluated a model taking into account the conservation of the canonical dinucleotides only to evaluate the contribution of the rest of the positions in the splicing motif. This way, we define the log odds with the respect to the probability p0 of an acceptor splice site ar being well-supported as

graphic file with name M0008.gif

Data and software acquisition

We analyzed the following gene catalogs: GENCODE [16] version 45, RefSeq [17] release 110, CHESS 3 [18] v.3.1.0, and MANE [11] Select v1.3. We note that GENCODE, RefSeq, and CHESS 3 all contain every gene and transcript in MANE, which was created by GENCODE and RefSeq scientists with the goal of providing a single high-confidence transcript for every human protein-coding gene. To take into account this confounding factor and observe the differences between annotations more clearly, we removed the MANE splice sites from each of the other catalogs and created reduced versions that we designate as GENCODE*, RefSeq*, and CHESS 3*, respectively. This procedure only affected protein-coding genes, because MANE does not currently contain noncoding genes or other types of annotation.

We only included protein-coding and long noncoding RNA (lncRNA) genes in our analyses; however, the way these catalogs define gene types slightly differs. For GENCODE and MANE, we used the attribute “transcript_type” of a transcript to infer its type; for CHESS 3, we used the attribute “gene_type” of a transcript for this purpose. For RefSeq, we consider a transcript to be protein coding if its corresponding gene was assigned “protein_coding” to its “gene_biotype” attribute, and the transcript was assigned “mRNA” to its “transcript_biotype” attribute. For lncRNAs from RefSeq, we consider a transcript to be lncRNA if its corresponding gene was assigned “lncRNA” to its “gene_biotype” attribute, and the transcript was assigned “lnc_RNA” to its “transcript_biotype” attribute. We note that for protein-coding genes we considered all introns from the mentioned annotations, which include the ones located in untranslated regions.

In addition, we created a false gene annotation, intended to capture a baseline of neutrally evolving sequences; we refer to this dataset as “Random.” This annotation consists of 180 000 randomly generated transcripts located within introns of genes of MANE outside of splicing motifs. Each transcript consists of two short exons separated by an intron, yielding 180 000 distinct donor and acceptor sites.

We used a 470-species alignment available at the UCSC Genome Browser website [19] generated using MultiZ whole-genome aligner [20]. However, we had to restrict this alignment to 405 species: in order to implement our exon realignment procedure, we had to download the sequences of the genomes themselves, and 65 of these genomes were unavailable for download; the full list of 405 genomes is available in online documentation. In addition, we excluded any genes located on the alternative sequences of GRCh38 [21] if such sequences were not included in the original alignment published by UCSC; we also excluded single exon transcripts. We also excluded chromosome Y from our analysis due to the reasons listed in the “Results” section.

For fitting coefficients of the regression equations above, we used the logistic regression module from SciKit [22]. To implement the alignment function w(k, gt), we utilized the AlignIO library from BioPython module [23] version 1.79.

Results

Assessing completeness of the alignment

Any conclusions about the conservation of splice sites drawn from the alignment analysis depend on its completeness. To assess it, we calculated the following statistics that reflect how many exons from human gene catalogs are mapped to the other genomes. Let E be the set of protein-coding and lncRNA exons from the three gene catalogs under consideration and G be the genomes used in the whole-genome alignment.

We define a variable W(e, gt), eE, gtG, that indicates whether exon e is aligned in the target genome gt as follows. We assign W(e, gt) = 1 if the exon e is aligned in the genome gt, and W(e, gt) = 0 otherwise. To present the summary of the alignment of the exons, we calculate the sums sm = ∑(1 - W(e, gt)) and sa = ∑W(e, gt) showing how many exon/genome pairs are missing and present in the alignment, respectively.

Table 1, columns 2–5 represent the statistics sa and sm broken down by chromosome and gene type. For autosomal chromosomes, we observe that up to 13% exon/genome pairs are missing for protein-coding genes, and up to 50% such pairs are missing for lncRNAs. Sex chromosome Y is an obvious outlier since >35% of protein-coding and 72% of lncRNA exon/genome pairs are missing in the alignment.

Table 1.

Completeness of the alignment with respect to the annotation and results of the realignment procedure

  Aligned exons/genomes Missing exons/genomes Recovered exons/genomes
Gene type Coding lncRNA Coding lncRNA Coding lncRNA
Chr1 13 524 670 4 943 971 1 173 995 1 848 284 558 122 1 016 754
Chr2 10 200 614 5 196 973 764 761 1 798 592 426 570 1 057 443
Chr3 8 517 408 3 747 144 569 172 1 263 921 331 051 743 381
Chr4 5 524 742 2 887 223 504 088 1 325 587 274 431 754 267
Chr5 6 143 565 3 287 189 404 475 1 260 556 230 366 684 250
Chr6 6 669 798 3 437 254 544 467 1 290 311 295 057 756 804
Chr7 6 478 768 2 808 452 589 292 1 288 933 312 680 666 229
Chr8 4 886 473 2 744 951 478 967 1 302 619 259 518 684 356
Chr9 5 271 324 2 320 005 404 346 993 705 215 712 435 116
Chr10 5 399 044 2 491 811 456 446 989 569 257 332 561 440
Chr11 8 459 650 2 620 330 653 255 844 445 336 269 501 208
Chr12 8 110 205 2 751 895 583 525 1 148 660 332 270 663 752
Chr13 2 334 841 1 663 448 217 469 794 497 131 064 418 293
Chr14 4 676 912 2 022 926 304 993 688 144 187 024 392 903
Chr15 4 948 431 2 327 369 341 679 834 871 185 590 425 251
Chr16 6 284 584 2 227 608 602 846 824 067 320 788 454 321
Chr17 8 710 715 2 248 362 685 690 868 518 341 519 441 792
Chr18 2 366 477 1 501 463 191 098 606 967 112 751 366 562
Chr19 8 336 039 1 275 528 1 333 336 1 222 917 620 313 483 427
Chr20 3 371 434 1 546 869 259 391 636 486 150 521 351 570
Chr21 1 299 495 1 000 286 165 795 561 799 93 259 256 060
Chr22 2 908 246 1 043 948 309 884 516 517 160 708 228 741
ChrX 4 774 991 1 068 688 597 739 648 917 200 967 191 003
ChrY 236 212 130 554 147 728 350 991 20 636 32 383

The number of aligned protein-coding and lncRNA exons/genomes in the original alignment of 470 mammals restricted to 405 species (columns 2 and 3, defined as sa in the main text), missing in the alignment (columns 4 and 5, defined as sm in the main text), and recovered using our synteny-based realignment procedure (columns 6 and 7, defined as sr in the main text). We computed these numbers for the union of the set of exons of all gene annotations under consideration: GENCODE, RefSeq, CHESS 3, and MANE; we present these statistics for each gene type and chromosome separately.

Since we observed a significant amount of exon/genome pairs not being aligned, we developed a strategy to recover them using the synteny information. The description of this step can be found in the “Realignment of missing exon/genome pairs” section. We define this quantity as sr = ∑Wr(e, gt), for all e, gt such that W(e, gt) = 0, and Wr(e, gt) = 1 if at least one position of the exon e is aligned somewhere in the genome gt by the extended alignment function described in the “Realignment of missing exon/genome pairs” subsection; otherwise Wr(e, gt) = 0. Table 1, columns 6 and 7 show the number of exon/genome pairs recovered by our method: for autosomal chromosomes, we recovered up to 60% of exon/genome pairs for both protein-coding genes and lncRNAs. In contrast, we recovered 33%–29% of such pairs for chromosome X and only 9%–13% of the exon/genome pairs for chromosome Y. This can be explained by the fact that many genomes are missing the assembled Y chromosome and its challenging structure, which is comprised of rich families of repeated sequences. Hence, we decided to exclude chromosome Y from our analysis. In addition to realigning the exons from the real datasets, we realigned the exons from the “Random” dataset to have a realistic baseline. Supplementary Table S1 shows these numbers that are similar to the real gene annotations: we were able to recover 58%–66% of exon/genome pairs.

Exploratory data analysis of splice site conservation

First, we evaluated the evolutionary conservation of splice sites from four different human genome annotation databases: GENCODE(*), RefSeq(*), CHESS 3(*), and MANE Select. Table 2 shows the numbers of donor and acceptor sites in each dataset, as well as the total number of transcripts. For every donor and acceptor splice site in the databases, we computed how many species preserve the consensus dinucleotides (GT and AG) that appear at the beginning and end of most introns. Figure 2 shows the pattern of conservation across species for each of these sets of donor and acceptor sites. First, we note that splice sites from protein-coding genes in MANE yield a plot that is clearly distinct from the other gene catalogs: most of the sites from MANE are conserved in >350 species. Second, protein-coding splice sites from the other datasets (after removing the MANE splice sites) seem to fall into two distinct categories: (i) MANE-like and (ii) neutral-like conservation. We also noted a small peak for splice sites from protein-coding genes of RefSeq* at around 330 species: most of these splice sites come from the alternative sequences to the hg38 reference genome that are absent in the other annotations. In contrast, lncRNAs from all datasets have very similar distributions that closely follow the conservation pattern of random sites. Both donor and acceptor splice sites show similar patterns of conservation. We note that randomly generated sites along with lncRNAs and some sites from coding genes exhibit several peaks in conservation in <50 species.

Table 2.

Summary statistics of splice site conservation analysis

  All splice sites “Well-supported” splice sites All transcripts “Well-supported” transcripts
Dataset Donor Acceptor Donor Acceptor    
Protein-coding    
MANE 182 596 182 557 18 204
GENCODE* 33 991 26 863 10 616 10 013 69 233 34 923
RefSeq* 62 860 54 160 27 537 25 986 116 762 46 006
CHESS 3* 50 474 45 459 23 849 22 905 85 364 41 196
lncRNA            
MANE
GENCODE 55 807 57 356 7940 9740 53 353 1240
RefSeq 48 508 48 616 4148 5729 30 503 304
CHESS 3 47 991 48 402 4221 5685 35 575 418
Synthetic data            
Random 180 000 180 000 180 000

The second and third columns represent the total number of donor and acceptor sites in each dataset and the third and fourth columns show the number of donor and acceptor splice sites classified as “well-supported” by our model. The last two columns indicate the total number of transcripts in each dataset and the number of transcripts that have all splice sites either from MANE dataset or classified as “well-supported.” We only considered transcripts with at least one intron present. Dashes indicate that transcripts and splice sites of a certain type were not available in a dataset.

Figure 2.

Figure 2.

Distribution of the number of human splice sites with canonical dinucleotides (GT for donor and AG for acceptor sites) conserved in 405 mammals, computed for donor (A) and acceptor (B) sites of protein-coding genes, and donor (C) and acceptor (D) sites of lncRNAs. Each point shows a number of splice sites conserved (y-axis) in a given number of genomes (x-axis). Numbers are normalized by the total number of sites in the corresponding dataset in each category. The figure shows this statistic for annotations from GENCODE, RefSeq, CHESS 3, and MANE, as well as artificial splice sites (“Random”) generated from internal sequences of introns that are assumed to evolve neutrally. For protein-coding genes, we created subsets GENCODE, RefSeq, and CHESS 3 from which we removed MANE annotations because each of these datasets is a superset of MANE; the resulting datasets are designated as GENCODE*, RefSeq*, and CHESS 3*, respectively.

We also calculated the most common species in which these splice sites are conserved, represented in Supplementary Table S2. These species mostly constitute primates, which suggests that their conservation is merely a result of having a relatively recent common ancestor with humans. These splice sites may be clade-specific, or they might represent erroneous annotations. We also calculated the same statistic for splice sites that have noncanonical dinucleotides on the introns’ flanks. Most of these splice sites constitute either U2-type sites flanked by GC–AG [24] or U12-type minor form introns [2526] flanked by the dinucleotides AT–AC. Supplementary Fig. S1 shows these numbers; they follow the same pattern as splice sites with the canonical dinucleotides.

Given the striking pattern of conservation of the canonical dinucleotides of splice sites from MANE, we investigated the conservation of different positions around splice sites. Supplementary Fig. S2 shows the pattern of conservation of bases as a function of their distance from the GT/AG splice site. As expected, the canonical dinucleotides (GT for donor sites and AG for acceptor sites) are the most conserved. On the other hand, upstream positions for donor sites and downstream positions for acceptor sites show similar patterns of conservation. However, downstream positions for donor sites and upstream ones for acceptor sites are much less conserved, which is expected because these positions are intronic.

We further explored the question of how well splice sites are conserved at the human population level. Specifically, we calculated the fraction of splice sites having an SNP at a certain position, similar to the cross-species conservation of different positions shown in Supplementary Fig. S2. To determine the presence of SNPs in the human population, we used the gnomAD database version 4.0.0 [27], focusing on loci that have at least one homozygous sample since a homozygous SNP at a splice site is very likely to cause incorrect splicing. Figure 3 shows these fractions, which we call “SNP rates,” calculated for each of the different gene catalogs. As expected, for protein-coding genes and their donor and acceptor splice sites, MANE has a much lower fraction of SNP sites at the canonical dinucleotides compared to random GT/AG positions, 0.2% versus 1.2%. On the other hand, splice sites in GENCODE* have only slightly lower SNP rates than randomly evolving sequences; CHESS 3*’s and RefSeq*'s rates are closer to MANE, but still somewhat higher. For lncRNAs from all of the catalogs, we observed that the SNP rates are relatively close to those of neutrally evolving sequences.

Figure 3.

Figure 3.

Rate of SNPs at positions near splice sites. Each point represents a proportion of splice sites from a certain dataset that have an SNP from the gnomAD dataset at a position either down- or upstream of the “canonical” dinucleotides. For example, for donor splice sites 0 is usually “G,” +1 is “T” (shown under the corresponding ticks on the horizontal axis), and −1 is the first nucleotide upstream of the splice site. We only considered SNPs that have one homozygous sample. Panels (A) and (B) show donor and acceptor sites of protein-coding genes, while panels (C) and (D) show values for donor and acceptor sites of lncRNAs. For lncRNAs, we included MANE sites from protein-coding genes as a baseline for splice sites under strong selection as MANE does not contain lncRNAs yet.

Our analysis suggests that some protein-coding splice sites and many more lncRNA splice sites include a subset of sites that are not under strong purifying selection. Otherwise, their average SNP rates should have been more similar to what we observed in splice sites from MANE. These sites might potentially represent nonfunctional and/or erroneous annotations.

Classifying splice sites based on their conservation

Above we showed that splice sites from the major gene catalogs exhibit two clearly distinct patterns of conservation: MANE-like and random-like. For brevity, we refer to the former as “well-supported” and the latter as “less-supported.” We next decided to classify splice sites based on their conservation across species, and to compare their properties to see whether less-supported sites might be misannotated. To do so, we trained a binary classifier based on logistic regression that uses the number of species in which a certain position around a splice site is conserved; we trained models for donor and acceptor sites separately. We used the randomly generated sites as negative examples and the whole MANE database as positive ones, with 20% of the data set aside for testing; the “Materials and methods” section contains a detailed description of the model.

Supplementary Fig. S3 shows the receiver operating characteristic (ROC) curve illustrating the trade-off between true positive and false positive rates for these models on the test data. We evaluated two types of models: one using only the conservation of the canonical dinucleotides GT/AG themselves, and the other one using the conservation of the other positions in the splicing motif (see the “Materials and methods” section for more details). Both models show high accuracy on the test data with an area under the ROC curve measuring 0.974–0.979 for donor and acceptor sites. However, the full model has a slightly lower false positive rate for a given classification threshold; hence, we chose it for further analyses. For classification, we used a threshold of 0.5 for the probability predicted by the regression model to classify sites as well-supported and less-supported. Given this threshold, the full models for donor and acceptor sites have F-scores of 0.933 and 0.949, respectively (see Supplementary Table S3).

In addition, we compared the probability output by the regression model to PhastCons [28] scores indicating whether a particular position in the genome is under negative selection. To do so, we used the PhastCons scores available at the UCSC Genome Browser that were computed using the same 470-species alignment. For each site, we took a minimum of two PhastCons scores of the positions of their canonical dinucleotides (usually GT/AG) and computed the Pearson correlation between this quantity and the probability predicted by the regression model; this resulted in the Pearson correlation value of 0.8 across all sites from the datasets under consideration for which PhastCons scores were available.

We then applied the model to each dataset under consideration to label sites as well-supported or less-supported. Table 2 (columns 4 and 5) contains the number of donor and acceptor splice sites in each of the annotation databases classified by the model as well-supported. For protein-coding genes, we observed that in GENCODE* only 31% of donor and 38% of acceptor splice sites were well-supported according to the model, while for RefSeq* and CHESS 3*, the proportion was higher, at 44-47% for donor sites and 48%–50% for acceptor sites, suggesting that the RefSeq and CHESS 3 have somewhat more reliable annotations of protein-coding transcripts. For lncRNAs, no more than 17% splice sites were classified as well-supported across all datasets. We observed similar results for noncanonical splice sites; these numbers are presented in Supplementary Table S4. Supplementary Fig. S4 shows the relationship between the probability of a donor (acceptor) splice site being classified as “well-supported” and the number of genomes in which the canonical dinucleotides of the particular splice site are conserved in; most sites that have their dinucleotides conserved in <200 species are classified as “less-supported.”

Columns 6 and 7 of Table 2 also show the total number of transcripts in each annotation and the number of “well-supported transcripts” where each splice site is either shared with a transcript from MANE or classified as “well-supported” by our model; we only considered transcripts with at least one intron present. We observed that for protein-coding genes, GENCODE*, RefSeq*, and CHESS 3*, 50%, 39%, and 48% of transcripts have fully well-supported splice sites. The number of well-supported lncRNAs is much lower in each dataset: 1%–1.1% for RefSeq and CHESS 3 and 2% for GENCODE. Supplementary Fig. S5 shows the number of well-supported transcripts in each gene type and dataset split by the number of introns in a transcript. We also broke down the number of well-supported and less-supported splice sites of protein-coding genes by whether they are located inside a MANE exon. As Supplementary Fig. S2 shows, positions within the exon are conserved similarly to the canonical dinucleotides, and alternative splice sites located within exons could be mistakenly labeled as well-supported, resulting in more such sites. Supplementary Table S5 shows these numbers: there are nearly 10 times more splice sites outside of MANE exons overall, and the ones located inside exons are slightly more likely to be classified as “well-supported.” For all three datasets, around half of the splice sites inside MANE exons are well-supported. On the other hand, this statistic varies between different datasets for the sites outside of such exons. For example, in CHESS 3* 47%–48% of such sites are well-supported, 42%–46% are in RefSeq*, and 25%–27% are in GENCODE*.

We further compared SNP rates in the human population for well-supported and less-supported splice sites, again using the gnomAD human variation database and focusing on sites where at least one individual had a homozygous SNP. Figure 4 shows these rates for different datasets. For protein-coding genes (Fig. 4A and B), we observed that SNP rates for the canonical dinucleotides (positions 0 and +1) were two to six times lower for the well-supported subset (as predicted by our classifier) compared to its less-supported counterpart. We also note that the curves corresponding to less-supported sites are closer to the Random (neutrally evolving) sites, while well-supported sites in all three databases have SNP rates similar to MANE. However, for lncRNAs (Fig. 4C and D), the separation is a little less clear: although the less-supported sites have SNP rate pattern close to the Random ones, the well-supported sites have only 1.5–2 times smaller SNP rates at the canonical dinucleotides, and these rates are also much higher than the rates of protein-coding sites from MANE.

Figure 4.

Figure 4.

Rate of homozygous SNPs at positions near splice sites. Each point represents a proportion of splice sites from a certain dataset that have an SNP at a position either down- or upstream of the “canonical” dinucleotides. For example, for donor splice sites 0 is usually “G,” +1 is “T” (shown under the corresponding ticks on the horizontal axis), and −1 is the first nucleotide upstream of the splice site. We only considered SNPs from the gnomAD database that have at least one homozygous sample. Panels (A) and (B) show donor and acceptor sites of protein-coding genes, while panels (C) and (D) show values for donor and acceptor sites of lncRNAs. Solid lines represent subsets classified as “well-supported” by our model, while dashed ones correspond to “less-supported” splice sites. The dotted line in panels (C) and (D) represents the rate of SNPs for splice sites from protein-coding genes of MANE for comparison.

Apart from calculating the SNP rates, we compared the frequencies of homozygous SNPs overlapping the canonical dinucleotides of the splice sites classified by the model as either well-supported or less-supported. Supplementary Fig. S6 shows these frequencies for different datasets. For donor and acceptor sites from protein-coding genes, the median frequencies of homozygous SNPs in well-supported sites are two to three times smaller than those for less-supported ones. In addition, their interquartile range is three to six times smaller. This is also true for donor sites of lncRNAs from GENCODE, CHESS 3, as well as for acceptor sites of lncRNAs from GENCODE and RefSeq. Frequency distributions of well-supported and less-supported donor sites of lncRNAs from RefSeq and acceptor sites of lncRNAs from CHESS 3 are somewhat closer.

We also examined how many splice sites have SNPs overlapping their canonical dinucleotides associated with diseases. Table 3 shows the number and the fraction of splice sites relative to their total number in the respective dataset that have at least one SNP from the ClinVar database [29] classified as “pathogenic” or “likely pathogenic.” Well-supported splice sites of protein-coding genes have a two to three times higher ratio of potentially pathogenic SNPs than less-supported ones across all datasets. Well-supported sites of lncRNAs have a consistently low number of pathogenic SNPs, while less-supported ones from lncRNAs possess even fewer such variants. However, we note that noncoding regions are under-ascertained in clinical variant databases, as reported previously [30].

Table 3.

Number of splice sites in each conservation category having an SNP classified as “pathogenic” and “likely pathogenic” from the ClinVar database overlapping its canonical dinucleotides

  “Well-supported” sites “Less-supported” sites
Dataset Donor Acceptor Donor Acceptor
Protein-coding        
GENCODE* 87 (0.82%) 80 (0.80%) 53 (0.23%) 36 (0.21%)
RefSeq* 70 (0.25%) 62 (0.24%) 41 (0.12%) 35 (0.12%)
CHESS 3* 71 (0.30%) 64 (0.28%) 42 (0.16%) 33 (0.15%)
lncRNA        
GENCODE 16 (0.20%) 10 (0.10%) 7 (0.01%) 3 (0.01%)
RefSeq 9 (0.22%) 6 (0.10%) 4 (0.01%) 4 (0.01%)
CHESS 3 12 (0.28%) 10 (0.18%) 4 (0.01%) 4 (0.01%)

Sites are classified as “well-supported” or “less-supported” by our model as per the “Classifying splice sites based on their conservation” section. Numbers in parentheses show the percentage of the sites relative to the total number of sites in that category.

We also examined the connection between multiple species conservation and gene expression using RNA-seq data. Our model classifies each acceptor and donor splice site as either well-supported or less-supported. This way, each intron consisting of a pair of a donor and an acceptor splice site belongs to either one of four categories: (i) neither site is well-supported; (ii) only the donor site is well-supported; (iii) only the acceptor site is well-supported; and (iv) both sites are well-supported. We employed data from the GTEx project [31] that were assembled using StringTie2 [32] and were postprocessed by TieBrush [33] to obtain the junction coverage. These data were used to generate the CHESS 3 gene catalog; further technical details on the pipeline can be found in [18]. For each intron, we calculated the number of reads supporting the particular donor–acceptor junction reflecting expression of isoforms using this intron. To integrate the data, we calculated the maximum coverage for each intron across all tissues; the breakdown for each individual tissue is available in Supplementary Figs S7 and S8. Figure 5 shows distributions of read coverage between introns of different conservation categories, for both protein-coding and lncRNA genes. As the figure shows, introns with both sides that are well-supported have median max coverage that is two to three times higher than the introns that have at least one less-supported site, which can be observed for both protein-coding and lncRNA genes. Supplementary Figs S7 and S8 show coverage distribution across individual tissues that show the same pattern.

Figure 5.

Figure 5.

Box plots showing the maximum number of reads supporting exon junctions of certain types across all tissues from GTEx data. Panel (A) shows data for protein-coding genes (GENCODE*, RefSeq*, and CHESS 3* datasets), while panel (B) represents lncRNAs (GENCODE, RefSeq, and CHESS 3). Each box plot shows the median (solid horizontal line), the interquartile range (solid top and bottom borders of the box), and minimum and maximum values within ±1.5 of the interquartile range (whiskers); outliers are not shown.

In addition, we explored how many isoforms of the same gene use well-supported and less-supported splice sites. In other words, for each splice site we computed the number of isoforms that use that particular site. Supplementary Fig. S9 shows the distribution of these values in each gene catalog, for both donor and acceptor sites in protein-coding genes and lncRNAs. For protein-coding genes, well-supported splice sites are more likely to be shared between multiple isoforms. However, we did not observe a similar pattern in lncRNAs, except for donor sites from the GENCODE annotation that also showed a notable difference between well-supported and less-supported sites. We note that lncRNA genes have fewer isoforms overall, which might explain some of the disparity between protein-coding genes and lncRNAs.

Case study: spotting potentially suspicious isoforms

To demonstra te the utility of our model, we manually inspected several isoforms that contain less-supported splice sites to see whether they appear to be nonfunctional. In the following, we provide two examples, one for protein-coding genes and one for lncRNAs.

In particular, we looked at the heat shock protein family B (small) member 1, or the HSPB1 gene. Its MANE transcript with GENCODE ID ENST00000248553.7 contains two exons and produces a protein that is 205 amino acids long. One of the alternative transcripts with the ID ENST00000674547.1 from the GENCODE database differs from the MANE isoform by one donor site, which was marked as “less-supported” by our model. This alternative site results in a premature stop codon in the protein sequence, which yields a protein that is only 143 amino acids long, which is nearly 30% time shorter than its MANE isoform; Supplementary Fig. S10A shows these two isoforms. In addition, the transcriptomic support for the splice junction containing the alternative site is also poor: tens of reads as opposed to millions or hundreds of thousands for the MANE isoform, depending on the tissue (see Supplementary Table S6, columns 2 and 3, for the exact numbers). Given these data, we hypothesize that ENST00000674547.1 is either a technical artifact or a result of spurious transcription.

We also looked at CHASERR, a highly conserved lncRNA located near the chromodomain helicase DNA binding protein 2, a protein associated with a neurological disease [34]. Recently, CHASERR itself was shown to be critical for viability [35] and its deletion was associated with a neurological disorder [36]. One of its isoforms in GENCODE, with ID ENST00000557682.6, contains four introns, and all splice sites of those introns are “well-supported” according to our model. Transcripts with the same intron structure are also present in the RefSeq annotation. However, a GENCODE isoform with ID ENST00000653163.1 differs from the above-mentioned transcript by the location of the first donor site, which is located much further upstream: at position 92 819 999 of chromosome 15, as opposed to 92 883 187. The alternative donor site is “less-supported” according to our model and has only two reads covering that junction in the GTEx dataset. On the other hand, the corresponding intron of junction ENST00000557682.6, sharing the same acceptor site that has a well-supported donor site is covered by thousands of reads across multiple tissues (see Supplementary Table S6, columns 4 and 5). Given that the isoform with a “less-supported” donor site occupies a locus 4.7 times longer and the longer intron has poor transcriptomic support, we believe that the longer transcript ENST00000653163.1 could be nonfunctional. Supplementary Fig. S10B illustrates the structure of these two isoforms.

Discussion

In this study, we found that the canonical dinucleotides from both donor and acceptor splice sites of the consensus MANE dataset exhibit a striking pattern of conservation: nearly all of them are conserved in >350 mammalian species. In contrast, splice sites from the leading gene catalogs—GENCODE, RefSeq, and CHESS 3—that are not shared with MANE exhibit two different patterns of conservation. The first pattern resembles MANE, where the splice sites are conserved in >350 species, while the second one resembles neutrally evolving sequences, at both the micro- and macroevolutionary levels. To compare the properties of these two groups of splice sites, we trained a logistic regression model using the MANE dataset as the source of positive examples and using randomly chosen dinucleotide sites from within introns to represent (albeit imperfectly) neutrally evolving sequences. We then applied this model to the rest of the GENCODE, RefSeq, and CHESS gene catalogs excluding MANE to classify splice sites as either well-supported or less-supported.

We found that 30%–50% of splice sites from coding genes and <17% of splice sites from lncRNA can be classified as well-supported. Splice sites classified as less-supported had SNP rates in the human population that were consistent with neutrally evolving sequences, while well-supported ones had patterns of SNP rates and frequencies similar to MANE. In addition, we observed that introns where both splice sites are well-supported have better transcriptomic support. We also found that less-supported splice sites are less likely to be shared by different isoforms of the same gene. We calculated the number of transcripts whose splice sites were either classified as well-supported by our model or shared with a transcript from MANE. For protein-coding genes, 41%–56% belong to this category, and only 0.5%–2% have all their sites classified as well-supported. These transcripts can be used as a high-confidence subset of the gene catalogs we studied.

Our findings are consistent with previous studies of splice site evolution. For example, it was observed before that genes that are highly conserved have higher expression levels [3738] and conserved exons are more likely to be included in multiple transcripts [39]. Other studies [4041] found that splice sites that are not conserved in other species are more likely to carry disruptive SNPs in their motifs, and transcripts enriched in such variants have lower expression levels [42]; it was also observed that splice sites of lncRNAs are less conserved than ones of mRNAs [43].

Some previous studies also found a lack of conservation of some splice sites using whole-genome alignments [4445], but they assumed that these patterns arose due to alignment errors. Using the large gnomAD collection of human variation, we were able to examine SNP rates in the human population and show that splice sites that are less-supported also have higher SNP rates and frequencies. This finding suggests that the majority of splice sites lacking conservation across species are simply not under selection in the human population as opposed to being poorly aligned.

This study has several limitations. First, unlike some previous studies of evolutionary dynamics of alternative splicing [46–48], we rely purely on the conservation of DNA sequences without taking into account whether a conserved splice site in a nonhuman genome is actually functional (information that is usually not known). Unfortunately, the incomplete status of many other genome annotations prevented us from incorporating them into our analysis. Second, we realize that the training data we used for our model could introduce biases. For example, the MANE dataset was constructed by choosing one “best” isoform per protein-coding gene, and conservation was one of the criteria. This could potentially contribute to the stronger conservation signal we observed in that data. In addition, randomly chosen dinucleotide sequences from the interior of introns might not be the ideal choice for neutrally evolving sequences. We also note that MANE might not be an appropriate baseline for comparison for noncoding genes. However, the subset of experimentally verified lncRNAs is small, which makes it challenging to create a training set based on these genes. At the same time, in our study at least a subset of lncRNAs showed levels of conservation (at their splice sites) comparable to protein-coding genes. We believe that our model can be still useful for such genes; e.g. if most splice sites of an lncRNA transcript are well-supported, then less-supported ones should be dealt with caution.

We hope this study will help improve human genome annotation by demonstrating the utility of using large-scale evolutionary conservation for functional annotation of splicing. According to our analysis, highly conserved splice sites from MANE alone constitute at least 75% of all the splice sites in protein-coding isoforms in all genome annotations, and together with their well-supported counterparts from the complementary subsets account for 80%–90% of all splice sites (Table 2). This finding is in concordance with previous studies showing that MANE and APPRIS transcripts represent the most biologically and clinically relevant isoforms [4950].

Hence, we believe that splice site conservation should be an important factor in constructing a genome annotation. However, only a few methods currently use this information directly, either for annotation or for splice site prediction [51]. A common data structure used in RNA assembly called a splice graph was generalized to integrate sequence homology information between species [52], but it was used for finding clusters of orthologous exons and is yet to be employed for RNA-seq assembly. As higher-quality genomes along with their alignments become available, conservation-based methods have the potential to be a powerful aid in constructing functional annotations. However, despite the recent advances in the field of alignment [53–55], our analysis shows that even the most complete whole-genome alignments to date miss many alignments of human exons, and further progress in this area is needed to improve the completeness of the alignments.

We also highlighted a subset of splice sites and corresponding isoforms in the leading human annotation catalogs that appear to be under strong selection. This subset can be used as a high-confidence representation of the annotation. At the same time, less-supported splice sites might require further scrutiny since splicing could be inherently error-prone [56–60]. This hypothesis is backed up by the fact that well-supported splice sites have higher SNP rates and frequencies in the human population consistent with randomly selected sites, which suggests that they are not under as strong negative selection. In addition, a recent study using proteomic analysis found that only one in six alternative isoforms was predicted to be functional [61]. However, pinpointing exactly which splice sites are errors would require further study incorporating extra data. Here, we have focused on the human genome because it has the highest-quality annotation, but in the future, we hope to extend our analysis to the annotations of other species.

Supplementary Material

gkaf184_Supplemental_File

Acknowledgements

We would like to thank Mihaela Pertea, Aleksey Zimin, Ales Varabyou, Beril Erdogdu, and Kuan-Hao Chao for useful discussions and suggestions.

Author contributions: I.M. and S.L.S. conceptualized the project. I.M. performed the investigation and formal analysis, and wrote the original draft. Both authors edited and revised all versions of the manuscript.

Contributor Information

Ilia Minkin, Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Baltimore, , MD 21218, United States; Center for Computational Biology, Johns Hopkins University, 3100 Wyman Park Drive, Baltimore, MD 21211, United States.

Steven L Salzberg, Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Baltimore, , MD 21218, United States; Center for Computational Biology, Johns Hopkins University, 3100 Wyman Park Drive, Baltimore, MD 21211, United States; Department of Computer Science, Johns Hopkins University, 3400 N. Charles Street, Baltimore, , MD 21218, United States; Department of Biostatistics, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, , MD 21205, United States.

Supplementary data

Supplementary data is available at NAR online.

Conflict of interest

None declared.

Funding

This work was supported in part by the National Institutes of Health [R01-HG006677, R35-GM0130151]. Funding to pay the Open Access publication charges for this article was provided by National Institutes of Health.

Data availability

The data and the code of the model are available at GitHub: https://github.com/iminkin/splice-sites-conservation. This repository, including all the code and the resulting data, was archived at Zenodo and is available with the following DOI: 10.5281/zenodo.14893716.

References

  • 1. Amaral  P, Carbonell-Sala  S, De  La Vega FM  et al.  The status of the human gene catalogue. Nature. 2023; 622:41–7. 10.1038/s41586-023-06490-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Pertea  M, Shumate  A, Pertea  G  et al.  CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 2018; 19:208. 10.1186/s13059-018-1590-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Raj  A, Peskin  CS, Tranchina  D  et al.  Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 2006; 4:e309. 10.1371/journal.pbio.0040309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Struhl  K  Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat Struct Mol Biol. 2007; 14:103–5. 10.1038/nsmb0207-103. [DOI] [PubMed] [Google Scholar]
  • 5. Cavallaro  M, Walsh  MD, Jones  M  et al.  3′–5′ crosstalk contributes to transcriptional bursting. Genome Biol. 2021; 22:56. 10.1186/s13059-020-02227-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Gilbert  W  Why genes in pieces?. Nature. 1978; 271:501. 10.1038/271501a0. [DOI] [PubMed] [Google Scholar]
  • 7. Doolittle  WF  Genes in pieces: were they ever together?. Nature. 1978; 272:581–2. 10.1038/272581a0. [DOI] [Google Scholar]
  • 8. Lynch  M, Richardson  AO  The evolution of spliceosomal introns. Curr Opin Genet Dev. 2002; 12:701–10. 10.1016/S0959-437X(02)00360-X. [DOI] [PubMed] [Google Scholar]
  • 9. Rogozin  IB, Carmel  L, Csuros  M  et al.  Origin and evolution of spliceosomal introns. Biol Direct. 2012; 7:11. 10.1186/1745-6150-7-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Irimia  M, Roy  SW  Origin of spliceosomal introns and alternative splicing. Cold Spring Harb Perspect Biol. 2014; 6:a016071. 10.1101/cshperspect.a016071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Morales  J, Pujar  S, Loveland  JE  et al.  A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature. 2022; 604:310–15. 10.1038/s41586-022-04558-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Rodriguez  JM, Pozo  F, Cerdán-Vélez  D  et al.  APPRIS: selecting functionally important isoforms. Nucleic Acids Res. 2021; 50:D54–9. 10.1093/nar/gkab1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Fong  JH, Murphy  TD, Pruitt  KD  Comparison of RefSeq protein-coding regions in human and vertebrate genomes. BMC Genomics. 2013; 14:654. 10.1186/1471-2164-14-654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Raney  BJ, Barber  GP, Benet-Pagès  A  et al.  The UCSC Genome Browser database: 2024 update. Nucleic Acids Res. 2024; 52:D1082–8. 10.1093/nar/gkad987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Šošić  M, Šikić  M  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics. 2017; 33:1394–5. 10.1093/bioinformatics/btw753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Frankish  A, Diekhans  M, Jungreis  I  et al.  GENCODE 2021. Nucleic Acids Res. 2021; 49:D916–23. 10.1093/nar/gkaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. O’Leary  NA, Wright  MW, Brister  JR  et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–45. 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Varabyou  A, Sommer  MJ, Erdogdu  B  et al.  CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome Biol. 2023; 24:249. 10.1186/s13059-023-03088-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Kent  WJ, Sugnet  CW, Furey  TS  et al.  The human genome browser at UCSC. Genome Res. 2002; 12:996–1006. 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Blanchette  M, Kent  WJ, Riemer  C  et al.  Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004; 14:708–15. 10.1101/gr.1933104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Schneider  VA, Graves-Lindsay  T, Howe  K  et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017; 27:849–64. 10.1101/gr.213611.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Pedregosa  F, Varoquaux  G, Gramfort  A  et al.  Scikit-learn: machine learning in Python. J Mach Learn Res. 2011; 12:2825–30. [Google Scholar]
  • 23. Cock  PJ, Antao  T, Chang  JT  et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25:1422–3. 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Thanaraj  TA, Clark  F  Human GC–AG alternative intron isoforms with weak donor sites show enhanced consensus at acceptor exon positions. Nucleic Acids Res. 2001; 29:2581–93. 10.1093/nar/29.12.2581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hall  SL, Padgett  RA  Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splice sites. J Mol Biol. 1994; 239:357–65. 10.1006/jmbi.1994.1377. [DOI] [PubMed] [Google Scholar]
  • 26. Hall  SL, Padgett  RA  Requirement of U12 snRNA for in vivo splicing of a minor class of eukaryotic nuclear pre-mRNA introns. Science. 1996; 271:1716–8. 10.1126/science.271.5256.1716. [DOI] [PubMed] [Google Scholar]
  • 27. Chen  S, Francioli  LC, Goodrich  JK  et al.  A genomic mutational constraint map using variation in 76,156 human genomes. Nature. 2024; 625:92–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Siepel  A, Bejerano  G, Pedersen  JS  et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005; 15:1034–50. 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Landrum  MJ, Lee  JM, Riley  GR  et al.  ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014; 42:D980–5. 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Ellingford  JM, Ahn  JW, Bagnall  RD  et al.  Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med. 2022; 14:73. 10.1186/s13073-022-01073-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Consortium  G, Ardlie  KG, Deluca  DS  et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015; 348:648–60. 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Kovaka  S, Zimin  AV, Pertea  GM  et al.  Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019; 20:278. 10.1186/s13059-019-1910-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Varabyou  A, Pertea  G, Pockrandt  C  et al.  TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets. Bioinformatics. 2021; 37:3650–1. 10.1093/bioinformatics/btab342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Lamar  KMJ, Carvill  GL  Chromatin remodeling proteins in epilepsy: lessons from CHD2-associated epilepsy. Front Mol Neurosci. 2018; 11:208. 10.3389/fnmol.2018.00208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Rom  A, Melamed  L, Gil  N  et al.  Regulation of CHD2 expression by the Chaserr long noncoding RNA gene is essential for viability. Nat Commun. 2019; 10:5092. 10.1038/s41467-019-13075-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Ganesh  VS, Riquin  K, Chatron  N  et al.  Neurodevelopmental disorder caused by deletion of CHASERR, a lncRNA gene. New Engl J Med. 2024; 391:1511–8. 10.1056/NEJMoa2400718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Green  P, Lipman  D, Hillier  L  et al.  Ancient conserved regions in new gene sequences and the protein databases. Science. 1993; 259:1711–6. 10.1126/science.8456298. [DOI] [PubMed] [Google Scholar]
  • 38. Blencowe  BJ  Alternative splicing: new insights from global analyses. Cell. 2006; 126:37–47. 10.1016/j.cell.2006.06.023. [DOI] [PubMed] [Google Scholar]
  • 39. Modrek  B, Lee  CJ  Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet. 2003; 34:177–80. 10.1038/ng1159. [DOI] [PubMed] [Google Scholar]
  • 40. Kurmangaliyev  YZ, Sutormin  RA, Naumenko  SA  et al.  Functional implications of splicing polymorphisms in the human genome. Hum Mol Genet. 2013; 22:3449–59. 10.1093/hmg/ddt200. [DOI] [PubMed] [Google Scholar]
  • 41. Shimada  MK, Hayakawa  Y, Takeda  JI  et al.  A comprehensive survey of human polymorphisms at conserved splice dinucleotides and its evolutionary relationship with alternative splicing. BMC Evol Biol. 2010; 10:122. 10.1186/1471-2148-10-122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Denisov  SV, Bazykin  GA, Sutormin  R  et al.  Weak negative and positive selection and the drift load at splice sites. Genome Biol Evol. 2014; 6:1437–47. 10.1093/gbe/evu100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Washietl  S, Kellis  M, Garber  M  Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals. Genome Res. 2014; 24:616–28. 10.1101/gr.165035.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Sharma  V, Elghafari  A, Hiller  M  Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation. Nucleic Acids Res. 2016; 44:e103. 10.1093/nar/gkw210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Sharma  V, Schwede  P, Hiller  M  CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation. Bioinformatics. 2017; 33:3985–87. 10.1093/bioinformatics/btx527. [DOI] [PubMed] [Google Scholar]
  • 46. Barbosa-Morais  NL, Irimia  M, Pan  Q  et al.  The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012; 338:1587–93. 10.1126/science.1230612. [DOI] [PubMed] [Google Scholar]
  • 47. Merkin  J, Russell  C, Chen  P  et al.  Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science. 2012; 338:1593–99. 10.1126/science.1228186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Franz  A, Weber  AI, Preußner  M  et al.  Branch point strength controls species-specific CAMK2B alternative splicing and regulates LTP. Life Sci Alliance. 2023; 6:e202201826. 10.26508/lsa.202201826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Pozo  F, Rodriguez  JM, Martínez  Gómez L  et al.  APPRIS principal isoforms and MANE select transcripts define reference splice variants. Bioinformatics. 2022; 38:ii89–94. 10.1093/bioinformatics/btac473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Pozo  F, Rodriguez  JM, Vázquez  J  et al.  Clinical variant interpretation and biologically relevant reference transcripts. NPJ Genom Med. 2022; 7:59. 10.1038/s41525-022-00329-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Rose  D, Hiller  M, Schutt  K  et al.  Computational discovery of human coding and non-coding transcripts with conserved splice sites. Bioinformatics. 2011; 27:1894–900. 10.1093/bioinformatics/btr314. [DOI] [PubMed] [Google Scholar]
  • 52. Zea  DJ, Laskina  S, Baudin  A  et al.  Assessing conservation of alternative splicing with evolutionary splicing graphs. Genome Res. 2021; 31:1462–73. 10.1101/gr.274696.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Minkin  I, Medvedev  P  Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. Nat Commun. 2020; 11:6327. 10.1038/s41467-020-19777-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Armstrong  J, Hickey  G, Diekhans  M  et al.  Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature. 2020; 587:246–51. 10.1038/s41586-020-2871-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Kille  B, Balaji  A, Sedlazeck  FJ  et al.  Multiple genome alignment in the telomere-to-telomere assembly era. Genome Biol. 2022; 23:182. 10.1186/s13059-022-02735-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Hurst  LD  Evolutionary genomics and the reach of selection. J Biol. 2009; 8:12. 10.1186/jbiol113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Hsu  SN, Hertel  KJ  Spliceosomes walk the line: splicing errors and their impact on cellular function. RNA Biol. 2009; 6:526–30. 10.4161/rna.6.5.9860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Pickrell  JK, Pai  AA, Gilad  Y  et al.  Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 2010; 6:e1001236. 10.1371/journal.pgen.1001236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Gout  JF, Thomas  WK, Smith  Z  et al.  Large-scale detection of in vivo transcription errors. Proc Natl Acad Sci USA. 2013; 110:18584–9. 10.1073/pnas.1309843110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Bénitière  F, Necsulea  A, Duret  L  Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. eLife. 2024; 13:RP93629. 10.7554/eLife.93629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Pozo  F, Martinez-Gomez  L, Walsh  TA  et al.  Assessing the functional relevance of splice isoforms. NAR Genom Bioinform. 2021; 3:lqab044. 10.1093/nargab/lqab044. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf184_Supplemental_File

Data Availability Statement

The data and the code of the model are available at GitHub: https://github.com/iminkin/splice-sites-conservation. This repository, including all the code and the resulting data, was archived at Zenodo and is available with the following DOI: 10.5281/zenodo.14893716.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES