Abstract
RNA sequencing (RNA-seq) is widely adopted for transcriptome analysis but has inherent biases that hinder the comprehensive detection and quantification of alternative splicing. To address this, we present an efficient targeted RNA-seq method that greatly enriches for splicing-informative junction-spanning reads. Local splicing variation sequencing (LSV-seq) utilizes multiplexed reverse transcription from highly scalable pools of primers anchored near splicing events of interest. Primers are designed using Optimal Prime, a novel machine learning algorithm trained on the performance of thousands of primer sequences. In experimental benchmarks, LSV-seq achieves high on-target capture rates and concordance with RNA-seq, while requiring significantly lower sequencing depth. Leveraging deep learning splicing code predictions, we used LSV-seq to target events with low coverage in GTEx RNA-seq data and newly discover hundreds of tissue-specific splicing events. Our results demonstrate the ability of LSV-seq to quantify splicing of events of interest at high-throughput and with exceptional sensitivity.
Graphical Abstract
Graphical Abstract.
Introduction
Alternative splicing (AS) varies the exonic and intronic segments included in mature messenger RNA (mRNA) to substantially increase the diversity of transcript isoforms. Patterns of AS are highly tissue-specific and frequently altered in disease, as revealed by large-scale studies such as the Genotype-Tissue Expression (GTEx) project (1) and The Cancer Genome Atlas (2). Mapping the complex landscape of AS has been a major ongoing challenge, driven by key advances in technology.
Short-read RNA sequencing (RNA-seq) is the current most popular approach for transcriptome-wide profiling of AS changes across tissues or conditions. Detection of AS relies on sequencing reads spanning splice junctions, as these provide direct evidence of a particular splicing outcome. However, the majority of RNA-seq reads do not span splice junctions and are thus less informative for AS analysis. To compensate for this bias, robust splicing quantification using RNA-seq typically requires deep sequencing at higher cost compared with gene expression analysis in order to obtain sufficient coverage of splice junctions. Even at near-saturating coverage, RNA-seq still fails to recover certain splicing events that are impeded by issues such as low abundance or secondary structure (3–5).
Targeted RNA-seq methods provide an efficient way to focus on specific transcripts or RNA regions that cannot be easily studied with standard RNA-seq. A variety of such approaches has been developed, each with different strategies for enriching RNAs of interest. For example, CaptureSeq and TEQUILA-Seq use oligonucleotide probes tiling exonic regions to capture targeted RNAs for sequencing (6,7). RASL-Seq and TempO-seq use pairs of detector oligos that anneal adjacent to each other on target RNAs and when ligated together, serve as readouts of target RNA abundance (8,9). Multiplexed polymerase chain reaction (PCR) is widely used to selectively amplify a set of target complementary DNAs (cDNAs) for either short- or long-read sequencing (10). As a promising alternative strategy, multiplexed primer extension sequencing (MPE-seq) is a targeted RNA-seq method for detection of splicing in yeast (11). MPE-seq performs targeting at the reverse transcription (RT) step, using pools of primers annealing downstream of splice junctions of interest. However, because the complexity of splicing in higher eukaryotes is orders of magnitude greater than in yeast, the use of multiplexed RT for studies of human splicing is significantly more challenging.
Here, we describe local splicing variation sequencing (LSV-seq), a targeted sequencing method we developed to address the limitations of RNA-seq and better capture AS events of interest. Building on prior targeted methods, we designed LSV-seq to offer a unique set of advantages. LSV-seq minimizes the number of required primers per targeted splicing event, enables the discovery and quantification of rare junctions and precisely discriminates quantitative AS differences in challenging targets. As in MPE-seq, LSV-seq uses customized primer pools to perform highly multiplexed RT adjacent to exon junctions, thereby directly enriching for junction-spanning reads. However, LSV-seq involves several key advances to convert the original multiplexed primer targeting schema into a robust, generalizable methodology. Primers designed with available tools performed inadequately, which led us to create Optimal Prime (OP), a novel machine learning-based primer design algorithm vastly increasing the per-primer targeting efficiency. We also established a webtool based on OP for other researchers to design LSV-seq primers. Separately, we created a novel, optimized library preparation protocol and updated our MAJIQ splicing algorithm (12) for use with LSV-seq sequencing data. To showcase its final capabilities, we benchmarked LSV-seq directly against conventional RNA-seq and leveraged deep learning splicing code predictions to reassess splicing events with low coverage in human GTEx RNA-seq data. Importantly, LSV-seq recovered hundreds of previously unquantified tissue-specific AS variations that were missed due to poor coverage in RNA-seq. We demonstrate that LSV-seq offers an accurate, sensitive and cost-effective method for the study of AS with the ability to target thousands of AS events.
Materials and methods
Cell culture
The clonal Jurkat T-cell line (JSL1) was cultured and stimulated as previously described (13). Cells were maintained in RPMI 1640 media supplemented with 5% fetal bovine serum, 100 U/ml penicillin and 100 µg/ml streptomycin. For stimulation, cells were seeded at 4 × 105 cells/ml and treated with 20 ng/ml of the phorbol ester phorbol myristate acetate (PMA) (#524400, MilliporeSigma). A separate culture of unstimulated cells was seeded at 2.5 × 105 cells/ml. After 48 h, successful stimulation was confirmed by staining cells for the activation marker CD69 (#310905, BioLegend) followed by flow cytometric analysis (data not shown). Cells were then collected and flash-frozen on liquid nitrogen in aliquots of 5 × 106 cells. To generate three total biological replicates, PMA stimulation was done on independent days.
RNA isolation and processing
Total RNA was purified from cell pellets with the Maxwell RSC 48 instrument using the Maxwell RSC simplyRNA Cells Kit (#AS1340, Promega). Poly-A selection was subsequently performed using the NEBNext Poly(A) mRNA Magnetic Isolation Module [#E7490L, New England Biolabs (NEB)]. LSV-seq was performed as described below. RNA-seq library preparation and paired-end 150-bp sequencing was performed by Novogene. Briefly, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. After fragmentation, first strand cDNA was synthesized using random hexamer primers. Then second strand cDNA was synthesized using dUTP. The directional library was ready after end repair, A-tailing, adapter ligation, size selection, USER enzyme digestion, PCR amplification and purification. After quality control, libraries were pooled and sequenced on the Illumina platform to a minimum depth of 100 million reads per library. Total RNA from human adult normal tissues (heart right atrium, liver and brain cerebellum) was obtained from a commercial vendor (BioChain Institute, Inc.). For each tissue, three technical replicates of LSV-seq were performed as described below.
Preparation of MPE-seq libraries
MPE-seq libraries were prepared as previously described with minimal modification (11,14). The 50- and 381-primer pools used for MPE-seq were designed while attempting to minimize predicted off-target alignments and basic considerations such as GC content and melting temperature, without the use of the OP primer selection pipeline. On- and off-target read percentages were calculated as described below for analysis of LSV-seq data.
Preparation of LSV-seq libraries
A detailed step-by-step protocol is also provided in the supplementary material.
First-strand synthesis reaction
Primer pools were designed according to our primer selection pipeline and ordered as a single oligo pool of hundreds to thousands of primers (oPools, Integrated DNA Technologies); see Supplementary Table S5 for primer sequences. RNA was mixed with 2.5 µl of the primer pool (diluted to 400 pM/oligo) in a total volume of 20 µl, also containing a final concentration of 2 mM of each dNTP (#N0447S, NEB) and 1× Maxima buffer (#EP0752, Thermo Fisher Scientific). To denature the RNA and allow the primers to hybridize to the target RNA with high specificity, the reaction was incubated on a thermal cycler in a touchdown cycle from 85°C to 60°C, decreasing 1°C per min for 25 min total, followed by a hold at 60°C. While maintaining the primer–RNA mixture at 60°C, 20 µl of RT mixture was directly added, consisting of 1 µl of 200 U/µl Maxima H Minus Reverse Transcriptase (#EP0752, Thermo Fisher Scientific), 1 µl of 5 U/µl ThermaStop-RT reagent (Ecologenix, LLC), 1 µl of 100 mM RNaseOUT Recombinant Ribonuclease Inhibitor (#10777019, Invitrogen) and 1× Maxima buffer. The RT reaction was incubated at 60°C for 80 min, followed by heat inactivation at 85°C for 5 min. The reaction was then cleaned up with 40 µl of RNAClean XP beads (#A63987, Beckman Coulter) and resuspended in 40 µl of water.
Second-strand synthesis reaction
To the 40 µl purified first-strand reaction, we added 40 µl of second-strand synthesis reaction mixture containing a final concentration of 0.2 mM of each dNTP, 0.55 µl of 10 U/µl E. coli DNA Ligase (#M0205L, NEB), 2.08 µl of 10 U/µl E. coli DNA Polymerase I (#18010025, Invitrogen), 0.55 µl of 5 U/µl E. coli RNase H (#M0297L, NEB) and 1× NEBNext Second Strand Synthesis Reaction Buffer (#B6117S, NEB). The reaction was incubated for 2 h at 16°C, cleaned up with 80 µl of RNAClean XP beads and resuspended in 10 µl of water.
In vitro transcription reaction
To the 10 µl purified second-strand reaction, we added 10 µl of in vitro transcription (IVT) mixture using the HiScribe T7 High Yield RNA Synthesis Kit (#E2040S, NEB), containing 1.5 µl of each NTP, 1.5 µl of T7 RNA Polymerase Mix, 1.5 µl of 10× T7 Reaction Buffer and 1 µl of RNaseOUT. The reaction was incubated at 37°C overnight (13–16 h total).
Fragmentation reaction
After the overnight incubation, 4 µl of ExoSAP-IT reagent (#78201.1.ML, Applied Biosystems) was added and the mixture was incubated at 37°C for 15 min. Next, 2.67 µl of 10× RNA Fragmentation Buffer from the NEBNext Magnesium RNA Fragmentation Module (#E6150S, NEB) was added, resulting in a total volume of 26.67 µl. The reaction was incubated in a preheated thermal cycler at 94°C for 2 min 50 s, followed by immediate transfer to ice and addition of 2.67 µl of 10× RNA Fragmentation Stop Solution. The reaction was then cleaned up with 41.1 µl of RNAClean XP beads (1.4× ratio), and resuspended in 5.5 µl of water.
Second RT reaction
To the 5.5 µl purified fragmentation reaction, we added 0.5 µl of 10 mM dNTP mix and 0.5 µl of 50 µM second RT primer for a total reaction volume of 6.5 µl. The reaction was incubated at 65°C for 5 min, then placed on ice for at least 1 min. Next, we added 4 µl of RT reaction mix consisting of 2 µl of 5× first-strand buffer, 1 µl of 100 mM DTT, 0.5 µl of RNaseOUT and 0.5 µl of 200 U/µl SuperScript II Reverse Transcriptase (#18064014, Invitrogen). The reaction was then incubated at 25°C for 10 min and 42°C for 1 h.
Final PCR amplification
Using 5 µl of the second RT reaction, we prepared a 50 µl total volume PCR reaction containing 0.5 µM each of Nextera i5 and i7 indexed adapter primers, 1× Terra PCR Direct Buffer and 1 µl of 1.25 U/µl Terra PCR Direct Polymerase Mix (#639270, Takara Bio). To clean up the reaction, 45 µl of RNAClean XP beads (0.9× ratio) was added, and after washes, libraries were eluted with 50 µl of water. A second round of bead purification was performed with the same bead ratio and the final purified library was eluted with 50 µl of water.
Library quality control and sequencing
Final LSV-seq libraries were assessed using the Agilent High Sensitivity DNA Kit (#5067–4626, Agilent) and quantified with the NEBNext Library Quant Kit for Illumina (#E7630L, NEB). Libraries were then pooled and sequenced in 150-cycle single-end format on an Illumina NextSeq 550.
Identification of targetable splicing events in Jurkat T-cell dataset
Selection of targeted splicing events
We first performed MAJIQ analysis on our previously published RNA-seq data in unstimulated and stimulated Jurkat T-cells (15), resulting in the initial discovery set of over 48 000 putative target local splicing variations (LSVs). We then defined different categories of LSVs by running the MAJIQ deltaPSI algorithm using a combination of MAJIQ-defined filters on measured change between groups and confidence level. This resulted in five total categories of targetable LSVs, namely (i) high-confidence large-change (confidence > 0.95 and change > 0.2), (ii) high-confidence small-change (confidence > 0.95 and change < 0.02), (iii) low-confidence large-change (confidence < 0.7 and change > 0.2), (iv) low-confidence small-change (confidence < 0.7 and change < 0.02) and (v) all other LSVs not included in these categories. We selected roughly equivalent numbers (∼300–500 per group) of LSVs from each of these five categories. This formed the set of LSVs targeted in the pool of 2002 primers we designed prior to development of the OP models (designated as KY007 in Supplementary Table S1).
For the key benchmarking experiments, to create the list of loci targeted in the corresponding primer pool (designated as KY008 in Supplementary Table S1), we first randomly selected a set of 948 target LSVs from the Jurkat T-cell-directed primer pool described above. We chose an additional 953 primers targeting randomly selected loci within the larger discovery set, for a total of 1901 primers. For these experiments, we avoided targeting splicing events within the same gene more than once, so that we could sample a more diverse array of splicing events.
Prioritization of targeted splicing events in GTEx dataset
Quantifying the proportion of targetable splice junctions
In our exploratory analysis to estimate the proportion of all splice junctions we could target with LSV-seq, we ran the MAJIQ HET algorithm with parameter {--min-experiments 0.1} on GTEx samples across all three tissues, with an additional parameter allowing us to output either only target LSVs {--target-only} or source LSVs {--source-only}. The superset of all splice junctions across all three tissues was then taken as the union of splice junctions within the target-LSV-only and source-LSV-only MAJIQ builds. Targetable splice junctions were defined as splice junctions that were contained within the target-LSV-only build.
Defining low-coverage, non-changing splicing events
To identify low-coverage splicing events that we could enrich with LSV-seq, we first ran the MAJIQ HET algorithm with parameters: {--min-experiments 0.1} on GTEx samples across all three tissues, with the read coverage output enabled. This allowed us to identify target LSVs that were changing (at least one junction confidently changing) and non-changing (no junctions confidently changing) across tissues, as well as to segregate high-coverage (> 50 mean reads) and low-coverage (< 25 mean reads) LSVs. The VOILA modulize command was also run, which allowed us to extract the subset of LSVs located within cassette exons for the splicing code pipeline.
Prioritizing splicing events with CLIP-seq data
Based on CLIP-seq data in the K562 and HepG2 cell lines from the Encyclopedia of DNA Elements (ENCODE) (38,39), we first identified putative ‘tissue-specific’ RNA-binding proteins (RBPs). We filtered for the subset of available RBPs with at least five transcripts per million (TPM) mean expression in at least one of the three GTEx tissues of interest (brain cerebellum, heart atrial appendage and liver) and exhibiting at least a 2-fold change in expression in at least one pairwise tissue comparison. We then identified the top 50 RBPs ranked by their maximal log2 pairwise expression change across tissues, resulting in an initial shortlist of tissue-specific RBPs.
Next, we discovered which of these top tissue-specific RBPs specifically regulate tissue-specific target LSVs across our three tissues of interest. For every high-coverage LSV we identified from our MAJIQ HET analysis, we defined cis regulatory regions that might be bound by RBPs. This was defined as a combination of regions including every detected junction from 300 bp downstream to 50 bp upstream, and the 5′ boundary of the target exon itself from 50 bp downstream to 300 bp upstream. We intersected the called CLIP-seq peaks for each tissue-specific RBP with the cis sequence regions of tissue-specific LSVs. Tissue-specific RBPs that putatively regulated tissue-specific splicing were defined as RBPs that bound changing LSVs more frequently than non-changing LSVs, as evaluated by binomial tests for statistical significance.
Having defined tissue-specific splicing regulatory RBPs based on the above procedure, we were then able to define a candidate set of events to interrogate with LSV-seq. Specifically, this set consisted of the low-coverage LSVs that we failed to call confident changes for. The final candidate set was further required to overlap with CLIP-seq peaks for at least 2, 4 or 6 such RBPs, depending on the stringency of our predictions.
Prioritizing splicing events with splicing code model
We prioritized splicing events using our multi-transformer based splicing code model (TrASPr) (16). In brief, given an input cassette splicing event sequence and two different tissue labels, TrASPr outputs the predicted change in PSI across the tissues. We used our model to generate predictions for changes in mean PSI values for cassette exons detected in the MAJIQ HET analysis but suffered lack of coverage as described above. By tuning the threshold of predicted change between any pair of tissues to either 0.1, 0.15 or 0.2 PSI, we were able to vary the stringency of this pipeline.
LSV-seq primer design
Obtaining candidate primer sequences from RNA-seq
To retrieve regions adjacent to known target LSVs (3′ splice sites), we ran the MAJIQ build and heterogen commands on the desired set of RNA-seq bam files, followed by the VOILA modulize command with the parameters {--keep-constitutive --decomplexify-psi-threshold 0 --show-all --output-mpe}. From the output file (‘mpe_primerable_regions.tsv’), we used the columns for ‘Reference Exon Constant Region’, ‘Constitutive Regions’, ‘LSV ID’, ‘strand’ and ‘chromosome’ to create a BED file matching the LSV names to their genomic coordinates to be extracted. Prior to the next step, we obtained a list of repeats found in the hg38 genome (Repeatmasker, http://www.repeatmasker.org) and reformatted it as a BED file. We then used the bedtools (17) subtract command to remove regions corresponding to the RepeatMasker defined repeat regions.
Generation of candidate primers and feature extraction
Primers were designed by running the primerGen script, which was run with default flags, to output primers with specific GC content (10–90%), melting temperature calculated based on RNA–DNA hybridization (18) (50–85°C), length (15–40 bases) and lacking prohibited sequences (AAAAA, TTTTT, CCCCC and GGGGG). To obtain the full model specific features, BLAST (19) was run against the human transcriptome with parameters {-gapopen 2 -gapextend 2 -reward 1 -penalty 2} and the NUPACK nucleic acid modeling package (20) was run assuming RT reaction chemical and temperature conditions. For both the lite and full model, the featureExtract script was run to extract sequence-based features and append the alignment-based features from the previous step and experiment-specific features from externally provided files. The sequence-based features include encodings of positional nucleotide motifs including one-hot encodings of positional mononucleotides (e.g. presence of G at position 1 or A at position 2), one-hot encodings of positional dinucleotides (e.g. presence of TA at position 3 or presence of GG at position 5) and cumulative proportion of mononucleotides up to the ith position (e.g. proportion of As within the first five positions). The resulting data matrix, consisting of rows of candidate primers and columns of features, was saved as a Python pickle file.
Prediction of primer performance
The modelPredict script was used to predict the performance of primers based on the saved data matrix of features. For the specified model version, either lite or full, predictions are outputted based on the saved yield/amplification model, denoted as ‘a’, and the specificity model, denoted as ‘s’. These predictions are combined into a single score using the following formula: (λa)*s. The λ parameter controls the relative tradeoff between amplification and specificity, where higher λ increases the relative weight of the amplification prediction relative to the specificity prediction. In practice, λ was fixed at 1.2 in all the work presented here as it seemed to give a good tradeoff as reflected in the empirical results. With λ set, the primer with the maximal combined prediction score for each targeted region was outputted as the best predicted primer.
Analysis of RNA-seq and LSV-seq data
Visualization and quantification of splicing events
See Supplementary Tables S3 and S4 for PSI and deltaPSI quantifications. For all analyzed LSV-seq and RNA-seq datasets, reads were first trimmed with BBDuk of BBMap (sourceforge.net/projects/bbmap/) using the parameters {ref=adapters ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=r trimq=15 qin=auto minlength=30}.
For RNA-seq analysis, we aligned reads to the hg38 genome using STAR (21) with the parameters {--alignSJoverhangMin 8 --alignEndsType Local --outFilterMultimapNmax 1}. We then ran MAJIQ build across all samples in a given analysis with the parameters {--minreads 2 --min-denovo 2 --minpos 1 --target-lsvs}. For the Jurkat T-cell libraries, we ran MAJIQ deltaPSI or PSI with the parameters {--minpos 1 --minreads 2}, followed by visualization in the VOILA web browser viewer. For calculation of differential gene expression, summed TPMs for each gene were derived from transcript TPMs based on Kallisto (22) quantifications and the mean was taken across all samples per condition. Subsequently, the log fold change between the mean gene TPMs per condition was calculated.
For LSV-seq analysis, the umi-tools extract command (23) was used to first extract the 10-nucleotide unique molecular identifier (UMI) from each read using the parameters {--bc-pattern=NNNNNNNNNN}, followed by STAR alignment with the same parameters described for RNA-seq. Then, umicollapse (24) with parameter {--two-pass} was used to mark UMI read duplicate groups. Because the initial 5′ end primer sequence is also expected to be shared between identical reads, we appended the first 13 nucleotides beyond the UMI to the marked UMI sequences. Final deduplicated reads were selected using custom scripts that first filtered for reads that were at least 30 nucleotides in length, then selected one of the maximal length reads at random for each marked UMI group (consisting of both the marked UMI group itself and the downstream primer barcode). This resulted in the deduplicated BAM files we used in splicing visualization and downstream quantification.
To visualize splicing events, we used an adaptation of the MAJIQ deltaPSI algorithm (12), running MAJIQ build, and MAJIQ PSI or deltaPSI with the same parameters described for RNA-seq. We visualized the resulting splicing events in the VOILA web browser viewer. For RNA-seq, the classical MAJIQ algorithm integrates a Bayesian approach in order to remove potential read stacks that might correspond to PCR duplicates. However, because UMI deduplication is performed upstream of MAJIQ in LSV-seq, and because the expected read distribution for LSV-seq greatly differs from that expected for RNA-seq, we used a distinct version of the MAJIQ algorithm which takes the raw splice ratio per junction with the Bayesian modeling disabled. To quantify splicing, we ran the build command with the same corresponding parameters {--minreads 2 --mindenovo 2 --minpos 1} and output calculated splice ratios with the PSI-coverage using parameters {--target-lsvs --stack-pvalue-threshold 0 --minbins 1 --minreads 2}. To quantify changes in expression, we took the total read count per LSV output from MAJIQ, normalized by total number of reads spanning targeted splicing events, as a proxy for overall gene abundance and calculated the log fold change between LSV level read counts across conditions of interest.
When downsampling analyses were required for both RNA-seq and LSV-seq, we subsampled the trimmed read files to varying read depths prior to alignment and other downstream analyses. To account for the fact that RNA-seq was paired-end data, while LSV-seq was single-end, we filtered for RNA-seq reads with the ‘READ1’ SAM flag set.
Analysis of on- and off-target reads per primer
To identify reads mapping on- and off-target, we created a BED file containing all of the on-target primer ‘extended’ regions, defined as the region between the 3′ end of the primer, excluding the length of the primer itself, and the known 3′ splice site. We used this BED file to perform the bedtools intersect command on both RNA-seq and LSV-seq mapped deduplicated BAM files, retrieving the number of reads per on-target region. For each on-target region, fold enrichment over RNA-seq for each LSV-seq library was calculated by dividing the number of library-size normalized reads in LSV-seq by the average number of library-size normalized reads in RNA-seq replicates. To calculate on- and off-target percentage, we used the bedtools intersect command to calculate the number of individual reads overlapping targeted regions, and divided by the total number of unique reads in the library.
Training of OP prediction models
Processing of output variables
To map each read to the original primers that they most likely originated from, the ‘blast_process’ script was created. For each library, we first converted the BAM files mapped by STAR into SAM and FASTA files. The FASTA file was used to generate a BLAST database, which was aligned against the sequences of all the primers in the pool with parameters {-gapopen 2 -gapextend 2 -reward 2 -penalty 3 -task blastn-short}. For each sequenced library read, the primer that aligned within 50 bp of the 5′ read end with the highest alignment score was inferred to be the primer of origin. We removed unextended primer reads that were no more than 5 nucleotides longer than the primer itself.
While the original STAR alignment for each read was highly efficient at identifying the correct genomic positions to map, it often gave inaccurate information about insertions, deletions or mutations near the 5′ read ends. To generate a more accurate representation of how the 5′ read ends mapped to their origin primers, we extracted the sequence corresponding to the mapped genomic coordinates of the read end. We then used the pairwise2.align.globalms method from the biopython package to align this sequence to the read’s origin primer sequence with highly permissive parameters {match=2, mismatch=−1, open=−0.5, extend=−0.2}. The top scoring alignment was taken as the most likely alignment of the actual genomic location of each read to its origin primer. The longest continuous sequence uninterrupted by insertions, deletions or mutations starting from the 3′ end of the primer, divided by the length of the primer itself is what we termed the ‘fraction of primer binding’ (FPB).
Separately, we inferred whether a read was on- or off-target based on if the read overlapped with the extended primer region, corresponding to the region between the targeted 3′ splice site and the 5′ end of the primer, which includes the length of the primer itself. This allowed us to derive a specificity metric which we termed ‘on-target fraction’ (OTF), corresponding to the number of on-target reads divided by the total number of reads for each primer. We also created a separate yield/amplification metric termed ‘log total amplification’, corresponding to the logarithm of the total number of reads for each primer, first divided by the pool size and then divided by the library size. To aggregate multiple measurements of the same primer pool and tissue condition, we took the average of the specificity and amplification metrics. To account for redundant primers, especially when the same primer pool was used in a different tissue condition, we collapsed the feature vectors for these primers that were identical by sequence into a single mean vector.
Regression model for primer selection
Using the feature matrix we extracted during primer generation, and the output variables we processed, we trained CatBoost gradient boosting decision tree regression models, after iterating through different models and formulations of output variables. For the final specificity output variable that was defined as the mean of FPB and OTF, we filtered for data points that had at least 12 total detected reads, as poorly amplified primers were subject to variance in measurement. In contrast, we did not use filtering for the final amplification output variable, defined as the log total amplification, because poorly amplified primers improved model performance. For each model, we performed 5-fold cross-validation twice and interpreted models using the SHapley Additive exPlanation (SHAP) TreeExplainer package (25).
We also created two deep learning model architectures for predicting primer specificity and amplification. The transformer model, based on mRNAbert that was pretrained on mRNA transcripts with 6-mer sequences (26), was fine-tuned for primer specificity and amplification regression predictions. We also created a convolutional neural network (CNN) model, which utilized several convolution layers followed by a long short-term memory layer. Non-sequence-based features, such as expression, bitscore and Tm, were concatenated before the second-to-last layer for both models. Both models were trained on the same primer data and underwent 5-fold cross-validation twice on identical splits as for the boosted decision tree models.
Results
Overview of LSV-seq
In order to better capture AS variations of interest, we developed LSV-seq (Figure 1A). While standard RNA-seq aims to capture all transcripts in an unbiased manner using random N-mer or oligo-dT RT primers, LSV-seq instead captures specific RNA regions of interest using complex pools of targeted RT primers. Importantly, LSV-seq is highly scalable and the number of RNA targets can range from hundreds to thousands. For detection of AS, the designed primers bind to specific target regions adjacent to selected splice junctions of interest (Figure 1A). The primer design and targeting of splice junctions takes advantage of the LSV formulation and detection in MAJIQ (12) (Figure 1B). LSVs are defined as all of the splice junctions entering (i.e. single target) or exiting (i.e. single source) a specific reference exon, thus allowing splicing studies to capture variations with any number of junctions. Under this framework, LSV-seq can capture target LSVs in MAJIQ’s splice graphs, consisting of all of the splice junctions entering the 3′ node of an AS event. Importantly, the target LSV formulation allows the detection and quantification of unannotated and complex splicing variations involving >2 splice junctions using a single primer. While junctions can be targeted individually, our analysis indicates we are able to capture 79% of splice junctions across GTEx tissues using the target LSV formulation (Figure 1B).
Figure 1.
Overview of LSV-seq for targeted detection of alternative splicing. (A) LSV-seq enriches for junction spanning reads by performing highly multiplexed RT with primers anchored directly adjacent to target LSVs. In contrast, only a minor fraction of conventional RNA-seq data is informative for splicing quantification, as most reads do not span splice junctions. (B) LSV formulation for splicing events used by the MAJIQ algorithm. Pie chart depicts the percentage of all splice junctions that can be captured by target LSVs, out of the superset of junctions in all source and target LSVs. LSV-seq primers can capture both classical binary splicing events as well as more complex events consisting of annotated and novel junctions. (C) Overview of LSV-seq protocol steps. LSV-seq targeting primers are first annealed to the target RNA using a touchdown protocol to maximize specificity. After first and second strand cDNA synthesis, linear amplification occurs via IVT. The amplified RNA (aRNA) then undergoes another RT step to append the second adapter. The resulting cDNA is then PCR-amplified and sequenced.
We initially attempted to directly transfer over the previously published MPE-seq protocol (11,14) from yeast to human cells. However, pilot experiments across two different primer pools resulted in libraries with a low mean percentage of on-target reads ranging from 1.24% to 2.02%, despite substantial amounts of input RNA (50 µg) (Supplementary Figure S1). We hypothesized that these results were due to the massively increased complexity of the human transcriptome compared with yeast and that further development of the method was needed. Thus, we undertook a series of iterative experimental and computational optimizations focused on improving the specificity and overall yield of the assay, which led to the LSV-seq method presented here.
In our optimized LSV-seq protocol, the targeting primer consists of a constant 5′ sequence that includes a T7 promoter sequence followed by an adapter sequence for PCR amplification and a 10 nucleotide UMI (Figure 1C). This invariant region is then followed by the target-specific priming sequence. The LSV-seq protocol then proceeds as follows (Figure 1C): first, the LSV-seq primer pool is gradually annealed to input RNA using a touchdown step and first-strand cDNA synthesis occurs at an elevated temperature of 60°C to maximize the specificity of the RT reaction. Next, second strand synthesis is performed and the double-stranded DNA then serves as a template for linear amplification by IVT. As the targeted RT performed in LSV-seq significantly reduces the amount of starting cDNA compared with conventional RNA-seq, we adopted the IVT step used in single-cell protocols like CEL-seq2 (27) and found that it is critical for increasing the amount of material available for downstream steps. The IVT amplified RNA (aRNA) is then fragmented and proceeds through an additional RT step to append a second adapter sequence. The cDNA is finally amplified by PCR and the resulting library is ready to be sequenced. This final experimental protocol was coupled with optimization of primer design, which we describe next.
OP machine learning models predict high performance primers
We originally attempted to design LSV-seq primers using multiple existing probe design tools (28–30), but were not able to achieve satisfactory performance across multiple iterations. To address this, we systematically analyzed trends that might reasonably correlate with primer performance. We first compiled an exhaustive dataset of ∼15 000 distinct data points from previously sequenced iterations of LSV-seq libraries. This dataset spanned five different primer pools reflecting a logarithmic regime of potential pool sizes, and five unique cell line or tissue conditions (Supplementary Table S1). We then defined optimal primer performance in terms of a combined specificity metric and a yield metric, based on rational models of primer binding (Figure 2A). The first component of the specificity metric is the Fraction of Primer Binding (FPB). FPB is computed by inferring the primer of origin from the 5′ end of the read, allowing us to identify reads with nonspecific partial primer binding during the RT reaction. The second component of the specificity metric is On-Target Fraction (OTF), which is computed by calculating the ratio of reads mapping to the intended on-target loci compared with undesired off-target loci. OTF and FPB are averaged to create a single combined specificity metric used in downstream analyses. We also defined a yield metric, the log total amplification, as the logarithmic sum of the on- and off-target reads combined. The poor observed correlation (R = 0.23) between the specificity and yield metrics prompted us to develop independent explanatory models for each (Figure 2B). Validating our initial challenges with using previous probe design tools, commonly used heuristics such as melting temperature, primer length and off-target alignment scores failed to identify any clear explanatory factors (Figure 2C). Based on calculated R2 scores, the strongest correlated individual metrics we examined corresponded to a maximal explained variance in performance of only 7.3% for the specificity metric (GC content, R = −0.2707), and 22.4% for the yield metric (melting temperature, R = −0.4729). Moreover, many of the individual factors we examined are highly intercorrelated (for instance GC content and melting temperature), implying that basic regression models would not markedly improve performance.
Figure 2.
Primer selection pipeline based on OP models. (A) Overview of specificity and yield metrics derived to infer individual primer performance. (B) Scatter plot depicting the correlation between the specificity metric (mean of FPB and OTF) and the yield metric (Normalized Log Total Amplification). Pearson correlation coefficient is given. (C) Scatter plots depicting the correlation between various individual features thought to be important for determining primer performance and the specificity metric (mean of FPB and OTF) or the yield metric (Normalized Log Total Amplification). Pearson correlation coefficients are given and definitions of individual features are provided in Supplementary Table S2. (D) Overview of processing steps in final primer selection pipeline. Pipeline allows use of either the ‘full’ or ‘lite’ specificity and yield models. (E) Overview of encoded sequence-based features. Individual LSV-seq primer is shown as it would bind to the RNA during RT, along with the proximal region extended by the reverse transcriptase. (F) Cross-validated performance of trained regression models. For each indicated model type, the calculated R2 score and Pearson correlation are given based on held-out 5-fold cross-validation splits conducted independently twice.
Since no simple combination of features was strongly predictive, we hypothesized that a more complex, dedicated machine learning model could significantly improve primer design (Figure 2D). In order to build this model, which we named Optimal Prime (OP), we expanded the set of explanatory features from the initial 6 to over 1000. We implemented our own primer design and feature extraction pipeline, dually inspired by the OligoMiner pipeline for RNA FISH probe design (30) and prior feature extraction methods used for CRISPR-Cas9 guide prediction (31). A ∼50–100 bp target region is converted into all possible candidate primers by filtering for substrings satisfying relatively relaxed constraints for GC content, length and melting temperature. The candidate primers are aligned with the BLAST alignment algorithm (19) and on- and off-target alignments are passed into the NUPACK nucleic acid binding prediction algorithm (20), as in previous work (30,32), to create alignment-based features. An additional optional category of features includes those that are specific to the experimental context, such as gene expression levels for the targeted tissue or number of primers targeted to the same locus. Importantly, we also extracted hundreds of key sequence-based features for each candidate primer based on one-hot encodings of different positional nucleotide motifs (Figure 2E). A full set of model features and their corresponding descriptions are given in Supplementary Table S2. The final set of features is passed into either ‘full’ or ‘lite’ specificity and yield models. The ‘full’ model supplements this set of features with experiment-specific features to maximize prediction performance for LSV-seq. In contrast, the ‘lite’ model is experiment-independent and therefore more flexible, potentially generalizing to other applications beyond LSV-seq. Finally, the separate predictions from the specificity and yield models are combined into a single score, allowing all candidate primers per locus to be ranked by predicted performance.
During model development, we tuned the design of our models through a combination of model selection, dataset filtering to remove noisy low-coverage data points and different formulations of the predicted variables (Supplementary Figure S2). Interestingly, the boosted decision tree model exceeded the performance of the two deep learning-based architectures we benchmarked, including convolutional neural networks (CNNs) and transformer models (Supplementary Figure S2A and B). Based on the mean performance across two independent 5-fold cross-validation procedures, we were able to generate highly accurate OP models based on the boosted decision trees for both specificity (full model Pearson’s R = 0.742 and R2 score = 0.549; lite model Pearson’s R = 0.676 and R2 score = 0.457) and yield (full model Pearson’s R = 0.822 and R2 score = 0.674; lite model Pearson’s R = 0.792 and R2 score = 0.627) (Figure 2F). For the specificity model, the OP model represents a relative performance increase of 7.5-fold, out of a maximum possible of 13.7-fold, compared with the original maximum individual factor performance of 7.3% explained variance.
Validation and feature analysis of OP models
While many prior primer design algorithms are either challenging to experimentally assess or are limited to relatively low-throughput validations, we leveraged the high-throughput scale of LSV-seq to directly validate the performance of our OP model at the bench across almost 1000 new primers. We compared the performance of primers designed only considering classic heuristics and predicted on- and off-target alignments, versus primers newly re-designed based on our OP specificity model (Figure 3A). Although these two sets of primers were designed to target the same 948 target LSVs, with one primer per target locus, we vastly improved the median specificity metric score from 0.55 in the non-OP design to 0.94 after incorporating our OP model (out of a maximum possible score of 1.0). We also greatly reduced the rate of primer dropout (primers with no detectable reads), from 8.9% (84/948 primers) to 0.3% (3/948 primers).
Figure 3.
Experimental validation and interpretation of OP models. (A) Cumulative distribution function for the performance of primers designed with (orange) or without (blue) the OP model, as measured by the specificity metric (mean of FPB and OTF), for the same set of target regions (n = 948). Number of primers with no observed value is noted by the arrowheads. (B) Top features for full OP models. Features are ranked by mean absolute SHAP value magnitude. (C and D) Informative interactions between top features for full models. For each feature indicated, its top most influential interactor by SHAP value magnitude is colored. (E) Bar graph depicting the relative enrichment of specific nucleotides at the first 25 positions in both directions from the 3′ end of the primer, for either the best or lowest performance primer groups, as evaluated by the specificity metric (mean of FPB and OTF). For each position and nucleotide combination, a binomial test was conducted to determine the likelihood of the observed proportion in the top quintile of primers, compared with the proportion in the bottom quintile of primers.
Highlighting the explainability of the boosted decision tree model architecture, we interpreted the top features that contributed the most to primer performance via SHAP interaction values (25). Although the same set of features is used in the specificity and yield models, their relative importance differs greatly (Figure 3B). For the specificity model, we discovered a number of features that represent the adenosine content within the 7–9 nucleotides at the 3′ primer end. Conceptually, the 3′ end represents the nucleotides most important for formation of the RT initiation complex prior to elongation (33). In contrast, for the yield model, the top model features were melting temperature and length, followed by a distinct set of sequence-based features. We further investigated more complex interactions between interpretable sets of top features per model. For the specificity model, a representative adenosine-rich 3′ end feature differentially modifies the impact on predicted performance for the length feature, in adenosine-poor primers (≤1 adenosine in 3′ end) compared with adenosine-rich primers (≥2 adenosines in 3′ end) (Figure 3C). For the yield model, our analysis of the melting temperature and length-related features reveals that as primers grow longer, the relative effect of calculated melting temperature is somewhat decreased, represented in a shallower curve (Figure 3D). Additionally, the point of inflection in determining model penalty for high melting temperature occurs close to the RT reaction temperature of 60°C, suggesting a link to real-world reaction conditions. Independently of our models, we also performed a binomial test for the relative enrichment of specific mononucleotide motifs in either the top-performing (top quintile) or bottom-performing (bottom quintile) primers for the specificity metric, as has been done in other nucleic acid prediction contexts (31) (Figure 3E). The most statistically significant differences are found in the increased preference for adenosines between nucleotides -1 and -7 from the 3′ end for top performing primers, which is consistent with their usage as top features in the OP specificity model. The guanine and thymine contents of the proximal extended region are also statistically significant determinants, and are likewise present within the top model features.
In order to increase the accessibility of our OP models to others for the design of LSV-seq primers, we implemented our primer selection pipeline as a webtool (https://tools.biociphers.org/lsv-seq). We preliminarily investigated the effects of using the full versus lite models for the specific task of creating a relative ranking of primers, and noted an overall strong concordance in the rankings between the model types, although some specific points experience high discordance (Supplementary Figure S3B and C). Thus, we decided to allow for the selection of two distinct run modes, both of which bypass the most computationally expensive steps and greatly reduce the runtime on a live web platform. ‘Lite Mode’, which implements our lite OP models, allows a more flexible array of inputs, based on either a BED6 file specifying mouse or human chromosome coordinates, a FASTA file or a selection from a list of LSVs. ‘Full Mode’ implements our more accurate full OP models. To achieve this, we ran MAJIQ across 54 GTEx tissues, allowing us to comprehensively catalog all >190 000 target LSVs we could detect across all human tissues. We then exhaustively generated >16 million total candidate primers from these identified target LSVs, and precomputed the BLAST alignment and NUPACK binding prediction steps, which took ∼7 days on our cluster. To run the webtool in ‘Full Mode’, the user selects from a list of LSVs and either specifies preloaded expression values for a given tissue of interest, or supplies their own list. For both run modes, the final output is a list of the top 10 primer sequences for each region and their corresponding prediction scores.
Benchmarking and validation of LSV-seq
To benchmark LSV-seq, we used Jurkat T-cells which we have previously shown undergo reproducible splicing changes upon stimulation with PMA (34). We first ran exploratory analysis with MAJIQ on previously published RNA-seq data from our lab in this T-cell context (15), resulting in the initial discovery set of over 48 000 putative target LSVs. We then performed a stratified random selection procedure to ensure we covered the different categories of splicing variations we expected to capture with LSV-seq (see ‘Materials and methods’ section for more details). Having established high capture rates for our optimized primers, we collected stimulated and unstimulated Jurkat cells in matched biological triplicate and then used the same RNA to generate either standard RNA-seq libraries (sequenced to a minimum depth of 100 million reads) or LSV-seq libraries with a pool of 1991 targeting primers designed using our OP pipeline (sequenced to a depth of 10 million reads). LSV-seq libraries were highly specific, with >95% of reads corresponding to targeted regions, while the same regions were covered by only ∼1–2% of reads in the RNA-seq dataset (Figure 4A). We also achieved a median of ∼19 and mean of ∼230 fold enrichment overall (Figure 4B).
Figure 4.
LSV-seq recapitulates gold standard RNA-seq measurements and enriches low-coverage RNA-seq measurements. (A) Percent of LSV-seq reads mapping to targeted splicing events of interest (n = 1991) in Jurkat T-cells. Also shown are the percent of reads mapping to the same splicing events in RNA-seq data. (B) Raindrop plot depicting the mean LSV-seq fold enrichment over RNA-seq for unstimulated (n = 3) and stimulated (n = 3) Jurkat T-cell biological replicates per splicing event. The box plot markings represent the 0th to 100th percentiles in increments of 25, with the median marked at the 50th percentile, and the mean denoted as an orange dot. A small proportion of targeted events that did not reach at least five detected reads in either LSV-seq or RNA-seq were excluded from analysis, as indicated above the plots. (C) Scatter plot comparing the PSI values for the same splice junctions in either saturated LSV-seq or RNA-seq datasets (for the unstimulated Jurkat T-cell condition), across only splicing events with at least 30 mean reads of coverage in both (n = 885). Each splicing event is reduced to one splice junction selected at random. Pearson correlation coefficient is given. (D) Scatter plot comparing the PSI values for the same splice junctions in equally downsampled LSV-seq or RNA-seq datasets (for the unstimulated Jurkat T-cell condition), across all splicing events. Only junctions observed in both LSV-seq and RNA-seq are considered when calculating the PSI value. Each splicing event is reduced to one splice junction selected at random. Points that have high disagreement are defined as those having over 0.1 PSI discordance (consisting of points lying outside the dashed lines). The bar plots quantify the coverage categories of all events overall and in events with >0.1 PSI discordance. Pearson correlation coefficient is given. (E) Scatter plot comparing the stimulated–unstimulated deltaPSI values for the same splice junctions in full-depth LSV-seq or RNA-seq datasets, across splicing events with at least 50 reads of coverage in both. All junctions, whether or not they are observed in both, are considered in calculating the deltaPSI values. Dashed lines indicate significant deltaPSI values over 0.2. Pearson’s correlation was calculated by shrinking all points within 0.2 deltaPSI (within the inner box) to 0. (F) Cumulative distribution functions plotting the mean difference in number of junctions detected in equally downsampled LSV-seq and RNA-seq. Positive values indicate more junctions detected in LSV-seq, while negative values indicate more junctions detected in RNA-seq.
One key characteristic of the LSV-seq method presented here is its tight integration with MAJIQ. As discussed earlier, this integration is reflected in the type of AS events LSV-seq inherently captures at the experimental level, which correspond to target LSVs by MAJIQ’s formulation. Practically, during post-sequencing data processing, this also requires quantification using the MAJIQ PSI/deltaPSI algorithms and visualization using the VOILA package, which were all originally designed for standard RNA-seq data. To accomplish this, we created a new independent python package and several compatibility updates within the base MAJIQ algorithm to allow us to seamlessly analyze splicing events for LSV-seq. To fairly compare the resulting RNA-seq and LSV-seq quantifications, we performed analyses of both full-depth datasets and data downsampled to equal read depths (Supplementary Figure S4A). First examining only the 885 splicing events with high coverage in both full-depth LSV-seq and RNA-seq (defined as having at least 30 reads per event), we noted near identical PSI values (R = 0.984), demonstrating that LSV-seq provides quantifications that are highly consistent with those from high-coverage RNA-seq (Figure 4C). To further assess the reproducibility between LSV-seq and RNA-seq, we produced Bland–Altman plots, which are commonly used to compare the agreement between two different assays, by plotting the mean of the assays (mean of LSV-seq and RNA-seq PSI values) against the magnitude of disagreement (difference between LSV-seq and RNA-seq PSI values) (Supplementary Figure S5A) (35). The limits of agreement between LSV-seq and RNA-seq are represented by the interval from −0.15 to 0.14, suggesting strong concordance. Comparatively, the limits of agreement are −0.08 to 0.08 for LSV-seq and RNA-seq samples internally (Supplementary Figure S5B and C).
Next, when we relaxed our read threshold filter and instead looked at the 911 splicing events across equally downsampled LSV-seq and RNA-seq with at least one read in both, the correlation dropped considerably (R = 0.875) (Figure 4D). We also classified each splicing event as having low coverage in either LSV-seq, RNA-seq or both. A majority of all splicing events were shown to have <30 reads in only RNA-seq (57.1%), while a much smaller proportion had <30 reads in only LSV-seq (1.0%). Looking specifically at the subset of events with PSI values which differed by >0.1 between LSV-seq and RNA-seq, the proportion of events with low coverage in both technologies increased (53.5%), while the other categories stayed relatively constant. Collectively, these results showcase the much higher rate of splicing events with sufficient coverage in LSV-seq compared with RNA-seq at similar sequencing depths. In addition to calculating PSI values across replicates, we also examined our ability to accurately profile the difference in mean PSI values, or deltaPSI, between the stimulated and unstimulated condition groups, using the same 30 read minimum threshold for analyzed events as in the previous analysis (Figure 4E). Notably, since this analysis requires quantifying events in two conditions, we included junctions reported as low coverage in either assay technology. Even with this addition, we still observed a strong correlation between deltaPSI values from RNA-seq and LSV-seq (R = 0.768). Reassuringly, the deltaPSI values measured in LSV-seq between matched pairs have significantly lower variance across read coverage bins compared with RNA-seq (Supplementary Figure S4B and C).
As another key benchmark, we assessed the ability of LSV-seq to consistently capture splice junctions in each quantified splicing event. For equally downsampled LSV-seq and RNA-seq datasets, we compared the difference in the number of splice junctions detected. For ∼70% of splicing events within a biological condition, LSV-seq detects, on average, at least one extra splice junction per LSV compared with RNA-seq (Figure 4F). Moreover, the vast majority of these splice junctions are likely true biological occurrences, as 8406/8521 (98.7%) of them can be detected in the full-depth RNA-seq dataset (Supplementary Figure S4D).
Finally, we assessed the ability of LSV-seq to approximate changes in expression. Although LSV-seq was primarily optimized to quantify PSI and deltaPSI values, the log2 fold changes in per-primer LSV-seq coverage can potentially be helpful as an orthogonal metric to discriminate between tissue conditions. We compared the mean LSV-seq log2 fold changes in single-primer coverage to the mean of RNA-seq log2 fold changes in whole-gene summed TPM values. While we do not necessarily expect LSV-seq coverage of a single targeted ∼150-base region to directly correspond to RNA-seq whole-gene expression, we observed an unexpectedly strong correlation between LSV-seq and RNA-seq in measured coverage differences (R = 0.922) (Supplementary Figure S4E), suggesting LSV-seq could be used to simultaneously track changes in target coverage. Moreover, the strength of the LSV-seq/RNA-seq correlation is similar to the strength of correlation across true biological replicate measurements (ranging from R = 0.967 to R = 0.971).
LSV-seq recovers previously uncharacterized tissue-specific splicing events in GTEx
Next, we reasoned that the enhanced sensitivity of LSV-seq could recover previously unquantifiable, low-coverage splicing events. We first set out to assess how many of the AS events detected in existing large datasets such as GTEx may suffer from limited quantifiability. In representative tissue-wide GTEx RNA-seq datasets we analyzed using our MAJIQ algorithm (36), we found that only an average ∼10% of reads span splice junctions, preventing consistent capture of less abundant isoforms (37) (Figure 5A). Because of the sparse coverage of splice junctions, up to ∼44% of AS events we detected in the GTEx data had ≤25 mean reads of coverage and could not be reliably quantified in a significant fraction of samples (where ‘reliable’ is defined by the LSV having at least 10 reads) (Figure 5B and Supplementary Figure S6A). Moreover, the large majority of these difficult-to-quantify AS events reside in well-expressed genes (with at least five TPM) (Figure 5C), hinting at their potential biological relevance.
Figure 5.
LSV-seq captures splicing events that are unquantifiable in GTEx RNA-seq datasets. (A) Pie charts illustrating low capture rate of splice junction reads in GTEx RNA-seq datasets across three tissues. (B) For the GTEx liver dataset, histogram in orange illustrating distribution of read coverage for all detectable LSVs, overlaid with box plot in blue illustrating decrease in quantifiability rate at low read coverage. Red box highlights the ∼44% of AS events with 25 or fewer mean reads of coverage that cannot be reliably quantified. (C) For the GTEx liver dataset, histogram in light blue illustrating the distribution of gene expression by TPM for low-coverage LSVs, overlaid with the cumulative distribution function in orange for all expressed genes. (D) Overview of the pipeline used to prioritize low-coverage splicing events which are predicted to be tissue-specific across three different GTEx tissues. Created with Biorender.com. (E) Upset plot depicting the relative overlap of the initial list of candidate events identified by each pipeline. (F) Raindrop plot of the mean fold enrichment of LSV-seq over RNA-seq for the mean splicing event coverage over three technical replicates, compared with the mean coverage over the entire GTEx dataset for each tissue. For each technology, read coverage is normalized by the mean library size. (G) Cumulative distribution functions plotting the difference in the number of mean junctions per splicing event detected across full-depth LSV-seq replicates, compared with the full-depth GTEx RNA-seq dataset, per tissue. Positive values indicate more junctions detected in LSV-seq, while negative values indicate more junctions detected in RNA-seq. (H) Upset plot categorizing changing events captured by LSV-seq. For each changing LSV, we noted which specific pairwise tissue comparison(s) they were changing between. (I) Summary of LSV-seq validation results for the splicing event prioritization pipeline. Different subcategories of events, based on their original prioritization pipeline and prediction confidence, are shown with their relative enrichment for confidently changing events (>0.15 deltaPSI in at least one pairwise comparison), out of total confidently changing and non-changing (<0.05 deltaPSI in all pairwise comparisons) events.
After establishing that the GTEx RNA-seq data did indeed have a large number of splicing ‘blind spots’, we devised a strategy to prioritize which of these events to target with LSV-seq (Figure 5D, and Supplementary Figure S6). Specifically, we aimed to recover tissue-specific splicing events that are changing in at least one of three pairwise tissue comparisons between liver, heart atrial appendage and brain cerebellum. Based on our initial estimates, targeting splicing events at random would identify tissue-specific events at an unacceptably low rate of 4.7%. To mitigate this, we created a prioritization pipeline to nominate putative tissue-specific splicing events that we could then experimentally validate with LSV-seq. We first identified all unquantifiable low-coverage LSVs (<25 mean reads of coverage) across the three selected tissues. We then classified each event as putatively differentially spliced between tissues based on two separate prediction pipelines. The first uses the ENCODE CLIP-seq data that capture the binding sites of RBPs (38,39), while the second uses an in-house splicing code deep learning model (16). Surprisingly, these two pipelines predicted largely nonoverlapping sets of putative tissue-specific splicing events (Figure 5E).
We then used our OP primer design pipeline to create a pool of 1514 primers capturing 1400 unique targets and performed LSV-seq on RNA from human liver, heart atrial appendage and brain cerebellum. LSV-seq was able to greatly boost the coverage of targeted splicing events between a median of 424- to 956-fold depending on the tissue, far exceeding the original GTEx RNA-seq coverage level (Figure 5F). Due to the corresponding increase in consistency of junction-level coverage, we also noted a large gain in the number of detected junctions per event (Figure 5G). When we assessed the deltaPSI values we captured between each pairwise tissue comparison, we found we were able to call 292 unique pairwise differences (deltaPSI > 0.15), corresponding to 171 unique LSVs (Figure 5H), with the most differences found between brain cerebellum and either of the other two tissues. Our overall rate of return was 21.5% for tissue-specific events, over 5 times the rate that would be expected with random selection (Figure 5I). Our splicing code pipeline in particular performed especially well, with 35.9% of candidate events being validated as truly changing between tissues. In contrast, the RBP-based pipeline nominated twice as many candidate events yet returned a similar number of truly changing events (15.5% true positive rate), demonstrating that the false positive rate of the splicing code pipeline is much lower.
To validate our original tissue-specific prediction pipeline, we assessed if the post hoc tissue-specific splicing change frequencies we empirically observed in LSV-seq reflected our a priori knowledge of different subcategories we expected to be informative. If our RBP binding and splicing code pipelines performed correctly, then we expect that as we increase the stringency of the threshold for calling putative changing events, the rate of false positives should decrease, at the cost of total true positives detected. Indeed, for both pipelines, we observe exactly this, with fewer false positives called as we increased the prediction threshold stringency in each pipeline, reaching a maximum validation rate of 44.4% in the RBP binding pipeline and 42.5% in the splicing code pipeline (Figure 5I). For each tissue-specific splicing regulatory RBP we identified in our RBP binding pipeline, we also calculated the expected discovery rate based on the prevalence of the RBP’s binding sites within tissue-specific splicing events in high-coverage RNA-seq. We then explicitly tested the concordance between each RBP’s expected discovery rate from RNA-seq and its true observed discovery rate in LSV-seq, and observed a strong positive correlation (R = 0.75) (Supplementary Figure S6E).
Next, we further examined the 171 tissue-specific splicing events we had newly recovered using LSV-seq and which previously could not be captured by the limited coverage of the GTEx RNA-seq data. Within this set, we found many examples of splicing that were highly specific to each of the three tissues assayed (Supplementary Figure S7), often involving splice junctions that were detected in only one tissue type. Among the splicing events we found to be unique to the brain cerebellum, we achieved robust splicing detection of an ENAH microexon (Supplementary Figure S7A), that was previously identified in a study of neuronal microexons (40). Our LSV-seq results for the ENAH microexon (PSI = 0.39) were consistent with this independent microexon study (cerebellum PSI = 0.34), supporting the accuracy of LSV-seq in quantification of targeted events.
Another brain-specific AS event involved the RAB3GAP1 gene. Rare genetic variants within RAB3GAP1 have previously been reported to cause the micro and Martsolf autosomal recessive disorders, both of which are associated with a collection of profound neurological deficits, including visual impairment, brain abnormalities and hypotonia (41). One such pathogenic mutation induces a frameshift very early within the N-terminal domain, likely almost fully abrogating the transcript, leading to the suggestion that a previously detected alternatively spliced isoform lacking the first 50 N-terminal amino acids could partially rescue RAB3GAP1 function. Using LSV-seq, we measured the relative presence of this alternatively spliced isoform (blue junction) compared with the dominant isoform (red junction) across tissues, whereas previous GTEx RNA-seq coverage was insufficient for detection (Figure 6A). We found that the alternatively spliced isoform is expressed specifically in the neural tissue we tested, brain cerebellum, compared with the other two tissues, heart atrial appendage and liver (Figure 6B). Interestingly, this change in AS correlates with the presence of binding sites for RBPs TIA1 and KHSRP (Figure 6A), which both have significantly higher expression in the brain cerebellum compared with the other tissues (Figure 6C). However, a direct functional role for these RBPs in regulation of RAB3GAP1 splicing remains to be experimentally validated.
Figure 6.
Tissue-specific splicing events recovered by LSV-seq. (A) RAB3GAP1 gene track with splice junctions shown as red or blue arcs. Also shown below are ENCODE CLIP IDR binding peaks for the RBPs KHSRP and TIA1. (B) RAB3GAP1 LSV splicing and PSI values across tissues as measured by LSV-seq. (C) Expression of KHSRP and TIA1 RBPs across all samples in GTEx for each of the three tissues. (D) LGALS9 LSV splicing and PSI values across tissues as measured by LSV-seq. Also shown are the gene structure of LGALS9 with annotation of the N/C-terminal CRDs and alternatively spliced linker region, and the VOILA splice graph of the LSV (corresponding to boxed area). (E) Complex LSV splicing and PSI values for PACSIN3 across tissues as measured by LSV-seq. Also shown is the VOILA splice graph with the dominant splice junctions for each tissue represented by thicker arcs. Numbers in PSI graph indicate PSI value for that specific splice junction.
In addition to events that were unique to a single tissue, we also recovered tissue-specific splicing that was quantitatively different across all three tissues, such as in LGALS9 (Figure 6D). LGALS9, or galectin-9, plays an important role in immunomodulation through binding to the TIM-3 receptor on immune cells (42). As this interaction can suppress immune responses, the galectin-9/TIM-3 axis is being actively investigated as a potential target for immunotherapy. Galectin-9 is expressed as several different isoforms that differ in the length of the linker region connecting its two carbohydrate recognition domains (CRDs). Differences in linker length affect the rotational freedom of the CRDs and multivalency of galectin-9, which can then impact its interactions with other proteins (43). Using LSV-seq, we detected differential splicing in this linker region, with heart tissue favoring galectin-9 isoforms containing a longer linker and brain tissue favoring isoforms with a shorter linker (Figure 6D). Finally, LSV-seq also captured complex splicing in many cases, highlighting its unique ability to detect multiple known and de novo junctions at each targeted LSV using only a single primer. For example, we detected particularly complex splicing in the 5′ region of PACSIN3, with many different splice junctions being used across all three tissues. Interestingly, LSV-seq also revealed that each tissue exhibited dominant splicing of a different junction, suggesting tissue-specific regulation of the PACSIN3 5′UTR (Figure 6E). Altogether, our results illustrate the ability of LSV-seq to capture quantifications for biologically relevant splicing events that may escape detection with conventional RNA-seq.
Discussion
Although short-read RNA-seq continues to be the standard approach for splicing analysis, it is inefficient at capturing the splice junctions needed for accurate quantification. Indeed, our analysis of the GTEx RNA-seq dataset revealed inadequate detection of a sizable fraction of biologically important splicing variation across tissues. To address this limitation, we developed LSV-seq, a method for targeted detection and quantification of up to thousands of splicing events of interest. In comparison with conventional RNA-seq, we demonstrate that LSV-seq provides significantly improved coverage at targeted events and accurate splicing quantification even with markedly reduced sequencing depths. We also find that LSV-seq can recover splicing information at events with poor coverage in GTEx RNA-seq data to reveal novel forms of tissue-specific splicing. Altogether, LSV-seq offers an efficient and versatile method for the study of AS in humans and other organisms with comparably complex transcriptomes.
LSV-seq is inspired by a previous method, MPE-seq, which was originally developed for detection of splicing in yeast (11). Like LSV-seq, MPE-seq enriches for RNA regions of interest by performing multiplexed RT with pools of target-specific primers. However, our initial attempts to apply MPE-seq to human RNA resulted in a majority of off-target reads (Supplementary Figure S1). Additionally, we note that a previous study performed primer extension based on MPE-seq in human cells, although only in a low-throughput format for a single intron (44). We reasoned that our observations could be due to the much greater complexity of the human transcriptome [∼250 000 annotated transcripts (45)] as compared with yeast [∼7000 annotated transcripts (46)]. As a result, we focused most of our efforts on improving the specificity of the RT reaction through both experimental and computational optimizations. For example, we incorporated a linear amplification step, used in some single-cell RNA-seq protocols, allowing us to reduce the amount of input RNA needed by at least 10-fold. However, as the 5 µg of input RNA we recommend may still exceed what is available, especially in the context of valuable clinical samples, further experiments will be needed to determine the compatibility of LSV-seq with lower input conditions. We also hope in the future to establish LSV-seq for use in accurate detection and quantification of intron retention. This would require careful consideration of approaches for tabulating the key intron-spanning and exon–intron junction reads, as well as systematically benchmarking performance across introns of different lengths.
When compared with other targeted sequencing methods such as CaptureSeq (6), RASL-Seq (8) and TempO-seq (9), one key advantage of LSV-seq is its ability to detect all junctions adjacent to the targeted region using only a single primer. Thus, with far fewer primers, LSV-seq can capture both simple and complex splicing events, including those with novel, unannotated junctions. Although each LSV-seq primer is limited to detecting splicing variation at target LSVs, we found that this is sufficient to capture ∼80% of splice junctions (Figure 1B). Currently, the primary limitation of LSV-seq is its inability to capture the remaining 20% of splice junctions which have variations only detectable by quantifying source LSVs. Based on further focused analysis, we found that these splice junctions almost entirely represent alternative last exon events. We anticipate that variation at these remaining junctions found in source LSVs could be captured by using combinations of primers anchored near each downstream junction, although this remains to be tested. More recently, long-read technologies have emerged with the ability to sequence full-length transcripts. Despite this advantage, current long-read RNA-seq studies are generally better equipped to tackle novel isoform detection, while accurate quantification remains an active area of development. Future work combining LSV-seq primer pools with long-read sequencing could facilitate sensitive and accurate splicing quantifications, analogous to what we have shown for short-read RNA-seq.
We credit the OP machine learning models for the strong performance of LSV-seq. Early in the development of our method, we discovered a lack of tools suitable for the design of RT-specific primers and noticed that most related protocols repurposed pipelines originally created for other multiplexed assays such as microarray or fluorescence in situ hybridization. Indeed, our first attempts at multiplexed RT using these off-the-shelf tools resulted in primers with significant off-target behavior or failure to prime desired targets. We also reviewed other tools from the literature tailored for designing splice junction-spanning primers and found these were also not well-suited for LSV-seq, either because they lacked adequate experimental validation or focused on generating pairs of PCR primers rather than target-specific RT primers (47–50). Thus, we sought to use a data-driven approach to first discover optimal RT primer properties and then train machine learning models to predict primer performance. For this, we defined two optimal priming measures, namely target specificity and yield, and found that none of the classical primer features, such as Tm or GC content, correlated well enough on their own with these measures to be used independently. However, by combining features together using the OP algorithm, we were able to substantially improve both the yield and specificity of the designed primers.
Interestingly, models based on boosted decision trees outperformed both CNNs and transformers, suggesting that our handcrafted features provide near optimal design for our current dataset. However, we anticipate the deep learning models could eventually overtake the performance of our boosted decision tree models when given more training data from additional LSV-seq experiments. As an additional consideration for further improvement, unwanted binding of primers with themselves and each other could intuitively hinder primer performance. We anticipate future integration of predicted cross-hybridization scores as an additional predictive feature in our existing OP models and primer design pipelines. Notably, cross-hybridization score implementation would likely require development of a novel iterative design algorithm to remove primers with the greatest predicted hybridization interactions, as in prior work (51).
One important observation regarding the primer optimization is that the OP lite model, which ignored off-target transcriptomic alignments and relative gene expression, still offered good performance compared with the OP full model. This lite model thus offers greater flexibility when condition-specific measurements are not available and for applications beyond LSV-seq. To enable the research community to take advantage of our OP algorithm for primer design, we have made it accessible via a webtool at URL: http://tools.biociphers.org/lsv-seq. By precomputing computationally expensive values or by running an alignment-free version, the webtool offers lightning-fast retrieval of optimal primers for targets of interest.
Our work also demonstrates the strengths and limitations of current approaches for the prediction of human tissue-specific splicing events. Here, one approach used was the incorporation of known RBP data (52,53) while the other was based on sequence context and tissue identity (53–55). The first approach was represented by a pipeline for analysis of ENCODE CLIP data, while the second was based on a tissue-specific splicing code model (without CLIP information). Our results using these two approaches reveal that there is little concordance in the predictions provided by our RBP binding and splicing code pipelines. One possible explanation for this result is that the splicing code pipeline focuses on binary cassette event prediction, while the RBP binding pipeline allows for prediction across a wider range of splicing event types. The lack of overlap may also indicate that our current splicing code model does not yet fully reflect the underlying biology of RBPs and their corresponding binding site sequence motifs. Although there is still room for improvement, the splicing code pipeline had a far lower false-positive rate for discovery of tissue-specific splicing events compared with the RBP binding pipeline. Thus, our analysis validates the concurrent use of both pipelines in order to maximize the number of tissue-specific changing events we can recover. Future work can also further explore combined usage of splicing code modeling with data from RBP binding assays, as was done previously (52,53). Regardless of the exact model or pipeline, LSV-seq provides an efficient method to validate and further refine such computational splicing predictions.
From our prioritized selection of targets that had limited coverage in GTEx RNA-seq data, we recovered 171 events with tissue-specific splicing using LSV-seq. Across the three tissues we assayed (brain cerebellum, heart atrial appendage and liver), we discovered several instances of splicing that were highly specific to each tissue type, as well as splicing that was less specific but quantifiably different between tissues. While we highlighted examples in RAB3GAP1, LGALS9 and PACSIN3, the majority of the splicing variation we detected is uncharacterized and our results nominate splicing events of interest for further functional analysis. Thus, LSV-seq provides the ability to focus in unexplored areas to recover splicing information and generate novel hypotheses about isoform regulation and function. Our findings also motivate further work to improve the mapping of AS across human tissues.
We anticipate that LSV-seq could be directly transferred in its current form to a number of additional applications. For instance, it has the potential to recover extremely rare splice junctions which are especially challenging to detect even with very deep RNA-seq. Targeted sequencing was recently used to recover transient splicing intermediates at putative recursive splicing sites in human cells, but was only performed for a single intron (44). In addition to its sensitivity, LSV-seq is much more cost-effective than RNA-seq and could make it more feasible to study splicing across large numbers of samples or conditions. For instance, LSV-seq could be used in conjunction with high-throughput drug or genetic perturbation to quantify effects on thousands of AS events across thousands of different conditions. Such high-dimensional ‘many-by-many’ experiments could provide valuable insight into the complex network underlying regulation of AS. While larger LSV-seq primer pools do add an appreciable upfront reagent cost, each pool can be used for thousands of samples, and as experiments scale up in size, this primer cost is significantly offset by the 5- to 10-fold reduction in sequencing depth that is needed for LSV-seq. Although our study used at most ∼2000 primers, we expect LSV-seq to scale well to larger numbers of targets. However, the actual limits of LSV-seq remain unknown and we predict that increasing the number of targeted events will likely lead to more off-target priming and potential primer–primer interactions as well. Our primer selection pipeline could also be broadly useful for various other contexts which require region-specific RT primers. For instance, due to the stringent requirement for >5000 reads per nucleotide within analyzed transcripts, SHAPE-seq relies on numerous RT primers tiled across specific genes of interest (56,57). For SHAPE-seq and other related methods that analyze RNA structure, our primer selection pipeline could directly improve assay throughput and performance.
In summary, we have developed LSV-seq as a sensitive and cost-effective method for the detection and quantification of AS. We envision that LSV-seq will also enable studies in additional areas of RNA biology, helping to better capture other challenging yet important features of the transcriptome.
Supplementary Material
Acknowledgements
We thank the members of the Choi and Barash laboratories for helpful discussions. We want to specifically thank Farica Zhuang for assisting in development of the OP transformer model and Joseph Aicher for implementation of the specific variant of the MAJIQv3 algorithm we used for LSV-seq analysis. We also thank Drs Hagen Tilgner, Sydney Shaffer and Brian Gregory for valuable feedback. Portions of the graphical abstract and Figure 5D were created with Biorender.com.).
Contributor Information
Kevin Yang, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA; Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA.
Nathaniel Islas, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA.
San Jewell, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.
Di Wu, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA.
Anupama Jha, Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
Caleb M Radens, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.
Jeffrey A Pleiss, Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA.
Kristen W Lynch, Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia, PA 19104, USA.
Yoseph Barash, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA.
Peter S Choi, Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA; Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA.
Data availability
Raw and processed data from RNA-seq and LSV-seq experiments were deposited to GEO under accession number GSE246294. Processed data and code to reproduce all figures were deposited to Zenodo repository at URL: https://doi.org/10.5281/zenodo.13999558. Auxiliary files required for certain analyses are deposited at the following Zenodo dois: https://doi.org/10.5281/zenodo.8323103 for the primary LSV-seq method and https://doi.org/10.5281/zenodo.8190734 for the OP model.
Code availability
Code for the analyses and processing pipelines is published in two different code repositories. Code for the primary LSV-seq method is available at https://bitbucket.org/biociphers/lsv_seq_method/. Code for the Optimal Prime model is available for academic/non-commercial use at https://majiq.biociphers.org/optimalprime/app_download/. Licensing information for commercial use of the model can be found at https://majiq.biociphers.org/optimalprime/commercial.php. The webtool implementation of the Optimal Prime model is published and freely available for use at https://tools.biociphers.org/lsv-seq/.
Supplementary data
Supplementary Data are available at NAR Online.
Funding
U.S. National Library of Medicine (NLM) [R01-LM-013437 to Y.B.]; National Institute of General Medical Sciences (NIGMS) [GM128096 to Y.B.; DP2GM146251 to P.S.C.]; National Cancer Institute (NCI) [R00CA208028 to P.S.C.]. Funding for open access charge: NIH [DP2GM146251].
Conflict of interest statement. None declared.
References
- 1. GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020; 369:1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Kahles A., Lehmann K.-V., Toussaint N.C., Hüser M., Stark S.G., Sachsenberg T., Stegle O., Kohlbacher O., Sander C., Caesar-Johnson S.J.et al.. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell. 2018; 34:211–224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Verwilt J., Mestdagh P., Vandesompele J.. Artifacts and biases of the reverse transcription reaction in RNA sequencing. RNA. 2023; 29:889–897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Davies P., Jones M., Liu J., Hebenstreit D.. Anti-bias training for (sc)RNA-seq: experimental and computational approaches to improve precision. Brief. Bioinform. 2021; 22:bbab148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Zheng W., Chung L.M., Zhao H.. Bias detection and correction in RNA-sequencing data. BMC Bioinformatics. 2011; 12:290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Mercer T.R., Gerhardt D.J., Dinger M.E., Crawford J., Trapnell C., Jeddeloh J.A., Mattick J.S., Rinn J.L.. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat. Biotechnol. 2012; 30:99–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Wang F., Xu Y., Wang R., Zhang B., Smith N., Notaro A., Gaerlan S., Kutschera E., Kadash-Edmondson K.E., Xing Y.et al.. TEQUILA-seq: a versatile and low-cost method for targeted long-read RNA sequencing. Nat. Commun. 2023; 14:4760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Li H., Qiu J., Fu X.-D.. RASL-seq for massively parallel and quantitative analysis of gene expression. Curr. Protoc. Mol. Biol. 2012; 98:4.13.1–4.13.9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yeakley J.M., Shepard P.J., Goyena D.E., VanSteenhouse H.C., McComb J.D., Seligmann B.E.. A trichostatin A expression signature identified by TempO-Seq targeted whole transcriptome profiling. PLoS One. 2017; 12:e0178302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Zheng Z., Liebers M., Zhelyazkova B., Cao Y., Panditi D., Lynch K.D., Chen J., Robinson H.E., Shim H.S., Chmielecki J.et al.. Anchored multiplex PCR for targeted next-generation sequencing. Nat. Med. 2014; 20:1479–1484. [DOI] [PubMed] [Google Scholar]
- 11. Xu H., Fair B.J., Dwyer Z.W., Gildea M., Pleiss J.A.. Detection of splice isoforms and rare intermediates using multiplexed primer extension sequencing. Nat. Methods. 2019; 16:55–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Vaquero-Garcia J., Barrera A., Gazzara M.R., Gonzalez-Vallinas J., Lahens N.F., Hogenesch J.B., Lynch K.W., Barash Y.. A new view of transcriptome complexity and regulation through the lens of local splicing variations. eLife. 2016; 5:e11752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lynch K.W., Weiss A.. A model system for activation-induced alternative splicing of CD45 pre-mRNA in T cells implicates protein kinase C and Ras. Mol. Cell. Biol. 2000; 20:70–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Gildea M.A., Dwyer Z.W., Pleiss J.A.. Multiplexed primer extension sequencing: a targeted RNA-seq method that enables high-precision quantitation of mRNA splicing isoforms and rare pre-mRNA splicing intermediates. Methods. 2020; 176:34–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Gazzara M.R., Mallory M.J., Roytenberg R., Lindberg J.P., Jha A., Lynch K.W., Barash Y.. Ancient antagonism between CELF and RBFOX families tunes mRNA splicing outcomes. Genome Res. 2017; 27:1360–1370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wu D., Jha A., Jewell S., Maus N., Gardner J.R., Barash Y.. Generative modeling for RNA splicing code predictions and design. 2023; NeurIPS 2023. OpenReviewhttps://openreview.net/forum?id=UZTpkfw0aC.
- 17. Quinlan A.R., Hall I.M.. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Cock P.J.A., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B.et al.. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25:1422–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. [DOI] [PubMed] [Google Scholar]
- 20. Zadeh J.N., Steenberg C.D., Bois J.S., Wolfe B.R., Pierce M.B., Khan A.R., Dirks R.M., Pierce N.A.. NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 2011; 32:170–173. [DOI] [PubMed] [Google Scholar]
- 21. Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R.. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29:15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Bray N.L., Pimentel H., Melsted P., Pachter L.. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016; 34:525–527. [DOI] [PubMed] [Google Scholar]
- 23. Smith T., Heger A., Sudbery I.. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017; 27:491–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Liu D. Algorithms for efficiently collapsing reads with Unique Molecular Identifiers. PeerJ. 2019; 7:e8275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Lundberg S.M., Erion G., Chen H., DeGrave A., Prutkin J.M., Nair B., Katz R., Himmelfarb J., Bansal N., Lee S.-I.. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020; 2:56–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Zhuang F., Gutman D., Islas N., Guzman B.B., Jimenez A., Jewell S., Hand N.J., Nathanson K., Dominguez D., Barash Y.. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. 2024; bioRxiv doi:03 October 2024, preprint: not peer reviewed 10.1101/2024.10.01.616124. [DOI]
- 27. Hashimshony T., Senderovich N., Avital G., Klochendler A., de Leeuw Y., Anavy L., Gennert D., Li S., Livak K.J., Rozenblatt-Rosen O.et al.. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 2016; 17:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Untergasser A., Cutcutache I., Koressaar T., Ye J., Faircloth B.C., Remm M., Rozen S.G.. Primer3—new capabilities and interfaces. Nucleic Acids Res. 2012; 40:e115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Rouillard J., Zuker M., Gulari E.. OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res. 2003; 31:3057–3062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Beliveau B.J., Kishi J.Y., Nir G., Sasaki H.M., Saka S.K., Nguyen S.C., Wu C.-T., Yin P.. OligoMiner provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ hybridization probes. Proc. Natl Acad. Sci. U.S.A. 2018; 115:E2183–E2192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Doench J.G., Fusi N., Sullender M., Hegde M., Vaimberg E.W., Donovan K.F., Smith I., Tothova Z., Wilen C., Orchard R.et al.. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 2016; 34:184–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Zhang J.X., Yordanov B., Gaunt A., Wang M.X., Dai P., Chen Y.-J., Zhang K., Fang J.Z., Dalchau N., Li J.et al.. A deep learning model for predicting next-generation sequencing depth from DNA sequence. Nat. Commun. 2021; 12:4387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Das K., Martinez S.E., DeStefano J.J., Arnold E.. Structure of HIV-1 RT/dsRNA initiation complex prior to nucleotide incorporation. Proc. Natl Acad. Sci. U.S.A. 2019; 116:7308–7313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Mallory M.J., Allon S.J., Qiu J., Gazzara M.R., Tapescu I., Martinez N.M., Fu X.-D., Lynch K.W.. Induced transcription and stability of CELF2 mRNA drives widespread alternative splicing during T-cell signaling. Proc. Natl Acad. Sci. U.S.A. 2015; 112:E2139–E2148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Bland J.M., Altman D.. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986; 327:307–310. [PubMed] [Google Scholar]
- 36. Vaquero-Garcia J., Aicher J.K., Jewell S., Gazzara M.R., Radens C.M., Jha A., Norton S.S., Lahens N.F., Grant G.R., Barash Y.. RNA splicing analysis using heterogeneous and large RNA-seq datasets. Nat. Commun. 2023; 14:1230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Dwyer Z.W., Pleiss J.A.. The problem of selection bias in studies of pre-mRNA splicing. Nat. Commun. 2023; 14:1966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Dunham I., Kundaje A., Aldred S.F., Collins P.J., Davis C.A., Doyle F., Epstein C.B., Frietze S., Harrow J., Kaul R.et al.. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Luo Y., Hitz B.C., Gabdank I., Hilton J.A., Kagda M.S., Lam B., Myers Z., Sud P., Jou J., Lin K.et al.. New developments on the encyclopedia of DNA elements (ENCODE) data portal. Nucleic Acids Res. 2020; 48:D882–D889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Irimia M., Weatheritt R.J., Ellis J.D., Parikshak N.N., Gonatopoulos-Pournatzis T., Babor M., Quesnel-Vallières M., Tapial J., Raj B., O’Hanlon D.et al.. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell. 2014; 159:1511–1523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Handley M.T., Morris-Rosendahl D.J., Brown S., Macdonald F., Hardy C., Bem D., Carpanini S.M., Borck G., Martorell L., Izzi C.et al.. Mutation spectrum in RAB3GAP1, RAB3GAP2, and RAB18 and genotype–phenotype correlations in Warburg micro syndrome and Martsolf syndrome. Hum. Mutat. 2013; 34:686–696. [DOI] [PubMed] [Google Scholar]
- 42. Wolf Y., Anderson A.C., Kuchroo V.K.. TIM3 comes of age as an inhibitory receptor. Nat. Rev. Immunol. 2020; 20:173–185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Heusschen R., Griffioen A.W., Thijssen V.L.. Galectin-9 in tumor biology: a jack of multiple trades. Biochim. Biophys. Acta. 2013; 1836:177–185. [DOI] [PubMed] [Google Scholar]
- 44. Wan Y., Anastasakis D.G., Rodriguez J., Palangat M., Gudla P., Zaki G., Tandon M., Pegoraro G., Chow C.C., Hafner M.et al.. Dynamic imaging of nascent RNA reveals general principles of transcription dynamics and stochastic splice site selection. Cell. 2021; 184:2878–2895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., Armstrong J., Barnes I.et al.. GENCODE 2021. Nucleic Acids Res. 2021; 49:D916–D923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Cherry J.M., Hong E.L., Amundsen C., Balakrishnan R., Binkley G., Chan E.T., Christie K.R., Costanzo M.C., Dwight S.S., Engel S.R.et al.. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 2012; 40:D700–D705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Monfort-Lanzas P., Rusu E.C., Parrakova L., Karg C.A., Kernbichler D.-E., Rieder D., Lackner P., Hackl H., Gostner J.M.. ExonSurfer: a web-tool to design primers at exon–exon junctions. BMC Genomics. 2024; 25:594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Govindkumar B., Kavyashree B., Patel K., Sasidharan K., Siva Arumugam T., Thomas L., Praveena B.K.G., Raksha H.N., Menon R., Acharya K.K.. Ex-Ex primer: an experimentally validated tool for designing oligonucleotides spanning spliced nucleic acid regions from multiple species. J. Biotechnol. 2022; 343:1–6. [DOI] [PubMed] [Google Scholar]
- 49. Jeon H., Bae J., Hwang S.-H., Whang K.-Y., Lee H.-S., Kim H., Kim M.-S.. MRPrimerW2: an enhanced tool for rapid design of valid high-quality primers with multiple search modes for qPCR experiments. Nucleic Acids Res. 2019; 47:W614–W622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. You F.M., Wanjugi H., Huo N., Lazo G.R., Luo M.-C., Anderson O.D., Dvorak J., Gu Y.Q.. RJPrimers: unique transposable element insertion junction discovery and PCR primer design for marker development. Nucleic Acids Res. 2010; 38:W313–W320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Xie N.G., Wang M.X., Song P., Mao S., Wang Y., Yang Y., Luo J., Ren S., Zhang D.Y.. Designing highly multiplex PCR primer sets with Simulated Annealing Design using Dimer Likelihood Estimation (SADDLE). Nat. Commun. 2022; 13:1881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Zhang Z., Pan Z., Ying Y., Xie Z., Adhikari S., Phillips J., Carstens R.P., Black D.L., Wu Y., Xing Y.. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat. Methods. 2019; 16:307–310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Jha A., Gazzara M.R., Barash Y.. Integrative deep models for alternative splicing. Bioinformatics. 2017; 33:i274–i282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Zeng T., Li Y.I.. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biol. 2022; 23:103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Cheng J., Çelik M.H., Kundaje A., Gagneur J.. MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol. 2021; 22:94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Busan S., Weidmann C.A., Sengupta A., Weeks K.M.. Guidelines for SHAPE reagent choice and detection strategy for RNA structure probing studies. Biochemistry. 2019; 58:2655–2664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Siegfried N.A., Busan S., Rice G.M., Nelson J.A.E., Weeks K.M.. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP). Nat. Methods. 2014; 11:959–965. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw and processed data from RNA-seq and LSV-seq experiments were deposited to GEO under accession number GSE246294. Processed data and code to reproduce all figures were deposited to Zenodo repository at URL: https://doi.org/10.5281/zenodo.13999558. Auxiliary files required for certain analyses are deposited at the following Zenodo dois: https://doi.org/10.5281/zenodo.8323103 for the primary LSV-seq method and https://doi.org/10.5281/zenodo.8190734 for the OP model.