Abstract
Motivation
Percent Spliced-In (PSI) values are commonly used to report alternative pre-mRNA splicing (AS) changes. Previous PSI-detection tools were limited to specific AS events and were evaluated by in silico RNA-seq data. We developed PSI-Sigma, which uses a new PSI index, and we employed actual (non-simulated) RNA-seq data from spliced synthetic genes (RNA Sequins) to benchmark its performance (i.e. precision, recall, false positive rate and correlation) in comparison with three leading tools (rMATS, SUPPA2 and Whippet).
Results
PSI-Sigma outperformed these tools, especially in the case of AS events with multiple alternative exons and intron-retention events. We also briefly evaluated its performance in long-read RNA-seq analysis, by sequencing a mixture of human RNAs and RNA Sequins with nanopore long-read sequencers.
Availability and implementation
PSI-Sigma is implemented is available at https://github.com/wososa/PSI-Sigma.
1 Introduction
Alternative pre-mRNA splicing (AS) generates diverse isoforms, some of which have cancer-biomarker potential or correspond to switches of cell identity (Lin et al., 2018). Moreover, modulating AS with antisense oligonucleotides is a proven therapeutic approach (Rinaldi and Wood, 2018). Thus, there is a need for sensitive and accurate assessment of AS patterns and isoform expression levels. In the case of AS analysis from short-read RNA-seq data, three major approaches have been used to calculate Percent Spliced-In (PSI) values. A two-isoform-based approach, such as MISO (Katz et al., 2010) and rMATS (Shen et al., 2014), is suitable to analyze simple AS events in which two splicing isoforms share one alternative exon, but it is not designed to detect multiple-exon-skipping (MES) or more complex splicing events. An isoform-resolution-based approach, such as SUPPA2 (Trincado et al., 2018), is limited to a set of predefined isoforms, and does not detect AS events involving more than one alternative exon. A splice-graph-based approach, such as JUM (Wang and Rio, 2018), MAJIQ (Vaquero-Garcia et al., 2016) and Whippet (Sterne-Weiler et al., 2018), can detect AS genes with more than one alternative exon, but it relies on the first alternative exon and each alternative exon has only one ΔPSI value. Moreover, these approaches were initially evaluated by in silico RNA-seq data and limited RT-PCR validation experiments. In real RNA-seq experiments, RNAs are processed through a series of steps (polyA-selection, fragmentation, reverse transcription and polymerase chain reaction) before getting sequenced by short- or long-read sequencing technologies. Although in silico RNA-seq analysis is a useful proof-of-concept tool, a more objective approach is to benchmark splicing-detection tools using real RNA-seq data in which the ground truth is known and generated by a third party, such that both biological and technical variables can be taken into account.
In this study, we use real RNA-seq data from mixtures of spliced synthetic RNAs [RNA Sequins (Hardwick et al., 2016)] to benchmark the performance of PSI-Sigma and the leading tools (i.e. rMATS, SUPPA2 and Whippet) representing the three major splicing-detection approaches. RNA Sequins were designed to mimic the human transcriptome in various aspects (transcript length, GC content, exon size, intron size, etc.) and consist of 164 polyA-tailed RNA molecules, present in two mixtures (Mix A and Mix B) with different RNA concentrations. In addition, we generated long-read RNA-seq data from human cell-line RNA spiked with RNA Sequins to evaluate the performance of PSI-Sigma.
2 Materials and methods
2.1 The PSIΣ equation
Our method is based on a modified PSI () equation, which we described earlier (Lin et al., 2018):
Briefly, for exon-skipping (or cassette exon) events, PSI-Sigma considers splice-junction reads (Si and Sj) of all isoforms in the region between two constitutive exons (C1 and C2). This enables PSI-Sigma to report PSI values of individual alternative exons. When an alternative exon is used by two different isoforms, PSI-Sigma reports two PSI values, based on the splice-junction reads (a and b) supporting each isoform. Also, for alternative splice-site events, PSI-Sigma considers the splice-junction reads of all the isoforms that share the same constitutive exon. When an alternative initial exon also shares the constitutive exon, PSI-Sigma takes its splice-junction reads into account, so an alternative transcriptional start site can be detected. Finally, for intron-retention (IR) events, PSI-Sigma counts the number of intronic reads crossing the first, 25th, 50th, 75th and 99th percentile positions of an intron in an IR event. Based on the median value of the five numbers, PSI-Sigma estimates the abundance of the IR isoform.
2.2 Short-read and long-read RNA-seq data and analysis
The short-read RNA-seq data of RNA Sequins was downloaded from NCBI-Short Read Archive (SRR3743147 and SRR3743148). The long-read RNA-seq data from U87 cells plus RNA Sequins was generated on a GridION, and the library preparation was done according to the PCR cDNA-seq protocol from Oxford Nanopore. Short-read RNA-seq data of U87 cells was downloaded from NCBI-Short Read Archive (SRR2968851). Long-read RNA-seq data were aligned by using GMAP (version 2017-11-15, -d GRCh38 -f samse –min-trimmed-coverage = 0.5 –no-chimeras -B 5 -t 6), whereas short-read RNA-seq data were aligned by STAR (version 2.5.3a, –outFilterIntronMotifs RemoveNoncanonical –twopassMode Basic). Gene-expression profiles of short-read RNA-seq data were generated with RSEM (version 1.3.0, –calc-pme –calc-ci –ci-memory 4000 –seed 123 –no-bam-output –paired-end), whereas the gene-expression profiles of long-read RNA-seq data were generated with our in-house script. Our script counts a long read as a supporting read for a gene by matching the segments of long read to the exons of a gene in the alignment file. If the long read has two or more candidate genes, the script assigns the long read to the gene for which its alignment segments have the most matching exons. Exon peaks were visualized with the Integrative Genomics Viewer (Thorvaldsdottir et al., 2013).
2.3 The output format of PSI-Sigma
PSI-Sigma reports: (i) event coordinate, (ii) gene symbol, (iii) exon coordinate, (iv) event type, (v) ΔPSI, (vi) p value, (vii) adjusted P-value and (viii) PSI values for each sample/replicate. An alternative exon can have multiple records if it is an overlapping exon in multiple isoforms. PSI-Sigma has been released and will be updated at https://github.com/wososa/PSI-Sigma.
3 Results
3.1 Performance of PSI-Sigma and other tools based on the short-read RNA-seq data of RNA Sequins
The artificial in silico genome of RNA Sequins (‘Sequins genome’) contains 78 spliced synthetic genes. According to the concentration annotations, 34 of 78 synthetic genes have isoform inclusion ratios that change between Mix A and B and 28 of these genes exhibit AS changes, whereas the other 6 genes exhibit non-AS events (e.g. alternative first or last exons). Using the same short-read RNA-seq data (21.62 and 20.71 million 125 bp pair-end reads for Mix A and B, respectively), PSI-Sigma, Whippet, rMATS and SUPPA2 reported AS events (ΔPSI ≠ 0) in 53, 70, 32 and 29 genes, respectively. The four methods reported 70 AS genes in total, and 22 overlapping genes were reported by all four methods. A total of 12 out of the 22 overlapping genes are known to have AS events based on concentration annotations (Fig. 1A). Based on the ΔPSI ≠ 0 criterion, recall (a statistical measure) of PSI-Sigma, Whippet, rMATS and SUPPA2 was 1, 1, 0.64 and 0.50, respectively. In other words, rMATS and SUPPA2 missed up to 50% of the AS genes. Precision (another statistical measure) of PSI-Sigma, Whippet, rMATS and SUPPA2 was 0.53, 0.40, 0.56 and 0.52, respectively.
Fig. 1.

Performance of PSI-Sigma in short-read RNA-seq data of RNA Sequins. (A) The Venn diagram shows the overlap of AS genes reported by the concentration annotation of RNA Sequins (black), Whippet (blue), PSI-Sigma (red), rMATS (green) and SUPPA2 (yellow), based on the ΔPSI ≠ 0 criterion. (B) Using 100 absolute ΔPSI cutoffs (0∼99%), the precision and recall of PSI-Sigma (red), rMATS (green), SUPPA2 (yellow) and Whippet (blue) are shown. (C) The recall (when ΔPSI ≠ 0) of the four tools is shown in the bar chart specifically for AS events with SES, MES or IR events. The recall is based on the number of AS genes reported by a given tool. (D) The FPRs of the four tools were calculated, using 100 absolute ΔPSI cutoffs (0∼99%). (E) The scatter plot shows the comparison of ΔPSI values between the ground truth (X-axis) and the four tools (Y-axis)
ΔPSI value is often used as a cutoff to select AS events with significant changes. Using 100 ΔPSI cutoffs (0∼99%), we found that PSI-Sigma outperformed Whippet, rMATS and SUPPA2 in terms of precision and recall (Fig. 1B). In all ΔPSI cutoffs, Whippet had lower precision and recall than PSI-Sigma. Both rMATS and SUPPA2 had significantly lower recall. This is due to the fact that they are not designed to detect MES events, and they ignore some of the IR events (Fig. 1C). For example, 9 of the 28 AS genes have MES events. rMATS reported single-exon-skipping (SES) or mutually-exclusive splicing events in four of the nine genes, but missed the exons actually involved in the MES events (e.g. R2_60). Similarly, SUPPA2 could not report the actual alternative exons involved in MES events. In other words, the actual recall for MES events was 0 for both rMATS and SUPPA2. Whippet identified all of the genes with MES events, but did not recognize two of the alternative exons that overlap with other alternative exons. For example, the alternative exons (chrIS: 10388367-10389391) of R2_59 and the alternative exon (chrIS: 5301021-5301254) of R2_38 were not reported in Whippet’s analysis. Finally, six genes with IR events have known concentration changes in their isoforms. However, both rMATS (tested in both version 3.2.5 and 4.0.2) and SUPPA2 failed to recognize IR events in three genes (R2_28, R1_62 and R1_41) from the gene-annotation file (.gtf) of RNA Sequins. Accordingly, these events were not analyzed. For the remaining three genes, rMATS reported ΔPSI values for the IR events, whereas SUPPA2 did not. For the eight IR events in RNA Sequins, PSI-Sigma achieved the highest correlation with the ground truth (r = 0.9465).
Furthermore, we used the ΔPSI cutoffs to estimate false positive rates (FPRs). The FPRs of rMATS and PSI-Sigma decreased as the ΔPSI cutoff increased (Fig. 1D), whereas SUPPA2 and Whippet had >0 FPRs for most of the ΔPSI cutoffs. On average, Whippet had the highest FPRs, whereas rMATS had the lowest. When the ΔPSI cutoff was 10%, the FPR of PSI-Sigma, Whippet, rMATS and SUPPA2 was 0.12, 0.28, 0.08 and 0.18, respectively. When the ΔPSI cutoff was 30%, the FPR of PSI-Sigma, Whippet, rMATS and SUPPA2 was 0, 0.04, 0 and 0.10, respectively. The 0.04 FPR of Whippet is due to the fact that Whippet reported AS events in two genes (R2_55 and R2_68) that do not exhibit typical AS events. For example, the SES event of R2_55 does not have constitutive exons that are shared by R2_55_2 and R2_55_3 isoforms. rMATS and SUPPA2 likewise do not consider the SES event. The 0.10 FPR of SUPPA2 is due to the fact that SUPPA2 reported ΔPSI values in five genes (R2_46, R1_22, R2_73, R2_45 and R2_57) whose isoform concentrations do not change. Overall, the FPRs of PSI-Sigma are significantly different from those of Whippet (P = 0.0002) and SUPPA2 (P = 3.09 ×10−8), whereas the FPRs between PSI-Sigma and rMATS are not significantly different (P = 0.0721). In short, PSI-Sigma and rMATS are significantly better than Whippet and SUPPA2 in terms of FPR.
Finally, we correlated the ΔPSI values from the ground truth and the four tools (Fig. 1E). Overall, PSI-Sigma reported the most accurate ΔPSI values (r = 0.9441, P = 3.93 × 10−30) and SUPPA2 reported the most inaccurate ΔPSI values (r = 0.5700, P = 1.63 ×10−6). Note that the ΔPSI values were set to zero if the tool does not report ΔPSI values for the AS event. For the AS events whose ΔPSI value is reported by rMATS, we found that PSI-Sigma has a slightly higher correlation with the ground truth (r = 0.9457) than rMATS (r = 0.9321). For these single-exon AS events, the ΔPSI values of PSI-Sigma and rMATS correlated strongly (r = 0.9793). There was also a strong correlation for the 130 AS events identified by rMATS in a recent analysis of mouse hepatocytes (Bhate et al., 2015). PSI-Sigma could identify all the reported events and gave ΔPSI values that were highly correlated with RT-PCR validations (r = 0.8741). Both lines of evidence suggest that PSI-Sigma and rMATS perform similarly in the case of AS events involving one alternative exon.
In summary, PSI-Sigma provides a more comprehensive and accurate analysis of AS events. For example, the R2_60 gene of RNA Sequins has both SES and MES events, but only PSI-Sigma and Whippet can detect the MES event (Fig. 2A). Morover, PSI-Sigma reports more accurate ΔPSI values for IR events. For example, PSI-Sigma, Whippet and rMATS identified the IR event of R1_72 in their databases, but the ΔPSI value of PSI-Sigma was closer to the ground truth (Fig. 2B).
Fig. 2.

PSI-Sigma detects the MES and IR events in the Sequins transcriptome while other tools may not. (A and B) The exon peaks of isoform structures of R2_60 and R1_72 synthetic genes are shown here. The exon peaks are based on the RNA-seq data of RNA Sequins. Red, green, yellow and blue balls represent PSI-Sigma, rMATS, SUPPA2 and Whippet, respectively. The pink bubbles highlight the regions where a group of splicing-detection tools can detect. Alternative exons are highlighted in pink color. The IR region is also highlighted in pink color in the Figure 2B
3.2 MES and IR events in the human transcriptome
MES events are commonly seen in the human transcriptome. For example, in the latest version of the genome annotation (Ensembl 94), PSI-Sigma identified MES events in 7029 genes and 6339 (90.18%) of which are protein-coding genes; 6059 protein-coding genes have both SES and MES events and 280 protein-coding genes have only MES events. For example, integrin subunit alpha V (ITGAV) only has MES events, so the AS event can only be detected by PSI-Sigma and Whippet (Fig. 3A). Telomerase reverse transcriptase (TERT) has both SES and MES events; the MES event skips two exons and introduces a premature termination codon. PSI-Sigma and Whippet detect both the SES and MES events, whereas rMATS can only detect the SES event (Fig. 3B). Furthermore, in our recent study, we identified and validated a switch of AFMID isoforms in hepatocellular carcinoma (HCC) by using PSI-Sigma and RT-PCR experiments (Lin et al., 2018). PSI-Sigma allowed us to precisely report the PSI values of individual alternative exons of AFMID (Fig. 3C). In particular, exon 6 of AFMID had two opposite ΔPSI values, depending on whether it is present in the AFMIDFL or AFMIDe6 isoform (Fig. 3C). Whippet cannot distinguish the two ΔPSI values of exon 6, because it assumes that one exon can only have one PSI value. Likewise, single-exon-based approaches, such as rMATS and SUPPA2, cannot correctly estimate the ΔPSI changes of individual exons in AFMID. We ran rMATS and Whippet to analyze the LIHC dataset (50 adjacent normal livers and 369 HCC patient samples) from the cancer genome atlas, and we confirmed that rMATS did not report significant PSI changes for exon 6 of AFMID, and Whippet could not report PSI values for this exon.
Fig. 3.

PSI-Sigma detects the MES and IR events in the human transcriptome, whereas other tools may not. (A and B) The isoform structures of human ITGAV and TERT genes are shown here. The black bars and lines represent exons and introns, respectively. The red, blue and green balls represent PSI-Sigma, Whippet and rMATS, respectively. Alternative exons are highlighted in pink. The pink bubbles highlight the regions that can be detected by one or a group of splicing-detection tools. (C) The exon peaks and isoform structures of the human AFMID gene are shown. The first two panels show the exon peaks based on RNA-seq data from a human liver and a human HCC patient sample from the cancer genome atlas-LIHC dataset. Alternative exons of AFMIDFL1 and AFMIDFL2 are colored in pink, and exon 6 of AFMIDe6 is colored in blue to demonstrate that the increase in exon 6 can only be detected by PSI-Sigma. (D) The isoform structures of the human SRSF2 gene are shown. The pink region highlights the IR and SES regions
In addition, we found that rMATS failed to identify the IR events in three Sequins genes. To extend this analysis of IR to the human transcriptome, we ran rMATS and PSI-Sigma for the latest gene annotation (Ensembl 94). PSI-Sigma identified 9175 genes with IR events, whereas rMATS (v4.0.2) identified IR events in 3574 genes. A total of 3515 (98.35%) out of the 3574 genes identified by rMATS were also identified by PSI-Sigma. PSI-Sigma identified an additional 5660 genes with IR events. These 5660 (61.69%) genes mostly have their IR events near either the first or last exon. For example, translocation-associated membrane protein 1 (TRAM1) in the translation pathway has an IR event involving the first intron. Mitogen-activated protein kinase 11 (MAPK11) in the MAP kinase pathway has an IR event in the 3'-UTR region. And serine- and arginine-rich splicing factor 2 (SRSF2) has a previously characterized IR event in the last intron, which destabilizes the mRNAs through autoregulation (Sureau et al., 2001). This IR event was only detected by PSI-Sigma, whereas the SES event of SRSF2 was detected by both PSI-Sigma and rMATS (Fig. 3D). Serine- and arginine-rich splicing factor 1 (SRSF1) also has an IR event involving the last intron (Sun et al., 2010), which was not detected by rMATS. PABPN1 has IR events in both 5'- and 3'-UTRs. The IR event in the 3'-UTR is known to be promoted by PABPN1 protein through autoregulation (Bergeron et al., 2015). Finally, some IR events missed by rMATS did not involve the first or last intron. For instance, heat shock protein family A member 8 (HSPA8) has IR events for introns 6 and 7, and enhancer of zeste 2 polycomb repressive complex 2 subunit (EZH2) has IR events for introns 10 and 11. These IR events are not in the database of rMATS (v3.2.5 and v4.0.2), and therefore are not analyzed and reported by this software.
3.3 Performance of PSI-Sigma based on the long-read RNA-seq data of RNA Sequins
To benchmark PSI-Sigma for long-read RNA-seq analysis, we generated nanopore long-read RNA-seq data of RNA Sequins. We isolated human RNA from U87 glioblastoma cells, spiked with RNA Sequins mixtures and carried out long-read RNA-seq on a GridION (Oxford Nanopore). After base-calling, we obtained 2.20 million reads for Run 1 (U87 RNA plus Mix A) and 2.51 million reads for Run 2 (U87 RNA plus Mix B). We used GMAP (Wu and Watanabe, 2005) to align the long reads onto both the human and Sequins genomes. The majority of the long reads could be aligned exclusively to either the human or the Sequins genome (Fig. 4A). Only 48 reads (0.0022%) from Run 1 and no reads from Run 2 aligned to both genomes. The error rate (based on the edit distance in the alignments) for Run 1 and 2 was 12.53 and 12.18%, respectively. Thus, the error rate of long-read RNA-seq is higher than that of short-read RNA-seq (<0.2%), but the long reads from the RNA Sequins were still clearly distinguishable from the human long reads. The human long reads from the two runs exhibited very similar read-length distributions, indicating highly similar RNA integrity.
Fig. 4.

Performance of PSI-Sigma in long-read RNA-seq data of RNA Sequins. (A) The pie charts show the proportions of long reads aligned to either the human or Sequins genome for Run 1 and 2, respectively. The Sequins genome represents the artificial in silico genome of RNA Sequins. (B) The scatter plot shows the correlation between gene-expression profiles of Mix B in short-read (y-axis) and long-read (x-axis) RNA-seq data. The trend line is colored in red. Each blue dot represents one synthetic gene of the RNA Sequins. (C) The scatter plots show the correlation of ΔPSI values between long-read RNA-seq (analyzed by PSI-Sigma) and concentration annotations
Based on the alignments of long reads in the Sequins genome, we calculated the expression levels of 78 synthetic genes, quantified as transcripts per million (TPM). Each long read represents one transcript, so one TPM equals one long read per million long reads. We used log2(TPM + 1) values to compare the gene-expression profile of Mix B between our long-read data (U87 spiked-in) and previous short-read data from a human lymphoblastoid cell line (Hardwick et al., 2016) (GM12878 spiked-in). The Sequins gene-expression profiles were highly correlated (rho = 0.9627) (Fig. 4B), indicating that the spike-in experiments are highly reproducible. The expression profiles of Mix A and B were also highly correlated with their concentration annotations (rho = 0.9565 for Mix A and rho = 0.9525 for Mix B).
Because other tools were designed for short-read RNA-Seq and are currently not compatible with long-read RNA-seq, we only used PSI-Sigma to analyze the long-read alignment files of Mix A and B. The ΔPSI values reported by PSI-Sigma were highly correlated with concentration annotations (r = 0.9245) (Fig. 4C). The number of events detected is smaller, due to the low sequencing depth. Nevertheless, the IR events (two introns) of R1_41 had ΔPSI values that were ∼2% different from the expected ΔPSI values based on concentration annotations.
3.4 Long-read RNA-seq covers the transcriptome with full-length long reads for as analysis
Sequencing depth determines the number of AS genes that can be detected by PSI-Sigma, so we sought to determine the number of genes whose expression can be detected with sufficient sequencing depth in short-read and long-read RNA-seq data. First, in the 0.76 million long reads from Run 1, 11 420 protein-coding genes were covered by ≥1 long read, 6499 by ≥10 long reads and 3685 by ≥30 long reads (Fig. 5A). In the 43 million 100-bp paired-end reads from U87 cells, 11 980 protein-coding genes were covered by ≥1 TPM, and 7772 and 4037 protein-coding genes had log2(TPM + 1) values ≥3.8 and ≥5.3, respectively (Fig. 5B). In other words, ∼94% of the protein-coding genes detected by 43.4 million short reads have at least one long read in the long-read RNA-seq data with only 0.76 million long reads. Likewise, ∼75% of the protein-coding genes with ≥3.8 log2(TPM + 1) value are likely to be covered by ≥10 long reads out of ∼1 million long reads. Notably, long reads were more frequently aligned to pseudogenes than short reads (Fig. 5A), probably due to the higher error rate of long reads. The gene-expression profiles of short-read and long-read RNA-seq data from U87 cells were highly correlated (r = 0.9045) among protein-coding genes, lincRNAs and antisense transcripts (Fig. 5C). For example, housekeeping genes gave very similar log2(TPM + 1) values: GAPDH gave 13.07 and 13.34 in short-read and long-read profiles, respectively; and HPRT1 gave 6.77 and 7.38 in short-read and long-read profiles, respectively.
Fig. 5.

A comparison of the numbers of genes detected by short-read and long-read RNA-seq. (A) The stacked-bar chart shows the number of genes detected in 0.76 million long reads, based on three expression-level criteria: (i) log2(TPM+1) ≥1 (≥1 long reads); (ii) log2(TPM+1) ≥3.8 (≥10 reads) and (iii) log2(TPM+1) ≥5.3 (≥30 long reads). Genes were further divided into four categories: protein-coding (blue), lncRNA (green), antisense (yellow) and others (purple). (B) The stacked-bar chart shows the number of genes detected for different numbers of short or long reads, based on the same expression-level criteria as in (A). (C) The scatter plots show the correlations between the gene-expression profiles in short-read and long-read RNA-seq data. Only protein-coding genes, lncRNAs and antisense transcripts are plotted. Mitochondrial RNAs were excluded from these plots. (D) The top panel shows the exon structure of C17orf53 in blue, and the middle panel shows the alignments of long reads (red) in the gene region of C17orf53. Each horizontal line represents one long read. Three of the long reads (first, third and fourth) do not match any of the exons in C17orf53, so they were excluded from the analysis. The heat map (bottom panel) is based on the alignments of 10 supporting long reads of C17orf53, and shows the distribution of sequencing depth in each of the 10 bins in C17orf53. Each circle under each bin represents the percentage of long reads covering the bin. The color spectrum of the circles is on the right. (E) The heat maps show the percentages of protein-coding genes with ≥10 or 3 long reads covering the 10 bins. For instance, the sixth bin from the left represents the protein-coding genes whose gene region is ≥50% covered by ≥10 long reads. The color spectrum of the circles is on the right
It is remarkable that ∼1 million long reads captured >11 000 protein-coding genes, but not all of the long reads can cover entire gene regions and be used for PSI analysis. For example, from Run 1, we found 10 long reads whose alignments belong to C17orf53 (Fig. 5D). Seven of the long reads covered the entire gene region of C17orf53, and the other three long reads (at the bottom) did not. Thus, the sequencing depth for the exons at the 5'-end can be lower than for the exons at the 3'-end. The depth difference can be due to: (i) the gene having alternative first exons (e.g. the last two long reads at the bottom in Fig. 5D); (ii) sequences at the 5'-end or 3'-end of the long read not being aligned by soft-clipping; (iii) incomplete reverse transcription or (iv) low RNA integrity. Soft-clipping is a technique to report that a series of bases at the beginning or the end of a short- or long-read sequence is not part of the alignment record. In other words, the clipped bases are masked and do not have matched bases in the alignment record. For example, the third-last long read in Figure 2D has 234 bases masked by soft-clipping at the 5'-end, so the long read may have spanned the first exon (240 bases in size).
To rigorously estimate the distribution of sequencing depths, we divided each entire gene region into 10 bins, and calculated the number of protein-coding genes with ≥10 supporting long reads in each bin. We found that 64.19 and 53.79% of the protein-coding genes were fully covered by at least 3 and 10 long reads in Run 1, respectively (Fig. 5E). About 78.69 and 69.22% of the protein-coding genes were almost fully covered (>70%) by at least 3 and 10 long reads in Run 1, respectively. About 78.47% of the protein-coding genes had >50% of their gene region covered by at least 10 long reads in Run 1. In Run 2, we found similar percentages (Fig. 5E). Overall, in Run 1, the gene regions of 3495 protein-coding genes were fully covered by at least 10 long reads; 4498 protein-coding genes had their gene regions almost entirely covered (>70%) by at least 10 long reads.
4 Discussion
PSI index is used to profile the inclusion of alternative exons in RNA-seq analysis. However, an alternative exon can be included in multiple splice isoforms. To allow high-resolution determination of PSI values in more complex AS events, we developed PSI-Sigma. We compared the performance of PSI-Sigma with three leading splicing-detection tools by using real RNA-seq data, and identified the limitations of the three major approaches, especially for detection of MES and IR events. rMATS and SUPPA2 assume only one alternative exon in an AS event, so they do not record MES events in their databases. The rMATS database pipeline could be updated by combining the alternative exons in a MES event into one big hypothetical exon, but the PSI calculation has to avoid considering the intronic reads between alternative exons as exonic reads. SUPPA2 could use the new PSI index used by PSI-Sigma to include the MES events in its analysis. PSI-Sigma takes all isoforms into account when calculating PSI values, so an alternative exon can have multiple PSI values, depending on the exon-junction reads linking it with the constitutive exon (Fig. 3C). Whippet and SUPPA2 are both limited to the assumption that one alternative exon can only have one PSI value. This makes PSI-Sigma more flexible than other approaches at detecting complex MES events.
We illustrated the limitations of various tools by using both Sequins (e.g. R2_60 and R2_59) and human (e.g. AFMID and TERT) genes. We found that rMATS and SUPPA2 missed some IR events in the Sequins transcriptome, so the IR events were not analyzed and reported. Further experiments confirmed that rMATS cannot recognize a large fraction of the IR events in the human genome annotation. We highlighted several known IR events (e.g. SRSF2 and PABPN1). Both rMATS and SUPPA2 need expanded databases to include the missing IR events in the human transcriptome. We showed that RNA Sequins are useful to benchmark the comprehensiveness of splicing-detection tools, as these limitations were not previously addressed.
In terms of accuracy of ΔPSI values, we showed that PSI-Sigma achieved the highest correlation with the ground truth (concentration annotation of RNA Sequins). For single-exon AS events, PSI-Sigma achieved a slightly higher correlation than rMATS. Moreover, 130 RT-PCR validations also indicated that PSI-Sigma reports accurate ΔPSI values for single-exon AS events. For IR events, we showed that PSI-Sigma achieved the highest correlation of the methods tested, and illustrated its advantage using an IR event of R1_72 (Fig. 3B). In terms of FPR, we showed that rMATS had the lowest FPR and Whippet had the highest FPR when the absolute ΔPSI cutoff was set to 10% (a common criterion). The fact that Whippet has higher FPR at low ΔPSI cutoffs is partly due to it not requiring exon-skipping events to have constitutive exons and reporting non-AS events as SES events. Even if we include the six non-AS genes (e.g. R2_55) as true positives (n = 34), Whippet still has higher FPR than the other tools when the absolute ΔPSI cutoff is low. For example, at 5 and 10% cutoffs, the FPRs of Whippet are 0.32 and 0.23, respectively, when the six non-AS genes are included. Under the same conditions, the FPRs of PSI-Sigma are 0.22 and 0.14, respectively.
This suggests that Whippet functions best for AS events that are expected to have high ΔPSI values. For smaller AS changes, Whippet might detect substantially more false positive events than other tools. Besides, when the absolute ΔPSI cutoff was 30%, SUPPA2 reported AS events in five genes that do not have changes in their isoform concentrations. SUPPA2 was the only tool that reported these AS events at 30% or greater cutoffs. This suggests that SUPPA2 might report more false positive events, even when the absolute ΔPSI cutoff is high.
In terms of speed, Whippet is remarkably faster than other tools. Using only one CPU thread, Whippet (v0.11) processed the .fastq files in a 3 versus 3 comparison (∼70 million pair-end reads per file) in just ∼5 h, whereas PSI-Sigma and rMATS (v3.2.5) took ∼10 and 17 h to process pre-aligned .bam files, respectively. When multiple CPUs are allowed, rMATS (v4.0.2) can process the pre-aligned .bam files in 30 min (eight CPU threads). Generating the pre-aligned .bam files for the 3 versus 3 comparison takes STAR (Dobin et al., 2013) (2-pass mode) ∼30 min per sample (eight CPU threads). If the alignments can be done in parallel by using multiple computing servers, rMATS (v4.0.2) will be slightly faster than Whippet, because Whippet needs ∼50 min for each RNA-seq sample. If only one computing server or one laptop is used, Whippet will be faster than PSI-Sigma and rMATS (v4.0.2). PSI-Sigma’s pipeline currently does not support multiple CPU threads. In addition, PSI-Sigma and rMATS both generate adjusted P-values in the 3 versus 3 comparisons, whereas Whippet does not generate P-values. PSI-Sigma and rMATS are suitable for statistical comparisons of multiple samples, whereas Whippet is more suitable for quick screening in 1 versus 1 condition.
Long-read RNA-seq is an emerging technology, but the error rate (12∼13%) of nanopore long reads raises concerns about its reliability. A recent study applied long-read RNA-seq to small transcriptomes, such as the yeast transcriptome and Lexogen’s Spike-in RNA Variant Control Mixes (Garalde et al., 2018), but its performance with larger and more complex transcriptomes, such as the human transcriptome and RNA Sequins was not reported. Another study used nanopore 2D reads (Byrne et al., 2017), but 2D long-read RNA-seq is no longer available. Our analysis shows that long-read RNA-seq is quantitatively reliable in gene-expression analysis. We showed that long-read RNA-seq reported the expected levels of synthetic RNAs in our spike-in experiments, and yielded gene-expression profiles that were highly correlated with the profiles of short-read RNA-seq data from natural human RNAs. Approximately 1 million long reads comprise long reads that can be mapped to >11 000 protein-coding genes. Moreover, our analysis showed that the long reads of synthetic RNAs and natural human RNAs are clearly distinguishable, because only a negligible number of long reads were ambiguously aligned to both the synthetic and human genomes. This means that a splice-aware aligner, such as GMAP (Wu and Watanabe, 2005), can tolerate the error rate of nanopore long reads. Long-read RNA-seq analysis is comparable to short-read RNA-seq analysis in assessing gene-expression level.
In addition, a long read from long-read RNA-seq often contains information about multiple exons, which is advantageous for quantifying overlapping genes (e.g. antisense and read-through transcripts) that are ambiguous in short-read RNA-seq analysis. For example, SYNJ2BP-COX16 is a read-through gene spanning SYNJ2BP and COX16. Short-read RNA-seq analysis gave similar TPM values for these three genes in U87 cells, but our long-read RNA-seq analysis was able to report that SYNJ2BP-COX16 transcripts were not present (or were below the level of detection), because only the full-length transcripts of SYNJ2BP and COX16 were present in the alignments. This example illustrates that long-read RNA-seq provides a less ambiguous expression profile in regions where two or more genes overlap. Also, the protein-coding genes whose expression level in U87 cells was zero, according to the long-read RNA-seq analysis, had an expression level [log2(TPM + 1)] < 7.5 in short-read RNA-seq analysis. This observation indicates that long-read RNA-seq did not miss protein-coding genes with moderate or high expression levels that were supposed to be captured for the given sequencing depth.
Interestingly, four protein-coding genes in U87 cells were preferentially detected by long-read RNA-seq. For example, pituitary tumor-transforming 2 (PTTG2) had a log2(TPM + 1) value of 6.47 (with 67 long reads), but it had zero or only one short read in the short-read RNA-seq data of U87 cells [NCBI-SRA accession number: SRP066879 and SRP117568 (Manukyan et al., 2018)]. In fact, PTTG2 has no detectable expression in 53 tissue types in the Genotype-Tissue Expression database (GTEx Consortium, 2013) based on short-read RNA-seq, but a previous study showed that cDNAs from this proto-oncogene made from pituitary RNA are detectable by a PCR-ELISA assay (Prezant et al., 1999). This observation suggests that short-read RNA-seq is limited in detecting expression of genes such as PTTG2. Long-read RNA-seq provides a more complete and precise transcriptome profile, because it can distinguish transcripts from overlapping genes, and may detect slightly more protein-coding genes. Most importantly, a single long read is equivalent to ∼10–100 short reads, depending on the transcript length, so a lot fewer reads are required for long-read RNA-seq to do PSI analysis.
We showed that 0.76 million long reads covered the entire or nearly entire gene regions of 3500∼4500 protein-coding genes with ≥10 long reads. We further showed that a MinION flow cell generated ∼2–3 million base-called reads in total (Fig. 2A). Thus, one sequencing run using a MinION flow cell is likely to provide enough long reads for AS analysis of most protein-coding genes. Improvements in sequence base-calling may also boost the number of protein-coding genes fully covered by long reads, because sequences that used to be masked by soft-clipping can be aligned to additional bases at the 5'- or 3'-end of the gene region.
In summary, using real RNA-seq data of RNA Sequins, we addressed key limitations of current leading splicing-detection tools and benchmarked their performance with different ΔPSI cutoffs. PSI-Sigma provides a more comprehensive and accurate AS analysis. We also demonstrated the use of PSI-Sigma in long-read RNA-seq analysis, and pointed out the potential of nanopore long-read RNA-seq. The name PSI-Sigma reflects the notion that all AS isoforms contribute to the PSI calculation.
Acknowledgements
We thank S. Goodwin for generating the long-read RNA-seq data and J. Scharner for providing RNA from U87 cells. We also thank T. Mercer for providing RNA Sequins.
Funding
This work was supported by National Cancer Institute [grant number CA13106]. We acknowledge assistance from the CSHL Shared Resources, funded in part by National Cancer Institute Cancer Center Support Grant [grant number 5P30CA045508].
Conflict of Interest: none declared.
References
- Bergeron D. et al. (2015) Regulated intron retention and nuclear pre-mRNA decay contribute to PABPN1 autoregulation. Mol. Cell. Biol., 35, 2503–2517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhate A. et al. (2015) ESRP2 controls an adult splicing programme in hepatocytes to support postnatal liver maturation. Nat. Commun., 6, 8768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrne A. et al. (2017) Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun., 8, 16027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobin A. et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garalde D.R. et al. (2018) Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods, 15, 201–206. [DOI] [PubMed] [Google Scholar]
- GTEx Consortium (2013) The Genotype-Tissue Expression (GTEx) project. Nat. Genet., 45, 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hardwick S.A. et al. (2016) Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods, 13, 792–798. [DOI] [PubMed] [Google Scholar]
- Katz Y. et al. (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods, 7, 1009–1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin K.T. et al. (2018) A human-specific switch of alternatively spliced AFMID isoforms contributes to TP53 mutations and tumor recurrence in hepatocellular carcinoma. Genome Res., 28, 275–284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manukyan A. et al. (2018) Analysis of transcriptional activity by the Myt1 and Myt1l transcription factors. J. Cell. Biochem., 119, 4644–4655. [DOI] [PubMed] [Google Scholar]
- Prezant T.R. et al. (1999) An intronless homolog of human proto-oncogene hPTTG is expressed in pituitary tumors: evidence for hPTTG family. J. Clin. Endocrinol. Metab., 84, 1149–1152. [DOI] [PubMed] [Google Scholar]
- Rinaldi C., Wood M.J.A. (2018) Antisense oligonucleotides: the next frontier for treatment of neurological disorders. Nat. Rev. Neurol., 14, 9–21. [DOI] [PubMed] [Google Scholar]
- Shen S. et al. (2014) rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. USA, 111, E5593–E5601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sterne-Weiler T. et al. (2018) Efficient and accurate quantitative profiling of alternative splicing patterns of any complexity on a laptop. Mol. Cell, 72, 187–200. [DOI] [PubMed] [Google Scholar]
- Sun S. et al. (2010) SF2/ASF autoregulation involves multiple layers of post-transcriptional and translational control. Nat. Struct. Mol. Biol., 17, 306–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sureau A. et al. (2001) SC35 autoregulates its expression by promoting splicing events that destabilize its mRNAs. EMBO J., 20, 1785–1796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thorvaldsdottir H. et al. (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform., 14, 178–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trincado J.L. et al. (2018) SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biol., 19, 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaquero-Garcia J. et al. (2016) A new view of transcriptome complexity and regulation through the lens of local splicing variations. Elife, 5, e11752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Q., Rio D.C. (2018) JUM is a computational method for comprehensive annotation-free analysis of alternative pre-mRNA splicing patterns. Proc. Natl. Acad. Sci. USA, 115, E8181–E8190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu T.D., Watanabe C.K. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21, 1859–1875. [DOI] [PubMed] [Google Scholar]
