Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2018 Apr 24;34(18):3220–3222. doi: 10.1093/bioinformatics/bty317

PathwaySplice: an R package for unbiased pathway analysis of alternative splicing in RNA-Seq data

Aimin Yan 1, Yuguang Ban 1, Zhen Gao 1, Xi Chen 1,2,, Lily Wang 1,2,3,
Editor: Janet Kelso
PMCID: PMC6137985  PMID: 29688305

Abstract

Summary

Pathway analysis of alternative splicing would be biased without accounting for the different number of exons or junctions associated with each gene, because genes with higher number of exons or junctions are more likely to be included in the ‘significant’ gene list in alternative splicing. We present PathwaySplice, an R package that (i) Performs pathway analysis that explicitly adjusts for the number of exons or junctions associated with each gene; (ii) visualizes selection bias due to different number of exons or junctions for each gene and formally tests for presence of bias using logistic regression; (iii) supports gene sets based on the Gene Ontology terms, as well as more broadly defined gene sets (e.g. MSigDB) or user defined gene sets; (iv) identifies the significant genes driving pathway significance and (v) organizes significant pathways with an enrichment map, where pathways with large number of overlapping genes are grouped together in a network graph.

Availability and implementation

https://bioconductor.org/packages/release/bioc/html/PathwaySplice.html

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

RNA-Seq has become a popular technology for transcriptome profiling. An important advantage of RNA-Seq is that it provides information on the fine structures of transcripts (isoforms) from alternative splicing, which can lead to polymorphisms in protein structures and functions.

Many tools are available for splicing analysis of individual genes using RNA-Seq data (Anders et al., 2012; Hartley and Mullikin, 2016; Shen et al., 2014). For pathway analysis of differential splicing, a typical workflow is to first identifying significant gene features (e.g. ‘counting bins’, which are non-overlapping segments of the exons in DEXSeq, non-overlapping segments of the exons or junctions in JunctionSeq) associated with splicing, select genes associated with these significant gene features and then performing pathway analysis based on the list of significant genes.

However, a major challenge in pathway analysis is that each gene has different number of exons or junctions, and because a gene is classified to be significant if any of its exons or junctions is significant, those genes with more exons or junctions would have higher chance to appear on the significant gene list, resulting in selection bias.

This selection bias is often ignored or adjusted improperly [e.g. some authors include gene lengths as the bias factor when using goseq (Young et al., 2010)], partly because of lack of available softwares. Here, we present PathwaySplice, a dedicated R package for pathway analysis of alternative splicing in RNA-Seq data.

2 Implementation

To illustrate the functionalities of PathwaySplice, we re-analyzed a publicly available dataset (GSE63569) in a recent paper that studied transcriptome of CD34+ cells from myelodysplastic syndrome (MDS) patients with SF3B1 mutations (Dolatshad et al., 2015). The example dataset featureBasedData (Supplementary Material, Section S2) includes P-values for annotated exon counting bins obtained from the JunctionSeq R package, for a random sample of 5000 genes.

2.1 Input

PathwaySplice takes result tables from splicing analysis software, which typically includes multiple rows for each gene, corresponding to multiple gene features associated with each gene. The function makeGeneTable then creates a data frame with one row for each gene. In the resulting genewise.table, the variable geneWisePvalue and numFeature correspond to the smallest P-value among the P-values of all the features of a gene and the number of the features for the gene, respectively.

The variable fdr corresponds to multiple comparison adjusted geneWisePvalue, and was calculated by function p.adjust with method = ’fdr’. We can then use the significance threshold of P-value < 0.05 to classify genes into significant genes and non-significant genes. Genes can also be classified based on fdr < 0.05, by specifying stat = ’fdr’ option in function makeGeneTable.

2.2 Selection bias diagnostics

To formally test for selection bias, we fit a logistic regression model logp/(1-pi)=α+β Xi, where X=number of gene features for gene i, and pi=Pr gene i is significant. Statistical significance of X would then indicate presence of selection bias. A boxplot that compares the distributions of the number of gene features for significant genes and non-significant genes is produced, and the statistical evidence for this difference is demonstrated with a P-value shown on the boxplot.

2.3 Assess pathway significance

The runPathwaySplice function implements statistical methodology in the goseq R package (Young et al., 2010) to assess pathway significance in splicing analysis, by adjusting for the number of gene features associated with each gene. It takes gene.based.table as input and then calculates weights for each gene using a spline regression model, which is the probability a gene being significant given the number of gene features (e.g. exons) associated with it. Next, random subsets of genes are sampled from the experiment. Each subset contains the same number of genes as the observed number of significant genes. For genes in each subset, the probability of selecting a gene is equal to the weights calculated above, and the number of genes belonging to a GO category is then calculated. Many random subsets of genes are generated to produce the null distribution for estimating P-value of a GO category. Both random sampling approach and approximation with Wallnius distribution approach are implemented. To account for multiple comparisons of testing many GO categories, PathwaySplice computes FDR using the method of Benjamini (Benjamini and Hochberg, 1995).

In addition to list of significant gene sets, users are often interested in the particular sets of significant genes that drive pathway significance. The function runPathwaySplice returns the list of significant genes in a tibble data frame (Wickham and Grolemund, 2016) that can be outputted in a .csv file.

Beyond Gene Ontology terms, runPathwaySplice can also test enrichment of user defined gene sets such as those from the Reactome, KEGG and MSigDB (Subramanian et al., 2005) database. This gene sets needs to be specified in .gmt format, which are then converted by the gmtGene2Cat function to a list and subsequently called by gene2Cat argument in runPathwaySplice.

To evaluate the effect of selection bias adjustment in pathway analysis, the function compareResults compares the distributions of gene features associated with genes in significant gene sets before and after bias adjustment graphically using a boxplot. A venn diagram is also produced to visualize the overalp of genes in significant pathways before and after adjustment.

2.4 Network visualization of significant pathways

In pathway analysis, the most significant pathways often include redundant set of genes. To help with visualizing the relationship between significant pathways, the function enrichmentMap organizes the most significant pathways in a network (Merico et al., 2010), which clusters pathways that contain large number of redundant genes. More specifically, the overlap between two pathways (A and B) is measured by the Jaccard coefficient (JC): JC=AB/AB, where AB indicates the number of genes within both A and B, and AB indicates the number of genes within A or B.

In the enrichment map (Fig. 1), the size of the nodes indicates the number of significant genes within the pathway. The color of the nodes indicates pathway significance, where smaller P-values correspond to dark red color. The thickness of the edges corresponds to JC similarity coefficient between the two pathways. A network file (.GML format) is also produced in the output folder for users who want to further manipulate the networks in cytoscape software (Shannon et al., 2003).

Fig. 1.

Fig. 1.

The enrichment map for several most significant gene sets after adjustment. Please note this result is for illustration purpose only, because the analysis was conducted on the example dataset with random sample of 5000 genes

Figure 1 shows an example network of the most significant pathways for the MDS dataset after adjusting for number of gene features.

3 Discussion and conclusion

One caveat in current approaches for pathway analysis of alternative splicing is that the inferences are carried out at gene-level, without explicitly considering specific transcripts or specific alternative splicing events. Indeed, the inferred transcripts structure, their estimated expressions and/or their functions do not play any role in pathway analysis. In addition, a gene is classified to be significant when at least one exon P-value is significant, the potential effects of multiple association signals for the gene from multiple significant exons may be missed. These issues point to important directions for future methodology development.

In summary, we developed PathwaySplice, an R package with comprehensive functionality and enhanced analysis output for pathway analysis of gene splicing. We expect PathwaySplice to be a useful analysis tool to facilitate unbiased pathway analysis of alternative splicing in RNA-Seq data. PathwaySplice is available at Bioconductor.

Funding

This work was supported partially by NIH/NCI R01 CA158472, NIH/NCI R01 CA200987 and NIH/NCI U24 CA210954.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data
Supplementary Data

References

  1. Anders S. et al. (2012) Detecting differential usage of exons from RNA-seq data. Genome Res., 22, 2008–2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Benjamini Y., Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B, 57, 289–300. [Google Scholar]
  3. Dolatshad H. et al. (2015) Disruption of SF3B1 results in deregulated expression and splicing of key genes and pathways in myelodysplastic syndrome hematopoietic stem and progenitor cells. Leukemia, 29, 1798.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Hartley S.W., Mullikin J.C. (2016) Detection and visualization of differential splicing in RNA-Seq data with JunctionSeq. Nucleic Acids Res., 44, e127.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Merico D. et al. (2010) Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PloS One, 5, e13984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Shannon P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13, 2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Shen S. et al. (2014) rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl. Acad. Sci. USA, 111, E5593–E5601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Subramanian A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Wickham H., Grolemund G. (2016) R for Data Science. O’Reilly Media, Sebastopol, CA. [Google Scholar]
  10. Young M.D. et al. (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol., 11, R14.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES