Abstract
Summary
The increasing size and complexity of omics datasets can make effective visualization and interpretation very challenging. Differential expression datasets as from RNA-sequencing and proteomics can contain many thousands of features and even after filtering for significance there can be hundreds or thousands of features to consider. A common tool for visualizing this type of data is the volcano plot where each feature is plotted as the log2 transformed fold change along the x-axis and the negative log10 of a P-value along the y-axis. These plots provide a useful visualization of the largest and most significant changes, but the majority of the features are often crowded and overlapping near the “cone” of the volcano. In order to provide a biologically informative way to simplify the visualization and interpretation of these types of datasets we developed the Pathway Volcano tool. This R-Shiny based software utilizes the Reactome API to select specific pathways and then filters the volcano plots to show only the data associated with those pathways. In this manner, many of the significant features in the crowded section of the volcano plot can be revealed to support the impact of specified pathways. This tool provides a range of interactive features to interrogate the data along with the ability to download png files and tables with pathway associated data.
Availability and implementation
Pathway Volcano is a freely available R Shiny package. The program was developed in R version 4.3.3 using R Studio version 2024.09.1 Build 394. Running the app locally requires the packages ggplot2, plotly, shiny (https://shiny.posit.co/) dplyr (https://dplyr.tidyverse.org), and ReactomeContentServer (https://bioconductor.org/packages/ReactomeContentService4R/). The full code and documentation including example datasets are available at https://github.com/thoconne/PathwayVolcano and has been deposited in Zenodo under DOI 10.5281/zenodo.15425246.
1 Introduction
The application of omics technologies in biomedical sciences is becoming increasingly prevalent. Among the most widely used approaches are RNA-sequencing and proteomics. In both cases, the differential expression of either genes or proteins are measured to understand the biological differences between groups or conditions (Aebersold and Mann 2016, Hasin et al. 2017, Stark et al. 2019). With increasing advancements in the analytical technologies, the datasets resulting from these studies are becoming increasingly large (Marx 2013). Many tools are available for the statistical analysis of these datasets which typically yield list of differential expression including fold changes and an assessment of significance, including DESeq (Love et al. 2014) and MAGenTA (McCoy et al. 2017) for genomics data along with MaxQuant/Perseus (Tyanova et al. 2016) and MSstats (Choi et al. 2014) for proteomics.
A common visualization approach used in the interpretation of differential expression data is the volcano plot. This type of plot typically presents the log2 transformed fold change along the x-axis and the negative log10 of an adjusted P-value along the y-axis. Typical datasets from RNA sequencing analyses can yield well in excess of 10 000 transcripts. After significance testing, datasets are reduced but can still retain hundreds or even thousands of significant features. Thus, volcano plots, even when presenting only the significantly altered transcripts are extremely crowded making the interpretation of the data very challenging. Often only the metabolites that are extremely significant receive any attention while the majority of significant features go uninterpreted.
A number of specialized tools have been developed to generate high quality volcano plots for differential expression data. Mullan et al. developed an R Shiny application called ggVolcanoR which allows for highly customizable volcano plots for use with RNA sequencing and proteomics data (Mullan et al. 2021). Different sets of analytes can be colored and labeled by virtue of significance and directionality of change. The EnhancedVolcano tool is an R package available on Bioconductor which provides a similar set of volcano plot customizations but currently requires experience with R programming in order to generate these plots (https://github.com/kevinblighe/EnhancedVolcano). Even with the ability to highlight analytes with varying magnitudes of fold change and significance, a challenge remains that many statistically significant expressions found near the very crowded “cone” of the volcano are being ignored.
To overcome this challenge and simplify the data visualization a different approach to filtering and highlighting the data is required. As the ultimate goal of many omics investigations is the identification of perturbations to specific biological pathways, we developed a method to filter and visualize the data based on biological pathways. This is accomplished by matching the experimental data against the lists of genes and proteins in the Reactome Database using the Reactome API (Jassal et al. 2020). The resulting volcano plots are filtered to contain only the entities in the experimental dataset that are found in a specifically queried pathway.
Deciding which pathways to query can be guided in several ways. Existing hypotheses from other experiments or biological intuition can provide a starting point. Clues from the most significantly altered entities can also guide the queries. Analyzing the experimental dataset using pathway analyses such as overrepresentation and gene set enrichment analyses (GSEA) can provide lists of potentially interesting pathways to query (Khatri et al. 2012). Both of these approaches are available in Reactome with pathways ranked by a P-value for significance (Fabregat et al. 2017).
2 Results
The Pathway Volcano tool was written in the R programming language using the R-Shiny framework along with the Plotly package for interactivity (Sievert 2020). The Reactome database API is called using the ReactomeContentService4R package, available from Bioconductor. The program can be easily loaded and launched from R-Studio without any additional programming and can be easily hosted on routine server hardware.
2.1 Data upload
Figure 1A shows the Data Input Panel of the tool which allows the user to upload a CSV file with differential expression data. The file can have any number of columns, but there are three required columns labeled (i) GeneSymbol, (ii) log2FoldChange, and (iii) padj. For simplicity the gene symbols are coerced to the mammalian (non-human) format with a capital first letter and subsequent lower case letters. When using proteomics data, the associated gene name can be used, and the data upload is identical. At the top is the line “Click here for detailed instructions” which provides a reminder of the process. Below the Upload Data File box is the option to Use Example Dataset. Details on the analysis of this dataset are described in the in the Example Analysis link on the README page.
Figure 1.
Interface to the Pathway Volcano Tool and example output. (A) Data upload and query panel. In this panel, the user can upload the experimental data, example data, query the Reactome Database for specific pathways, select a specific pathway and adjust the volcano plot features. (B) Volcano plot with the term ALL used in the Enter Reactome Pathway ID box. This shows all of the experimental dataset. (C) Table of pathways generated after putting the term “glucose” in the Enter Pathway Query Term box. (D) Volcano plot generated after selecting the Glucose Metabolism pathway and entering the Reactome ID R-HSA-70326 into the Enter Reactome Pathway ID box. (E) Table of genes associated with the Glucose Metabolism pathway that are shown in (D).
2.2 Selecting and filtering by pathway
Upon launching the program, the Reactome API is called, and a list of all pathways is generated which in the current version of Reactome (version 92, released March 20, 2025), contain 2769 pathways. The Enter Pathway Query Term box allows the user to enter a term which will then return a list of all of the Reactome Pathways with that term in the name. To view the entire experimental dataset, the term ALL is input resulting in all of the genes shown as in Fig. 1B. When a query term such as “glucose” is input, a table of Reactome pathways containing this term is generated and includes the Reactome ID and the full name of the pathway as in Fig. 1C. Next the user can copy and paste the Reactome ID for the selected pathway into the Enter Reactome Pathway box in Fig. 1A. Using the Reactome ID for the Glucose Metabolism pathway, R-HSA-70326, the volcano plot shown in Fig. 1D is generated which contains only the genes in the experimental dataset that are involved in the Glucose Metabolism pathway. Below the volcano plot is a table of all of the genes in the experimental dataset which are part of this pathway along with the Log2 fold change and adjusted P-value as shown in Fig. 1E.
It should be noted that there is a hierarchy in the pathway definitions in the Reactome Database. This is evidenced by the set of pathways returned using the “glucose” query. The top pathway Glucose metabolism is a high level pathway whereas the other three are more specific. Such a hierarchy is found with many other queries as well. For example, the query “metabolism” yields 91 pathways, but includes the high level pathway simply titled Metabolism (R-HSA-1430728). Using this pathway can provide a useful filter while maintaining broad view of metabolism when the working hypothesis includes suspected perturbations to metabolism.
2.3 Interactive visualization
The plotly package provides several different options to interact with the data. Hovering in the upper right corner of the volcano plot reveals the options to zoom, pan, box select, lasso select, zoom in, zoom out, autoscale and reset axes. The box and lasso select options focus on specific regions of the plot by graying out the genes outside of the selected regions. An option to download a png file of the plot is also provided. Slider bars in the left panel in Fig. 1A allow interactive changes of the fold change and P-value thresholds. Note that for simplicity, the P-value threshold is adjusted as the raw value and then translated into the -log10 value on the plot. Gene IDs can be toggled on or off at the bottom and the font size and offset of the labels adjusted by slider bars to optimize the clarity of the plots. A button to download the gene table shown in Fig. 1E is at the bottom of the panel.
3 Conclusion
The Pathway Volcano tool provides a unique way to simplify the analysis and interpretation of differential expression data by focusing on biological pathways and not just filtering by fold change or statistical significance. This strategy of data reduction helps uncover significant changes that would otherwise be obscured in the crowded region of data near the top of the “cone” of the volcano. The interactivity and the ability to export graphical images and data tables should make this a very helpful tool for researchers working with differential expression data.
Acknowledgements
Thanks to Bharat Gudipudi for helpful code review and Dr Jeffery Baumes, Dr Roni Choudhury for helpful discussions.
Author contributions
Thomas O'Connell (Conceptualization [lead], Investigation [lead], Methodology [lead], Software [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [lead])
Conflict of interest: None declared.
Funding
This work was supported in part by NIH/National Cancer Institute SBIR Program (Project Numer K004857-00-S01) in collaboration with Kitware Inc.
References
- Aebersold R, Mann M. Mass-spectrometric exploration of proteome structure and function. Nature 2016;537:347–55. [DOI] [PubMed] [Google Scholar]
- Choi M, Chang CY, Clough T et al. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 2014;30:2524–6. [DOI] [PubMed] [Google Scholar]
- Fabregat A, Sidiropoulos K, Viteri G et al. Reactome pathway analysis: a high-performance in-memory approach. BMC Bioinformatics 2017;18:142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol 2017;18:83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jassal B, Matthews L, Viteri G et al. The reactome pathway knowledgebase. Nucleic Acids Res 2020;48:D498–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 2012;8:e1002375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marx V. Biology: the big challenges of big data. Nature 2013;498:255–60. [DOI] [PubMed] [Google Scholar]
- McCoy KM, Antonio ML, van Opijnen T. MAGenTA: a Galaxy implemented tool for complete Tn-Seq analysis and data visualization. Bioinformatics 2017;33:2781–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mullan KA, Bramberger LM, Munday PR et al. ggVolcanoR: a Shiny app for customizable visualization of differential expression datasets. Comput Struct Biotechnol J 2021;19:5735–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sievert C. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Boca Raton, FL, USA: Chapman and Hall/CRC, 2020. [Google Scholar]
- Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet 2019;20:631–56. [DOI] [PubMed] [Google Scholar]
- Tyanova S, Temu T, Sinitcyn P et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat Methods 2016;13:731–40. [DOI] [PubMed] [Google Scholar]
- Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag, 2016. [Google Scholar]

