Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 May 27;39(6):btad325. doi: 10.1093/bioinformatics/btad325

IndepthPathway: an integrated tool for in-depth pathway enrichment analysis based on single-cell sequencing data

Sanghoon Lee 1,2,3, Letian Deng 4,5, Yue Wang 6,7, Kai Wang 8, Maureen A Sartor 9, Xiao-Song Wang 10,11,12,
Editor: Peter Robinson
PMCID: PMC10275909  PMID: 37243667

Abstract

Motivation

Single-cell sequencing enables exploring the pathways and processes of cells, and cell populations. However, there is a paucity of pathway enrichment methods designed to tolerate the high noise and low gene coverage of this technology. When gene expression data are noisy and signals are sparse, testing pathway enrichment based on the genes expression may not yield statistically significant results, which is particularly problematic when detecting the pathways enriched in less abundant cells that are vulnerable to disturbances.

Results

In this project, we developed a Weighted Concept Signature Enrichment Analysis specialized for pathway enrichment analysis from single-cell transcriptomics (scRNA-seq). Weighted Concept Signature Enrichment Analysis took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes, which we termed as the universal concept signature, to tolerate the high noise and low coverage of this technology. We then incorporated Weighted Concept Signature Enrichment Analysis into an R package called “IndepthPathway” for biologists to broadly leverage this method for pathway analysis based on bulk and single-cell sequencing data. Through simulating technical variability and dropouts in gene expression characteristic of scRNA-seq as well as benchmarking on a real dataset of matched single-cell and bulk RNAseq data, we demonstrate that IndepthPathway presents outstanding stability and depth in pathway enrichment results under stochasticity of the data, thus will substantially improve the scientific rigor of the pathway analysis for single-cell sequencing data.

Availability and implementation

The IndepthPathway R package is available through: https://github.com/wangxlab/IndepthPathway.

1 Introduction

The identification of causal pathways underlying physiological processes or disease initiation, progression, or therapeutic resistance based on proteomic, genomic, and transcriptomic datasets are major challenges of genomics research, particularly for single-cell sequencing (SCS)-based studies. Pathway enrichment (PE) methods can be classified chronologically into three generations (García-Campos et al. 2015): (i) over-representation analysis (ORA) (Yu et al. 2012), such as Fisher’s exact test, (ii) functional class scoring, such as Gene Set Enrichment Analysis (GSEA) (Subramanian et al. 2005), and (iii) network-topology based, such as network enrichment analysis (NEA) (Jeggari et al. 2018). Based on their dependency on gene weights [i.e. levels of differential expressions (DEs)], PE methods can be classified into unweighted or weighted methods. The unweighted methods (i.e. ORA or NEA) are commonly used for analysis of genetic traits extracted from genomic analysis. For RNA-seq analysis, these methods can be used to identify the pathways within the user-defined DE genes based on thresholding. The weighted methods (i.e. GSEA) are commonly used for analysis of continuous expression data. Incorporating the degrees of differential gene expressions allows for detection of different magnitudes of pathway alterations.

It is notable that PE methods have distinct functionality from pathway activity inference, pathway and gene set over-dispersion analysis (PAGODA), regulon activity inference, or functional gene set inference. PE methods aim to test the enrichment of a wide array of functional pathways in differentially expressed genes thus are commonly used for pathway discovery. Whereas pathway activity inference tools assess the activity of handful pathways in each sample or single cell. Popular pathway activity inference tools, such as PROGENy, require consensus signatures obtained from publicly available perturbation experiments, which further limit the scope of the pathways to be assessed (Schubert et al. 2018). PAGODA is used to cluster cells into transcriptional subpopulations (Fan et al. 2016). Regulon inference (i.e. AUCel) is designed for testing the activity of the large footprint regulons for gene regulatory network reconstruction (Aibar et al. 2017), thus are not dedicated for pathways that have much smaller footprints. Functional gene set inference depends on PE methods for testing gene set enrichments based on matrix factorizations (MFs), which we will discuss further in Section 3.

While there are a handful pathway activity inference methods for SCS data, to date there is a paucity of PE methods tailored for SCS data despite its popularity for pathway discovery in biomedical research. Compared to bulk-sequencing, SCS enables exploring pathways and processes of individual cells. One of the most important analyses to achieve this is to identify the pathways that exhibit significant differences among different cell states or cell types. Due to low genome coverage and high amplification bias, bioinformatics tools developed for bulk-sequencing may not work well with SCS data (Ning et al. 2014; Poirion et al. 2016). The conspicuous challenges of SCS data analysis include dissecting single-cell expression variability, identifying rare cell populations, and understanding spatio-temporal transitions (Stegle et al. 2015; Yuan et al. 2017). In the case of PE analysis, the problem is less self-evident due to lack of golden standard and explicit modeling methods to assess the impact of SCS technical variability on the pathway interpretations. Further, it is even more challenging to detect the pathways enriched in less abundant cells, which is more vulnerable to disturbances.

In our previous study, we sought to improve the reproducibility of PE analysis based on experimentally defined gene sets through leveraging the vast variety of knowledge-based gene sets including gene ontologies, pathways, interactions, and domains etc., to help inform complex functional relations, which we collectively termed as “molecular concepts.” Based on this idea, we developed a concept signature enrichment analysis (CSEA) for deep assessing the functional relations between pathway gene sets and a target gene list using a compendia of molecular concepts (Chi et al. 2020). This method is grounded on the framework of shared concept signatures between gene sets, thus overcomes the limitations of the current algorithms relying on assessing gene set overlap. Here “a target gene list” is defined as a gene list of interest cataloged by comparative genomic analysis, and a “pathway gene set” is defined as a set of genes known to function in a certain pathway (Fig. 1A). The concept signatures of a target gene list are collectively termed as a Universal Concept Signature (uniConSig). CSEA deeply interprets the function of a target gene list via computing their overrepresentations in a wide array of molecular concepts, which were then used as weights to compute a cumulative genome-wide uniConSig score that represent the functional relevance of human genes underlying this target gene list (Fig. 1D). Then the UniConSig-sorted genome will be used for testing PE through weighted Kolmogorov–Smirnov (K-S) (WKS) tests (Fig. 1E). If a pathway gene set is functionally like the target gene list, the uniConSig scores will be high for that pathway genes, which will result in a high normalized enrichment score (NES) of that pathway. CSEA represents a new class of PE method that is distinct from all earlier generations with much improved depth in assessing gene set functional relations. A major advantage of CSEA over the approach based on interactome network topology is that CSEA is grounded on the framework of the vast knowledge databases, and thus can comprehensively assess most, if not all, functional aspects when computing the functional relations (Chi et al. 2020). In this study, we further developed a weighted CSEA (WCSEA) method for PE analysis based on a weighted gene list as in the function of GSEA. CSEA accept a list of target genes (i.e. up, or down-regulated genes), whereas WCSEA accept a weighted target gene list (i.e. levels of DEs) as input data (Supplementary Fig. S1). Thus, CSEA interprets pathways in a gene set, whereas WCSEA interprets pathways in a weighted gene list.

Figure 1.

Figure 1.

Schematic of the unweighted and WCSEA methods for PE analysis based on bulk or scRNA-seq data. (A) For CSEA analysis, a target gene list will be generated from genomic datasets (i.e. differentially expressed or mutated genes). CSEA achieves deep functional assessment of the target gene list by performing similarity tests, which uses Ochiai index, to quantitate the enrichment of each molecular concept associated with a given Gene x in the target gene list. The resulting Ochiai index for each concept will be used as the concept weight (ωi) for calculating uniConSig scores. (B) For WCSEA analysis, the weighted gene list can be created based on DE statistics comparing different cell groups, or from the amplitude matrix of a LDC identified by MF methods. The uniConSig analysis achieves deep functional assessment of the weighted gene list (based on DE statistics) by performing WKS tests to quantitate the enrichment of each molecular concept associated with a given Gene x in the highly weighted genes. The resulting NES score for each concept is used as weight for calculating the uniConSig score of Gene x, which will be normalized to the range of [0,1]. (C) These weights generated by similarity tests or WKS tests will be penalized by the degree of overlaps between the Gene x concepts, computed using their Ochiai similarity index. We then introduced a penalization factor (ε) calculated for each molecular concept i based on the cumulated similarities of concept i with other molecular concepts associated with Genex. (D) The uniConSig scores will be computed using the indicated formula to reflect the functional relevance of human genes underlying the target gene list (for CSEA analysis) or highly weighted genes (for WCSEA analysis). (E) The enrichment of pathways in the pathway datasets (i.e. MSigDB C2CP canonical pathways, or hallmark) in this uniConSig-sorted gene list will then be assessed using WKS tests. The resulting NESs can be used as a quantitative measure of the functional associations between these pathways underlying the target gene list (CSEA) or weighted gene list (WCSEA).

We speculate that CSEA and WCSEA could be particularly suited for addressing the characteristics of SCS data. The rationale is that CSEA and WCSEA may tolerate the high noise and stochasticity of SCS data via leveraging the stable concept signatures of the target gene list to improve the reproducibility and scientific rigor of PE results. While the expressions of individual genes may be noisy, coordinated up-regulation of genes within a gene set could provide a more stable signature. A similar principle has been leveraged by PAGODA to cluster cells into transcriptional subpopulations (Fan et al. 2016). We then further develop a purpose-built PE package called “IndepthPathway” for deep PE analysis from bulk and SCS data that took a broader approach for assessing gene set relations and leverage the universal concept signature of the target gene list to tolerate the high noise and low gene coverage of this technology. Our results show that IndepthPathway achieved better reproducibility under technical noise simulations of SCS data as well as in real matched bulk and single-cell RNAseq data compared to other popular PE methods. Empowered by its strength in deep functional interpretation, IndepthPathway will have broad applications on inferring the pathway alterations from bulk and SCS data.

2 Materials and methods

2.1 Compiling molecular concept and pathway datasets

The uniConSig score calculation and deep functional interpretation for a target gene list rely on the precompiled human molecular concept databases. To generate a comprehensive knowledge base, we compiled 45 522 molecular concepts from the Molecular Signatures Database (MSigDB) C2CP pathways, or hallmark (v.7.0, https://www.gsea-msigdb.org/gsea/msigdb) (Zhao et al. 2017), NCBI EntrezGene interactome (Brown et al. 2015) database, the Pathway commons database (Rodchenkov et al. 2020) (http://www. pathwaycommons.org), and conserved domain database (Brown et al. 2015) (Fig. 1C).

2.2 WCSEA for pathway interpretation

To generate target gene lists, users can perform DE analysis of single-cell transcriptomic data based on their preferred methods, such as Limma or Single Cell Differential Expression (SCDE) (Kharchenko et al. 2014). The identified up- or down-regulated gene lists were used as the target gene list for CSEA analysis (Fig. 1A). For WCSEA analysis, we used the signed statistical significance of the expression difference for the target genes as weights for uniConSig score calculation for genes and for PE analysis (Fig. 1B).

Target gene weights=signed statistical significance such as qvalues. (1)

The uniConSig scores assess the functional associations of human genes underlying the experimental gene list by calculating the similarities (Ochiai index) between the experimental gene list and the molecular concepts associated with each Genex in the genome (Fig. 1A). In WCSEA analysis, the calculation of uniConSig scores from a weighted gene list is based on the enrichment of each molecular concepts associated with Gene x in the weighted gene list calculated using WKS test. The resulting NES for each concept is used as a concept weight ωi for calculating the genome-wide uniConSig score (Fig. 1B).

concept weight ωi=NES=ESmean(ESrandom). (2)

To reduce the inflation effects of redundant concepts on the uniConSig scores, we penalized the scores based on the similarity matrix of molecular concepts associated with each Genex in the genome (Fig. 1C). We generated the similarity matrix by computing the Ochiai index between molecular concepts.

Ochiai index,Oij=A BA × B . (3)

We then introduced a penalization factor (ε) calculated for each molecular concept i associated with Genex based on the cumulated similarities of concept i with other molecular concepts associated with Genex (Fig. 1C). The penalization factor εi is given by

penalization factor εi=j=1nOij. (4)

Compared to the Jaccard Index used in our previous study (Chi et al. 2020), the Ochiai index can better assess the similarity between a small molecular concept and a large molecular concept. To demonstrate this, we identified 300 large and small concept pairs in which each large concept contains all genes in the paired small concept, and calculated similarity indexes by both methods (Jaccard index or Ochiai index) (Supplementary Fig. S2). The Ochiai index produced much higher similarity scores for these 100% overlapping concepting pairs than Jaccard Index.

The Effective Weight (EW) is then calculated as (Fig. 1D):

EWi=ωiεi. (5)

The sum of the reciprocals of εi (harmonious sum) is then used to calculate the effective concept number (ECN) based on different degrees of concept overlaps:

ECN=i=1n1εi . (6)

As discussed in our previous study (Chi et al. 2020), the EW and ECN algorithms have the same effect as removing duplicated concepts then calculate the concept weights and concept numbers. Including ε in the calculation of EW and ECN avoids using an arbitrary cutoff for removing duplicated concepts. Finally, the uniConSig score of the given Genex is calculated as following (Fig. 1D):

uniConSig|Genex=i=1nEWiECN=i=1nωiεii=1n1εi. (7)

Here, the square root transformation compresses the effect of high concept numbers to accentuate final scores on the weights. The cumulative genome-wide uniConSig scores reflect the functional relevance of the human genes underlying the highly weighted target genes, which can be interpreted as the adjusted average of normalized ωi. The NESs of pathway gene sets were then calculated by the weighted step-up of the random walk K-S test using uniConSig-sorted human gene list to assess the functional associations of these pathways with the original weighted gene list. As in GSEA, users can interpret the most important pathways based on their ranks.

2.3 Disambiguation of the crosstalk effects between similar pathways

Many pathways have varying degrees of overlap each other due to their shared genes. To reveal independent pathway modules, here, we implemented disambiguation steps to correct the crosstalk effects between the enriched pathways, as previously described (Donato et al. 2013). The enriched pathways should be independently enriched even after deleting the genes overlapped with the other similar pathway. Users can choose Bonferroni or Benjamini–Hochberg method for multiple testing correction of the P-values. In our study, the significance of the disambiguation process was determined by adjusted P-value <.01 based on “Bonferroni” method. We then used CSEA to calculate the functional associations of the unique pathways following the disambiguation step. Here, the CSEA method can assess the functional associations between two pathways, in which CSEA consider one pathway as a target gene list, and a second pathway as a pathway gene list, and implements the same PE analysis to assess their functional associations. If the gene lists of two pathways share many molecular concepts, the second pathway will have a high enrichment score, which can represent high functional association between the two pathways. The associations of the top pathways were visualized using either association heatmap or the pathway network. We used the corrplot R package (ver. 0.92) for the association heatmap and the igraph R package (ver. 1.2.4.2) for pathway network.

2.4 PE analysis based on three scRNAseq datasets

We applied CSEA, WCSEA, GSEA, and other PE tools to analyze three public single-cell RNA-seq (scRNA-seq) datasets. The first dataset is GSE68981 that compared the active and quiescent populations of hematopoietic stem cells (HSCs, n = 112) (Yang et al. 2017). The second scRNAseq dataset is GSE75748 that profiled four lineage-specific progenitor cells derived from human embryonic stem cells (hESCs) (Chu et al. 2016). More important, this dataset provide the matched bulk RNA-seq for hESCs (n = 4) and derived endothelial cells (ECs, n = 3), which can be used for directly comparing the PE results based on real matched bulk and single-cell RNAseq data. We thus performed PE analysis comparing hESCs with ECs using both bulk and scRNAseq data (212 hESCs versus 105 ES cells). The third scRNA-seq dataset is GSE132172 that contains scRNA-seq data for human neural stem cells (n = 59) and glioblastoma (GBM) cancer stem cells (n = 75) (Zhao et al. 2019). PE analyses were performed to compare the aforementioned cell types using multiple PE tools.

2.5 Simulations of multiple faceted variability of scRNA-seq and dropouts in gene expression

To benchmark the performance of PE tools, we simulated the multi-faceted variability of scRNA-seq data using Synthetic model of multiple variability factors for Simulation (SymSim) (Zhang et al. 2019). We simulated technical variations by noise intrinsic to the process of transcription, extrinsic indication of different cell states, and low sensitivity in measurement. In the noise simulations, we use two different sets of parameters; PS1, alpha_mean =0.05; nPCR1 = 16 and PS2, alpha_mean =0.04; nPCR1 = 18. ‘alopha_mean’ is the mean of rate of subsampling of transcripts during capture step to give transcript bias. ‘nPCR1’ is the number of PCR cycles in pre-amplification step to give the extend of amplification bias. The example code to simulate scRNA-seq data is included in our IndepthPathway R package. In addition, we simulated dropouts in gene expressions by randomly deleting 20% or 50% of the expressed genes. We then compared the performance of CSEA, WCSEA, and other PE discovery tools, such as ORA (Yu et al. 2012), NEA (Jeggari et al. 2018), GSEA (Subramanian et al. 2005), UniPath (Chawla et al. 2021), and iDEA (Ma et al. 2020), based on the variations in pathway ranks under simulated technical variations. ORA is implemented based on R package clusterProfiler (ver. 4.6.0), and NEA is implemented based on NEArender R package (ver. 1.5). UniPath was implemented using GitHub R package (ver, 0.1.0, https://reggenlab.github.io/UniPathWeb/). iDEA was implemented using GitHub R package (https://xzhoulab.github.io/iDEA/installation/).

2.6 Visualizing enriched pathways from IndepthPathway analysis

Here, we provide five modules for visualization of enriched pathways. The first option generates a heatmap revealing the functional association between the top up- or down-regulated pathways (Supplementary Fig. S3). The second option generates a network revealing the functional associations of the top up- or down-regulated pathways. The third option generates an enrichment plot for one selected pathway as in GSEA (Supplementary Fig. S4A). The fourth option generates a heatmap to represent differentially expressed genes from a specific pathway gene set as in GSEA (Supplementary Fig. S4B). The fifth option generates an interactome network that visualize the DE levels of genes in a selected pathway and their interactions with a target gene of interest based on human reference binary protein interactome (Luck et al. 2020) and Entrez gene reference interactome database (Maglott et al. 2011) (Supplementary Fig. S5).

3 Results

3.1 Development of a WCSEA method to enhance its utility and reproducibility in detecting pathway alterations from single-cell transcriptomics data

Next, we developed a WCSEA method to enhance its functionality and reproducibility in analysis of single-cell transcriptomics. We propose WCSEA to achieve deep functional assessment of the weighted gene list (based on DE statistics) by performing WKS tests to quantitate the enrichment of each molecular concept associated with a given Genex in the highly weighted genes. The resulting NES score for each concept is used as weight for calculating the uniConSig score of Genex (Fig. 1B). These weights are then penalized by the degree of overlaps between the Genex concepts, computed based on their Ochiai’s similarity indexes (Fig. 1C). The resulting uniConSig scores reflect the functional relevance of human genes underlying the highly weighted genes in the list (i.e. highly overexpressed genes) (Fig. 1D). Then, PE analysis will be performed on this genome-wide weighted gene list based on uniConSig scores (Fig. 1E). WCSEA leverages the stable concept signatures of highly weighted genes to create a buffer zone for tolerating the high noise and stochasticity of scRNA-seq data.

3.2 Development of IndepthPathway package for biologists to broadly leverage this technology to enhance pathway discovery from experimentally defined gene sets and SCS data

Next, we implemented CSEA and WCSEA in a R package called “IndepthPathway” to widely disseminate this technology. This purpose-built tool allows users to provide a weighted or unweighted target gene list defined by comparative analysis of bulk or SCS data for WCSEA or CSEA analysis, respectively (Supplementary Fig. S1). Users can perform DE analysis between the comparing sample groups or cell groups using the bulk or SCS DE methods, such as DESeq2 (Love et al. 2014), Limma (Ritchie et al. 2015), and SCDE (Kharchenko et al. 2014), and the resulting statistical values of user’s choice can be used to generate a weighted gene list for WCSEA analysis. This will allow user to implement their preferred DE methods to identify differentially expressed genes, and we have implemented the Limma and SCDE methods in our package. Over- or under-expressed pathways should be analyzed separately. Users can also implement unsupervised MFs techniques to reveal the low-dimensional components (LDCs) for functional pathway inference (Stein-O'Brien et al. 2018). MF is not a PE method, but instead, it relies on PE analysis to explore the pathways enriched in an LDC using its amplitude matrix, which is known as functional pathway inference. Users can select popular MF methods, such as UMAP (Becht et al. 2019) or t-SNE (Linderman et al. 2019), and perform WCSEA analysis for selected LDC to identify the key pathways underlying that component.

For deep functional interpretations, we have precompiled the human concept datasets from the MSigDB (Subramanian et al. 2005), NCBI EntrezGene interactome database (Brown et al. 2015), and conserved domain database (Brown et al. 2015). Users can use our precompiled human concept datasets or provide their own concept dataset. We expect inclusion of the more available concepts will substantially “thicken” the “buffering zone” via increasing the number of signature concepts, thus will further improve the stability of WCSEA under technical noise. For PE analysis, the pathway gene sets for curated canonical pathways and hallmark pathways have been precompiled from MSigDB (Subramanian et al. 2005) and provided to users. Users may select the precompiled pathway gene sets or provide their own pathway gene sets in the Gene Matrix Transposed file format (*.gmt). In addition, we have integrated downstream analysis modules to improve its usefulness. Following PE analysis, we have implemented a module for users to disambiguate and identify independent pathway modules via correcting the crosstalk effects between pathways, as previously reported (Donato et al. 2013). The PE results include the pathways ranked by NES scores, together with the P-values and FDR q-values. We also implemented a module to allow users to generate a pathway network showing the functional relations between significant pathways, which are computed based on the CSEA method, and a module to visualize the DE and interactome of core enriched genes in a selected pathway related to a select target gene of interest, which is a powerful tool to explore the mechanisms.

3.3 IndepthPathway provided more in-depth view of complex pathway alternations from scRNAseq data

Next, we applied the IndepthPathway tool to examine the pathways enriched in active versus quiescent populations of HSCs based on a publishes SCS dataset (Yang et al. 2017). Differential gene expressions between active and quiescent HSC are compared and the resulting statistical values are used as weights for WCSEA analysis. Following WCSEA analysis, the top 30 up- and down-regulated pathways were disambiguated as explained in Section 2 and plotted into a network space based on the functional associations of these pathways computed through WCSEA analysis (Fig. 2A). Our results show that the WCSEA methods can comprehensively detect the pathways known to characterize active HSC: (i) up-regulation of E2F targets, DNA replication, and G2M checkpoint pathways, indicating increased cell cycle activities (Kwon et al. 2017; Yang et al. 2017); (ii) up-regulation of MYC targets known to control HSC self-renewal and differentiation (Wilson et al. 2004); (iii) up-regulation of spliceosome, which is consistent with the increased total RNA amount in active HSCs (Hüttmann et al. 2001); (iv) down-regulation of pathway required for the quiescent HSCs to respond to interferon α/β stimulation (Robb 2007; Essers et al. 2009); and (v) up-regulation of various DNA repair pathways. It is reported that DNA damages accumulated during quiescent state will be repaired upon entry into active state (Beerman et al. 2014). In contrast, the commonly used GSEA (Subramanian et al. 2005) appears to enrich the most significant pathways but lack the depth to interpret complex pathway alternations (Fig. 2B). Furthermore, interrogation of interactome network revealed the direct role of MYC in regulating the replicative complex in addition to its function in transcriptional regulation of cell cycle (Supplementary Fig. S5), consistent with previous report (Dominguez-Sola et al. 2007).

Figure 2.

Figure 2.

Pathway characteristics of active HSCs revealed by WCSEA or GSEA based on a SCS dataset. Pathway association networks showing the top up- and down-regulated pathways in active HSCs compared to quiescent HSCs detected by WCSEA (A) or GSEA (B). Top 30 up- and down-regulated pathways based on WCSEA or GSEA analysis are disambiguated, and the resulting non-ambiguous pathways are plotted in the network. The node size correlates with the levels of enrichments of the respective pathways, and the edge thickness correlates with the levels of functional associations between the pathways.

We then compared IndepthPathway with other PE methods in detecting significantly enriched pathways in active versus quiescent populations of HSC. We included representative PE tools of each generations including the GSEA, NEA, ORA, among which GSEA accepts weighted gene list based on DGE as in WCSEA, whereas NEA and ORA accept up- or down-regulated gene sets as in CSEA. In addition, we also included two pathway tools specialized for scRNAseq data, a PE tool called iDEA that carries out integrative DE and GSEA (Ma et al. 2020), and a popular pathway activity inference tool called UniPath (Chawla et al. 2021). As discussed above, pathway activity inference tools produce activity scores for each pathway in each single cell and additional statistical analysis will be required for PE analysis to compare the pathway activities between the different cell groups, and we have implemented in our analysis. We then constructed kernel density of P-values calculated for enriched pathways detected by different PE tools to view the distribution of P-values within .05 significance range (Fig. 3A). The left skewed density of the P-values for the detected pathways indicates that WCSEA identified more significantly enriched pathways than other PE methods. We also counted the number of significant pathways with increasing FDR q-value cutoffs for different PE methods (Fig. 3B). CSEA and WCSEA identified more enriched pathways with low false positive rates compared to other PE methods.

Figure 3.

Figure 3.

The performance of CSEA, WCSEA, and other PE tools on detecting pathways characteristic of active HSC based on scRNA-seq dataset and the reproducibility under simulated technical variability and dropouts in gene expression. (A) Kernal density of P-values from CSEA, WCSEA, and other PE tools. The density plot display the significantly enriched pathways detected by each PE tool within P-value .05. (B) The number of identified enriched pathways by each PE tools based on increasing FDR q-value cutoffs from 1 × 10−05 up to 0.05. (C) Noise simulations are performed using SymSim with two parameter sets PS1 and PS2 representing different noise levels (see Section 2). The dropouts are simulated via randomly deleting 20% or 50% expressed genes. Five permutations are done for each simulation. Read counts for 177 HSC cells (77 quiescent HSCs and 100 active HSCs) are used as true counts before simulations. SCDE is used for DE analysis, and the results are used for PE analysis. Variations in pathway ranks for top 30 up- or down-regulated pathways in quiescent HSCs are used for benchmarking with negative variations indicating reduced significance of the pathways.

3.4 IndepthPathway yielded reproducible pathway results under simulations of multiple faceted variability of scRNA-seq and dropouts in gene expression

Next, we sought to examine the performance of IndepthPathway under the simulated technical variability of scRNA-seq data. This can be achieved through “SymSim” that explicitly models the multiple faceted variability of scRNA-seq experiments (Zhang et al. 2019). In addition, we simulated dropouts in gene expressions by randomly deleting 20% or 50% of expressed genes and assessed the changes in pathway ranks from the original results. We then tested the performance of IndepthPathway on pathway discovery from scRNA-seq with or without simulated technical variabilities (Fig. 3C). Pathways are usually interpreted based on their enrichment ranks, thus the rank variations for the top 30 up- and down-regulated pathways identified by these methods compared to the original PE results are used for benchmarking.

As expected, the simulations resulted in high variations in the pathway ranks from GSEA, NEA, ORA, iDEA, and UniPath (Fig. 3C). This suggests that PE tools developed for bulk RNA-seq or even for scRNA-seq may yield inconsistent results under technical variability of scRNA-seq. When gene expression data are noisy and signals are sparse, testing PE based on the genes measured may not yield statistically significant results. Tolerating dropouts in gene expression will require deep interpreting the functions of the differently expressed genes in the comparing pathways, instead of simply mapping the gene names in the pathways. Indeed, our result showed that CSEA and WCSEA produced overall lowest variations in pathway ranks with highest density around zero compared to other PE methods under simulations of both technical noises and gene expression dropouts. Together, this result supports the advantage of IndepthPathway on reproducibility under high technical noise and dropouts in gene expression.

3.5 IndepthPathway results show outstanding reproducibility in matched bulk and single-cell RNAseq data, offering in-depth view of pathway alterations in two additional scRNAseq datasets

Next, we sought to show the variability of PE results using the real dataset of matched bulk RNAseq and scRNAseq on the same cell line models. Through extensive literature investigation, we identified a dataset, GSE75748, with matched bulk RNAseq and scRNA-seq of hESCs and derived endothelial cells (EC). We thus performed PE analysis for EC versus hESC using different PE tools. In terms of kernel density of P-values for enriched pathways and significant pathways under different FDR q-value cutoffs, CSEA and WCSEA showed outstanding performance compared to other PE methods (Fig. 4A and B). We then examined variations of pathway ranks between PE results based on the bulk RNA-seq and scRNA-seq datasets, which is a most important test of variability of PE tools (Fig. 4C). Our results show that CSEA and WCSA outperformed all other PE tools, we tested, including GSEA, NEA, ORA, iDEA, and uniPath, showing much less variations in PE results. Likewise, in the simulations of noise or dropouts, CSEA and WCSEA showed much lower variations in pathway ranks than other PE tools (Supplementary Fig. S6).

Figure 4.

Figure 4.

The performance and reproducibility of CSEA, WCSEA, and other PE tools on detecting pathways characteristic of endothelial cells compared to embryonic stem cells based on matched bulk and scRNA-seq dataset. (A) Kernal density plots of P-values from CSEA, WCSEA, and other PE tools. (B) The number of identified enriched pathways by each PE tools based on increasing FDR q-value from 1 × 10−5 up to 0.05. (C) The reproducibility of CSEA, WCSEA, and other PE tools in real matched bulk and SCS datasets comparing embryonic stem cells and derived endothelial cells. Variations of the pathway ranks between bulk RNA-seq and scRNA-seq datasets are shown in the figure with negative variations indicating reduced significance of the pathways. The matched bulk and single-cell RNAseq data are from GSE75748.

Furthermore, WCSEA offered more in-depth and meaningful view of the pathway alterations during endothelial differentiation than GSEA, particularly for up-regulated pathways (Fig. 5). For example, WCSEA was able to comprehensively detect key pathways in endothelial differentiation and function, such as angiogenesis, PDGFRB (Rolny et al. 2006), and VEGF pathways (Li et al. 2017), integrin (Toya et al. 2015), FAK, and RhoGTPases pathways (Shen et al. 2013), IL6 (Fan et al. 2008), EPHA2, and inflammatory pathways (Funk et al. 2012). These results suggest that the high noise and low gene coverage of scRNAseq data indeed can result in high variations in PE analysis and support the utility of IndepthPathway to overcome this issue.

Figure 5.

Figure 5.

Pathways characteristics of endothelial cells differentiated from embryonic stem cells revealed by WCSEA or GSEA. Pathway association networks showing the top up- and down-regulated pathways in endothelial cells compared to H1 hESCs detected by WCSEA (A) or GSEA (B). Top 30 up- and down-regulated pathways based on WCSEA or GSEA analysis are disambiguated, and the resulting non-ambiguous pathways are plotted in the network. The node size correlates with the levels of enrichments of the respective pathways, and the edge thickness correlates with the levels of functional associations between the pathways.

Moreover, we also analyzed a third dataset, GSE132172 that compared neural stem cell with GBM stem cells. To test if CSEA/WCSEA can reveal more significant pathways from the scRNA-seq data, we also compared the density of P-values and the number of significant pathways at different FDR cutoffs. We show CSEA and WCSEA revealed more significant pathways than all other tools (Supplementary Fig. S7A and B), providing in-depth view of the enriched pathways (Supplementary Fig. S7C and D).

4 Discussion

Bioinformatics tools for bulk-sequencing may not work well with SCS data due to the low genome coverage and high amplification bias. In case of PE analysis, the problem is less self-evident and suitability of existing PE methods on SCS data is hard to assess due to the lack of golden standard. When gene expression data are noisy and signals are sparse, testing PE based on the genes measured may not yield statistically significant results. Indeed, our analyses revealed that popular PE methods yielded highly inconsistent results under the simulated multiple faceted variability of scRNA-seq, and when directly comparing real matched scRNAseq and bulk RNAseq data. Here, we introduce a new WCSEA analysis for deep assessing the pathways enriched in a weighted gene list, such as differentially expressions through exploiting comprehensive sets of molecular concepts (i.e. gene ontologies, pathways, interactions, and domains, etc.) to compute the concept signatures shared by the highly weighted genes, then testing the PEs over all genes sorted by their functional similarity to the highly weighted genes based on their associations with these concept signatures. WCSEA takes a broader approach for assessing gene set relations and leverages universal concept signature to tolerate the stochasticity of the data. The utilization of the universal concept signature for PE analysis will be particularly useful for scRNA-seq data as technical noise and dropouts in gene expressions are less likely to affect the universal concept signature. This creates a “buffer zone” to tolerate the technical noise of SCS data and improves reproducibility through leveraging the stable concept signatures of the highly weighted genes for PE analysis.

Through analysis of three single-cell transcriptomics datasets, we demonstrate the outstanding performance and reproducibility of CSEA and WCSEA on detecting complex pathway alterations under high noise and low coverage of scRNAseq data. Under simulated technical variability of scRNA-seq, CSEA and WCSEA yielded lowest deviations in PE results compared to other commonly used PE methods. More importantly, we demonstrate the outstanding stability of IndepthPathway results in a real dataset of matched scRNAseq data and bulk RNAseq data for the same comparing cell line models. Whereas existing tools, such as GSEA, NEA, ORA, iDEA, and UniPath, showed much larger deviations. This is consistent with the report by Holland et al. (2020) that the performance of all PE tools they tested dropped significantly with decreasing gene coverage. They suggest that applying PE tools on perturbation-based pathway gene sets, which they termed as footprint gene sets, are more effective. Such approach, however, will leave away most function-based pathway gene sets and limit the analysis to perturbation-based gene sets. WCSEA can endure the stochasticity of SCS data and detect the enrichment of functional-based pathways with high reproducibility, thus substantially improving the scientific rigor of the pathway analysis for single-cell groups. Furthermore, we provided five modules for visualization of the PE results. In particular, the pathway association heatmap and network allows for visualizing the functional associations of top enriched pathways to reveal the pathway functional clusters (Fig. 2A). The visualization of differential gene expression in the context of interactome network allow for exploration of functional links between the gene of interest and the selected top pathways. This function is particularly helpful for genetic perturbation studies to examine the interactions between the perturbated gene of interest and the up- or down-regulated pathways (Supplementary Fig. S5). The development of the IndepthPathway tool will promote the application of SCS technology to explore the cellular pathway mechanisms at more precise resolution.

Merging a wide array of concept datasets carries inherent concept redundancy, as different gene set sources follow different nomenclatures, and different categories of concepts possess different levels of inherent redundancy. Our uniConSig algorithm possesses a unique ability to compute the redundancy within molecular concept data and sieve only the unique parts of signature concepts for further calculation via an innovative mathematical scheme to penalize the partially overlapping concepts (Fig. 1B). This allows for comprehensively assessing the function relations through merging a wide array of molecular concept datasets. The new design of WCSEA algorithm will empower PE analysis based on a weighted gene list to provide more power to detect different magnitudes of pathway alterations reflected by the levels of DEs and better tolerance of the technical noise of scRNA-seq. WCSEA and GSEA have similar functionality by intaking a weighted gene list and producing PE results, under different designs. Based on the strength of WCSEA in inferring the functional enrichments of gene sets, and the strength of GSEA in direct measuring the enrichments of gene sets, we recommend the use of WCSEA for interpretation of functional-based pathways and the use GSEA for exploring the enrichment of perturbation-based gene expression signature.

Purpose-designed PE methods, such as IndepthPathway, that can better address the characteristics of scRNA-seq data will be of utmost importance to ensure the accuracy and reproducibility of the PE results. The integration of the up and downstream analysis modules, such as differential gene expression, pathway disambiguation, functional associations of pathways, and heatmap and network visualization of the top enriched pathways, will greatly enhance the functionalities of IndepthPathway.

Supplementary Material

btad325_Supplementary_Data

Acknowledgements

We thank Fangping Mu for his kind assistance with the research computing and Yuehua Zhu and Megan E. Yates for testing the IndepthPathway package.

Contributor Information

Sanghoon Lee, UPMC Hillman Cancer Center, University of Pittsburgh, Pittsburgh, PA 15232, United States; Department of Pathology, University of Pittsburgh, Pittsburgh, PA 15232, United States; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, United States.

Letian Deng, UPMC Hillman Cancer Center, University of Pittsburgh, Pittsburgh, PA 15232, United States; Department of Pathology, University of Pittsburgh, Pittsburgh, PA 15232, United States.

Yue Wang, UPMC Hillman Cancer Center, University of Pittsburgh, Pittsburgh, PA 15232, United States; Department of Pathology, University of Pittsburgh, Pittsburgh, PA 15232, United States.

Kai Wang, Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States.

Maureen A Sartor, Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States.

Xiao-Song Wang, UPMC Hillman Cancer Center, University of Pittsburgh, Pittsburgh, PA 15232, United States; Department of Pathology, University of Pittsburgh, Pittsburgh, PA 15232, United States; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, United States.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This study was supported by National Institutes of Health grant [1R01CA181368, 1R01CA183976, 1R21CA237964, P30 CA047904 CCSG Dev Funds, P50 CA097190 Head and Neck SPORE DRP to X.-S.W.]; CDMRP grant [W81XWH-22-1-1036]; and PA breast cancer coalition (to X.-S.W.). This research was also supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. This work also used the Extreme Science and Engineering Discovery Environment (XSEDE) supported by National Science Foundation [grant number OCI-1053575]. Specifically, it used the Bridges system, which is supported by National Science Foundation [award number ACI-1445606], at the Pittsburgh Supercomputing Center (PSC).

Data availability

The IndepthPathway R package and the related data are available through: https://github.com/wangxlab/IndepthPathway.

References

  1. Aibar S, González-Blas CB, Moerman T. et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 2017;14:1083–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Becht E, McInnes L, Healy J. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 2019;37:38–44. [DOI] [PubMed] [Google Scholar]
  3. Beerman I, Seita J, Inlay MA. et al. Quiescent hematopoietic stem cells accumulate DNA damage during aging that is repaired upon entry into cell cycle. Cell Stem Cell 2014;15:37–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Brown GR, Hem V, Katz KS. et al. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res 2015;43:D36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chawla S, Samydurai S, Kong SL. et al. UniPath: a uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles. Nucleic Acids Res 2021;49:e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chi X, Sartor MA, Lee S. et al. Universal concept signature analysis: genome-wide quantification of new biological and pathological functions of genes and pathways. Brief Bioinform 2020;21:1717–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chu L-F, Leng N, Zhang J. et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol 2016;17:20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dominguez-Sola D, Ying CY, Grandori C. et al. Non-transcriptional control of DNA replication by c-Myc. Nature 2007;448:445–51. [DOI] [PubMed] [Google Scholar]
  9. Donato M, Xu Z, Tomoiaga A. et al. Analysis and correction of crosstalk effects in pathway analysis. Genome Res 2013;23:1885–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Essers MAG, Offner S, Blanco-Bose WE. et al. IFNalpha activates dormant haematopoietic stem cells in vivo. Nature 2009;458:904–8. [DOI] [PubMed] [Google Scholar]
  11. Fan J, Salathia N, Liu R. et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods 2016;13:241–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fan Y, Ye J, Shen F. et al. Interleukin-6 stimulates circulating blood-derived endothelial progenitor cell angiogenesis in vitro. J Cereb Blood Flow Metab 2008;28:90–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Funk SD, Yurdagul A, Albert P. et al. EphA2 activation promotes the endothelial cell inflammatory response: a potential role in atherosclerosis. Arterioscler Thromb Vasc Biol 2012;32:686–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. García-Campos MA, Espinal-Enríquez J, Hernández-Lemus E.. Pathway analysis: state of the art. Front Physiol 2015;6:383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Holland CH, Tanevski J, Perales-Patón J. et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol 2020;21:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hüttmann A, Liu SL, Boyd AW. et al. Functional heterogeneity within rhodamine123(lo) Hoechst33342(lo/sp) primitive hemopoietic stem cells revealed by pyronin Y. Exp Hematol 2001;29:1109–16. [DOI] [PubMed] [Google Scholar]
  17. Jeggari A, Alekseenko Z, Petrov I. et al. EviNet: a web platform for network enrichment analysis with flexible definition of gene sets. Nucleic Acids Res 2018;46:W163–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kharchenko PV, Silberstein L, Scadden DT.. Bayesian approach to single-cell differential expression analysis. Nat Methods 2014;11:740–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kwon JS, Everetts NJ, Wang X. et al. Controlling depth of cellular quiescence by an Rb-E2F network switch. Cell Rep 2017;20:3223–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li L, Liu H, Xu C. et al. VEGF promotes endothelial progenitor cell differentiation and vascular repair through connexin 43. Stem Cell Res Ther 2017;8:237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Linderman GC, Rachh M, Hoskins JG. et al. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods 2019;16:243–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Love, M.I., Huber, W. and Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15(12):1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Luck K, Kim D-K, Lambourne L. et al. A reference map of the human binary protein interactome. Nature 2020;580:402–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ma Y, Sun S, Shang X. et al. Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies. Nat Commun 2020;11:1585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Maglott D, Ostell J, Pruitt KD. et al. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2011;39:D52–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ning L, Liu G, Li G. et al. Current challenges in the bioinformatics of single cell genomics. Front Oncol 2014;4:7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Poirion OB, Zhu X, Ching T. et al. Single-cell transcriptomics bioinformatics and computational challenges. Front Genet 2016;7:163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ritchie, M.E., et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 2015;43(7):e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Robb L. Cytokine receptors and hematopoietic differentiation. Oncogene 2007;26:6715–23. [DOI] [PubMed] [Google Scholar]
  30. Rodchenkov I, Babur O, Luna A. et al. Pathway commons 2019 update: integration, analysis and exploration of pathway data. Nucleic Acids Res 2020;48:D489–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Rolny C, Nilsson I, Magnusson P. et al. Platelet-derived growth factor receptor-beta promotes early endothelial cell differentiation. Blood 2006;108:1877–86. [DOI] [PubMed] [Google Scholar]
  32. Schubert M, Klinger B, Klünemann M. et al. Perturbation-response genes reveal signaling footprints in cancer gene expression. Nat Commun 2018;9:20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Shen Y, Ma Y, Gao M. et al. Integrins-FAK-Rho GTPases pathway in endothelial cells sense and response to surface wettability of plasma nanocoatings. ACS Appl Mater Interfaces 2013;5:5112–21. [DOI] [PubMed] [Google Scholar]
  34. Stegle O, Teichmann SA, Marioni JC.. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet 2015;16:133–45. [DOI] [PubMed] [Google Scholar]
  35. Stein-O'Brien GL et al. Enter the matrix: factorization uncovers knowledge from omics. Trends Genet 2018;34:790–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Subramanian A, Tamayo P, Mootha VK. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005;102:15545–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Toya SP, Wary KK, Mittal M. et al. Integrin alpha6beta1 expressed in ESCs instructs the differentiation to endothelial cells. Stem Cells 2015;33:1719–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wilson A, Murphy MJ, Oskarsson T. et al. c-Myc controls the balance between hematopoietic stem cell self-renewal and differentiation. Genes Dev 2004;18:2747–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Yang J, Tanaka Y, Seay M. et al. Single cell transcriptomics reveals unanticipated features of early hematopoietic precursors. Nucleic Acids Res 2017;45:1281–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Yu G, Wang L-G, Han Y. et al. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics 2012;16:284–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Yuan G-C, Cai L, Elowitz M. et al. Challenges and emerging directions in single-cell analysis. Genome Biol 2017;18:84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zhang X, Xu C, Yosef N.. Simulating multiple faceted variability in single cell RNA sequencing. Nat Commun 2019;10:2611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhao D, Lu X, Wang G. et al. Synthetic essentiality of chromatin remodelling factor CHD1 in PTEN-deficient cancer. Nature 2017;542:484–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Zhao Y, Carter R, Natarajan S. et al. Single-cell RNA sequencing reveals the impact of chromosomal instability on glioblastoma cancer stem cells. BMC Med Genomics 2019;12:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad325_Supplementary_Data

Data Availability Statement

The IndepthPathway R package and the related data are available through: https://github.com/wangxlab/IndepthPathway.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES