Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data

Nighat Noureen; Zhenqing Ye; Yidong Chen; Xiaojing Wang; Siyuan Zheng

doi:10.7554/eLife.71994

. 2022 Feb 25;11:e71994. doi: 10.7554/eLife.71994

Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data

Nighat Noureen ^1,², Zhenqing Ye ^1,², Yidong Chen ^1,², Xiaojing Wang ^1,², Siyuan Zheng ^1,^2,^✉

Editors: C Daniela Robles-Espinoza³, Naama Barkai⁴

PMCID: PMC8916770 PMID: 35212622

Abstract

Quantifying the activity of gene expression signatures is common in analyses of single-cell RNA sequencing data. Methods originally developed for bulk samples are often used for this purpose without accounting for contextual differences between bulk and single-cell data. More broadly, few attempts have been made to benchmark these methods. Here, we benchmark five such methods, including single sample gene set enrichment analysis (ssGSEA), Gene Set Variation Analysis (GSVA), AUCell, Single Cell Signature Explorer (SCSE), and a new method we developed, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE). Using cancer as an example, we show cancer cells consistently express more genes than normal cells. This imbalance leads to bias in performance by bulk-sample-based ssGSEA in gold standard tests and down sampling experiments. In contrast, single-cell-based methods are less susceptible. Our results suggest caution should be exercised when using bulk-sample-based methods in single-cell data analyses, and cellular contexts should be taken into consideration when designing benchmarking strategies.

Research organism: Human

Introduction

Single-cell RNA sequencing (scRNA-seq) is a powerful technology to study the ecosystems of normal and disease tissues. Gene expression signatures can be used to interrogate single cells for cell identities and other cellular properties (Noureen et al., 2021a) using signature-scoring methods. A key consideration for these methods is how to account for the variability of expression within the signature, a characteristic often associated with the overall expression variability of the interrogated sample. Despite widespread use and their benchmarking on certain applications (Holland et al., 2020; Zhang et al., 2020), limitations of these methods, including some initially developed for bulk samples, are incompletely understood in scRNA-seq analysis.

Compared with bulk-sample RNAseq, scRNA-seq data have high dropout rates (Hicks et al., 2018). Dropout rates are the opposite of gene counts, that is, the number of genes detected in a cell, a feature that is associated with cell differentiation status (Gulati et al., 2020). Variability in gene counts may cause imbalanced representation of non-expressed genes in gene signatures, resulting in scoring bias. Thus, we reason that in contexts where cells differ in differentiation status, failing to account for this variability may lead to misleading signature scores.

Cancer cells exhibit stem-cell like properties Malta et al., 2018; thus, cancer may represent such a context. To test this, we benchmark five signature-scoring methods in cancer single-cell datasets. The five methods included bulk-sample-based methods single sample gene set enrichment analysis (ssGSEA) and Gene Set Variation Analysis (GSVA) (implemented in the R GSVA package Hänzelmann et al., 2013) and three single-cell-based methods, AUCell (Aibar et al., 2017), Single-Cell Signature Explorer (SCSE) (Pont et al., 2019), and Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE), a new method we developed. We show that cancer cells consistently express more genes than normal cells, and this imbalance significantly biases results from ssGSEA and GSVA but largely spares single-cell-based methods. Our results caution against the use of bulk-sample-based methods in scRNA-seq analyses.

Results

Cancer cells demonstrate higher gene counts than normal cells

Several recent studies showed that gene counts of single cells are associated with cell differentiation status and cell identity (Gulati et al., 2020; Qiu, 2020). Specifically, cells higher in the differentiation hierarchy express more genes. To examine if cancer and normal cells demonstrate similar imbalances, we collected 10 scRNA-seq datasets across different cancer types, technological platforms and sequencing coverage (Supplementary file 1). We also obtained cell type annotations from the original studies. We found that the average number of detected genes was significantly higher in tumor cells than in normal cells across all datasets (Figure 1A, p < 2.2e-16). This imbalance persisted when we separated normal cells into different cell populations (Figure 1—figure supplement 1). However, cancer-associated cells, including cancer-associated fibroblasts (CAF), tumor-associated macrophages (TAM), and tumor-related endothelial cells (TEC) had higher gene counts than other normal cell types, sometimes even comparable to malignant cells (Figure 1—figure supplement 1).

(A) The number of detected genes in tumor and normal cell populations in 10 single cell cancer RNAseq datasets. The height of each bar represents average, and whiskers represent standard deviation. In all cases, the difference is statistically significant (student t test, p < 2.2e-16). (B) Percentage of up and down regulated gene signatures in cancer cells relative to normal cells based on Cohen’s d. Dot size corresponds to the percentage of all signatures tested (n = 7503). (C) Spearman correlation coefficients of Cohen’s d with signature sizes across the datasets and methods. Asterisk (*) in each cell indicates p-value < 0.01. Color of the heatmap represents correlation coefficient. (D) Scores of a cell cycle gene set (GO:0007049) calculated using four methods along with MKI67 expression, gene counts, and cell cycle phases predicted by Seurat in Tumor and normal cell populations of HNSC dataset (GSE103322). The red box highlights non-cycling tumor cells that exhibit higher scores than non-cycling normal cells.

Figure 1—source data 1. Source data for Figure 1.

elife-71994-fig1-data1.xlsx^{(1.5MB, xlsx)}

Figure 1—figure supplement 1. — (A) The number of detected genes in tumor and normal cell populations in 10 single cell cancer RNAseq datasets. The height of each bar represents average, and whiskers represent standard deviation. In all cases, the difference is statistically significant (student t test, p < 2.2e-16). (B) Percentage of up and down regulated gene signatures in cancer cells relative to normal cells based on Cohen’s d. Dot size corresponds to the percentage of all signatures tested (n = 7503). (C) Spearman correlation coefficients of Cohen’s d with signature sizes across the datasets and methods. Asterisk (*) in each cell indicates p-value < 0.01. Color of the heatmap represents correlation coefficient. (D) Scores of a cell cycle gene set (GO:0007049) calculated using four methods along with MKI67 expression, gene counts, and cell cycle phases predicted by Seurat in Tumor and normal cell populations of HNSC dataset (GSE103322). The red box highlights non-cycling tumor cells that exhibit higher scores than non-cycling normal cells.

Figure 1—source data 1. Source data for Figure 1.

elife-71994-fig1-data1.xlsx^{(1.5MB, xlsx)}

Bias of ssGSEA signature scores in cancer datasets

The reliability of signature-scoring methods is measured by their ability to identify the quantitative differences between two groups of samples at the gene-signature level. Here, we compare signature scores between tumor and normal cells. One issue is that scores from these methods differ in range and variance, thus making direct comparison challenging. To overcome this challenge, we used Cohen’s d (Cohen, 1988), a measurement that normalizes mean differences with standard deviation, thus generating unitless contrast that can be compared across methods. For simplicity, we defined Cohen’s d greater than one as upregulated, and less than –1 as down regulated. These criteria were used throughout this work.

We assembled 7503 gene sets from MSigDB that each has at least 20 genes (The term ‘gene set’ is used interchangeably with ‘gene signature’; Methods). On average, single-cell methods reported similar proportions of up- and down-regulated gene sets (9% vs 12% of all input gene sets) (Figure 1B). In contrast, ssGSEA and GSVA reported 25 and 23% of the input gene sets as upregulated but only 6 and 5% as down regulated. In six out of the 10 datasets, ssGSEA and GSVA estimated more than twice as many upregulated gene sets than down gene sets. This pattern remained when we divided normal cells into different populations (Figure 1—figure supplement 2). Notably, in cell types with high gene counts such as TEC, TAM, and CAF, we did not observe significantly more upregulated gene sets by ssGSEA and GSVA (Figure 1—figure supplement 2E,F). These results collectively suggest ssGSEA and GSVA are sensitive to variability of gene counts common to scRNA-seq data.

We also found that Cohen’s d from ssGSEA and GSVA showed consistent positive correlation with gene set sizes (Figure 1C). Single-cell methods did not demonstrate this pattern.

We observed that GSVA was slower than other methods. For instance, running the head and neck dataset with 200 threads on a 2.5 GHz Xeon processor took more than 3 days to complete. Because GSVA outputs were highly consistent with those of ssGSEA (Figure 1—figure supplement 3), and its running speed would not likely scale up to single cell datasets, we dropped GSVA from subsequent analyses.

We next use an example to illustrate how gene counts may affect signature scores and data interpretation. We scored the cell cycle gene signature from Gene Ontology (GO:0007049) in the head and neck cancer (HNSC) dataset (Puram et al., 2017). For comparison, we used Seurat (Butler et al., 2018) to identify cycling cells at G2M and S phases and also included the expression of KI67, a proliferation marker expressed by cycling cells. We observed all four methods reported a higher score in cycling cells (Figure 1D). However, ssGSEA scores were significantly higher in non-cycling cancer cells than in normal cells although neither are cycling (Student’s t test, p < 2.2e-16). In contrast, the single-cell-based methods did not show this difference.

Sensitivity and specificity

We next generated gold standard up- and down-regulated gene sets at various sizes to benchmark detection sensitivity. To simulate real-world scenario, we added noises to each gene set so that higher noise levels attenuate more of their signals (Method).

We found gene set size had negligible impacts on detection rates (Figure 2—figure supplement 1). Noise levels, on the other hand, had a much bigger impact. For up gene sets, all methods were able to detect 50% of the gene sets on average even at the 80% noise level (Figure 2A). However, for down gene sets, ssGSEA performed worse than other methods (Figure 2B). Without noise, it only detected about 85% of the gene sets. At 80% noise level, it detected around 30% of the gene sets, compared to 70–80% by single-cell methods.

To benchmark detection specificity, we generated null datasets by randomly choosing 100 cancer cells from each dataset. We down sampled each cell to 50% coverage, resulting in lower gene counts. These cells were then normalized to the same total coverage. Because the down-sampled cells were the same cells with the original, no up or down gene sets were expected between the two groups.

We found AUCell and JASMINE outperformed SCSE and ssGSEA in specificity (Figure 2C). On average, AUCell detected 7.5 and 1.2% of the input gene sets as up and down; JASMINE detected 4 and 1%. In four datasets, SCSE overestimated down regulated gene sets (average 25.4%). ssGSEA grossly overestimated both up- and down regulated gene sets in all datasets (average 11 and 46%, respectively).

Next, we estimated how changes in gene count affected score stability. For each gene signature, we calculated coefficient of variance (CV) (Figure 2—figure supplement 2). We observed that single-cell methods were robust to our down sampling, with little changes in CV. However, in three datasets, ssGSEA had a higher CV in down-sampled cells than in original cells.

Comparing with consensus and computational efficiency

We next compared the three single-cell-based methods against consensus—gene sets that were identified as up- or down regulated by at least two methods. JASMINE aligned better with the consensus in most datasets (Figure 2D and Figure 2—figure supplement 3). AUCell overall had lower accuracies against the consensus, particularly for down regulated gene sets, likely because it is designed to score marker signatures (Figure 2D). When breaking down to sensitivity and specificity, AUCell showed higher false positive rates (Figure 2—figure supplement 4A). The three tools showed comparable sensitivity. Across the 10 datasets, SCSE and JASMINE showed better correlation (Figure 2—figure supplement 4B,C), thus explaining their relatively better alignment with the consensus.

Finally, we tested computing efficiency of the four methods. We randomly generated gene sets of various sizes and analyzed each gene set in a dataset consisting of 2000 cells. Figure 2—figure supplement 5 shows the time and memory use averaged over 50 iterations for one gene set under the same hardware configuration (2.20 GHz CPU, 32 GB memory). Overall, SCSE was the most efficient computationally. JASMINE was the second fastest algorithm but required a memory similar to that of ssGSEA. AUCell was slightly faster than ssGSEA but required the most memory. Their performances were robust across signature sizes.

Dropouts affect ssGSEA scores

To further show the impact of dropouts on ssGSEA scores, we re-ran the down sampling experiment by setting down sampling rates to 20, 40, 60, and 80%. A down sampling rate of 80% indicates that the total sequencing depth of the down sampled cell is at 80% of the original cell. A lower down sampling rate creates more dropouts. We observed that as the down sampling rate shifted from low to high, the pattern of deregulated gene sets by ssGSEA also shifted from more up gene sets towards more down gene sets (Figure 3A and Figure 3—figure supplement 1). This observation strongly associates dropouts with ssGSEA scores. In Figure 3—figure supplement 2, we provide an example using real data to show how ssGSEA scores change when limited to cells with comparable gene counts.

(A) Percentages of up and down regulated gene signatures in original cells relative to down sampled cells for four levels of down sampling (20, 40, 60, and 80%) based on Cohen’s d. Dot size corresponds to the percentage of all signatures tested (n = 7503) in Head and Neck (Puram et al., 2017). (B) Effect of dropouts on ssGSEA scoring using a dummy expression matrix. The black line denotes the cell without any dropouts, and the blue line denotes the same cell with a 60% dropout rate. Note that for the gene signature, the first 99 genes are fixed. The x axis reflects the position of the last signature gene. When the gene is at rank <4000. The two cells give identical scores. However, after entering dropout zone, the scores start to deviate.

Figure 3—source data 1. Source data for Figure 3.

elife-71994-fig3-data1.xlsx^{(10.5KB, xlsx)}

Figure 3—figure supplement 1. — (A) Percentages of up and down regulated gene signatures in original cells relative to down sampled cells for four levels of down sampling (20, 40, 60, and 80%) based on Cohen’s d. Dot size corresponds to the percentage of all signatures tested (n = 7503) in Head and Neck (Puram et al., 2017). (B) Effect of dropouts on ssGSEA scoring using a dummy expression matrix. The black line denotes the cell without any dropouts, and the blue line denotes the same cell with a 60% dropout rate. Note that for the gene signature, the first 99 genes are fixed. The x axis reflects the position of the last signature gene. When the gene is at rank <4000. The two cells give identical scores. However, after entering dropout zone, the scores start to deviate.

Figure 3—source data 1. Source data for Figure 3.

elife-71994-fig3-data1.xlsx^{(10.5KB, xlsx)}

To understand how dropouts may affect ssGSEA scoring, we simulated a dummy expression profile of a cell with 10,000 genes. We then set different dropout rates to the cell ranging from no dropout to 80% dropout. Genes were ranked from high to low expression to mimic ssGSEA calculation and the gene rankings were identical at each dropout rate. We then assembled a gene signature of 100 genes. The first 99 genes were drawn from the top 1000 genes, and the last gene was added to the signature from the 1001st gene onward to the bottom of the list. This way, the signature consists of 99 highly expressed genes and a floating gene. As the floating gene moves toward the dropout zone, we can observe how the signature score changes, especially around the dropout zone. The codes to reproduce these data can be found at our Github repository (Methods). Figure 3B shows the result at the 60% dropout rate. Before entering the dropout zone, the signature score was identical between the no-dropout cell and the 60%-dropout cell. However, when the floating gene was between ranks 4000 and 7000, the signature was scored higher in the dropout cell; when moving further toward ranks 7000–10,000, the signature was scored lower in the dropout cell. Thus, the scores teetered around the tie rank of dropout genes, that is, 7000. This simple experiment demonstrates that though tedious, dropouts are sufficient to change ssGSEA scores in unintended ways.

Discussion

In this study, we benchmarked five signature-scoring methods, including two that were developed for bulk sample RNAseq analysis (ssGSEA and GSVA), and three that were developed for scRNA-seq analysis (AUCell, SCSE, and JASMINE). Because ssGSEA and GSVA generate highly correlated scores and GSVA is slower, we focused on ssGSEA and the other three methods. Unlike methods that test for signature enrichment between sample groups, these signature-scoring methods quantify expression activity of a gene set in a sample independent of other samples. Outputs from these methods can be conveniently tested for association with complex sample characteristics. We show that single-cell-based methods are more robust at identifying bona fide up- and down regulated gene signatures from scRNAseq data. In contrast, bulk-sample-based ssGSEA and GSVA are more susceptible to dropouts. These disparities stem from their distinct mechanism to score signatures. AUCell uses Area Under the Curve (AUC) to test the enrichment of a gene set among top expressed genes in a cell. SCSE measures a signature using normalized total expression of the signature genes. AUCell and SCSE intuitively counter dropout effects by using only expressed genes. JASMINE considers the enrichment of signature genes in expressed genes to counter dropout effects, and meanwhile, evaluates the average expression level of the expressed signature genes. Bulk-sample-based methods were not designed to deal with excessive dropouts, which may mislead analysis especially if cell biology underpins different dropout rates. We further show that among single-cell methods, JASMINE and SCSE showed better concordance, and JASMINE outperforms SCSE in down sampling tests.

Dropouts were initially thought to result from low amounts of input RNA and transcriptional stochasticity. However, more studies showed that dropouts are associated with cell states and identities (Gulati et al., 2020; Qiu, 2020). Thus, in cellular contexts where cells may have systematic differences in dropout rates, any tool that fails to account for dropouts may generate misleading even erroneous data. In conclusion, our results caution against using bulk-sample-based signature-scoring methods to score single cells. Our study also suggests that it is important to consider cellular contexts when benchmarking methods for scRNA-seq data analysis, particularly when bulk-sample-based methods are involved.

Materials and methods

Single cell data sets and signature scoring

We used single cell data sets from 10 published studies (Bi et al., 2021; Chung et al., 2017; Darmanis et al., 2017; He et al., 2021; Lee et al., 2020; Ma et al., 2019; Neftel et al., 2019; Puram et al., 2017; Tirosh et al., 2016; Venteicher et al., 2017) for the evaluation of number of expressed genes in tumor versus normal cells to identify significant heterogenous patterns among the two phenotypes. Annotations of cell identity were also downloaded from each publication. We filtered all the data sets by removing non-expressed genes and then applied regularized negative binomial regression implemented in Seurat for normalization. We used C2 (n = 6226), C3 (n = 3556) and Hallmarks (n = 50) modules from MSigDB (Subramanian et al., 2005) v.7.2, to calculate the ratio of signature genes across all data sets and further signature scoring. We tested five tools for signature score calculations, including SCSE (Pont et al., 2019), AUCell (Aibar et al., 2017), ssGSEA, GSVA (Hänzelmann et al., 2013), and JASMINE. GSVA was included in tumor-normal comparisons but was dropped in gold standard tests and down sampling experiments due to slow running speed and highly correlated outputs with ssGSEA. We used GSVA and ssGSEA methods implemented in the GSVA Bioconductor (Hänzelmann et al., 2013) and AUCell method from AUCell Bioconductor packages (Aibar et al., 2017) with default parameters. We implemented SCSE in the R environment (v4.0) according to the equation reported in their paper (Pont et al., 2019). The output scores were used as is in tumor/normal cell comparisons and simulation analyses.

Jointly Assessing Signature Mean and Inferring Enrichment

For each signature, JASMINE calculates the approximate mean using gene ranks among expressed genes and the enrichment of the signature in expressed genes. The two are then scaled to 0–1 and averaged to result in the final JASMINE score.

Assume R_g represents the rank of an expressed signature gene g among genes with expression value >0 in a cell, then a mean rank vector V_mean is calculated as follows:

V_{m e a n} = \sum_{i = 1}^{m} R_{g, i} / (m \times N)

Where m represents the total number of expressed signature genes and N represents the total number of expressed genes in the cell. Because the mean is based on ranks, it is robust to scale normalization methods. It assesses the expression level of the signature only using genes detected in one cell, thus minimizing the effect of dropouts.

To assess signature enrichment in the expressed genes, we calculate either the Odds Ratio (OR) using four variables: a = signature genes expressed, b = signature genes not expressed, c = non-signature genes expressed, and d = non-signature genes not expressed. The signature enrichment using OR is calculated as follows:

O R = (a * d) / (b * c)

In practice, c is unlikely zero. For smaller signatures, b can be occasionally 0. In that case, we replace it with 1. In our testing, we observed that 99% of the gene sets did not encounter this issue in any of the cells we analyzed. Despite this rare incidence, we provide Likelihood Ratio (LR) as an alternative to OR in the JASMINE function. LR is calculated as follows:

L R = a * (c + d) / (c * (a + b))

OR and LR generate highly similar results according to our tests. OR (or LR) assesses the distribution of signature genes against the dropout background. Finally, V_mean and OR are linearly scaled to [0–1] and are averaged to generate JASMINE scores.

Effect size for gene sets scores

We first filtered the 9835 gene sets to 7503 by requiring a minimum size of 20 genes. We scored these gene sets (C2, C3, and hallmarks) among tumor and normal phenotypes using Cohen’s d (Cohen, 1988). We utilized an R package ‘effectsize’ (Ben-Shachar et al., 2020) to calculate this metric. We used Cohen’s d ≥ one as a threshold for positive or up cases and d ≤ –1 as negative or down cases with respect to tumor versus normal cells.

Gene signature simulation

We first identified differentially expressed genes for each dataset using MAST (Finak et al., 2015), a method that explicitly accounts for dropout rates. We then randomly drew N genes from upregulated genes to generate an up gene set of size N, and similarly for down gene sets. Because in practice up and down regulated gene sets contain genes that have no expression change, or even changes at opposite directions, we added noises to the simulated gene sets. For instance, when setting noise level to 20%, an up gene set of size N would have 20% of its genes drawn from the remainders other than the upregulated genes. Following this procedure, we generated gene sets at the sizes of 50, 100, 150, 200, and 300 genes. For each size, we set the noise levels to 0, 20, 40, 60, and 80%. For each noise-size combination, we randomly generated 200 gene sets. In total, 5000 gene sets were generated per data set for each direction.

Down sampling

We subset 100 tumor cells per data set for down sampling. The R package ‘scuttle’ (McCarthy et al., 2017) was used to down sample each cell to 50% total coverage. The down sampled data sets were then scaled back to ensure equal total coverage before comparison. Down sampling was also performed at various levels (20, 40, 60, and 80%) to examine its impact upon scoring of different methods especially ssGSEA.

Dropouts impact on ssGSEA

We generated a toy example to demonstrate the impact of dropouts on ssGSEA scores. We used a signature comprised of n + 1 genes. We randomly selected 99 genes from top 1000 genes of the dummy expression matrix. These 99 genes are fixed. Then starting from 1001 onward, we added one gene to the signature each time, starting from 1001st gene index throughout the remaining list. This allowed us to investigate how signature score changes as the additional gene gradually moves from high expression to dropouts. We simulated the scenario with 0% (original scores), and 60% dropout rates. This simulation clearly reflects how the dropout rates significantly affect the scoring by ssGSEA.

Data availability

Single cell data sets used in this study including their downloading sources were listed in Supplementary file 1. Gene sets were downloaded from MSigDB v.7.2 (http://www.gsea-msigdb.org/gsea/msigdb/index.jsp) for C2, C3 and Hallmark modules. JASMINE and ssGSEA testing codes are available on Github (https://github.com/NNoureen/JASMINE, copy archived at swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa; Noureen, 2021b).

Acknowledgements

This work was supported by CPRIT (RR170055 to ZS). NN was supported by a CPRIT postdoctoral fellowship award (RP170345).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Siyuan Zheng, Email: zhengs3@uthscsa.edu.

C Daniela Robles-Espinoza, International Laboratory for Human Genome Research, Mexico.

Naama Barkai, Weizmann Institute of Science, Israel.

Funding Information

This paper was supported by the following grants:

Cancer Prevention and Research Institute of Texas RR170055 to Siyuan Zheng.
Cancer Prevention and Research Institute of Texas RP170345 to Nighat Noureen.

Additional information

Competing interests

No competing interests declared.

Author contributions

Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Visualization, Writing - original draft.

Investigation, Visualization.

Investigation, Resources.

Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Writing - original draft.

Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Visualization, Writing - original draft.

Additional files

Supplementary file 1. Single cell RNAseq datasets.

The table lists all the ten datasets used in this study. Cell annotations were downloaded from the original publications.

elife-71994-supp1.xlsx^{(16.2KB, xlsx)}

Transparent reporting form

elife-71994-transrepform1.docx^{(246.7KB, docx)}

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Single cell data sets used in this study including their downloading sources were listed in Supplementary file 1. Gene sets were downloaded from MSigDB v.7.2. JASMINE source code is available on Github (https://github.com/NNoureen/JASMINE, copy archived at swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa). Source Data contain the numerical data used to generate the figures.

The following dataset was generated:

Noureen N. 2021. Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data. GitHub. GitHub

The following previously published datasets were used:

Tirosh I, Izar B, Prakadan SM. 2016. Single-cell expression of metastatic melanoma. NCBI Gene Expression Omnibus. GSE72056

Puram SV, Tirosh I, Parikh AS, Patel AP. 2017. Single-cell expression of head and neck cancer. NCBI Gene Expression Omnibus. GSE103322

Bi K, Mx He, Bakouny Z, Kanodia A. 2021. Single-cell expression of advanced renal cell carcinoma. Broad Single Cell portal. SCP1288

Lee HO, Hong Y, Etlioglu HE, Cho YB. 2020. Single-cell expression of colorectal cancer. NCBI Gene Expression Omnibus. GSE144735

Ma L, Hernandez MO, Zhao Y, Mehta M. 2019. Single-cell expression of liver cancer. NCBI Gene Expression Omnibus. GSE125449

Neftel C, Laffy J, Filbin MG, Hara T. 2019. Single-cell expression of IDH wildtype glioblastoma. Broad Single Cell portal. SCP393

Venteicher AS, Tirosh I, Hebert C, Yizhak K. 2017. Single-cell expression of IDH mutant astrocytoma. Broad Single Cell portal. SCP50

Darmanis S, Sloan SA, Croote D, Mignardi M. 2017. Single-cell expression of glioblastoma. NCBI Gene Expression Omnibus. GSE84465

Chung W, Eum HH, Lee HO, Lee KM. 2017. Single-cell expression of breast cancer. NCBI Gene Expression Omnibus. GSE75688

Mx He, Cuoco MS, Crowdis J, Bosma-Moody A, Zhang Z, Bi K, Kanodia A. 2021. Single-cell expression of prostate cancer. Broad Single Cell portal. SCP1244

References

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S. SCENIC: single-cell regulatory network inference and clustering. Nature Methods. 2017;14:1083–1086. doi: 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ben-Shachar M, Lüdecke D, Makowski D. effectsize: Estimation of Effect Size Indices and Standardized Parameters. Journal of Open Source Software. 2020;5:2815. doi: 10.21105/joss.02815. [DOI] [Google Scholar]
Bi K, He MX, Bakouny Z, Kanodia A, Napolitano S, Wu J, Grimaldi G, Braun DA, Cuoco MS, Mayorga A, DelloStritto L, Bouchard G, Steinharter J, Tewari AK, Vokes NI, Shannon E, Sun M, Park J, Chang SL, McGregor BA, Haq R, Denize T, Signoretti S, Guerriero JL, Vigneau S, Rozenblatt-Rosen O, Rotem A, Regev A, Choueiri TK, Van Allen EM. Tumor and immune reprogramming during immunotherapy in advanced renal cell carcinoma. Cancer Cell. 2021;39:649–661. doi: 10.1016/j.ccell.2021.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, Ryu HS, Kim S, Lee JE, Park YH, Kan Z, Han W, Park WY. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nature Communications. 2017;8:15081. doi: 10.1038/ncomms15081. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cohen J. Statistical Power Analysis for the Behavioral Sciences. Routledge; 1988. [Google Scholar]
Darmanis S, Sloan SA, Croote D, Mignardi M, Chernikova S, Samghababi P, Zhang Y, Neff N, Kowarsky M, Caneda C, Li G, Chang SD, Connolly ID, Li Y, Barres BA, Gephart MH, Quake SR. Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma. Cell Reports. 2017;21:1399–1410. doi: 10.1016/j.celrep.2017.10.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ, Prlic M, Linsley PS, Gottardo R. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology. 2015;16:278. doi: 10.1186/s13059-015-0844-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gulati GS, Sikandar SS, Wesche DJ, Manjunath A, Bharadwaj A, Berger MJ, Ilagan F, Kuo AH, Hsieh RW, Cai S, Zabala M, Scheeren FA, Lobo NA, Qian D, Yu FB, Dirbas FM, Clarke MF, Newman AM. Single-cell transcriptional diversity is a hallmark of developmental potential. Science (New York, N.Y.) 2020;367:405–411. doi: 10.1126/science.aax0249. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7. doi: 10.1186/1471-2105-14-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
He MX, Cuoco MS, Crowdis J, Bosma-Moody A, Zhang Z, Bi K, Kanodia A, Su MJ, Ku SY, Garcia MM, Sweet AR, Rodman C, DelloStritto L, Silver R, Steinharter J, Shah P, Izar B, Walk NC, Burke KP, Bakouny Z, Tewari AK, Liu D, Camp SY, Vokes NI, Salari K, Park J, Vigneau S, Fong L, Russo JW, Yuan X, Balk SP, Beltran H, Rozenblatt-Rosen O, Regev A, Rotem A, Taplin ME, Van Allen EM. Transcriptional mediators of treatment resistance in lethal prostate cancer. Nature Medicine. 2021;27:426–433. doi: 10.1038/s41591-021-01244-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics (Oxford, England) 2018;19:562–578. doi: 10.1093/biostatistics/kxx053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Holland CH, Tanevski J, Perales-Patón J, Gleixner J, Kumar MP, Mereu E, Joughin BA, Stegle O, Lauffenburger DA, Heyn H, Szalai B, Saez-Rodriguez J. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biology. 2020;21:36. doi: 10.1186/s13059-020-1949-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee H-O, Hong Y, Etlioglu HE, Cho YB, Pomella V, Van den Bosch B, Vanhecke J, Verbandt S, Hong H, Min J-W, Kim N, Eum HH, Qian J, Boeckx B, Lambrechts D, Tsantoulis P, De Hertogh G, Chung W, Lee T, An M, Shin H-T, Joung J-G, Jung M-H, Ko G, Wirapati P, Kim SH, Kim HC, Yun SH, Tan IBH, Ranjan B, Lee WY, Kim T-Y, Choi JK, Kim Y-J, Prabhakar S, Tejpar S, Park W-Y. Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nature Genetics. 2020;52:594–603. doi: 10.1038/s41588-020-0636-z. [DOI] [PubMed] [Google Scholar]
Ma L, Hernandez MO, Zhao Y, Mehta M, Tran B, Kelly M, Rae Z, Hernandez JM, Davis JL, Martin SP, Kleiner DE, Hewitt SM, Ylaya K, Wood BJ, Greten TF, Wang XW. Tumor Cell Biodiversity Drives Microenvironmental Reprogramming in Liver Cancer. Cancer Cell. 2019;36:418–430. doi: 10.1016/j.ccell.2019.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Malta TM, Sokolov A, Gentles AJ, Burzykowski T, Poisson L, Weinstein JN, Kamińska B, Huelsken J, Omberg L, Gevaert O, Colaprico A, Czerwińska P, Mazurek S, Mishra L, Heyn H, Krasnitz A, Godwin AK, Lazar AJ, Cancer Genome Atlas Research Network. Stuart JM, Hoadley KA, Laird PW, Noushmehr H, Wiznerowicz M. Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. Cell. 2018;173:338–354. doi: 10.1016/j.cell.2018.03.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCarthy DJ, Campbell KR, Lun ATL, Wills QF. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics (Oxford, England) 2017;33:1179–1186. doi: 10.1093/bioinformatics/btw777. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neftel C, Laffy J, Filbin MG, Hara T, Shore ME, Rahme GJ, Richman AR, Silverbush D, Shaw ML, Hebert CM, Dewitt J, Gritsch S, Perez EM, Gonzalez Castro LN, Lan X, Druck N, Rodman C, Dionne D, Kaplan A, Bertalan MS, Small J, Pelton K, Becker S, Bonal D, Nguyen QD, Servis RL, Fung JM, Mylvaganam R, Mayr L, Gojo J, Haberler C, Geyeregger R, Czech T, Slavc I, Nahed BV, Curry WT, Carter BS, Wakimoto H, Brastianos PK, Batchelor TT, Stemmer-Rachamimov A, Martinez-Lage M, Frosch MP, Stamenkovic I, Riggi N, Rheinbay E, Monje M, Rozenblatt-Rosen O, Cahill DP, Patel AP, Hunter T, Verma IM, Ligon KL, Louis DN, Regev A, Bernstein BE, Tirosh I, Suvà ML. An Integrative Model of Cellular States, Plasticity, and Genetics for Glioblastoma. Cell. 2019;178:835–849. doi: 10.1016/j.cell.2019.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Noureen N, Wu S, Lv Y, Yang J, Alfred Yung WK, Gelfond J, Wang X, Koul D, Ludlow A, Zheng S. Integrated analysis of telomerase enzymatic activity unravels an association with cancer stemness and proliferation. Nature Communications. 2021a;12:139. doi: 10.1038/s41467-020-20474-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Noureen N. JASMINE. swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfaGitHub. 2021b https://archive.softwareheritage.org/swh:1:dir:f5b092ae4a9b8e9a7314131251bc01391abd8e1f;origin=https://github.com/NNoureen/JASMINE;visit=swh:1:snp:68b3eef22d209a4b966ce1c768b5eded68dae4ff;anchor=swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa
Pont F, Tosolini M, Fournié JJ. Single-Cell Signature Explorer for comprehensive visualization of single cell signatures across scRNA-seq datasets. Nucleic Acids Research. 2019;47:e133. doi: 10.1093/nar/gkz601. [DOI] [PMC free article] [PubMed] [Google Scholar]
Puram SV, Tirosh I, Parikh AS, Patel AP, Yizhak K, Gillespie S, Rodman C, Luo CL, Mroz EA, Emerick KS, Deschler DG, Varvares MA, Mylvaganam R, Rozenblatt-Rosen O, Rocco JW, Faquin WC, Lin DT, Regev A, Bernstein BE. Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer. Cell. 2017;171:1611–1624. doi: 10.1016/j.cell.2017.10.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nature Communications. 2020;11:1169. doi: 10.1038/s41467-020-14976-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, Rotem A, Rodman C, Lian C, Murphy G, Fallahi-Sichani M, Dutton-Regester K, Lin JR, Cohen O, Shah P, Lu D, Genshaft AS, Hughes TK, Ziegler CGK, Kazer SW, Gaillard A, Kolb KE, Villani AC, Johannessen CM, Andreev AY, Van Allen EM, Bertagnolli M, Sorger PK, Sullivan RJ, Flaherty KT, Frederick DT, Jané-Valbuena J, Yoon CH, Rozenblatt-Rosen O, Shalek AK, Regev A, Garraway LA. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science (New York, N.Y.) 2016;352:189–196. doi: 10.1126/science.aad0501. [DOI] [PMC free article] [PubMed] [Google Scholar]
Venteicher AS, Tirosh I, Hebert C, Yizhak K, Neftel C, Filbin MG, Hovestadt V, Escalante LE, Shaw ML, Rodman C, Gillespie SM, Dionne D, Luo CC, Ravichandran H, Mylvaganam R, Mount C, Onozato ML, Nahed BV, Wakimoto H, Curry WT, Iafrate AJ, Rivera MN, Frosch MP, Golub TR, Brastianos PK, Getz G, Patel AP, Monje M, Cahill DP, Rozenblatt-Rosen O, Louis DN, Bernstein BE, Regev A, Suvà ML. Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science (New York, N.Y.) 2017;355:eaai8478. doi: 10.1126/science.aai8478. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y, Ma Y, Huang Y, Zhang Y, Jiang Q, Zhou M, Su J. Benchmarking algorithms for pathway activity transformation of single-cell RNA-seq data. Computational and Structural Biotechnology Journal. 2020;18:2953–2961. doi: 10.1016/j.csbj.2020.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.71994.sa0

Editor's evaluation

C Daniela Robles-Espinoza ¹

In this revised manuscript, Noureen and collaborators benchmark five methods that are used in single-cell RNA sequencing data analyses, including single sample gene set enrichment analysis (ssGSEA), Gene Set Variation Analysis (GSVA), AUCell, Single Cell Signature Explorer (SCSE), and a newly developed method, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE). The authors test these in distinct cancer datasets and conclude that caution should be exercised when using bulk sample-based methods in single-cell data analyses, and cellular contexts should be taken into consideration. As a rapidly developing field in need of method benchmarking, we believe that this paper will be of great interest to the bioinformatics and single-cell data analysis communities.

eLife. doi: 10.7554/eLife.71994.sa1

Decision letter

Editor: C Daniela Robles-Espinoza¹

Reviewed by: Leng Han²

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Patricia Wittkopp as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Leng Han (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

In this manuscript, Noureen and collaborators benchmark four methods that are used in single-cell RNA sequencing data analyses, including single sample gene set enrichment analysis (ssGSEA), AUCell, Single Cell Signature Explorer (SCSE), and a newly developed method, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE). The authors test these in distinct cancer datasets and conclude that caution should be exercised when using bulk sample-based methods in single-cell data analyses, and cellular contexts should be taken into consideration. Although interesting and informative, reviewers have raised a number of essential considerations that need to be addressed before publication.

Essential revisions:

Major Points

1. A reviewer comments, "The main analysis performed by the authors was identifying gene sets that significantly distinguish between cancer cells and the rest of the cells in the tumor. This a problematic comparison, since the cancer cells are epithelial cells (in most of the studies used) and the "normal" cells are stromal cells, mostly immune. Those are not comparable "normal" cells, and therefore it is expected that all immune-related pathways will be significant. The authors find much more down-regulated gene sets in ssGSEA compared to the other methods, but why are they wrong? If they are all immune related, I would actually conclude that ssGSEA is better than the other methods." They continue, "The comparison of cancer cells to normal microenvironment cells is meaningless. If looking at cancer, a relevant comparison would be cancer cells and normal adjacent cells. But cancer in general is not the best, as there would be way too many significant signatures, and its hard to understand what makes sense and what not". – Can the authors address this crucial observation please?

2. Can the methodology by which each method identifies signatures be explained? This would help identify why these methods perform differentially. Specifically, reviewers strongly suggest understanding what in ssGSEA makes it output different results. Gene dropouts could be the root cause, but this is something that needs proof. Empirically, reviewers suggest that this can be studied by comparing different scRNA-seq methods with different gene dropouts rates – compare 10X to SMART-seq. Do the authors have an idea of how this affects results?

3. The new method the authors propose (JASMINE) is not thoroughly explained and includes formulations which, in a reviewer's evaluation, lead to unintended mathematical behavior and hamper interpretation. Can the explanation on this method be expanded, please? Specifically, reviewers have raised the following points:

a. The four variables (a,b,c,d) considered when calculating the odds ratio make sense. However, the authors claim that "For smaller signatures b can be occasionally 0. In that case we replace it with 1." In a reviewer's words, "this is not a coherent and principled approach. I recommend the authors reconsider the mathematical formulation of this quantification to avoid the necessity of this unprincipled workaround."

b. A reviewer commented that "It is not clear why the average is the best way to combine the V_mean and OR to obtain the JASMINE score". Can this be specified, please?

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Naama Barkai as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Leng Han (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

In this revised manuscript, Noureen and collaborators benchmark five methods that are used in single-cell RNA sequencing data analyses, including single sample gene set enrichment analysis (ssGSEA), Gene Set Variation Analysis (GSVA), AUCell, Single Cell Signature Explorer (SCSE), and a newly developed method, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE). The authors test these in distinct cancer datasets and conclude that caution should be exercised when using bulk sample-based methods in single-cell data analyses, and cellular contexts should be taken into consideration. The majority of the reviewers have found the answers to their initial points to be thorough and satisfactory, and so we would only ask the authors to address the points of one new reviewer, who replaced a previous reviewer for this round of revisions.

Essential revisions:

1. It is unclear how calibrated any of these methods are from the analysis presented in this manuscript. To help clarity, please address:

a) Can the authors please detail, in the Methods section, which thresholds are used on each program?

b) Is any test statistic is derived from the JASMINE method?

c) How is the final statistic is derived? That is, what is the range that the statistic is scaled by? In particular, in the eyes of a reviewer, the math for V_mean is unclear. They write, "As written, it is computed only over a fixed index g? The range on the sum is a bit confusing as well. Shouldn't it be over all the cells? If I were to guess, V_mean appears to be V_mean = \sum_{g=1}^{m}\sum_{c=1}^{total cells} R_{g,c}/(m*N). Is this correct?". Please clarify these points in the Methods section.

2. This manuscript infers that the main advantage of methods specific for single-cell data is to account for dropouts. However, in the view of a reviewer, the work presented here does not explicitly compare the differences between the bulk and single-cell methods, and they note that there might be factors that influence the results besides dropouts in real data. They argue that it is inconclusive to attribute poor performance of bulk methods to dropouts, especially when there is only one bulk method included in the analysis. For example, it is unclear how the authors ran ssGSEA, as there are some missing details in the Methods section. In particular, it is common practice to 'pre-filter' RNA-seq data before running any analysis on it. Did the authors do this for the real data analysis? Please add an introduction to each method, and add the details missing so readers can recapitulate the test conditions.

3. From Figure 1D, the authors reveal that the non-cycling tumor cells exhibit higher scores than non-cycling normal cells by ssGSEA. However, it is also noticeable that ssGSEA scores have a much higher variance than the other three methods, and results from SCSE and AUCell scores do not show a significant difference between cycling and non-cycling cells. Are there any explanations and/or intuition for such a result?

4. In the "Dropouts affect ssGSEA scores" section (Line 167), the "dummy expression matrix of 1000 genes and nine columns" is not clearly described. Is it for each cell? Is the ranking of genes the same for each dropout rate? Besides, Figure 3B is not fully explained: Could the authors please explain why the black line has a curvy feature towards the end of x axis while the blue line is linear? How about other dropout rates? There should be nine lines in the figure according to the matrix.

eLife. 2022 Feb 25;11:e71994. doi: 10.7554/eLife.71994.sa2

Author response

Essential revisions:

1. A reviewer comments, "The main analysis performed by the authors was identifying gene sets that significantly distinguish between cancer cells and the rest of the cells in the tumor. This a problematic comparison, since the cancer cells are epithelial cells (in most of the studies used) and the "normal" cells are stromal cells, mostly immune. Those are not comparable "normal" cells, and therefore it is expected that all immune-related pathways will be significant. The authors find much more down-regulated gene sets in ssGSEA compared to the other methods, but why are they wrong? If they are all immune related, I would actually conclude that ssGSEA is better than the other methods." They continue, "The comparison of cancer cells to normal microenvironment cells is meaningless. If looking at cancer, a relevant comparison would be cancer cells and normal adjacent cells. But cancer in general is not the best, as there would be way too many significant signatures, and its hard to understand what makes sense and what not". – Can the authors address this crucial observation please?

We agree with the reviewer that lineage is an important factor when comparing tumor and normal cells. Lineage specific signatures are expected to differ, but expectations are not reliable for benchmarking purposes because tumor cells may influence the expression of immune and stromal cells, and cancers of different lineages do not necessarily show the same differences with normal cells. A more objective approach is to compare outputs of each method with ground truth. This is why we designed two objective experiments to benchmark the reliability of the tested methods. In the first experiment, we generated gold standard up/down signatures and tested how well each method rediscovers them (sensitivity test, Figure 2a-b in initial manuscript). Note that generating these gold standard signatures does not rely on expectations, but rather on their constituent genes that are sampled from the pool of differentially expressed genes. In the second experiment, we generated pseudo-cells by down sampling. These cells are identical to the original cells in the distribution of expressed genes; thus, no signature should be identified. This experiment was to test how well each method eludes false positives (specificity test, Figure 2c in the initial manuscript). These experiments were designed precisely because of the challenges to establish functional gold standard, as the reviewer pointed it out.

We gently correct the reviewer that we found more up-regulated, not down-regulated gene sets in ssGSEA. We also respectfully remind the reviewer that in our manuscript, we never concluded that ssGSEA scores were wrong in the tumor vs normal comparisons. Instead, we concluded that “… ssGSEA is sensitive to variability of gene counts common to scRNAseq data” (lines 94-95 on page 4, first submitted version). We respectfully argue that the observation of many more deregulated signatures called by ssGSEA preludes the more systematic benchmark analyses we laid out later in the manuscript, because it substantiates the hypothesis that signature scores can be biased if a method does not factor in dropouts in single cell data.

To give the reviewer and readers an intuitive example of how ssGSEA scores may mislead single cell interpretation, we calculated the score of the cell cycle signature (Gene Ontology, GO:0007049). As the reviewer is aware, cycling cells often need to be regressed out before clustering in single cell analysis. In this example, we used the head and neck data (Puram et al., 2017), and normal/tumor cell annotations were downloaded from the publication. We divided cells into cycling (S phase, G2M phase) and non-cycling cells (G1) using Seurat. Very few normal cells were determined as cycling, and thus, were removed from the figure. We additionally added the expression of KI67 to further confirm cell cycle status. KI67 is an established proliferation marker and is expressed by cycling cells but not quiescent cells. As shown in the figure, KI67 expression agrees well with Seurat scores. Notably, in tumor cells highlighted in the red box, ssGSEA scores were significantly higher than those in normal cells, suggesting the tumor cells may be cycling. However, neither these tumor cells nor the normal cells are cycling according to other evidence. Thus, at the minimum, ssGSEA scores complicate the interpretation of cell cycle status in this example. None of the single-cell methods showed the same bias.

This example was included in our preprint but was omitted in our initial submission. In the revision, we have added it back to support our conclusion. We again apologize for the omission, and hope the reviewer finds that this example makes biological sense.

2. Can the methodology by which each method identifies signatures be explained? This would help identify why these methods perform differentially. Specifically, reviewers strongly suggest understanding what in ssGSEA makes it output different results. Gene dropouts could be the root cause, but this is something that needs proof. Empirically, reviewers suggest that this can be studied by comparing different scRNA-seq methods with different gene dropouts rates – compare 10X to SMART-seq. Do the authors have an idea of how this affects results?

We thank the reviewers for suggesting new analyses to better understand the impact of dropouts on signature scores. Comparing SMART-seq with 10x does reproduce the differences in dropout rates. Among the 10 datasets we used, five were from SMART-seq, four from 10X and one from C1 fludigm. However, one cell cannot be sequenced twice with two scRNAseq platforms. Therefore, it is unfeasible to compare data of the same cells generated by the two platforms. Alternatively, we can compare the same cell population sequenced by the two. In a recent study (PMID: 33662621), Wang et al., sequenced the same samples of CD45^- cells with both SMART-seq and 10x platforms. They showed that the RNA composition detected by the two platforms differ. Specifically, the 10x platform detected more ribosomal genes and lncRNAs, whereas SMART-seq detected more mitochondrial genes. The differences in RNA composition are critically important for RNA sequencing data analysis (see Robinson et al., Genome Bio, 2010. PMID: 20196867; and Zhao et al., RNA, 2020. PMID: 32284352), and thus, gene expression values are not directly comparable between the two platforms. For the same reason, gene signatures scores are not directly comparable either. Another confounding factor is cell-to-cell variation such as caused by transcriptional bursting.

Down sampling is an established method in single cell RNAseq analysis to recapitulate the effects of differences in sequencing depth. The method increases technical dropout rates and creates a realistic expression profile at a prespecified depth (Luecken M and Theis F, Mol Syst Biol, 2019. PMID: 31217225). Using this method, we can generate the expression profile of the same cell at a lower depth (conversely, higher dropout) without confounding effects from other factors. As we show in Figure 2C, at the 50% down sampling rate, ssGSEA identifies many deregulated signatures between the original cells and down sampled cells across all ten datasets. These concerning results indicate dropouts likely affect ssGSEA much more than other methods.

To further prove this, we set down sampling levels to 20%, 40%, 60% and 80%. Here, 80% down sampling means after the procedure, the cell is at 80% sequencing depth of the original cell. As down sampling level decreases, more dropouts are created as the procedure primarily affects genes with low counts. New Figure 3A shows how the number of up and down signatures identified by ssGSEA shifts along with down sampling levels. Because the dropout rate is the only parameter we manipulated in this analysis, the results clearly indicate that it impacts ssGSEA scores. In contrast, single-cell methods do not show the same correlation. We replicated the results in other datasets (see Figure 3—figure supplement 1 in the revised manuscript).

An empirical approach to this question is to take advantage of cell-to-cell variation and examine how scoring patterns change as cells with similar dropout rates are compared. In new Figure 3-figure supplement 2, the top panel shows score correlation of a signature between JASMINE and the other three methods using all cells from the head and neck data. Normal cells are scored higher by single cell methods but not ssGSEA, despite that JASMINE and ssGSEA scores are nicely correlated in both normal and tumor cells. In the bottom panel, we limited tumor and normal cells to those with 4,000-5,000 expressed genes. With this filtering, tumor cells still have higher gene counts than normal cells on average (tumor, 4585; normal, 4407), but the difference is much smaller than using the entire dataset. Nevertheless, using these cells led to a much more consistent pattern between ssGSEA and single-cell methods, further supporting our conclusion.

We agree with the reviewers that more insights are needed, particularly related to ssGSEA. The method scores a signature by summing up differences in the empirical cumulative density of genes inside and outside the signature. To do so, ssGSEA orders and ranks genes from high to low expression and iterate through every gene. For bulk expression data, the ranks are continuous; thus, signature scores gradually change depending on the distribution of the signature genes on the rank list. Single cell data often encompass thousands of dropouts. Because dropout genes have equal expression values, i.e., zero, their ranks are tied, thus disrupting the presumption of continuous data distribution. This predicates that if a signature has no gene in the dropout zone, the signature score is not affected by dropout rates. Otherwise, the score will be affected. We think this is an important issue, because using 10x, it is common to detect only 2,000-4,000 genes; even smart-seq usually only detect ~6,000 genes. It is thus highly likely that genes from gene signatures fall into the dropout zone. Single cell methods dodge this issue for various reasons. AUCell and SCSE do not use dropout genes in scoring, whereas JASMINE explicitly considers dropout rate in the calculation.

We next designed a computational approach to test this idea. We first generated a dummy expression vector with 10,000 genes. The genes were ranked from high to low based on expression, the same as ssGSEA. This expression vector mimics bulk expression data. Then we set the bottom of the vector to zeros to create dropouts at various rates. Next, we generated a highly expressed gene signature of 100 genes. This 100-gene signature consisted of 99 genes that were randomly sampled from the top 1000 genes, and the last gene (the floating signature gene) was iterated from the 1001^st gene onwards to the bottom of the gene list. This allowed us to observe how the signature score changes as the floating signature gene moved towards the dropout zone. We have provided the source code of this process in our Github repository for the reviewers and other researchers to reproduce our results and test different scenarios using the dummy expression data (https://github.com/NNoureen/JASMINE).

In new Figure 3B, the black line shows the signature score at no dropout. The blue line shows the signature score when the floating gene enters the dropout zone at 60% dropout rate. At this rate, the dropout zone is between the gene rank 4,000 and 10,000. Before the floating signature gene enters the dropout zone (at gene rank 4,000), the signature score is identical between the original cell and down sampled cell. However, between gene rank 4,000 and 7,000, the signature is scored higher in the dropout cell than the control cell, and the pattern is reversed after gene rank 7,000. Note that gene rank 7,000 is the tie rank for all dropout genes. Clearly, the scores deviate from the intended behavior in the dropouts.

In the revision, we have added new Figure 3A, new Figure 3—figure supplement 2 and new Figure 3B to the manuscript. We have also discussed the reasons why these methods perform differently.

3. The new method the authors propose (JASMINE) is not thoroughly explained and includes formulations which, in a reviewer's evaluation, lead to unintended mathematical behavior and hamper interpretation. Can the explanation on this method be expanded, please? Specifically, reviewers have raised the following points:

a. The four variables (a,b,c,d) considered when calculating the odds ratio make sense. However, the authors claim that "For smaller signatures b can be occasionally 0. In that case we replace it with 1." In a reviewer's words, "this is not a coherent and principled approach. I recommend the authors reconsider the mathematical formulation of this quantification to avoid the necessity of this unprincipled workaround."

b. A reviewer commented that "It is not clear why the average is the best way to combine the V_mean and OR to obtain the JASMINE score". Can this be specified, please?

a. When calculating odds ratio, the denominator is b*c, where b is the number of signature genes that are not detected, and c is the number of nonsignature genes that are detected (see Author response table 1 for reviewers’ convenience).

Author response table 1. A breakdown of signature genes in a single cell expression profile.

The four numbers are used in signature enrichment calculation by JASMINE.

	Expressed	Not-expressed
Signature gene	a	b
Non-signature gene	c	d

Open in a new tab

b. In practice, c is unlikely zero. b would be zero if all genes in the signature are expressed. In this scenario, the signature is highly enriched in the expressed genes, and thus, should be given a high OR score. Imagine for a signature consisting of 50 genes, if all 50 genes are expressed in a cell, the signature is apparently highly enriched with expressed genes. Even if we assume one of the signature genes is not expressed, the enrichment pattern will be largely the same. We think adding a small number is a mathematical necessity without significantly changing the biological interpretation. Our approach is not so much different from other methods such as Haldane correction which would add a small number, such as 0.5, to each variable in the OR calculation.

We agree with the reviewer that this workaround may deviate scores from the formula for some signatures. To clarify how many signatures may be affected, the following Author response table 2 summarizes the number of signatures that have b=0 in none of the cells, 1 cell, 2 cells, or ≥3 cells. The number in the parenthesis indicates percentage. The total number of cells analyzed in each dataset is also given in the column name. On average, 99% of the signatures we analyzed did not encounter this issue in any of the cells in the ten datasets. Given that each dataset has hundreds to thousands of cells, this issue is extremely rare.

Author response table 2. Counts and frequencies of signatures with b=0 in 0, 1, 2 and ≥3 cells across datasets.

The number of cells analyzed in each dataset is given in the parenthesis at the column name.

cell # (b=0)	Astrocytoma(n=6176)	BRCA(n=515)	IDHwtGBM(n=5742)	HNSC(n=5560)	GBM(n=3589)
0	7425 (98.96)	7438 (99.13)	7395 (98.56)	7194 (95.88)	7326 (99.84)
1	33 (0.44)	25 (0.33)	33 (0.44)	161 (2.15)	7 (0.1)
2	18 (0.24)	10 (0.13)	17 (0.23)	46 (0.61)	2 (0.03)
≥3	27 (0.36)	30 (0.4)	58 (0.6)	102 (1.36)	3 (0.04)
cell # (b=0)	LIHC (n=5023)	CRC (n=3336)	ccRCC (n=3933)	Prostate (n=2170)	Melanoma (n=4645)
0	7482 (99.72)	7496 (99.91)	7489 (99.81)	7493 (99.87)	7253 (96.67)
1	8 (0.11)	2 (0.03)	7 (0.09)	7 (0.09)	79 (1.05)
2	7 (0.09)	3 (0.04)	1 (0.01)	2 (0.03)	43 (0.57)
≥3	6 (0.08)	2 (0.03)	6 (0.08)	1 (0.01)	128 (1.71)

Open in a new tab

Inspired by the reviewer’s comments, we have provided likelihood ratio as an additional approach to odds ratio in the revision. The first likelihood calculates the risk of a signature gene being detected as a/(a+b), and the second likelihood calculates the risk of non-signature genes being detected as c/(c+d). The second likelihood essentially provides a background, and approximately equals the ratio between the numbers of expressed genes and non-expressed genes. The likelihood ratio is calculated as the ratio of the two likelihoods, i.e., LR=a*(c+d)/(c*(a+b)). This will effectively avoid the 0 denominator issue. In our testing, the likelihood ratio and odds ratio gave highly consistent results: effect size correlation was higher than 0.95 in four of the five datasets tested. The new function has been implemented in the JASMINE code.

In the revision, we have added more details to the method description, and clarified the extent to which b=0 may affect signature scores. The JASMINE code has also been updated on Github. We thank the reviewer for prompting us to improve the JASMINE function.

b. We cannot say the average is the best way to combine the two. As insight evolves, there might be better and more sensible approaches to signature scoring. During our analysis, we tested multiple ways, including arithmetic mean, geometric mean, and some other non-linear methods. After the tests, we felt equal contribution from the two score components makes best sense, largely because we do not have a good reasoning to overweigh one option over another.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Essential revisions:

1. It is unclear how calibrated any of these methods are from the analysis presented in this manuscript. To help clarity, please address:

a) Can the authors please detail, in the Methods section, which thresholds are used on each program?

When we calculated scores with each method, no thresholds were used. However, when we compared signature scores between tumor and normal cells, we used Cohen’s d > 1 as criterion for up regulation and < -1 for down regulation. These thresholds have been clarified in the main text and Methods.

b) Is any test statistic is derived from the JASMINE method?

JASMINE does not derive a test statistic, similar to other methods benchmarked in the study. These methods calculate a score for an input signature in a cell/sample. These scores can be compared across cells/samples for differences in signature expression. Compared with methods such as regular GSEA, these single sample methods do not rely on class labels, and thus, offer better flexibility for downstream analyses. In our revised manuscript, we have provided further context in Discussion.

c) How is the final statistic is derived? That is, what is the range that the statistic is scaled by? In particular, in the eyes of a reviewer, the math for V_mean is unclear. They write, "As written, it is computed only over a fixed index g? The range on the sum is a bit confusing as well. Shouldn't it be over all the cells? If I were to guess, V_mean appears to be V_mean = \sum_{g=1}^{m}\sum_{c=1}^{total cells} R_{g,c}/(m*N). Is this correct?". Please clarify these points in the Methods section.

We thank the reviewer for meticulously checking the formula, and yes, it is indeed a bit confusing. To make it clearer, we have revised it as follows:

Assume R_g represents the rank of an expressed signature gene g among genes with expression value > 0 in a cell, then a mean rank vector V_mean is calculated as follows:

V m e a n = \sum_{g = 1}^{m} R_{g, c} / (m * N)

The final score is defined as the arthemetic mean of the 2 signature components: signature mean defined by Vmean, and signature enrichment defined by odds ratio or likelihood ratio. Each component has a range of 0-1.

2. This manuscript infers that the main advantage of methods specific for single-cell data is to account for dropouts. However, in the view of a reviewer, the work presented here does not explicitly compare the differences between the bulk and single-cell methods, and they note that there might be factors that influence the results besides dropouts in real data. They argue that it is inconclusive to attribute poor performance of bulk methods to dropouts, especially when there is only one bulk method included in the analysis. For example, it is unclear how the authors ran ssGSEA, as there are some missing details in the Methods section. In particular, it is common practice to 'pre-filter' RNA-seq data before running any analysis on it. Did the authors do this for the real data analysis? Please add an introduction to each method, and add the details missing so readers can recapitulate the test conditions.

We appreciate the reviewer’s concern. We did pre-filter the single-cell RNAseq datasets by removing genes that were not detected in any cells. As for the running of these methods is concerned, we have run ssGSEA and GSVA using the R biocondutor package GSVA. We used the default parameters to run these methods. Similarly we ran AUCell using their package with default parameters. SCSE did not provide R codes, only a web application. For fair comparison, we implemented it in R according to the formula reported in their publication. These details have been added to the methods section. We have also included some discussions on each method in the revised manuscript.

3. From Figure 1D, the authors reveal that the non-cycling tumor cells exhibit higher scores than non-cycling normal cells by ssGSEA. However, it is also noticeable that ssGSEA scores have a much higher variance than the other three methods, and results from SCSE and AUCell scores do not show a significant difference between cycling and non-cycling cells. Are there any explanations and/or intuition for such a result?

We thank the reviewer for raising this great question, which prompted us to include gene counts in the figure. Visually, it is clear that ssGSEA scores are associated with gene counts, which we think further strengthens our conclusion. We have replaced the original Figure 1D with the new one in our revised manuscript.

4. In the "Dropouts affect ssGSEA scores" section (Line 167), the "dummy expression matrix of 1000 genes and nine columns" is not clearly described. Is it for each cell? Is the ranking of genes the same for each dropout rate? Besides, Figure 3B is not fully explained: Could the authors please explain why the black line has a curvy feature towards the end of x axis while the blue line is linear? How about other dropout rates? There should be nine lines in the figure according to the matrix.

The dummy expression matrix of 10,000 genes and nine columns reflect a scenario where the first column representing a cell that has no dropouts; from the 2^nd column to 9^th column is the same cell but with different drop out rates ranging from 10 % to 80 % with the step of 10%. The ranking of genes is identical in each column. We agree with the reviewer that stating 9 columns in the text may lead to confusion because we did not show 9 lines in the figure. The reason we only showed 60% dropout (blue line) is that we think having one line is sufficient to show the deviation in the dropout zone, whereas having multiple lines would make the figure harder to read, and importantly, unnecessary given that they are essentially repeating the same message. The codes for this similatuion were published in our Github (https://github.com/NNoureen/JASMINE). We respectfully encourage the reviewer to check them.

As for the curly tail of the ssGSEA scores in the cell without dropouts, that is how ssGSEA penalizes the genes at the bottom of the gene list as shown in our floating gene simulation. Thus, it appears ssGSEA penalizes the gene less for the signature score if the gene appears at the bottom of the list. In our experiment, we only changed the position of the last signature gene while keeping all other settings the same. When the ranking of the bottom genes are identical due to dropouts, the penalization became linear, showing the score as a straight declining line. We think these are compelling data to show how dropouts can affect ssGSEA scoring.

In the revision, we have edited the text to mitigate confusion. We thank the reviewer for pointing out this.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Noureen N. 2021. Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data. GitHub. GitHub [DOI] [PMC free article] [PubMed]
Tirosh I, Izar B, Prakadan SM. 2016. Single-cell expression of metastatic melanoma. NCBI Gene Expression Omnibus. GSE72056
Puram SV, Tirosh I, Parikh AS, Patel AP. 2017. Single-cell expression of head and neck cancer. NCBI Gene Expression Omnibus. GSE103322
Bi K, Mx He, Bakouny Z, Kanodia A. 2021. Single-cell expression of advanced renal cell carcinoma. Broad Single Cell portal. SCP1288
Lee HO, Hong Y, Etlioglu HE, Cho YB. 2020. Single-cell expression of colorectal cancer. NCBI Gene Expression Omnibus. GSE144735
Ma L, Hernandez MO, Zhao Y, Mehta M. 2019. Single-cell expression of liver cancer. NCBI Gene Expression Omnibus. GSE125449
Neftel C, Laffy J, Filbin MG, Hara T. 2019. Single-cell expression of IDH wildtype glioblastoma. Broad Single Cell portal. SCP393
Venteicher AS, Tirosh I, Hebert C, Yizhak K. 2017. Single-cell expression of IDH mutant astrocytoma. Broad Single Cell portal. SCP50
Darmanis S, Sloan SA, Croote D, Mignardi M. 2017. Single-cell expression of glioblastoma. NCBI Gene Expression Omnibus. GSE84465
Chung W, Eum HH, Lee HO, Lee KM. 2017. Single-cell expression of breast cancer. NCBI Gene Expression Omnibus. GSE75688
Mx He, Cuoco MS, Crowdis J, Bosma-Moody A, Zhang Z, Bi K, Kanodia A. 2021. Single-cell expression of prostate cancer. Broad Single Cell portal. SCP1244

Supplementary Materials

Figure 1—source data 1. Source data for Figure 1.

elife-71994-fig1-data1.xlsx^{(1.5MB, xlsx)}

Figure 2—source data 1. Source data for Figure 2.

elife-71994-fig2-data1.xlsx^{(17.5KB, xlsx)}

Figure 3—source data 1. Source data for Figure 3.

elife-71994-fig3-data1.xlsx^{(10.5KB, xlsx)}

Supplementary file 1. Single cell RNAseq datasets.

The table lists all the ten datasets used in this study. Cell annotations were downloaded from the original publications.

elife-71994-supp1.xlsx^{(16.2KB, xlsx)}

Transparent reporting form

elife-71994-transrepform1.docx^{(246.7KB, docx)}

Data Availability Statement