TWO-SIGMA-G: a new competitive gene set testing framework for scRNA-seq data accounting for inter-gene and cell–cell correlation

Eric Van Buren; Ming Hu; Liang Cheng; John Wrobel; Kirk Wilhelmsen; Lishan Su; Yun Li; Di Wu

doi:10.1093/bib/bbac084

. 2022 Mar 24;23(3):bbac084. doi: 10.1093/bib/bbac084

TWO-SIGMA-G: a new competitive gene set testing framework for scRNA-seq data accounting for inter-gene and cell–cell correlation

Eric Van Buren ¹, Ming Hu ², Liang Cheng ^3,^4,⁵, John Wrobel ³, Kirk Wilhelmsen ⁶, Lishan Su ^3,^4,⁷, Yun Li ^8,^9,^10,^✉, Di Wu ^8,^10,^✉

PMCID: PMC9271221 PMID: 35325048

Abstract

We propose TWO-SIGMA-G, a competitive gene set test for scRNA-seq data. TWO-SIGMA-G uses a mixed-effects regression model based on our previously published TWO-SIGMA to test for differential expression at the gene-level. This regression-based model provides flexibility and rigor at the gene-level in (1) handling complex experimental designs, (2) accounting for the correlation between biological replicates and (3) accommodating the distribution of scRNA-seq data to improve statistical inference. Moreover, TWO-SIGMA-G uses a novel approach to adjust for inter-gene-correlation (IGC) at the set-level to control the set-level false positive rate. Simulations demonstrate that TWO-SIGMA-G preserves type-I error and increases power in the presence of IGC compared with other methods. Application to two datasets identified HIV-associated interferon pathways in xenograft mice and pathways associated with Alzheimer’s disease progression in humans.

Introduction

Single-cell RNA sequencing (scRNA-seq) data have been used to understand the heterogeneity of cell type landscapes and to understand the variation of gene expression at the single-cell resolution across biological processes or treatments. Many of the data analysis methods originally designed for bulk RNA-seq data, including differential expression (DE) based gene set tests, have been reapplied to scRNA-seq data. These DE-based gene set tests are used to test whether a pre-constructed [1] set of genes is significantly differentially expressed between/among sample groups. In the past decade and a half, such gene set tests have been used to contextualize gene-level DE analyses and identify both important pathways and real biological mechanisms [2–5]. These gene set tests also improve statistical power and reduce spurious associations as compared with gene-level DE testing [6, 7]. This further increases reproducibility across multiple experiments, which is often lower than desired due to biological and technical variability across different transcription profiling platforms [6, 7]. Such DE-based gene set tests therefore constitute a routine step of performing DE based analyses, in both bulk RNA-seq and scRNA-seq datasets [8, 9].

There are at least two different types of DE-based analyses for scRNA-seq data. The first type of DE-based analysis focuses on the difference between sample groups to identify genes or gene sets significantly associated with sample groups. Another type of DE analysis is typically conducted by finding biomarker genes for a cell type, that is, to compare between or among cell types. In this paper we will focus on the first type of analysis, however the method we propose can perform gene set testing for both types of questions. Conducting gene-level DE analysis is the first step in DE-based gene set testing. One method for gene-level DE in scRNA-seq data is our previously published TWO-SIGMA [10], which has several useful features including directly modeling zero-inflation and accounting for possible correlation among cells from the same sample.

It is essential to discriminate between the two types of null hypotheses, competitive and self-contained, used in gene set testing. Competitive gene set tests evaluate significance by comparing the evidence of DE of a gene set to the evidence in a reference set of genes [6, 11–13]. In contrast, self-contained tests are commonly used to compare the similarity of gene expression patterns in a gene set across different data sets [14, 15], not relative to other genes. Competitive tests use a battery of gene sets, for example from the Molecular Signatures Database [16, 17], to rank sets and analyze which are the most significantly associated with a given phenotype. The results of competitive tests are easier to interpret, and competitive tests are more common in the literature today [18, 19]. Previous studies have shown that competitive tests which ignore the possibility of inter-gene correlation (IGC) in the test set and assume independence often suffer from inflated type-I error [14, 18, 20]. This is because genes within a given gene set tend to have a positive IGC, even under the null hypothesis. Ignoring this IGC underestimates the variance of set-level summary statistics and can dramatically inflate type-I error through inducing a typically positive correlation in the marginal gene-level statistics [6, 18, 19, 21]. Therefore, it is critical that any competitive gene set test adequately account for IGC to provide statistically rigorous set-level P-values.

Available DE-based gene set tests originally developed for microarray and bulk RNA-seq data include GSEA [16, 22] and related extensions sigPathway and fGSEA [23], CAMERA [18] and PAGE [11]. We are aware of two existing methods explicitly created for competitive gene set testing using scRNA-seq data: iDEA [24] and an extension of MAST [20]. iDEA jointly conducts gene-level DE testing using zingeR [25] and uses a Bayesian approach to produce set-level P-values. iDEA does not adjust for IGC, however, and may not detect the scenario in which the same proportion of genes are significant in the test and reference set but the magnitude of the association differs. MAST fits a log-normal hurdle model at the gene-level and uses a Z-test with a computationally intensive bootstrapping procedure that was not studied in great detail to produce set-level P-values. Many other scRNA-seq-related gene set tests with different goals have been developed, including BAGSE [26], UniPath [27], PAGODA [28] and SCENIC-AUcell [29]. BAGSE is not a competitive test and has similar hypothesis as GSEA [16], consisting of a hybrid of the self-contained and competitive null hypothesis [12, 14, 30]; PAGODA looks for coordinated variation and is not DE based; UniPath uses single-cell ATAC-seq (scATAC-seq) data in addition to scRNA-seq data; and AUcell in the SCENIC workflow is based on regulon activity.

Overall, there is a need for methodological advancements to tailor gene set testing frameworks to scRNA-seq data, and a need to evaluate the ability of methods designed for bulk data to provide statistically valid results when applied to scRNA-seq data. This paper develops TWO-SIGMA-G, a set-level framework for DE testing in scRNA-seq data using the competitive null hypothesis. To test for DE at the gene-level, TWO-SIGMA-G uses our recently developed TWO-SIGMA [10], providing a flexible mixed-effects zero-inflated negative binomial regression model for good fit to the data at the gene-level. The use of a regression-based framework allows complex experiments including multiple covariates to be analyzed and provides many choices for gene-level statistics depending on the priority of the analysis. To avoid the inflated type-I error often caused by commonly existing positive IGC in a biological gene regulation pathway, IGC is estimated using an innovative residual-based approach and explicitly adjusted for at the set-level. We demonstrate TWO-SIGMA-G outperforms existing competitive gene set testing methods using extensive simulation scenarios. Application of TWO-SIGMA-G to an HIV-related humanized mouse scRNA-seq dataset and an Alzheimer’s Disease human brain scRNA-seq dataset reveals exciting biological findings. TWO-SIGMA-G is available on CRAN and on GitHub at https://github.com/edvanburen/twosigma.

Materials and Methods

Estimation of IGC

Before specifying our new gene set testing method, we first propose a novel strategy to estimate IGC between pairs of genes from their respective gene-level DE regression models. Cell-level covariates such as the cellular detection rate (CDR), which measures the percentage of genes expressed in a cell, have been previously demonstrated to be highly influential to observed expression levels [20]. Subject-specific covariates, such as disease status or ethnicity, can further create an additional correlation structure in the raw data. Therefore, using the raw data to estimate IGC can overestimate the correlation that remains between gene-level statistics, which come from regression models that directly adjust for these other covariates. Thus, the use of residuals to estimate IGC can better represent the remaining correlation of the gene-level statistics under the null.

We estimate the IGC of a given gene set using gene-level residuals from the TWO-SIGMA [10] (discussed more below and in Supplementary Section S1) model as follows: Define the ( Inline graphic ) vector of residuals for gene from individual as . Then, by individual, construct the matrix consisting of the residuals for all test set genes. Given these residual matrices, we compute the pairwise () Pearson correlation matrix , which contains choose two unique non-diagonal elements. These elements give the pairwise correlations between the residuals of two different genes in the test set. We average these values to produce one average pairwise correlation Inline graphic per individual. Finally, we estimate the overall correlation with the average of these values such that . We choose to use averages here because, as discussed in the CAMERA paper [18], only the average correlation in the residual space matters when testing the competitive null hypothesis. Thus, even if positive and negative methods are averaged, statistical inference will remain valid. If Inline graphic is negative, we set it to zero, which provides conservative inference in the procedure described in the next section.

Our IGC procedure therefore builds off of the advantages of a residual-based approach in removing the correlation from sample-level and cell-level covariates. We add the use of individual-level calculations to help mitigate the impacts of the large individual heterogeneity often seen in scRNA-seq datasets. In simulations, we found that this IGC estimate preserves type-I error in a conservative manner while still producing improved power in a variety of realistic scenarios. The estimate of the IGC is virtually free computationally in that the model is not refit via permutation or bootstrapping.

Figure 1 shows the average set-level IGC estimates from TWO-SIGMA-G for each of the two comparisons in the Alzheimer’s dataset described more in the real data analysis section. A non-zero correlation exists in the residual space for both comparisons, with half of the sets having a correlation larger than 0.02. For each comparison, over 98% of the estimated pairwise correlations are positive. Ignoring this remaining correlation makes inflated type-I error a distinct possibility (Supplementary Table S1). As mentioned above, estimated correlations that are negative are set to zero when computing the TWO-SIGMA-G P-value, leading to conservative inference as discussed in the methods section.

Set-level IGC estimates from TWO-SIGMA-G’s residual-based approach for the two Alzheimer’s dataset comparisons (see the real data analysis section). Sets plotted are taken from the c2 collection of the Molecular Signatures Database. Most sets demonstrate a substantially positive correlation after regressing out sample-level and cell-level covariates.

TWO-SIGMA-G for Set-Level Testing

We extend our TWO-SIGMA method [10] (Supplementary Section S1) to competitive gene set testing via TWO-SIGMA-Geneset (TWO-SIGMA-G), an overview of which is shown in Figure 2. Briefly, TWO-SIGMA uses a zero-inflated negative binomial regression model to test for DE at the gene level in scRNA-seq data. The model is flexible and can be customized in several different ways. First, the zero-inflation component can be removed from the model entirely (as was done in the real data analysis section), leaving a standard negative binomial regression model. Second, the model can additionally include random effect terms to account for cell–cell correlation within the same sample and limit type-I error inflation in gene-level DE inference. Finally, many gene-level statistics measuring the evidence of DE can be used for set-level testing. We discuss uses for other, more complex, gene-level statistics based on custom contrasts of regression parameters in the real data analysis section.

Overview of TWO-SIGMA-G. TWO-SIGMA-G enables cell-type-specific DE-based gene set testing using single-cell RNA-sequencing data. For clarity, we illustrate TWO-SIGMA-G on a binary comparison with two cell types, but we refer readers to the methods and real data analysis sections for examples of possible analyses involving more than two cell types and/or more than two sample groups simultaneously. In TWO-SIGMA-G, our previously published TWO-SIGMA method is used to produce gene-level statistics using the Likelihood Ratio test or the Z test. IGC is estimated in a novel way using the gene-level residuals, and set-level inference is conducted using a modified Wilcoxon rank sum test which accounts for IGC to preserve type-I error. TWO-SIGMA-G is applicable both when analyzing DE between cell types for phenotype-associated gene sets or when analyzing DE between phenotypes for phenotype-associated gene sets.

TWO-SIGMA-G employs the Wilcoxon rank-sum test to compare the statistics of genes in the test set to the statistics of genes in the reference set, and therefore uses the sum of the ranks in the test set as the set-level summary statistic. In using the ranks, TWO-SIGMA-G provides robustness against the influence of very large gene-level statistics. Traditionally, the Wilcoxon rank-sum test assumes that observations within a group are independent. However, as mentioned, IGC is expected given the construction of gene sets as harmonious biological pathways, and can inflate type-I error if ignored [18]. To create a gene set testing method designed for single-cell data, we utilize a modified version of the rank-sum test. This modification allows for correlated gene-level statistics in the test set [21], similar to the approach of CAMERA for bulk RNA-seq [18]. We assume a pairwise correlation Inline graphic between gene-level statistics in the test set of size and no correlation in the reference set of size . With these assumptions, the variance of the two-group Wilcoxon rank-sum statistic is

As Inline graphic increases, this variance term increases as well. Therefore, ignoring a positive leads to an underestimated variance and inflated type-I error as a result. As discussed in the previous section, we estimate using a residual-based approach. Using this modified variance formula, and the known mean of rank-sum statistics under the null, set-level P-values are computed analytically using a standard normal approximation [18]. The reference set used in TWO-SIGMA-G can be chosen in one of two ways: either using a random sample of other genes of size Inline graphic or as the collection of all genes not in the test set under consideration.

In addition to producing set-level significance, TWO-SIGMA-G also identifies the directionality of sets as upregulated or downregulated. Whether or not a zero-inflation component is included in gene-level models, directionality is produced by averaging gene-level log fold-change estimates in the test set to produce a set-level effect size and taking the sign of the result. These effect sizes are demonstrated further in the real data analysis section.

As compared with other methods, TWO-SIGMA-G has several key advantages in applicability and interpretability. First, it is explicitly tailored to scRNA-seq data at the gene-level in that it can flexibly and optionally account for zero-inflation, overdispersion and within-subject random effect terms to account for within-subject cell–cell correlation. Second, the use of a regression modeling framework at the gene-level enables the analysis of complex designs including multiple confounding covariates, as will be demonstrated further in the real data analysis section. Third, estimating IGC using residuals after regressing out sample-level and cell-level covariates provides estimates of IGC that more closely reflect the remaining correlation of the gene-level statistics. The standard output of TWO-SIGMA-G includes gene-level DE summary statistics and associated P-values, set-level P-values (nominal and FDR-corrected) and IGC estimates, and set-level effect sizes to characterize pathways as ‘upregulated’ or ‘downregulated’. When multiple cell types are included as a contrast, all information except IGC estimates is cell-type-specific, allowing users to investigate cellular heterogeneity.

Simulation Studies

We utilize a custom simulation procedure to simulate correlated gene sets (See Supplementary Section S2 for full details) and test for set-level DE of a binary covariate representing a treatment effect. First, independent genes were simulated from the zero-inflated negative binomial distribution, optionally including (1) within-sample random effect terms to create a within-sample correlation structure and (2) additional confounding covariates to create additional cell–cell correlation. For each independent gene, we then simulated 29 correlated genes by adding random noise from the negative binomial distribution to create correlated gene sets of size 30. Under the alternative, the magnitude of the added noise is increased to preserve signal and maintain the correlation structure. We simulated 1000 independent genes without gene-level random effect terms and 300 independent genes with gene-level random effect terms for each of six settings (Supplementary Table S2) which vary the magnitude of added noise and the presence of additional covariates in the gene-level simulation. These six settings are intended to represent the diversity seen in real data sets to paint an accurate picture of testing properties over a wide range of gene sets and correlation structures. Simulations assumed 100 cells from each of 100 samples, and constructed uncorrelated reference sets randomly. To increase variability in the design matrix and thus the simulated expression data, we repeat this simulation procedure 10 times using the same distributional assumptions, but a different random seed. This simulation procedure not only allows us capture the impact of excess zeros, but it is particularly designed to vary the percentage of DE genes in the test and reference sets to mimic real data, which typically has DE genes in both the test set and in the reference set. For example, the scenario ‘T50, R20’ is a scenario under the competitive alternative hypothesis in which 50% of genes in the test set are DE and 20% of the genes in the test set are DE. Genes that are DE have the same effect size in all cases with the exception of the ‘mixed DE’ scenarios which create gene sets with two different gene-level DE effect sizes.

We compared TWO-SIGMA-G with six popular methods for gene set testing (Supplementary Section S2.1). First, TWO-SIGMA-G was compared with three methods for gene set testing which utilize the full expression data: GSEA [16], CAMERA [18] and MAST [20]. These simulations were calibrated to produce gene-level statistics which summarize evidence from both the mean and zero-inflation components after adjustment for confounding covariates. Second, TWO-SIGMA-G was compared with three other highly relevant methods which rely on gene-level summary statistics: iDEA [24], PAGE [11] and fGSEA [23]. Simulations were conducted separately because the marginal gene-level effect sizes produced from informative scenarios evaluating the first three methods were modest, which led to difficulties obtaining reliable P-values from the second set of methods. Thus, to provide a fair and interesting comparison to the summary-statistic-based methods, we simulated correlated genes using the same framework, but increasing the signal in the mean component (Supplementary section S2.3). R version 3.6.3 was used to conduct simulations. The likelihood ratio test was used to calculate gene-level statistics in TWO-SIGMA-G, and default options were used for all methods unless otherwise specified. For fair comparison, log fold-change values from TWO-SIGMA were used as input for iDEA, PAGE and fGSEA.

We primarily use boxplots to summarize simulation results. Each boxplot aggregates six different settings (Supplementary Table S2) which vary both the magnitude of the average IGC (where applicable) in the test set and the nature of the correlation structure via the introduction of other individual-level covariates (Figures 3, 4 and 5 and Supplementary Figures S3 to S11).

Type-I error performance for CAMERA, GSEA, MAST and TWO-SIGMA-G using a reference set size of 30 genes. GSEA, CAMERA and MAST were chosen for comparison here because all utilize the raw expression data, and the latter two use a regression modeling framework that can explicitly analyze complex experimental designs similar to the data analyzed in the real data analysis section. See Figure 4 for additional comparisons to summary-statistic-based gene set testing methods fGSEA, iDEA and PAGE. Each subfigure varies the existence of IGC between genes in the test set and the presence of gene-level random effect terms in the gene-level model (CAMERA and GSEA never include gene-level random effect terms). Within each subfigure, both unadjusted and adjusted set-level P-values are plotted, where available. See the Methods section of the main text and Supplementary Section S2 for more details regarding the simulation procedure. Significance was determined using a P-value threshold of 0.05, and 30 genes were included in the test and reference sets.

Type-I error **(A)** and power **(B)** performance of fGSEA, iDEA, PAGE and TWO-SIGMA-G for various genes simulated with IGC using a reference set size of 30 genes. fGSEA, iDEA and PAGE are compared together because all use gene-level summary statistics instead of raw expression data. Scenarios along the -axis vary the percentage of genes that are differentially expressed (with the same effect size) in the test and reference sets. Because of misleading model performance in cases with ‘R0’, these scenarios were excluded from the summary-statistic-based simulation studies. Significance was determined using a P-value threshold of 0.05, and 30 genes were included in the test and reference sets. See the Methods section of the main text and Supplementary Section S2 for more details regarding the simulation procedure.

Inline graphic — Type-I error **(A)** and power **(B)** performance of fGSEA, iDEA, PAGE and TWO-SIGMA-G for various genes simulated with IGC using a reference set size of 30 genes. fGSEA, iDEA and PAGE are compared together because all use gene-level summary statistics instead of raw expression data. Scenarios along the -axis vary the percentage of genes that are differentially expressed (with the same effect size) in the test and reference sets. Because of misleading model performance in cases with ‘R0’, these scenarios were excluded from the summary-statistic-based simulation studies. Significance was determined using a P-value threshold of 0.05, and 30 genes were included in the test and reference sets. See the Methods section of the main text and Supplementary Section S2 for more details regarding the simulation procedure.

Set-level power of CAMERA, GSEA, MAST and TWO-SIGMA-G using a reference set size of 30 genes. GSEA, CAMERA and MAST were chosen for comparison here because all utilize the raw expression data, and the latter two use a regression modeling framework that can explicitly analyze complex experimental designs similar to the data analyzed in the real data analysis section. See Figure 4 for additional comparisons to summary-statistic-based gene set testing methods fGSEA, iDEA and PAGE. Each subfigure varies the existence of IGC between genes in the test set and the presence of gene-level random effect terms in the gene-level model (CAMERA and GSEA never include gene-level random effect terms). Scenarios along the -axis of each subfigure vary the percentage of differentially expressed genes (with the same effect size) in the test and reference sets. For example, ‘T80,R50’ corresponds to the configuration under the alternative hypothesis in which 80% of test set genes are DE and 50% of reference set genes are DE. Significance is determined using a P-value threshold of 0.05, and 30 genes were included in the test and reference sets (Supplementary Figure S7 presents results using a reference set of 100 genes). Note that GSEA did not output P-values for the ‘R0’ scenarios. See the Methods section of the main text and Supplementary Section S2 for more details regarding the simulation procedure.

Two scRNA-seq data sets

We performed gene set testing on two different datasets, one experimental and one observational, to illustrate the usefulness of TWO-SIGMA-G. Gene sets were taken from the Molecular Signatures Database (mSigDB) [16, 17] version 7, c2 collection, accessed via the msigdf R package (https://github.com/ToledoEM/msigdf). All gene sets with at least two genes present after filtering were analyzed in each of two datasets described below. We used Fisher’s method to combine P-values over all cell types and create a consensus ranking of pathways (see, for example, Figure 6 (A)). Both datasets were from 10X UMI-based single cell RNA-seq platforms. As with other UMI-based scRNA-seq count data, we found that these data were not consistent with zero-inflation [31], and thus we fit the TWO-SIGMA model without the zero-inflation component at the gene-level.

HIV Dataset

Our first dataset consists of 11 630 single cells collected from humanized mice [32]. We kept only genes with a proportion of zeros no higher than the mean percentage, leaving the most relevant and highly expressed 3549 genes. A total of nine cell types are present in the data: 4249 natural killer (NK) cells, 2085 erythroid cells, 1421 innate lymphoid cells (ILCs), 1205 B cells, 1088 myeloid dendritic cells (mDC), 821 plasmacytoid dendritic cells (pDC), 555 progenitor cells, 126 macrophage cells and 80 mast cells. The raw read counts are then treated as the outcome of interest. Because the primary interest is in comparisons between HIV and mock cells within cell-type, we categorize cells into one of 2*9 = 18 mutually exclusive groups. An analysis of covariance (ANCOVA) model additionally adjusting for CDR, a known surrogate for batch effects [20], was fit as a way to test for cell-type specific differences in expression levels comparing HIV to mock. TWO-SIGMA-G is ideal for this analysis because gene-level statistics can come from a test of such an arbitrary contrast matrix. These gene-level statistics are, for each cell-type, Wald Z-statistics contrasting the mean values in observed expression between the two groups within a cell-type. After filtering to keep sets with at least two genes present in our data, a total of 4772 sets from the MsigDB c2 collection were analyzed.

Alzheimer’s Dataset

Our second dataset consists of 70 634 single cells from human donors [33]. A total of 48 individual donors are present, categorized into three pathology groups: 24 individuals are control patients free of a diagnosis of AD, 12 were diagnosed with early-stage AD and 12 were diagnosed with late-stage AD. Cells from the six most common cell-types were analyzed: 34 976 excitatory neurons (Ex), 18 235 oligodendrocytes (Oli), 9196 inhibitory neurons (In), 3392 astrocytes (Ast), 2627 oligodendrocyte progenitor cells (Opc) and 1920 microglia cells. We did not remove cells beyond what was done in the original manuscript because extensive quality control was performed on the dataset we used. We chose to filter the original 17,926 genes to the 6,048 most highly expressed genes by removing genes unexpressed in at least 90% of cells. The raw read counts are once again treated as the outcome of interest in a model without a zero-inflation component. The existence of the pathology groups allows us to explore cell-type-specific variability in gene expression as AD progresses into early and late stages of disease severity. Our geneset analysis was conducted similarly to above: a one-component ANCOVA model was fit including cell-type and AD status jointly, with age at death, sex and the CDR used as additional covariates. In total, 5074 sets with at least two genes present from the MsigDB c2 collection were analyzed.

Results

Overview of TWO-SIGMA-G and Summary of Simulation Design

TWO-SIGMA-G is a gene set testing method based on DE specifically designed for scRNA-seq data. Because proper gene-level DE testing is key for a gene set test, TWO-SIGMA-G employs a mixed-effects zero-inflated negative binomial regression model we previously developed [10] to test for DE at the gene-level. This model can simultaneously capture the distribution of scRNA-seq datasets, account for within-sample correlation via random effect terms (included for either all genes or for no genes), analyze complex experiments by controlling for additional covariates and provide options for gene-level test statistics. The ranks of genes based on the gene-level DE test statistics are then used in the Wilcoxon-based gene set testing procedure which adjusts for gene–gene correlation in the test set to control type-I error. As we will show below, TWO-SIGMA-G can improve power over other existing gene set testing methods that use either the gene expression matrix or gene-level summary statistics as input.

We also performed extensive simulations to compare TWO-SIGMA-G with six other methods designed for gene set testing, the full details of which are given in Supplementary Section S2.1. Our real-data inspired simulation procedure is designed to reflect the zero-inflation, overdispersion, cell–cell correlation and gene–gene correlation seen in counts using parameters inspired by real data and using six different settings (Supplementary Table S2). We simulated genes using a zero-inflated negative binomial distribution, optionally including random effect terms to simulate the presence of within-sample correlation. We then simulated correlated gene sets of size 30 using a custom approach detailed in Supplementary Section S2.3. The procedure is designed to study the impact of the following on set-level type-I error and power: the magnitude and complexity of within-sample (cell–cell) correlation, the magnitude and complexity of gene–gene correlation, the amount and strength of gene-level DE, the presence of additional confounding covariates in the gene-level model and the size of the reference set of genes.

TWO-SIGMA-G preserves type-I error in the presence of IGC

In simulations, we compared the performance of TWO-SIGMA-G with other six gene set tests under the null hypothesis that there are no significant gene sets among sample groups. These methods are either competitive tests or have some flavor of competitive tests (GSEA and fGSEA).

First, we compared TWO-SIGMA-G with three methods (GSEA, CAMERA and MAST) which use the full scRNA-seq data matrix as input and have been popularly used in scRNA-seq data. In the comparisons, we particularly investigated how IGC, gene-level random effect (RE) terms and both IGC and RE together in simulated data affect the false positive rate (Type I error) in these methods. All four methods provide strong type-I error control when no genes are DE and genes are simulated independently without RE terms (Figure 3 (A), also see Supplementary Figures S1 and S2 for results using smaller significance thresholds after IGC adjustment). In contrast, type-I error is consistently inflated when IGC is present and ignored (Figure 3 (B)–(D), unadjusted). This is particularly true for GSEA, which does not adjust for IGC or account other confounding covariates. While CAMERA and MAST still have some level of type-I error inflation after IGC adjustment (from expected 0.05 to max 0.1 or 0.15), TWO-SIGMA-G perfectly preserves type-I error at the 5% level in the presence of IGC (Figure 3 (B)–(D), adjusted). The procedure used in TWO-SIGMA-G to estimate and adjust for IGC is well-calibrated and produces valid set-level inference.

We also considered scenarios in which RE exists in the simulated data. Figure 3 (C) and (D) show that the type-I error from TWO-SIGMA-G is preserved or approximately preserved when gene-level random effect terms are truly present and either correctly included (present) or incorrectly excluded (incorrectly absent) from the fitted gene-level model. For CAMERA, GSEA and MAST, however, type-I error tends to be inflated on average and the variance in the type-I error across the six settings tends to increase in the presence of gene-level RE terms. For all methods, however, this type-I inflation is much lower in magnitude than can exist at the gene-level [10]. This highlights an advantage of competitive gene set testing: because it makes a relative comparison to a reference set of genes, it is partially robust to the consequences of a systematic, gene-level misspecification. The real data analysis section further shows a large agreement in set-level inference from TWO-SIGMA-G regardless of random effect inclusion in the gene-level model.

We additionally found that the null distributions of all three methods are nearly identical with a larger reference set size (Supplementary Figure S3). However, in the interest of being conservative, we will evaluate performance using results in which the test and reference sets are of equal size.

We also evaluated type-I error at various set-level null hypotheses, in which an equal but non-zero percentage of genes in the test and reference sets are DE (with the same gene-level effect size, see Supplementary Figure S4). For example, the scenario in which 20% of genes are DE in both the test and reference sets is one set-level competitive null hypothesis. Generally, all methods except GSEA become more conservative as the proportion of DE genes increases.

Second, we compared TWO-SIGMA-G to three other methods which use gene-level summary statistics: iDEA, fGSEA and PAGE. To have a fair comparison among these methods and TWO-SIGMA-G, we used simulations with larger gene-level DE effect sizes to get gene-level summary statistics. iDEA, fGSEA and PAGE produced reliable P-values when excluding scenarios involving a completely null reference set (i.e. ‘R0’). Therefore, we only are able to evaluate type-I error at other set-level null hypotheses, in which an equal but non-zero percentage of genes in the test and reference sets are DE. For example, the scenario in which 20% of genes are DE in both the test and reference sets is one set-level competitive null hypothesis. Figure 4 (A) shows that all four methods approximately control type-I error when evaluated at two set-level null scenarios (‘T20,R20’ and ‘T50,R50’). See Supplementary Figures S5 and S6 for comparisons with larger reference set sizes and in the presence of gene-level random effect terms.

TWO-SIGMA-G improves power over alternative approaches

Figure 5 shows the power of CAMERA, GSEA, MAST and TWO-SIGMA-G using simulated data, and demonstrate that TWO-SIGMA-G is consistently the most powerful method. Different configurations are presented, involving a differing proportion of DE genes (with the same effect size) in the test and reference set. For example, ‘T100,R50’ corresponds to the configuration in which 100% of genes in the test set are DE and 50% of genes in the reference set are DE. Scenarios that include DE and non-DE genes in both the test and reference set are the most informative to study because it is unlikely in real data to have a completely null reference set and/or a completely alternative test set. Results from Figure 5 suggest that power depends primarily on the proportion difference in DE between the test and reference set and less on the precise composition of the test and reference sets. For example, the ‘T80,R50’ and ‘T50,R20’ configurations have 30% more DE genes in the test set than in the reference set, and similar power profiles for all methods within all four subfigures of Figure 5. We found that using a reference set size of 100 tends to improve power for all methods and particularly for TWO-SIGMA-G (Supplementary Figure S7). This power increase does not seem to be a consequence of an increase in type-I error (Supplementary Figure S3). This provides some evidence in favor of using a larger reference set in lieu of a balanced reference set.

The power of different testing methods has been further compared when gene-level random effect terms are truly non-zero and either correctly included (present) or incorrectly excluded (incorrectly absent) from the fitted gene-level model (Figure 5 (C) and (D)). In either case, power is only slightly reduced versus the cases without gene-level RE terms seen in Figure 5 (A) and (B). Thus, if interested primarily in set-level inference, the increased computational cost from gene-level RE terms may not be necessary for valid and powerful inference. However, any set-level power loss may be acceptable to prevent the massive type-I error inflation that has been shown to occur at the gene-level when RE is mistakenly absent if gene-level inference is of interest [10].

When the magnitude of gene-level DE is varied, such that half of genes have twice the effect size of the other half, we found that set-level power is improved (Supplementary Figure S8). The relative positions of each configuration remain as in Figure 5, however, suggesting that power results in Figure 5 apply to alternative DE breakdowns. For example, whether or not genes in the test have varying DE magnitudes, the ‘T80, R20’ scenarios have improved power over the ‘T100,R50’ scenarios. The relative rankings of the three compared methods also remains when the magnitude of gene-level DE is varied.

As above, we also used simulations with larger gene-level DE effect sizes to compare the power of TWO-SIGMA-G with iDEA, fGSEA and PAGE. Broadly, all of the results discussed above also apply to the comparisons of the summary-statistic-based methods. Figure 4 (B) shows power for scenarios that do not involve a completely null reference set (i.e. without ‘R0’). TWO-SIGMA-G is the most powerful method across a variety of set-level alternative scenarios as compared with summary-statistic-based methods, however the magnitude of the difference depends on the precise configuration of genes in the test and reference sets. When using a reference set of size 100 (Supplementary Figure S9), power is uniformly improved for all methods, but the relative ranks of the methods remain the same. In scenarios where gene-level RE terms exist (Supplementary Figure S10), all methods tend to have reduced power but remain in a similar relative ranking. One exception is seen in Supplementary Figure S10 (A), which shows that fGSEA is more powerful than TWO-SIGMA-G in the ‘T50, R20’ scenario. Comparing Supplementary Figure S10 (A) and (B) to (C) and (D), respectively, shows that use of a larger reference set tends to increase power for all methods. Finally, we also compared the use of multiple randomly generated reference sets to the use of a single randomly generated reference set, and found that both type-I error and power are nearly identical in both cases (Supplementary Figure S11).

Analysis of HIV data reveals biologically expected findings

We analyzed an experimental dataset of 11 630 single cells collected from of four humanized donor mice, two of which were infected with HIV and two were given a mock treatment [32] (full details described in the Methods section). A total of 3549 genes and 4772 gene sets from the Molecular Signatures Database (MSigDB) [16, 17] were analyzed. The number of differentially expressed genes and sets in all cell types comparing HIV to a mock treatment is shown in Table 1. An increase in the number of significantly DE genes does not always correspond to more significantly DE gene sets. For example, erythroid cells have the second largest number of DE genes, but rank eighth in terms of the number of DE gene sets. This observation is expected in a competitive gene set test like TWO-SIGMA-G, because competitive tests focus on the relative signal of gene sets as compared with a background reference set of genes. The lack of a direct relationship between the number of DE genes and gene sets is also reflected when analyzing the overlap in significance between genes and gene sets among the four most prevalent cell types as seen in Figure 6 (B) and (C). Instead of being a drawback, this highlights the need of using gene set testing to give extra set-level information beyond that from individual-gene-level DE tests.

Table 1.

The table shows the number of differentially expressed genes (using TWO-SIGMA) and gene sets (using TWO-SIGMA-G) after FDR-adjustment for the HIV dataset. Gene-level P-values were adjusted using the Benjamini–Hochberg method, and significance was determined by comparing these adjusted P-values with the 0.05 significance threshold. At the set-level, FDR-adjusted P-values were compared with the 0.2 threshold to mimic an exploratory analysis.

	Genes		Sets
Cell Type	Up	Down	Up	Down
NK (N = 4249)	489	531	51	51
Erythroid (N = 2085)	413	523	3	46
ILC (N = 1421)	235	198	57	13
B (N = 1205)	168	273	38	16
mDC (N = 1088)	350	346	95	14
pDC (N = 821)	214	194	63	2
Progenitor (N = 555)	93	127	35	9
Macrophages (N = 126)	41	40	23	4
Mast (N = 80)	16	8	0	0

Open in a new tab

Results from analysis of the HIV dataset using TWO-SIGMA-G. **(A)** Cell-type-specific variation in average set-level log fold-change (left) and significance (right). Sets plotted are among the top 10 in significance for at least one cell type. Sets in bold are significant at the 5% level over all cell types after FDR-adjustment of the Fisher’s method P-value, and the rank of the Fisher’s P-value among all sets is provided next to the set name. **(B)** Overlap between FDR-adjusted DE genes (5% significance level) among the four most prevalent cell types. **(C)** Overlap between FDR-adjusted DE gene sets among the four most prevalent cell types.

Within-cell-type (cell-type-specific) TWO-SIGMA-G results comparing HIV to the mock treatment are presented in Figure 6 (A) as cell-type-specific average log fold changes (FC) and corresponding P-values. In most cases, the main cell types share significance in common pathways associated with HIV infection, while in other cases there are HIV associated pathways which are significant only in a small number of cell types. In the gene sets among the 10 most significant sets in at least one of the nine cell-types (Figure 6 (A)), sets related to virus introduction and interferon release are expected to be consistently upregulated and highly significant at both the set-level (as seen in a representative gene set in Supplementary Figure S12) and the gene-level [34]. The significance of these sets is found both when combining P-values into a consensus FDR-adjusted P-value using Fisher’s method and within cell types other than erythroid cells, albeit with differing strength of significance. Given the known functionality of erythroid cells as oxygen carriers in contrast to the immunological function of the other cell types, this result is expected. It also demonstrates that TWO-SIGMA-G can recover expected biological findings using cell-type-specific analyses and quantify differing strengths of association even among sets that may not exhibit large cell-type-specific heterogeneity. Figure 6 (B)-(C) show the overlap in significant DE genes and gene sets, respectively, after FDR-adjustment for multiple testing among the four most prevalent cell types. For example, Figure 6 (C) shows that there are 48 gene sets that are significant only in NK cells after FDR adjustment. These Venn diagrams show that our analysis reveals a large degree of cell-type-specific heterogeneity at the gene level and the set level. Additional discussion and comparisons between TWO-SIGMA-G, MAST and CAMERA can be seen in Supplementary Figures S13 to S15 in Supplementary Section S4.

Analysis of Alzheimer’s data reveals cell-type-specific heterogeneity in set-level expression

We chose to use an observational scRNA-seq dataset to demonstrate how TWO-SIGMA-G can handle a more complex application. Specifically, we use the scRNA-seq data of [33] (see the Methods section for more details) to analyze changes in gene expression as Alzheimer’s Disease (AD) progresses. The data provide gene expression across three distinct pathology groups: control (AD free), early-stage AD progression and late-stage AD progression. We focus on two relevant comparisons: late- versus early-stage AD, and early-stage AD versus control. A total of 6048 genes and 5074 gene sets from the Molecular Signatures Database (mSigDB) [16, 17] were analyzed.

We used TWO-SIGMA-G to compare early-stage AD patients to control in each of the cell types and provide cell-type-specific results (Figure 7). Previous studies have suggested that disruptions in mitochondrial functioning, particularly in cellular respiration as caused by oxidative damage, are among the earliest events in Alzheimer’s disease [35]. As Figure 7 (A) demonstrates, we replicate this finding with particularly robust downregulation seen in pathways related to cellular respiration, such as ‘KEGG_OXIDATIVE_PHOSPHORYLATION’ and ‘MOOTHA_VOXPHOS’ (see Supplementary Figures S16 and S17 for more detailed gene-level results for these sets). The neuronal cell types demonstrate highly consistent statistical significance in these pathways. The Venn diagrams in Figure 7 (B) and (C) show that, after FDR correction, the majority of the overlap in significant genes and gene sets comes from the two neuronal cell types. Pathway changes in the early stages of Alzheimer’s disease therefore seem to manifest themselves as changes in the functioning of these two neuronal cell types. This result has been demonstrated previously [36]. Interestingly, in comparisons between late-stage AD and control, this trend is actually reversed (Figure 8 (A), more discussion and figures in the Supplement).

Results from analysis of Alzheimer’s dataset comparing early-stage AD to Control using TWO-SIGMA-G. **(A)** Cell-type-specific variation in average set-level log fold-change (left) and significance (right). Gene sets plotted are among the top 10 in significance for at least one cell type. Sets in bold are significant at the 5% level over all cell types after FDR-adjustment of the Fisher’s method P-value, and the rank of the Fisher’s P-value among all sets is provided next to the set name. **(B)** Overlap between FDR-adjusted DE genes (5% significance level) among the four most prevalent cell types. **(C)** Overlap between FDR-adjusted DE gene sets among the four most prevalent cell types.

Results from analysis of the Alzheimer’s dataset comparing late- to early-stage AD using TWO-SIGMA-G. **(A)** Cell-type-specific variation in average set-level log fold-change (left) and significance (right). Gene sets plotted are among the top 10 in significance for at least one cell type. Sets in bold are significant at the 5% level over all cell types after FDR-adjustment of the Fisher’s method P-value, and the rank of the Fisher’s P-value among all sets is provided next to the set name. **(B)** Overlap between FDR-adjusted DE genes (5% significance level) among the four most prevalent cell types. **(C)** Overlap between FDR-adjusted DE gene sets among the four most prevalent cell types.

The numbers of differentially expressed gene sets and genes for both comparisons in each of the cell types are summarized (Table 2). As mentioned above, most of the significant gene sets are downregulated when comparing early-stage AD to control, but upregulated when comparing late- to early-stage AD. The number of significant genes and gene sets increases dramatically in the late- versus early-stage AD comparison over the early-stage AD to control comparison. The fact that so many genes in the top three rows of Table 2 are significantly DE reveals one reason that competitive set-level inference is useful in some scRNA-seq analyses. The large sample sizes of cells means that very modest differences in gene-level expression can be statistically significant, and thus that gene-level P-values must be interpreted cautiously. Competitive set-level analyses can contextualize gene-level differences and rank pathways to help highlight meaningful changes in important biological processes.

Table 2.

The table shows the number of differentially expressed genes (using TWO-SIGMA) and gene sets (using TWO-SIGMA-G) after FDR-adjustment for both comparisons of the Alzheimer’s dataset. Gene-level P-values were adjusted using the Benjamini–Hochberg method, and significance was determined by comparing these adjusted P-values to the 0.05 significance threshold. At the set-level, FDR-adjusted P-values were compared with the 0.2 threshold to mimic an exploratory analysis.

	Early-stage AD versus Control				Late- versus early-stage AD
	Genes		Sets		Genes		Sets
Cell Type	Up	Down	Up	Down	Up	Down	Up	Down
Excitatory neuron (N = 29018, 17878)	1055	1893	18	223	1619	2645	467	17
Oligodendrocyte (N = 14806, 9035)	339	619	0	0	1443	74	15	0
Inhibitory neuron (N = 7621, 4371)	58	1781	0	18	1483	761	208	12
Astrocyte (N = 2840, 1830)	61	311	0	0	265	44	0	0
Oligodendrocyte progenitor (N = 2207, 1290)	13	64	0	0	256	3	0	0
Microglia (N = 1491, 955)	27	20	0	0	14	0	0	0

Open in a new tab

Our analysis also reveals other cell-type-specific heterogeneity. For example, microglia cells tend to have a unique set-level effect size profile, as demonstrated by the hierarchical clustering in the left heatmap of Figure 7. This uniqueness also extends to significance. In comparing early-stage patients with control, microglia cells exhibit stronger significance in pathways involved in immune response, such as ‘RADAEVA_RESPONSE_TO_IFNA1_DN’ or ‘BROWNE_INTERFERON_RESPONSIVE_ GENES’, while showing less or no significance in previously mentioned pathways related to cellular respiration. Given the role of microglia cells in immune response [37], these results are not surprising. For a general application, however, TWO-SIGMA-G can help researchers to investigate cell-type-specific heterogeneity when cellular functions are not understood. The ability to test complex gene-level hypotheses as contrasts of regression parameters increases the diversity of cell-type-specific hypotheses that can be explored. Additional discussion, including comparisons between TWO-SIGMA-G, MAST and CAMERA, can be seen in Supplementary Figures S18 to S31 within Supplementary Sections S5 to S9.

Discussion

We propose TWO-SIGMA-G, a novel method designed for competitive gene set testing using scRNA-seq data. At the gene-level, we employ our previously developed TWO-SIGMA method to test for DE at the gene-level. TWO-SIGMA is a flexible regression modeling framework that (1) allows for overdispersed counts, (2) can include additional covariates, allowing the analysis of complex experimental designs, (3) can include sample-specific random effect terms and (4) optionally uses a zero-inflation component. Because the zero-inflation component is optional, users can make modeling decisions based on their own views regarding the cause and interpretation of excess zeros in scRNA-seq data. For example, our real data analyses exclude the zero-inflation component based on work suggesting this is reasonable when analyzing UMI-based data [31]. At the set-level, we adjust for IGC, which has been demonstrated to inflate type-I error if mistakenly ignored. Using gene-level residuals to estimate IGC, we produce set-level P-values that preserve type-I error and improve power over alternative approaches.

The ability of TWO-SIGMA-G to include random effect terms at the gene-level provides a distinguishing factor from many methods for gene set analysis. Such RE terms can improve inference at the gene-level substantially for some genes [10]. However, if only interested in set-level inference, our simulations suggest that statistical inference remains valid when excluding gene-level RE terms and reducing computational burden as a result. When gene-level inference is also of interest, it is likely desirable to fall back on including random effect terms into the regression modeling framework. However, we suggest that inference in real data analyses is likely not influenced greatly at the set-level by the presence or absence of gene-level RE terms (Supplementary Figure S32).

TWO-SIGMA-G is implemented in the twosigma R package (https://github.com/edvanburen/twosigma), which is computationally efficient and allows for parallelization. To benchmark computational performance, we ran a modified version of our HIV data analysis, testing for a treatment effect of HIV pooled over all cell types. This modification to a one degree of freedom hypothesis allows us to test identical hypotheses in TWO-SIGMA-G, MAST and CAMERA to provide a fairer comparison of computation. Using three computing cores on a MacBook Pro laptop, the methods had the following respective runtimes: 33.2 min for TWO-SIGMA-G, 33.5 minutes for MAST (25 bootstrap replications) and 5 s for CAMERA. TWO-SIGMA-G shows slightly improved yet nearly identical computational performance to MAST in the presence of the other advantages for performing gene set testing in scRNA-seq data described throughout this paper. We also found that computation time of TWO-SIGMA-G increases approximately linearly in the number of genes and number of cells, and remains steady for a fixed number of cells coming from varying numbers of individuals until the number of individuals approaches 100 000 (Supplementary Figure S33).

Our competitive testing method is not without limitations, however. First, as with other approaches for geneset testing using scRNA-seq data, power may be low for (1) sparse datasets, (2) gene sets that consist of very lowly expressed genes and (3) gene sets that are most active in rare cell types. Second, unlike bulk RNA-seq data, many genes are often uncaptured or fail to survive filtering in scRNA-seq data. Although this limitation is general to all gene set tests using scRNA-seq data, we must nonetheless assume that a gene set can be represented by the genes that exist in the dataset. In both the HIV and Alzheimer’s datasets, we typically have around 40% representation regardless of set size after gene filtering (Supplementary Figure S34). Given the biologically meaningful and interpretable results we presented, we feel that the absence of these genes does not threaten the ability of scRNA-seq gene set analyses to contribute new biological insights. Third, the computation time of TWO-SIGMA-G can become quite large for genome-wide analyses of the large sample sizes of hundreds of thousands of cells that are becoming more and more common in scRNA-seq datasets. Because the majority of computation time comes from gene-level model fitting, and gene-level models are fit separately of one another, the parallelization supported in our twosigma R package can largely mitigate these computational requirements.

Key Points

Gene set testing is recognized as a key step in differential expression analyses of both bulk and single-cell RNA-sequencing data. Although there are at least two classes of gene set tests, competitive gene set tests have been more popular in the past decade because they allow for ranking of biological pathways in a statistically rigorous manner.
Inter-gene correlation between genes in the same gene set is very common, and it is essential to account for this correlation to ensure control of the false positive rate.
scRNA-seq data can exhibit overdispersed and zero-inflated counts and within-sample correlation, particularly in the large-sample observational datasets being rapidly generated. One ideal way to account for these features is with a mixed-effects regression modeling framework.
To address these issues, we developed a new competitive gene set testing framework designed for scRNA-seq data, TWO-SIGMA-G. In comparison to methods for both bulk and scRNA-seq gene set analysis, including GSEA, MAST, CAMERA, iDEA, fGSEA and PAGE, we found superior performance in both simulations and in two real datasets, one experimental and one observational.

Supplementary Material

supp_bbac084

Click here for additional data file.^{(1.8MB, pdf)}

TSG_BiB_Responses_Second_Revision_bbac084

Click here for additional data file.^{(98.8KB, pdf)}

Acknowledgements

None at this time.

Funding

This work was supported by the National Institute of Health (R01GM105785 and U54HD079124 to Y.L., R01HL129132 to Y.L. and E.V.B., UM1HG011585 to M.H., R03DE028983 to D.W.), University of North Carolina Computational Medicine Program Award 2020 [to L.S. and D.W.]. Lishan Su (L.S.) to the University of North Carolina Computational Medicine Program Award 2020.

Author Biographies

Eric Van Buren is a Postdoctoral Research Fellow in the Department of Biostatistics at Harvard University.

Ming Hu is an Assistant Staff in the Department of Quantitative Health Sciences, Lerner Research Institute at the Cleveland Clinic Foundation.

Liang Cheng is a Professor in the Medical Research Institute at Wuhan University.

John Wrobel is a Research Associate in the Department of Biochemistry and Biophysics at the University of North Carolina at Chapel Hill.

Kirk Wilhelmsen is a Professor of Neurology at the West Virginia University Rockefeller Neuroscience Institute, and an Adjunct Professor of Genetics at the University of North Carolina at Chapel Hill.

Lishan Su is a Professor of Pharmacology, Microbiology, and Immunology at the University of Maryland School of Medicine.

Yun Li is a Professor of Genetics, Biostatistics, and Computer Science at the University of North Carolina at Chapel Hill.

Di Wu is an associate professor in the Department of Biostatistics and Oral Craniofacial Health Science at the University of North Carolina at Chapel Hill.

Data and method availability

Both datasets analyzed in this manuscript are publicly available. The HIV dataset is available at the Gene Expression Omnibus under accession GSE148796. The Alzheimer’s dataset is available upon completion of a data usage agreement at the Rush Alzheimer’s Disease Center (RADC) Research Resource Sharing Hub (https://www.radc.rush.edu/docs/omics.htm) under ‘snRNA-seq PFC.’ TWO-SIGMA-G is implemented in the function twosigmag in the twosigma R package, which is freely available on GitHub at https://github.com/edvanburen/twosigma and on CRAN https://cran.r-project.org/web/packages/twosigma/. The code used for the simulation studies presented is available on GitHub at https://github.com/edvanburen/TWO-SIGMA-G-Paper-Code.

References

1. Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 2005;21(9):1943–9. [DOI] [PubMed] [Google Scholar]
2. Hombrink P, Helbig C, Backer RA, et al. Programs for the persistence, vigilance and control of human cd8+ lung-resident memory t cells. Nat Immunol 2016;17(12):1467–78. [DOI] [PubMed] [Google Scholar]
3. Lim E, Vaillant F, Di W, et al. Aberrant luminal progenitors as the candidate target population for basal tumor development in brca1 mutation carriers. Nat Med 2009;15(8):907–13. [DOI] [PubMed] [Google Scholar]
4. Pinto D, Pagnamenta AT, Klei L, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 2010;466(7304):368–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Gupta PK, Godec J, Wolski D, et al. Cd39 expression identifies terminally exhausted cd8+ t cells. PLoS Pathog 2015;11(10):1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat 2007;1(1):107–29. [Google Scholar]
7. Gaynor SM, Sun R, Lin X, et al. Identification of differentially expressed gene sets using the generalized Berk Jones statistic. Bioinformatics 2019;35(22):4568–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Reimand J, Isserlin R, Voisin V, et al. Pathway enrichment analysis and visualization of omics data using g:profiler, gsea, cytoscape and enrichmentmap. Nat Protoc 2019;14(2):482–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Dulken BW, Leeman DS, Boutet SC, et al. Single-cell transcriptomic analysis defines heterogeneity and transcriptional dynamics in the adult neural stem cell lineage. Cell Rep 2017;18(3):777–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Van Buren E, Ming H, Weng C, et al. Two-sigma: a novel two-component single cell model-based association method for single-cell rna-seq data. Genet Epidemiol 2020;45(2):142–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Kim S-Y, Volsky DJ. Page: parametric analysis of gene set enrichment. BMC Bioinformatics 2005;6(1):144. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Lu T, Greenberg SA, Kong SW, et al. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci 2005;102(38):13544–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Oron AP, Jiang Z, Gentleman R. Gene set enrichment analysis using linear models and diagnostics. Bioinformatics 2008;24(22):2586–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007;23(8):980–7. [DOI] [PubMed] [Google Scholar]
15. Di W, Lim E, Vaillant F, et al. ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 2010;26(17):2176–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 2005;102(43):15545–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Liberzon A, Birger C, Thorvaldsdottir H, et al. The molecular signatures database hallmark gene set collection. Cell Systems 2015;1(6):417–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Di W, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res 2012;40(17):e133–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Gatti DM, Barry WT, Nobel AB, et al. Heading down the wrong pathway: on the influence of correlation within gene sets. BMC Genomics 2010;11(1):574. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Finak G, McDavid A, Yajima M, et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data. Genome Biol 2015;16(1):278. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Barry WT, Nobel AB, Wright FA. A statistical framework for testing functional categories in microarray data. Ann Appl Stat 2008;2(1):286–315. [Google Scholar]
22. Mootha VK, Lindgren CM, Eriksson K-F, et al. Pgc-1a responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003;34(3):267–73. [DOI] [PubMed] [Google Scholar]
23. Korotkevich G, Sukhov V, Budin N, et al. Fast gene set enrichment analysis. In: bioRxiv, 2021.
24. Ma Y, Sun S, Shang X, et al. Integrative differential expression and gene set enrichment analysis using summary statistics for scrna-seq studies. Nat Commun 2020;11(1):1585. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Berge Van den K, Soneson C, Love ML, et al. zinger: unlocking rna-seq tools for zero-inflation and single cell applications. In: bioRxiv, 2017. [DOI] [PMC free article] [PubMed]
26. Hukku A, Quick C, Luca F, et al. BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis. Bioinformatics 2019;36(6):1689–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Chawla S, Samydurai S, Kong SL, et al. UniPath: a uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles. Nucleic Acids Res 2020;49(3):e13–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Fan J, Salathia N, Liu R, et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods 2016;13(3):241–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Aibar S, González-Blas CB, Moerman T, et al. Scenic: single-cell regulatory network inference and clustering. Nat Methods 2017;14(11):1083–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Damian D, Gorfine M. Statistical concerns about the gsea procedure. Nat Genet 2004;36(7):663–3. [DOI] [PubMed] [Google Scholar]
31. Svensson V. Droplet scrna-seq is not zero-inflated. Nat Biotechnol 2020;38(2):147–50. [DOI] [PubMed] [Google Scholar]
32. Cheng L, Haisheng Y, Wrobel JA, et al. Identification of pathogenic trail-expressing innate immune cells during hiv-1 infection in humanized mice by scrna-seq. JCI Insight 2020;5(11):6. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Mathys H, Davila-Velderrain J, Peng Z, et al. Single-cell transcriptomic analysis of alzheimer’s disease. Nature 2019;570(7761):332–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Soper A, Kimura I, Nagaoka S, et al. Type i interferon responses by hiv-1 infection: association with disease progression and control. Front Immunol 2018;8:1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Nunomura A, Perry G, Aliev G, et al. Oxidative damage is the earliest event in Alzheimer disease. J Neuropathol Exp Neurol 2001;60(8):759–67. [DOI] [PubMed] [Google Scholar]
36. Varela EV, Etter G, Williams S. Excitatory-inhibitory imbalance in alzheimer’s disease and therapeutic significance. Neurobiol Dis 2019;127:605–15. [DOI] [PubMed] [Google Scholar]
37. Yang I, Han SJ, Kaur G, et al. The role of microglia in central nervous system immunity and glioma immunology. J Clin Neurosci 2010;17(1):6–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp_bbac084

Click here for additional data file.^{(1.8MB, pdf)}

TSG_BiB_Responses_Second_Revision_bbac084

Click here for additional data file.^{(98.8KB, pdf)}

[ref1] 1. Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 2005;21(9):1943–9. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Hombrink P, Helbig C, Backer RA, et al. Programs for the persistence, vigilance and control of human cd8+ lung-resident memory t cells. Nat Immunol 2016;17(12):1467–78. [DOI] [PubMed] [Google Scholar]

[ref3] 3. Lim E, Vaillant F, Di W, et al. Aberrant luminal progenitors as the candidate target population for basal tumor development in brca1 mutation carriers. Nat Med 2009;15(8):907–13. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Pinto D, Pagnamenta AT, Klei L, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 2010;466(7304):368–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Gupta PK, Godec J, Wolski D, et al. Cd39 expression identifies terminally exhausted cd8+ t cells. PLoS Pathog 2015;11(10):1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat 2007;1(1):107–29. [Google Scholar]

[ref7] 7. Gaynor SM, Sun R, Lin X, et al. Identification of differentially expressed gene sets using the generalized Berk Jones statistic. Bioinformatics 2019;35(22):4568–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Reimand J, Isserlin R, Voisin V, et al. Pathway enrichment analysis and visualization of omics data using g:profiler, gsea, cytoscape and enrichmentmap. Nat Protoc 2019;14(2):482–517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Dulken BW, Leeman DS, Boutet SC, et al. Single-cell transcriptomic analysis defines heterogeneity and transcriptional dynamics in the adult neural stem cell lineage. Cell Rep 2017;18(3):777–90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Van Buren E, Ming H, Weng C, et al. Two-sigma: a novel two-component single cell model-based association method for single-cell rna-seq data. Genet Epidemiol 2020;45(2):142–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Kim S-Y, Volsky DJ. Page: parametric analysis of gene set enrichment. BMC Bioinformatics 2005;6(1):144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Lu T, Greenberg SA, Kong SW, et al. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci 2005;102(38):13544–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Oron AP, Jiang Z, Gentleman R. Gene set enrichment analysis using linear models and diagnostics. Bioinformatics 2008;24(22):2586–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007;23(8):980–7. [DOI] [PubMed] [Google Scholar]

[ref15] 15. Di W, Lim E, Vaillant F, et al. ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 2010;26(17):2176–82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 2005;102(43):15545–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Liberzon A, Birger C, Thorvaldsdottir H, et al. The molecular signatures database hallmark gene set collection. Cell Systems 2015;1(6):417–25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Di W, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res 2012;40(17):e133–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Gatti DM, Barry WT, Nobel AB, et al. Heading down the wrong pathway: on the influence of correlation within gene sets. BMC Genomics 2010;11(1):574. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Finak G, McDavid A, Yajima M, et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data. Genome Biol 2015;16(1):278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Barry WT, Nobel AB, Wright FA. A statistical framework for testing functional categories in microarray data. Ann Appl Stat 2008;2(1):286–315. [Google Scholar]

[ref22] 22. Mootha VK, Lindgren CM, Eriksson K-F, et al. Pgc-1a responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003;34(3):267–73. [DOI] [PubMed] [Google Scholar]

[ref23] 23. Korotkevich G, Sukhov V, Budin N, et al. Fast gene set enrichment analysis. In: bioRxiv, 2021.

[ref24] 24. Ma Y, Sun S, Shang X, et al. Integrative differential expression and gene set enrichment analysis using summary statistics for scrna-seq studies. Nat Commun 2020;11(1):1585. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Berge Van den K, Soneson C, Love ML, et al. zinger: unlocking rna-seq tools for zero-inflation and single cell applications. In: bioRxiv, 2017. [DOI] [PMC free article] [PubMed]

[ref26] 26. Hukku A, Quick C, Luca F, et al. BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis. Bioinformatics 2019;36(6):1689–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27. Chawla S, Samydurai S, Kong SL, et al. UniPath: a uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles. Nucleic Acids Res 2020;49(3):e13–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Fan J, Salathia N, Liu R, et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods 2016;13(3):241–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Aibar S, González-Blas CB, Moerman T, et al. Scenic: single-cell regulatory network inference and clustering. Nat Methods 2017;14(11):1083–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Damian D, Gorfine M. Statistical concerns about the gsea procedure. Nat Genet 2004;36(7):663–3. [DOI] [PubMed] [Google Scholar]

[ref31] 31. Svensson V. Droplet scrna-seq is not zero-inflated. Nat Biotechnol 2020;38(2):147–50. [DOI] [PubMed] [Google Scholar]

[ref32] 32. Cheng L, Haisheng Y, Wrobel JA, et al. Identification of pathogenic trail-expressing innate immune cells during hiv-1 infection in humanized mice by scrna-seq. JCI Insight 2020;5(11):6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Mathys H, Davila-Velderrain J, Peng Z, et al. Single-cell transcriptomic analysis of alzheimer’s disease. Nature 2019;570(7761):332–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34. Soper A, Kimura I, Nagaoka S, et al. Type i interferon responses by hiv-1 infection: association with disease progression and control. Front Immunol 2018;8:1823. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Nunomura A, Perry G, Aliev G, et al. Oxidative damage is the earliest event in Alzheimer disease. J Neuropathol Exp Neurol 2001;60(8):759–67. [DOI] [PubMed] [Google Scholar]

[ref36] 36. Varela EV, Etter G, Williams S. Excitatory-inhibitory imbalance in alzheimer’s disease and therapeutic significance. Neurobiol Dis 2019;127:605–15. [DOI] [PubMed] [Google Scholar]

[ref37] 37. Yang I, Han SJ, Kaur G, et al. The role of microglia in central nervous system immunity and glioma immunology. J Clin Neurosci 2010;17(1):6–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

TWO-SIGMA-G: a new competitive gene set testing framework for scRNA-seq data accounting for inter-gene and cell–cell correlation

Eric Van Buren

Ming Hu

Liang Cheng

John Wrobel

Kirk Wilhelmsen

Lishan Su

Yun Li

Di Wu

Abstract

Introduction

Materials and Methods

Estimation of IGC

Figure 1.

TWO-SIGMA-G for Set-Level Testing

Figure 2.

Simulation Studies

Figure 3.

Figure 4.

Figure 5.

Two scRNA-seq data sets

HIV Dataset

Alzheimer’s Dataset

Results

Overview of TWO-SIGMA-G and Summary of Simulation Design

TWO-SIGMA-G preserves type-I error in the presence of IGC

TWO-SIGMA-G improves power over alternative approaches

Analysis of HIV data reveals biologically expected findings

Table 1.

Figure 6.

Analysis of Alzheimer’s data reveals cell-type-specific heterogeneity in set-level expression

Figure 7.

Figure 8.

Table 2.

Discussion

Key Points

Supplementary Material

Acknowledgements

Funding

Author Biographies

Data and method availability

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases