Abstract
In this paper, we develop TWO-SIGMA, a TWO-component SInGle cell Model-based Association method for differential expression (DE) analyses in single-cell RNA-seq (scRNA-seq) data. The first component models the probability of “drop-out” with a mixed-effects logistic regression model and the second component models the (conditional) mean expression with a mixed-effects negative binomial regression model. TWO-SIGMA is extremely flexible in that it: (i) does not require a log-transformation of the outcome, (ii) allows for overdispersed and zero-inflated counts, (iii) accommodates a correlation structure between cells from the same individual via random effect terms, (iv) can analyze unbalanced designs (in which the number of cells does not need to be identical for all samples), (v) can control for additional sample-level and cell-level covariates including batch effects, (vi) provides interpretable effect size estimates, and (vii) enables general tests of DE beyond two-group comparisons. To our knowledge, TWO-SIGMA is the only method for analyzing scRNA-seq data that can simultaneously accomplish each of these features. Simulations studies show that TWO-SIGMA outperforms alternative regression-based approaches in both type-I error control and power enhancement when the data contains even moderate within-sample correlation. A real data analysis using pancreas islet single-cells exhibits the flexibility of TWO-SIGMA and demonstrates that incorrectly failing to include random effect terms can have dramatic impacts on scientific conclusions.
Keywords: Single-cell RNA sequencing, zero-inflated model, differential expression, random effects model
1. Introduction
Advancements in single-cell sequencing technologies have created many exciting opportunities to researchers yet have also posed many challenges relating to data analysis. Expression profiles can now be analyzed at the single-cell level, providing new insights into the cellular heterogeneity of gene expression. Three characteristic features of single-cell transcriptome sequencing data include excess zero counts, overdispersion of observed counts, and a large number of cells that are sequenced from a relatively small number of samples [29]. Technological limitations including low capture rate and amplification failure lead to “drop-out” events, in which the data may capture only a fraction of the transcriptome of a given cell and mistakenly generate zero measurements for expressed genes. The presence of these zeros creates a dataset with an excess of zeros (often called “zero-inflation”) beyond those that occur due to biological factors; these excess zeros often necessitate special modelling approaches such as a two-component model [5, 9]. Overdispersion, in which the variance in expression exceeds the mean expression, is commonly observed in count-based quantitative sequencing due to large variance in expression and within gene variability over time or across samples. A within-sample correlation is also present because multiple cells are sequenced from the same biological sample. These challenges motivated us to develop the new statistical method TWO-SIGMA. It is designed for association analyses where the primary interest is in performing statistical inference on covariate(s) of interest, such as a treatment effect. TWO-SIGMA is not limited to a two-group comparison and can test for more general kinds of differential expression (DE) while simultaneously controlling for multiple sample-level and cell-level covariates and accounting for within-sample correlation. As we will discuss, accounting for this within-sample dependency is a necessity to preserve type-I error when cells from the sample are even moderately correlated.
Most existing methods for DE analysis in scRNA-seq data are designed for a two-group comparison. As a result, benchmarking papers typically limit themselves to two-group comparisons [28, 23]. Three of the most popular methods for two-group comparisons are SCDE, scDD, and DESingle. SCDE and scDD are both innovative Bayesian methods, with the former utilizing a two-component negative binomial mixture method and the latter using a Dirichlet mixture process [10, 11]. Although both methods show strong performance, only the latter can adjust for confounding covariates, and this adjustment is indirect through a residualized analysis. DESingle employs a zero-inflated negative binomial (ZINB) distribution to analyze DE in scRNA-seq data while accounting for excess zeros and overdispersion [18]. Like SCDE and scDD, however, DESingle does not employ a regression modeling framework to control for other covariates or account for within-sample correlation.
MAST was introduced as a hurdle model for the analysis of scRNA-seq data [5], and is considered to be one of the preferred methods for performing DE analysis in scRNA-seq data [15]. Like TWO-SIGMA, but unlike the methods described above, MAST can test for DE in cases beyond a two-group comparison. There are several important differences, however, between TWO-SIGMA and MAST. First, we fit a zero-inflated model on the observed counts while MAST fits a hurdle model (described in more detail in the next section) on the log scale. The ability to avoid log-transforming the data is desirable given recent evidence which suggests that log transformation can distort many scRNA-seq datasets by producing false variability [25, 16]. Further recommendations for DE analysis in scRNA-seq data state that the observed counts should be modeled directly while accounting for batch effects as covariates rather than through normalization [15]. Second, TWO-SIGMA allows the covariates in each of the two components to differ. We will discuss reasons that this flexibility can be appreciated by researchers later. Third, and most importantly, although the ability to include random effects in either component of MAST is mentioned by its authors, they do not prioritize their inclusion for scRNA-seq data and do not evaluate the impact of random effects on the model’s performance. We will revisit the comparison with MAST in the methods section.
Several unsupervised methods, such as ZINB-WaVE [22] and ZIFA [20] have also been proposed for scRNA-seq analysis. Both ZINB-WaVE and ZIFA are primarily designed for settings in which dimension reduction, not association analysis, is the primary goal. One interesting use of ZINB-WaVE is to construct observation-level weights that can be incorporated into the popular bulk RNA-seq pipelines found in the DESeq2 [14] or edgeR [17] Bioconductor packages [26, 27]. These pipelines do not allow for random effects or model excess zeros separately and can involve some transformation of the data in processing or analysis.
The two-component zero-inflated model without random effects has a long history in the analysis of count and microbiome data [12, 3, 7], however its application in scRNA-seq data is limited. Furthermore, a zero-inflated negative binomial mixed effects model has previously been proposed for modelling zero-inflated count data [19]. The focus of that work was on typical repeated measures applications where the number of repeated measures per individual is small, in contrast to genomic applications which tend to include more repeated measures than samples. Because attention is split between zero-inflated and a similar approach called a hurdle model, and between the Poisson and negative binomial distributions, details regarding the performance and robustness of the zero-inflated negative binomial mixed-effects model are not discussed in as much detail as we can here.
The rest of the article proceeds as follows: first, we specify the TWO-SIGMA model, discuss implications of its parameterization, and provide details on parameter estimation. Next, we describe both traditional methods and a new ad hoc method to decide whether random effects should be included in our zero-inflated negative binomial model. Then we show simulation results and an application to a dataset of pancreatic islet single-cells, respectively. Finally, we conclude with a discussion.
2. Materials and Methods
2.1. Zero-inflated negative binomial distribution
For a given gene, let i index the samples sequenced and j index the ni single cells from sample i. Consider the following parameterization of the negative binomial probability mass function (p.m.f.) at a non-negative integer yij corresponding to the observed read count:
With this parameterization, E(Yij) = μij and , such that ϕ is the overdispersion parameter (ϕ > 0). This parameterization is appealing for interpretability because as , the density above approaches the Poisson density with mean μ. Thus, the Poisson and negative binomial distributions are asymptotically nested (and nearly identical for large values of ϕ). One can compare the fit of the Poisson model to the fit of the negative binomial model by testing versus using the likelihood ratio test (p-values come from a 50:50 mixture of and a point mass at zero). As seen in table 2 in the real data analysis below, the p-value from such a test will often suggest a significant deviation from the Poisson model in scRNA-seq data.
Table 2:
Hypothesis | Alpha Cells | Beta Cells |
---|---|---|
No ZIVC | No ZIVC | |
Overall Disease Status | 0.161 | 0.111 |
Overall R.E. Test | 0.738 | 0.724 |
NB vs. Poisson | 0.627 | 0.555 |
To accommodate the excess zeros often observed in scRNA-seq data, we employ the zero-inflated negative binomial distribution (ZINB). This distribution mixes a point mass at zero (from which observations are considered “drop-out”) with the negative binomial distribution. Let pij and μij be the probability of drop-out and the mean read count conditional on not being dropped-out for cell j from sample i, respectively. The p.m.f. of one observation Yij under the ZINB distribution is given by:
(1) |
The ZINB distribution thus assumes that there are two sources of zeros in the data: the first source is the process that governs drop-out and the second source is from the negative binomial process for genes that are not dropped-out. This differs from hurdle models used for gene expression data [5] which use a left-truncated or continuous distribution for the positive expression component—meaning zeros can only be generated from the drop-out process. Thus, the hurdle model does not allow zero expression measurements due to biological variation.
Interpretations from the zero-inflated model are quite natural for single-cell gene expression data because it is reasonable to believe that some observed zeros are “structural zeros” with bona fide zero expression due to stochastic biological factors (e.g. transcriptional bursting, cell cycle) and not due to technical drop-out [29]. Although semantic, the distinction regarding the source of zeros affects the interpretation of model coefficients and is important because these models are often misinterpreted by researchers [21, 24].
2.2. TWO-SIGMA
We can now provide the full TWO-SIGMA specification:
(2) |
The model is fit for each gene individually, so all parameters are gene-specific. α and β are fixed effect coefficient vectors and the corresponding vectors of covariates zij and xij can be different. ai and bi are sample-specific intercepts (discussed more in the next section). Prediction of sample-specific intercepts and estimation of the variance components and allow us to investigate heterogeneity among individuals, and tests of whether the variance components equal zero allow us to separately (or jointly) evaluate the need for random effects. Separate variance components are estimated because the different link functions in the two components correspond to linear predictors with different scales. Including the random effects terms also helps control for any within-sample correlation, providing more accurate estimates and standard errors of fixed effect parameters.
As part of our twosigma R package, we employ the glmmTMB package chi-square[2] to fit the model specified in equation (2). This package is well-suited to fit generalized linear mixed models (GLMMs) because the user can easily specify an arbitrarily complex model composed of fixed and random effects. More details regarding computational considerations can be found in section 5 of the supplement. To summarize, TWO-SIGMA controls for additional covariates in both components, incorporates random effects to accommodate within-sample dependency, can analyze unbalanced data, and allows for zero-inflated and overdispersed counts. The regression modelling framework provides interpretable effect size estimates and can examine any DE hypothesis that can be expressed as a contrast of regression coefficients. The implementation of the model strikes a balance between computational accuracy and efficiency, even as the number of random effects (number of samples in the context of the scRNA-seq data) or the number of single cells per sample increases.
2.3. Evaluating the need for random effects
One primary methodological contribution of TWO-SIGMA for scRNA-seq data analysis is the inclusion of random effect terms in each of the two components, which is a well-established technique to account for within-sample correlation. Ignoring random effects in TWO-SIGMA is equivalent to assuming that cells from the same sample/individual are independent. This assumption can lead to underestimated standard errors and thus inflated type-I errors. An example is given in table 3 in the real data analysis section.
Table 3:
Effect | Estimate | Std. Error | z value | p-value |
---|---|---|---|---|
Intercept | 0.521 | 0.207 | 2.515 | 0.012 |
T2D | −0.349 | 0.292 | −1.197 | 0.231 |
age | −0.284 | 0.256 | −1.109 | 0.267 |
CDR | 0.394 | 0.011 | 36.284 | <.001 |
σ b | 0.490 | |||
Intercept | 0.833 | 0.021 | 40.094 | <.001 |
T2D | −0.605 | 0.032 | −19.090 | <.001 |
age | −0.057 | 0.017 | −3.324 | <.001 |
CDR | 0.390 | 0.015 | 26.611 | <.001 |
Evaluating the need for a random effect term involves a hypothesis test of whether the corresponding variance component(s) equal zero. For example, consider testing whether random effects are needed in either component of TWO-SIGMA with H0 : σa = σb = 0 versus Ha : σa > 0 or σb > 0 via the Likelihood Ratio Test (LRT). The test statistic has an asymptotic chi-square distribution under the null, and can be expressed as: , where and are the log-likelihoods of the model under the null and alternative hypotheses, respectively. The null hypothesis is on the boundary of the parameter space, and the two variance components in equation (2) are estimated using a non-separable marginal likelihood function and are therefore not independent. Thus, to our knowledge, the true degrees of freedom for this LRT chi-square statistic have not been derived analytically. A conservative choice, however, would be to use 2 degrees of freedom for conducting the test, as discussed elsewhere in the context of one component generalized linear mixed models [30]. If testing only one component, one can use the LRT with p-values coming from a 50:50 mixture of a chi-square distribution with 1 degree of freedom and a point mass at zero [30].
Other less desirable post-fitting options to compare models with and without random effects include information criteria like AIC and BIC or Wald tests of the variance components [6]. Critically, all options discussed require fitting the “full” model including the random effect terms. The scRNA-seq application is distinct from typical repeated measures analyses in that the number of repeated measures (cells) typically far exceeds the number of samples. Such designs can entail more extensive computational time for each gene over scenarios involving a smaller number of repeated measures from a modest number of individuals. These computational burdens are especially relevant given that scRNA-seq data typically include thousands of genes, each of which is fit separately using TWO-SIGMA. It would therefore be useful to identify the genes that are most likely to need the random effect terms without having to fit the full model to each gene.
2.3.1. ad hoc approach
First, define the Pearson’s residuals from a zero-inflated negative binomial regression model without random effects as follows, where yij is the observed count for cell j from sample i:
We employ the following ad hoc approach to determine whether random effects are needed in equation (2): using a one-way ANOVA, we regress the Pearson’s residuals on an indicator of sample and take the p-value from the overall ANOVA F test. Use of the Pearson’s residuals provides a convenient way to analyze the residual variation unexplained by either component of the model, and the ANOVA p-value serves as a rudimentary measure of whether this residual variation tends to differ across samples. If this p-value is significant, there is evidence that residuals are not exchangeable across samples, and thus that sample-specific random effect terms may exist. The full TWO-SIGMA specification including random effects in both components will then be fit to more formally evaluate the need for random effect terms. In contrast, when the residuals show no tendency of differing across samples, we do not have evidence to believe that they are structured/clustered within samples and thus will not fit the full model with random effects. Because the Pearson residual combines both components, and seems sensitive to presence of random effect terms in either component (Supplementary Figure S2) the ad hoc procedure is not designed to inform which component suggests a presence of random effects. Rather, formal testing can be used for this purpose as described above. See supplementary section 2 for more details, including results that show that the ad hoc procedure demonstrates good performance in a variety of simulation scenarios. Results from applying this proposed method to a real dataset of pancreatic islet cells are given in the data analysis section. In simulations, computation runtime for TWO-SIGMA was the longest for models attempting to fit random effects when variance components were truly zero (see supplementary tables 1–4). Therefore, as discussed more in section 3 of the supplementary file, the ad hoc method can dramatically reduce overall computation time over many genes in addition to increasing model parsimony where most appropriate.
2.4. Simulation studies
To evaluate the performance of TWO-SIGMA, we simulated data in a variety of scenarios. In each scenario, we simulated 10,000 independent genes using the R function simulate_zero_inflated_nb_random_effect_data in our twosigma R package. For each gene, a drop-out probability is simulated based on fixed effect parameter values, simulated covariates, and the possible presence of random effect terms. This probability is used to simulate the drop-out status of a given gene using the Bernoulli distribution. If the gene is not a drop-out, the mean expression value is then simulated from the negative binomial distribution with mean varying according to simulated covariates, parameter values, and the possible presence of random effect terms. Simulated covariates mirrored our pancreatic real data analysis and included a binary disease status, age and the cellular detection rate (CDR, see the real data analysis section and [5] for more details including discussions of its usefulness as a surrogate for batch effects and other technical artifacts). True parameter values α and β for these covariates are specified in supplementary sections S3 and S4 and were designed to mimic realistic parameter values as observed in the real data analysis of section 3.4. Type-I error and power were evaluated using the likelihood ratio test on the joint null hypothesis that the disease status indicator is not associated with expression through either drop-out probability or the conditional mean (H0 : α1 = β1 = 0, α1 and β1 are set to 0 under the null and are non-zero under the alternative hypothesis). We used 0.05 as the nominal significance rate to evaluate type-I error and power, but provide results for more stringent significance thresholds, such as the Bonferroni threshold of 1/10000, in supplementary figures S5–S6. We consider two different ways of simulating data: one in which the number of samples far exceeds the number of cells per sample, as is typical in most repeated measures contexts, and the other in which the number of cells far exceeds the number of samples, as is the case in scRNA-seq data. Although many methods exist for DE analysis in scRNA-seq data, as described above, we chose to focus our comparison to MAST because, like TWO-SIGMA, it uses a regression modeling framework that is suitable for designs beyond a two-group comparison and can simultaneously control for multiple cell-level and subject-specific covariates. We also compare to a ZINB model without random effects to highlight the impact random effect terms can have on model performance. See supplementary sections S3 and S4 for additional simulation details, results, and discussion.
3. Results
3.1. Type-I error control
Table 1 shows results from simulations in which the true values of the overdispersion parameter ϕ and the variance components σa, σb vary. Type-I error is well-controlled for TWO-SIGMA in the scenarios involving more individuals than cells. When the number of cells increases, type-I errors from TWO-SIGMA are slightly inflated over the nominal rate of 5%, but consistently remain superior to the results from the ZINB model or MAST in the presence of ignored non-zero variance components. For example, the last row of table 1 shows that, when ϕ = 1 and σa = σb = 0.5, type-I error for TWO-SIGMA increases from 0.05 to 0.053 to 0.074 as the number of individuals decreases from 1000 to 100 to 25. In contrast, the ZINB model and MAST have inflated type-I errors in every scenario that increase to nearly 1 as the number of individuals decreases. This is not surprising because both of the latter methods cannot account for any within-sample dependency structure among the single cells from the same sample. Ignoring the dependency introduced by even a moderate random effect size can thus have a drastic impact on the type-I error. When true variance components are zero, both TWO-SIGMA and the ZINB model preserve type-I error while MAST consistently has higher type-I error, as seen in the first three rows of table 1. Coverage of confidence intervals for α, β, and ϕ always approaches the nominal level (Supplementary tables S1–S4). The reason for the slightly inflated type-I error for TWO-SIGMA observed in the scenario with 25 individuals is worth mentioning briefly. The smaller number of individuals (25) provides less information to estimate the sample-specific variance components σa and σb and few unique values of the simulated binary disease status indicator. The slightly lower coverage for variance components in the last 6 sets of supplementary table 4 is one illustration of the (relative) difficulty in getting precise variance component estimates. TWO-SIGMA outperforms MAST or the ZINB model in preserving type-I error and estimating parameters under a variety of sample size breakdowns and with a variety of true parameter values.
Table 1:
Case 1 in Supplement: | Case 3 in Supplement: | Case 4 in Supplement: | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sim Params | 50 Cells per 1000 Ind. | 500 Cells per 100 Ind. | 2000 Cells per 25 Ind. | |||||||||
ϕ | σ a | σ b | T-S | ZINB | MAST | T-S | ZINB | MAST | T-S | ZINB | MAST | |
No
R.E. |
10 | 0 | 0 | 0.049 | 0.051 | 0.089 | 0.042 | 0.050 | 0.090 | 0.041 | 0.052 | 0.090 |
2 | 0 | 0 | 0.048 | 0.051 | 0.080 | 0.038 | 0.044 | 0.079 | 0.041 | 0.052 | 0.086 | |
1 | 0 | 0 | 0.048 | 0.052 | 0.081 | 0.044 | 0.051 | 0.087 | 0.042 | 0.051 | 0.090 | |
Small
R.E. |
10 | 0.1 | 0.1 | 0.051 | 0.132 | 0.144 | 0.056 | 0.534 | 0.313 | 0.077 | 0.795 | 0.487 |
2 | 0.1 | 0.1 | 0.051 | 0.078 | 0.089 | 0.057 | 0.323 | 0.176 | 0.072 | 0.643 | 0.361 | |
1 | 0.1 | 0.1 | 0.049 | 0.066 | 0.095 | 0.053 | 0.224 | 0.174 | 0.075 | 0.548 | 0.361 | |
Large
R.E. |
10 | 0.5 | 0.5 | 0.051 | 0.621 | 0.290 | 0.055 | 0.941 | 0.716 | 0.076 | 0.984 | 0.875 |
2 | 0.5 | 0.5 | 0.053 | 0.505 | 0.275 | 0.056 | 0.909 | 0.685 | 0.076 | 0.974 | 0.857 | |
1 | 0.5 | 0.5 | 0.050 | 0.404 | 0.247 | 0.053 | 0.873 | 0.649 | 0.074 | 0.964 | 0.827 |
3.2. TWO-SIGMA retains high power under a variety of scenarios
Because the ZINB model and MAST both have heavily inflated type-I errors in many cases, using raw (or “apparent”) power does not provide a fair comparison for these two methods. For each method and each simulation setting under the null, we therefore calculate the empirical significance threshold, defined as the test statistic value at the quantile associated with 1 minus the significance level. A percentage of statistics equal to the nominal significance level will then be larger than this threshold. For various alternative hypotheses, we calculate “true” power for MAST and the ZINB model by using the empirical significance threshold from the corresponding setting under the null as the rejection threshold instead of a usual theoretical threshold (e.g. 5.9915 from at the .05 level). Figure 1 plots raw power for TWO-SIGMA and true power for MAST in the ZINB model in the following four scenarios: effect in both components, in either the same or opposite directions, and effects in only one of the two components. In the first three scenarios, MAST consistently has the lowest power, while TWO-SIGMA and the ZINB model have very similar power in the first two scenarios, beginning at around 20% and increasing to nearly 100%. The ZINB model has higher power than TWO-SIGMA in the third scenario but the lowest power in the fourth scenario. In simulation, computing the empirical significance thresholds and true power is straightforward and computationally included. In real data settings, however, computationally intensive resampling approaches are needed for reliable estimates of the empirical significance thresholds. Because TWO-SIGMA preserves type-I error, we can rely on raw power and can therefore bypass the need for any resampling approach for valid inference. This is a key advantage and shows that TWO-SIGMA is more robust and flexible than the ZINB model while both preserving the type-I error and having high power without any additional computation. When the effect is only in the zero-inflation component, power is lower for all methods than in the first three scenarios. Such effects present only in the zero-inflation component are known to be more difficult to detect, as seen in [3]. For full power results, including more detailed comparisons to the ZINB model with additional discussion, see section 4, figures S7–S9 and tables S5–S12 of the supplement.
3.3. Sensitivity of the ad hoc method for random effect screening in simulated data
We evaluated the ad hoc screening procedure used to select genes possibly in need of random effect terms using the simulated data. We found it to be effective as a screening procedure: for the data used in Table 1, most genes with non-zero variance components had p-values less than 0.05 and were flagged as in need of estimating variance components using TWO-SIGMA. When the variance components are zero, however, the p-values from the ad hoc method are close to uniformly distributed, indicating that the procedure is not too liberal under the null. See section 2 and figures S2, S3, and S4 of the supplement for more details and some example results.
3.4. Pancreas real data analysis
For illustrative purposes we applied TWO-SIGMA to a dataset of pancreatic islet cells isolated from nine individuals (see section 1 of the supplement and [4] for full details on the data processing and generation steps). To focus on the most informative cells and genes, we applied rather aggressive filtering of the data to keep the top 2,000 genes by number of transcripts observed and only keep cells with more than 1000 transcripts across these genes. After merging across all nine individuals, we were left with 1,290 genes and 10,269 single cells of which we used only the 7,774 for which cell type information was available based on the expression of signature genes. Here we focus our attention on alpha and beta cells, which compose the majority (55% and 34%, respectively) of the cells in our dataset. Type-II diabetes (T2D) status is of primary interest, and age is included as an additional subject-level covariate given its potential to confound the relationship between T2D status and gene expression. The cellular detection rate (CDR) is defined in [5] as the percentage of genes expressed over some background level of expression (often chosen to be zero). The CDR therefore has a biological interpretation as a cellular scaling factor and is a surrogate for both technical and biological variation. This confirms the conclusions of [8] that the CDR can explain a substantial proportion of observed expression variability and should be included in any association analysis of scRNA-seq data. As such, we include CDR in all analyses performed and stratified by cell type. For more details about the pancreas data processing, see section 1 and figure S1 of the supplementary file.
Figure 2 plots the relationship of mean versus variance for the 1,290 genes we used in our analysis. It shows that the Poisson and zero-inflated Poisson models cannot adequately account for the overdispersion observed in many genes. In contrast, TWO-SIGMA can accommodate these mean-variance pairs in a quadratic relationship via the overdispersion parameter ϕ. Because we have only nine individuals, we chose to focus on analyses excluding the zero-inflation random effect terms ai to improve convergence and overall model fit. Some genes still showed convergence issues–partly indicative of a misspecified or overparameterized model and partly due to the small number of cells and samples in the dataset. As a general guideline, we recommend that users with concerns or limited computational resources begin including random effects in the mean component, and scale upwards to include random effects in the zero-inflation component if performance is satisfactory.
Table 2 shows the proportions of genes showing statistically significant results at the .05 level for three types of hypothesis tests: the joint test of significance for the binary disease indicator H0 : α1 = β1 = 0, the test of the mean model variance component H0 : σb = 0, and the test for the presence of overdispersion . For example, when fitting the TWO-SIGMA model without the zero-inflation variance component to alpha cells, 73.8% of genes had statistically significant variance components in the mean model. Most genes showed the need for a random effect term or the negative binomial distribution (or both).
3.4.1. Impact of ignoring within-sample correlation
Models for genes that mistakenly exclude the bi random effect term often show highly significant results for covariates; this significance can disappear when including the random intercept term—possibly indicative of a false positive due to failing to account for within-sample correlation. For example, gene RPS29 demonstrates this pattern in alpha cells. Table 3 shows that failing to include random effects—and thereby assuming independence of all single cells—can lead to vastly underestimated standard errors. T2D status and age change from highly significant to insignificant when including a random intercept term. The standard error for the coefficient of T2D increases by a factor of 9 from 0.032 to 0.292, and the magnitude of the point estimate is halved from −0.605 to −0.349. Individual covariates such as T2D can thus exhibit dramatically increased type-I error when random effects are incorrectly ignored. In contrast, the coefficient and associated standard error for the cellular detection rate (CDR) are nearly identical in the two models. This result is expected given that CDR is a cell-level covariate and shows that including sample-specific random effects leads to very minor changes in the estimation of any covariates that are not sample-specific. Our emphasis in this section is not to draw conclusions about any association between RPS29 and T2D, but rather to illustrate that ignoring random effects has the potential to alter scientific conclusions.
3.4.2. Cell-type specific genes often show a need for random effect inclusion
We matched 234 and 120 genes in our data that were identified in previous studies as cell-type specific in alpha or beta cells, respectively. ([13], supplementary table 10). After stratifying the data by cell type and removing genes with more than 90% or less than 10% zeros, we fit TWO-SIGMA (excluding ai as mentioned previously) to the remaining 222 alpha cell-specific and 111 beta cell-specific genes to alpha cells and beta cells, respectively. Of these, 93 alpha cell-specific genes and 85 beta cell-specific genes had statistically significant variance components σb. This suggests that non-negligible between-sample variation—not attributable to cell-type—is present for these cell-type specific genes. As discussed in [13], cell-type specific expression profiles are often of primary interest to study (dis)function at the cellular level and reveal novel approaches to treat and manage diseases such as T2D. Thus, it is critical to have reliable inference for these genes. As seen in the previous section, incorrectly excluding random effect terms can provide very misleading results and can thereby misdirect attempts to understand disease etiology at a cellular level.
We also used alpha cells to test the overall effect of T2D using both TWO-SIGMA to MAST. Table 4 shows that MAST rejects in many more instances than TWO-SIGMA. Of the 273 genes that were rejected with MAST but not with TWO-SIGMA, 234 have statistically significant variance components in TWO-SIGMA. This further illustrates the possibility that fixed effect coefficients can be mistakenly deemed significant in the presence of within-sample correlation.
Table 4:
TWO-SIGMA | MAST | |
---|---|---|
No Reject | Reject | |
No Reject | 1013 | 273 |
Reject | 1 | 3 |
3.4.3. The ad hoc method successfully separates genes that need random effects
Finally, we used all 1,290 genes from the islet dataset to demonstrate the usefulness of the ad hoc method to determine the need for the random effects terms bi. Figure 3 shows that likelihood ratio statistics from formal testing of bi are consistently larger for genes selected by the procedure than those not selected. This pattern suggests that the ad hoc procedure described earlier can effectively identify genes that will exhibit non-zero variance components in real data.
4. Discussion
We have developed TWO-SIGMA, a two-component zero-inflated negative binomial model with random effects for association analysis of scRNA-seq data. The model builds on the well-established literature in both zero-inflated models and generalized linear mixed models. It keeps the data on the original scale while simultaneously allowing for zero-inflation, overdispersion, and random effects to account for within-sample correlation. As compared to existing methods, its flexibility is demonstrated both in the use of random effect terms and the ability to test any hypothesis of DE that can be expressed as a contrast of regression coefficients while controlling for multiple sample-level and individual-level covariates. To our knowledge, TWO-SIGMA is the only method for scRNA-seq DE analysis that accounts for within-sample dependency to protect type-I error given the presence of many cells sequenced from the same sample.
Including random effects explicitly controls for within-sample correlation, and can improve mean parameter and standard error estimates. Given that many scRNA-seq studies have few samples, it would be reasonable to consider controlling for sample as a fixed effect rather than a random effect. However, there are two reasons to prefer incorporating a random effect to a fixed effect approach. First, we are interested in estimating a variance component that can apply to all samples in the population. Second, the random intercepts explicitly control for the within-sample correlation, rather than only providing adjusted parameter estimates for included covariates. An alternative approach to accounting for such sample-level repeated measures would be to fit a marginal model with the generalized estimating equations (GEE) approach instead of a mixed effects model [1]. We chose not to do so for two main reasons: first, we retain the flexibility for sample-level prediction. Second, given that many scRNA-seq experiments are conducted over a small number of samples, it is likely that the empirical (sandwich) covariance estimate would underestimate the true standard errors [1].
Incorrectly excluding random effects and assuming independence of cells can lead to underestimated standard errors of fixed effects and can therefore increase the type-I error of hypothesis tests relating to fixed effects parameters. See table 3 for an example. If the random effect terms do not contribute to the model fit, as judged by a statistical test or practical significance, they can be removed easily within the general framework of TWO-SIGMA. Random intercepts can also be useful even when they are not of direct interest: they often capture the effects of omitted sample-specific covariates, and can limit the bias of fixed effect coefficients caused by misspecification. For example, if cell-type information is missing, and varies between individuals, a random intercept term can limit the resulting bias and p-value inflation observed in fixed effects parameters. Our ad hoc method proves to be a useful tool to both select genes that could benefit from including random effect terms and reduce overall computation time by suggesting genes that do not need to be fit including random effect terms.
Because we expect a priori that zero-inflation will occur in scRNA-seq, it is beneficial to include a component dedicated to it. The zero-inflation component in TWO-SIGMA is flexible in that it allows for a different set of covariates from the mean model or no covariates at all. For example, one might be interested in using zero-inflation only to improve mean parameter estimation. In this scenario, a constant probability of drop-out could be assumed via an intercept-only regression model. This would prevent coefficient estimates in the mean model from being overly shrunk towards zero, as would occur if drop-out was not accounted for, but would also limit the total number of parameters estimated and maximize model parsimony. Even if the data are not truly generated from a zero-inflated process, or if drop-out is viewed as a “nuisance,” using the two component model in equation (2) can be a convenient choice to improve model fit and fixed effect parameter estimation. See section 6 of the supplement and supplementary figure S10 for more discussion regarding the zero-inflation component.
Finally, our experience suggests that variance component estimates are often much smaller in the zero-inflation component than in the mean component. Therefore, as we did in the real data analysis, it might be a pragmatic choice to exclude random effects from the zero-inflation component of TWO-SIGMA. A key strength of TWO-SIGMA is the flexibility to easily customize the model within the general framework either a priori or via iterative removal based on statistical hypothesis tests of features such as random effects, overdispersion, or the drop-out component.
Supplementary Material
Funding:
This work was supported by the National Institute of Health [R01GM105785 and U54HD079124 to Yun Li, R01HL129132 to Yun Li and E.V.B., U54DK107977 to M.H., R01CA215347 and R03DE028983 to D.W., R01HG009658 to F.J., R01DK113185 to Yan Li].
Footnotes
Appendices: See Supplementary pdf file.
Data Availability: The pancreas Drop-seq data analyzed in this article are available at GEO: GSE101207.
Conflict of Interest Statement. None declared.
References
- [1].Agresti A (2013). Categorical Data Analysis, Third Edition. John Wiley & Sons, Inc., Hoboken, NJ. [Google Scholar]
- [2].Brooks ME, Kristensen K, van Benthem KJ, Magnusson A, Berg CW, Nielsen A, Skaug HJ, Mchler M, and Bolker BM (2017). glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. The R Journal, 9(2):378–400. [Google Scholar]
- [3].Chen EZ and Li H (2016). A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics, 32(17):2611–2617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Fang Z, Weng C, Li H, Tao R, Mai W, Liu X, Lu L, Lai S, Duan Q, Alvarez C, Arvan P, Wynshaw-Boris A, Li Y, Pei Y, Jin F, and Li Y (2019). Single-cell heterogeneity analysis and crispr screen identify key beta cell-specific disease genes. Cell Reports, 26(11):3132–3144.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ, Prlic M, Linsley PS, and Gottardo R (2015). Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data. Genome Biology, 16(1):278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Fitzmaurice GM et al. (2003). Applied Longitudinal Analysis, Second Edition. John Wiley & Sons, Inc., Hoboken, NJ. [Google Scholar]
- [7].Hall DB (2000). Zero-inflated poisson and binomial regression with random effects: A case study. Biometrics, 56. [DOI] [PubMed] [Google Scholar]
- [8].Hicks SC, Teng M, and Irizarry RA (2015). On the widespread and critical impact of systematic bias and batch effects in single-cell rna-seq data. bioRxiv. [Google Scholar]
- [9].Hicks SC, Townes FW, Teng M, and Irizarry RA (2017). Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics, 19(4):562–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Kharchenko PV, Silberstein L, and Scadden DT (2014). Bayesian approach to single-cell differential expression analysis. Nature Methods, 11(7):740–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Korthauer KD, Chu L-F, Newton MA, Li Y, Thomson J, Stewart R, and Kendziorski C (2016). A statistical approach for identifying differential distributions in single-cell rna-seq experiments. Genome Biology, 17(1):222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Lambert D (1992). Zero-inflated poisson regression, with an application to defects in manufacturing. Tecnometrics, 34. [Google Scholar]
- [13].Lawlor N, George J, Bolisetty M, Kursawe R, Sun L, Sivakamasundari V, Kycia I, Robson P, and Stitzel ML (2017). Single-cell transcriptomes identify human islet cell signatures and reveal cell-typespecific expression changes in type 2 diabetes. Genome Research, 27(2):208–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Love MI, Huber W, and Anders S (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology, 15(12):550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Luecken MD and Theis FJ (2019). Current best practices in single-cell rna-seq analysis: a tutorial. Molecular Systems Biology, 15(6):e8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Lun A (2018). Overcoming systematic errors caused by log-transformation of normalized single-cell rna sequencing data. bioRxiv. [Google Scholar]
- [17].McCarthy DJ, Chen Y, and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10):4288–4297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Miao Z, Deng K, Wang X, and Zhang X (2018). DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics, 34(18):3223–3224. [DOI] [PubMed] [Google Scholar]
- [19].Min Y and Agresti A (2005). Random effect models for repeated measures of zero-inflated count data. Statistical Modelling, 5:1–19. [Google Scholar]
- [20].Pierson E and Yau C (2015). Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology, 16(1):241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Preisser JS, Stamm JW, Long DL, and Kincade ME (2012). Review and recommendations for zero-inflated count regression modeling of dental caries indices in epidemiological studies. Caries Research, 46(4):413–423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Risso D, Perraudeau F, Gribkova S, Dudoit S, and Vert J-P (2018). A general and flexible method for signal extraction from single-cell rna-seq data. Nature Communications, 9(1):284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods, 15(4):255–261. [DOI] [PubMed] [Google Scholar]
- [24].Todem D, Kim K, and Hsu W-W (2016). Marginal mean models for zero-inflated count data. Biometrics, 72(3):986–994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Townes FW, Hicks SC, Aryee MJ, and Irizarry RA (2019). Feature selection and dimension reduction for single cell rna-seq based on a multinomial model. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, Vert J-P, Robinson MD, Dudoit S, and Clement L (2018). Observation weights unlock bulk rna-seq tools for zero inflation and single-cell applications. Genome Biology, 19(1):24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Van den Berge K, Soneson C, Love MI, Robinson MD, and Clement L (2017). zinger: unlocking rna-seq tools for zero-inflation and single cell applications. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Wang T, Li B, Nelson CE, and Nabavi S (2019). Comparative analysis of differential gene expression analysis tools for single-cell rna sequencing data. BMC Bioinformatics, 20(1):40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Wills QF, Livak KJ, Tipping AJ, Enver T, Goldson AJ, Sexton DW, and Holmes C (2013). Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments. Nature Biotechnology, 31(8):748–752. [DOI] [PubMed] [Google Scholar]
- [30].Zhang D and Lin X (2008). Variance Component Testing in Generalized Linear Mixed Models for Longitudinal/Clustered Data and other Related Topics, pages 19–36. Springer; New York, New York, NY. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.