Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2023 Jul 28;24(5):bbad272. doi: 10.1093/bib/bbad272

A comprehensive assessment of hurdle and zero-inflated models for single cell RNA-sequencing analysis

Tao Cui 1,, Tingting Wang 2,
PMCID: PMC10516395  PMID: 37507115

Abstract

Single cell RNA-sequencing (scRNA-seq) technology has significantly advanced the understanding of transcriptomic signatures. Although various statistical models have been used to describe the distribution of gene expression across cells, a comprehensive assessment of the different models is missing. Moreover, the growing number of features associated with scRNA-seq datasets creates new challenges for analytical accuracy and computing speed. Here, we developed a Python-based package (TensorZINB) to solve the zero-inflated negative binomial (ZINB) model using the TensorFlow deep learning framework. We used a sequential initialization method to solve the numerical stability issues associated with hurdle and zero-inflated models. A recursive feature selection protocol was used to optimize feature selections for data processing and downstream differentially expressed gene (DEG) analysis. We proposed a class of hybrid models combining nested models to further improve the model’s performance. Additionally, we developed a new method to convert a continuous distribution to its equivalent discrete form, so that statistical models can be fairly compared. Finally, we showed that the proposed TensorFlow algorithm (TensorZINB) was numerically stable and that its computing speed and performance were superior to those of existing ZINB solvers. Moreover, we implemented seven hurdle and zero-inflated statistical models in Python and systematically assessed their performance using a real scRNA-seq dataset. We demonstrated that the ZINB model achieved the lowest Akaike information criterion compared with other models tested. Taken together, TensorZINB was accurate, efficient and scalable for the implementation of ZINB and for large-scale scRNA-seq data analysis with DEG identification.

Keywords: zero-inflated and hurdle models, zero-inflated negative binomial, scRNA-seq, feature selection, DEG, tensorflow

INTRODUCTION

The maturation of single cell RNA-sequencing (scRNA-seq) technology has provided unique opportunities to explore the transcriptomic features and biological processes in health and disease at the single-cell level [1–4]. Although various statistical models and solver packages have been developed for scRNA-seq analysis, there is no consensus on model selection and package usage [5–7]. Moreover, increasing numbers of features, such as experimental batch, specimen identity, gender and age, are generated from single-cell studies [8]. Selection, engineering and incorporation of features pose enormous challenges for scRNA-seq analysis and downstream DEG identification [9, 10]. Zero-inflated negative binomial (ZINB) has been proposed to model the excessive zero reads that are typically observed in scRNA-seq data [7, 11]. However, due to the numerical instability of solving complex models, such as ZINB, existing packages may fail to converge, which makes them unsuitable for likelihood ratio test (LRT)-based DEG analysis [12]. From Occam’s razor or the law of parsimony, ‘entities should not be multiplied beyond necessity.’ We sought to (1) thoroughly compare the performance of various models used in scRNA-seq analysis, and (2) understand to what extent scRNA-seq analysis can benefit from the increase in model complexity. We use the Akaike information criterion (AIC) as a measure to determine the ‘necessity’ in Occam’s razor via comprehensive comparisons of different statistical models and packages for scRNA-seq analysis. Additionally, we provide guidelines for the selection of models, packages, and features and the experimental design.

In order to improve the accuracy and speed of solving the ZINB model, we first developed a Python package named TensorZINB using the open-source TensorFlow deep learning framework [13]. Although existing packages can be used to solve the ZINB model [7, 14–16], all the packages are numerically unstable and are not scalable for large-scale data analysis. We propose a sequential initialization algorithm to ensure the likelihood is monotonic as more features are added. With these improvements, TensorZINB is numerically stable and can be generalized to analyze a large number of genes in parallel when running on CPU and GPU. Furthermore, feature selection is a relatively underexplored area in previous single-cell studies. We propose a recursive feature selection protocol and show that feature selections can dramatically impact the AIC performance. Notably, TensorZINB not only supports separate feature sets on the zero-inflated and NB components in the ZINB model, respectively, but can also incorporate common features across all genes rather than using specific features for individual genes. We systematically compare TensorZINB to existing ZINB solvers on a real scRNA-seq dataset [8] and demonstrate that TensorZINB achieves a higher likelihood and faster computing speed.

In order to provide a comprehensive evaluation of different statistical models used in scRNA-seq analysis, we perform systematic comparisons of seven hurdle and zero-inflated models (Table 1). Different from existing comparisons, which typically call existing packages, we implement all seven models from scratch in Python so that they can be fairly compared using the same platform. To properly compare all the models, we develop a method that can generally convert any continuous model into an equivalent discrete model so that the likelihood and AIC can be computed for the discrete model. With this conversion, discrete models (such as ZINB) and continuous models (such as MAST [5]) can be compared on the same ground. For nested models, we propose a new class of hybrid models to combine NB and ZINB (NB is a special case of ZINB) to further improve performance.

Table 1.

Models compared for scRNA-seq analysis

Base model Hurdle Zero-inflated
P (Poisson) PH (Poisson Hurdle) ZIP (Zero-inflated Poisson)
NB (Negative Binomial) NBH (Negative Binomial Hurdle) ZINB (Zero-inflated Negative Binomial)
MAST

We compare the AIC and computing speed for all seven different models and a hybrid (NB + ZINB) model on a published scRNA-seq dataset with rich features [8]. We find that the ZINB model, when solved with TensorZINB, achieves the lowest AIC compared to other non-hybrid models. When we compare the ZINB model with its nested models using LRT, we find that the ZINB model obtains the best performance in both likelihood and computing speed. Finally, we perform DEG analysis using all seven models and the hybrid model and compare the identified DEGs among the models.

METHODS

Models

Let Inline graphic be the observed count from cell Inline graphic for gene Inline graphic, where Inline graphic is a non-negative integer and there are Inline graphic cells and Inline graphic genes. A statistical model Inline graphic for Inline graphic is the probability mass function (PMF) of observing Inline graphic given model parameter Inline graphic, i.e.

graphic file with name DmEquation1.gif (1)

For simplicity, we drop the subscripts throughout this paper if it does not cause ambiguity.

Hurdle and zero-inflated models

scRNA-seq data typically have excessive zeros [11, 17]. Standard count models assume that the zeros and the non-zeros come from the same data-generating process, which may not reflect the underlying mechanisms for excessive zeros. Zero counts could be true zero expression of genes or could be technical failures, which are generated by two distinct processes [17]. Both hurdle and zero-inflated models are described by two distributions: a Bernoulli distribution with parameter Inline graphic and a counter distribution on non-negative integers with PMF Inline graphic, e.g. Poisson, negative binomial, etc.

For hurdle models, the zeros and the non-zeros are generated from two different distributions separately. The Bernoulli distribution Inline graphic uniquely determines whether a count is zero or positive. If the realization is 1, the hurdle is crossed, and the conditional distribution of the positives is governed by a truncated-at-zero count data model induced from the vanilla model. The hurdle model can be expressed as

graphic file with name DmEquation2.gif (2)

Let Inline graphic if Inline graphic and Inline graphic if Inline graphic. The log likelihood (LL) function for the hurdle model can be written as

graphic file with name DmEquation3.gif (3)

From (3), the Bernoulli part Inline graphic and the count part Inline graphic can be estimated separately. This reduces the complexity of analysis, especially for datasets with many features, which makes the Hurdle model broadly used in scRNA-seq analysis [5]. However, due to the presence of Inline graphic in the count part Inline graphic, the log likelihood function in Inline graphic is typically not convex in the distribution parameters. Intricate distributions, such as NB, typically incur numerical instability issues. When Inline graphic the hurdle model (3) reduces to the vanilla model Inline graphic.

For zero-inflated models, the counts are modeled as a mixture of the Bernoulli distribution and the count distribution, i.e.,

graphic file with name DmEquation4.gif (4)

The log likelihood function for the zero-inflated model when observing counts Inline graphic can be written as

graphic file with name DmEquation5.gif (5)

Unlike the hurdle model in (3), we cannot move Inline graphic out of the Inline graphic from Inline graphic in (5). Therefore, the two parts in zero-inflated models must be estimated jointly, which makes it more challenging to solve.

One important distinction between the hurdle and zero-inflated models is that zero counts can have two different sources in the zero-inflated model, while they can only be generated from the Bernoulli part in the hurdle model. This difference potentially leads to a likelihood difference between these two models in single-cell data analysis, which may have important biological meanings.

Model parameterization

We consider model parameterization on the count distribution Inline graphic in (2) and (4) and Inline graphic in the Bernoulli distribution. For the count distribution, we consider the Poisson distribution and the negative binomial distribution, which are characterized by the mean and/or the dispersion parameter. In many single-cell studies, counts are normalized so that each cell’s total count is 10 000, i.e.

graphic file with name DmEquation6.gif (6)

where Inline graphic is a cell level normalization factor. Assuming after normalization Inline graphic has the same mean Inline graphic across all cells Inline graphic, we have

graphic file with name DmEquation7.gif (7)

So if we consider Inline graphic and Inline graphic as two features, existing normalization is equivalent to modeling Inline graphic as a linear combination of features. Motivated by this observation, we consider a linear regression model on the log of distribution mean. Together with the logistic model on Inline graphic, we consider the following parameterization for a given gene Inline graphic

graphic file with name DmEquation8.gif (8)

where Inline graphic and Inline graphic, Inline graphic, Inline graphic and Inline graphic are applied element-wisely, Inline graphic and Inline graphic are features, Inline graphic are coefficients for each gene Inline graphic, Inline graphic are common features shared across all genes and Inline graphic are corresponding common coefficients without dependence on genes Inline graphic. Features can be moved from Inline graphic to Inline graphic to reduce overfitting and improve the AIC metric, where common features are leveraged as a form of regularization.

The dispersion parameter in ZINB is chosen as

graphic file with name DmEquation9.gif (9)

so that Inline graphic is always positive and we do not need to constrain Inline graphic to be positive for optimization. We do not enforce a linear regression model on Inline graphic, even though TensorZINB can readily solve this case as well.

With (8), we do not need to preprocess the observed counts Inline graphic using gene-specific normalization factors as in (6) but rather use Inline graphic directly for analysis, as shown in [11] preprocessing Inline graphic may lead to mutual information loss. By adding the log library size Inline graphic as a feature in Inline graphic and/or Inline graphic, normalization is performed implicitly. The existing normalization is equivalent to setting the coefficient Inline graphic for this feature.

Finally, for the model with parameterization (8) and (9), given Inline graphic, we solve Inline graphic for a subset of genes by maximizing the sum of log likelihood

graphic file with name DmEquation10.gif (10)

where Inline graphic is the PMF of the model.

DEG analysis

We use LRT to identify DEG. Let Inline graphic denote the log likelihood function in (10) given features Inline graphic and its corresponding coefficient Inline graphic and we put all features in (10) into Inline graphic for notation simplicity. The LRT statistic null hypothesis that gene Inline graphic is not differentially expressed for given conditions in features Inline graphic is

graphic file with name DmEquation11.gif (11)

where Inline graphic contains coefficients for the combined feature Inline graphic. Assuming Inline graphic contains Inline graphic features, the test statistic Inline graphic will be asymptotically chi-squared distributed with degrees of freedom Inline graphic, as the number of cells Inline graphic approaches Inline graphic according to Wilks’ theorem [12]. Finally, the P-values from LRT can be adjusted for multiplicity using the Benjamini and Hochberg method [18].

Hurdle normal distribution model

In single-cell studies, it is common to transform integer counts into continuous variables for downstream analysis. For the popular log transformation, we have

graphic file with name DmEquation12.gif (12)

where Inline graphic is the observed count, Inline graphic is the transformed variable and Inline graphic is a scaling factor. Instead of modeling Inline graphic, many studies model Inline graphic as a continuous random variable, where a probability density function (PDF) on Inline graphic is assigned. The hurdle model on Inline graphic is

graphic file with name DmEquation13.gif (13)

MAST [5] assumes a normal distribution on Inline graphic with mean Inline graphic. The model in MAST writes PMF on discrete non-negative integer counts as PDF Inline graphic for continuous distribution. Therefore, unlike other count-based hurdle models, truncated-at-zero (Inline graphic in (2)) is not applied in MAST (13). The likelihoods derived from PDF and PMF have different meanings. As PDF, but not PMF, is used in MAST, the likelihood calculated from MAST cannot be directly compared to those from PMF-based count models.

Conversion of MAST to a discrete model

We sought to determine the PMF equivalent of the PDF model for fair comparisons. We let the PDF of Inline graphic be Inline graphic. We approximate the probability of observing the integer count Inline graphic from this PDF as

graphic file with name DmEquation14.gif (14)

For Inline graphic we have

graphic file with name DmEquation15.gif (15)

We have the PMF of MAST as

graphic file with name DmEquation16.gif (16)

As (3), the log likelihood of MAST is

graphic file with name DmEquation17.gif (17)

where Inline graphic is as defined in (3), part Inline graphic is the log likelihood using the PDF, and part Inline graphic is the adjustment to make it comparable to discrete models.

An important implication of (17) is that MAST only maximizes part Inline graphic without considering part Inline graphic. As Inline graphic in (17) also depends on model parameters, only maximizing Inline graphic as in MAST does not truly maximize the log likelihood in (17). It is critical to note that LRT-based DEG identification requires maximizing the log likelihood Inline graphic in order to compute the likelihood ratio. Therefore, Wilks’ theorem [12] cannot be applied in MAST for DEG identification, where the log likelihood difference may not be chi-squared distributed. This indicates the DEGs identified by MAST may not be guaranteed by existing statistics theories.

Comparisons of models

In this study, we thoroughly compare seven models in Table 1. We consider Poisson and negative binomial in both hurdle and zero-inflated models, while normal distribution is applied to the hurdle model only (MAST). We implement all seven models in Python from scratch so that all models can be compared fairly and the results are not confounded by implementation differences. Except for MAST, we solve the other 6 count models using Stan. Moreover, we propose TensorZINB, a TensorFlow-based algorithm, for solving ZINB (Figure 1, see Section New algorithm to solve ZINB using TensorFlow: TensorZINB).

Figure 1.

Figure 1

The architecture of the TensorFlow-based algorithm for solving ZINB (TensorZINB). Each of Inline graphic, Inline graphic, Inline graphic, Inline graphic in (8) is implemented as a fully connected Dense layer with a linear activation function. Inline graphic is also computed using a Inline graphic Dense layer with input Inline graphic.

Zero-inflated negative binomial model

Overview

The negative binomial distribution is

graphic file with name DmEquation18.gif (18)

where Inline graphic is the mean and Inline graphic is the dispersion parameter. The PMF for ZINB is

graphic file with name DmEquation19.gif (19)

where Inline graphic if Inline graphic and Inline graphic if Inline graphic.

Comparisons of existing ZINB solving packages

VGAM [14], statsmodels [15], ZINB-WaVE [7] and Stan [16] can solve the ZINB model. However, they have at least one of the following issues:

  • (i) Convergence: the log likelihood of ZINB is difficult to optimize. Besides converging to a local optimum that is distant from the global optimal solution, it is common to incur numerical issues for real single-cell data.

  • (ii) Monotonicity of log likelihood: for LRT, we need to compute the increase in log likelihood after adding conditional features. Existing algorithms cannot guarantee log likelihood monotonicity, which leads to a negative LL difference. LRT fails to apply in this case.

  • (iii) Computing speed: the computing speed is slow, especially for large-scale datasets with many cells and genes, which makes most packages not scalable.

New algorithm to solve ZINB using TensorFlow: TensorZINB

In order to solve the challenges discussed above, we develop a Python package, TensorZINB, to solve the ZINB model using the TensorFlow deep learning framework. We create a customized loss function in TensorFlow to maximize the ZINB log likelihood. To overcome the numerical stability issues in computing the Gamma function and Inline graphic in (18), we transform all terms in (18) into the log domain and use numerically stable Tensorflow functions. For a given Inline graphic, we can write the log likelihood from (19) as

graphic file with name DmEquation20.gif (20)

where Inline graphic is the log Gamma function. We can further improve stability by rewriting (20) as

graphic file with name DmEquation21.gif (21)

where

graphic file with name DmEquation22.gif (22)

Inline graphic and Inline graphic. All computations in (21) are performed in the log domain using numerically stable functions: lgamma, logsumexp, and softplus. Each of Inline graphic, Inline graphic, Inline graphic, Inline graphic in (8) is implemented as a fully connected Dense layer with a linear activation function. Inline graphic is also computed using a Inline graphic Dense layer with input Inline graphic (Figure 1). TensorZINB solves the model for a batch of genes rather than each individual gene by leveraging GPU computing, which further increases the computing speed. We use RMSProp for optimization with an initial learning rate of 0.02. The learning rate is multiplied by 0.8 if the loss does not improve for 10 epochs until a minimum learning rate of 0.002 is hit. The training stops when the loss change remains less than 0.05 for 50 epochs or when the maximum number of epochs 3000 is reached.

Sequential initialization

It is common for an algorithm to converge to local optima, which may lead to a negative LL difference in LRT. We propose a sequential initialization method to solve this issue. We note that one model is nested in another if the former model can be obtained by constraining the parameters of the latter. For instance, Poisson is a nested model of NB (when dispersion Inline graphic) and NB is a nested model of ZINB by setting Inline graphic. As the number of model parameters increases from Poisson to NB and to ZINB, we solve the model sequentially from Poisson to NB and to ZINB, where the log likelihood increases incrementally.

For a sequence of nested models Inline graphic, assuming Inline graphic is solved with optimal parameter Inline graphic, we can initialize the solver of Inline graphic with Inline graphic where Inline graphic consists of parameters with values such that Inline graphic reduces to Inline graphic. In some cases such as NB and ZINB, we can even estimate a better Inline graphic than using the reduced model values. We consider several model chains described below. For notational simplicity, we drop the gene subscript Inline graphic and ignore common features in (8).

(i) PoissonInline graphicNBInline graphicZINB:

Poisson: the model only depends on the mean Inline graphic. We compute the sample mean value of counts Inline graphic and set the initial parameter in (8) as Inline graphic, where all parameters are initialized to be zero except the last one corresponding to an all one feature or intercept. Let Inline graphic be the optimal parameter for the Poisson model.

NB: let Inline graphic. From [19], we estimate the initial dispersion Inline graphic by running the auxiliary OLS regression,

graphic file with name DmEquation23.gif (23)

where Inline graphic is the Inline graphic-th entry in Inline graphic. The NB model is then solved with initial values Inline graphic. Let Inline graphic and Inline graphic be the optimal parameters for the NB model.

ZINB: let Inline graphic and Inline graphic be the observed fraction of positive counts. We can compute the probability of observing non-zero counts from NB as

graphic file with name DmEquation24.gif (24)

We then have the probability of observing non-zero counts from ZINB as

graphic file with name DmEquation25.gif (25)

Let

graphic file with name DmEquation26.gif (26)

where Inline graphic is a small constant and Inline graphic is the inverse logit function. Finally, we set the initial parameter for Inline graphic in (8) as Inline graphic, where all parameters are initialized to be zero except the first one corresponding to an all one feature or intercept.

We can also initialize ZINB using Poisson directly, as Poisson is a nested model of ZINB, where Inline graphic is also initialized using (23). Using Poisson initialization avoids an additional model fitting of NB and reduces computing time if NB fitting is not needed.

(ii) LRT: we need to find the log likelihood difference between maximizing the model Inline graphic over the full feature space and maximizing Inline graphic over the reduced feature space. Let Inline graphic be a vector including only features in the reduced feature space, and Inline graphic be a vector in the full feature space, where Inline graphic is the additional feature vector in the full feature space. We first solve Inline graphic for Inline graphic and let the optimal value be Inline graphic. As we use a linear model in (8), we initialize the full model with Inline graphic where we set Inline graphic. The log likelihood difference is guaranteed to be non-negative provided that the solver can eventually increase the log likelihood from the initial parameter in the full model.

Hybrid model for DEG identification

For two models Inline graphic, Inline graphic, where Inline graphic is a nested model of Inline graphic, we can choose the model with a lower AIC to perform DEG analysis for each gene rather than choosing the better model based on AIC across all genes and using the same model across all genes. Taking NB and ZINB as an example, for gene Inline graphic, we compare the AIC of full feature space for each model. If Inline graphic, we use NB for DEG analysis for gene Inline graphic. Otherwise, we use ZINB. We denote this hybrid model as ‘NB+ZINB’ to reduce the potential overfitting in ZINB.

Feature engineering and selection

Feature engineering is the process of formulating the most appropriate features given the data, model and tasks [20], which is crucial for optimizing the model performance in machine learning. However, formulating the ‘most appropriate features’ is subjective, which may introduce bias to the analysis. Feature engineering is largely unexplored in single-cell research, with only a few studies implicitly touching upon this concept [5]. In our study, we not only use existing features in the single-cell datasets but also generate several derived features, such as total UMIs in a cell (UMI), Log of total UMIs (UMI_log), and Log of ngene (ngene_log), where the rationale behind UMI_log is from (7). We can further improve our feature set by utilizing polynomials on existing features and/or by combining them through operations such as multiplication.

Notably, using excessive features, especially with many derived features, may lead to overfitting. Thus, we sought to determine a feature selection process to only include features that are necessary. We consider feature selection based on the AIC metric. Complex models, such as ZINB, are complicated to solve, and it is time-consuming to fit the model with all feature combinations. To reduce the complexity of the feature selection process, we propose a recursive feature selection method.

For the top-down approach, we start with a feature set containing all features and eliminate one feature recursively at each step until the feature set is empty. Specifically, at the Inline graphicth step, let Inline graphic denote the current set of features. When Inline graphic, Inline graphic contains all features. We iterate through each feature Inline graphic in Inline graphic and train the model using features in Inline graphic, i.e. removing Inline graphic from Inline graphic. We compute the AIC denoted as Inline graphic after training the model. The feature Inline graphic that minimizes AIC is chosen, i.e. Inline graphic. We record this AIC as Inline graphic and the corresponding feature set as Inline graphic. We remove Inline graphic from Inline graphic, set Inline graphic, and start the next iteration using Inline graphic. This process repeats until Inline graphic is empty. Finally, the selected feature set is Inline graphic that achieves the lowest AIC among all Inline graphic, i.e., Inline graphic. Let Inline graphic be the number of features. The complexity of this top-down approach is Inline graphic compared to Inline graphic of the full algorithm.

Similarly, we can adopt a bottom-up approach where we start with an empty feature set and add one feature recursively at each step until the feature set contains all features.

RESULTS

scRNA-seq data used for evaluation

For comparisons between TensorZINB and other existing ZINB solvers and comparisons between different statistical models, we use a scRNA-seq dataset from the prefrontal cortex and anterior cingulate cortex in 15 autism patients and 16 controls with rich features [8]. There are in total 10 original features available in the published dataset: Region, Sex, Age, Capbatch, Seqbatch, PMI, RIN, Ribo pct, Mito pct, Diagnosis. Feature selection and engineering are performed using methods in Section 2.9 Feature engineering and selection.

As the statsmodels is slow and Poisson Hurdle, zero-inflated Poisson, and NB Hurdle are not scalable and cannot be used on the real full dataset, we generate a smaller dataset with 340 genes for the initial tests. We select 20 genes from each cell type (17 cell types and 340 genes in total) with the lowest P-value after running a Wilcoxon rank sum test. Default parameters are used in ZINB-WaVE, statsmodels and Stan. As ZINB-WaVE does not support separate features on the logistic and NB parts, we use the same feature set on both parts, e.g. UMIs_log, ngene, ngene_log, Sex, Age, Capbatch, PMI, RIN, Ribo_pct, Mito_pct. All model training and testing are performed on a computer with an Intel Xeon CPU E5-2686 v4 @ 2.30GHz with 62GB of RAM and a NVIDIA Tesla K80 GPU with 17 GB of memory.

Validation of the feature selection method

First, we find that, in the scRNA-seq dataset used for evaluation, Capbatch uniquely determines Seqbatch, and Capbatch CB3, CB4, CB8, CB9 uniquely determine the Brain region. Therefore, we remove these redundant features. Diagnosis is a feature specifically used for DEG analysis, so we do not include it except for DEG identification. For feature engineering, we generate three new features, total UMIs in a cell (UMI), Log of total UMIs (UMI_log) and Log of ngene (ngene_log) and adopt ngene from MAST. We consider feature selection of 11 features on both the logistic and NB parts in ZINB. Taken together, we have a total of 22 features to choose from.

We first validate the proposed top-down feature selection protocol in ZINB using a dataset with 340 genes as described in Section scRNA-seq data used for evaluation. We compare the sum of the log likelihood of all genes with different number of features (Figure 2). For a given number of features, we use the feature set with the highest likelihood. The log likelihood is a monotonically increasing function in the number of features. After the number of features is greater than 13, the likelihood gain begins to diminish (Figure 2A). Similarly, we compare the AIC from different number of features. For a given number of features, we select the feature set with the lowest AIC. We find that AIC is minimized with 19 features (Figure 2B). These selected features include UMIs, UMIs_log, ngene, ngene_log, Sex, Age, Capbatch, PMI, RIN, Ribo_pct, Mito_pct for the NB part, and UMIs, UMIs_log, ngene_log, Sex, Age, Capbatch, Ribo_pct, Mito_pct for the logistic part.

Figure 2.

Figure 2

Validation of the proposed feature selection protocol. A. Log likelihood increases with the number of features. The best log likelihood for a given number of features is used in the plot. B. AIC changes with the number of features. AIC reaches the smallest number when 19 features are used. The best AIC for a given number of features is used in the plot. C. The relationship between log likelihood and feature numbers. All log likehoods in the 256 testing cases are shown. D. The relationship between AIC and feature numbers. All AIC in the 256 testing cases are shown.

Finally, we examine how the log likelihood and AIC change with different combinations of features (not only with the best combinations). We plot the log likelihood and AIC from all 254 testing cases that have been run (Figure 2C and D). Interestingly, we find that for a given number of features, some features have greater impacts on both the likelihood and AIC, suggesting that these features may affect the transcriptome more significantly than others. These results demonstrate that the proposed feature selection protocol improves data fitting and provides valuable information about the underlying biological processes.

Validation of TensorZINB

We generate a simulation dataset for the validation of TensorZINB, where the parameters estimated from TensorZINB are compared with the known model parameters. In the ZINB model (19), we choose

graphic file with name DmEquation27.gif (27)

where Inline graphic is a synthetic feature vector with entries generated from the uniform distribution between 0 and 1. We assign arbitrary numbers to (27) to generate a dataset with 20 000 samples and subsequently used it to validate the convergence of TensorZINB. From the log likelihood, we find TensorZINB quickly converges to the maximum likelihood estimation (MLE; Figure 3A) after 250 iterations, which is in the neighborhood of the true values (denoted as a dot in Figure 3B and C), demonstrating that TensorZINB can solve the ZINB model correctly.

Figure 3.

Figure 3

Validation of the TensorZINB using a simulating dataset. A. Convergence of log likelihood over iterations using a simulation dataset. Maximum likelihood estimation (MLE) is shown in the plot (purple box). B. Convergence of Inline graphic, Inline graphic, Inline graphic, and Inline graphic over iterations using a simulation dataset. True values (green dots) and MLE (purple boxes) are shown. C. Convergence of Inline graphic over iterations using a simulation dataset. The true values (green dot) and MLE (purple box) are shown. D. Histogram of Inline graphic using the fitted parameters. Gene ENSG00000198840 in cluster L2/3 in a published scRNA-seq dataset is used. The mean value of Inline graphic is 0.14, E. Convergence of log likelihood over iterations using a real scRNA-seq dataset. F. Convergence of Inline graphic, Inline graphic, and Inline graphic over iterations using a real scRNA-seq dataset.

Next, we apply TensorZINB to real scRNA-seq data [8]. As an example, we consider gene ENSG00000198840 in cluster L2/3 (Figure 3DF). The features are chosen based on the feature selection in the previous subsection. Figure 3D shows the histogram of Inline graphic from (19) using the fitted parameters. With TensorFlow’s default random initialization, the log likelihood converges after 750 iterations (Figure 3E). We find the mean of Inline graphic to be 0.14, which indicates that with a mean probability of 0.14, we see zero counts from the Bernoulli part in ZINB (Figure 3E). Note that in the histogram, we observe diverse Inline graphic across different cells, which indicates some cells are more susceptible to zero counts from the Bernoulli part, possibly due to technical failures. Rather than imposing hand-selected rules to filter ‘bad’ cells, cell status may be inferred from TensorZINB through the zero-inflated probability Inline graphic.

Comparison of TensorZINB to existing ZINB solvers

Next, we compare the performance of different ZINB solvers and evaluate the quality of the solution returned by each algorithm. From our experimentation, VGAM cannot solve for most of the genes. Thus, we do not compare it in this study. Let the log likelihood of gene Inline graphic returned by algorithm Inline graphic be denoted as Inline graphic using 340 genes from the scRNA-seq dataset as in Section scRNA-seq data used for evaluation. We take the maximum likelihood over all four tested algorithms (TensorZINB, statmodels, Stan and ZINB-WaVE) for each gene and cell type and denote it as Inline graphic. We compare the following ratio of each algorithm

graphic file with name DmEquation28.gif (28)

The higher the quantity in (28), the worse the solution is. We compare the total log likelihood loss of 4 algorithms as in (28). We find that TensorZINB is within 0.1% of the highest likelihood and obtains superior performance to other ZINB solvers (Figure 4A).

Figure 4.

Figure 4

Comparisons of TensorZINB and other ZINB solving packages. A. The ratio of the total log likelihood of each algorithm to the best total log likelihood. TensorZINB, statmodels, Stan, and ZINB-WaVE are compared. B. The percentage of cases where the likelihood difference is negative when Diagnosis is added as a feature in the DEG analysis. C. The mean computing time of running LRT for each gene in DEG identification with different algorithms. D. Comparisons of DEGs identified by TensorZINB, Stan, and ZINB-WaVE using a real scRNA-seq dataset with 340 genes. E. Comparisons of DEGs identified by TensorZINB, Stan, and statmodels using a dataset with 340 genes.

Next, to test whether each algorithm is feasible for LRT in DEG analysis, we compare the likelihood without Diagnosis to that with Diagnosis as a feature and compute the percentage of cases where the log likelihood difference is negative (Figure 4B). We find that the log likelihood difference is always positive in TensorZINB, while other packages have a higher percentage of negative cases, suggesting that TensorZINB is the most suitable for LRT-based DEG analysis. We examine DEGs identified from the dataset with 340 genes using TensorZINB, ZINB-WaVE and Stan, and find that the majority of DEGs detected are common among different algorithms (Figure 4D and 4E), possibly because strict criteria are used to select the 340 genes for testing.

Finally, we compare the mean of computing time of LRT for each gene using different algorithms on CPU and TensorZINB on CPU has the fastest computing speed (Figure 4C). Taken together, TensorZINB achieves a higher likelihood, maintains the monotonicity of likelihood in LRT, and is computationally efficient. It is scalable and robust and can be used for DEG analysis on scRNA-seq datasets. In the remainder of the study, we use TensorZINB to solve ZINB for comparisons with other statistical models.

Comparisons of different models in scRNA-seq analysis

Next, we comprehensively compare the performance of seven models (Table 1) plus the hybrid NB+ZINB model as in Section Hybrid model for DEG identification. We use the same dataset with 340 genes as in Section 3.1 scRNA-seq data used for evaluation and select features as in Section 2.9 Feature engineering and selection. For MAST, we use (17) to compute the likelihood so that all models can be compared, and the empirical Bayes method to regularize variance is not used. We compute the ratio of the difference between the best likelihood across all models and the likelihood of each model to the best likelihood as defined in (28) (Figure 5A). Similarly, we compute the ratio of the difference between the AIC of each algorithm and the best AIC across all models to the best AIC (Figure 5B), i.e.

Figure 5.

Figure 5

Comparisons of different statistical models. A. The ratio of the total log likelihood of each model to the best total log likelihood. Poisson, Poisson hurdle, Zero-inflated Poisson, MAST, NB, NB hurdle, ZINB, and NB+ZINB are compared using a dataset with 340 genes. ZINB is solved by the proposed TensorZINB method. B. The ratio of toal AIC of each model to the best total AIC using a dataset with 340 genes. C. AIC difference between other models and ZINB (AIC of other models - AIC of ZINB) using a dataset with 340 genes. D. The fitting of the count distribution of gene ENSG00000183117 in cluster L2/3 by different models. E. The fitting of the count distribution of gene ENSG00000198840 in cluster L2/3 by different models. F. Comparisons of DEGs identified by Poisson, Poisson hurdle, Zero-inflated Poisson, NB, NB hurdle, and ZINB using a dataset with 340 genes. G. Comparisons of DEGs identified by MAST, NB, ZINB, and NB+ZINB using a real large-scale scRNA-seq dataset.

graphic file with name DmEquation29.gif (29)

Smaller values indicate a higher likelihood and a lower AIC. We find that the ZINB model obtains the highest likelihood and the lowest AIC, suggesting that it achieves the best performance compared to other models (Figure 5AC).

To evaluate the goodness of fitting, we apply LRT to full and nested models and use AIC for non-nested models. The Vuong test [21] is not used on non-nested models as it has been reported to be unsuitable for zero-inflated non-nested models [22]. In Table 2, we list P-values from LRT between nested models using the sum of log likelihood across all genes if the column model is a nested model of the row model. AIC differences between the column model and the row model, as (column model AIC - row model AIC), are listed in Table 3. In summary, all the results demonstrate that the performance of the eight models can be ranked as NB + ZINB > ZINB > NB > NB hurdle > MAST > zero-inflated Poisson >Poisson hurdle > Poisson. Interestingly, the vanilla NB model performs better than the NB hurdle model, which indicates that the NB hurdle model using Stan may converge to local optima. Also note that the vanilla NB is not a nested model of the NB hurdle model with a logit function on the hurdle part. Hence, there is no guarantee that the NB hurdle attains a higher LL than the vanilla NB.

Table 2.

P-values of LRT between nested models

Model P NB PH NBH ZIP
NB 0.0
PH
NBH 0.0
ZIP 0.0
ZINB 0.0 0.0 0.0

Table 3.

The difference of mean AIC between models

Model P PH ZIP MAST NB NBH ZINB NB+ZINB
P 0.00 −415.97 −611.50 −10884.96 −11243.58 −11084.49 −11391.47 −11393.96
PH 415.97 0.00 −195.53 −10469.00 −10827.61 −10668.52 −10975.51 −10977.99
ZIP 611.50 195.53 0.00 −10273.47 −10632.08 −10472.99 −10779.98 −10782.47
MAST 10884.96 10469.00 10273.47 0.00 −358.61 −199.53 −506.51 −509.00
NB 11243.58 10827.61 10632.08 358.61 0.00 159.09 −147.90 −150.39
NBH 11084.49 10668.52 10472.99 199.53 −159.09 0.00 −306.99 −309.47
ZINB 11391.47 10975.51 10779.98 506.51 147.90 306.99 0.00 −2.49
NB+ZINB 11393.96 10977.99 10782.47 509.00 150.39 309.47 2.49 0.00

Next, in order to further investigate the performance of each model, we examine the fitting of individual genes by each model. To visualize the goodness of fitting, we compare all models without any features except the intercept, so that the PMF does not depend on features and can be displayed. Two genes, ENSG00000183117 and ENSG00000198840, in cluster L2/3 from the dataset in [8] are used. We compare the histogram from observed single-cell counts with the PMF of models (Figure 5D and E). For gene ENSG00000183117, zero count probability is zero, so hurdle and zero-inflated models are not shown. The transformed PMF (16) is shown for MAST. We find that NB fits the experimental data accurately, while MAST has a shift compared to the distribution of real counts, which is likely due to the use of the normal distribution and log transformation. Poisson does not fit the data well, possibly due to the existence of over-dispersion in single-cell data.

For gene ENSG00000198840, ZINB reduces to NB without using any features, so it is not included in the comparison. Notably, when features are used, ZINB indeed has a non-zero probability on the Bernoulli part, which indicates the zero inflation may come from individual cells. Without any features, NB and NB hurdle fit the data better than other models (Figure 5E). The observations on the fittings of individual genes are consistent with model ranking using likelihood and AIC.

DEG analysis

Lastly, we evaluate DEG identification by different models using LRT. We compute Inline graphic in (11) using Diagnosis as the testing condition for the eight models (Table 1). We do not apply any additional filtering using non-model-based criteria, such as fold change, to demonstrate that the DEG difference is solely from using different models. We first use the same 340 gene dataset as in Section 3.1 and select features as in Section 2.9. P-value 0.01 is used to determine whether a gene is DEG. We compare all the DEGs identified by each model (Figure 5F and Supplementary Figure 1A). We find that DEGs identified by different models from the 340 gene dataset are very similar, probably due to the stringent criteria used for gene selection.

Then, we apply eight models to all genes in the full scRNA-seq dataset instead of using only 340 genes. Since Poisson Hurdle, zero-inflated Poisson and NB Hurdle are not scalable and cannot be used on the real full dataset, we compare the DEGs identified from Poisson, MAST, NB, ZINB and NB+ZINB. P-values are adjusted using the Benjamini Hochberg protocol [18] and 0.01 is chosen for adjusted P-value to determine the DEGs. We compute the total numbers of DEGs (combined DEGs from all 17 cell clusters) identified by different models and find that the ZINB model identifies the highest number of DEGs (Table 4). Then, we compare the identities of DEGs from different models on all genes in the scRNA-seq dataset (Figure 5G and Supplementary Figure 1B). The results show that DEGs found by different models only partially overlap with each other. The observation that DEGs found by MAST are different from NB and ZINB is possibly because MAST does not maximize the likelihood function, and hence LRT does not apply to MAST, as shown in Section 2.4 Conversion of MAST to a discrete model. Some DEGs are detected by ZINB but not by NB, which may be due to the fact that the probabilities of observing zero counts are different in autism patients and controls, and excessive zero counts cannot be modeled well by NB.

Table 4.

The number of DEGs across all 17 cell types identified by different models using a real large-scale scRNA-seq dataset

Model Number of DEGs
Poisson 13104
MAST 10751
NB 10765
ZINB 15827
NB+ZINB 15390

Finally, we compare the mean computing time of running LRT for each gene using different models on CPU (Figure 5H). The computing time is the total time to compute two likelihoods in (11). We find that zero-inflated Poisson and NB hurdle, both implemented in Stan, are much slower than other models. MAST is fast as both the linear and logistic regression are simple to solve. However, since LRT may not apply to MAST, users need to be cautious when using MAST for DEG identification. Although the fast computing time of NB makes it a good option for exploratory analysis, ZINB has the lowest AIC, fast computing time, and can perform DEG identification reliably, especially for large-scale datasets with excessive zero counts.

DISCUSSION AND CONCLUSIONS

With the rapid growing of single-cell techniques, it becomes challenging to perform data analysis and DEG identification accurately and efficiently, especially with massive dataset size and increasing numbers of features. In this study, we propose a Python-based algorithm, TensorZINB, to solve the ZINB model, which can be run on both CPU and GPU. TensorZINB obtains performance superior to other ZINB solvers.

We develop a protocol for feature engineering and selection using a recursive feature elimination-based method on the AIC metric, which is used as a measurement of ‘necessity’ in Occam’s razor. The feature selection process can also provide valuable information about the connection between the biological meaning of features and transcriptomic regulation at the single-cell level. Although certain redundant features are difficult to avoid due to experimental limitations, the efficiency of using single-cell data can be improved with good design of experimental plans. One example of optimizing experimental design is to randomly shuffle samples into different sequencing experiments using the Fisher–Yates shuffle [23], to ensure that none of the feature columns in the design matrix can be written as a linear combination of others.

We propose a method to convert any continuous distribution to a discrete distribution so that the likelihood can be computed and evaluated for different models. Notably, MAST only maximizes the continuous likelihood function but does not necessarily maximize the discrete likelihood function, suggesting DEGs identified from such transformations need to be used with caution.

In order to gain insights on model selection, we thoroughly compare and discuss the performance of eight different models on real scRNA-seq datasets with rich features. We use AIC and LRT for model evaluation and find that ZINB is the best performer. Moreover, the hybrid model, by combining NB and ZINB, achieves even a lower AIC than ZINB. We comprehensively compare and rank these eight models and provide possible explanations for their performance, which can serve as a guideline for model selection in the practice of single-cell data analysis. We also apply these eight models in DEG identification and find that different models may lead to different DEG results. This finding is particularly important for collaborative disease cohort studies, in which experiments and analyses are usually performed in a number of laboratories using different models. Misuse of DEGs identified from different models may lead to misinterpretations of experimental results.

Researchers can leverage multiple models in their analysis to obtain DEGs for their downstream functional studies: (1) Instead of using a single model for DEG identification, we can use different models and select the best model for each gene using AIC as a selection metric; (2) We can use DEGs that are commonly discovered by all models (intersections), which may be a robust way to identify DEGs with high confidence, as these DEGs need to pass multiple model checks; (3) We can also aggregate DEGs identified from different models if the goal is to identify genes that are potentially important for certain biological processes. In this case, false positives may occur, and DEGs need to be validated with caution.

In conclusion, we develop TensorZINB, a TensorFlow-based algorithm to solve ZINB, and propose several new methods for sequential initialization, feature selection and model conversion. We comprehensively compare and discuss the performance of hurdle, zero-inflated, and vanilla models in scRNA-seq analysis. The ZINB model is the best performer over other models and can be used for accurate, reliable and fast single-cell analysis.

ABBREVIATIONS

  • scRNA-seq single cell RNA-sequencing

  • DEG differentially expressed gene

  • AIC Akaike information criterion

  • LL log likelihood

  • LRT likelihood ratio test

  • PMF probability mass function

  • PDF probability density function

  • OLS ordinary least-squares regression

  • MLE maximum likelihood estimation

  • P Poisson

  • PH Poisson hurdle

  • ZIP zero-inflated Poisson

  • NB negative binomial

  • NBH negative binomial hurdle

  • ZINB zero-inflated negative binomial

Key Points

  • A Python package (TensorZINB) is developed using TensorFlow to solve the complex ZINB model for large-scale scRNA-seq analysis and its performance is superior over existing ZINB solvers.

  • Feature engineering and selection algorithms are proposed to obtain optimized features achieving the lowest AIC.

  • Model selection protocol shows that the ZINB model, when solved with the proposed TensorZINB method, achieves a lower AIC compared to other statistical models for scRNA-seq analysis.

ACKNOWLEDGMENTS

Work in the laboratory of T.T.W. was supported by National Institute of Neurological Disorders and Stroke (NINDS grants R01 NS117372, R21 NS121284), Simons Foundation Autism Research Initiative (SFARI Bridge to Independence Award 551354) and Brain and Behavior Research Foundation (Young Investigator Award 27792).

AUTHORS’ CONTRIBUTION

T.C. and T.T.W. envisioned and designed the project. T.C. implemented the project and conducted the analysis. T.C. and T.T.W. wrote the manuscript.

DATA AVAILABILITY STATEMENT

Python 3.7.12 is used in this study. statmodel is available at https://github.com/statsmodels/statsmodels. Stan is available at https://github.com/stan-dev/pystan. ZINB-WaVE is available at https://github.com/drisso/zinbwave. MAST is available at https://github.com/RGLab/MAST. A published autism scRNA-Seq dataset is used in this study: https://autism.cells.ucsc.edu/. The proposed TensorZINB algorithm with detailed instructions is available at: https://github.com/wanglab-georgetown/tensorzinb. The Count Models Analysis and Compare, which supports seven count models and Stan, statsmodels and tensorflow methods, is available at: https://github.com/wanglab-georgetown/countmodels.

Supplementary Material

Suppl1_bbad272

Author Biographies

Tao Cui, Ph.D., is a research specialist in the Department of Pharmacology and Physiology, Georgetown University, USA. Tao received his Ph.D. in Electrical Engineering from California Institute of Technology. His research interests include mathematical modeling, machine learning, high dimensional analysis, signal processing and genomics.

Tingting Wang, Ph.D., is an assistant professor in the Department of Pharmacology and Physiology, Georgetown University, USA. She received her Ph.D. in Neurobiology from Duke University. Her research interests include synaptic transmission, synaptic plasticity, neural circuitry, bioinformatics and genomics.

Contributor Information

Tao Cui, Department of Pharmacology and Physiology Georgetown University Medical Center  SE407 Med/Dent 3900 Reservoir Road, N.W. Washington D.C., USA.

Tingting Wang, Department of Pharmacology and Physiology Georgetown University Medical Center  SE407 Med/Dent 3900 Reservoir Road, N.W. Washington D.C., USA.

REFERENCES

  • 1. Potter  SS. Single-cell RNA sequencing for the study of development, physiology and disease. Nat Rev Nephrol  2018;14(8):479–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Gawad  C, Koh  W, Quake  SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet  2016;17(3):175–88. [DOI] [PubMed] [Google Scholar]
  • 3. Usoskin  D, Furlan  A, Islam  S, et al.  Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci  2015;18(1):145–53. [DOI] [PubMed] [Google Scholar]
  • 4. Villani  AC, Satija  R, Reynolds  G, et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science  2017;356(6335):283–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Finak  G, McDavid  A, Yajima  M, et al.  Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol  2015;16:278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Chen  W, Li  Y, Easton  J, et al.  Umi-count modeling and differential expression analysis for single-cell rna sequencing. Genome Biol  2018;19(1):70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Risso  D, Perraudeau  F, Gribkova  S, et al.  A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun  2018;9(1):284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Velmeshev  D, Schirmer  L, Jung  D, et al.  Single-cell genomics identifies cell type-specific molecular changes in autism. Science  2019;364(6441):685–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Yang  L, Liu  J, Lu  Q, et al.  Saic: an iterative clustering approach for analysis of single cell RNA-seq data. BMC Genomics  2017;18(Suppl 6):689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Hastie  T, Tibshirani  B, Friedman  JH. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, Springer, 2016. [Google Scholar]
  • 11. Cui  T, Wang  T. Joint for large-scale single-cell RNA-sequencing analysis via soft-clustering and parallel computing. BMC Genomics  2021;22(1):47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Wilks  SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat  1938;9:60–2. [Google Scholar]
  • 13. Abadi  M, Agarwal  A, Barham  P, et al.  TensorFlow: a system for large-scale machine learning. Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation OSDI'16, 2016, pp. 265–83.
  • 14. Yee  TW. The VGAM package for categorical data analysis. J Stat Softw  2010;32(10):1–34. [Google Scholar]
  • 15. Seabold  S, Perktold  J. Statsmodels: econometric and statistical modeling with python. Proceedings of the 9th Python in Science Conference (SCIPY 2010). 2010, pp. 92–96.
  • 16. Carpenter  B, Gelman  A, Hoffman  MD, et al.  Stan: a probabilistic programming language. J Stat Softw  2017;76:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Squair  JW, Gautier  M, Kathe  C, et al.  Confronting false discoveries in single-cell differential expression. Nat Commun  2021;12(1):5692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Benjamini  Y, Hochberg  Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc B Methodol  1995;57(1):289–300. [Google Scholar]
  • 19. Colin, Cameron  A, Trivedi  PK. Regression Analysis of Count Data. Reading, MA: Cambridge University Press, 1998. [Google Scholar]
  • 20. Zheng  A, Casari  A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. Sebastopol, California: O’Reilly Media, 2018. [Google Scholar]
  • 21. Vuong  QH. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica  1989;57(2):307–33. [Google Scholar]
  • 22. Wilson  P. The misuse of the Vuong test for non-nested models to test for zero-inflation. Econ Lett  2015;127(2):51–3. [Google Scholar]
  • 23. Fisher  RA, Yates  F. Statistical Tables for Biological, Agricultural and Medical Research. London: Oliver & Boyd, 1938. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl1_bbad272

Data Availability Statement

Python 3.7.12 is used in this study. statmodel is available at https://github.com/statsmodels/statsmodels. Stan is available at https://github.com/stan-dev/pystan. ZINB-WaVE is available at https://github.com/drisso/zinbwave. MAST is available at https://github.com/RGLab/MAST. A published autism scRNA-Seq dataset is used in this study: https://autism.cells.ucsc.edu/. The proposed TensorZINB algorithm with detailed instructions is available at: https://github.com/wanglab-georgetown/tensorzinb. The Count Models Analysis and Compare, which supports seven count models and Stan, statsmodels and tensorflow methods, is available at: https://github.com/wanglab-georgetown/countmodels.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES