Abstract
Single cell RNA-sequencing (scRNA-seq) technology has significantly advanced the understanding of transcriptomic signatures. Although various statistical models have been used to describe the distribution of gene expression across cells, a comprehensive assessment of the different models is missing. Moreover, the growing number of features associated with scRNA-seq datasets creates new challenges for analytical accuracy and computing speed. Here, we developed a Python-based package (TensorZINB) to solve the zero-inflated negative binomial (ZINB) model using the TensorFlow deep learning framework. We used a sequential initialization method to solve the numerical stability issues associated with hurdle and zero-inflated models. A recursive feature selection protocol was used to optimize feature selections for data processing and downstream differentially expressed gene (DEG) analysis. We proposed a class of hybrid models combining nested models to further improve the model’s performance. Additionally, we developed a new method to convert a continuous distribution to its equivalent discrete form, so that statistical models can be fairly compared. Finally, we showed that the proposed TensorFlow algorithm (TensorZINB) was numerically stable and that its computing speed and performance were superior to those of existing ZINB solvers. Moreover, we implemented seven hurdle and zero-inflated statistical models in Python and systematically assessed their performance using a real scRNA-seq dataset. We demonstrated that the ZINB model achieved the lowest Akaike information criterion compared with other models tested. Taken together, TensorZINB was accurate, efficient and scalable for the implementation of ZINB and for large-scale scRNA-seq data analysis with DEG identification.
Keywords: zero-inflated and hurdle models, zero-inflated negative binomial, scRNA-seq, feature selection, DEG, tensorflow
INTRODUCTION
The maturation of single cell RNA-sequencing (scRNA-seq) technology has provided unique opportunities to explore the transcriptomic features and biological processes in health and disease at the single-cell level [1–4]. Although various statistical models and solver packages have been developed for scRNA-seq analysis, there is no consensus on model selection and package usage [5–7]. Moreover, increasing numbers of features, such as experimental batch, specimen identity, gender and age, are generated from single-cell studies [8]. Selection, engineering and incorporation of features pose enormous challenges for scRNA-seq analysis and downstream DEG identification [9, 10]. Zero-inflated negative binomial (ZINB) has been proposed to model the excessive zero reads that are typically observed in scRNA-seq data [7, 11]. However, due to the numerical instability of solving complex models, such as ZINB, existing packages may fail to converge, which makes them unsuitable for likelihood ratio test (LRT)-based DEG analysis [12]. From Occam’s razor or the law of parsimony, ‘entities should not be multiplied beyond necessity.’ We sought to (1) thoroughly compare the performance of various models used in scRNA-seq analysis, and (2) understand to what extent scRNA-seq analysis can benefit from the increase in model complexity. We use the Akaike information criterion (AIC) as a measure to determine the ‘necessity’ in Occam’s razor via comprehensive comparisons of different statistical models and packages for scRNA-seq analysis. Additionally, we provide guidelines for the selection of models, packages, and features and the experimental design.
In order to improve the accuracy and speed of solving the ZINB model, we first developed a Python package named TensorZINB using the open-source TensorFlow deep learning framework [13]. Although existing packages can be used to solve the ZINB model [7, 14–16], all the packages are numerically unstable and are not scalable for large-scale data analysis. We propose a sequential initialization algorithm to ensure the likelihood is monotonic as more features are added. With these improvements, TensorZINB is numerically stable and can be generalized to analyze a large number of genes in parallel when running on CPU and GPU. Furthermore, feature selection is a relatively underexplored area in previous single-cell studies. We propose a recursive feature selection protocol and show that feature selections can dramatically impact the AIC performance. Notably, TensorZINB not only supports separate feature sets on the zero-inflated and NB components in the ZINB model, respectively, but can also incorporate common features across all genes rather than using specific features for individual genes. We systematically compare TensorZINB to existing ZINB solvers on a real scRNA-seq dataset [8] and demonstrate that TensorZINB achieves a higher likelihood and faster computing speed.
In order to provide a comprehensive evaluation of different statistical models used in scRNA-seq analysis, we perform systematic comparisons of seven hurdle and zero-inflated models (Table 1). Different from existing comparisons, which typically call existing packages, we implement all seven models from scratch in Python so that they can be fairly compared using the same platform. To properly compare all the models, we develop a method that can generally convert any continuous model into an equivalent discrete model so that the likelihood and AIC can be computed for the discrete model. With this conversion, discrete models (such as ZINB) and continuous models (such as MAST [5]) can be compared on the same ground. For nested models, we propose a new class of hybrid models to combine NB and ZINB (NB is a special case of ZINB) to further improve performance.
Table 1.
Models compared for scRNA-seq analysis
| Base model | Hurdle | Zero-inflated |
| P (Poisson) | PH (Poisson Hurdle) | ZIP (Zero-inflated Poisson) |
| NB (Negative Binomial) | NBH (Negative Binomial Hurdle) | ZINB (Zero-inflated Negative Binomial) |
| MAST |
We compare the AIC and computing speed for all seven different models and a hybrid (NB + ZINB) model on a published scRNA-seq dataset with rich features [8]. We find that the ZINB model, when solved with TensorZINB, achieves the lowest AIC compared to other non-hybrid models. When we compare the ZINB model with its nested models using LRT, we find that the ZINB model obtains the best performance in both likelihood and computing speed. Finally, we perform DEG analysis using all seven models and the hybrid model and compare the identified DEGs among the models.
METHODS
Models
Let
be the observed count from cell
for gene
, where
is a non-negative integer and there are
cells and
genes. A statistical model
for
is the probability mass function (PMF) of observing
given model parameter
, i.e.
![]() |
(1) |
For simplicity, we drop the subscripts throughout this paper if it does not cause ambiguity.
Hurdle and zero-inflated models
scRNA-seq data typically have excessive zeros [11, 17]. Standard count models assume that the zeros and the non-zeros come from the same data-generating process, which may not reflect the underlying mechanisms for excessive zeros. Zero counts could be true zero expression of genes or could be technical failures, which are generated by two distinct processes [17]. Both hurdle and zero-inflated models are described by two distributions: a Bernoulli distribution with parameter
and a counter distribution on non-negative integers with PMF
, e.g. Poisson, negative binomial, etc.
For hurdle models, the zeros and the non-zeros are generated from two different distributions separately. The Bernoulli distribution
uniquely determines whether a count is zero or positive. If the realization is 1, the hurdle is crossed, and the conditional distribution of the positives is governed by a truncated-at-zero count data model induced from the vanilla model. The hurdle model can be expressed as
![]() |
(2) |
Let
if
and
if
. The log likelihood (LL) function for the hurdle model can be written as
![]() |
(3) |
From (3), the Bernoulli part
and the count part
can be estimated separately. This reduces the complexity of analysis, especially for datasets with many features, which makes the Hurdle model broadly used in scRNA-seq analysis [5]. However, due to the presence of
in the count part
, the log likelihood function in
is typically not convex in the distribution parameters. Intricate distributions, such as NB, typically incur numerical instability issues. When
the hurdle model (3) reduces to the vanilla model
.
For zero-inflated models, the counts are modeled as a mixture of the Bernoulli distribution and the count distribution, i.e.,
![]() |
(4) |
The log likelihood function for the zero-inflated model when observing counts
can be written as
![]() |
(5) |
Unlike the hurdle model in (3), we cannot move
out of the
from
in (5). Therefore, the two parts in zero-inflated models must be estimated jointly, which makes it more challenging to solve.
One important distinction between the hurdle and zero-inflated models is that zero counts can have two different sources in the zero-inflated model, while they can only be generated from the Bernoulli part in the hurdle model. This difference potentially leads to a likelihood difference between these two models in single-cell data analysis, which may have important biological meanings.
Model parameterization
We consider model parameterization on the count distribution
in (2) and (4) and
in the Bernoulli distribution. For the count distribution, we consider the Poisson distribution and the negative binomial distribution, which are characterized by the mean and/or the dispersion parameter. In many single-cell studies, counts are normalized so that each cell’s total count is 10 000, i.e.
![]() |
(6) |
where
is a cell level normalization factor. Assuming after normalization
has the same mean
across all cells
, we have
![]() |
(7) |
So if we consider
and
as two features, existing normalization is equivalent to modeling
as a linear combination of features. Motivated by this observation, we consider a linear regression model on the log of distribution mean. Together with the logistic model on
, we consider the following parameterization for a given gene 
![]() |
(8) |
where
and
,
,
and
are applied element-wisely,
and
are features,
are coefficients for each gene
,
are common features shared across all genes and
are corresponding common coefficients without dependence on genes
. Features can be moved from
to
to reduce overfitting and improve the AIC metric, where common features are leveraged as a form of regularization.
The dispersion parameter in ZINB is chosen as
![]() |
(9) |
so that
is always positive and we do not need to constrain
to be positive for optimization. We do not enforce a linear regression model on
, even though TensorZINB can readily solve this case as well.
With (8), we do not need to preprocess the observed counts
using gene-specific normalization factors as in (6) but rather use
directly for analysis, as shown in [11] preprocessing
may lead to mutual information loss. By adding the log library size
as a feature in
and/or
, normalization is performed implicitly. The existing normalization is equivalent to setting the coefficient
for this feature.
Finally, for the model with parameterization (8) and (9), given
, we solve
for a subset of genes by maximizing the sum of log likelihood
![]() |
(10) |
where
is the PMF of the model.
DEG analysis
We use LRT to identify DEG. Let
denote the log likelihood function in (10) given features
and its corresponding coefficient
and we put all features in (10) into
for notation simplicity. The LRT statistic null hypothesis that gene
is not differentially expressed for given conditions in features
is
![]() |
(11) |
where
contains coefficients for the combined feature
. Assuming
contains
features, the test statistic
will be asymptotically chi-squared distributed with degrees of freedom
, as the number of cells
approaches
according to Wilks’ theorem [12]. Finally, the P-values from LRT can be adjusted for multiplicity using the Benjamini and Hochberg method [18].
Hurdle normal distribution model
In single-cell studies, it is common to transform integer counts into continuous variables for downstream analysis. For the popular log transformation, we have
![]() |
(12) |
where
is the observed count,
is the transformed variable and
is a scaling factor. Instead of modeling
, many studies model
as a continuous random variable, where a probability density function (PDF) on
is assigned. The hurdle model on
is
![]() |
(13) |
MAST [5] assumes a normal distribution on
with mean
. The model in MAST writes PMF on discrete non-negative integer counts as PDF
for continuous distribution. Therefore, unlike other count-based hurdle models, truncated-at-zero (
in (2)) is not applied in MAST (13). The likelihoods derived from PDF and PMF have different meanings. As PDF, but not PMF, is used in MAST, the likelihood calculated from MAST cannot be directly compared to those from PMF-based count models.
Conversion of MAST to a discrete model
We sought to determine the PMF equivalent of the PDF model for fair comparisons. We let the PDF of
be
. We approximate the probability of observing the integer count
from this PDF as
![]() |
(14) |
For
we have
![]() |
(15) |
We have the PMF of MAST as
![]() |
(16) |
As (3), the log likelihood of MAST is
![]() |
(17) |
where
is as defined in (3), part
is the log likelihood using the PDF, and part
is the adjustment to make it comparable to discrete models.
An important implication of (17) is that MAST only maximizes part
without considering part
. As
in (17) also depends on model parameters, only maximizing
as in MAST does not truly maximize the log likelihood in (17). It is critical to note that LRT-based DEG identification requires maximizing the log likelihood
in order to compute the likelihood ratio. Therefore, Wilks’ theorem [12] cannot be applied in MAST for DEG identification, where the log likelihood difference may not be chi-squared distributed. This indicates the DEGs identified by MAST may not be guaranteed by existing statistics theories.
Comparisons of models
In this study, we thoroughly compare seven models in Table 1. We consider Poisson and negative binomial in both hurdle and zero-inflated models, while normal distribution is applied to the hurdle model only (MAST). We implement all seven models in Python from scratch so that all models can be compared fairly and the results are not confounded by implementation differences. Except for MAST, we solve the other 6 count models using Stan. Moreover, we propose TensorZINB, a TensorFlow-based algorithm, for solving ZINB (Figure 1, see Section New algorithm to solve ZINB using TensorFlow: TensorZINB).
Figure 1.

The architecture of the TensorFlow-based algorithm for solving ZINB (TensorZINB). Each of
,
,
,
in (8) is implemented as a fully connected Dense layer with a linear activation function.
is also computed using a
Dense layer with input
.
Zero-inflated negative binomial model
Overview
The negative binomial distribution is
![]() |
(18) |
where
is the mean and
is the dispersion parameter. The PMF for ZINB is
![]() |
(19) |
where
if
and
if
.
Comparisons of existing ZINB solving packages
VGAM [14], statsmodels [15], ZINB-WaVE [7] and Stan [16] can solve the ZINB model. However, they have at least one of the following issues:
(i) Convergence: the log likelihood of ZINB is difficult to optimize. Besides converging to a local optimum that is distant from the global optimal solution, it is common to incur numerical issues for real single-cell data.
(ii) Monotonicity of log likelihood: for LRT, we need to compute the increase in log likelihood after adding conditional features. Existing algorithms cannot guarantee log likelihood monotonicity, which leads to a negative LL difference. LRT fails to apply in this case.
(iii) Computing speed: the computing speed is slow, especially for large-scale datasets with many cells and genes, which makes most packages not scalable.
New algorithm to solve ZINB using TensorFlow: TensorZINB
In order to solve the challenges discussed above, we develop a Python package, TensorZINB, to solve the ZINB model using the TensorFlow deep learning framework. We create a customized loss function in TensorFlow to maximize the ZINB log likelihood. To overcome the numerical stability issues in computing the Gamma function and
in (18), we transform all terms in (18) into the log domain and use numerically stable Tensorflow functions. For a given
, we can write the log likelihood from (19) as
![]() |
(20) |
where
is the log Gamma function. We can further improve stability by rewriting (20) as
![]() |
(21) |
where
![]() |
(22) |
and
. All computations in (21) are performed in the log domain using numerically stable functions: lgamma, logsumexp, and softplus. Each of
,
,
,
in (8) is implemented as a fully connected Dense layer with a linear activation function.
is also computed using a
Dense layer with input
(Figure 1). TensorZINB solves the model for a batch of genes rather than each individual gene by leveraging GPU computing, which further increases the computing speed. We use RMSProp for optimization with an initial learning rate of 0.02. The learning rate is multiplied by 0.8 if the loss does not improve for 10 epochs until a minimum learning rate of 0.002 is hit. The training stops when the loss change remains less than 0.05 for 50 epochs or when the maximum number of epochs 3000 is reached.
Sequential initialization
It is common for an algorithm to converge to local optima, which may lead to a negative LL difference in LRT. We propose a sequential initialization method to solve this issue. We note that one model is nested in another if the former model can be obtained by constraining the parameters of the latter. For instance, Poisson is a nested model of NB (when dispersion
) and NB is a nested model of ZINB by setting
. As the number of model parameters increases from Poisson to NB and to ZINB, we solve the model sequentially from Poisson to NB and to ZINB, where the log likelihood increases incrementally.
For a sequence of nested models
, assuming
is solved with optimal parameter
, we can initialize the solver of
with
where
consists of parameters with values such that
reduces to
. In some cases such as NB and ZINB, we can even estimate a better
than using the reduced model values. We consider several model chains described below. For notational simplicity, we drop the gene subscript
and ignore common features in (8).
(i) Poisson
NB
ZINB:
Poisson: the model only depends on the mean
. We compute the sample mean value of counts
and set the initial parameter in (8) as
, where all parameters are initialized to be zero except the last one corresponding to an all one feature or intercept. Let
be the optimal parameter for the Poisson model.
NB: let
. From [19], we estimate the initial dispersion
by running the auxiliary OLS regression,
![]() |
(23) |
where
is the
-th entry in
. The NB model is then solved with initial values
. Let
and
be the optimal parameters for the NB model.
ZINB: let
and
be the observed fraction of positive counts. We can compute the probability of observing non-zero counts from NB as
![]() |
(24) |
We then have the probability of observing non-zero counts from ZINB as
![]() |
(25) |
Let
![]() |
(26) |
where
is a small constant and
is the inverse logit function. Finally, we set the initial parameter for
in (8) as
, where all parameters are initialized to be zero except the first one corresponding to an all one feature or intercept.
We can also initialize ZINB using Poisson directly, as Poisson is a nested model of ZINB, where
is also initialized using (23). Using Poisson initialization avoids an additional model fitting of NB and reduces computing time if NB fitting is not needed.
(ii) LRT: we need to find the log likelihood difference between maximizing the model
over the full feature space and maximizing
over the reduced feature space. Let
be a vector including only features in the reduced feature space, and
be a vector in the full feature space, where
is the additional feature vector in the full feature space. We first solve
for
and let the optimal value be
. As we use a linear model in (8), we initialize the full model with
where we set
. The log likelihood difference is guaranteed to be non-negative provided that the solver can eventually increase the log likelihood from the initial parameter in the full model.
Hybrid model for DEG identification
For two models
,
, where
is a nested model of
, we can choose the model with a lower AIC to perform DEG analysis for each gene rather than choosing the better model based on AIC across all genes and using the same model across all genes. Taking NB and ZINB as an example, for gene
, we compare the AIC of full feature space for each model. If
, we use NB for DEG analysis for gene
. Otherwise, we use ZINB. We denote this hybrid model as ‘NB+ZINB’ to reduce the potential overfitting in ZINB.
Feature engineering and selection
Feature engineering is the process of formulating the most appropriate features given the data, model and tasks [20], which is crucial for optimizing the model performance in machine learning. However, formulating the ‘most appropriate features’ is subjective, which may introduce bias to the analysis. Feature engineering is largely unexplored in single-cell research, with only a few studies implicitly touching upon this concept [5]. In our study, we not only use existing features in the single-cell datasets but also generate several derived features, such as total UMIs in a cell (UMI), Log of total UMIs (UMI_log), and Log of ngene (ngene_log), where the rationale behind UMI_log is from (7). We can further improve our feature set by utilizing polynomials on existing features and/or by combining them through operations such as multiplication.
Notably, using excessive features, especially with many derived features, may lead to overfitting. Thus, we sought to determine a feature selection process to only include features that are necessary. We consider feature selection based on the AIC metric. Complex models, such as ZINB, are complicated to solve, and it is time-consuming to fit the model with all feature combinations. To reduce the complexity of the feature selection process, we propose a recursive feature selection method.
For the top-down approach, we start with a feature set containing all features and eliminate one feature recursively at each step until the feature set is empty. Specifically, at the
th step, let
denote the current set of features. When
,
contains all features. We iterate through each feature
in
and train the model using features in
, i.e. removing
from
. We compute the AIC denoted as
after training the model. The feature
that minimizes AIC is chosen, i.e.
. We record this AIC as
and the corresponding feature set as
. We remove
from
, set
, and start the next iteration using
. This process repeats until
is empty. Finally, the selected feature set is
that achieves the lowest AIC among all
, i.e.,
. Let
be the number of features. The complexity of this top-down approach is
compared to
of the full algorithm.
Similarly, we can adopt a bottom-up approach where we start with an empty feature set and add one feature recursively at each step until the feature set contains all features.
RESULTS
scRNA-seq data used for evaluation
For comparisons between TensorZINB and other existing ZINB solvers and comparisons between different statistical models, we use a scRNA-seq dataset from the prefrontal cortex and anterior cingulate cortex in 15 autism patients and 16 controls with rich features [8]. There are in total 10 original features available in the published dataset: Region, Sex, Age, Capbatch, Seqbatch, PMI, RIN, Ribo pct, Mito pct, Diagnosis. Feature selection and engineering are performed using methods in Section 2.9 Feature engineering and selection.
As the statsmodels is slow and Poisson Hurdle, zero-inflated Poisson, and NB Hurdle are not scalable and cannot be used on the real full dataset, we generate a smaller dataset with 340 genes for the initial tests. We select 20 genes from each cell type (17 cell types and 340 genes in total) with the lowest P-value after running a Wilcoxon rank sum test. Default parameters are used in ZINB-WaVE, statsmodels and Stan. As ZINB-WaVE does not support separate features on the logistic and NB parts, we use the same feature set on both parts, e.g. UMIs_log, ngene, ngene_log, Sex, Age, Capbatch, PMI, RIN, Ribo_pct, Mito_pct. All model training and testing are performed on a computer with an Intel Xeon CPU E5-2686 v4 @ 2.30GHz with 62GB of RAM and a NVIDIA Tesla K80 GPU with 17 GB of memory.
Validation of the feature selection method
First, we find that, in the scRNA-seq dataset used for evaluation, Capbatch uniquely determines Seqbatch, and Capbatch CB3, CB4, CB8, CB9 uniquely determine the Brain region. Therefore, we remove these redundant features. Diagnosis is a feature specifically used for DEG analysis, so we do not include it except for DEG identification. For feature engineering, we generate three new features, total UMIs in a cell (UMI), Log of total UMIs (UMI_log) and Log of ngene (ngene_log) and adopt ngene from MAST. We consider feature selection of 11 features on both the logistic and NB parts in ZINB. Taken together, we have a total of 22 features to choose from.
We first validate the proposed top-down feature selection protocol in ZINB using a dataset with 340 genes as described in Section scRNA-seq data used for evaluation. We compare the sum of the log likelihood of all genes with different number of features (Figure 2). For a given number of features, we use the feature set with the highest likelihood. The log likelihood is a monotonically increasing function in the number of features. After the number of features is greater than 13, the likelihood gain begins to diminish (Figure 2A). Similarly, we compare the AIC from different number of features. For a given number of features, we select the feature set with the lowest AIC. We find that AIC is minimized with 19 features (Figure 2B). These selected features include UMIs, UMIs_log, ngene, ngene_log, Sex, Age, Capbatch, PMI, RIN, Ribo_pct, Mito_pct for the NB part, and UMIs, UMIs_log, ngene_log, Sex, Age, Capbatch, Ribo_pct, Mito_pct for the logistic part.
Figure 2.

Validation of the proposed feature selection protocol. A. Log likelihood increases with the number of features. The best log likelihood for a given number of features is used in the plot. B. AIC changes with the number of features. AIC reaches the smallest number when 19 features are used. The best AIC for a given number of features is used in the plot. C. The relationship between log likelihood and feature numbers. All log likehoods in the 256 testing cases are shown. D. The relationship between AIC and feature numbers. All AIC in the 256 testing cases are shown.
Finally, we examine how the log likelihood and AIC change with different combinations of features (not only with the best combinations). We plot the log likelihood and AIC from all 254 testing cases that have been run (Figure 2C and D). Interestingly, we find that for a given number of features, some features have greater impacts on both the likelihood and AIC, suggesting that these features may affect the transcriptome more significantly than others. These results demonstrate that the proposed feature selection protocol improves data fitting and provides valuable information about the underlying biological processes.
Validation of TensorZINB
We generate a simulation dataset for the validation of TensorZINB, where the parameters estimated from TensorZINB are compared with the known model parameters. In the ZINB model (19), we choose
![]() |
(27) |
where
is a synthetic feature vector with entries generated from the uniform distribution between 0 and 1. We assign arbitrary numbers to (27) to generate a dataset with 20 000 samples and subsequently used it to validate the convergence of TensorZINB. From the log likelihood, we find TensorZINB quickly converges to the maximum likelihood estimation (MLE; Figure 3A) after 250 iterations, which is in the neighborhood of the true values (denoted as a dot in Figure 3B and C), demonstrating that TensorZINB can solve the ZINB model correctly.
Figure 3.
Validation of the TensorZINB using a simulating dataset. A. Convergence of log likelihood over iterations using a simulation dataset. Maximum likelihood estimation (MLE) is shown in the plot (purple box). B. Convergence of
,
,
, and
over iterations using a simulation dataset. True values (green dots) and MLE (purple boxes) are shown. C. Convergence of
over iterations using a simulation dataset. The true values (green dot) and MLE (purple box) are shown. D. Histogram of
using the fitted parameters. Gene ENSG00000198840 in cluster L2/3 in a published scRNA-seq dataset is used. The mean value of
is 0.14, E. Convergence of log likelihood over iterations using a real scRNA-seq dataset. F. Convergence of
,
, and
over iterations using a real scRNA-seq dataset.
Next, we apply TensorZINB to real scRNA-seq data [8]. As an example, we consider gene ENSG00000198840 in cluster L2/3 (Figure 3D–F). The features are chosen based on the feature selection in the previous subsection. Figure 3D shows the histogram of
from (19) using the fitted parameters. With TensorFlow’s default random initialization, the log likelihood converges after 750 iterations (Figure 3E). We find the mean of
to be 0.14, which indicates that with a mean probability of 0.14, we see zero counts from the Bernoulli part in ZINB (Figure 3E). Note that in the histogram, we observe diverse
across different cells, which indicates some cells are more susceptible to zero counts from the Bernoulli part, possibly due to technical failures. Rather than imposing hand-selected rules to filter ‘bad’ cells, cell status may be inferred from TensorZINB through the zero-inflated probability
.
Comparison of TensorZINB to existing ZINB solvers
Next, we compare the performance of different ZINB solvers and evaluate the quality of the solution returned by each algorithm. From our experimentation, VGAM cannot solve for most of the genes. Thus, we do not compare it in this study. Let the log likelihood of gene
returned by algorithm
be denoted as
using 340 genes from the scRNA-seq dataset as in Section scRNA-seq data used for evaluation. We take the maximum likelihood over all four tested algorithms (TensorZINB, statmodels, Stan and ZINB-WaVE) for each gene and cell type and denote it as
. We compare the following ratio of each algorithm
![]() |
(28) |
The higher the quantity in (28), the worse the solution is. We compare the total log likelihood loss of 4 algorithms as in (28). We find that TensorZINB is within 0.1% of the highest likelihood and obtains superior performance to other ZINB solvers (Figure 4A).
Figure 4.

Comparisons of TensorZINB and other ZINB solving packages. A. The ratio of the total log likelihood of each algorithm to the best total log likelihood. TensorZINB, statmodels, Stan, and ZINB-WaVE are compared. B. The percentage of cases where the likelihood difference is negative when Diagnosis is added as a feature in the DEG analysis. C. The mean computing time of running LRT for each gene in DEG identification with different algorithms. D. Comparisons of DEGs identified by TensorZINB, Stan, and ZINB-WaVE using a real scRNA-seq dataset with 340 genes. E. Comparisons of DEGs identified by TensorZINB, Stan, and statmodels using a dataset with 340 genes.
Next, to test whether each algorithm is feasible for LRT in DEG analysis, we compare the likelihood without Diagnosis to that with Diagnosis as a feature and compute the percentage of cases where the log likelihood difference is negative (Figure 4B). We find that the log likelihood difference is always positive in TensorZINB, while other packages have a higher percentage of negative cases, suggesting that TensorZINB is the most suitable for LRT-based DEG analysis. We examine DEGs identified from the dataset with 340 genes using TensorZINB, ZINB-WaVE and Stan, and find that the majority of DEGs detected are common among different algorithms (Figure 4D and 4E), possibly because strict criteria are used to select the 340 genes for testing.
Finally, we compare the mean of computing time of LRT for each gene using different algorithms on CPU and TensorZINB on CPU has the fastest computing speed (Figure 4C). Taken together, TensorZINB achieves a higher likelihood, maintains the monotonicity of likelihood in LRT, and is computationally efficient. It is scalable and robust and can be used for DEG analysis on scRNA-seq datasets. In the remainder of the study, we use TensorZINB to solve ZINB for comparisons with other statistical models.
Comparisons of different models in scRNA-seq analysis
Next, we comprehensively compare the performance of seven models (Table 1) plus the hybrid NB+ZINB model as in Section Hybrid model for DEG identification. We use the same dataset with 340 genes as in Section 3.1 scRNA-seq data used for evaluation and select features as in Section 2.9 Feature engineering and selection. For MAST, we use (17) to compute the likelihood so that all models can be compared, and the empirical Bayes method to regularize variance is not used. We compute the ratio of the difference between the best likelihood across all models and the likelihood of each model to the best likelihood as defined in (28) (Figure 5A). Similarly, we compute the ratio of the difference between the AIC of each algorithm and the best AIC across all models to the best AIC (Figure 5B), i.e.
Figure 5.
Comparisons of different statistical models. A. The ratio of the total log likelihood of each model to the best total log likelihood. Poisson, Poisson hurdle, Zero-inflated Poisson, MAST, NB, NB hurdle, ZINB, and NB+ZINB are compared using a dataset with 340 genes. ZINB is solved by the proposed TensorZINB method. B. The ratio of toal AIC of each model to the best total AIC using a dataset with 340 genes. C. AIC difference between other models and ZINB (AIC of other models - AIC of ZINB) using a dataset with 340 genes. D. The fitting of the count distribution of gene ENSG00000183117 in cluster L2/3 by different models. E. The fitting of the count distribution of gene ENSG00000198840 in cluster L2/3 by different models. F. Comparisons of DEGs identified by Poisson, Poisson hurdle, Zero-inflated Poisson, NB, NB hurdle, and ZINB using a dataset with 340 genes. G. Comparisons of DEGs identified by MAST, NB, ZINB, and NB+ZINB using a real large-scale scRNA-seq dataset.
![]() |
(29) |
Smaller values indicate a higher likelihood and a lower AIC. We find that the ZINB model obtains the highest likelihood and the lowest AIC, suggesting that it achieves the best performance compared to other models (Figure 5A–C).
To evaluate the goodness of fitting, we apply LRT to full and nested models and use AIC for non-nested models. The Vuong test [21] is not used on non-nested models as it has been reported to be unsuitable for zero-inflated non-nested models [22]. In Table 2, we list P-values from LRT between nested models using the sum of log likelihood across all genes if the column model is a nested model of the row model. AIC differences between the column model and the row model, as (column model AIC - row model AIC), are listed in Table 3. In summary, all the results demonstrate that the performance of the eight models can be ranked as NB + ZINB > ZINB > NB > NB hurdle > MAST > zero-inflated Poisson >Poisson hurdle > Poisson. Interestingly, the vanilla NB model performs better than the NB hurdle model, which indicates that the NB hurdle model using Stan may converge to local optima. Also note that the vanilla NB is not a nested model of the NB hurdle model with a logit function on the hurdle part. Hence, there is no guarantee that the NB hurdle attains a higher LL than the vanilla NB.
Table 2.
P-values of LRT between nested models
| Model | P | NB | PH | NBH | ZIP |
|---|---|---|---|---|---|
| NB | 0.0 | ||||
| PH | |||||
| NBH | 0.0 | ||||
| ZIP | 0.0 | ||||
| ZINB | 0.0 | 0.0 | 0.0 |
Table 3.
The difference of mean AIC between models
| Model | P | PH | ZIP | MAST | NB | NBH | ZINB | NB+ZINB |
|---|---|---|---|---|---|---|---|---|
| P | 0.00 | −415.97 | −611.50 | −10884.96 | −11243.58 | −11084.49 | −11391.47 | −11393.96 |
| PH | 415.97 | 0.00 | −195.53 | −10469.00 | −10827.61 | −10668.52 | −10975.51 | −10977.99 |
| ZIP | 611.50 | 195.53 | 0.00 | −10273.47 | −10632.08 | −10472.99 | −10779.98 | −10782.47 |
| MAST | 10884.96 | 10469.00 | 10273.47 | 0.00 | −358.61 | −199.53 | −506.51 | −509.00 |
| NB | 11243.58 | 10827.61 | 10632.08 | 358.61 | 0.00 | 159.09 | −147.90 | −150.39 |
| NBH | 11084.49 | 10668.52 | 10472.99 | 199.53 | −159.09 | 0.00 | −306.99 | −309.47 |
| ZINB | 11391.47 | 10975.51 | 10779.98 | 506.51 | 147.90 | 306.99 | 0.00 | −2.49 |
| NB+ZINB | 11393.96 | 10977.99 | 10782.47 | 509.00 | 150.39 | 309.47 | 2.49 | 0.00 |
Next, in order to further investigate the performance of each model, we examine the fitting of individual genes by each model. To visualize the goodness of fitting, we compare all models without any features except the intercept, so that the PMF does not depend on features and can be displayed. Two genes, ENSG00000183117 and ENSG00000198840, in cluster L2/3 from the dataset in [8] are used. We compare the histogram from observed single-cell counts with the PMF of models (Figure 5D and E). For gene ENSG00000183117, zero count probability is zero, so hurdle and zero-inflated models are not shown. The transformed PMF (16) is shown for MAST. We find that NB fits the experimental data accurately, while MAST has a shift compared to the distribution of real counts, which is likely due to the use of the normal distribution and log transformation. Poisson does not fit the data well, possibly due to the existence of over-dispersion in single-cell data.
For gene ENSG00000198840, ZINB reduces to NB without using any features, so it is not included in the comparison. Notably, when features are used, ZINB indeed has a non-zero probability on the Bernoulli part, which indicates the zero inflation may come from individual cells. Without any features, NB and NB hurdle fit the data better than other models (Figure 5E). The observations on the fittings of individual genes are consistent with model ranking using likelihood and AIC.
DEG analysis
Lastly, we evaluate DEG identification by different models using LRT. We compute
in (11) using Diagnosis as the testing condition for the eight models (Table 1). We do not apply any additional filtering using non-model-based criteria, such as fold change, to demonstrate that the DEG difference is solely from using different models. We first use the same 340 gene dataset as in Section 3.1 and select features as in Section 2.9. P-value 0.01 is used to determine whether a gene is DEG. We compare all the DEGs identified by each model (Figure 5F and Supplementary Figure 1A). We find that DEGs identified by different models from the 340 gene dataset are very similar, probably due to the stringent criteria used for gene selection.
Then, we apply eight models to all genes in the full scRNA-seq dataset instead of using only 340 genes. Since Poisson Hurdle, zero-inflated Poisson and NB Hurdle are not scalable and cannot be used on the real full dataset, we compare the DEGs identified from Poisson, MAST, NB, ZINB and NB+ZINB. P-values are adjusted using the Benjamini Hochberg protocol [18] and 0.01 is chosen for adjusted P-value to determine the DEGs. We compute the total numbers of DEGs (combined DEGs from all 17 cell clusters) identified by different models and find that the ZINB model identifies the highest number of DEGs (Table 4). Then, we compare the identities of DEGs from different models on all genes in the scRNA-seq dataset (Figure 5G and Supplementary Figure 1B). The results show that DEGs found by different models only partially overlap with each other. The observation that DEGs found by MAST are different from NB and ZINB is possibly because MAST does not maximize the likelihood function, and hence LRT does not apply to MAST, as shown in Section 2.4 Conversion of MAST to a discrete model. Some DEGs are detected by ZINB but not by NB, which may be due to the fact that the probabilities of observing zero counts are different in autism patients and controls, and excessive zero counts cannot be modeled well by NB.
Table 4.
The number of DEGs across all 17 cell types identified by different models using a real large-scale scRNA-seq dataset
| Model | Number of DEGs |
|---|---|
| Poisson | 13104 |
| MAST | 10751 |
| NB | 10765 |
| ZINB | 15827 |
| NB+ZINB | 15390 |
Finally, we compare the mean computing time of running LRT for each gene using different models on CPU (Figure 5H). The computing time is the total time to compute two likelihoods in (11). We find that zero-inflated Poisson and NB hurdle, both implemented in Stan, are much slower than other models. MAST is fast as both the linear and logistic regression are simple to solve. However, since LRT may not apply to MAST, users need to be cautious when using MAST for DEG identification. Although the fast computing time of NB makes it a good option for exploratory analysis, ZINB has the lowest AIC, fast computing time, and can perform DEG identification reliably, especially for large-scale datasets with excessive zero counts.
DISCUSSION AND CONCLUSIONS
With the rapid growing of single-cell techniques, it becomes challenging to perform data analysis and DEG identification accurately and efficiently, especially with massive dataset size and increasing numbers of features. In this study, we propose a Python-based algorithm, TensorZINB, to solve the ZINB model, which can be run on both CPU and GPU. TensorZINB obtains performance superior to other ZINB solvers.
We develop a protocol for feature engineering and selection using a recursive feature elimination-based method on the AIC metric, which is used as a measurement of ‘necessity’ in Occam’s razor. The feature selection process can also provide valuable information about the connection between the biological meaning of features and transcriptomic regulation at the single-cell level. Although certain redundant features are difficult to avoid due to experimental limitations, the efficiency of using single-cell data can be improved with good design of experimental plans. One example of optimizing experimental design is to randomly shuffle samples into different sequencing experiments using the Fisher–Yates shuffle [23], to ensure that none of the feature columns in the design matrix can be written as a linear combination of others.
We propose a method to convert any continuous distribution to a discrete distribution so that the likelihood can be computed and evaluated for different models. Notably, MAST only maximizes the continuous likelihood function but does not necessarily maximize the discrete likelihood function, suggesting DEGs identified from such transformations need to be used with caution.
In order to gain insights on model selection, we thoroughly compare and discuss the performance of eight different models on real scRNA-seq datasets with rich features. We use AIC and LRT for model evaluation and find that ZINB is the best performer. Moreover, the hybrid model, by combining NB and ZINB, achieves even a lower AIC than ZINB. We comprehensively compare and rank these eight models and provide possible explanations for their performance, which can serve as a guideline for model selection in the practice of single-cell data analysis. We also apply these eight models in DEG identification and find that different models may lead to different DEG results. This finding is particularly important for collaborative disease cohort studies, in which experiments and analyses are usually performed in a number of laboratories using different models. Misuse of DEGs identified from different models may lead to misinterpretations of experimental results.
Researchers can leverage multiple models in their analysis to obtain DEGs for their downstream functional studies: (1) Instead of using a single model for DEG identification, we can use different models and select the best model for each gene using AIC as a selection metric; (2) We can use DEGs that are commonly discovered by all models (intersections), which may be a robust way to identify DEGs with high confidence, as these DEGs need to pass multiple model checks; (3) We can also aggregate DEGs identified from different models if the goal is to identify genes that are potentially important for certain biological processes. In this case, false positives may occur, and DEGs need to be validated with caution.
In conclusion, we develop TensorZINB, a TensorFlow-based algorithm to solve ZINB, and propose several new methods for sequential initialization, feature selection and model conversion. We comprehensively compare and discuss the performance of hurdle, zero-inflated, and vanilla models in scRNA-seq analysis. The ZINB model is the best performer over other models and can be used for accurate, reliable and fast single-cell analysis.
ABBREVIATIONS
scRNA-seq single cell RNA-sequencing
DEG differentially expressed gene
AIC Akaike information criterion
LL log likelihood
LRT likelihood ratio test
PMF probability mass function
PDF probability density function
OLS ordinary least-squares regression
MLE maximum likelihood estimation
P Poisson
PH Poisson hurdle
ZIP zero-inflated Poisson
NB negative binomial
NBH negative binomial hurdle
ZINB zero-inflated negative binomial
Key Points
A Python package (TensorZINB) is developed using TensorFlow to solve the complex ZINB model for large-scale scRNA-seq analysis and its performance is superior over existing ZINB solvers.
Feature engineering and selection algorithms are proposed to obtain optimized features achieving the lowest AIC.
Model selection protocol shows that the ZINB model, when solved with the proposed TensorZINB method, achieves a lower AIC compared to other statistical models for scRNA-seq analysis.
ACKNOWLEDGMENTS
Work in the laboratory of T.T.W. was supported by National Institute of Neurological Disorders and Stroke (NINDS grants R01 NS117372, R21 NS121284), Simons Foundation Autism Research Initiative (SFARI Bridge to Independence Award 551354) and Brain and Behavior Research Foundation (Young Investigator Award 27792).
AUTHORS’ CONTRIBUTION
T.C. and T.T.W. envisioned and designed the project. T.C. implemented the project and conducted the analysis. T.C. and T.T.W. wrote the manuscript.
DATA AVAILABILITY STATEMENT
Python 3.7.12 is used in this study. statmodel is available at https://github.com/statsmodels/statsmodels. Stan is available at https://github.com/stan-dev/pystan. ZINB-WaVE is available at https://github.com/drisso/zinbwave. MAST is available at https://github.com/RGLab/MAST. A published autism scRNA-Seq dataset is used in this study: https://autism.cells.ucsc.edu/. The proposed TensorZINB algorithm with detailed instructions is available at: https://github.com/wanglab-georgetown/tensorzinb. The Count Models Analysis and Compare, which supports seven count models and Stan, statsmodels and tensorflow methods, is available at: https://github.com/wanglab-georgetown/countmodels.
Supplementary Material
Author Biographies
Tao Cui, Ph.D., is a research specialist in the Department of Pharmacology and Physiology, Georgetown University, USA. Tao received his Ph.D. in Electrical Engineering from California Institute of Technology. His research interests include mathematical modeling, machine learning, high dimensional analysis, signal processing and genomics.
Tingting Wang, Ph.D., is an assistant professor in the Department of Pharmacology and Physiology, Georgetown University, USA. She received her Ph.D. in Neurobiology from Duke University. Her research interests include synaptic transmission, synaptic plasticity, neural circuitry, bioinformatics and genomics.
Contributor Information
Tao Cui, Department of Pharmacology and Physiology Georgetown University Medical Center SE407 Med/Dent 3900 Reservoir Road, N.W. Washington D.C., USA.
Tingting Wang, Department of Pharmacology and Physiology Georgetown University Medical Center SE407 Med/Dent 3900 Reservoir Road, N.W. Washington D.C., USA.
REFERENCES
- 1. Potter SS. Single-cell RNA sequencing for the study of development, physiology and disease. Nat Rev Nephrol 2018;14(8):479–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet 2016;17(3):175–88. [DOI] [PubMed] [Google Scholar]
- 3. Usoskin D, Furlan A, Islam S, et al. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci 2015;18(1):145–53. [DOI] [PubMed] [Google Scholar]
- 4. Villani AC, Satija R, Reynolds G, et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 2017;356(6335):283–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Finak G, McDavid A, Yajima M, et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol 2015;16:278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Chen W, Li Y, Easton J, et al. Umi-count modeling and differential expression analysis for single-cell rna sequencing. Genome Biol 2018;19(1):70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Risso D, Perraudeau F, Gribkova S, et al. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 2018;9(1):284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Velmeshev D, Schirmer L, Jung D, et al. Single-cell genomics identifies cell type-specific molecular changes in autism. Science 2019;364(6441):685–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yang L, Liu J, Lu Q, et al. Saic: an iterative clustering approach for analysis of single cell RNA-seq data. BMC Genomics 2017;18(Suppl 6):689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Hastie T, Tibshirani B, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, Springer, 2016. [Google Scholar]
- 11. Cui T, Wang T. Joint for large-scale single-cell RNA-sequencing analysis via soft-clustering and parallel computing. BMC Genomics 2021;22(1):47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat 1938;9:60–2. [Google Scholar]
- 13. Abadi M, Agarwal A, Barham P, et al. TensorFlow: a system for large-scale machine learning. Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation OSDI'16, 2016, pp. 265–83.
- 14. Yee TW. The VGAM package for categorical data analysis. J Stat Softw 2010;32(10):1–34. [Google Scholar]
- 15. Seabold S, Perktold J. Statsmodels: econometric and statistical modeling with python. Proceedings of the 9th Python in Science Conference (SCIPY 2010). 2010, pp. 92–96.
- 16. Carpenter B, Gelman A, Hoffman MD, et al. Stan: a probabilistic programming language. J Stat Softw 2017;76:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Squair JW, Gautier M, Kathe C, et al. Confronting false discoveries in single-cell differential expression. Nat Commun 2021;12(1):5692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc B Methodol 1995;57(1):289–300. [Google Scholar]
- 19. Colin, Cameron A, Trivedi PK. Regression Analysis of Count Data. Reading, MA: Cambridge University Press, 1998. [Google Scholar]
- 20. Zheng A, Casari A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. Sebastopol, California: O’Reilly Media, 2018. [Google Scholar]
- 21. Vuong QH. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 1989;57(2):307–33. [Google Scholar]
- 22. Wilson P. The misuse of the Vuong test for non-nested models to test for zero-inflation. Econ Lett 2015;127(2):51–3. [Google Scholar]
- 23. Fisher RA, Yates F. Statistical Tables for Biological, Agricultural and Medical Research. London: Oliver & Boyd, 1938. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Python 3.7.12 is used in this study. statmodel is available at https://github.com/statsmodels/statsmodels. Stan is available at https://github.com/stan-dev/pystan. ZINB-WaVE is available at https://github.com/drisso/zinbwave. MAST is available at https://github.com/RGLab/MAST. A published autism scRNA-Seq dataset is used in this study: https://autism.cells.ucsc.edu/. The proposed TensorZINB algorithm with detailed instructions is available at: https://github.com/wanglab-georgetown/tensorzinb. The Count Models Analysis and Compare, which supports seven count models and Stan, statsmodels and tensorflow methods, is available at: https://github.com/wanglab-georgetown/countmodels.































