Abstract
Background
Single-cell RNA sequencing (scRNA-seq) provides extensive opportunities to explore cellular heterogeneity but is often limited by substantial technical noise and variability. The prevalence of zero counts, arising from both biological variation and technical dropout events, poses significant challenges for downstream analyses. Existing imputation methods face inherent trade-offs: statistical approaches maintain interpretability but exhibit limited capacity for capturing complex, non-linear gene expression relationships, whereas deep learning methods demonstrate superior flexibility but are prone to overfitting and lack mechanistic interpretability, particularly in settings with limited sample sizes.
Methods
We present ZILLNB (Zero-Inflated Latent factors Learning-based Negative Binomial), a novel computational framework that integrates zero-inflated negative binomial (ZINB) regression with deep generative modeling. ZILLNB employs an ensemble architecture combining Information Variational Autoencoder (InfoVAE) and Generative Adversarial Network (GAN) to learn latent representations at cellular and gene levels. These latent factors serve as dynamic covariates within a ZINB regression framework, with parameters iteratively optimized through an Expectation-Maximization algorithm. This approach enables systematic decomposition of technical variability from intrinsic biological heterogeneity.
Results
Comparative evaluations across multiple scRNA-seq datasets demonstrate ZILLNB’s superior performance. In cell type classification tasks using mouse cortex and human PBMC datasets, ZILLNB achieved the highest Adjusted Rand index (ARI) and Adjusted Mutual Information (AMI) among tested methods, with improvements ranging from 0.05 to 0.2 over VIPER, scImpute, DCA, DeepImpute, SAVER, scMultiGAN and ALRA. For differential expression analysis validated against matched bulk RNA-seq data, ZILLNB demonstrated improvements ranging from 0.05 to 0.3 for area under the Receiver Operating Characteristic curve (AUC-ROC) and the Precision-Recall curve (AUC-PR) compared to standard and other imputation methods, with consistently lower false discovery rates. Application to idiopathic pulmonary fibrosis (IPF) datasets revealed distinct fibroblast subpopulations undergoing fibroblast-to-myofibroblast transition, validated through marker gene expression and pathway enrichment analyses.
Conclusion
ZILLNB provides a principled framework for addressing technical artifacts in scRNA-seq data while preserving biological variation. The integration of statistical modeling with deep learning enables robust performance across diverse analytical tasks, including cell type identification, differential expression analysis, and rare cell population discovery, demonstrating utility across common single-cell analysis tasks.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12859-025-06296-w.
Keywords: Single-cell RNA sequencing, Deep learning, Zero-inflated negative binomial distribution, Denoising, Differentially expressed gene identification
Introduction
Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our ability to explore diverse cell populations, facilitating the discovery of novel cell types and states [1], elucidating relationships between previously characterized and newly identified cellular phenotypes [2, 3], and improving our understanding of transcriptomic variation in health and disease contexts [4, 5]. However, despite these advantages, scRNA-seq data frequently contain substantial technical noise and variability arising from multiple sources [6–8]. Specifically, heterogeneity can be categorized into cell-specific, gene-specific, and experiment-specific variations. Cell-specific measurement errors, notably those related to variability in library sizes, pose a significant challenge. These variations primarily stem from differences in sequencing depth and can occur even among cells considered biologically similar [9]. Another prevalent issue is the high frequency of zero counts in scRNA-seq data, often resulting from technical dropout events with distinct patterns across different cell types [10]. Gene-specific errors add further complexity, resulting from interactions between genes or between genes and environmental factors that remain inadequately captured or modeled [11, 12]. Experiment-specific variability, on the other hand, often originates from biases inherent in experimental procedures, such as PCR amplification bias.
Addressing these diverse sources of variability is essential to ensure reliable downstream analyses. Unlike bulk RNA sequencing data, scRNA-seq datasets require specialized computational methods to manage their unique characteristics, particularly the abundance of zeros and associated artifacts. Existing methods generally fall into two main categories: statistical modeling approaches and deep learning-based methods. Statistical approaches, including scImpute [13], VIPER [14], SAVER [15], and ALRA [16], utilize probabilistic frameworks explicitly designed to accommodate the zero-inflation and count distributions inherent to scRNA-seq data. Deep learning-based techniques, such as DCA [17], DeepImpute [18] and, more recently, scMultiGAN [19], leverage neural network architectures, particularly autoencoders and generative adversarial networks (GANs), to capture complex nonlinear relationships among genes. Although these deep learning models provide significant flexibility and scalability, they can suffer from interpretability issues and susceptibility to overfitting, especially when sample sizes are limited.
To leverage the advantages of both statistical modeling and deep learning approaches, we introduce the zero-inflated latent factors learning-based negative binomial (ZILLNB) model. ZILLNB integrates zero-inflated negative binomial (ZINB) regression with deep latent factor models, providing a unified approach for simultaneously addressing various sources of technical variability (Fig. 1). By explicitly modeling latent structures at both cell and gene levels, ZILLNB accurately recovers gene expression signals while preserving biologically meaningful variation. Additionally, ZILLNB is designed for broad applicability, effectively handling datasets with or without explicit covariates. We validate ZILLNB’s effectiveness across multiple analytical tasks, including cell type identification, classification of cell subsets, and differential expression analysis. The combined statistical and deep learning approach of ZILLNB thus provides a robust, interpretable, and adaptable solution for analyzing scRNA-seq data.
Fig. 1.
Workflow of the ZILLNB Denoising Method. ZILLNB processes a count matrix using an InfoVAE-GAN model to generate latent factors for cells and genes, which are used to fit a ZINB model. The output is a dense, denoised matrix. The schematic diagram at the bottom shows the decomposition of the link function of the mean value in matrix form. are regression parameters, are intercept terms, are vectors with ones, are latent factors and is covariates matrix which is optional. Dimensions represent the respective sizes of matrices
Methods
The ZILLNB model comprises three main steps (Fig. 1). First, an ensemble-based deep generative framework combining InfoVAE and GAN extracts latent features from both cellular and gene-level perspectives. Second, the derived latent factors are utilized to fit a ZINB model, refining both latent representations and regression coefficients through iterative optimization via the Expectation-Maximization (EM) algorithm. This iterative refinement distinctly decomposes variability in the raw data into cell-specific sampling noise and intrinsic gene-specific heterogeneity. Finally, adjusted mean parameters generate a denoised and complete expression matrix.
Latent factors learning
We use the ensemble InfoVAE-GAN model for manifold learning and identifying potentially missing cell- and gene-grouping in scRNA-seq datasets [20, 21]. To mitigate overfitting issues common in traditional VAEs, we integrate InfoVAE [22, 23] with GAN [24] and replace Kullback–Leibler (KL) divergence with Maximum Mean Discrepancy (MMD) as the regularizer for both cell-wise structure V and gene-wise structure U, while constraining the latent space with normal priors:
| 1 |
where denote probability densities, is the Gaussian kernel, follows a prior distribution in the latent space, and x are latent samples derived from observed data y. The InfoVAE-GAN architecture includes three interconnected neural networks: (1) an encoder that maps a sample y (either a cell or gene vector from the expression matrix Y) to latent space as ; (2) a decoder reconstructing input samples as ; and (3) a discriminator , distinguishing real data from generated samples with probabilities defined as . Adaptive weighting parameters and balance the reconstruction loss (), prior alignment (), and generative accuracy () by with , representing the MMD loss, and . Additional network details and hyperparameter selection are provided in Supplementary Sect. 1.1.
ZINB fitting
Consider a single-cell RNA sequencing dataset represented by an expression matrix Y with dimensions M (genes) by N (cells). Each element represents the observed expression count for gene i in cell j, modeled using a ZINB distribution. We introduce latent binary variables indicating whether a zero is generated by a dropout event:
| 2 |
Here, denotes the gene-specific dropout probability, while and are parameters representing the mean and dispersion of the negative binomial distribution, respectively. The indicator function equals 1 if its argument is true and 0 otherwise. By expressing the mean parameter through a log-link function, we incorporate latent cell- and gene-specific structures
| 3 |
where and represent latent factor matrices associated with genes and cells, respectively. Parameters and account for cell- and gene-specific intercepts, while regression parameters are encapsulated in matrices and . To improve model fit, we iteratively update the latent factor matrix U (obtained from the deep generative model) together with the regression parameters via the EM algorithm, while holding the cell-wise factor V fixed because it already captures sufficient biological structure (Figs. 4 and S2). To prevent overfitting, we include regularization terms on U and on the intercepts and in the objective function. Detailed derivations and computations for the latent-variable updates and the auxiliary function are provided in Supplementary Sect. 1.2. In practice, the model converges after a few iterations (see Fig. S1).
Fig. 4.
Latent factor learning preserves cell-wise and gene-wise information in the mouse brain dataset [40]. A, B UMAP projections of cell latent factors V and gene latent factors , annotated by cell types and Seurat-predicted clusters. C Comparison of learned gene latent factor matrix with a standard Gaussian matrix using Wishart ensemble. The upper panel shows empirical spectral distributions versus the theoretical Marchenko-Pastur law (red curve); the lower panel illustrates the level-spacing distributions alongside the Wigner Surmise law (blue curve). D Heatmap of the top 20 marker genes per cell type based on , with genes as columns and latent variables () as rows. E Heatmap illustrating the overlap between the top 20 marker genes and Seurat-predicted gene clusters. F AUCell scores for the predicted gene clusters across various cell types, reflecting activity levels within the original dataset. G UMAP projection of the mESC cell cycle dataset [43] based on matrix, annotated by predicted and true cell types. H Clustering performance assessment using Purity, F score, ARI, and AMI metrics of imputed matrix (“ZILLNBUbeta”), compared with COMSE [44], M3Drop [45], Scran [46], Scran cyclone [47], default Seurat method and Seurat cell-cycle scoring [48]
Furthermore, external covariates can be included by extending equation (3) with an additional term, , where represents external covariate data, are the corresponding regression coefficients, and S represents the number of covariates. During optimization, since V remains fixed, we can concatenate V and W into a combined matrix , and similarly merge and into , without altering the algorithm.
Data imputation
Building on latent factors learning and ZINB fitting steps described above, we introduce the concept of adjusted mean as follows:
| 4 |
Here, and denote the jth columns of the fixed cell latent factor matrix and the covariate matrix , respectively. The vectors and are the ith columns of and ; is the jth column of ; and is the ith column of ; all estimated by the EM algorithm. By intentionally omitting the intercept terms and , we normalize the original mean parameter . This normalization allows direct comparability of expression levels across samples while preserving intrinsic biological variability, and is consistent with established denoising practices used in bulk RNA-seq analyses [9]. Moreover, Output 1 is preferred when covariates primarily encode nuisance variation (e.g., batch effects). Omitting the covariate term facilitates integration across datasets while preserving the biological signal of interest within a unified framework. Unless otherwise stated, we use Output 1 by default in all experiments.
Dimension reduction and clustering
To effectively visualize high-dimensional raw and imputed data (as shown in Figs. 2, 3, 4, 5 and 6), we adopted the standard workflow from Seurat [25] (https://satijalab.org/seurat/articles/pbmc3k_tutorial). Raw data were processed following the default procedures recommended by Seurat. For the imputed data, we initially applied column-wise log-normalization, followed by principal component analysis to identify principal components used for dimensionality reduction via Uniform Manifold Approximation and Projection (UMAP). In cases involving large-scale datasets, we additionally employed t-Distributed Stochastic Neighbor Embedding (t-SNE) to enhance visualization quality. Cluster assignments and validations were conducted using hierarchical clustering with average linkage. Notably, clustering analyses performed on the estimated latent matrix followed the identical approach used for the raw data.
Fig. 2.
ZILLNB improves cell classification accuracy and compactness in mouse brain cortex and subsampled human PBMC scRNA-seq datasets. A, B, D, E UMAP plots of the mouse brain and subsampled human PBMC scRNA-seq data using raw and default ZILLNB-imputed data, with points labeled by true cell types. C, F Four external clustering validations (Purity, F score, ARI, and AMI) demonstrate the performance and robustness of ZILLNB compared to other widely used imputation methods, evaluated on the mouse brain dataset and human PBMC dataset
Fig. 3.
ZILLNB improves resolution and clarity in cell classification within the MCA scRNA-seq dataset. A–E tSNE plots displaying cell-type annotations of the mouse cell atlas dataset, before and after denoising using various imputation methods: A raw data, B ZILLNB, C DeepImpute, D DCA, and E ALRA. F Four external clustering validations (Purity, F score, ARI, and AMI) confirm ZILLNB’s superior performance compared to other imputation methods
Fig. 5.
ZILLNB robustly identifies DEGs in bootstrap experiments using breast cancer cell line dataset [33] with tumor cell line BT474 versus non-tumor cell lines Jurkat (A–E) and Thp1 (F–J). A, F FDP, TPR, ACC, and MCC. B, G ROC curves; D, I PR curves with a fixed tDEG proportion of 20%. C, H AUC-ROC and E, J AUC-PR across varying tDEG proportions (0–50%). Dashed lines indicate the case with 3000 tDEGs (top 20%). Methods compared include Seurat log-normalized counts (“logNorm”), default ZILLNB denoised matrix (“ZILLNB”), imputed matrices of DCA and ALRA (“DCA”,“ALRA”). “t/Wilcox” denotes a two-sample t-test or Wilcoxon rank-sum test for DEGs selection
Fig. 6.
Identification of fibroblast subsets undergoing FMT. A UMAP plots of re-clustered fibroblasts from 3 patients with IPF, 3 patients with chronic obstructive pulmonary disease (COPD), and 3 healthy controls, displaying expression levels of myofibroblast markers, AUCell scores for myofibroblast-associated gene sets, and Seurat-predicted labels derived from ZILLNB-denoised data. B GO analysis of the top 200 DEGs in cluster 3 identified by ZILLNB compared to the Seurat pipeline. The bar plot illustrates the number of DEGs associated with each GO term across Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). C Cumulative curve showing the top 200 DEGs in cluster 3 (ZILLNB) and cluster 4 (Seurat) that overlap with myofibroblast-related gene sets
To assess clustering accuracy, we calculated four metrics against true cell labels provided in each original dataset: Purity, F score, adjusted Rand index (ARI) [26], and adjusted mutual information (AMI) [27]. We selected ARI and AMI as our evaluation metrics because they offer complementary perspectives [28] on clustering validation: ARI favors equal-sized clusters while AMI is better suited for evaluating imbalanced clusters. The combination of these two metrics enables a more thorough and balanced evaluation of clustering results [27]. Each metric ranges from 0 to 1, with values closer to 1 indicating superior clustering performance. Detailed definitions and formulas for these metrics are available in Supplementary Section 1.4.
Random matrix comparison
To quantitatively evaluate the learned latent factor matrix against random Gaussian noise, we employ spectral methods derived from random matrix theory [29]. Consider a matrix whose entries are independently drawn from a Gaussian distribution . We examine its sample covariance matrix through three eigenvalue-based metrics: (1) empirical spectral density , which asymptotically follows the Marchenko-Pastur distribution; (2) empirical normalized level-spacing density , approaching the Wigner Surmise distribution; and (3) the largest eigenvalue (spectral radius) , which converges to the Tracy-Widom distribution. Detailed derivations and asymptotic distributions for these statistics are provided in Supplementary Section 1.5. For an accurate comparison, we standardized by centering and scaling each column to zero mean and unit variance. Subsequently, we derived its corresponding covariance matrix and calculated the aforementioned spectral statistics. The goodness-of-fit between empirical and theoretical distributions was assessed using Kolmogorov-Smirnov and Anderson-Darling tests. The largest eigenvalue comparison utilized the RMTstat package in R [30], which directly computes p-values based on the Tracy-Widom distribution.
Gene sets evaluation via enrichment analysis and AUCell scores
We assessed gene set relevance through enrichment analysis and gene set activity measurements using the AUCell package [31] in R. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were conducted using the clusterProfiler package [32], with the top 10 pathways visualized (Fig. S4). To further evaluate biological significance, we calculated AUCell scores to quantify gene cluster activities, subsequently averaging these scores by cell type within each cluster. Results were presented using heatmaps (Fig. 4F).
Differentially expressed genes (DEGs) selection and simulation design
To evaluate ZILLNB in a downstream analysis task (DEGs selection), we constructed simulated experiments using the mixed scRNA-seq (Mixture 1 breast cell line dataset [33]) with bulk RNA-seq on pure cell samples used as a reference. From the original scRNA-seq data, we formed disease–control subsets by pairing one tumor line (BT474 or T47D) with one non-tumor line (Jurkat or Thp1) as the control, and generated 100 bootstrap replicates per pairing.
For DEGs calling, we performed two-sample tests (t-test or Wilcoxon rank-sum) on either the imputed expression matrices (ZILLNB, DCA, ALRA) or on log-normalized counts from Seurat. p-values were adjusted by the Benjamini-Hochberg procedure [34], and DEGs were declared by simultaneous thresholds on the adjusted p-value and the log2 fold change (logFC). For convenience, we summarize these criteria into a per-gene ranking score f (see Supplementary Section 1.3) and select predicted DEGs (pDEGs) by thresholding f on scRNA-seq data. For bulk RNA-seq, we define true DEGs (tDEGs) as the intersection of calls made independently by DESeq2, edgeR, and limma [35], yielding a stable reference set.
After generating the tDEG and pDEG sets, we evaluated performance using standard metrics–false discovery proportion (FDP), true positive rate (TPR), accuracy (ACC), and the Matthews correlation coefficient (MCC)–as detailed in Supplementary Section 1.4. We also constructed ROC and precision–recall (PR) curves by varying the number of genes selected as pDEGs and computed their areas under the curve (AUC-ROC and AUC-PR), using tDEGs as the reference set. In addition to the classical metrics, we highlight MCC [36, 37] and AUC-PR [38, 39] because they are less sensitive to class imbalance and thus provide a more informative assessment in our simulations. In practical terms, we considered the top 20% of bulk RNA-seq genes (, approximately 3,000 genes) as the tDEGs for sub-figures A, B, D, F, G, and I in Figs. 5 and S7, replicating realistic scenarios. Robustness was further evaluated by systematically adjusting f values from 0 to 0.5, measuring corresponding AUC values, Pearson correlation, Wasserstein-2 distance, and cross-entropy, as described comprehensively in the Supplementary, Section 1.4.
Results
Enhanced cell classification after denoising
To evaluate the effectiveness of the ZILLNB framework, we compared its performance against several widely utilized imputation methods, including VIPER [14], scImpute [13], SAVER [15], ALRA [16], DCA [17], DeepImpute [18] and scMultiGAN [19]. Specifically, we assessed the accuracy of each method in recovering missing transcript counts and preserving biologically relevant signals critical for accurate cell-type classification.
Our evaluations involved two scRNA-seq datasets with verified cell-type annotations: (1) Mouse Brain Dataset: consisting of 3005 single cells obtained from the cerebral cortex of 33 male and 34 female mice (GSE60361) [40]; and (2) Human PBMC Dataset: comprising eight distinct immune cell types from purified human peripheral blood mononuclear cells (PBMCs), subsampled to 500 cells per cell type [41]. We compared predicted labels derived from each imputation method against true cell-type labels, employing four clustering performance metrics: Purity, F score, ARI, and AMI (see the Methods and the Supplementary Section 1.4).
The ZILLNB framework consistently outperformed standard imputation methods in clustering and cell type classification across all datasets and metrics, demonstrating improved Purity, F score, ARI, and AMI (Fig. 2C, F). Notably, ZILLNB demonstrated a significant improvement in ARI and AMI, highlighting its robustness in accurately identifying cell-type groupings. Methods such as VIPER and scImpute did not outperform when using the raw data directly, likely due to challenges in distinguishing subtle subgroup structures within highly similar cell populations, such as the T cells in the PBMC dataset (Fig. 2F).
To further illustrate the scalability of ZILLNB on large-scale scRNA-seq datasets, we employed the mouse cell atlas (MCA) dataset [42], which includes samples derived from more than 50 mouse tissues and cultures. For consistency with our main benchmark and to keep the MCA-scale evaluation focused [16], we compared against the baselines highlighted in Fig. 2C and F, namely DeepImpute, DCA, and ALRA. Compared with these three methods, ZILLNB showed a t-SNE representation with superior resolution of subcell types (Fig. 3). Specifically, gonadal-related cell subtypes (including luteal cells, cumulus cells, granulosa cells, Leydig cells, and Sertoli cells) exhibited clearer and more distinct boundaries after ZILLNB denoising, as indicated by the red dashed circles in Fig. 3B. Moreover, cluster evaluation metrics further confirmed the improved performance of ZILLNB (Fig. 3F).
By showing the accurate cell clustering results, ZILLNB effectively preserves latent biological characteristics while capturing meaningful underlying signals.
Latent factors preserve cell- and gene-wise information
To assess the biological and statistical significance of the latent factor matrices U and V produced by ZILLNB, we evaluated their ability to retain meaningful information from multiple perspectives. The matrix V effectively captured the underlying structure of cell populations, as evidenced by Fig. 4A. Additionally, we verified the capability of matrix V to accurately preserve biological signals despite significant batch effects present in the raw data. By integrating batch information encoded as a one-hot matrix into the V matrix training process using pancreatic scRNA-seq datasets generated from diverse sequencing technologies [49], V effectively mitigated sequencing biases and preserved more biological structure than Seurat-CCA [25] and Harmony [50] (Fig. S2B).
For the gene-embedding latent factor matrix , estimated from the EM algorithm, we validated its capacity to capture biologically relevant gene structures using the mouse brain dataset [40], using both statistical and biological analyses. We compared against a random Gaussian matrix X of equivalent dimensions, sampled from N(0, 1), the prior distribution used during training. Spectral density analysis, normalized level-spacing distributions, and largest eigenvalue assessments revealed pronounced distinctions between and the random matrix X (Fig. 4C). Goodness-of-fit tests (Kolmogorov-Smirnov and Anderson-Darling tests, ) and the largest eigenvalue test () strongly indicated that encodes structured latent signals rather than random noise.
The biological relevance of was further supported by clustering and enrichment analyses. Using Seurat’s clustering pipeline, we identified eight gene clusters and examined their biological significance (Fig. 4D). The heatmap depicting the top 20 marker genes per cell type confirmed that retained critical marker gene information, with clearly identifiable clusters for microglia, oligodendrocytes, and pyramidal cells. Additional analyses indicated strong enrichment of marker genes within specific gene clusters (Figs. 4E, S3). GO and KEGG enrichment analyses validated cell-type-specific functions, highlighting immune-related processes in microglia and differentiation-related functions in oligodendrocytes (Fig. S4). AUCell analysis further confirmed high activity of gene clusters 6, 7, and 8 within microglia, oligodendrocytes, and neurons, respectively (Fig. 4F). External validation on two additional neural scRNA-seq datasets [51, 52] shows that the predicted clusters align with the annotated cell types (Fig. S5).
We further validated ZILLNB on a homogeneous mouse embryonic stem cell (mESC) dataset [43], in which subtle gene-level variation delineates the G1, S, and G2/M cell-cycle phases. The matrix of ZILLNB clearly differentiated these cell cycle phases (Figs. 4G and S6), demonstrating superior clustering performance compared to methods such as COMSE [44], Scran Cyclone [47], and Seurat cell-cycle scoring [48] (Fig. 4H) and even the imputed matrix (Fig. S6). This suggests that the latent factor captures high-dimensional intrinsic gene structure aligned with cell-cycle states. In addition, combining the two imputed components–the default ZILLNB output–further improves clustering accuracy (Fig. S6C).
ZILLNB enhances DEGs selection
To illustrate the utility of ZILLNB in downstream analyses, we evaluated its performance in detecting DEGs, a critical task in scRNA-seq data analysis. Since ground-truth DEG labels are typically unavailable for scRNA-seq datasets, we employed complementary scRNA-seq and bulk RNA-seq datasets. The latter provided benchmark DEG labels for robust validation. Performance stability was rigorously assessed through 100 bootstrap iterations for each experimental dataset (see Methods Section 2.7 and Supplementary Section 1.3 for details).
We analyzed a breast cancer cell line dataset containing both scRNA-seq and bulk RNA-seq data for each tumor and non-tumor cell line [33]. Semi-synthetic datasets were created by pairing one tumor (BT474 or T47D) with one non-tumor cell line (Jurkat or Thp1) to simulate disease and control groups. DEGs selection was performed using established criteria involving adjusted p-value and logFC. tDEGs were defined by intersecting results obtained independently from DESeq2, edgeR, and limma analyses on bulk RNA-seq data [9, 53, 54].
Using the top 20% (3000 genes) of bulk RNA-seq data as benchmark tDEGs, ZILLNB consistently outperformed the log-normalization baseline and two strong imputation baselines, DCA and ALRA, selected based on their performance in Figs. 2 and 3 across multiple evaluation metrics. ZILLNB achieved significantly reduced FDP and enhanced TPR, clearly demonstrating its superior recall capabilities (Figs. 5A, F and S7). Furthermore, ZILLNB achieved stronger ROC/PR performance, with higher AUCs, than the log-normalization baseline and, in most datasets, than the two imputation baselines (DCA and ALRA) (Fig. 5B, D, G and I). When the proportion of tDEGs was systematically varied from 0% to 50% (7,500 genes), ZILLNB maintained consistent and substantial improvements over the baseline methods across all scenarios (Figs. 5C, E, H, J, S7). Additional quantitative comparisons using Pearson correlation, Wasserstein-2 distance, and cross-entropy further supported ZILLNB’s superior performance in capturing genuine biological signals from scRNA-seq data (Fig. S8). These findings were consistently replicated across other cell-line comparisons (T47D-Jurkat and T47D-Thp1), underscoring the robustness and reliability of ZILLNB (Figs. S7 and S8).
Overall, ZILLNB demonstrated robust advantages over traditional log-normalization methods, achieving improved FDP control, higher TPR, and enhanced preservation of intrinsic data structure. Its consistently strong performance across diverse scenarios and perturbations highlights ZILLNB’s effectiveness and reliability for DEG identification in scRNA-seq analyses.
ZILLNB facilitates identification of fibroblast subsets
We next demonstrate ZILLNB’s capability in identifying specific fibroblast subsets undergoing fibroblast-to-myofibroblast transition (FMT), a crucial event in idiopathic Pulmonary Fibrosis (IPF). Utilizing default ZILLNB-denoised scRNA-seq data (from Eq. 4, Output 1), we identified a distinct cluster exhibiting robust myofibroblast-associated characteristics, validating its biological significance through marker gene expression, pathway enrichment, and gene set activity analyses.
IPF, the most common type of pulmonary fibrosis, is characterized by chronic progressive lung fibrosis [55]. The FMT drives myofibroblast accumulation, and inhibiting this transition can significantly mitigate or halt disease progression [56, 57]. Given the critical role of myofibroblasts in IPF pathogenesis, accurately identifying fibroblast subsets involved in FMT is essential for advancing therapeutic strategies.
To this end, we analyzed publicly available scRNA-seq datasets of IPF, focusing specifically on fibroblast populations [58]. Using ZILLNB, we identified a clearly defined fibroblast subset (cluster 3) characterized by elevated expression of myofibroblast markers, including MYO1E, H19, COL3A1, and COL1A1 (Fig. 6A). We further evaluated these cells by generating a curated myofibroblast-related gene list from the GeneCards database [59], applying a relevance score threshold of 1, resulting in a set of 517 genes. AUCell scoring analysis confirmed enriched expression of these myofibroblast-associated genes within cluster 3 (ZILLNB) (Fig. 6A).
In contrast, a parallel analysis using the raw data and the standard Seurat pipeline identified another cluster (cluster 4) with similar marker gene expression. However, this cluster exhibited less coherent signals, and AUCell scoring failed to distinctly identify a myofibroblast-enriched subset (Fig. S9A). These observations indicate improved delineation of the myofibroblast-enriched subset relative to the raw-data Seurat pipeline (Fig. S9B).
To further establish biological validity, we conducted GO enrichment analysis of the top 200 DEGs from cluster 3 identified by ZILLNB. Results revealed significant enrichment in wound healing processes, a characteristic functional signature of myofibroblasts (Fig. 6B). Additionally, cumulative curve analyses showed that ZILLNB-derived DEGs encompassed a higher proportion of myofibroblast-related genes compared with those identified via the raw Seurat analysis, further highlighting the biological significance of the identified subset (Fig. 6C).
Thus, these results affirm that ZILLNB provides clearer and more biologically accurate identification of fibroblast subsets undergoing FMT. By effectively denoising scRNA-seq data, ZILLNB emerges as a robust analytical tool with broad applicability for investigating complex biological phenomena such as IPF and similar multifactorial diseases.
Discussion
ZILLNB combines learned latent representations with ZINB regression, casting denoising as a regression problem with learned covariates rather than purely unsupervised modeling. By dynamically capturing complex multivariate structures at both gene and cell levels, ZILLNB merges the strengths of deep learning’s powerful representational capabilities with the interpretability and robustness inherent to ZINB models.
Previous ZINB-based methods have been extensively utilized for scRNA-seq and microbiome analyses. However, these approaches typically depend heavily on user-defined covariates or limit covariate dimensionality, restricting the complexity of captured group structures [60–62]. In contrast, ZILLNB employs neural network architectures to autonomously learn intricate gene-wise and cell-wise structures without predetermined covariate constraints. We assessed the biological contribution of each model component. The cell-wise latent factor V captures major cell-type structure (Figs. 4A; S2), whereas the gene-wise factor preserves intrinsic functional modules associated with specific cell types (Figs. 4D–F; S3-S5), enabling accurate subpopulation classification. Combining these two components yields better performance than using either alone (Fig. S6C). Consistent improvements in clustering and DEG analyses further validate the effectiveness of ZILLNB. For most applications, the default output (Output 1 in Eq. 4) is sufficient. Despite integrating deep learning with statistical modeling, ZILLNB maintains reasonable memory consumption and computational feasibility on standard multi-core systems (Fig. S10). Practical speed–memory trade-offs include highly variable gene selection, cell downsampling, and reducing the latent dimensions K and L, all while preserving biological interpretability. In summary, compared with stand-alone statistical methods or deep learning models, ZILLNB’s integrated design enhances representational capacity and extends the applicability of classical statistical models to settings where key biological signals are only indirectly observed.
Although initially designed to address heterogeneity in scRNA-seq data, ZILLNB demonstrates considerable adaptability across a broad spectrum of analytical applications. For instance, it can readily integrate user-specified covariates, making it highly effective in tasks such as batch-effect correction and cohort bias adjustment. Additionally, integration strategies based on domain-adversarial learning and variational approximations provide promising avenues to improve cross-dataset alignment in future work [63]. Users can also define particular gene and cell groups to specifically target variance sources or amplify relevant expression signals. As demonstrated in Fig. 5, defining tumor and normal cell groups significantly enhances robust DEG identification. Similarly, ZILLNB’s flexibility enables the modeling of diverse structures, including cell types, experimental conditions, and biological perturbations, to enhance downstream analyses such as precise cell subset characterization (Fig. 6).
ZILLNB represents a powerful analytical framework for interpreting complex scRNA-seq data, offering extensive potential applications. Future extensions and adaptations could further broaden its utility. Promising directions include applying ZILLNB to pseudo-time trajectory analyses [2], enabling detailed exploration of dynamic cellular processes across developmental timelines by integrating latent-space similarity measures. Additionally, extending ZINB with variance-decomposition-based approaches for gene clustering and regulatory network inference may yield deeper insights into gene function and regulatory interactions [64, 65]. Furthermore, adapting ZILLNB to bulk RNA-seq or scATAC-seq data would facilitate critical tasks such as data denoising and differential feature identification, underscoring its versatility in analyzing diverse genomic data types.
Conclusions
ZILLNB is a powerful method that leverages deep learning methods and statistical models to retain biological heterogeneity while reducing technical heterogeneity. ZILLNB automatically learns the cell-wise and gene-wise information and uses them as covariates for ZINB model fitting. Compared to other common methods for denoising, ZILLNB shows superior performance in cell-type clustering, cell subtype identification, and differentially expressed gene identification. Additionally, it can flexibly incorporate known covariates to enhance robustness and facilitate downstream analysis. Thus, ZILLNB provides a new perspective for modeling heterogeneity in scRNA-seq data by merging deep-learning-embedded data and a statistical framework, and can identify the variation introduced by different sources. We expect ZILLNB to play a crucial role in modeling the heterogeneity of scRNA-seq data in future research.
Additional file
Supplementary file 1 The Supplementary file contains: (i) a detailed description of the InfoVAE-GAN model; (ii) derivations of the EM algorithms; (iii) differentially expressed genes (DEGs) selection (iv) evaluation metrics; (v) random-matrix statistics; (vi) guidance on using ZILLNB output; and (vii) additional figures supporting the data analyses. All code and a tutorial are available on GitHub at https://github.com/tianyingw/ZILLNB.
Acknowledgements
We would like to thank Dr. Xun Lan from Tsinghua University for generously providing computational resources. We also thank the editors and four anonymous reviewers for their insightful comments and helpful suggestions.
Author contributions
T.W., Q.L., and Y.Y. conceptualized and designed the model and computational framework. Q.L. and Y.Y. developed the algorithm, implemented the model, and conducted performance evaluations on real datasets. T.W. led the manuscript preparation. All authors provided substantial, critical feedback, which shaped the research, analysis, and manuscript preparation. T.W. supervised the overall project. All authors reviewed and approved the final manuscript.
Funding
Not applicable.
Data availability
All datasets utilized in this analysis are publicly available. The mouse brain scRNA-seq dataset comprising 3005 cells from the mouse cortex and hippocampus can be accessed through the Gene Expression Omnibus (GEO) SuperSeries GSE60361 [40]. The other two mouse neural scRNA-seq datasets can be accessed through GSE90806 [51] and GSE71585 [52]. The human PBMC dataset was subsampled to 4000 cells from an original 68k PBMC dataset available in the Short Read Archive under accession SRP073767, encompassing proportionally subsampled cells from eight distinct immune cell types [41]. The expression matrix for the MCA dataset is available at https://doi.org/10.6084/m9.figshare.5435866.v8. The human pancreatic scRNA-seq datasets generated using various sequencing methods [49] can be accessed under the following accession numbers: GSE81076, GSE85241, GSE86469, GSE84133, GSE81608, and E-MTAB-5061 (ArrayExpress) or at https://doi.org/10.6084/m9.figshare.12420968. The mESC scRNA-seq datasets, generated using full-length and UMI-based techniques, are available under accession numbers E-MTAB-2805 [43] and GSE54695 [66], respectively. The bulk RNA-seq dataset for mESCs is accessible under GSE78140 [67]. The breast cancer cell line dataset, comprising both bulk RNA-seq and scRNA-seq data, is publicly accessible under accession GSE220608 [33]. The IPF scRNA-seq dataset is accessible under GSE136831 [58]. The source code for ZILLNB, implemented in R and Python, is available on GitHub (https://github.com/tianyingw/ZILLNB).
Declarations
Conflict of interest
The authors declare no Conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally to this work.
References
- 1.Baysoy A, Bai Z, Satija R, Fan R. The technological landscape and applications of single-cell multi-omics. Nat Rev Mol Cell Biol. 2023;24(10):695–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015;25:1491–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Steven Potter S. Single-cell RNA sequencing for the study of development, physiology and disease. Nat Rev Nephrol. 2018;14:479–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jagadeesh KA, Dey KK, Montoro DT, Mohan R, Gazal S, Engreitz JM, et al. Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics. Nat Genet. 2022;54:1479–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21:31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Phipson B, Zappia L, Oshlack A. Gene length and detection bias in single cell RNA sequencing protocols. F1000Research. 2017;6:595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-SEQ data. BMC Bioinf. 2011;12:480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Robinson MD, McCarthy DJ, Smyth GK. Edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Qiu P. Embracing the dropouts in single-cell RNA-SEQ analysis. Nat Commun. 2020;11:1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Choudhary S, Satija R. Comparison and evaluation of statistical error models for SCRNA-SEQ. Genome Biol. 2022;23:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-SEQ data using regularized negative binomial regression. Genome Biol. 2019;20:296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Li WV, Li JJ. An accurate and robust imputation method scimpute for single-cell RNA-SEQ data. Nat Commun. 2018;9:997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chen M, Zhou X. Viper: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies. Genome Biol. 2018;19:196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, et al. Saver: gene expression recovery for single-cell RNA sequencing. Nat Methods. 2018;15:539–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Linderman GC, Zhao J, Roulis M, Bielecki P, Flavell RA, Nadler B, et al. Zero-preserving imputation of single-cell RNA-SEQ data. Nat Commun. 2022;13(1):192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-SEQ denoising using a deep count autoencoder. Nat Commun. 2019;10:390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX. Deepimpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-SEQ data. Genome Biol. 2019;20:211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang T, Zhao H, Xu Y, Wang Y, Shang X, Peng J, et al. scmultigan: cell-specific imputation for single-cell transcriptomes with multiple deep generative adversarial networks. Briefings Bioinf. 2023;24(6):bbad384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Papalexi E, Satija R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol. 2018;18:35–45. [DOI] [PubMed] [Google Scholar]
- 21.Faure AJ, Schmiedel JM, Lehner B. Systematic analysis of the determinants of gene expression noise in embryonic stem cells. Cell Syst. 2017;5:471-484.e4. [DOI] [PubMed] [Google Scholar]
- 22.Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, Abbeel P. Variational lossy autoencoder. arXiv:1611.02731, 2016.
- 23.Zhao S, Song J, Ermon S. Infovae: Information maximizing variational autoencoders. arXiv:1706.02262, 2017.
- 24.Larsen ABL, Sønderby SK, Larochelle H, Winther O. Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300, 2015.
- 25.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Warrens MJ, Van Der Hoef H. Understanding the adjusted rand index and other partition comparison indices based on counting object pairs. J Classif. 2022;39(3):487–509. [Google Scholar]
- 27.Romano S, Vinh NX, Bailey J, Verspoor K. Adjusting for chance clustering comparison measures. J Mach Learn Res. 2016;17(134):1–32. [Google Scholar]
- 28.Miller C, Portlock T, Nyaga DM, O’Sullivan JM. A review of model evaluation metrics for machine learning in genetics and genomics. Front Bioinf. 2024;4:1457619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Aparicio L, Bordyuh M, Blumberg AJ, Rabadan R. A random matrix theory approach to denoise single-cell data. Patterns. 2020;1(3):100035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Johnstone IM, Ma Z, Perry PO, Shahram M. RMTstat: Distributions, statistics and tests derived from random matrix theory, 2022. R package version 0.3.1.
- 31.Aibar S, Gonzalez-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, et al. Scenic single-cell regulatory network inference and clustering. Nat Methods. 2017;14:1083–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation. 2021;2(3):100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cobos FA, Panah MJN, Epps J, Long X, Man T-K, Chiu H-S, et al. Effective methods for bulk RNA-SEQ deconvolution using SCNRNA-SEQ transcriptomes. Genome Biol. 2023;24(1):177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Stat Methodol. 1995;57:289–300. [Google Scholar]
- 35.Li Y, Ge X, Peng F, Li W, Li JJ. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol. 2022;23(1):79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21(1):6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chicco D, Jurman G. The matthews correlation coefficient (mcc) should replace the roc auc as the standard metric for assessing binary classification. BioData Mining. 2023;16(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ozenne B, Subtil F, Maucort-Boulch D. The precision-recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J Clin Epidemiol. 2015;68(8):855–9. [DOI] [PubMed] [Google Scholar]
- 39.Sofaer HR, Hoeting JA, Jarnevich CS. The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol Evol. 2019;10(4):565–77. [Google Scholar]
- 40.Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-SEQ. Science. 2015;347:1138–42. [DOI] [PubMed] [Google Scholar]
- 41.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, et al. Mapping the mouse cell atlas by microwell-seq. Cell. 2018;172(5):1091–107. [DOI] [PubMed] [Google Scholar]
- 43.Buettner F, Natarajan KN, Paolo Casale F, Proserpio V, Scialdone A, Theis FJ, et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33:155–60. [DOI] [PubMed] [Google Scholar]
- 44.Luo Q, Chen Y, Lan X. Comse: analysis of single-cell RNA-SEQ data using community detection-based feature selection. BMC Biol. 2024;22:167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Andrews TS, Hemberg M. M3drop: Dropout-based feature selection for scrnaseq. Bioinformatics. 2019;35:2865–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, et al. Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods. 2015;85:54–61. [DOI] [PubMed] [Google Scholar]
- 48....Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-SEQ. Science. 2016;352:189–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Luecken MD, Büttner M, Chaichoompu K, Danese A, Interlandi M, Mueller MF, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods. 2022;19:41–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nature Methods. 2019;16:1289–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Campbell JN, Macosko EZ, Fenselau H, Pers TH, Lyubetskaya A, Tenen D, et al. A molecular census of arcuate hypothalamus and median eminence cell types. Nat Neurosci. 2017;20(3):484–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016;19(2):335–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-SEQ data with deseq2. Genome Biol. 2014;15(12):1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Ritchie ME, Phipson B, Wu DI, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Nalysnyk L, Cid-Ruzafa J, Rotella P, Esser D. Incidence and prevalence of idiopathic pulmonary fibrosis: review of the literature. Eur Respir Rev. 2012;21:355–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Sontake V, Kasam RK, Sinner D, Korfhagen TR, Reddy GB, White ES, et al. Wilms’ tumor 1 drives fibroproliferation and myofibroblast transformation in severe fibrotic lung disease. JCI insight. 2018;3:e121252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bollong MJ, Yang B, Vergani N, Beyer BA, Chin EN, Zambaldo C, et al. Small molecule-mediated inhibition of myofibroblast transdifferentiation for the treatment of fibrosis. Proc Natl Acad Sci USA. 2017;114:4679–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Adams TS, Schupp JC, Poli S, Ayaub EA, Neumark N, Ahangari F, et al. Single-cell RNA-SEQ reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis. Sci Adv. 2020;6(28):eaba1983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The genecards suite: from gene data mining to disease genome sequence analyses. Curr Protoc Bioinfrmatics. 2016;2016:1.30.1-1.30.33. [DOI] [PubMed] [Google Scholar]
- 60.Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-SEQ data with a model-based deep learning approach. Nat Mach Intell. 2019;1:191–8. [Google Scholar]
- 61.Li Y, Mingcong W, Ma S, Mengyun W. Zinbmm: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data. Genome Biol. 2023;24:208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Zeng Y, Li J, Wei C, Zhao H, Wang T. Mbdenoise: microbiome data denoising using zero-inflated probabilistic principal components analysis. Genome Biol. 2022;23(1):94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Jialu H, Zhong Y, Shang X. A versatile and scalable single-cell data integration algorithm based on domain-adversarial and variational approximation. Brief Bioinform. 2021;23(1):bbab400. [DOI] [PubMed] [Google Scholar]
- 64.Shu H, Zhou J, Lian Q, Li H, Zhao D, Zeng J, et al. Modeling gene regulatory networks using neural network architectures. Nature Comput Sci. 2021;1:491–501. [DOI] [PubMed] [Google Scholar]
- 65.Lian B, Zhang H, Wang T, Wang Y, Shang X, Aziz NA, et al. Inference of gene coexpression networks from single-cell transcriptome data based on variance decomposition analysis. Briefings Bioinf. 2025;26(4):bbaf309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Grün D, Kester L, Van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014;11:637–40. [DOI] [PubMed] [Google Scholar]
- 67.Guo F, Li L, Li J, Xinglong W, Boqiang H, Zhu P, et al. Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells. Cell Res. 2017;27:967–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary file 1 The Supplementary file contains: (i) a detailed description of the InfoVAE-GAN model; (ii) derivations of the EM algorithms; (iii) differentially expressed genes (DEGs) selection (iv) evaluation metrics; (v) random-matrix statistics; (vi) guidance on using ZILLNB output; and (vii) additional figures supporting the data analyses. All code and a tutorial are available on GitHub at https://github.com/tianyingw/ZILLNB.
Data Availability Statement
All datasets utilized in this analysis are publicly available. The mouse brain scRNA-seq dataset comprising 3005 cells from the mouse cortex and hippocampus can be accessed through the Gene Expression Omnibus (GEO) SuperSeries GSE60361 [40]. The other two mouse neural scRNA-seq datasets can be accessed through GSE90806 [51] and GSE71585 [52]. The human PBMC dataset was subsampled to 4000 cells from an original 68k PBMC dataset available in the Short Read Archive under accession SRP073767, encompassing proportionally subsampled cells from eight distinct immune cell types [41]. The expression matrix for the MCA dataset is available at https://doi.org/10.6084/m9.figshare.5435866.v8. The human pancreatic scRNA-seq datasets generated using various sequencing methods [49] can be accessed under the following accession numbers: GSE81076, GSE85241, GSE86469, GSE84133, GSE81608, and E-MTAB-5061 (ArrayExpress) or at https://doi.org/10.6084/m9.figshare.12420968. The mESC scRNA-seq datasets, generated using full-length and UMI-based techniques, are available under accession numbers E-MTAB-2805 [43] and GSE54695 [66], respectively. The bulk RNA-seq dataset for mESCs is accessible under GSE78140 [67]. The breast cancer cell line dataset, comprising both bulk RNA-seq and scRNA-seq data, is publicly accessible under accession GSE220608 [33]. The IPF scRNA-seq dataset is accessible under GSE136831 [58]. The source code for ZILLNB, implemented in R and Python, is available on GitHub (https://github.com/tianyingw/ZILLNB).






