Abstract
Single-cell RNA sequencing (scRNA-seq) facilitates the study of cell type heterogeneity and the construction of cell atlas. However, due to its limitations, many genes may be detected to have zero expressions, i.e. dropout events, leading to bias in downstream analyses and hindering the identification and characterization of cell types and cell functions. Although many imputation methods have been developed, their performances are generally lower than expected across different kinds and dimensions of data and application scenarios. Therefore, developing an accurate and robust single-cell gene expression data imputation method is still essential. Considering to maintain the original cell–cell and gene–gene correlations and leverage bulk RNA sequencing (bulk RNA-seq) data information, we propose scINRB, a single-cell gene expression imputation method with network regularization and bulk RNA-seq data. scINRB adopts network-regularized non-negative matrix factorization to ensure that the imputed data maintains the cell–cell and gene–gene similarities and also approaches the gene average expression calculated from bulk RNA-seq data. To evaluate the performance, we test scINRB on simulated and experimental datasets and compare it with other commonly used imputation methods. The results show that scINRB recovers gene expression accurately even in the case of high dropout rates and dimensions, preserves cell–cell and gene–gene similarities and improves various downstream analyses including visualization, clustering and trajectory inference.
Keywords: single-cell expression data, imputation, bulk RNA-seq data, network regularization, non-negative matrix factorization
INTRODUCTION
Compared with bulk RNA sequencing (bulk RNA-seq) technology, which measures the averaged gene expression level of various cells in a heterogeneous sample, single-cell RNA sequencing (scRNA-seq) technology measures gene expression at the level of individual cells [1, 2], revealing transcriptomic heterogeneity and dynamic changes among different cells and even uncovering novel cell types. Although scRNA-seq has become increasingly popular in transcriptome research, scRNA-seq data are often highly sparse and noisy due to dropout events [3]. The term dropout event [4–6] is used to describe the phenomenon in scRNA-seq experiments where the expression of certain genes cannot be detected in some cells due to biological or technical reasons, resulting in a substantial number of zero values in the data. The reasons include the low amounts of mRNA in individual cells, the stochasticity of mRNA expression, transcriptional bursting, low sequencing depth per cell, improper sample handling, inefficient mRNA capture and so on [7, 8]. Dropout events can introduce biases in downstream analyses such as clustering, differential expression and pseudotime analyses, thereby affecting the identification and characterization of cell types and functions. Therefore, imputing missing expression values is an important step in scRNA-seq data analysis.
In recent years, there has been extensive research on the imputation of single-cell gene expression through various algorithms based on different principles. These methods can generally be classified into three categories [9]: model-based, smoothing-based and reconstruction-based imputation methods. Model-based imputation methods employ probabilistic distributions to fit the single-cell gene expression and recover the expression data [10–16]. Smoothing-based imputation methods infer missing data leveraging similar data points [17–19]. Reconstruction-based imputation methods aim to define the latent representation of cells, including low-rank matrix-based methods [20–25] and deep learning–based methods [26–31]. A benchmarking study on scRNA-seq imputation methods [32] found there were significant differences in imputation performances across different evaluation aspects; hence, the choice of imputation method needs to be considered for different evaluation requirements and application scenarios. Another review [33] also indicated that no single imputation method could maintain optimal performance in all cases and highlighted issues such as scalability, robustness and unavailability in certain situations that need to be addressed in future research. While existing single-cell gene expression imputation methods have alleviated the adverse effects of dropout events to some extent, the imputation performance is lower than expected across different kinds of data and application scenarios. Therefore, there is still a need to develop more accurate and robust imputation methods to recover missing values in single-cell transcriptomic data with different dimensions and dropout rates and advance various downstream data analyses.
Compared with most existing imputation methods that solely rely on scRNA-seq data for missing value recovery, incorporating bulk transcriptome sequencing data can provide a more accurate estimation of gene expression distribution. Additionally, instead of directly using the expression of similar samples to impute, we utilize network regularization to preserve similarity information between cells and also between genes. Therefore, we propose a single-cell gene expression imputation method with network regularization and bulk RNA-seq data (scINRB). It ensures that the imputed data maintain the original cell–cell similarity and gene–gene similarity and approximate the gene average expression calculated from bulk RNA-seq data. To evaluate the performance of scINRB, we test it on simulated datasets with five different data sizes and five different dropout rates, as well as six experimental datasets, and also compare it with 10 other commonly used single-cell gene expression imputation methods. It is demonstrated that scINRB performs outstandingly across different kinds and dimensions of data, varying degrees of dropout rates and various application scenarios.
MATERIALS AND METHODS
scINRB algorithm
scINRB was developed based on non-negative matrix factorization and the optimization problem is:
![]() |
(1) |
![]() |
where
is an input single-cell gene expression matrix with g denoting the number of genes and c denoting the number of cells, W
and H
are lower-dimensional representations of genes and cells where k denotes the number of factors. The first term denotes the non-negative matrix factorization on Y to obtain lower-dimensional representations of genes and cells, making
approach Y.
and
are the symmetric normalized Laplacian matrices of gene–gene correlation and cell–cell correlation, Tr(·) indicates the trace of the matrix.
,
is a gene–gene similarity matrix whose entry
,
is Euclidean distance between genes i and j and
is the degree matrix that is a diagonal matrix and its entry
is the sum of absolute values of the i-th row of
. Similarly,
, where
is a cell–cell similarity matrix and
is the degree matrix. The second and third terms make the lower-dimensional representations keep the knowledge of gene–gene correlations and cell–cell correlations via network regularization.
and all entries of a are
.
represents the average expression of each gene calculated from bulk RNA-seq data. The last term makes the average gene expression calculated from
approach that from bulk RNA-seq data.
,
and
are regularization parameters. After obtaining W and H, we calculate
and set its negative entries to zeroes to get a matrix named
. For the zero entries of Y, we use the corresponding entries of
to replace for obtaining the imputed gene expression data
.
Implementation of scINRB algorithm
To implement the scINRB algorithm, we employed the gradient descent method [34] (Supplementary Materials). Using the Lagrange multiplier method, we transformed the constrained original problem into an unconstrained problem. Let
and
be the multiplier matrices, where
has the same dimensions with
and
has the same dimensions with
. We then derived the gradients of the unconstrained objective function concerning each matrix (including W, H,
and
) through partial differentiation. Subsequently, the gradients were utilized for iterations to optimize the objective function. The iterative formulas are as follows:
![]() |
(2) |
![]() |
(3) |
![]() |
(4) |
![]() |
(5) |
where
. The left side of the equation above represents the values at the (t + 1)-th iteration, while the right side uses the values at the t-th iteration. The learning rate
of the gradient descent method was set to 1e-7 by default.
We performed the iterations until convergence. The convergence criterion is set to the number of iterations reaches 2000 or the ratio of the difference between the current imputed matrix and the previous one to the current one is less than a threshold (which is set to 1e-05 by default), i.e.
![]() |
(6) |
where
and
are the matrices at the t-th and the (t-1)-th iterations.
Determination of parameters
To determine the number of factors k and the regularization parameters
,
and
, we used five-fold cross-validation by adapting the way of Elyanow et al. [23] (Supplementary Materials). We divided the input single-cell gene expression matrix into five folds at random, each of which contains 20% of the entries of the input matrix. Then, we ran scINRB for a range of k (k can be 10, 50, 100, 300 or 500) with all regularization parameters set as zeroes, masking out 1-fold of entries. Next, we calculated the root mean squared error (RMSE) between the masked 1-fold data from Y and that from
. The procedure was repeated for each fold, and then five RMSE values were obtained to calculate the average of RMSE. We selected the value of k that resulted in the lowest RMSE average. After determining the value of k, we used grid searching and 5-fold cross-validation to select the regularization parameter combination which resulted in the lowest RMSE average.
and
can be 0.001, 0.1 or 10, and
can be 1, 10, 50 or 100.
Simulated single-cell gene expression data
We used Splatter [35] to generate different sizes and different dropout rates of simulated single-cell gene expression datasets, each of which consists of three cell types. The gene sizes and sample sizes include 800
1000, 1000
1000, 1000
800, 1000
5000 and 5000
1000. We used the function of splatSimulateGroup to generate each simulated dataset. Three clusters were embedded, and each size was controlled by the parameter ‘group.prob’, which was set as 0.2, 0.35 and 0.45 for each cluster. The parameter controlling the probability that a gene is differentially expressed in each group was set to 0.045. The location and the scale factor parameters of randomly generating multiplication factors from a log-normal distribution were set to 0.1 and 0.4, respectively. The parameter ‘dropout_mid’ was used to control the dropout rates in the simulated data to obtain datasets with dropout rates of 20%, 30%, 40%, 50% and 60%. To be noted, the dropout rate here is the rate of non-zero entries turning to zeroes. After obtaining true and dropout data matrices using Splatter, both in the form of counts per million (CPM), we used the mean values of genes in the true gene expression data to represent the corresponding bulk RNA-seq data. The dropout data and bulk data are the inputs of scINRB.
Experimental single-cell gene expression data
We also obtained experimental single-cell gene expression data and corresponding bulk RNA-seq data below, preprocessed using the same quality control standards, from a benchmark study [32] for testing the impact of imputation methods on downstream analyses, including visualization, clustering and trajectory inference. More details about the experimental datasets can be seen in Supplementary Materials.
For visualization and clustering analysis, except for simulated datasets, we also analyzed four experimental datasets. Among them, three are benchmarking scRNA-seq datasets from CellBench [36], the ones generated using CEL-seq2 protocol (denoted as sc_celseq2), Drop-seq Dolomite protocol (denoted as sc_dropseq) and 10x Chromium Genomics protocol (denoted as sc_10x). These datasets contain three human lung adenocarcinoma cell lines, HCC827, H1975 and H2228. For these three datasets, bulk RNA-seq samples from GSE86337 [37] (each of the three cell lines has two replicates) were used to calculate the average expression level of each gene across all samples. We also downloaded the preprocessed ENCODE [38] cell line dataset and corresponding bulk RNA-seq data from the benchmarking study [32]. The cell line dataset was sequenced with SMARTer full-length method using Fluidigm C1 protocol (hence denoted as ENCODE_fluidigm_5cl). It includes A549, GM12878, H1-hESC, IMR90 and K562 cell lines, each corresponding to a distinct cell type.
For trajectory inference, we analyzed two experimental datasets, a benchmarking RNA mixture dataset from CellBench [36] generated using Sort-seq protocol (denoted as RNAmix_sortseq) and a bone marrow cell dataset of tissue samples from Human Cell Atlas [39] measured using 10x Genomics (denoted as HCA_10x_tissue). Dataset RNAmix_sortseq involves three human lung adenocarcinoma cell lines, HCC827, H1975 and H2228. Dataset HCA_10x_tissue consists of B cell, CD4 T cell, CD8 T cell, common lymphoid progenitor (CLP), common myeloid progenitor (CMP), erythroid, granulocyte-macrophage progenitor (GMP), hematopoietic stem cell (HSC), lymphoid-primed multipotent progenitor (LMPP), megakaryocyte-erythroid progenitor (MEP), monocyte, multipotent blood progenitor (MPP) and natural killer (NK) cells. For RNAmix_sortseq and HCA_10x_tissue, bulk RNA-seq samples from GSE86337 [37] and GSE74246 [40] were used, respectively, to calculate the average expression level of each gene across all samples.
Benchmarking
We compared scINRB with 10 commonly used single-cell gene expression imputation methods, including SAVER [12], scImpute [14], VIPER [16], DrImpute [17], MAGIC [18], DCA [27], DeepImpute [28], SAUCIE [29], CMF-Impute [21] and SCRABBLE [25]. Among them, SAVER, scImpute and VIPER are model-based imputation methods; DrImpute and MAGIC are smoothing-based imputation methods; DCA, DeepImpute and SAUCIE are deep learning–based methods; CMF-Impute and SCRABBLE are low-rank matrix-based methods. DCA and DeepImpute require an input of counts, and SCRABBLE is a method leveraging the information of bulk RNA-seq data. To run the imputation methods, we met the input requirement of each method and skipped the gene and cell filtering steps within each method's pipeline to ensure that the output imputed matrices have the same dimensions.
RESULTS
Overall description
To keep the original cell–cell and gene–gene correlations and leverage the gene expression average calculated from bulk RNA-seq data in the imputation of single-cell gene expression, we propose a method, scINRB, which utilizes network regularization and bulk RNA-seq data. The flowchart of scINRB algorithm can be seen in Figure 1A. The input single-cell gene expression matrix Y is decomposed into low-dimensional representations of genes (W) and cells (H), using network regularization to make W and H preserve the cell–cell and gene–gene correlations. Additionally, the average gene expression levels calculated from bulk RNA-seq data are incorporated to constrain
. After obtaining W and H, we set the negative entries of
to zeroes and then use the corresponding entries of
to fill in the zero entries of Y. The imputed gene expression matrix
is then used for downstream analyses.
Figure 1.

(A) Flowchart of scINRB algorithm. (B) Testing and evaluation process of scINRB.
To assess the performance of scINRB compared with existing imputation methods, we tested different kinds of datasets, including simulated and experimental single-cell expression data (Figure 1B). For the simulated data, we used Splatter [35] to generate datasets of five different sizes. For each data size, a certain proportion (20%, 30%, 40%, 50% or 60%) of non-zero entries in true data were randomly masked, i.e. set to zeroes, to generate dropout data. To be noted that, for one data size, the true data of different dropout rates are the same. To assess the ability of imputation methods to recover gene expression, we computed the root mean square error (RMSE) and Pearson correlation coefficient (PCC) between the imputed gene expression data and the true expression data to evaluate the accuracy of gene expression recovery. To assess whether the imputation methods preserve the true gene–gene and cell–cell data structures, we calculated the PCC between the gene–gene similarity and cell–cell similarity matrices obtained from the true and imputed single-cell gene expression matrices. Additionally, based on both simulated and experimental data (including sc_celseq2, sc_dropseq, sc_10x and ENCODE_fluidigm_5cl), we checked if imputation can improve visualization effect and clustering performance. Using experimental data (RNAmix_sortseq and HCA_10x_tissue), we examined the effectiveness of imputation algorithms in trajectory inference.
scINRB recovers gene expression accurately
Ideally, an imputation method should accurately recover gene expression and make the imputed gene expression values closer to the true data, without introducing significant errors. Based on simulated scRNA-seq datasets, using true data as the gold standard, the masked data with dropouts was supplied as input for the imputation methods to assess their imputation ability.
For the simulated data containing 1000 genes, 1000 cells and a medium dropout rate (i.e. 40%), Figure 2A shows the heatmap of the imputed matrix by scINRB and the other 10 imputation algorithms along with the true and dropout data. The heatmaps of the imputation results for other simulated data can be found in Supplementary Figures 1–5. It can be observed that the data imputed by scINRB exhibits smoother and more continuous expression patterns, closely resembling the simulated gold-standard data (i.e. true data). This indicates that scINRB can better reconstruct missing data that may exist in experimental settings. In comparison, other methods such as SAVER, scImpute, VIPER and CMF-Impute do not perform well in capturing the true data pattern. Furthermore, compared with most other methods, scINRB demonstrates consistent and stable results across different data sizes and different dropout rates (Supplementary Figures 1–5).
Figure 2.

(A) Heatmaps of the imputation results of simulated data (1000 genes
1000 cells with 40% dropout rate) along with true and dropout data. (B) PCC and (C) RMSE between imputed gene expression data and true data for all simulated data.
To quantify the performance of different imputation methods in recovering single-cell gene expression levels, we calculated the PCC between the imputed gene expression data and the true expression data. Figure 2B presents the PCC results for each simulated dataset. We found that most imputation methods are effective in recovering single-cell gene expression, among which scINRB, DrImpute and SCRABBLE outperform others across most simulated datasets. Furthermore, an increasing dropout rate leads to varying degrees of decrease in the performance of all imputation methods in recovering gene expression. Specifically, for datasets with high dropout rates, the imputation performance of SAVER, VIPER, SAUCIE and CMF-Impute methods are poor, possibly due to a large amount of missing data making it difficult to infer the true gene expression from non-missing values. However, scINRB maintains good performance even under high dropout rates, indicating its robustness.
In addition, we used RMSE as an evaluation metric to assess the difference between the imputed gene expression and the true expression data (as shown in Figure 2C). Most imputation methods can reduce the differences between the imputed and true data compared with the input dropout data. Among them, scINRB performs better than others, with a small difference between the imputed and true data. Additionally, DrImpute, DeepImpute and SCRABBLE also show prominent performances. These methods exhibit stability across different dropout rates, meaning that the imputation is not significantly affected by changes in dropout rates. However, most other imputation methods show a gradual decline in performance as the dropout rate increases. Methods like SAVER, VIPER, SAUCIE and CMF-Impute not only underperform in capturing the true data pattern but also are greatly influenced by dropout rates.
It can be seen that utilizing the non-negative matrix factorization and the information from bulk RNA-seq data is beneficial to the recovery of single-cell gene expression data. This is because low-rank approximation can reduce noise in highly sparse single-cell data and extract the key factors to describe genes and cells, and leveraging the gene expression average can constrain the range of imputed values.
scINRB maintains gene–gene and cell–cell correlations
One of the main applications of scRNA-seq technology is to reveal gene–gene relationships and cell–cell relationships within complex tissues. Therefore, imputation methods should preserve the interrelationships between cells and genes.
To assess the ability to preserve cell–cell correlations of imputation methods, we calculated the PCC between the cell–cell similarity matrices obtained from the true and the imputed single-cell gene expressions (Figure 3A). It has been clearly demonstrated that the cell–cell correlation matrix derived from the imputed data by scINRB exhibits a higher correlation with the true cell–cell correlation matrix relative to the other 10 imputation methods, indicating that scINRB better preserves the true cell–cell relationships. Furthermore, the performance of imputation methods is somewhat influenced by the presence of missing values, with a higher dropout rate leading to a greater impact, as observed in methods such as SAVER, scImpute and MAGIC. In contrast, scINRB, DrImpute and SCRABBLE are less affected by the proportion of missing values and demonstrate more stable performances.
Figure 3.
(A) PCC between the cell–cell similarity matrices obtained from the true and the imputed single-cell gene expressions. (B) PCC between the gene–gene similarity matrices obtained from the true and the imputed single-cell gene expressions.
Similarly, to evaluate whether imputation methods effectively preserve gene–gene correlations, we also calculated the PCC between the true gene–gene similarity matrix and the gene–gene similarity matrix derived from the imputed matrix. Figure 3B shows that the gene–gene correlation matrix obtained from the imputed data by scINRB exhibits a higher correlation with the true gene–gene correlation compared with the other 10 imputation methods, indicating that scINRB possesses stronger capability in recovering the true gene–gene correlation structure. A higher proportion of missing values leads to poorer performance of imputation methods. We observed that, for the simulated dataset with 5000 genes and 1000 cells, as the dropout rate increases, the preservation of gene–gene correlations of various imputation methods deteriorates. This effect is less pronounced in the other four datasets, which may be attributed to the fact that the simulated dataset with 5000 genes and 1000 cells already has 32% zero values in the true data that is larger than those of other four datasets (ranging from 10% to 20%), and when the dropout rate reaches 60%, the rate of zeroes in this data has reached 92%. This also implies that a large amount of missing data affects the accuracy of imputation results, and most imputation methods have difficulty in compensating for the impact of a high dropout rate.
From the above, it has been proven that introducing the regularization terms about cell–cell and gene–gene networks is conducive to the preservation of correlations between cells and between genes in the procedure of single-cell data imputation, hence facilitating the subsequent cell-level and gene-level analyses.
scINRB improves visualization effect
Visualization is an important analytical process that allows us to observe sample distribution more intuitively, facilitating the identification of different cell types and their expression patterns. To evaluate the effectiveness of scINRB and other imputation methods in visualization analysis, we first performed t-SNE [41] on the five sets of imputed simulated data to reduce the high-dimensional data to a two-dimensional space for visualization and checked if different groups of cells can still be differentiated after imputation. For the simulated data containing 1000 genes, 1000 cells and a medium dropout rate (i.e. 40%), Figure 4A shows the t-SNE visualization results of the imputed data along with the simulated true and dropout data. The visualization results for the other simulated data can be found in Supplementary Figures 6–10. We can observe that the larger the dropout rate, the more difficult it becomes to distinguish different cell clusters. However, across different dropout rates, scINRB can improve the separability of clusters relative to dropout data, closely resembling the simulated ground truth (i.e. true data). On the other hand, many other imputation methods perform poorly, exhibiting clear biases and confusion in cell type distributions. For instance, SAVER, SAUCIE, MAGIC and CMF-impute show mixed clusters when the dropout rate reaches 50%.
Figure 4.

(A) t-SNE visualization of imputed data for the simulated data (1000 genes
1000 cells with a dropout rate of 40%). Each different color denotes each group of cells. (B) Heatmap of imputed experimental data sc_10x. (C) t-SNE visualization of imputed data for sc_10x.
In addition to the simulated data, we also evaluated the performance of all methods using four experimental datasets (including sc_10x, sc_celseq2, sc_dropseq and ENCODE_fuidigm_5cl). The heatmaps of imputed data by all methods are shown in Figure 4B and Supplementary Figure 11, and the t-SNE visualizations of the imputed data are shown in Figure 4C and Supplementary Figure 12. We can see that for datasets sc_celseq2 and sc_dropseq, compared with several methods, scINRB maintains cell distributions to a certain degree, not introducing additional noises, while SAVER, VIPER, MAGIC and SAUCIE perform better. For sc_10x and ENCODE_fuidigm_5cl, compared with other methods, different cell clusters are more differentiated and each cluster is more compact after imputation by scINRB, particularly for sc_10x (Figure 4C).
scINRB enhances cell type identification
Cell type identification is an essential task in scRNA-seq data analysis. To assess whether imputation methods can improve the performance of downstream clustering analysis, we performed 100 times K-means [42] clustering on both the input single-cell gene expression data and the imputed data by various imputation methods to avoid randomness in K-means. Then we calculated the mean values of two clustering metrics, ARI (adjusted Rand index) [43] and NMI (normalized mutual information) [44]. The value of K, the predetermined number of clusters, was specified as the true number of cell groups.
The averages of ARI and NMI values on five sets of simulated data with different sizes (each set consisting of five dropout rates: 20%, 30%, 40%, 50% and 60%) are shown in Figure 5A. It can be observed that MAGIC generally underperformed in clustering analysis. DCA shows decent performance only in the data with a low dropout rate (20%, 30% and 40%) for 1000 genes and 5000 cells, but does not improve the clustering performance for other simulated datasets. Most of the other imputation methods contribute to enhancing the performance of clustering analysis. The boxplots shown in Figure 5B provide a comprehensive overview of the clustering results across all five simulated datasets. Among all methods, scINRB exhibits the best clustering performance, followed by SCRABBLE and DrImpute. Compared with SCRABBLE, scINRB performs better in the two datasets with gene numbers larger than sample numbers, i.e. 5000
1000 and 1000
800. DrImpute's performance is not as good as scINRB in cases of high dropout rates.
Figure 5.
(A) ARI and NMI means obtained from 100 times of K-means clustering on simulated datasets before and after imputation. (B) Boxplot of ARI and NMI values in Figure 5A. (C) ARI and NMI means obtained from 100 times of K-means clustering on four experimental data before and after imputation.
Moreover, we also tested on four experimental datasets, namely, sc_10x, sc_celseq2, sc_dropseq and ENCODE_fluidigm_5cl (shown in Figure 5C). We observed that certain methods, such as SCRABBLE and DeepImpute, perform well on simulated data but not on experimental data. DCA shows good clustering performance on sc_10x dataset but performs poorly on the other three experimental datasets. However, scINRB consistently demonstrates excellent imputation performance across all four experimental datasets.
scINRB facilitates trajectory inference
To evaluate whether the imputation algorithms can improve downstream trajectory inference, we performed Monocle 2 [45] on both the input single-cell gene expression data and the imputed data by imputation methods. Monocle utilizes DDR-Tree (Discriminative DRTree) [46] for dimensionality reduction and tree construction. We analyzed two experimental datasets, RNAmix_sortseq and HCA_10x_tissue (Supplementary Materials). For RNAmix_sortseq dataset, the cluster with the most H2228 cells was selected to be root state for the inferred trajectory. For HCA_10x_tissue, each cell was assigned a differentiation level and the cluster with the smallest averaged differentiation level was set to be root state for the inferred trajectory [32]. To evaluate the trajectory inference performance, we calculated two metrics: overlap and Kendall rank correlation coefficient. Overlap denotes the proportion of cells in the inferred trajectory that correctly overlap with the cells in the true trajectory [32, 36]. The Kendall rank correlation coefficient denotes the consistency degree of the orderings of cells between the inferred trajectory and the true trajectory [47].
Figure 6A shows the performance indexes of inferred trajectories. For RNAmix_sortseq, several methods perform well, such as scINRB, Drimpute and MAGIC. As to HCA_10x_tissue, scINRB, VIPER and SAUCIE show better performance. Compared with RNAmix_sortseq, the task of imputation and trajectory inference on HCA_10x_tissue seems more difficult; relative to the overlap metric, the Kendall rank correlation coefficient is overall lower, while scINRB demonstrates outstanding metric results. To examine the inferred pseudotime on RNAmix_sortseq, Figure 6B shows the results of trajectory inference based on dropout data and imputed data by different methods. Figure 6C depicts the inferred trajectory on HCA_10x_tissue, for which SCRABBLE could not perform imputation successfully. It can be noted that imputation by MAGIC and scINRB can make trajectory be inferred more accurately and clearly. DeepImpute exhibits mixed differentiation stages in the differentiation trajectory of the HCA_10x_tissue dataset, and CMF-Impute has poor results on both datasets. The results demonstrate that scINRB enhances trajectory inference performance relative to dropout data and is competitive among all imputation methods. On both tested datasets, scINRB consistently ranks top and produces an accurate and clear trajectory to showcase the differentiation order.
Figure 6.
(A) Trajectory inference metrics, overlap and Kendall rank correlation coefficient, calculated based on two experimental data before and after imputation. The white area indicates that the trajectory failed to be inferred by Monocle 2. (B) The inferred trajectory based on imputed RNAmix_sortseq data where color denotes pseudotime. (C) The inferred trajectory based on imputed HCA_10x_tissue data where different colors denote different cell types. HCA_10x_tissue could not be imputed by SCRABBLE.
scINRB is robust to gene and cell numbers
To examine the influence of gene number and cell number on the performances of imputation algorithms, we used the simulated data with a large gene number or cell number (i.e. 5000 genes or 5000 cells) and a dropout rate of 40% (using a medium dropout rate as an example) to generate sub-data matrices with multiple different dimensions and performed the tests above on simulated data, as the simulated data can be generated along with the ground truth. Specifically, from the data with 1000 genes and 5000 cells, we generated data with the same genes but with 100, 500, 1000, 1500 and 2000 cells by stratified sampling according to the cell groups. Similarly, from the data with 5000 genes and 1000 cells, we also generated datasets with the same cells but with 100, 500, 1000, 1500 and 2000 genes by selecting from the top differential expressed genes to non-differential expressed genes in each cell group. Then, we checked the imputation performances of all methods on these generated datasets along with the original high-dimensional data and also examined the time cost and memory usage of different methods.
With the increase of gene number or cell number, scINRB still recovers gene expression accurately, maintains cell–cell and gene–gene correlations and facilitates cell clustering, not only being more stable but also better in performance indicators than most compared methods (Supplementary Figure 13). In addition, the time of VIPER and SCRABBLE increases significantly with the increase of dimension and DCA and DeepImpute need much memory, while scINRB usually does not require much time and memory (Supplementary Figure 13) to achieve good performances.
DISCUSSION
Dropout events in single-cell gene expression are commonly observed, which can negatively impact downstream analysis results. Therefore, it is necessary to impute dropout values in scRNA-seq experiments to ensure data integrity, avoid sample bias and improve the accuracy of data analyses. In this study, we proposed a single-cell gene expression imputation method called scINRB, which incorporates information from bulk RNA-seq data and models the correlation networks between cells and genes.
We first tested scINRB and compared it with ten commonly used imputation methods using simulated scRNA-seq datasets. The evaluations are based on whether the imputation methods accurately recover gene expression values, maintain the true gene–gene similarity and cell–cell similarity, as well as the impact of dropout rates on gene expression imputation. scINRB, SCRABBLE and DrImpute perform exceptionally well in accurately recovering gene expression values. As for preserving the true cell–cell similarity, scINRB and SCRABBLE are excellent. scINRB and DrImpute exhibit superior performance in maintaining the true gene–gene similarity. Furthermore, we tested scINRB and the other 10 imputation methods on both simulated and experimental datasets to assess their improvement on downstream analysis. For t-SNE visualization, some methods perform well in simulated data but not in experimental data, and some are not robust to the increase of dropout rate, while scINRB enhances visualization effect across both simulated and experimental datasets even in the case of high dropout rate. For cell clustering, scINRB and SCRABBLE perform well on simulated data, while scINRB and SAUCIE excel on experimental data. Additionally, scINRB and SCRABBLE exhibit minimal perturbation in response to changes in dropout rates, making them preferable choices. As to trajectory inference, different methods are outstanding when being tested on different data or evaluated using different metrics, while scINRB ranks top consistently and helps to infer accurate and clear differentiation trajectories. In addition, with the increase of data dimension, the time of VIPER and SCRABBLE increases significantly. DCA and DeepImpute require much memory. However, scINRB usually does not require much time and memory to obtain a stable and good performance.
Compared with other methods, scINRB possesses several advantages. Firstly, scINRB utilizes non-negative matrix factorization to find lower-dimensional representations of genes and cells. This can reduce noises in high-dimensional and sparse single-cell gene expression matrix and extract the important factors to describe genes and cells. Secondly, scINRB maintains cell–cell and gene–gene relationships through network regularization, instead of relying solely on distance measurements to quantify the relationships. Several methods rely on cell–cell distance to impute single-cell gene expression data, but this distance-based measurement is only an approximation and cannot directly reflect the cell–cell correlations. The incorporation of cell–cell and gene–gene networks facilitates the subsequent cell-level and gene-level data analyses. Thirdly, scINRB introduces external references into the imputation model, leveraging information from bulk RNA-seq data to reduce biases during the imputation process and improve model accuracy. Lastly, under appropriate parameters, scINRB demonstrates excellent imputation performance across simulated data and experimental data, different data scales and varying dropout rates, demonstrating its robustness. In practical usage, one can apply the cross-validation function we provided to choose the appropriate parameters of scINRB. It is highly competitive with the existing methods and is well-suited for applications in single-cell gene expression imputation.
For future improvements and developments of scINRB, several aspects can be considered. Currently, scINRB does not differentiate between technical zeroes and biological zeroes during the imputation process. A probability distribution can be incorporated and only the technical zeroes are to be imputed. Additionally, further integration with other modal data can be explored to introduce more prior knowledge, such as spatial location, single-cell methylation data and others, for enhancing the imputation capability and accuracy of the algorithm.
Key Points
To maintain cell–cell and gene–gene correlation networks and leverage bulk RNA-seq data in single-cell gene expression imputation, we propose scINRB based on network regularization and bulk RNA-seq data.
Across simulated datasets with different sizes and dropout rates, scINRB recovers gene expression accurately and preserves cell–cell and gene–gene similarities.
Across different experimental datasets, scINRB is competitive with the existing methods and improves the performances of data visualization, cell clustering and trajectory inference.
scINRB is robust and stable across simulated and experimental data, different data dimensions, varying degrees of dropout rates and various application scenarios.
Supplementary Material
Author Biographies
Yue Kang and Hongyu Zhang are postgraduates in Department of Automation at Xiamen University.
Jinting Guan is an associate professor in Department of Automation at Xiamen University.
Contributor Information
Yue Kang, Department of Automation, Xiamen University, Xiamen, Fujian, China.
Hongyu Zhang, Department of Automation, Xiamen University, Xiamen, Fujian, China.
Jinting Guan, Department of Automation, Xiamen University, Xiamen, Fujian, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China.
FUNDING
This work was supported by National Science and Technology Major Project (No. 2021ZD0112600), National Natural Science Foundation of China (61803320) and Natural Science Foundation of Fujian Province of China (2022J05012).
DATA AVAILABILITY
scINRB is available at: https://github.com/JGuan-lab/scINRB. The simulated datasets and the analyzed experimental single-cell datasets can be accessed at: https://zenodo.org/record/8224512. The experimental datasets were originally downloaded from a benchmarking study of Hou et al. at: https://doi.org/10.5281/zenodo.3701939.
References
- 1. Islam S, Zeisel A, Joost S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods 2014;11(2):163–6. [DOI] [PubMed] [Google Scholar]
- 2. Svensson V, Natarajan KN, Ly LH, et al. Power analysis of single-cell RNA-sequencing experiments. Nat Methods 2017;14(4):381–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Kim JK, Kolodziejczyk AA, Ilicic T, et al. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat Commun 2015;6:8687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Mazutis L, Araghi AF, Miller OJ, et al. Droplet-based microfluidic systems for high-throughput single DNA molecule isothermal amplification and analysis. Anal Chem 2009;81(12):4813–21. [DOI] [PubMed] [Google Scholar]
- 5. Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 2018;19(4):562–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Marinov GK, Williams BA, McCue K, et al. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res 2014;24(3):496–510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun 2020;11(1):1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Larsson AJM, Johnsson P, Hagemann-Jensen M, et al. Genomic encoding of transcriptional burst kinetics. Nature 2019;565(7738):251–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Lähnemann D, Köster J, Szczurek E, et al. Eleven grand challenges in single-cell data science. Genome Biol 2020;21(1):31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Tang W, Bertaux F, Thomas P, et al. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 2020;36(4):1174–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Prabhakaran S, Azizi E, Carr A, et al. Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. Proceedings of the 33rd International Conference on Machine Learning, PMLR: Proceedings of Machine Learning Research, New York, NY, USA, 2016;48:1070–9. [PMC free article] [PubMed] [Google Scholar]
- 12. Huang M, Wang J, Torre E, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 2018;15(7):539–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Wang J, Agarwal D, Huang M, et al. Data denoising with transfer learning in single-cell transcriptomics. Nat Methods 2019;16(9):875–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun 2018;9(1):997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Miao Z, Li J, Zhang X. scRecover: discriminating true and false zeros in single-cell RNA-seq data for imputation. bioRxiv 2019, 665323.
- 16. Chen M, Zhou X. VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies. Genome Biol 2018;19(1):196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Gong W, Kwak IY, Pota P, et al. DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics 2018;19(1):220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Dijk D, Sharma R, Nainys J, et al. Recovering gene interactions from single-cell data using data diffusion. Cell 2018;174(3):716–729.e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wagner F, Yan Y, Yanai I. K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data. bioRxiv 2017;217737. [Google Scholar]
- 20. Linderman GC, Zhao J, Roulis M, et al. Zero-preserving imputation of single-cell RNA-seq data. Nat Commun 2022;13(1):192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Xu J, Cai L, Liao B, et al. CMF-impute: an accurate imputation tool for single-cell RNA-seq data. Bioinformatics 2020;36(10):3139–47. [DOI] [PubMed] [Google Scholar]
- 22. Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based imputation for single cell RNA-seq data. Front Genet 2019;10:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res 2020;30(2):195–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Zhang L, Zhang S. PBLR: an accurate single cell RNA-seq data imputation tool considering cell heterogeneity and prior expression level of dropouts. bioRxiv 2018;379883. [Google Scholar]
- 25. Peng T, Zhu Q, Yin P, Tan K. SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data. Genome Biol 2019;20(1):88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Sci Rep 2018;8(1):16329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Eraslan G, Simon LM, Mircea M, et al. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun 2019;10(1):390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Arisdakessian C, Poirion O, Yunits B, et al. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol 2019;20(1):211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Amodio M, van Dijk D, Srinivasan K, et al. Exploring single-cell data with deep multitasking neural networks. Nat Methods 2019;16(11):1139–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Deng Y, Bao F, Dai Q, et al. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 2019;16(4):311–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Lopez R, Regier J, Cole MB, et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15(12):1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol 2020;21(1):218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Zhang L, Zhang S. Comparison of computational methods for imputing single-cell RNA-sequencing data. IEEE/ACM Trans Comput Biol Bioinform 2020;17(2):376–89. [DOI] [PubMed] [Google Scholar]
- 34. Ruder S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv 2016:609.04747.
- 35. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol 2017;18(1):174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Tian L, Dong X, Freytag S, et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019;16(6):479–87. [DOI] [PubMed] [Google Scholar]
- 37. Holik AZ, Law CW, Liu R, et al. RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic Acids Res 2017;45(5):e30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. The ENCODE (ENCyclopedia of DNA elements) project. Science 2004;306(5696):636–40. [DOI] [PubMed] [Google Scholar]
- 39. Regev A, Teichmann SA, Lander ES, et al. The Human Cell Atlas. elife 2017;6: e27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Corces MR, Buenrostro JD, Wu B, et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat Genet 2016;48(10):1193–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Laurens VDM, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9(2605):2579–605. [Google Scholar]
- 42. MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, California, 1967, 281–97.
- 43. Hubert L, Arabie P. Comparing partitions. J Classif 1985;2(1):193–218. [Google Scholar]
- 44. Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans Neural Netw 2009;20(2):189–201. [DOI] [PubMed] [Google Scholar]
- 45. Qiu X, Hill A, Packer J, et al. Single-cell mRNA quantification and differential analysis with census. Nat Methods 2017;14(3):309–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Mao Q, Wang L, Goodison S, et al. , Dimensionality reduction via graph structure learning. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery: Sydney, NSW, Australia, 2015, 765–74. [Google Scholar]
- 47. Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 2019;20(1):269. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
scINRB is available at: https://github.com/JGuan-lab/scINRB. The simulated datasets and the analyzed experimental single-cell datasets can be accessed at: https://zenodo.org/record/8224512. The experimental datasets were originally downloaded from a benchmarking study of Hou et al. at: https://doi.org/10.5281/zenodo.3701939.










