scRNMF: An imputation method for single-cell RNA-seq data by robust and non-negative matrix factorization

Yuqing Qian; Quan Zou; Mengyuan Zhao; Yi Liu; Fei Guo; Yijie Ding

doi:10.1371/journal.pcbi.1012339

. 2024 Aug 8;20(8):e1012339. doi: 10.1371/journal.pcbi.1012339

scRNMF: An imputation method for single-cell RNA-seq data by robust and non-negative matrix factorization

Yuqing Qian ^1,², Quan Zou ^1,², Mengyuan Zhao ³, Yi Liu ^1,², Fei Guo ^4,^*, Yijie Ding ^2,^*

Editor: Qing Nie⁵

PMCID: PMC11338450 PMID: 39116191

Abstract

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool in genomics research, enabling the analysis of gene expression at the individual cell level. However, scRNA-seq data often suffer from a high rate of dropouts, where certain genes fail to be detected in specific cells due to technical limitations. This missing data can introduce biases and hinder downstream analysis. To overcome this challenge, the development of effective imputation methods has become crucial in the field of scRNA-seq data analysis. Here, we propose an imputation method based on robust and non-negative matrix factorization (scRNMF). Instead of other matrix factorization algorithms, scRNMF integrates two loss functions: L₂ loss and C-loss. The L₂ loss function is highly sensitive to outliers, which can introduce substantial errors. We utilize the C-loss function when dealing with zero values in the raw data. The primary advantage of the C-loss function is that it imposes a smaller punishment for larger errors, which results in more robust factorization when handling outliers. Various datasets of different sizes and zero rates are used to evaluate the performance of scRNMF against other state-of-the-art methods. Our method demonstrates its power and stability as a tool for imputation of scRNA-seq data.

Author summary

It is still difficult to analyze scRNA-seq data because a significant portion of expressed genes have zeros. Gene expression levels can be restored through the imputation of scRNA-seq data, facilitating downstream analysis. To overcome this challenge, we propose an imputation method based on robust and non-negative matrix factorization (scRNMF). Instead of other matrix factorization algorithms, scRNMF integrates two loss functions: L₂ loss and C-loss. Through the use of several simulated and real datasets, we perform an comprehensively evaluation of scRNMF against existing methods. scRNMF can enhance various aspects of downstream analysis, including gene expression data recovery, cell clustering analysis, gene differential expression analysis, and cellular trajectory reconstruction. The results of our study demonstrate that scRNMF is a powerful tool that can improve the accuracy of single-cell data analysis.

Introduction

Single-cell RNA sequencing is a powerful technique that allows researchers to analyze gene expression at the single-cell level. However, scRNA-seq data is prone to dropout events, which refer to the failure of detecting true gene expression due to technical limitations during the experimental process. Dropout events occur when the expression level of a gene falls below the detection limit, resulting in zero or low read counts. Dropout events are particularly common in lowly expressed genes, making their identification and handling challenging. Understanding dropout events is crucial for the accurate interpretation of scRNA-seq data.

To address dropout events, several computational methods have been developed. These methods aim to impute the missing gene expression values by leveraging the information from other genes or cells. In general, we can categorize the previous imputation methods into three classes.

The first category of methods involves probabilistic models. These models identify the zeros as dropout values and predict the missing value. scImpute [1] estimates dropout rates using a Gamma-Normal mixture model. For the expression of genes with high dropout probabilities, scImpute constructs a separate non-negative least squares regression model to impute. Huang et al. [2] developed an expression recovery tool called SAVER which uses a Poisson-Gamma model to pool expression values across genes within each cell. Further, SAVER-X, proposed by Wang et al. [3], couples an autoencoder (AE) with a Bayesian model to extract transferable gene-gene relations across data sets.

The second category of methods aims to restore the expression value from the raw data or prior knowledge (such as cell-cell and gene-gene interaction network). For example, MAGIC [4] shares information between similar cells through data diffusion. To capture gene and cell similarities, scTSSR [5] simultaneously learns two non-negative sparse self-representation matrices. Genes and cells that show similarity are bilinearly combined to impute dropout values. Further, scTSSR2 [6] combines matrix decomposition with scTSSR, leading to fast two-side sparse self-representation to impute dropout events in scRNA-seq data. Using a multi-objective optimization model, scMOO [7] infers the combination of weights and the latent representation of three types of structures (horizontal, vertical, and low-rank) from the data.

Another category of methods is based on deep learning (DL) theory. The hidden distribution of gene expression can be captured using a DL-based approach [8–10]. AutoImpute [11] uses an autoencoder (AE) to learn the distribution of input data, imputing the missing values with minimal impact on gene expression levels. The missing values are imputed in DCA [12] by a deep counting AE whose output layer is seen as Zero-inflated Negative Binomial (ZINB) regression. However, these DL-based approaches rely on strong distribution assumptions, which severely limit their effectiveness and utility. Thus, Li et al. [13] developed AutoClass, an algorithm that combines an AE with a classifier without assuming any particular data distribution, allowing it to effectively remove noise and artifacts from scRNA-seq datasets. Wang et al. [14] developed the single-cell Graph Neural Network (scGNN), which uses graph neural networks to learn cell-cell relationships. scIGANs, a proposal by Xu et al. [15], uses generated cells to balance the performance between major and rare cell populations. The scGCL method proposed by Xiong et al. [16], employs a ZINB AE and graph contrastive learning for estimating dropout rates.

MF [17–21] and AE [8, 11–13, 16, 22]-based imputation methods first identify a latent space representation of cells or genes. They then reconstruct the observed expression matrix from the estimated latent spaces, resulting in a matrix that is no longer sparse. While MF is a shallow model capturing the linear relationship between cells and genes, AE is a DL model that can capture non-linear relationships and restore complex structures not exhibited in the raw data. According to comparison results on the single-cell Imputation Methods Comparison platform (scIMC [23]), it is evident that DL-based approaches hold significant potential for imputation. However, AEs often require substantial training data to extract meaningful representations. Insufficient data can lead to overfitting, hindering the model’s ability to generalize accurately to new data.

Research [24, 25] has shown that leveraging bulk RNA-seq data can significantly improve the quality of imputed data. SCRABBLE [26] enhances clustering quality and cell type identification by imputing dropout events from bulk RNA-seq data. It employs matrix regularization rather than relying on cell-cell distance, transforming the mathematical model into a convex optimization problem. DURIAN [27] reduces its error rate by using celltype-specific gene expression patterns in the bulk expression data. During the imputation task, DURIAN enhances the usefulness of bulk data as well as single-cell data by sharing information iteratively. Bubble [28] is an AE-based model that uses matched bulk RNA-seq data to identify and impute scRNA-seq data. With Bubble, the alignment between aggregated imputed data and bulk RNA-seq data improves, resulting in more accurate gene expression level estimation. For imputation, these methods rely on extensive sets of RNA-seq data. However, when bulk RNA-seq data is unavailable, or if there is limited congruence between scRNA-seq and bulk RNA-seq data sets, these methods might become less effective or even unsuitable.

The observed scRNA-seq data can be modeled as a gene count matrix. Matrix factorization (MF), which approximates a data matrix as the product of two or more low-dimensional factor matrices are popular approaches for scRNA-seq data analysis [18, 29]. For example, McImpute [17] uses Nuclear Norm-based MF to recover the full gene expression from partial information. CMF-Impute [20], a collaborative MF method, exploits the information of cell similarity and gene similarity. ALRA adopts low-rank MF to reconstruct scRNA-seq data. ALRA [18] preserve true biological zeros at zero count by set all the values below the threshold to zero. The above MF-based imputation methods utilize L₂ loss in optimization, which can perform well under Gaussian and zero mean noise as assumption [30]. Previous studies [31, 32] show that the L₂ loss measure is sensitive to outliers. In the scRNA-seq data, the raw counts exists false (dropout) zero counts, which are outliers. In this case, the L₂ loss may not properly represent the error statistics and the performance of MF algorithms may degrade.

A large number of imputation methods have been proposed and most of them achieved good performance in different scenarios. However, some benchmark studies [23, 33, 34] find that there still a lack of imputation methods that can perform well across all scenarios. For example, Hou et al. [33] performed a systematic evaluation of 18 scRNA-seq imputation methods. And, they found that the majority of the methods did not improve performance in downstream analyses compared to no imputation, in particular for clustering and trajectory analysis. In Cheng et al.’s study [34], no imputation method performed consistently well across all datasets and some methods even had a negative effect on cell clustering. Furthermore, scIMC [23] designed a comprehensively comparing framework for interpolation methods. The comparing experiment is conducted from the following four aspects: gene expression recovering, cell clustering, gene differential expression and cellular trajectory reconstruction.

To alleviate the above concerns, we intend to develop a robust imputation method that can handle dropout zeros effectively and improve performance in all scenarios. Here, we extend the typical MF approach and adapt it to solve imputation tasks in scRNA-seq data. To this effect, we develop a MF with a non-negative constraint and C-loss function, termed scRNMF. The contributions of our work are as follows:

(1) Instead of known MF methods [7, 20, 35–37], scRNMF integrates two loss function: L₂ loss and C-loss. Observed zero values in scRNA-seq data don’t necessarily indicate true gene expression, but rather, they signify unobserved values. When we incorporate the C-loss function into our model, a minor penalty is imposed on these zeroes, as illustrated in Fig 1(b). Contrarily, the L₂ loss function assigns a significantly larger penalty for the same error. This distinction leads scRNMF to provide more reliable factorization, particularly in handling outliers.
(2) We develop an iterative algorithm that uses half-quadratic minimization to solve the non-quadratic and non-convex objective function encountered in scRNMF. The objective function is minimized until the convergence of the algorithm is reached. This approach provides an efficient and accurate solution to scRNMF.
(3) Through the use of several simulated and real datasets, we perform an comprehensively evaluation of scRNMF against existing methods. scRNMF can enhance various aspects of downstream analysis, including gene expression data recovery, cell clustering analysis, gene differential expression analysis, and cellular trajectory reconstruction. The results of our study demonstrate that scRNMF is a powerful tool that can improve the accuracy of single-cell data analysis.

Materials and methods

The overview of our framework is illustrated in Fig 1. It can be seen that the workflow of our work including three steps: data pre-processing, missing value imputation and downstream analysis. The goal of our study is missing value imputation, namely recover the true data from the raw matrix. In this section, we introduce the data preprocessing and imputation method in detail.

Data pre-processing

Let X be the raw count matrix. We index genes and cells using g and c, respectively. The preprocessing of scRNA-seq data consists of three steps. The first step is data filtering, which removes low-abundance genes and cells, followed by log normalization to account for differences in sequencing depth, and then top gene selection to reduce the dimensionality of the matrix. The log normalization function is defined as:

\begin{matrix} N (X_{g c}) = ln (Median (X) \frac{X_{g c}}{\sum_{i = 1}^{C} X_{g i}} + 1) \end{matrix}

(1)

where Median(X) is the median of the total cell expression counts. We refer to AutoClass [13], selecting 2000 highly variable genes (HVGs) for imputation.

Model

The missing value of a cell gene expression count is modelled as a linear combination of gene and cell activation patterns over latent representations. An imputed value is obtained by multiplying each of k cell factor components by its corresponding gene factor component, then summing the results. Thus, the missing gene expression count of a gene g for a given cell c can be expressed as a combination of k components as follows:

\begin{matrix} {\hat{X}}_{g c} = \sum_{i = 1}^{k} W_{g i} H_{i c} \end{matrix}

(2)

The loss function for MF originally used to measure the quality of approximation was L₂. In many tasks, it has shown its effectiveness due to its mathematical properties. For scRNA-seq data imputation, however, it is not the best choice. The observed zero values do not reflect real gene expression, which can be seen as outlier. The L₂ loss makes MF sensitive to outliers. To address this issue, we propose to replace the quadratic form of residues by the correntropy induced loss (C-loss) function to achieve robust factorization. C-loss function is defined by:

\begin{matrix} l_{C} (y, x) = 1 - exp {- \frac{{(y - x)}^{2}}{2 σ^{2}}} \end{matrix}

(3)

where σ is window width. S1 Fig shows the L₂ loss and C-loss function under different widths σ. From S1 Fig, we can see that C-loss is a bounded, smooth and non-convex loss, and C-loss behaves like L₂ loss for small errors. S2 Fig displays that, when we use C-loss on zero counts, small punishment can be imposed on the error. This is to say, C-loss is robust to outliers. Since the observed high-expressed gene expressions are usually accurate [20], we use L₂ loss to measure the error.

Thus, we propose the following scRNMF to learn the data latent representation:

\begin{matrix} \begin{matrix} \underset{W, H}{arg min} \frac{1}{2} \sum_{{(g, c) | X_{g c} \neq 0}} {(X_{g c} - \sum_{i = 1}^{k} W_{g i} H_{i c})}^{2} \\ + \frac{1}{2} \sum_{{(g, c) | X_{g c} = 0}} l_{C} (X_{g c} - \sum_{i = 1}^{k} W_{g i} H_{i c}) \\ + \frac{α}{2} ‖ K_{G} - W W^{T} ‖_{F}^{2} + \frac{β}{2} ‖ K_{C} - H^{T} H ‖_{F}^{2} \\ + \frac{λ}{2} \sum_{{(g, c) | X_{g c} \in {0}}} {(\sum_{i = 1}^{k} W_{g i} H_{i c})}^{2} \\ subject to : W \geq 0, H \geq 0 . \end{matrix} \end{matrix}

(4)

In cases where the raw data contain a significant amount of redundant information, the cell and gene factor cannot effectively represent cell and gene. To avoid that, the third and fourth terms are introduced. The learned W and H are consistent with the cell-cell graph K_C and the gene-gene graph K_G, respectively. In our study, we take average of the Cosine similarity and Correlation coefficient similarity [38] to describe the distance.

The fifth term is a regularization factor. The goal is to fit the zeros to the fitting constraint. A parameter λ is used to control how important zeros are during learning.

We also introduce a non-negative constraint for factors. It is due to the fact that gene expression counts are not negative. This constraint helps us obtain more accurate results from our model. Moreover, it prevents unrealistic values from being generated by the model. Lastly, it provides a interpretation of the results, as they are constrained to non-negative values.

Optimization

The objective function as defined in Eq 4 is non-convex, which poses challenges because it cannot be minimized directly. In this context, the half-quadratic optimization algorithm is employed to address this difficulty and optimize function 4. By doing so, the scRNMF issue is broken down into solving the Weighted NMF problem. The optimization procedure consists of two key stages.

The first stage necessitates introducing an additional auxiliary variable. Relying on the principle of the conjugate function [39] and half-quadratic theory [40], the objective function 4 simplifies to Eq 5.

\begin{matrix} \underset{L}{arg min} \sum_{{(g, c) | X_{g c} = 0}} (- L_{g c} \frac{{(X_{g c} - (W H)_{g c})}^{2}}{2 σ^{2}} + g (L_{g c})) \end{matrix}

(5)

where g(⋅) is conjugate function. According to half-quadratic theory [40], the close-form solutions of function 5 is

\begin{matrix} L_{g c} = - exp {- \frac{{(X_{g c} - (W H)_{g c})}^{2}}{2 σ^{2}}}, (g, c) \in {(g, c) | X_{g c} = 0} \end{matrix}

(6)

The second stage involves determining W and H after L has been fixed.

\begin{matrix} \begin{matrix} \underset{W, H}{arg min} \frac{1}{2} ‖ M ⊙ (X - W H) ‖_{F}^{2} \\ + \frac{α}{2} ‖ K_{G} - W W^{T} ‖_{F}^{2} + \frac{β}{2} ‖ K_{C} - H^{T} H ‖_{F}^{2} \\ + \frac{λ}{2} ‖ P ⊙ (W H) ‖_{F}^{2} \\ subject to : W \geq 0, H \geq 0 . \end{matrix} \end{matrix}

(7)

where ⊙ indicates element-wise matrix multiplication, P is weighted matrix and M is projection matrix, that is, P_gc = 1 if X_gc ≠ 0 or P_gc = 0 otherwise; M_gc = −L_gc if X_gc = 0 or M_gc = 1 otherwise.

Eq 7 is the Weighted NMF problem. To minimise Eq 7, we use an efficient multiplicative learning algorithm [41]. The multiplicative update rules is given by:

\begin{matrix} \begin{matrix} W \leftarrow W ⊙ \frac{(M ⊙ X) H^{T} + α K_{G} W}{(M ⊙ (W H)) H^{T} + λ (P ⊙ (W H)) H^{T} + α W W^{T} W} \\ H \leftarrow H ⊙ \frac{W^{T} (M ⊙ X) + β H K_{C}}{W^{T} (M ⊙ (W H)) + λ W^{T} (P ⊙ (W H)) + β H H^{T} H} \end{matrix} \end{matrix}

(8)

The complete algorithm follows a half-quadratic iterative process that involves alternating between the two stages. The first stage deals with updating L using Eq 6, while the second stage updates W and H according to Eq 8. This iterative process assists in the minimization of the non-convex function until convergence. The details of optimization procedure, the proof of convergence and the pseudo-code for implementation are elaborately defined in S1 Text. We also plot the objective function value on all datasets. The results (S3 Fig) illustrate that scRNMF has good convergence experimentally.

Parameters selection

There are five parameters in our methods, including k, σ, α, β and λ. k represents the latent dimensions of the cell and gene. It is chosen from the set {2,10,20,30,40,50}. The window width σ is selected from the range 10⁻⁵ to 10³ with intervals of 10. The regularization parameters α, β and λ are selected from the range 10⁻⁵ to 10, also with step of 10. We randomly masked 10% of the non-zero counts in the expression matrix and used the RMSE between the imputed values and the masked counts as an evaluation metric for reconstruction error. A low reconstruction error means the model is accurately learning the data and can be used for predictions [12]. We randomly sampled one thousand hyperparameter configurations from the search space. The hyperparameter configuration with the lowest reconstruction error is then selected as the most efficient model. The optimal parameters are listed in S1 Table.

Results

Multiple datasets and downstream analyses were used to evaluate and compare scRNMF to other imputation methods. The details of datasets and compared imputation methods are list in S2 and S3 Tables.

Gene expression data recovery

We evaluate expression value recovery on simulated scRNA-seq data. We generate a true counts matrix (matrix without dropouts) and eight additional raw matrix (Simulated 1–8) using Splatter [42] with 500 cells and 1000 genes in four cell groups.

The first evaluation method is data visualization. Referring to the study from [13], we use PCA [43] and UMAP [44] for dimension reduction and data visualization. Fig 2a illustrates the results of the Simulated 1 dataset with 78% zero rate. True counts result in four subpopulations with clear borders, whereas raw counts results are affected by dropout noise. After imputation by scRNMF, subpopulations with clear borders were recovered. We also visualize the results of other imputation methods on Simulated 1–6 dataset (which can be found in S4–S9 Figs). On these simulated datasets, we can observe that our method and DCA performed better than other methods, distinguishing four clusters regardless of high or low zero expression rates.

Fig 2 — (a) UMAP plots for true data, raw data and imputed data by scRNMF on the Simulated 1 dataset. (b) RMSE and PCC between normalized true counts and imputed values on eight simulated datasets.

We also evaluate the performance of different imputation methods by using two metrics, Root mean square error (RMSE) and Pearson correlation coefficient (PCC), to quantify their ability to recover true gene expression. Fig 2b shows that as the zero rate increases, the RMSE of all imputation methods increases, while their PCC decreases. A lower RMSE and higher PCC were achieved with scRNMF and DCA, compared with the other methods. CMF-Impute, scGCL and scImpute have the poorest performance, and could not be successfully applied to simulated datasets.

DCA and scRNMF appear to outperform other methods by a significant margin according to visualization and quantitative comparison. In addition, both of them offer a wide range of hyperparameters for tuning the model. AutoClass adds a classifier to the bottleneck layer of the regular autoencoder. Because the default hyperparameters of AutoClass are robust, they work well for the majority of scRNA-Seq datasets. On high zero expression rate simulated data (78%, 71%, 63%), AutoClass does not perform well. This indicates that AutoClass may not be suitable for datasets with a high zero expression rate. Therefore, further optimization of hyperparameters is needed for datasets with a high zero expression rate.

Cell clustering analysis

In order to identify cell types from scRNA-seq data, clustering is commonly used. A total of five real datasets were used to evaluate the performance of scRNMF for clustering analysis. The size of their datasets ranged from 182 (Buettner) to 8592 (Lake), and the zero rate varied from 38% (Buettner) to 96% (Usoskin).

We implement the cell cluster experiment following AutoClass [13]. Before imputation, we select 2000 HVGs. After imputation, we perform cell clustering using K-means [45] on the imputed and raw datasets. Two metrics were used to evaluate clustering results: Adjusted Rand index (ARI [46]) and Normalized mutual information (NMI). ARI measures the similarity between the clusters and the true classes, while NMI evaluates the mutual information between the two partitions. Both metrics gave high scores, indicating clustering success.

As indicated in Fig 3, compared with other competing methods, scRNMF achieves the most reliable clustering results in real datasets. Only scRNMF improve two metrics from the raw data in all datasets.

CMF-Impute also achieves competitive performance with ARI reaching 85.8%, 89.3% and 90.6% on Usoskin, Diaphragm and Muscle datasets, respectively. Both scRNMF and CMF-Impute use MF models to impute the raw dataset. The critical difference between the two methods lies in the loss function and the factor constraint. Specifically, CMF-Impute only leverages L₂ loss function to measure the quality of approximation, and has no constraint for factors. While, scRNMF use L₂ loss and C-loss to fit the non-zero and zero value in raw counts, respectively. And, scRNMF introduce non-negative as constraint for factors. Obviously, scRNMF perform better than CMF-Impute in clustering analysis. The behind reason is the raw data contain high noise and gene expression count is non-negative, while CMF-Impute lacks robustness and non-negative constraint.

Gene differential expression analysis

Further downstream analysis of scRNA-seq data involves identifying differentially expressed (DE) genes. Through this approach, we can identify genes that are specifically associated with the observed phenotypes and gain insight into the molecular mechanisms involved in the process.

We implement the DE analysis experiment following scMOO [7]. As bulk RNA-seq data is minimally affected by dropout events, we regarded its differential expression analysis results as our gold standard. We utilized the Cell Type dataset, which incorporates both scRNA-seq and bulk RNA-seq data with seven cell types (H1, DEC, EC, H9, HFF, NPC, and TB). Our primary focus was on detecting DE genes between six pairs of cell subpopulations containing H1. We run edgeR [47] on the raw and imputed data. A top 200, 400, 600, 800 and 1000 gene set based on an adjusted P is used as a reference, whereas imputed data is used as a predicted result. Different imputation methods are evaluated based on their Area under the receiver operating characteristic curve (AUC) and Accuracy (ACC). Fig 4, S10, S11, S12, S13 and S14 Figs show the results.

When the top 200 genes are selected, all results shows that only scRNMF outperforms the raw dataset in terms of ACC and AUC. Other imputation methods gradually outperform raw datasets as the number of top genes selected increases. MAGIC and CMF-Impute and also perform well in most cases. We note that these methods all rely on gene similarity. This suggests that using gene similarity for imputation is an effective strategy for better performance, especially in DE analysis.

Besides, we visualize the raw and imputed data with volcano figures as illustrated in Fig 4a. We can see that imputed data from scRNMF detected more accurately the top expressed genes compared with the raw data. Research by Chu et al. [48] shows that DEG cells are enriched in genes such as CST1, PAX6, NODAL and IFI16. Our method have the higher -log(P) value in these genes compared with the raw data.

Cellular trajectories reconstruct

Cell cycle dynamics patterns can be explored by time course by reconstructing cellular trajectories from scRNA-seq data. Dropout events can lead to incorrect estimates of gene expression levels, which can in turn result in incorrect estimates of cell cycle progression. Therefore, it is critical to take into account the potential for dropouts when reconstructing cellular trajectories in order to ensure that the results are accurate and reliable. In this study, we visualized cellular trajectories reconstructed by Monocle2 [49] (Fig 5(a)). Pseudo-temporal ordering score (POS) and Kendall’s rank correlation score (KOR) [50] scores are used to measure the correlation between the real time labels and the pseudo-time labels (Fig 5(b)). Visualization of cellular trajectories reconstruction from other imputed data on Time-course and Deng datasets are illustrated in S15 and S16 Figs, respectively.

Fig 5 — (a) Visualization of cellular trajectories reconstruction from raw and imputed data. (b) POS and KOR scores are used to measure the correlation between the real time labels and the pseudo-time labels.

On Deng and Time-course datasets, scRNMF achieves the highest correspondence between imputed data and true cell order (Deng: POS = 94.9%, KOR = 81.1%; Time-course: POS = 93.2%, KOR = 77.3%). Moreover, AutoClass, DCA, CMF-Impute and scVI do not outperform Raw on all datasets. The results suggest that scRNMF and MAGIC are most appropriate for exploring the cellular trajectory in scRNA-seq data.

Assessing imputation-induced false signals

The observed zero values in scRNA-seq data are not always dropout zeros. There exist true zero events, representing low-level gene expression in a specific cell type [1, 18, 23]. The false positive signals are caused by imputing these true zero events, also known as “over-imputation”.

To evaluate whether different imputation methods lead to false positive signals, we conduct an experiment on simulated scRNA-seq data (because simulated data has true counts matrix as labels). The true zero rate of all simulated datasets are 25.9%. Referring to the experiment from [18, 51], we first define the threshold for binarizing imputed counts. Specifically, the threshold was set based on the percentile of the imputed counts. Counts above this threshold are considered as non-zero, and values below or equal to this threshold are considered as zero. An ideal imputation method should accurately impute the data, in particular preserving at true zero event while completing the dropout ones. Therefore, two metrics were used to evaluate the false positive signals: False Positive Rate (FPR) and F-score. A higher FPR indicates that the imputation method tends to introduce more false positive signals. The F-Score provides a comprehensive evaluation of the method’s performance. A high F-Score indicates that the method achieves a good balance between accurately imputing missing values and not over-imputing true zeros.

We implement experiment about the false positive signals on Simulated 1–6 dataset and the results are show in Fig 6 and S17–S21 Figs. From these figures, we can see that scRNMF, MAGIC and DCA achieve the best F-Score at the 30th percentile on Simulated 1–3 datasets and at the 40th percentile on Simulated 4–6 datasets. At the same time, scRNMF, MAGIC and DCA have the lowest FPR compared to other methods when achieving the best F-score. This indicates that their effectiveness in accurately identifying dropout zeros while maintaining a low false positive signals. Another observation is that ALRA, scGCL and scImpute exhibit the low FPR compared to other methods at 0th percentile on all simulated datasets. This can be explained by the distinct approaches these methods use for imputation. ALRA computes a low-rank approximation of the observed matrix and then restoring true zeros through an entry thresholding process. Differently, scImpute focuses on the identification and subsequent imputation of dropout zeros only. The reconstruction loss utilized by scGCL is the negative log-likelihood of ZINB. As a result, the imputation process of scGCL may lean towards preserving zeros to minimize reconstruction loss.

Evaluation of parameter k selection

The idea behind MF-based imputation methods is to assume that the count matrix can be decomposed into a product of two low-rank matrices, where the matrices are the latent representations of cells and genes, respectively. Therefore, the k value is the dimensionality of the latent features of the cells and genes. Inappropriate k values can lead to sub-optimal performance [52].

To investigate the impact of varying k values on the low-rank based imputation methods (including scRNMF, ALRA, CMF-Impute and McImpute), we design simulations using Splatter to generate simulated data. We generate true cell group label and raw matrix (Simulated 9) with 2000 cells, 500 genes and 90% zero rate in 20 cell groups. According to the Louvain algorithm [53, 54], there are 10 gene modules in the Simulation 9 dataset. Cell clustering analysis (the details are the same as in section “Cell clustering analysis”) is performed on the imputed data, and NMI and ARI are used as evaluation metrics. This scenario mimics the biological context where a small number of gene modules produce a variety of cell types through their combinatorial effects. There is a range of values from 2 to 1000 for k. This setting helps us understand the effect of increasing k due to a growing number of cell type-dependent dropout patterns, even when the true number of genes remains much lower than k. The results are shown in Fig 7.

From Fig 7, we can observe that scRNMF achieves the greatest ARI (100%) and NMI (100%) when the k value is 20, 30, 40 and 50. As the k value increases, ARI and NMI decrease to varying degrees, but when the k value reaches 800 and 1000, both NMI and ARI almost reach 100%. This indicates that scRNMF is robust over a wide range of large k values. McImpute imposes Nuclear Norm Minimization (NNM) on the imputed count matrix to ensure that it is low-rank. Therefore, McImpute avoid the explicit specification of the k value and is a horizontal line in Fig 7. ALRA and CMF-Impute achieve optimal performance when the k value is 20. When the k value increases, their imputation performance drops significantly. ALRA is based on random SVD [55], so its k value cannot be set larger than the minimum number of genes and cells. Another issue worth noting is that ALRA and CMF-Impute chose k values of 5 and 40, which are obviously not optimal from the experiment results. ALRA estimates the k value from empirical distribution of eigenvalues [56]. CMF-Impute sets the k value to the number of cells divided by 50. Our method selects the optimal k value by minimizing the reconstruction error, which tends to be accurate but is time-consuming (because running the model multiple times with different parameters).

Computational time

The computational cost of an imputation method is represented by its running time. The results are depicted in Fig 8, and correspond to computations conducted on a Core i7-10750H CPU with 16GB RAM and an RTX 3060 GPU. Specifically, the GSM4505405 dataset containing 110828 cells and 22966 genes, downloaded from GEO (accession number GSM4505405), was used.

To understand how the cost of different methods changes with the number of cells while keeping the number of genes fixed, we preprocessed the dataset by selecting 2000 HVGs. The dataset was split into six samples with 2000, 4000, 6000, 8000, 10000, and 20000 cells obtained through random sampling. The results are shown in Fig 8(a).

Next, we hold the number of cells constant to witness how the cost of different methods evolved with an increasing number of genes. We selected the top 2000 cells expressing the most genes and filtered out the remaining cells. Then, six datasets featuring 2000, 4000, 6000, 8000, 10000, and 20000 genes were acquired through random sampling. Detailed results are shown in Fig 8(b).

The findings show that scRNMF and the other two MF methods (CMF-Impute and McImpute) exhibit similar running times. And, these times seemed to correlate with the maximum number of cells and genes involved in our experiments. This can be explained by examining the time complexity involved in these methods. Assuming a row count matrix, X, sized M × N, and a known rank, k ≪ min(M, N), let I denote the number of iterations needed for the algorithm to converge and T represent the maximum of N and M. For scRNMF, CMF-Impute, and McImpute, the training phase’s time complexity relies heavily on the computation of the matrix product. These methods have a time complexity of O(IT²k). Moreover, DL based methods demonstrate a significant speed advantage over MF based methods, particularly with large datasets. This can be attributed to batch gradient descent during the training phase.

Our approach is grounded in MF principles and it might take longer to process extensive datasets. This is an inherent characteristic of matrix operations that we cannot entirely avert. However, this speed trade-off should not overshadow our method’s performance benefits.

Discussion

It is still difficult to analyze scRNA-seq data because a significant portion of expressed genes have zeros. Gene expression levels can be restored through imputation of scRNA-seq data, facilitating downstream analysis.

In this study, we present a new imputation method, called scRNMF. To measure the quality of approximation, existing MF methods use the L₂ loss function. There is high sensitivity to outliers. When we incorporate the C-loss function into our model, a minor penalty is imposed on these zeroes. Contrarily, the L₂ loss function assigns a significantly larger penalty for the same error. Hence, scRNMF both approximates the original matrix well and improves robustness against dropout events. Due to the non-quadratic and non-convex of scRNMF, we develop an iterative algorithm that relies on half-quadratic minimizations.

To validate the performance of scRNMF, we compare ten imputation methods (S3 Table) and a total of fourteen datasets (S2 Table). For recovering gene expression, we visualize (S4–S9 Figs) the imputed matrix and calculate RMSE and PCC (Fig 2) to evaluate the performance. Compared to other approaches, scRNMF and DCA perform better. In terms of cell clustering, we evaluated K-means on nine real datasets labelled by ARI and NMI scores. As shown in Fig 3, only scRNMF led to improve cell clustering. When evaluating gene DE, EdgeR was run on Cell Type dataset and matched bulk data to calculate ACC and AUC (Fig 4 and S10–S14 Figs). A Moncle2 analysis was carried out on Time-course and Deng datasets (S15 and S16 Figs) in order to investigate the reconstruction of cellular trajectory. There are significant improvements in results for scRNMF over raw dataset.

Supporting information

S1 Text. The details of optimization procedure.

Because the objective function of scRNMF is non-convex, we propose an effective optimization algorithm to solve it. We also give the convergence analysis of a optimization algorithm.

(PDF)

pcbi.1012339.s001.pdf^{(141.5KB, pdf)}

S1 Fig. C-loss with different widths.

C-loss is a bounded, smooth and non-convex loss.

(PDF)

pcbi.1012339.s002.pdf^{(14.6KB, pdf)}

S2 Fig. C-loss is robust for zero-count.

The observed zero values do not reflect real gene expression, which can be seen as outliers. When we use C-loss on zero count, small punishment (small loss value) can be imposed on the error. In contrast, L₂ loss impose a larger punishment for the same error. This is to say, C-loss is more robust than L₂ loss. Since the observed high-expressed gene expressions (non-zero count) are usually accurate, we use L₂ loss to measure the error.

(PDF)

pcbi.1012339.s003.pdf^{(23.8KB, pdf)}

S3 Fig. Convergence curves of the objective function values.

We conduct experiments to verify the convergence of scRNMF on all datasets. We plot the objective function value on all datasets. The results illustrate that scRNMF has good convergence experimentally.

(PDF)

pcbi.1012339.s004.pdf^{(34.1KB, pdf)}

S4 Fig. PCA+UMAP plots for raw and imputed data on Simulated 1 dataset with zero expression rate of 78%.

(PDF)

pcbi.1012339.s005.pdf^{(139KB, pdf)}

S5 Fig. PCA+UMAP plots for raw and imputed data on Simulated 2 dataset with zero expression rate of 71%.

(PDF)

pcbi.1012339.s006.pdf^{(138.9KB, pdf)}

S6 Fig. PCA+UMAP plots for raw and imputed data on Simulated 3 dataset with zero expression rate of 63%.

(PDF)

pcbi.1012339.s007.pdf^{(138.9KB, pdf)}

S7 Fig. PCA+UMAP plots for raw and imputed data on Simulated 4 dataset with zero expression rate of 55%.

(PDF)

pcbi.1012339.s008.pdf^{(138.4KB, pdf)}

S8 Fig. PCA+UMAP plots for raw and imputed data on Simulated 5 dataset with zero expression rate of 48%.

(PDF)

pcbi.1012339.s009.pdf^{(138.6KB, pdf)}

S9 Fig. PCA+UMAP plots for raw and imputed data on Simulated 6 dataset with zero expression rate of 42%.

(PDF)

pcbi.1012339.s010.pdf^{(138.5KB, pdf)}

S10 Fig. Evaluation of imputation methods through differential expression analysis on H1-EC dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s011.pdf^{(19.9KB, pdf)}

S11 Fig. Evaluation of imputation methods through differential expression analysis on H1-H9 dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s012.pdf^{(20.4KB, pdf)}

S12 Fig. Evaluation of imputation methods through differential expression analysis on H1-HFF dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s013.pdf^{(19.9KB, pdf)}

S13 Fig. Evaluation of imputation methods through differential expression analysis on H1-NPC dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s014.pdf^{(20.1KB, pdf)}

S14 Fig. Evaluation of imputation methods through differential expression analysis on H1-TB dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s015.pdf^{(20.1KB, pdf)}

S15 Fig. Evaluation of imputation methods through pseudo-time analysis by Monocle 2 on Time-course dataset.

(PDF)

pcbi.1012339.s016.pdf^{(188.3KB, pdf)}

S16 Fig. Evaluation of imputation methods through pseudo-time analysis by Monocle 2 on Deng dataset.

(PDF)

pcbi.1012339.s017.pdf^{(115.4KB, pdf)}

S17 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 2 dataset.

(PDF)

pcbi.1012339.s018.pdf^{(30.3KB, pdf)}

S18 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 3 dataset.

(PDF)

pcbi.1012339.s019.pdf^{(30.3KB, pdf)}

S19 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 4 dataset.

(PDF)

pcbi.1012339.s020.pdf^{(30.3KB, pdf)}

S20 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 5 dataset.

(PDF)

pcbi.1012339.s021.pdf^{(30.3KB, pdf)}

S21 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 6 dataset.

(PDF)

pcbi.1012339.s022.pdf^{(30.2KB, pdf)}

S1 Table. The parameters of all datasets.

(PDF)

pcbi.1012339.s023.pdf^{(44.3KB, pdf)}

S2 Table. The details of the scRNA-seq datasets.

(PDF)

pcbi.1012339.s024.pdf^{(50.5KB, pdf)}

S3 Table. The details of the competing imputation methods.

(PDF)

pcbi.1012339.s025.pdf^{(51.5KB, pdf)}

Data Availability

The code and data of scRNMF are available on figshare: https://figshare.com/articles/software/scRNMF/23725986.

Funding Statement

This work is supported in part by the National Natural Science Foundation of China (No. 62131004 to Q.Z.; No. 62172076 to Y.D.; No. 62250028 to Q.Z.; No. U22A2038 to Y.D.), the Zhejiang Provincial Natural Science Foundation of China (No. LY23F020003 to Y.D.), the Municipal Government of Quzhou (No. 2023D038 to Y.D.), and the Project from Key Laboratory of Computational Science and Application of Hainan Province (No. JSKX202201 to Y.D.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature communications. 2018;9(1):997. doi: 10.1038/s41467-018-03405-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nature methods. 2018;15(7):539–542. doi: 10.1038/s41592-018-0033-z [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Wang J, Agarwal D, Huang M, Hu G, Zhou Z, Ye C, et al. Data denoising with transfer learning in single-cell transcriptomics. Nature methods. 2019;16(9):875–878. doi: 10.1038/s41592-019-0537-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–729. doi: 10.1016/j.cell.2018.05.061 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Jin K, Ou-Yang L, Zhao XM, Yan H, Zhang XF. scTSSR: gene expression recovery for single-cell RNA sequencing using two-side sparse self-representation. Bioinformatics. 2020;36(10):3131–3138. doi: 10.1093/bioinformatics/btaa108 [DOI] [PubMed] [Google Scholar]
6. Li B, Jin K, Ou-Yang L, Yan H, Zhang XF. scTSSR2: imputing dropout events for single-cell RNA sequencing using fast two-side self-representation. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2022;20(2):1445–1456. doi: 10.1109/TCBB.2022.3170587 [DOI] [PubMed] [Google Scholar]
7. Jin K, Li B, Yan H, Zhang XF. Imputing dropouts for single-cell RNA sequencing based on multi-objective optimization. Bioinformatics. 2022;38(12):3222–3230. doi: 10.1093/bioinformatics/btac300 [DOI] [PubMed] [Google Scholar]
8. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature methods. 2018;15(12):1053–1058. doi: 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome biology. 2019;20(1):1–14. doi: 10.1186/s13059-019-1837-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Xu Y, Wang Y, Luo J, Zhao W, Zhou X. Deep learning of the splicing (epi) genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic acids research. 2017;45(21):12100–12112. doi: 10.1093/nar/gkx870 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: Autoencoder based imputation of single-cell RNA-seq data. Scientific reports. 2018;8(1):16329. doi: 10.1038/s41598-018-34688-x [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-seq denoising using a deep count autoencoder. Nature communications. 2019;10(1):390. doi: 10.1038/s41467-018-07931-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Li H, Brouwer CR, Luo W. A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data. Nature Communications. 2022;13(1):1901. doi: 10.1038/s41467-022-29576-y [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Wang J, Ma A, Chang Y, Gong J, Jiang Y, Qi R, et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nature communications. 2021;12(1):1882. doi: 10.1038/s41467-021-22197-x [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Xu Y, Zhang Z, You L, Liu J, Fan Z, Zhou X. scIGANs: single-cell RNA-seq imputation using generative adversarial networks. Nucleic acids research. 2020;48(15):e85–e85. doi: 10.1093/nar/gkaa506 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Xiong Z, Luo J, Shi W, Liu Y, Xu Z, Wang B. scGCL: an imputation method for scRNA-seq data based on graph contrastive learning. Bioinformatics. 2023;39(3):btad098. doi: 10.1093/bioinformatics/btad098 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based imputation for single cell RNA-seq data. Frontiers in genetics. 2019;10:9. doi: 10.3389/fgene.2019.00009 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Linderman GC, Zhao J, Roulis M, Bielecki P, Flavell RA, Nadler B, et al. Zero-preserving imputation of single-cell RNA-seq data. Nature communications. 2022;13(1):192. doi: 10.1038/s41467-021-27729-z [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMF-sc: leveraging gene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome research. 2020;30(2):195–204. doi: 10.1101/gr.251603.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Xu J, Cai L, Liao B, Zhu W, Yang J. CMF-Impute: an accurate imputation tool for single-cell RNA-seq data. Bioinformatics. 2020;36(10):3139–3147. doi: 10.1093/bioinformatics/btaa109 [DOI] [PubMed] [Google Scholar]
21. Zhang L, Zhang S. Imputing single-cell RNA-seq data by considering cell heterogeneity and prior expression of dropouts. Journal of molecular cell biology. 2021;13(1):29–40. doi: 10.1093/jmcb/mjaa052 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Amodio M, Van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, et al. Exploring single-cell data with deep multitasking neural networks. Nature methods. 2019;16(11):1139–1145. doi: 10.1038/s41592-019-0576-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Dai C, Jiang Y, Yin C, Su R, Zeng X, Zou Q, et al. scIMC: a platform for benchmarking comparison and visualization analysis of scRNA-seq data imputation methods. Nucleic Acids Research. 2022;50(9):4877–4899. doi: 10.1093/nar/gkac317 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Shu H, Ding F, Zhou J, Xue Y, Zhao D, Zeng J, et al. Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data. Briefings in Bioinformatics. 2022;23(5):bbac389. doi: 10.1093/bib/bbac389 [DOI] [PubMed] [Google Scholar]
25. Xu Y, Zhou X. Applications of single-cell sequencing for multiomics. Computational Systems Biology: Methods and Protocols. 2018; p. 327–374. doi: 10.1007/978-1-4939-7717-8_19 [DOI] [PubMed] [Google Scholar]
26. Peng T, Zhu Q, Yin P, Tan K. SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data. Genome biology. 2019;20(1):1–12. doi: 10.1186/s13059-019-1681-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Karikomi M, Zhou P, Nie Q. DURIAN: an integrative deconvolution and imputation method for robust signaling analysis of single-cell transcriptomics data. Briefings in bioinformatics. 2022;23(4):bbac223. doi: 10.1093/bib/bbac223 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Chen S, Yan X, Zheng R, Li M. Bubble: a fast single-cell RNA-seq imputation using an autoencoder constrained by bulk RNA-seq data. Briefings in bioinformatics. 2023;24(1):bbac580. doi: 10.1093/bib/bbac580 [DOI] [PubMed] [Google Scholar]
29. Kriebel AR, Welch JD. UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization. Nature communications. 2022;13(1):780. doi: 10.1038/s41467-022-28431-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kong D, Ding C, Huang H. Robust nonnegative matrix factorization using l21-norm. In: Proceedings of the 20th ACM international conference on Information and knowledge management; 2011. p. 673–682.
31.Du L, Li X, Shen YD. Robust nonnegative matrix factorization via half-quadratic minimization. In: 2012 IEEE 12th International Conference on Data Mining. IEEE; 2012. p. 201–210.
32. He Y, Wang F, Li Y, Qin J, Chen B. Robust matrix completion via maximum correntropy criterion and half-quadratic optimization. IEEE Transactions on Signal Processing. 2019;68:181–195. doi: 10.1109/TSP.2019.2952057 [DOI] [Google Scholar]
33. Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome biology. 2020;21:1–30. doi: 10.1186/s13059-020-02132-x [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Cheng Y, Ma X, Yuan L, Sun Z, Wang P. Evaluating imputation methods for single-cell RNA-seq data. BMC bioinformatics. 2023;24(1):302. doi: 10.1186/s12859-023-05417-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Guo X, Zhou W, Yu Y, Ding Y, Tang J, Guo F. A novel triple matrix factorization method for detecting drug-side effect association based on kernel target alignment. BioMed Research International. 2020;2020. doi: 10.1155/2020/4675395 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Wang H, Tang J, Ding Y, Guo F. Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Briefings in Bioinformatics. 2021;22(5):bbaa409. doi: 10.1093/bib/bbaa409 [DOI] [PubMed] [Google Scholar]
37. Ding Y, Tang J, Guo F, Zou Q. Identification of drug–target interactions via multiple kernel-based triple collaborative matrix factorization. Briefings in Bioinformatics. 2022;23(2):bbab582. doi: 10.1093/bib/bbab582 [DOI] [PubMed] [Google Scholar]
38.Wang Y, Zhai Y, Ding Y, Zou Q. SBSM-Pro: Support Bio-sequence Machine for Proteins. arXiv preprint arXiv:230810275. 2023.
39. Boyd SP, Vandenberghe L. Convex optimization. Cambridge university press; 2004. [Google Scholar]
40. Nikolova M, Chan RH. The equivalence of half-quadratic minimization and the gradient linearization iteration. IEEE Transactions on Image Processing. 2007;16(6):1623–1627. doi: 10.1109/TIP.2007.896622 [DOI] [PubMed] [Google Scholar]
41.Lee D, Seung HS. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. 2000;13.
42. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome biology. 2017;18(1):174. doi: 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Bro R, Smilde AK. Principal component analysis. Analytical methods. 2014;6(9):2812–2831. doi: 10.1039/C3AY41907J [DOI] [Google Scholar]
44. Becht E, McInnes L, Healy J, Dutertre CA, Kwok IW, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology. 2019;37(1):38–44. doi: 10.1038/nbt.4314 [DOI] [PubMed] [Google Scholar]
45. Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society series c (applied statistics). 1979;28(1):100–108. [Google Scholar]
46. Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2:193–218. doi: 10.1007/BF01908075 [DOI] [Google Scholar]
47. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Chu LF, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology. 2016;17:1–20. doi: 10.1186/s13059-016-1033-x [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification and differential analysis with Census. Nature methods. 2017;14(3):309–315. doi: 10.1038/nmeth.4150 [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic acids research. 2016;44(13):e117–e117. doi: 10.1093/nar/gkw430 [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Andrews TS, Hemberg M. False signals induced by single-cell imputation. F1000Research. 2018;7. doi: 10.12688/f1000research.16613.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–37. doi: 10.1109/MC.2009.263 [DOI] [Google Scholar]
53. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome biology. 2018;19:1–5. doi: 10.1186/s13059-017-1382-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Seth S, Mallik S, Bhadra T, Zhao Z. Dimensionality reduction and louvain agglomerative hierarchical clustering for cluster-specified frequent biomarker discovery in single-cell sequencing data. Frontiers in Genetics. 2022;13:828479. doi: 10.3389/fgene.2022.828479 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Erichson NB, Voronin S, Brunton SL, Kutz JN. Randomized matrix decompositions using R. arXiv preprint arXiv:160802148. 2016;.
56. Onatski A. Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics. 2010;92(4):1004–1016. doi: 10.1162/REST_a_00043 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. The details of optimization procedure.

Because the objective function of scRNMF is non-convex, we propose an effective optimization algorithm to solve it. We also give the convergence analysis of a optimization algorithm.

(PDF)

pcbi.1012339.s001.pdf^{(141.5KB, pdf)}

S1 Fig. C-loss with different widths.

C-loss is a bounded, smooth and non-convex loss.

(PDF)

pcbi.1012339.s002.pdf^{(14.6KB, pdf)}

S2 Fig. C-loss is robust for zero-count.

(PDF)

pcbi.1012339.s003.pdf^{(23.8KB, pdf)}

S3 Fig. Convergence curves of the objective function values.

(PDF)

pcbi.1012339.s004.pdf^{(34.1KB, pdf)}

S4 Fig. PCA+UMAP plots for raw and imputed data on Simulated 1 dataset with zero expression rate of 78%.

(PDF)

pcbi.1012339.s005.pdf^{(139KB, pdf)}

S5 Fig. PCA+UMAP plots for raw and imputed data on Simulated 2 dataset with zero expression rate of 71%.

(PDF)

pcbi.1012339.s006.pdf^{(138.9KB, pdf)}

S6 Fig. PCA+UMAP plots for raw and imputed data on Simulated 3 dataset with zero expression rate of 63%.

(PDF)

pcbi.1012339.s007.pdf^{(138.9KB, pdf)}

S7 Fig. PCA+UMAP plots for raw and imputed data on Simulated 4 dataset with zero expression rate of 55%.

(PDF)

pcbi.1012339.s008.pdf^{(138.4KB, pdf)}

S8 Fig. PCA+UMAP plots for raw and imputed data on Simulated 5 dataset with zero expression rate of 48%.

(PDF)

pcbi.1012339.s009.pdf^{(138.6KB, pdf)}

S9 Fig. PCA+UMAP plots for raw and imputed data on Simulated 6 dataset with zero expression rate of 42%.

(PDF)

pcbi.1012339.s010.pdf^{(138.5KB, pdf)}

S10 Fig. Evaluation of imputation methods through differential expression analysis on H1-EC dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s011.pdf^{(19.9KB, pdf)}

S11 Fig. Evaluation of imputation methods through differential expression analysis on H1-H9 dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s012.pdf^{(20.4KB, pdf)}

S12 Fig. Evaluation of imputation methods through differential expression analysis on H1-HFF dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s013.pdf^{(19.9KB, pdf)}

S13 Fig. Evaluation of imputation methods through differential expression analysis on H1-NPC dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s014.pdf^{(20.1KB, pdf)}

S14 Fig. Evaluation of imputation methods through differential expression analysis on H1-TB dataset.

The ACC (A) and AUC (B) scores of which the reference are set as the top 200, 400, 600, 800 and 1000 genes sorted by adjusted P values from the bulk data.

(PDF)

pcbi.1012339.s015.pdf^{(20.1KB, pdf)}

S15 Fig. Evaluation of imputation methods through pseudo-time analysis by Monocle 2 on Time-course dataset.

(PDF)

pcbi.1012339.s016.pdf^{(188.3KB, pdf)}

S16 Fig. Evaluation of imputation methods through pseudo-time analysis by Monocle 2 on Deng dataset.

(PDF)

pcbi.1012339.s017.pdf^{(115.4KB, pdf)}

S17 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 2 dataset.

(PDF)

pcbi.1012339.s018.pdf^{(30.3KB, pdf)}

S18 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 3 dataset.

(PDF)

pcbi.1012339.s019.pdf^{(30.3KB, pdf)}

S19 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 4 dataset.

(PDF)

pcbi.1012339.s020.pdf^{(30.3KB, pdf)}

S20 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 5 dataset.

(PDF)

pcbi.1012339.s021.pdf^{(30.3KB, pdf)}

S21 Fig. Comparison of imputation methods in reducing false positive signals on Simulated 6 dataset.

(PDF)

pcbi.1012339.s022.pdf^{(30.2KB, pdf)}

S1 Table. The parameters of all datasets.

(PDF)

pcbi.1012339.s023.pdf^{(44.3KB, pdf)}

S2 Table. The details of the scRNA-seq datasets.

(PDF)

pcbi.1012339.s024.pdf^{(50.5KB, pdf)}

S3 Table. The details of the competing imputation methods.

(PDF)

pcbi.1012339.s025.pdf^{(51.5KB, pdf)}

Data Availability Statement

The code and data of scRNMF are available on figshare: https://figshare.com/articles/software/scRNMF/23725986.

[pcbi.1012339.ref001] 1. Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature communications. 2018;9(1):997. doi: 10.1038/s41467-018-03405-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref002] 2. Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nature methods. 2018;15(7):539–542. doi: 10.1038/s41592-018-0033-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref003] 3. Wang J, Agarwal D, Huang M, Hu G, Zhou Z, Ye C, et al. Data denoising with transfer learning in single-cell transcriptomics. Nature methods. 2019;16(9):875–878. doi: 10.1038/s41592-019-0537-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref004] 4. Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–729. doi: 10.1016/j.cell.2018.05.061 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref005] 5. Jin K, Ou-Yang L, Zhao XM, Yan H, Zhang XF. scTSSR: gene expression recovery for single-cell RNA sequencing using two-side sparse self-representation. Bioinformatics. 2020;36(10):3131–3138. doi: 10.1093/bioinformatics/btaa108 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref006] 6. Li B, Jin K, Ou-Yang L, Yan H, Zhang XF. scTSSR2: imputing dropout events for single-cell RNA sequencing using fast two-side self-representation. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2022;20(2):1445–1456. doi: 10.1109/TCBB.2022.3170587 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref007] 7. Jin K, Li B, Yan H, Zhang XF. Imputing dropouts for single-cell RNA sequencing based on multi-objective optimization. Bioinformatics. 2022;38(12):3222–3230. doi: 10.1093/bioinformatics/btac300 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref008] 8. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature methods. 2018;15(12):1053–1058. doi: 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref009] 9. Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome biology. 2019;20(1):1–14. doi: 10.1186/s13059-019-1837-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref010] 10. Xu Y, Wang Y, Luo J, Zhao W, Zhou X. Deep learning of the splicing (epi) genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic acids research. 2017;45(21):12100–12112. doi: 10.1093/nar/gkx870 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref011] 11. Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: Autoencoder based imputation of single-cell RNA-seq data. Scientific reports. 2018;8(1):16329. doi: 10.1038/s41598-018-34688-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref012] 12. Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-seq denoising using a deep count autoencoder. Nature communications. 2019;10(1):390. doi: 10.1038/s41467-018-07931-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref013] 13. Li H, Brouwer CR, Luo W. A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data. Nature Communications. 2022;13(1):1901. doi: 10.1038/s41467-022-29576-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref014] 14. Wang J, Ma A, Chang Y, Gong J, Jiang Y, Qi R, et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nature communications. 2021;12(1):1882. doi: 10.1038/s41467-021-22197-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref015] 15. Xu Y, Zhang Z, You L, Liu J, Fan Z, Zhou X. scIGANs: single-cell RNA-seq imputation using generative adversarial networks. Nucleic acids research. 2020;48(15):e85–e85. doi: 10.1093/nar/gkaa506 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref016] 16. Xiong Z, Luo J, Shi W, Liu Y, Xu Z, Wang B. scGCL: an imputation method for scRNA-seq data based on graph contrastive learning. Bioinformatics. 2023;39(3):btad098. doi: 10.1093/bioinformatics/btad098 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref017] 17. Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based imputation for single cell RNA-seq data. Frontiers in genetics. 2019;10:9. doi: 10.3389/fgene.2019.00009 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref018] 18. Linderman GC, Zhao J, Roulis M, Bielecki P, Flavell RA, Nadler B, et al. Zero-preserving imputation of single-cell RNA-seq data. Nature communications. 2022;13(1):192. doi: 10.1038/s41467-021-27729-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref019] 19. Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMF-sc: leveraging gene–gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome research. 2020;30(2):195–204. doi: 10.1101/gr.251603.119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref020] 20. Xu J, Cai L, Liao B, Zhu W, Yang J. CMF-Impute: an accurate imputation tool for single-cell RNA-seq data. Bioinformatics. 2020;36(10):3139–3147. doi: 10.1093/bioinformatics/btaa109 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref021] 21. Zhang L, Zhang S. Imputing single-cell RNA-seq data by considering cell heterogeneity and prior expression of dropouts. Journal of molecular cell biology. 2021;13(1):29–40. doi: 10.1093/jmcb/mjaa052 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref022] 22. Amodio M, Van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, et al. Exploring single-cell data with deep multitasking neural networks. Nature methods. 2019;16(11):1139–1145. doi: 10.1038/s41592-019-0576-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref023] 23. Dai C, Jiang Y, Yin C, Su R, Zeng X, Zou Q, et al. scIMC: a platform for benchmarking comparison and visualization analysis of scRNA-seq data imputation methods. Nucleic Acids Research. 2022;50(9):4877–4899. doi: 10.1093/nar/gkac317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref024] 24. Shu H, Ding F, Zhou J, Xue Y, Zhao D, Zeng J, et al. Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data. Briefings in Bioinformatics. 2022;23(5):bbac389. doi: 10.1093/bib/bbac389 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref025] 25. Xu Y, Zhou X. Applications of single-cell sequencing for multiomics. Computational Systems Biology: Methods and Protocols. 2018; p. 327–374. doi: 10.1007/978-1-4939-7717-8_19 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref026] 26. Peng T, Zhu Q, Yin P, Tan K. SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data. Genome biology. 2019;20(1):1–12. doi: 10.1186/s13059-019-1681-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref027] 27. Karikomi M, Zhou P, Nie Q. DURIAN: an integrative deconvolution and imputation method for robust signaling analysis of single-cell transcriptomics data. Briefings in bioinformatics. 2022;23(4):bbac223. doi: 10.1093/bib/bbac223 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref028] 28. Chen S, Yan X, Zheng R, Li M. Bubble: a fast single-cell RNA-seq imputation using an autoencoder constrained by bulk RNA-seq data. Briefings in bioinformatics. 2023;24(1):bbac580. doi: 10.1093/bib/bbac580 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref029] 29. Kriebel AR, Welch JD. UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization. Nature communications. 2022;13(1):780. doi: 10.1038/s41467-022-28431-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref030] 30.Kong D, Ding C, Huang H. Robust nonnegative matrix factorization using l21-norm. In: Proceedings of the 20th ACM international conference on Information and knowledge management; 2011. p. 673–682.

[pcbi.1012339.ref031] 31.Du L, Li X, Shen YD. Robust nonnegative matrix factorization via half-quadratic minimization. In: 2012 IEEE 12th International Conference on Data Mining. IEEE; 2012. p. 201–210.

[pcbi.1012339.ref032] 32. He Y, Wang F, Li Y, Qin J, Chen B. Robust matrix completion via maximum correntropy criterion and half-quadratic optimization. IEEE Transactions on Signal Processing. 2019;68:181–195. doi: 10.1109/TSP.2019.2952057 [DOI] [Google Scholar]

[pcbi.1012339.ref033] 33. Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome biology. 2020;21:1–30. doi: 10.1186/s13059-020-02132-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref034] 34. Cheng Y, Ma X, Yuan L, Sun Z, Wang P. Evaluating imputation methods for single-cell RNA-seq data. BMC bioinformatics. 2023;24(1):302. doi: 10.1186/s12859-023-05417-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref035] 35. Guo X, Zhou W, Yu Y, Ding Y, Tang J, Guo F. A novel triple matrix factorization method for detecting drug-side effect association based on kernel target alignment. BioMed Research International. 2020;2020. doi: 10.1155/2020/4675395 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref036] 36. Wang H, Tang J, Ding Y, Guo F. Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment. Briefings in Bioinformatics. 2021;22(5):bbaa409. doi: 10.1093/bib/bbaa409 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref037] 37. Ding Y, Tang J, Guo F, Zou Q. Identification of drug–target interactions via multiple kernel-based triple collaborative matrix factorization. Briefings in Bioinformatics. 2022;23(2):bbab582. doi: 10.1093/bib/bbab582 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref038] 38.Wang Y, Zhai Y, Ding Y, Zou Q. SBSM-Pro: Support Bio-sequence Machine for Proteins. arXiv preprint arXiv:230810275. 2023.

[pcbi.1012339.ref039] 39. Boyd SP, Vandenberghe L. Convex optimization. Cambridge university press; 2004. [Google Scholar]

[pcbi.1012339.ref040] 40. Nikolova M, Chan RH. The equivalence of half-quadratic minimization and the gradient linearization iteration. IEEE Transactions on Image Processing. 2007;16(6):1623–1627. doi: 10.1109/TIP.2007.896622 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref041] 41.Lee D, Seung HS. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. 2000;13.

[pcbi.1012339.ref042] 42. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome biology. 2017;18(1):174. doi: 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref043] 43. Bro R, Smilde AK. Principal component analysis. Analytical methods. 2014;6(9):2812–2831. doi: 10.1039/C3AY41907J [DOI] [Google Scholar]

[pcbi.1012339.ref044] 44. Becht E, McInnes L, Healy J, Dutertre CA, Kwok IW, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology. 2019;37(1):38–44. doi: 10.1038/nbt.4314 [DOI] [PubMed] [Google Scholar]

[pcbi.1012339.ref045] 45. Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society series c (applied statistics). 1979;28(1):100–108. [Google Scholar]

[pcbi.1012339.ref046] 46. Hubert L, Arabie P. Comparing partitions. Journal of classification. 1985;2:193–218. doi: 10.1007/BF01908075 [DOI] [Google Scholar]

[pcbi.1012339.ref047] 47. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref048] 48. Chu LF, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology. 2016;17:1–20. doi: 10.1186/s13059-016-1033-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref049] 49. Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification and differential analysis with Census. Nature methods. 2017;14(3):309–315. doi: 10.1038/nmeth.4150 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref050] 50. Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic acids research. 2016;44(13):e117–e117. doi: 10.1093/nar/gkw430 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref051] 51. Andrews TS, Hemberg M. False signals induced by single-cell imputation. F1000Research. 2018;7. doi: 10.12688/f1000research.16613.2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref052] 52. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–37. doi: 10.1109/MC.2009.263 [DOI] [Google Scholar]

[pcbi.1012339.ref053] 53. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome biology. 2018;19:1–5. doi: 10.1186/s13059-017-1382-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref054] 54. Seth S, Mallik S, Bhadra T, Zhao Z. Dimensionality reduction and louvain agglomerative hierarchical clustering for cluster-specified frequent biomarker discovery in single-cell sequencing data. Frontiers in Genetics. 2022;13:828479. doi: 10.3389/fgene.2022.828479 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1012339.ref055] 55.Erichson NB, Voronin S, Brunton SL, Kutz JN. Randomized matrix decompositions using R. arXiv preprint arXiv:160802148. 2016;.

[pcbi.1012339.ref056] 56. Onatski A. Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics. 2010;92(4):1004–1016. doi: 10.1162/REST_a_00043 [DOI] [Google Scholar]

PERMALINK

scRNMF: An imputation method for single-cell RNA-seq data by robust and non-negative matrix factorization

Yuqing Qian

Quan Zou

Mengyuan Zhao

Yi Liu

Fei Guo

Yijie Ding

Roles

Abstract

Author summary

Introduction

Fig 1. The overview of the scRNMF framework.

Materials and methods

Data pre-processing

Model

Optimization

Parameters selection

Results

Gene expression data recovery

Fig 2. Gene expression data recovery after imputation.

Cell clustering analysis

Fig 3. ARI and NMI of cell clustering results of different imputation methods on five datasets.

Gene differential expression analysis

Fig 4. Evaluation of imputation methods through differential expression analysis on H1-DEC dataset.

Cellular trajectories reconstruct

Fig 5. Evaluation of imputation methods through pseudo-time analysis by Monocle 2 on Time-course dataset.

Assessing imputation-induced false signals

Fig 6. Comparison of imputation methods in reducing false positive signals on Simulated 1 dataset.

Evaluation of parameter k selection

Fig 7. The impact of varying k values on the low-rank based imputation methods.

Computational time

Fig 8. Comparative running time of different imputation methods with a fixed gene count of 2000 (a) and a fixed cell count of 2000 (b).

Discussion

Supporting information

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases