ABSTRACT
The high-resolution feature of single-cell transcriptome sequencing technology allows researchers to observe cellular gene expression profiles at the single-cell level, offering numerous possibilities for subsequent biomedical investigation. However, the unavoidable technical impact of high missing values in the gene-cell expression matrices generated by insufficient RNA input severely hampers the accuracy of downstream analysis. To address this problem, it is essential to develop a more rapid and stable imputation method with greater accuracy, which should not only be able to recover the missing data, but also effectively facilitate the following biological mechanism analysis. The existing imputation methods all have their drawbacks and limitations, some require pre-assumed data distribution, some cannot distinguish between technical and biological zeros, and some have poor computational performance. In this paper, we presented a novel imputation software FRMC for single-cell RNA-Seq data, which innovates a fast and accurate singular value thresholding approximation method. The experiments demonstrated that FRMC can not only precisely distinguish ‘true zeros’ from dropout events and correctly impute missing values attributed to technical noises, but also effectively enhance intracellular and intergenic connections and achieve accurate clustering of cells in biological applications. In summary, FRMC can be a powerful tool for analysing single-cell data because it ensures biological significance, accuracy, and rapidity simultaneously. FRMC is implemented in Python and is freely accessible to non-commercial users on GitHub: https://github.com/HUST-DataMan/FRMC.
KEYWORDS: Imputation1, scRNA-seq2, dropout event3, low-rank matrix optimization4, singular value thresholding iteration5, sparsity6
1. Introduction
Single cell RNA sequencing (scRNA-seq) has been becoming one of the most widely used technologies in the biomedical investigation over the past few years. Using the traditional bulk RNA sequencing method dealing with RNA isolated from millions of cells, the final gene expression value should be considered as the average of all input cells, and is therefore more suitable for revealing the global view of gene expression [1,2]. However, bulk RNA sequencing cannot accurately quantify RNA content, and besides, it suffers from measurement bias when samples contain multiple heterogeneous cell populations (e.g. at early embryonic developmental stages where the number of cells is relatively limited and multiple lineages exist) [3]. In contrast, scRNA-seq overcomes the disadvantage of traditional bulk RNA sequencing, and enables analysis of transcriptome data at the single-cell level, thereby obtaining information on gene sequences, transcripts, and epigenetics of a single cell [4–6].
With the development of high-throughput sequencing technology and the reduction of sequencing costs, the number of studies using scRNA-seq technology has exploded, revealing many important biological discoveries [7–9]. A growing number of studies have demonstrated the irreplaceable application of identifying different cell subtypes of seemingly similar cells, including discriminating cancer heterogeneity [10,11], discovering and identifying novel cell populations, and elucidating changes of transcriptional dynamics in cellular developmental trajectories [12–18].
Despite its many advantages, the scRNA-seq technique is still plagued by relatively higher technical noises. One of the general troubles is the insufficient RNA input. For example, some low-abundance transcripts are often lost in the reverse transcription step due to their own low quantity [19]. In addition, the expression counts we observed are usually only a small random sample (typically 5–15%) of the transcriptome of each cell because of the low capturing and sequencing efficiency [2,20,21]. These technical factors affect scRNA-seq result data sets with high sparsity in the gene – cell expression matrix, also known as ‘zero inflation’, i.e. a large percentage (even exceeding 90%) of zero or very low values in the expression matrix. Some of these zeros are ‘true zeros’, defined as genes that were never expressed in the first place, and some are ‘false zeros’, also known as dropout events, which refer to situations where transcripts are physically present but not amplified and detected, whose essence is a special type of missing value [19].
Current commonly used scRNA-seq statistical tools handle dropout events in different ways [3]. Excluding genetic data containing dropout events directly from the analysis would inevitably lead to the loss of information and sacrifice the opportunities to discover numerous biologically valuable findings [22]. Some approaches directly use clustering algorithms such as PCA to aggregate genes or cells to create ‘meta-genes’ or collapse thousands of cells into a small number of clusters, which addresses the sparsity of scRNA-seq data to some extent, but simultaneously loses the resolution at single cell or single gene level [23]. Therefore, imputation of dropout events is an indispensable step in processing scRNA-seq datasets. From a biological point of view, a reasonable imputation strategy should have the following discriminatory properties: it should keep the ‘true zero’ counts unaffected, while correctly recovering the expression counts of the genes in which the dropout event occurred [22].
In response to the above challenges, there are two main types of imputation. The first type is based on predetermined statistical models, and the limitation of such algorithms is their requirement to make appropriate statistical distribution assumptions on the data to be processed in advance. Examples include scImpute based on a gamma-normal mixed statistical model [24], VIPER based on a non-negative sparse regression model [25], 2DImpute based on the assumption of interrelationships between genes and cells [26], etc. The other type is built on a mathematical optimization model for low-rank matrix recovery without the need to make a priori statistical distribution assumptions, such as McImpute [22] and ALRA [27].
To tackle the drawbacks and limitations of existing imputation methods, we proposed a novel fast and robust imputation software based on matrix completion, called FRMC, for scRNA-seq data in this paper. This method FRMC achieves simultaneously superior performance in accuracy and speed by innovating a singular value thresholding approximation method and further combining a mathematical optimization model for low-rank matrix completion and a prejudgment algorithm for distinguishing dropout events. The experiments we performed demonstrated that our FRMC method can not only precisely distinguish ‘true zeros’ from dropout events and correctly impute missing values attributed to dropout events, but also effectively enhance intracellular and intergenic connections and achieve accurate clustering of cells in biological applications.
2. Methods
2.1. Algorithms and implementation
The FRMC method imputes scRNA-seq data by taking a two-step approach: (1) first evaluate the ‘similar cell set’ and the proportional index of gene expression on this set on the basis of Jaccard similarity coefficients to further predict the biological ‘true zero’ and dropout events; (2) transform the imputation problem of expression matrix into a low-rank matrix optimization problem, then minimize the augmented Lagrangian function of this constrained minimization problem through a singular value thresholding iteration to recover the complete gene expression matrix.
2.1.1. Prejudgment method
Before the imputation algorithm starts, we prejudge whether each zero in the data expression matrix is a dropout that requires imputation, or a ‘true zero’ that does not require any subsequent changes and mark them.
We first identify a set of ‘similar cells’ that are similar to the cells they belong to, defined as having the same genes expressed (with positive values) or not expressed (with zero values) as far as possible. To quantify the similarity between cell and cell , we use the Jaccard similarity coefficient based on a binary expression value (0 for non-expression and 1 for gene expression), defined as
Here, and stands for the set of genes expressed in cell and cell , respectively. A group of ‘similar cells’ is defined as a collection of all cells that have a Jaccard similarity coefficient greater than 0.5 with cell . If the number of cells in this group is less than 10, zero is held constant because of insufficient confidence. In each group of similar cells, we calculated the proportion of cells where each gene is expressed (>0) as an indicator. If this indicator exceeds a critical threshold of 20% for a gene, we prejudge that the expression zero value for that gene is a dropout event, otherwise it is true zero.
2.1.2. Imputation algorithm
The imputation of scRNA-seq expression matrix is to use the partially observed gene expression matrix (where rows represent individual cell IDs and columns represent genes) to recover the missing unknown values (i.e. zero values). Assume that the observed scRNA-seq expression matrix is a sampled version of the complete expression matrix . The mathematical representation is: . Here is the orthogonal projection matrix, which is a binary mask matrix containing only 0 and 1 of the same dimension as , where 0 indicates that no gene expression is observed at the corresponding position in and 1 indicates that gene expression is observed at the corresponding position in . Its mathematical expression is , where is the set of indices of the known values in the data matrix. In summary, the imputation problem can be translated as follows: given the gene expression observations and the binary mask matrix , one recovers the original complete gene expression matrix .
It has been proven that a matrix of rank can be recovered by solving the following nuclear norm minimization problem (NNM) [28,29]:
(1) |
where denotes the nuclear norm of the matrix, i.e. the sum of all singular values of the matrix. This motivates us to extend the model to the imputation of dropout events in the actual scRNA-seq data expression matrix. To solve the NNM problem easily by using the existing optimization theory and Lagrangian methods, FRMC introduces an error matrix . The NNM problem is further transformed into the following optimization problem with equality constraints:
(2) |
According to the constrained optimization theory [30], by introducing a Lagrangian multiplier matrix to remove the equality constraint, the optimal solutions of (2) can be approximated by unconstrained minima of the following augmented Lagrangian function:
(3) |
where is a positive penalty parameter and denotes the Frobenius norm of the matrix, i.e. for matrix . The details of the augmented Lagrangian method for solving the optimization problem with equality constraints were demonstrated in Section 4.2 [31].
From Theorem 2.1 and its proof process in Cai’s work [32], it can be seen that
namely, the optimal solution of about is where . The notation svd is singular value decomposition of one matrix. To solve the problem (2), the FRMC algorithm in this paper iteratively updates and via minimizing the augmented Lagrangian function (3) with respect to and respectively, with Lagrangian multiplier matrix fixed. The specific algorithmic steps are as follows:
Input: scRNA-seq gene expression matrix obtained by the prejudgment method.
Step0Initialization: .
Step 1 If convergence conditions or holds, STOP
Step 2 Solve :
(4) |
(5) |
where .
Step 3 Solve :
(6) |
where is the set of indices of the unknown values in .
Step 4 Update Lagrangian multiplier matrix :
(7) |
Step 5 Update :
(8) |
Step 6 Set . Return to Step 1.
Output: Imputed gene expression matrix .
2.2. Performance evaluation
2.2.1. Datasets
This paper uses scRNA-seq datasets from four different studies to conduct the evaluation experiments, all of which are available for download from public websites.
(1) Usoskin dataset [33]: This dataset is mouse neuronal data, obtained by RNA-Seq of 799 dissociated single cells dissected from mouse dorsal lumbar ganglia (DRG). After principal component analysis (PCA) of the expression of all cells and genes, 622 cells were classified as neuronal, 109 as non-neuronal and 68 as unknown cell types. We considered neuronal clusters containing NF, non-peptidergic injury receptors (NP), peptidergic injury receptors (PEP) and tyrosine hydroxylase (TH) of mouse lumbar DRG neurofilaments. RPM normalized expression data are available under NCBI Gene Expression Omnibus (GEO) GSE59739.
(2) Jurkat-293T dataset [34]: This dataset contains the expression profiles of Jurkat and 293T cells, which were mixed in equal proportions (50%:50%) in vitro. All approximately 3,380 cells were annotated according to the expression of cell type-specific markers. Cells expressing the gene CD3D were assigned to Jurkat and cells expressing the gene XIST were assigned to 293T. The dataset is available from the 10x Genomics website: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/jurkat:293t_50:50.
(3) InDrop dataset [35]: This single-cell RNA-seq dataset contains immune cells from eight patients with primary breast cancers and matched normal tissues. Two cell sets were selected for this study, InDrop-BC6 consisting of 3498 cells from one primary breast cancer of patient BC6, and InDrop-BC4 consisting of 13,266 cells from the blood of the patient BC4. This dataset can be downloaded from NCBI GEO with the ID number GSE114725.
(4) Smart-Seq2 dataset [36]: This dataset contains single-cell RNA sequencing profiles from 33 human melanoma tumours, including 17 newly collected patient tumours and 16 previously reported patient tumours. Data containing 2987 cells from the new cohort were selected for this study. This data is accessible from NCBI GEO GSE115978.
(5) NeuronCell dataset: This dataset contains more than 100,000 neuronal cells derived from temporal cortices of nine epileptic and ten non-epileptic subjects. The data is available on github: https://github.com/khodosevichlab/Epilepsy19.
2.2.2. Data preprocessing
Data preprocessing is divided into three steps: first, low abundance genes are removed before imputing and other downstream analyses [20,37]. We filter the data expression matrix for genes with extremely low expression, so that the retained genes are expressed (at least three reads) in at least 20 cells. Next, the filtered data expression matrix is normalized by the total number of libraries: for each cell, the expression of each gene was divided by the total expression of all genes and then multiplied by 10,000. Finally, the normalized single-cell data expression matrix is transformed to obtain a new expression matrix, which is used as the input to FRMC to perform imputation.
2.2.3. Dimensions and metrics
To comprehensively evaluate the performance of FRMC, we evaluated and compared the performance and accuracy of FRMC with the existing methods McImpute, ScImpute, 2Dimpute, and ALRA. All these baseline imputation methods take a gene expression matrix as input and output an imputed gene expression matrix with the same dimensions, and the implementation of the baseline algorithms is listed in Supplementary Table S1. Five dimensions were evaluated: imputation accuracy, identification of ‘true zero’ and dropout events, enhancement of intergenic linkage, improvement of intracellular linkage, and computation time.
3. Results
3.1. Accuracy of imputation
To investigate the accuracy of imputation, we simulated a certain percentage of missing values using the real single-cell dataset mentioned above to compare the five imputation methods FRMC, McImpute, ScImpute, 2Dimpute, and ALRA in terms of error and correlation.
For each gene in the original data matrix in the InDrop dataset, the set of cells with non-zero expression is selected, and cells with different masking ratios (2%, 5%, 10%, 20%) are randomly selected based on Bernoulli distribution, and the gene expression of these cells is masked to zero to simulate dropout, forming a new and different dataset. Here Masking Ratio refers to the percentage of non-zero elements in the data matrix that are masked to missing values. Then, five methods are used to impute these four datasets with different masking ratio. The Normalized Root Mean Square Error (NRMSE, defined as , where is the real original data matrix, is the imputed matrix. [38]) and Pearson correlation coefficient (PCC) are used to compare the error level and correlation level between the real original data matrix and the imputed data matrix.
Under different mask rate gradients in dataset Indrop-BC6, FRMC has a lower error level with NRMSE as low as 0.522, followed by McImpute method, while other methods show a larger error, especially the ALRA method has the largest error of 0.998 (Fig. 1(a)). In addition, FRMC has a high Pearson correlation coefficient between the imputed data matrix and the true original data matrix, reaching a minimum of 0.838, followed by the McImpute method, while the other methods exhibit lower correlation levels, especially the ALRA method with a correlation coefficient of only 0.6 (Fig. 1(b)). Considering that the large number of zero values included in these matrices may bias the final PCC scores, the scores of PCC after removing zeros of the corresponding positions between the actual matrix and imputed matrix were calculated and the validation was added in both dataset Jurkat-293T and dataset Usoskin (Supplementary Table S2), and the results are consistent with those in the dataset Indrop-BC6. The above results suggested the recovered data values obtained from the missing values in the data expression matrix are closer to the true original values after the processing of FRMC.
Figure 1.
Accuracy comparison of FRMC with other 4 imputation methods on the dataset indrop-BC6. Error levels and correlation levels between the original values and the imputed values of the expression matrices with different artificial random masking ratios (2%, 5%, 10%, and 20%), were measured by using: (a) Normalized root-mean-square error (NRMSE) and (b) Pearson correlation coefficients
3.2. Identification of ‘true zero’ and dropout events
The performance of FRMC and four other algorithms in identifying ‘true zero’ and dropout events was evaluated using the InDrop-BC6 dataset to compare the expression changes of specific genes before and after imputation.
The gene PTPRC (Recombinant Protein Tyrosine Phosphatase Receptor Type C), also known as leukocyte common antigen (LCA) or CD45, is widely present on the surface of leukocyte surface; the cells in the InDrop-BC6 dataset were subjected to fluorescence-activated cell sorting (FACS) purification process, and PTPRC is its fluorescence-tagged gene which is necessarily expressed, so it can be assumed that all zero values of this gene are dropout events in the measured expression data. FRMC and 2Dimpute accurately predicted the deletion value of this gene as dropout event and impute it. In contrast, after the processing of ScImpute and ALRA, the gene expression values still had a large number of missing values without imputation, which indicating that these two methods did not accurately distinguish between ‘true zero’ and dropout events (Fig. 2(a)).
Figure 2.
Violin plots of distributions before and after the imputation of FRMC and other 4 algorithms in different cell type marker genes on the dataset indrop-BC6. (a) PTPRC: universally expressed on all leukocytes; (b) CD68:macrophage-specific; (c) CD4:T cell-specific; (d) CD14: Monocyte-specific; (e) ITGAX: Dendritic cell-specific;(f)GAPDH: a house-keeping gene universally expressed on all human cells
The other gene CD68 is macrophage-specific, and InDrop-BC6 dataset involves multiple types of leukocytes including macrophages, so there were a large number of ‘true zeros’ in the zero values of the gene CD68 in the expression matrix. The four imputation methods, FRMC, 2Dimpute, ScImpute and ALRA, correctly predicted the missing expression values of CD68 gene as ‘true zeros’ without imputation, while McImpute identified all ‘zeros’ in CD68 gene as dropout events and imputed them without exception, which is incorrect and deviates from the actual biological significance (Fig. 2(b)). In addition, we also validated this result on CD4 (a T cell-specific gene,) CD14 (a monocyte-specific gene), ITGAX (a dendritic cell-specific gene) and GAPDH (a house-keeping gene universally expressed on all human cells) (Fig. 2).
Taken together, these comparisons show that only FRMC and 2Dimpute have the best ability to discriminate between ‘true zero’ and dropout events, and only impute missing values that need to be imputed, which is consistent with the actual biological knowledge of the cells and genes in the dataset.
3.3. Enhancement of intergenic linkage
The heavy presence of missing values in the scRNA-seq dataset resulted in much weaker true gene-to-gene associations and did not facilitate the subsequent identification of marker genes in cell subpopulations. To assess whether the algorithm is able to strengthen and recover associations between genes, we used biologically significant co-expressed genes (FTL and CD68 in the InDrop-BC6 dataset) to do a comparative analysis [35].
The gene CD68 in the original dataset showed weak correlation with the gene FTL with the PCC of 0.72. After imputation by the FRMC, McImpute, and ALRA imputation algorithms, the linear correlation between the co-expressed genes FTL and CD68 was enhanced with Pearson correlation coefficients reaching 0.8349, 0.8465, and 0.8954, respectively (Fig. 3). After 2Dimpute’s imputation, the correlation coefficient of this pair of co-expressed genes slightly increased to 0.8019. Worst of all, the expression data lost their original co-expression under the imputation treatment of the ScImpute method, showing a weak correlation. As can be seen, this confirms that FRMC can enhance the association between genes by imputation.
Figure 3.
Scatter plots of co-expressed genes FTL and CD68 before and after the imputation of FRMC and other 4 algorithms on the dataset indrop-BC6. The straight (black) line represents the regression line fitted by a linear model
3.4. Improvement of intracellular linkage
Both PCC within cells before and after imputation and cell clustering maps were analysed as a way to assess whether the imputation algorithm can enhance cell-to-cell correlations.
We used a Jurkat-293T dataset containing two in vitro equal proportions of mixed cell types (Jurkat cell type and 293T cell type). In the unimputed expression data, the mean Pearson coefficients between cells in cell types 293T and Jurkat were 0.74 and 0.71 with standard deviations of 0.056 and 0.054, respectively (Fig. 4). After FRMC imputation, the mean PCC between cells in both cell types 293T and Jurkat increased to 0.82 and 0.80, respectively, and the standard deviations decreased to 0.039 and 0.036, respectively (Fig. 4), illustrating that FRMC enhanced the intracellular correlations (i.e. intracellular connections) between the two cell types Jurkat and 293T. The other four imputation algorithms also enhanced intracellular correlations to some extent (Fig. 4).
Figure 4.
Intra-cellular correlation between ‘Jurkat’ and ‘293T’ cell types before and after the imputation of FRMC and other 4 algorithms on the dataset Jurkat-293T. Error bars represent 1 SD
To further investigate the accuracy of FRMC algorithm to enhance intracellular linkage, we randomly selected 50 cells from the Jurkat-293T dataset for demonstrating the clustering of cells before and after imputation. In the following, the cell name with ‘_1’ identifier stands for cell type ‘293T’, and the cell name with ‘_2’ identifier stands for cell type ‘Jurkat’. Fifty cells were randomly sampled from the original gene expression matrix, and the same 50 cells were extracted from the FRMC-treated gene expression matrix, and subsequently the two new data were analysed by cell clustering using the R package pheatmap. It was found that incorrect clusterings occurred in the clustering graph of the data before imputation, that cells CL2505_1 and CL2019_2 had their separate clustering branches. Presumably, this is due to a large number of missing values in the single-cell dataset resulting in diminished intracellular correlation. However, after imputation by the FRMC algorithm, both cells CL2505_1 and CL2019_2 were correctly classified into their true cell types, and the clustering relationships between the two types of cells were more clearly defined (Fig. 5). Obviously, after the imputation process of FRMC, the single-cell gene expression matrix can more accurately reflect the true intracellular connections and facilitate the downstream single-cell analysis.
Figure 5.
Cell clustering map based on gene expression before and after the imputation of FRMC on the dataset Jurkat-293T. Yellow cell names identified with ‘_1’ represent cell type ‘293T’, and blue cell names identified with ‘_2’ represent cell type ‘Jurkat’
3.5. Runtime
We evaluated the computational efficacy of the algorithms on six single-cell datasets, including four datasets with cells less than 10,000 (Jurkat-293T, InDrop-BC6, Usoskin, and Smart-Seq2) and two large-scale single-cell datasets Indrop-BC4 with about 13k cells and NeuronCell with more than 100k cells. These datasets have different data characteristics: number of cells and genes, and the percentage of missing values, with the dataset Indrop-BC4 yielding the highest percentage of missing values at 94.22% and the number of genes at 9218. On the five datasets except the dataset NeuronCell, we implemented five algorithms with 2Dimpute and ScImpute setting the number of parallel CPU cores parameter (2Dimpute setting cores to 22, i.e. using 22 CPU cores in parallel, and ScImpute setting cores to 20) and the other algorithms not setting the number of CPU cores. Specifically, on the dataset NeuronCell, because ScImpute and 2DImpute methods require more memory as the set cores increase, and the memory of the machine cluster used for testing is 128 G. When processing dataset NeuronCell in this environment, ScImpute and 2DImpute would fail to run due to the insufficient memory if the ncores was set higher than 1, thus the ncores could only be set to 1.
Compared among the datasets with close cell counts (Smart-Seq2, InDrop-BC6, and Jurkat-293T), the algorithm has a longer runtime on the Smart-Seq2 dataset, which has the higher zero ratio among three datasets (Table 1), explaining the higher imputing difficulty of the single-cell dataset with a high number of genes and a high percentage of missing values. On this dataset, ALRA can finish running in as fast as 0.8 minutes; the new method FRMC is second only to ALRA and can be completed in 2.5 minutes, which is close to the runtime of McImpute; the most time-consuming algorithm is 2Dimpute which take 6021 minutes. Furthermore, the FRMC method can also impute successfully in large-scale single-cell datasets. When dealing with datasets InDrop-BC4 with 13,266 cells, FRMC run second only slower than ALRA; and when dealing with datasets NeuronCell with 100,000 cells, FRMC is one of the only three methods that can run successfully within 7 days.
Table 1.
The runtime of FRMC and other four algorithms on six different datasets
Datasets | Cells×Genes | Zero ratio | Runtime (min) |
||||
---|---|---|---|---|---|---|---|
FRMC | McImpute | ScImpute | 2DImpute | ALRA | |||
Smart-Seq2 | 2987 × 16,727 | 72.40% | 64.5 | 64.2 | 58.9 | 6021 | 2.5 |
Indrop-BC6 | 3498 × 2423 | 70.03% | 21 | 19.5 | 34 | 204 | 0.8 |
Jurkat-293T | 3388 × 3702 | 36.00% | 10.1 | 28 | 72.8 | 448 | 1 |
Usoskin | 622 × 10,554 | 64.40% | 2.5 | 3 | 13 | 504 | 0.8 |
InDrop-BC4 | 13,266 × 9218 | 94.22% | 203.7 | 838 | 224 | 4675 | 4.5 |
NeuronCell | 100,000 × 8746 | 78.16% | 9703 | 1766 | NA* | NA* | 29 |
*runtime longer than 7 days (10,080 minutes).
4. Discussions
The high-resolution feature of single-cell transcriptome sequencing technology allows researchers to observe cellular gene expression profiles at the single-cell level, offering numerous possibilities for subsequent studies, but also poses a serious problem of missing data, with expression matrices of high missing values instead hindering downstream analysis. To address this problem, it is essential to develop a more rapid and stable data imputation method with greater accuracy, which should not only be able to recover the missing data, but also effectively facilitate the subsequent biological mechanism analysis.
Although several relatively mature imputation algorithms exist for processing scRNA-seq data, each of them has its own bias and drawbacks. As mentioned earlier, a priori-type algorithms based on statistical models need to make appropriate statistical distribution assumptions on the data to be processed, but such algorithms also do not adopt uniform distribution models, such as gamma-normal mixed distribution (ScImpute), Poisson-gamma mixed distribution (SAVER) or zero-inflated Poisson distribution (VIPER). Evidently, there does not exist an established distribution model that can accurately describe all scRNA-seq features. Therefore, we are eager to seek a more ubiquitous algorithm that can solve this problem unbiasedly. In fact, imputation is a technique that is widely used in various fields, such as Netflix’s movie recommendation system, speech recovery, and image reconstruction [39–42]. The cardinal goal of all these algorithms is to recover an unknown low-rank matrix from a very limited amount of information. Further, whether the imputation problem of scRNA-seq data can be transformed into a problem of solving low-rank matrices is the key to determine whether the imputation techniques from other fields described previously can be applied to scRNA-seq data processing.
Numerous studies have confirmed that gene expression does not behave in isolation, but is interdependent as part of a complex network structure [43–45]. Different genes may be highly correlated with each other due to shared regulatory mechanisms, functional interdependence, or possessing the same tissue origin [46,47]. It has also been suggested that it is reasonable and feasible to assume that the expression values of genes lie in a linear subspace of low dimensionality [22]. Therefore, a low-rank matrix model is feasible for imputing single-cell transcriptome data matrices. There are various alternative approaches in order to solve this low-rank matrix model. One is the singular value thresholding iterative method to solve NNM [32], which mainly preforms singular value shrinkage iterations, namely, a soft-thresholding technique on the singular values of iterative matrices at each step until a stopping criterion is reached. It is efficient at addressing NNM problem, but it requires a very large number of iterations to converge, which limits its applicability. The other is FPCA method (Fixed Point Continuation with Approximate SVD) [48], which incorporates fixed point continuation into a Bregman iteration, but has limitations in solving large-scale NNM problem. Besides, the accelerated proximal gradient approach [49] is an accelerated proximal gradient singular value thresholding algorithm that is relatively efficient and robust in solving large-scale NNM problem. However, it has only sub-linear convergence speed theoretically. Based on this, FRMC chose a more efficient way by using the existing optimization theory and the Lagrangian method and then solved the transformed problem through a singular value thresholding iteration.
To address the shortcomings and limitations of existing imputation methods for single-cell transcriptomics datasets, we have developed a new, fast imputation software FRMC that transforms the single-cell gene expression matrix imputation problem into a low-rank matrix optimization problem, whose kernel innovation is using a fast and accurate singular value threshold approximation method combined with a mathematical optimization model for low-rank matrix recovery and a ‘true zero’ prediction algorithm. It is worth mentioning that some algorithms integrate the imputation step into other data processing processes, performing imputation simultaneously in processes such as normalization, batch effect correction, and clustering [50,51]. While FRMC is a separate session used to process dropout events and can be flexibly integrated into the single-cell data analysis pipeline to allow researchers to analyse with their own data characteristics and adopt appropriate correction strategies. FRMC is implemented in Python and is freely accessible to non-commercial users on GitHub: https://github.com/HUST-DataMan/FRMC. Also, FRMC can be flexibly integrated into single-cell data analysis pipelines.
The results revealed that FRMC can not only accurately distinguish ‘true zero’ from dropout events and correctly impute the missing values generated by dropout events, but also can effectively enhance intracellular and intergenic connections and achieve accurate clustering of cells. In addition, its operation speed is relatively fast. In conclusion, FRMC is an optimized scRNA-seq imputation software that combines accuracy and rapidity performance compared with existing 2Dimpute, ScImpute, McImpute and ALRA imputation algorithms.
Finally, FRMC also has its limitations. It can currently only be used to impute missing values in single-cell transcriptomic data and is not suitable for other datasets. In the future, we will extend the FRMC method to more application scenarios, such as for solving the imputation problem of GWAS genotype data [52] and also for solving the imputation problem of missing values in macro-genomics data [53]. Another room for improvement lies in the use of parallelization to accelerate the operation of the algorithm.
Supplementary Material
Acknowledgments
The authors are very grateful to all friends that give them suggestions on the data analysis and writing. The authors gratefully acknowledge all anonymous reviewers for carefully reading the paper and helpful comments.
Biographies
H W, X W and M C designed the study, summarized the findings, interpreted the data and drafted the manuscript.
X W implemented the algorithms and the computational analyses.
R X did visualizations.
K Z supervised the project.
Disclosure statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Supplementary material
Supplemental data for this article can be accessed here.
References
- [1].Shapiro E, Biezuner T, Linnarsson S.. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14(9):618–630. [DOI] [PubMed] [Google Scholar]
- [2].Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–145. [DOI] [PubMed] [Google Scholar]
- [3].Gong W, Kwak IY, Pota P, et al. DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics. 2018;19(1):220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6(5):377–382. [DOI] [PubMed] [Google Scholar]
- [5].Islam S, Kjallquist U, Moliner A, et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011;21(7):1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol. 2016;34(11):1145–1160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Rozenblatt-Rosen O, Stubbington MJT, Regev A, et al. The human cell atlas: from vision to reality. Nature. 2017;550(7677):451–453. [DOI] [PubMed] [Google Scholar]
- [8].Tirosh I, Suva ML. Deciphering human tumor biology by single-cell expression profiling. Annual Review of Cancer Biology, 2019;3:151–166 [Google Scholar]
- [9].Zhong S, Zhang S, Fan X, et al. A single-cell RNA-seq survey of the developmental landscape of the human prefrontal cortex. Nature. 2018;555(7697):524–528 [DOI] [PubMed] [Google Scholar]
- [10].Patel AP, Tirosh I, Trombetta JJ, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344(6190):1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Tirosh I, Izar B, Prakadan SM, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352(6282):189–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Deng Q, Ramsköld D, Reinius B, et al. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014;343(6167):193–196. [DOI] [PubMed] [Google Scholar]
- [13].Guo G, Huss M, Tong GQ, et al. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev Cell. 2010;18(4):675–685. [DOI] [PubMed] [Google Scholar]
- [14].Pollen AA, Nowakowski TJ, Shuga J, et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014;32(10):1053–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Gong W, Rasmussen TL, Singh BN, et al. Dpath software reveals hierarchical haemato-endothelial lineages of Etv2 progenitors based on single-cell transcriptome analysis. Nat Commun. 2017;8(1):14362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Tang F, Barbacioru C, Bao S, et al. Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNA-Seq analysis. Cell Stem Cell. 2010;6(5):468–478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Yan L, Yang M, Guo H, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol. 2013;20(9):1131–1139. [DOI] [PubMed] [Google Scholar]
- [18].Biase FH, Cao X, Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 2014;24(11):1787–1796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Methods. 2014;11(7):740–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Huang M, Wang J, Torre E, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods. 2018;15(7):539-542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Grün D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014;11(6):637–640. [DOI] [PubMed] [Google Scholar]
- [22].Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based imputation for single cell RNA-seq data. Front Genet. 2019;10:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].van Dijk D, Sharma R, Nainys J, et al. Recovering gene interactions from single-Cell data using data diffusion. Cell. 2018;174(3):716–729.e727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun. 2018;9(1):997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Chen M, Zhou X. VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies. Genome Biol. 2018;19(1):196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Zhu K, Anastassiou D. 2DImpute: imputation in single-cell RNA-seq data from correlations in two dimensions. Bioinformatics. 2020;36(11):3588–3589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Linderman GC, Zhao J, Kluger Y. Zero-preserving imputation of scRNA-seq data using low-rank approximation. bioRxiv. 2018;397588. DOI: 10.1101/397588 [DOI] [Google Scholar]
- [28].Candès EJ, Recht B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics. 2009;9(6):717. [Google Scholar]
- [29].Candes E, Tao T. The power of convex relaxation: near-Optimal matrix completion. Information Theory, IEEE Transactions on. 2010;56(5):2053–2080. [Google Scholar]
- [30].Bertsekas. D . Constrained optimization and Lagrange multiplier methods. Academic Press, New York, USA, 1982. [Google Scholar]
- [31].Bertsekas. D . Nonlinear programming(2nd edn). Athena Scientific, Belmont, Massachusetts, 1999. [Google Scholar]
- [32].Cai JF, Candès EJ, Shen Z. A singular value thresholdingalgorithm for matrix completion. SIAM J Optim. 2010;20(4):1956–1982. [Google Scholar]
- [33].Usoskin D, Furlan A, Islam S, et al. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci. 2015;18(1):145–153. [DOI] [PubMed] [Google Scholar]
- [34].Zheng GX, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Azizi E, Carr AJ, Plitas G, et al. Single-Cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell. 2018;174(5):1293-1308.e36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Jerby-Arnon L, Shah P, Cuoco MS, et al. A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell. 2018;175(4):984-997.e24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Butler A, Hoffman P, Smibert P, et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Poli AA, Cirillo MC. On the use of the normalized mean square error in evaluating dispersion model performance. Atmospheric Environment. Part A. General Topics. 1993;27(15):2427–2434. [Google Scholar]
- [39].Bennett J, Lanning S. The netflix prize. in Proceedings of KDD Cup and Workshop 2007, California, USA, 2007.
- [40].Dass SC, Nair VN. Edge detection, spatial smoothing, and image reconstruction with partially observed multivariate data. J Am Stat Assoc. 2003;98(461):77–89. [Google Scholar]
- [41].Faubel F, McDonough J, Klakow D. Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2009, Taiwan, China. 3869–3872. [Google Scholar]
- [42].Rulloni V, Bustos O, Flesia A. Large gap imputation in remote sensed imagery of the environment. Computational Statistics & Data Analysis, 2012;56(8), 2388–2403 [Google Scholar]
- [43].Silver M, Chen P, Li R, et al. Pathways-driven sparse regression identifies pathways and genes associated with high-density lipoprotein cholesterol in two Asian cohorts. PLoS Genet. 2013;9(11):e1003939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010;11(1):95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Xiong M, Feghali-Bostwick CA, Arnett FC, et al. A systems biology approach to genetic studies of complex diseases. FEBS Lett. 2005;579(24):5325–5332. [DOI] [PubMed] [Google Scholar]
- [46].Ye G, Tang M, Cai JF, et al. Low-rank regularization for learning gene expression programs. PloS One. 2013;8(12):e82146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Kapur A, Marwah K, Alterovitz G. Gene expression prediction using low-rank matrix completion. BMC Bioinformatics. 2016;17(1):243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Ma S, Goldfarb D, Chen L. Fixed point and Bregman iterative methods for matrix rank minimization. Math Program. 2011;128(1–2):321–353. [Google Scholar]
- [49].Toh KC, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization. 2010;6:615–640. [Google Scholar]
- [50].Prabhakaran S, Azizi E, Carr A, et al. Dirichlet process mixture model for correcting technical variation in Single-Cell gene expression data. JMLR workshop and conference proceedings. 2016;48:1070–1079. [PMC free article] [PubMed] [Google Scholar]
- [51].Tang W, Bertaux F, Thomas P, et al. bayNorm: bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics. 2020;36:1174–1181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Chi EC, Zhou H, Chen GK, et al. Genotype imputation via matrix completion. Genome Res. 2013;23(3):509–518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Jiang R, Li W, Li J. mbImpute: an accurate and robust imputation method for microbiome data. Genome Biol. 2021;22:192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.