Abstract
Background
Single-cell RNA sequencing (scRNA-seq) technology enables an in-depth understanding of cellular transcriptome heterogeneity and dynamics. However, a key challenge in scRNA-seq analysis is the dropout events, wherein certain expressed transcripts remain undetected. Dropouts seriously affect the accuracy and reliability of downstream analysis. Therefore, there is an urgent need to develop an effective imputation method that can accurately impute the missing values to mitigate their adverse effects on scRNA-seq analysis.
Methods
We proposed a bidirectional autoencoder-based model (BiAEImpute) for dropout imputation in scRNA-seq dataset. This model employs row-wise autoencoders and column-wise autoencoders to respectively learn cellular and genetic features during the training phase. The synergistic integration of these learned features is then utilized for the imputation of missing values, enhancing the robustness and accuracy of the imputation process.
Results
Evaluations conducted on four real scRNA-seq datasets consistently indicate that BiAEImpute exhibits superior performance compared to existing imputation methods. BiAEImpute adeptly restores missing values, facilitates the clustering of cell subpopulations, refines the identification of marker genes, and aids the inference of cell developmental trajectory.
Conclusion
BiAEImpute proves to be efficacious and resilient in the imputation of missing data in scRNA-seq, contributing to enhanced accuracy in downstream analyses. The source code of BiAEImpute is available at https://github.com/LiuXinyuan6/BiAEImpute.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12864-025-11988-x.
Keywords: Single-Cell RNA-seq data, Imputation, Dropout events, Bidirectional autoencoder
Introduction
Single-cell RNA sequencing (scRNA-seq) is a revolutionary technique that characterizes the transcriptome of individual cells on a genome-wide scale [1]. This technique enables the exploration of cellular heterogeneity and dynamics, providing cell type-specific gene expression signals and progression trajectories that are unattainable through traditional bulk RNA-seq approaches [2–4]. Currently, scRNA-seq has emerged as a potent tool for biomedical research and finds extensive applications in fields such as cancer biology, immunotherapy, and pharmaceutical development [5–7]. However, a significant challenge associated with scRNA-seq data analysis is the occurrence of dropout events, wherein certain expressed transcripts are erroneously recorded as zero due to low RNA capture efficiencies [8, 9]. The presence of missing values in gene expression data poses a considerable obstacle for downstream analysis of scRNA-seq datasets [10, 11]. Therefore, the development of effective methods for imputing these missing values is particularly important.
In recent years, a variety of imputation methods has emerged, categorized into four main groups. The first category employs smoothing or data diffusion to recover missing values. For example, MAGIC [12] utilizes Markov matrices constructed from cellular affinities to diffuse data, propagating information across akin cells to impute missing values. However, determining the diffusion time accurately poses a challenge for MAGIC, potentially leading to an overestimation of missing values. DrImpute [13] utilizes clustering methods for imputation, averaging gene expression values of similar cells identified through clustering, and obtaining final estimates by averaging multiple clustered estimates. However, the necessity of iterative clustering and averaging renders the DrImpute imputation process time-consuming and resource-intensive.
The second category applies probabilistic models to directly impute the missing values in the data. For example, scImpute [14] employs a Gamma-Poisson mixing model to estimate the dropout probability of each gene within each cell, selecting similar cells based on genes with low dropout probabilities for imputation. BayNorm [15] employs a binomial model based on the mRNA trapping mechanism to infer a prior via empirical Bayesian approaches, utilizing cross-cellular expression data to estimate gene expression levels. However, these methods mostly rely on assumptions derived from cell-to-cell relationships, potentially limiting performance in scenarios involving fewer cell types.
The third category employs matrix decomposition techniques for imputation. ALRA [16] utilizes Singular Value Decomposition (SVD) to impute zeros in the expression matrix, leveraging the non-negative nature of the expression matrix and its intrinsic correlation structure. Nevertheless, ALRA typically captures only linear relationships within the original expression matrix.
The fourth category employs deep learning models for imputation. For example, deepImpute [17] employs a divide-and-conquer strategy to train sub-neural networks, utilizing dropout layers and loss functions to identify patterns in the data and impute missing values. However, deepImpute primarily focuses on learning gene similarities, potentially inadequately capturing biological distinctions at the cellular level. GE-Impute [18] utilizes Euclidean distances to construct the original cell-to-cell similarity network and generates embedding matrices for each cell by simulating fixed-length random walks, subsequently reconstructing the similarity network to estimate missing data for each cell by averaging expression values of neighboring cells. Nonetheless, GE-Impute’s reliance on average expression values of neighboring cells may not fully capture all biological variations. To provide a clear comparison of these methods, we summarized their main characteristics—including whether they utilize cell-wise modeling, gene-wise modeling, or explicitly focus on zero values—in Supplementary Table 1. The autoencoder represents an unsupervised learning paradigm within neural network architecture [19], demonstrating widespread utility in feature dimensionality reduction [20] and data imputation tasks [21]. In this study, we proposed an innovative imputation model tailored for scRNA-seq data, called BiAEImpute, conceptualized and implemented on the foundation of bidirectional autoencoder architecture. BiAEImpute capture both cellular and genetic features inherent in the input data during the training phase, enabling accurate predictions of missing data. Notably, BiAEImpute exclusively targets the imputation of zero values while endeavoring to retain non-zero values wherever possible. This diverges from conventional approaches such as MAGIC, which estimate the expression levels of all genes within a cell, even those unaffected by dropout events. Our BiAEImpute mitigates the potential introduction of additional bias intrinsic to such methods. Moreover, BiAEImpute systematically considers both potential cell-to-cell and gene-to-gene associations. This comprehensive strategy aims to preserve biological variation among cells and genes, thereby significantly enhancing downstream analytical results. In summary, our model offers the following contributions:
Table 1.
Summary of benchmark methods
| Model | Parameters | Version |
|---|---|---|
| MAGIC | Default settings | python |
| DrImpute | Default settings | R |
| scImpute | drop_thre = 0.5, Kcluster = cell clusters in each dataset, ncores = 1 | R |
| bayNorm | mean_version = TRUE | R |
| ALRA | Default settings | R |
| deepImpute | Default settings | python |
| GE-Impute | Default settings | python |
The source URLs for the benchmark methods are:
1. MAGIC: https://github.com/KrishnaswamyLab/magic
2. DrImpute: https://github.com/gongx030/DrImpute
3. scImpute: https://github.com/Vivianstats/scImpute/tree/master
4. bayNorm: https://github.com/WT215/bayNorm
5. ALRA: https://github.com/KlugerLab/ALRA
6. deepImpute: https://github.com/lanagarmire/DeepImpute
7. GE-Impute: https://github.com/wxbCaterpillar/GE-Impute
Devises a bidirectional autoencoder-based imputation method for precise estimation of missing values in scRNA-seq data, effectively capturing cellular and genetic features.
Concentrates on estimating zero values rather than the expression values of all genes within each cell, thus mitigating potential additional bias in the dataset.
Systematically evaluates the performance of BiAEImpute in the real datasets.
Materials and methods
Datasets and preprocessing
To evaluate the performance of BiAEImpute, we selected four publicly available real datasets, each derived from distinct tissue sources: Zeisel [22], Romanov [23], Usoskin [24], and Klein [25]. The Zeisel (GSE60361) dataset contains 19,972 genes and 3005 cells, representing seven cell types from cerebral cortex tissue (interneurons, pyramidal SS, pyramidal CA1, oligodendrocytes, microglia, endothelial-mural, astrocytes-ependymal). The Romanov (GSE74672) dataset consists of 24,341 genes and 2881 cells across seven cell types from hypothalamus tissue (oligodendrocytes, astrocytes, ependymal, microglia, vsm, endothelial, neurons). The Usoskin (GSE59739) dataset has 25,334 genes and 622 cells, representing four cell types from lumbar dorsal root ganglion tissue (NF, NP, PEP, TH). The Klein (GSE65525) dataset is a longitudinal dataset containing 24,175 genes and 2717 cells from embryonic stem cells, sampled at four time points (day 0, day 2, day 4, day 7). To mitigate technical biases introduced during sequencing, we applied max-min normalization preprocessing on the raw datasets. The normalization procedure was conducted as follows:
![]() |
1 |
where
represents the original gene expression matrix comprising
rows (individual cells) and
columns (individual genes),
denotes the original expression value of gene
in cell
within the matrix
,
denotes the minimum gene expression value of cell
,
denotes the maximum gene expression value of cell
, and
denotes the normalized gene expression matrix.
Model architecture
The bidirectional autoencoder consists of the row-wise autoencoder, the column-wise autoencoder, and incorporates three penalty functions. The row-wise autoencoder is responsible for learning relationships between cells, while the column-wise autoencoder focuses on learning relationships between genes. Each autoencoder employs its own distinctive penalty function for learning and tuning. In the final layer of the model, these two autoencoders are interconnected, introducing an additional penalty function to optimize the consistency of outputs at the intersection point. This architecture facilitates efficient sharing and transfer of information. Ultimately, the predicted value for the missing data is computed by averaging the outputs from the two autoencoders. The data after imputation can be applied to various downstream analysis. The structural depiction of BiAEImpute is succinctly presented in Fig. 1.
Fig. 1.
Overall workflow of BiAEImpute. The dropout-events data are generated by emulating dropout events on the real expression matrix. At the final layer of the autoencoders, these data are interlinked to minimize disparities in gene expression values at their intersection points. Ultimately, the average of the predicted value from both row-wise and column-wise autoencoders is used as the final prediction
Training process
The training process of the model can be roughly segmented into the following stages, each characterized by distinct operations aimed at achieving an optimal imputation of missing data within the matrices involved. During the initial feature compression stage, the row-wise and column-wise autoencoders are trained on the row and column vectors extracted from the matrix
, respectively. In the subsequent the data reconstruction stage, these autoencoders utilize column nesting and row nesting, in conjunction with nonlinear transformations, to generate two reconstructed matrices, denoted as
and
. These reconstructed matrices embody the learned column-column and row-row relationships, facilitating the imputation of missing data. Subsequently, the reconstructed matrices
and
are aggregated to capitalize on both row-row and column-column relationships for a comprehensive imputation strategy. The final imputed matrix is obtained by averaging the summed contributions from matrices
and
. In the model optimization stage, three loss functions are defined to refine the model’s performance.
Feature compression
Individual cell rows are fed into the row-wise autoencoder, while individual gene columns are input into the column-wise autoencoder. Subsequently, the row-wise encoder and column-wise encoder map the inputs onto their respective latent spaces:
![]() |
2 |
where
represents the normalized expression value of cell
,
represents the normalized expression value of gene
,
and
denote the weight parameters of the two fully-connected layers of the row-wise encoder,
and
denote the weight parameters of the two fully-connected layers of the column-wise encoder,
denotes the nonlinear activation function,
and
represent the outputs of the row-wise encoder and column-wise encoder, respectively.
Data reconstruction
The row-wise decoder and column-wise decoder respectively map the data (
and
) from their unique latent spaces back to their input spaces:
![]() |
3 |
where
and
denote two weight parameters of the two fully connected layers of the row-wise decoder,
and
are the weight parameters of the two fully connected layers of the column-wise decoder,
represents the reconstructed expression vector of cell
, and
denotes the reconstructed expression vector of gene
.
Model optimization
We define three distinct loss functions tailored to the specific tasks of the row-wise and column-wise autoencoders. The first loss function, denoted as
, is specifically designed for the row-wise autoencoder, while the second loss function, denoted as
, corresponds to the column-wise autoencoder. Both
and
are dedicated to estimate the real gene expression values within the dataset. Moreover, a third loss function, denoted as
, plays a pivotal role in coordinating the operations between the two autoencoders. As both autoencoders operate on the same input matrix,
is introduced to ensure the synchronization of the output matrices produced by the row-wise and column-wise autoencoders. This synchronization guarantees the coherence and equality of intersecting points in the output matrices. The optimization of model parameters is achieved by minimizing the aforementioned loss functions (
,
, and
). This iterative optimization process is aimed at refining the autoencoder model to accurately capture and reconstruct the latent data patterns underlying the input data.
![]() |
4 |
![]() |
5 |
where
represents the mask matrix corresponding to
,
and
represent the reconstructed expression values of gene
in cell
by the row-wise and column-wise autoencoders, respectively. The index sets
and
denote the mini-batch of cells and genes, respectively, which correspond to the cell-delineated and gene-delineated mini-batch data illustrated in Fig. 1.
Imputation process
Following the completion of model training, the matrix
, containing missing events, is reintroduced into the model to obtain the imputed matrix
. It is imperative to emphasize that our imputation targets are those zeros in the masked gene expression matrix
, as non-zero entries are presumed to represent authentic gene expression values. The final imputed value
is determined through the following equation:
![]() |
6 |
where
represents the final imputed value of gene
in cell
.
Experimental details
Benchmark methods
We selected seven well-established models known for their outstanding performance to serve as benchmarks in our experiments. These models encompass various categories, including smooth similarity-based imputation models such as MAGIC [12] and DrImpute [13], probabilistic-based imputation models such as scImpute [14] and bayNorm [15], low-rank matrix-based imputation models like ALRA [16], and deep learning-based imputation models such as deepImpute [17] and GE-Impute [18]. For each benchmark method, the parameters were configured following their respective guided tutorials. This information is summarized in Table 1.
Experimental settings
In our experimental setup, we implement the framework with PyTorch and adopt the backpropagation algorithm and gradient descent to minimize the loss function. We use the Adam optimizer for parameter tuning, owing to its efficiency and effectiveness with sparse data. The parameters of our BiAEImpute model are configured as outlined in Table 2, following the settings reported in prior studies [26].
Table 2.
BiAEImpute model configuration parameters
| Parameter | Hidden layers | Neurons per layer | Learning rate | Epoch | GPU |
|---|---|---|---|---|---|
| Value | 2 | 128 |
|
500 | NVIDIA 4060 |
In addition, the gene expression values from four distinct real datasets (Zeisel, Romanov, Usoskin, and Klein) were considered as ground-truth data. To simulate the inherent dropout events commonly observed in scRNA-seq data, we introduced synthetic missing data, referred to as non-imputed data, by applying MCAR (missing completely at random), in which 20%, 40%, and 60% of the non-zero expression values were randomly masked from the ground-truth data. Regarding simulated dataset, we generated scRNA-seq data without inherent dropout using the Splatter R package [27]. We then introduced three distinct types of missingness: MCAR, MAR (missing at random), and MNAR (missing not at random). The detailed implementation of the three missingness mechanisms is as follows:
For MCAR, we first identified all non-zero entries in the gene expression matrix and then randomly selected a proportion of them according to a predefined masking ratio. The selected values were then replaced with zeros using the dropout function defined in datasets.py to mimic dropout events independent of gene expression or cell group characteristics. For MAR, we utilized Splatter’s ‘group’-based dropout simulation. The core idea is to make missingness dependent on group membership, with certain cell groups having higher dropout rates. This implementation is found in the second section of the ‘simulate_data.R’ script titled ‘MAR’. We defined five cell groups with different proportions and configured the dropout parameters (dropout.mid) to vary across these groups. For MNAR, we adopted Splatter’s ‘experiment’-wide dropout setting, in which the dropout probability is a function of the gene expression level itself (low-expression genes are more likely to be dropped), independent of group membership. This is implemented in the third section of the ‘simulate_data.R’ script titled ‘MNAR’. All groups share the same dropout parameters.
Subsequently, we applied our proposed BiAEImpute along with other benchmark methods to estimate the missing values. Following the imputation process, we conducted a series of experiments and downstream analyses to assess the performance of our BiAEImpute imputation method compared to these benchmark methods. Specifically, we evaluated the efficacy of each method in accurately recovering the missing values and preserving the underlying biological information encoded in the scRNA-seq datasets.
Evaluation methods and metrics
To evaluate the imputation performance of our BiAEImpute and other benchmark methods, we conducted the following experiments, including quantitative metrics, clustering analysis, marker gene identification analysis, and trajectory analysis.
Quantitative metrics
The disparity between the imputed data and the ground-truth data was assessed through the computation of three quantitative metrics: Pearson Correlation Coefficients (PCC), Coefficient of Determination (
), and Root Mean Square Error (RMSE). PCC serves to measure the correlation between two matrices, with a higher PCC value indicating a stronger correlation, thereby elucidating a more pronounced relationship between the datasets.
quantifies the extent to which a model can explain the variability within the data. A higher value of
signifies that the model adeptly captures a larger portion of the data’s variance, thereby indicating a stronger concordance between the imputed data and the ground-truth data. RMSE, on the other hand, is employed to quantify the discrepancy between individual elements within the matrix. A smaller RMSE value denotes a reduced disparity between the datasets. The mathematical formulations for each of these three metrics are elucidated in Supplementary Data 1.
Clustering analysis
To evaluate whether BiAEImpute outperforms other benchmark methods in cluster analysis, the Seurat tool [28] was employed for clustering, wherein three clustering metrics were selected to calculate the correlation between the clustering results derived from the imputed data and the ground-truth data. The chosen metrics encompass Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Purity. ARI quantifies the consistency between the clustering results and the real categories, with a higher value indicating stronger consistency between two. NMI quantifies the extent of information exchange between the clustering results and the real categories. A higher NMI value means a greater degree of shared information. Purity measures the cohesiveness of each cluster within the clustering results, whereby a higher purity implies a greater presence of a single real category within each cluster. The mathematical formulations for these three metrics can be found in Supplementary Data 2.
Marker gene identification analysis
To evaluate the capability of BiAEImpute in identifying marker genes, the FindMarkers function from Seurat tool was utilized to identify marker genes for each cell type within the Zeisel dataset. The marker genes identified within the ground-truth data served as the gold standard for comparison, whereby the difference between the imputed data and the dropout data was measured relative to this standard.
Trajectory analysis
To evaluate the capability of BiAEImpute in inferring cell developmental trajectories, the infer_trajectory function in the SCORPIUS tool [29] was employed to infer the estimated data generated by each imputation method. Subsequently, the draw_trajectory_plot function from SCORPIUS tool was utilized to visually present the inferred results in the form of plots. To quantitatively evaluate the quality of the reconstructed trajectories, Kendall’s Rank Correlation Score (KRCS) was computed between true-time labels and predicted pseudo-times. KRCS, ranging between − 1 and 1, whereby a value closer to 1 indicates a greater similarity between the inferred cell ordering and the true temporal sequence. The mathematical formulation of KRCS is elaborated in Supplementary Data 3.
Results
BiAEImpute accurately recovers the gene expression levels
To evaluate BiAEImpute’s performance in recovering missing values in scRNA-seq data, multiple sets of experiments were conducted and compared with seven benchmark methods. Dropout events were simulated by randomly masking 20%, 40%, and 60% of non-zero expression values in the Zeisel, Romanov, Usoskin and Klein datasets. Subsequently, three evaluation metrics (PCC,
, and RMSE) were employed to measure the accuracy of the estimated values, as depicted in Fig. 2 (Supplementary Tables 2–5).
Fig. 2.
Quantitative comparison on accurate recovery. The accuracy of recovering missing values was evaluated using three metrics: PCC (higher values are better),
(higher values are better), and RMSE (lower values are better)
As shown in Fig. 2, BiAEImpute exhibits commendable performance across all missing rates. This superior performance can be attributed to its incorporation of both cellular and genetic correlations during imputation. Although DrImpute slightly outperforms BiAEImpute in the Romanov dataset with a missing rate of 20%, potentially due to its utilization of multiple imputation, it lags behind BiAEImpute notably when the missing rate escalates to 40% and 60%, rendering it less suitable for datasets with high missing rates. Furthermore, it was observed that results obtained from BiAEImpute, DrImpute, GE-Impute, and deepImpute are notably more satisfactory compared to MAGIC, scImpute, ALRA, and bayNorm. This disparity may be attributed to the robust feature extraction capabilities inherent in neural network-based methods. In conclusion, the estimated values produced by BiAEImpute closely resemble to the real values and prove suitable for datasets with varying dropout rates.
To more comprehensively evaluate the model’s imputation capability under different missing data mechanisms, we further employed the Splatter simulation tool to generate three synthetic datasets corresponding to missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) scenarios, as summarized in Supplementary Table 6. For each missingness type, we imposed dropout rates of 20%, 40%, and 60%, and evaluated performance using PCC, R², and RMSE. The results (Supplementary Tables 7–9) show that BiAEImpute consistently outperforms all baseline methods across all missing mechanisms and dropout rates, achieving the highest PCC and R², and the lowest RMSE. These findings underscore the model’s strong robustness, adaptability, and generalizability across diverse and challenging missing data scenarios.
BiAEImpute facilitates cell subpopulation identification and visualization
Clustering serves as a fundamental application in the downstream analysis of scRNA-seq data, enabling the identification and annotation of different cell types pivotal for understanding tissue development and disease progression. This section evaluates whether BiAEImpute improves unsupervised clustering results. A comprehensive evaluation needs to conduct clustering tasks on the Zeisel, Romanov, and Usoskin datasets with a simulated missing rate of 40%. Additionally, three widely used metrics, namely ARI, NMI, and Purity, are employed to assess clustering performance, as illustrated in Supplementary Fig. 1 (Supplementary Table 14). Observations reveal that BiAEImpute consistently outperforms other benchmark methods in terms of ARI, NMI, and Purity metrics. Furthermore, the clustering results of Zeisel dataset estimated by each imputation method were visualized using the Unified Method of Popular Approximation and Projection (UMAP) to reveal the accuracy of the clustering in Fig. 3, with each cell color-coded according to the ground-truth cell type.
Fig. 3.
Visualization of clustering results based on the Zeisel dataset. In each subfigure, the upper subplot depicts the clustering results, contrasting with the color-coded representation of real cell types in the lower subplot
Figure 3(a) demonstrates that clustering results of non-imputed data fail to adequately separate S1 pyramidal and oligodendrocyte cell populations, while also erroneously grouping S1 pyramidal with CA1 pyramidal cell populations. This underscores the significant impact of missing values on the clustering outcomes of raw data. Conversely, Fig. 3 (b) illustrates that the imputed data by BiAEImpute results in clear separation of seven distinct cell clusters, each precisely assigned to its respective cell type. Additionally, Supplementary Fig. 2 displays the top 10 marker genes for each cell type after imputation by BiAEImpute, revealing notable expressions of Gm11549 and Tbr1 genes in S1 Pyramidal cells compared to others. Spink8 and Fibcd1 genes exhibit high expression in CA1 Pyramidal cells, while F13a1 gene is notably expressed in Microglia cells, consistent with the findings in literature [22]. However, as shown in Fig. 3(c)−(i), clustering results are suboptimal with most methods misclassifying a subset of cells into incorrect cell types. For instance, MAGIC incorrectly classified CA1 pyramidal cells as oligodendrocytes, while ALRA misclassified some S1 pyramidal cells as CA1 pyramidal cells. Overall, BiAEImpute significantly improves the accuracy of clustering analysis in scRNA-seq data with dropout events. Through its imputation capabilities, BiAEImpute contributes to improved delineation of cell populations and more precise identification of marker genes associated with distinct cell types.
BiAEImpute enhances marker genes identification
In scRNA-seq data analysis, identifying marker genes for each cell subpopulation is essential for understanding cell type specificity and the underlying biological mechanisms of diseases. This knowledge can provide critical insights into disease states, cell development, and function. However, the occurrence of dropout events in scRNA-seq data may result in the loss of authentic cellular signaling, posing challenges in distinguishing between different cell types or states. To demonstrate the efficacy of BiAEImpute in marker gene identification, we analyzed marker genes expression for each cell subpopulation using clustering results from the Zeisel dataset, presenting the findings through violin plots in Fig. 4.
Fig. 4.
Visualization of marker genes expression for each cell subpopulation in the Zeisel dataset
As shown in Fig. 4, In both ground-truth data and BiAEImpute-imputed data, seven distinct cell clusters were identified and annotated. However, in the non-imputed data, only six cell clusters were observed because the microglia cell and oligodendrocytes cell types could not be accurately distinguished and were merged into a single cell type due to missing values. Canonical marker genes in each cell type exhibit high expression levels in the ground-truth data. However, these expression levels were markedly reduced in the non-imputed data due to the presence of dropout events. This reduction was particularly evident in the cells of CA1 pyramids, S1 pyramids, endothelial-mural, and interneurons. In contrast, following imputation with BiAEImpute, we observed a substantial increase in the expression levels of marker genes across different cell types, bringing them to levels comparable with those observed in the ground-truth data. In summary, BiAEImpute prominently improved the accuracy of marker gene identification, thereby enhancing the precision in characterizing cell subpopulations.
BiAEImpute aids cell trajectory inference
Trajectory inference constitutes a fundamental task in the downstream analysis of scRNA-seq data, playing a pivotal role in revealing the dynamic cellular development process. However, the presence of missing events can affect the accurate inference of cell pseudo-temporal trajectories. To assess whether BiAEImpute enhances the ability to infer cellular developmental trajectories, we conducted experiments using the Klein dataset, comprising cells at four stages (days 0, 2, 4, and 7). Detailed results are presented in Fig. 5 (Supplementary Table 15).
Fig. 5.
Visualization of cell developmental trajectory inference. Different colors denote distinct cellular stages, while lines depict developmental trajectories
As shown in Fig. 5, our findings indicate that non-imputed data failed to accurately construct the developmental trajectory. In contrast, BiAEImpute-imputed data accurately inferred the pseudo-temporal ordering of cells, indicating a significant impact of dropout events on trajectory inference and demonstrating BiAEImpute’s capability to accurately impute missing values, thereby enhancing the inference. Trajectories results derived from the imputed data with other benchmark methods, except scImpute, closely aligned with the actual temporal stages of development, but not as good as BiAEImpute, which achieved the highest Kendall’s rank correlation score (KRCS = 0.867) value compared to other benchmark methods.
Runtime and memory consumption comparison
A robust and efficient imputation method should not only accurately impute missing values but also exhibit reasonable computational efficiency in terms of both runtime and memory usage. Fig. 6 (Supplementary Tables 16–17) presents a comparative analysis of the runtime and memory consumption of BiAEImpute(including the training and imputation phases of the autoencoders) against seven other imputation methods across four scRNA-seq datasets.
Fig. 6.
Comparison of Runtime and Memory Consumption Across Imputation Methods
As illustrated in Fig. 6 (Supplementary Tables 16–17), BiAEImpute offers a favorable balance between computational efficiency and imputation accuracy. While its runtime is slightly longer than that of MAGIC and GE-Impute, BiAEImpute consistently achieves higher imputation accuracy. In terms of memory consumption, BiAEImpute demonstrates significantly lower memory usage compared to other deep learning-based methods such as DrImpute, ALRA, and deepImpute, while remaining competitive with lighter methods like MAGIC and bayNorm. These results suggest that BiAEImpute strikes an effective trade-off between accuracy, runtime, and memory efficiency, making it well-suited for real-world applications.
In addition, we assessed the scalability of BiAEImpute on large-scale scRNA-seq datasets. As shown in Supplementary Fig. 3, both runtime and GPU memory usage scale approximately linearly with the number of cells, indicating that BiAEImpute is computationally efficient and potentially suitable for even million-cell datasets, given appropriate hardware support.
Ablation study
To verify the effectiveness and synergy of the proposed bidirectional autoencoder architecture, we conducted a series of ablation studies under the same experimental settings. Several model variants were introduced, each representing a specific configuration:
NRA (No Row-wise Autoencoder): This variant includes only the column-wise autoencoder, with the row-wise autoencoder removed.
NCA (No Column-wise Autoencoder): This variant includes only the row-wise autoencoder, with the column-wise autoencoder removed.
Experiments were conducted on four datasets with three different missing rates. The results(Supplementary Tables 10–13) consistently show that BiAEImpute outperforms both ablated variants across all evaluation metrics. This highlights the overall effectiveness of the proposed bidirectional architecture.
Furthermore, the ablation study reveals the importance of each component. Regarding the importance of the column-wise autoencoder, a comparison between NCA and BiAEImpute demonstrates that the column-wise autoencoder, which captures gene-level dependencies, plays a crucial role in enhancing imputation accuracy. Regarding the importance of the row-wise autoencoder, a comparison between NRA and BiAEImpute shows that the row-wise autoencoder, which learns cell-level relationships, also significantly contributes to performance.
These findings indicate that both the row-wise and column-wise autoencoders offer complementary benefits. Their synergistic integration in BiAEImpute is key to its superior performance, validating the design of the bidirectional autoencoder.
Limitation
While BiAEImpute offers promising results, there remains room for improvement. Currently, all zero values are treated as potential missing data, which may not be optimal. A potentially more refined strategy would involve probabilistic imputation, where missing values are imputed based on their likelihood of being truly missing. However, existing methods for estimating missing value probabilities often suffer from low accuracy. Therefore, developing robust and accurate methods for distinguishing true missing values from true biological zeros remains a crucial direction for future research. Futhermore, In this study, we employed an averaging approach to integrate imputed values obtained from dual perspectives (gene-gene and cell-cell similarities). While this method demonstrated robust performance in our analyses, we acknowledge that future investigations exploring weighted fusion strategies may potentially achieve superior imputation accuracy. Moreover, inspired by Zhou et al. [30], it is essential to account for longitudinal observations, as they may substantially impact imputation performance. Future methodological advancements should prioritize this consideration to improve reliability and effectiveness.
Conclusion
In the field of scRNA-seq data, dropout events present a prevalent challenge that significantly impacts the accuracy and reliability of downstream analysis. To mitigate this challenge, we propose BiAEImpute, a novel imputation method that leverages bidirectional autoencoders to learn latent representations of both cellular and genetic features. Our comprehensive evaluation demonstrates that BiAEImpute outperforms state-of-the-art methods in accurately recovering missing gene expression values. By effectively capturing complex biological patterns, BiAEImpute facilitates the identification and visualization of cell subpopulations, marker genes, and cellular trajectories.
Supplementary Information
Acknowledgements
The authors thank the anonymous reviewers for suggestions that helped improve the paper substantially.
Abbreviations
- scRNA-seq
Single-cell RNA sequencing
- SVD
Singular Value Decomposition
- PCC
Pearson Correlation Coefficients
- R²
Coefficient of Determination
- RMSE
Root Mean Square Error
- ARI
Adjusted Rand Index
- NMI
Normalized Mutual Information
- KRCS
Kendall’s Rank Correlation Score
- UMAP
Unified Method of Popular Approximation and Projection
- GEO
Gene Expression Omnibus
Authors’ contributions
Conceptualization, Y.Z.; Data curation, X.L.; Formal analysis, X.L.; Funding acquisition, Y.Z.; Methodology, X.L.; Software, X.L.; Validation, Y.Z.; Writing—original draft, Y.Z.; Writing—review and editing, Y.Z. and X.L.; All authors reviewed the manuscript.
Funding
This research was funded by National Natural Science Foundation of China, grant number 62166014, Natural Science Foundation of Guangxi Zhuang Autonomous Region, grant number 2025GXNSFAA069627.
Data availability
The scRNA-seq data used in this manuscript are all publicly available. The Zeisel, Romanov, Usoskin, and Klein data are available at the Gene Expression Omnibus (GEO) under accession codes GSE60361, GSE74672, GSE59739, and GSE65525, respectively. The source code of BiAEImpute is available at https://github.com/LiuXinyuan6/BiAEImpute.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20:631–56. [DOI] [PubMed] [Google Scholar]
- 2.Papalexi E, Satija R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol. 2018;18:35–45. [DOI] [PubMed] [Google Scholar]
- 3.Grün D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014;11:637–40. [DOI] [PubMed] [Google Scholar]
- 4.Wilson NK, Kent DG, Buettner F, et al. Combined Single-Cell functional and gene expression analysis resolves heterogeneity within stem cell populations. Cell Stem Cell. 2015;16:712–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Potter SS. Single-cell RNA sequencing for the study of development, physiology and disease. Nat Rev Nephrol. 2018;14:479–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009;6:377–82. [DOI] [PubMed] [Google Scholar]
- 7.Patel AP, Tirosh I, Trombetta JJ, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344:1396–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Islam S, Zeisel A, Joost S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11:163–6. [DOI] [PubMed] [Google Scholar]
- 9.Hicks SC, Townes FW, Teng M, et al. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19:562–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Durruthy-Durruthy R, Gottlieb A, Hartman BH, et al. Reconstruction of the mouse otocyst and early neuroblast lineage at single-cell resolution. Cell. 2014;157:964–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pollen AA, Nowakowski TJ, Shuga J, et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014;32:1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Van Dijk D, Sharma R, Nainys J, et al. Cell. 2018;174:716–e72927. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. [DOI] [PMC free article] [PubMed]
- 13.Gong W, Kwak I-Y, Pota P, et al. DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics. 2018;19:220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Li WV, Li JJ. An accurate and robust imputation method scimpute for single-cell RNA-seq data. Nat Commun. 2018;9:997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Tang W, Bertaux F, Thomas P, et al. BayNorm: bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics. 2020;36:1174–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Linderman GC, Zhao J, Roulis M, et al. Zero-preserving imputation of single-cell RNA-seq data. Nat Commun. 2022;13:192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Arisdakessian C, Poirion O, Yunits B, et al. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol. 2019;20:211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wu X, Zhou Y. GE-Impute: graph embedding-based imputation for single-cell RNA-seq data. Brief Bioinform. 2022;23:bbac313. [DOI] [PubMed] [Google Scholar]
- 19.Hinton GE, Zemel R. Autoencoders, minimum description length and Helmholtz free energy. Proc 6th Int Conf Neural Inform Proces Systems. 1993;1993:3–10.
- 20.Wang Y, Yao H, Zhao S. Auto-encoder based dimensionality reduction. Neurocomputing. 2016;184:232–42. [Google Scholar]
- 21.Beaulieu-Jones BK, Moore JH, CONSORTIUM PRO-AACT. Missing data imputation in the electronic health record using deeply learned autoencoders. Pacific symposium on biocomputing 2017 2017; 207–218. [DOI] [PMC free article] [PubMed]
- 22.Zeisel A, Muñoz-Manchado AB, Codeluppi S, et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–42. [DOI] [PubMed] [Google Scholar]
- 23.Romanov RA, Zeisel A, Bakker J, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nat Neurosci. 2017;20:176–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Usoskin D, Furlan A, Islam S, et al. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci. 2015;18:145–53. [DOI] [PubMed] [Google Scholar]
- 25.Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for Single-Cell transcriptomics applied to embryonic stem cells. Cell. 2015;161:1187–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cheng Y, Ma X. ScGAC: a graph attentional architecture for clustering single-cell RNA-seq data. Bioinformatics. 2022;38:2187–93. [DOI] [PubMed] [Google Scholar]
- 27.Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017;18:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–e358729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Cannoodt R, Saelens W, Sichien D. et al. SCORPIUS improves trajectory inference and identifies novel modules in dendritic cell development. bioRxiv. 2016:079509.
- 30.Zhou F, Lu X, Ren J, et al. Sparse group variable selection for gene–environment interactions in the longitudinal study. Genet Epidemiol. 2022;46:317–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The scRNA-seq data used in this manuscript are all publicly available. The Zeisel, Romanov, Usoskin, and Klein data are available at the Gene Expression Omnibus (GEO) under accession codes GSE60361, GSE74672, GSE59739, and GSE65525, respectively. The source code of BiAEImpute is available at https://github.com/LiuXinyuan6/BiAEImpute.













