Abstract
The increasing single-cell RNA sequencing (scRNA-seq) data enable researchers to explore cellular heterogeneity and gene expression profiles, offering a high-resolution view of the transcriptome at the single-cell level. However, the dropout events, which are often present in scRNA-seq data, remaining challenges for downstream analysis. Although a number of studies have been developed to recover single-cell expression profiles, their performance may be hindered due to not fully exploring the inherent relations between genes. To address the issue, we propose scDTL, a deep transfer learning based approach for scRNA-seq data imputation by harnessing the bulk RNA-sequencing information. We firstly employ a denoising autoencoder trained on bulk RNA-seq data as the initial imputation model, and then leverage a domain adaptation framework that transfers the knowledge learned by the bulk imputation model to scRNA-seq learning task. In addition, scDTL employs a parallel operation with a 1D U-Net denoising model to provide gene representations of varying granularity, capturing both coarse and fine features of the scRNA-seq data. Finally, we utilize a cross-channel attention mechanism to fuse the features learned from the transferred bulk imputation model and U-Net model. In the evaluation, we conduct extensive experiments to demonstrate that scDTL could outperform other state-of-the-art methods in the quantitative comparison and downstream analyses.
Keywords: transfer learning, gene imputation, single-cell RNA-sequencing, bulk RNA sequencing
Introduction
Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced the study of cellular heterogeneity and enabled the high-resolution gene expression profiling at the single-cell level. However, scRNA-seq data always suffer from a high ratio of ‘dropouts’ due to technical limits or improper sequencing. This issue leads to genes not being adequately observed, resulting in artificial zeros in the expression matrix. Such dropout events have been the major obstacle in further downstream analysis such as cell-type clustering [1], trajectory analysis, and differential expressed gene analysis. Therefore, it is desirable to develop an effective method that addresses this critical issue to facilitate scRNA-seq data interpretation.
Recently, a variety of imputation techniques have been proposed to infer and reconstruct the missing values in scRNA-seq data. MAGIC [2] is one of the earliest solutions for recovering incomplete single-cell gene expression. It uses the shared information to estimate the missing values in the expression matrix by considering similar cells based on data diffusion. Besides, scImpute [3] conducts separate regression model for each cell to directly estimate true gene expression levels. To measure the uncertainty of estimated values, SAVER [4] uses a Bayesian model with a Poisson LASSO regression and takes advantage of other genes as predictors. DrImpute [5] and KNN smoothing [6] impute scRNA-seq data by averaging or smoothing the expression values via cell clustering to identify similar cells (e.g. KNN). Since the clustering conditions in different datasets are usually unknown, the results heavily rely on similar-cell information that can not be guaranteed.
Instead of using computational techniques only, deep learning-based imputation approaches have been effectively applied. For example, DeepImpute and scScope [7, 8] use encoder as the feature extractor that aggregates the cell representation, and reconstruct the expression profile of scRNA-seq data from the latent space with a decoder. Moreover, researchers have extended basic FNN architecture to graph-based models to learn representations of the graph topology and build cell-to-cell similarity links. For example, GraphSCI [9] estimates missing expression values from scRNA-seq data by combining graph convolution and autoencoder neural networks. scGNN [10] introduces a multi-modal framework that adopts dense and graph convolution networks revealing heterogeneous gene expression patterns. On the other hand, contrastive learning is also emerged as a promising technique for scRNA-seq analysis [11, 12]. scGCL [13] leverages graph contrastive learning to capture the topological information of the cell graph. It trains a ZINB autoencoder to reconstruct the single-cell expression by randomly dropping out nodes and edges of the cell relationship graph. In addition, CL-Impute [11] uses contrastive learning and self-attention network that captures latent biological cell relationships. However, most of aforementioned methods only consider single-cell RNA-seq information based on the reconstruction loss or contrastive loss (similarity). The inherent relationships between genes are still not comprehensively explored [14], resulting in a limited imputation performance—dropout events are still not adequately recovered.
On the other hand, bulk RNA sequencing measures the average expression across hundreds to millions of cells, providing a more comprehensive gene expression patterns and correlations. scGGAN [14] builds a gene relation network using single-cell and bulk genomics information to facilitate scRNA-seq data imputation. It could yield more accurate results by considering the inherent relations between genes. Peng et al. [15] proposed SCRABBLE, a matrix regularization based framework that uses bulk cell data as a constraint recovering dropout values in scRNA-seq. SCRABBLE reuiqres the collection of matched bulk data on the same cell/tissue, and consistent cell population is also needed due to matrix regularization for scRNA-seq imputing. What is more, Chen et al. [16] developed scDEAL that leverages transferred model on bulk cell data to facilitate single-cell drug response prediction. Therefore, these methods well examined the capabilities of transferring valuable bulk-level knowledge to the single-cell level tasks.
Inspired by above approaches, we introduce scDTL, a deep transfer learning (DTL)-based approach to recover dropout events in scRNA-seq data by leveraging the bulk cell information. Firstly, we employ a DAE that learns bulk cell representations, and then apply domain adaptation technique to align the learned representations with single-cell data embeddings, thereby ensuring that the imputation model can be effectively applied to scRNA-seq analysis tasks. Besides, we further explore the scRNA-seq gene representations with different granularity by utilizing a 1D U-Net denoising model as a parallel operation. The U-Net model enables the network to propagate contextual information to higher-resolution layers, effectively capturing both coarse and fine features. This capability enhances the representation of heterogeneous gene expression in scRNA-seq data. Finally, we integrate the scRNA-seq features derived from the adapted bulk imputation model and those from the U-Net architecture by employing a cross-channel attention mechanism. The channel-wise attention module is designed to capture the heterogeneous correlations between the dual output streams, selectively emphasizing the most informative spatial features to enrich the representation of scRNA-seq data effectively. In addition, both reconstruction loss and cell clustering loss of gene expression are considered in each training epoch. It is noteworthy that scDTL is capable of dealing with single-cell and bulk from different tissues, and thus does not force a consistent cell population.
In the evaluation, we conduct extensive experiments based on eight scRNA-seq datasets and use nine well-known state-of-the-art imputation methods as baselines. The results demonstrate that scDTL-based approach could outperform other solutions in the quantitative comparison including Pearson correlation coefficients (PCCs), Root mean square error (RMSE), and L1-distance, as well as in downstream analyses including cell clustering [17] and pseudotime analysis [18, 19]. We also provide an ablation study to examine the effectiveness of hyper-parameter and modules in scDTL model.
Materials and methods
We introduce scDTL, a DTL based framework that addresses single-cell RNA-seq imputation problem by considering large-scale bulk RNA-seq information synchronously. The whole architecture of scDTL is illustrated in Fig. 1, scDTL involves one pretraining phase where two Denoising Autoencoders (DAEs) with induced noise are employed to separately extract low-dimensional features from bulk and single-cell input data, and three key processing sttif including: 1. We firstly train an initial imputation model via supervised learning using large-scale bulk RNA-seq data, 2 in order to ensure that the feature spaces of SC encoder and Bulk encoder exhibit similar distributions, the DTL framework will be trained by considering the maximum mean discrepancy between the low-dimensional feature spaces of single-cell and bulk data, a reconstruction loss for scRNA-seq data is also added so that the SC Encoder remains distinct from Bulk Encoder, and 3. we finally utilize a 1D U-net module for imputing the dropouts of a given single-cell RNA-seq expression matrix and applied a cross-channel attention mechanism—CBAM (Convolutional Block Attention Module) [20], which captures the heterogeneous correlations between bulk RNA-seq data and scRNA-seq data from a global perspective.
Figure 1.
The workflow of scDTL includes DAEs pretraining and (i) bulk RNA-seq imputation training using
, (II) RNA-seq imputation DTL training using
and
, and (iii) single-cell RNA-seq imputation training using
and
.
Data preparation and preprocessing
In this study, eight single-cell RNA-seq datasets of tumor cells [21–25] and one CCLE bulk RNA-seq dataset [26] were adopted to examine the ability of scDTL in dropout events imputation. The detailed information of datasets are summarized in Table 1. As we can see from the table, the number of genes ranges from 12274 to 57241, the number of cells ranges from 274 to 24427,the number of cell clusters (groundtruth label) varies from 2 to 10, and the zero-value proportion
is in
. This indicates that single-cell RNA-seq loses a considerable amount of gene information.
Table 1.
Summary of the scRNA-seq datasets
| Dataset | Cell | Gene | Cell type |
|
Platform |
|---|---|---|---|---|---|
| GSE112274 | 507 | 13 801 | 5 | 0.546 | Drop-seq |
| GSE117872 | 1302 | 18 120 | 10 | 0.555 | 10x |
| GSE134836 | 24 427 | 14 467 | 2 | 0.867 | CEL-seq2 |
| GSE134838 | 8112 | 12 274 | 2 | 0.937 | inDrop |
| GSE134839 | 2903 | 12 811 | 6 | 0.891 | Smart-seq2 |
| GSE134841 | 5662 | 12 942 | 5 | 0.915 | Drop-seq |
| GSE81861 | 274 | 57 241 | 4 | 0.785 | Fluidigm |
| GSM3618014 | 5001 | 32 895 | 6 | 0.886 | 10x |
In order to reduce the technical variance and control the sample quality in each scRNA-seq dataset, data preprocessing were performed. Due to the high dropout rate of scRNA-seq expression data [10, 16], we firstly filter out genes expressed as non-zero in less than 1% of cells, and cells expressed as non-zero in less than 1% of genes. In addition, the gene matrices
of bulk and single-cell were normalized by dividing by the total UMI count (refer to Python package SCANPY [27]), multiplied by
:
![]() |
(1) |
Then we perform a log transformation (log1p) on the normalized counts. The addition of 1 helps avoid issues with taking the logarithm of zero:
![]() |
(2) |
The values in
are then scaled (from 0 to 1) using Min-Max scaling:
![]() |
(3) |
We select the top 4000 highly variable genes into gene expression matrix
and
[28]. We split both bulk RNA-seq data and single-cell RNA-seq data 64%, 16%, and 20% as the training set, validation set, and testing set, respectively.
We acknowledge that the data used in this study is publicly available and has been obtained in accordance with ethical guidelines and regulations. We have ensured that the use of this data complies with all relevant laws and regulations regarding data protection, privacy, and confidentiality. We have also taken measures to ensure that any potential risks to individuals or communities associated with the use of this data have been minimized.
Bulk RNA-seq imputation training
Firstly, we build a bulk RNA-seq imputation model using a DAE. The DAE is applied to learn the low-dimensional representation of the bulk expression matrix
, which is derived from section Data Preparation and Preprocessing. The DAE takes the input
and intentionally corrupts it by adding noise—randomly masking (replace non-zero RNA expression values with zero). Then an encoder-decoder model is trained to reconstruct the original bulk RNA-seq expression from the corrupted one. The training workflow is composed of three parts:
Encoder: the goal is for the autoencoder to learn a representation that is robust enough to capture the underlying structure of the data and remove the introduced noise during the reconstruction process. In our design, the encoder
will learn to capture relevant gene features of bulk RNA-seq, and map the input data with random masking (simulated dropout operations) into a hidden representation as a compressed form.Decoder/Imputer: the latent representation is then fed into a decoder neural network
. The decoder’s task is to reconstruct input from the compressed representation and impute the bulk RNA-seq dropout values.- Reconstructed Output: The final output
of the DAE is the attempt to reconstruct the original, uncorrupted input
without the noise (dropout values). The DAE is optimized by the reconstruction loss function (Mean Squared Error, MSE)—minimizing the difference between the input
and the reconstructed output
: 
(4)
Single-cell RNA-seq feature extraction
In order to pretrain an encoder for the extraction of low-dimensional features from the single-cell RNA-seq data, we use a similar DAE model as in the previous section Bulk RNA-seq Imputation Training. We add noise
to the single-cell RNA expression
based on a binomial distribution for generating
.
![]() |
(5) |
where
and
are the encoder and decoder for the noisy single-cell RNA expression input
, and the reconstructed result
is the output of the decoder
. In the testing phase, we will also use the input data with simulated dropout events for
to examine the imputation performance.
RNA-seq imputation DTL training
By leveraging the knowledge derived from the section Bulk RNA-seq Imputation Training, we are able to enhance single-cell RNA-seq imputation by training the model that learns general features and representations from the source bulk RNA-seq data. More specifically, we apply a domain adaptation architecture inspired by Domain Adversarial Neural Network [29]. The proposed architecture promotes the emergence of features that are both discriminative and invariant to the change of domains (from bulk RNA-seq to single-cell RNA-seq).
As the training progresses, we update both encoders
and
to balance the distribution of features extracted from bulk and single-cell domains by introducing the MMD (Maximum Mean Discrepancy) loss. The MMD loss is used to measure the similarity between the outputs of
and
, which is often defined as the squared difference between the empirical mean embeddings of the source and target domains in a reproducing kernel Hilbert space (RKHS) and can be expressed as follows:
![]() |
(6) |
![]() |
(7) |
![]() |
(8) |
![]() |
(9) |
where
is the kernel function applied to samples from the source domain,
is the number of cells and
is the number of genes in the bulk cell dataset,
is the number of cells in the single cell dataset,
is the number of top highly variable genes, and
is the kernel function applied to samples from the target domain.
In addition, we also consider the bulk imputation loss (derived from the imputer
) together with the the MMD loss during the whole training process as defined below:
![]() |
(10) |
where
and
are the weights of
and
.
By joinly optimizing
,
, and
, the proposed domain adaptation architecture is able to minimize the distribution discrepancy between the source and target domains in the RKHS, effectively aligning bulk and single-cell feature distributions.
Single-cell RNA-seq imputation training
Using the well-trained encoder
, we can now compress any single-cell RNA-seq data into a compact and low-dimensional latent variable
. Then the transferred bulk imputer
will take the latent variable
as input, while at the same time a 1D U-Net denoising model is applied to extract and compute single-cell RNA-seq features at multiple scales simultaneously. Given two outputs
and
, we leverage a cross-channel attention module to sequentially infer attention maps along two separate channels and dimensions (both bulk imputer and single-cell U-Net).
1D U-net architecture
Unlike the FCN-based decoder that complements the typical contracting network with continuous layers, the U-Net features M levels of downsampling and upsampling blocks replacing for increasing the resolution of the output. In our design, each block with contracting path follows a typical structure of 1D convolutional network. It consists of the repeat application of one
conv layer with stride 2 (unpadded), one
conv layer followed by a Sigmoid function with stride 2 (padding = 1). Moreover, we double channel numbers at each downsampling step (16–32–64–128), while at each upsampling step halve the channel numbers with a
conv kernel and concat with the correspondingly cropped feature map from the contracting path. At the final step, one
conv layer is used to map each 16 component feature vector to the desired number of gene expression values (4000 in our case). There are in total 23 conv layers in the proposed U-net structure.
Cross-channel attention module
As shown in Fig. 2, given a concatenation
of
and
as input, the cross-channel attention calculates the channel and spatial weight matrix sequentially, selecting ‘what’ and ‘where’ is meaningful in gene expression analysis. The overall process is summarized as follows:
Figure 2.
The architecture of cross-channel attention module.
![]() |
(11) |
![]() |
(12) |
where
is the 1D channel attention matrix,
is the 1D spatial attention matrix, and
denotes element-wise multiplication.
is the final refined output.
To produce a channel attention matrix
of feature map
, we squeeze the spatial dimension of the feature map by applying both average-pooling
and max-pooling
. It has been empirically examined that exploiting both average-pooled and max-pooled features could greatly improve the representation power of networks rather than using each independently [20].
The pooling features are connected by a shared convolutional network
, which is usually composed of multi-layer perceptron with one hidden layer:
![]() |
(13) |
where
represents the Sigmoid function.
To emphasize the inter spatial features of gene expression, the spatial attention refines the feature map along channel dimension by leveraging average pooling and max pooling operations to concatenate an efficient feature descriptor. The spatial attention map can be generated by applying a two channel convolution layer as shown below:
![]() |
(14) |
where
is the convolutional function with a
kernel size. For detailed explanation and discussion on channel-wise attention mechanism, we refer to the paper [20].
Finally, we integrates the cell-clustering loss with weight
and gene reconstruction loss together to retain the single-cell heterogeneity during the training process. The well-trained
and imputer
together with the U-net module will take the single-cell RNA-seq data
as the input, and output the imputed gene expression
by replacing the zero values in cell
.
![]() |
(15) |
Results
Experimental settings and baselines
In the experiment, we regarded the gene expression from eight single-cell datasets as ground-truth. We simulate the dropout events in selected single-cell testing sets by randomly masking 10%, 20%, and 40% of non-zero gene values (replacing the value with zero). We then performed the imputation task on both synthetic dropout data and original data with zero values using scDTL and the baseline methods to recover the missing values. Next, we quantitatively evaluated the performance of each method on the synthetic dropout data by using benchmarking metrics that measure the similarity and error distance between the recovered data and the ground-truth data. We also provided downstream analysis tasks to evaluate the performance of each method on the original data with zero values. Downstream tasks include cell clustering [17], PAGA (Partition-based Graph Abstraction) [18], and pseudotime analysis [19].
For, we selected nine state-of-the-art methods [30] for single-cell RNA-seq data imputation as baselines including MAGIC [2], CMF-Impute [31], GE-Impute [32], CL-Impute [11], scGCL [13], SCRABBLE [15], SAVER-X [33], scScope [34], and scVI [35]. All parameters and configurations of baseline methods in the comparison follow the settings as suggested in the original papers or the default parameters of shared codes.
scDTL greatly improved the performance of missing data recovery
To assess the performance of scDTL in imputing missing values in scRNA-seq data, we simulated dropout events by randomly masking 10%, 20% and 40% of non-zero expression values in eight datasets. Then we adopted three quantitative metrics including PCCs, RMSE, and L1-distance to evaluate the accuracy of the imputed values. The formulations of three quantitative metrics are shown as follows:
![]() |
(16) |
![]() |
(17) |
![]() |
(18) |
where
represent the mean and variance of
and
represent the mean and variance of
, and
is the total number of the single-cell samples.
As shown in Fig. 3 and Supplementary Tables 1 and 4, scDTL outperformed other nine methods in nearly all cases of 10% and 20% dropout rates with higher PCCs values as well as lower RMSE and L1-distance values. Notably, scDTL showed excellent performance in recovering the missing values at 40% dropout rates than any other methods in all datasets. It indicate that our model excels at precisely recovering missing data, especially in situations with high dropout rates, which is more practical in real-world testing that the actual average dropout rates of RNA are typically above 40% as illustrated in Table 1.
Figure 3.
Quantitative metrics of scDTL compared with other imputation methods in recovering missing values in scRNA-seq data at 40% dropout rates
Compared to other baselines, the distinctive advantage of scDTL lies in its ability to fully leverage bulk cell information through a DTL framework, while simultaneously utilizing the CBAM mechanism and U-net module to capture heterogeneous correlations between bulk RNA-seq and scRNA-seq data from a global perspective. For example, though SCRABBLE [15] also leverages bulk cell information, it only consider capturing the differences between bulk RNA-seq data and aggregated scRNA-seq data across cells directly. The solution is constrained and may result in the loss of critical information from both bulk and single cells. This limitation is evident in the imputation results, where it does not outperform other methods that rely solely on single-cell data. This indicates that merely using bulk data is insufficient; an effective framework with strong feature extraction capabilities is necessary to learn representations effectively.
Other imputation methods that rely on calculating cell-to-cell similarity in scRNA-seq data, such as GE-Impute [32] and scGCL [13], also face significant challenges while using graph embedding and contrastive learning to capture the topological relationships among cells. However, these methods focus exclusively on single-cell RNA-seq data and depend on reconstruction or contrastive loss for similarity. Consequently, their performance is heavily reliant on the availability of similar-cell information, which is not always assured. This limitation is particularly critical when ’real’ similar neighbors are difficult to identify due to a high dropout rate in scRNA-seq data (e.g. over 40%).
There is an interesting case in that both scVI [35] and scScope [34] achieve relatively good PCCs. scVI aggregates information across similar cells and genes to approximate the distributions underlying observed expression values, while scScope uses a recurrent network layer to iteratively impute zero-valued entries in scRNA-seq data. This is largely because PCC is scale-independent, measuring the strength and direction of the linear relationship between two variables, regardless of the units of measurement. However, when considering a combination of RMSE, L1, and PCC metrics, their performance declines. On the other hand, while scVI achieved the second-best performance among nine SOTA methods, it became less stable and underperformed in clustering tasks, primarily because they do not utilize bulk information. In contrast, scDTL allows the model to learn latent representations and recovers missing values in scRNAseq data effectively, even in the presence of incomplete gene expressions, producing an imputed matrix that closely resembles the actual data.
scDTL significantly improved the performance of cell clustering
Cell clustering is a crucial task in scRNA-seq downstream analysis, impacting cell type annotation. To assess scDTL’s clustering performance, we compared it with nine baseline methods across eight datasets. The clustering was conducted through the SCANPY toolkit [27], and we computed the Adjusted Rand Index (ARI) and Fowlkes–Mallows Index (FMI) to measure the correlation between the clustering results and the annotated cell populations in the datasets [36].
As shown in Fig. 4, scDTL demonstrated greater improvement in clustering accuracy compared to all other baseline methods in the GSE112274, GSE117872, GSE134839, GSE81861, and GSE81861 datasets, as indicated by higher ARI and FMI values. In datasets GSE134836, GSE134838, GSE134841 and GSM3618014, scDTL’ s performance was comparable to some of the SOTA methods. We believe this to due the following reasons: first, all four of these datasets contain a relatively high percentage of zero-values (P_zero, as referred to in Table 1) compared to other datasets, which poses a significant challenge for single-cell gene imputation by any SOTA methods. Second, datasets GSE134836 and GSE134838 contain only two cell types, enabling scDTL to achieve clustering results comparable to those of some SOTA methods due to the lower heterogeneity of the samples. Third, although dataset GSM3618014 contains five cell types, the differences among these cell types are pronounced even before imputation, leaving less room for improvement by any imputation methods. These data suggest that when the percentage of zero-values is extremely high (
) or when the differences among cell types are apparent, scDTL might achieve clustering results comparable to some of the other SOTA methods. These data also suggest that scDTL’s performance is robust in terms of the percentage of zero-values and number of cell types. In most cases, it performed better in improving clustering accuracy compared to other methods and, therefore, can be applied in a broader range of scenarios.
Figure 4.

Clustering analysis. ARI and FMI values of scDTL clustering results compared with other imputation methods.
The clustering results were visualized using the uniform manifold approximation and projection (UMAP) method (Fig. 5 and supplementary fig 1-6). As shown in Fig. 5, for the non-imputed data in the GSE134839 and GSE134841 datasets, the clustering results were completely inconsistent with the true labels. In contrast, in the scDTL imputation plot for the GSE134839 dataset, different clusters were well-separated and consistent with the true labels. Similarly, in the GSE134841 dataset, although unsupervised clustering analysis revealed six clusters compared to the actual five true label clusters, the alignment between the clustering results and true labels was still greatly improved compared to the analysis with non-imputed data. In comparison, the unsupervised clustering results were still inconsistent with the true labels after the imputation using other methods. These data demonstrate that the clustering results of scDTL are highly consistent with the ground-truth labels. Both GSE134839 and GSE134841 datasets contain cells treated with receptor tyrosine kinase inhibitor at various time points, an accurate separation of these cell populations would greatly facilitate the understanding of transcriptomic changes at different drug treatment stages.
Figure 5.

Clustering analysis. UMAP plots of clustering results in GSE134839 and GSE134841 datasets before and after imputation.
scDTL improved the performance of trajectory analysis
Trajectory analysis of scRNA-seq data is an important downstream analysis revealing cellular developmental patterns. Dropout events can impact trajectory reconstruction, but imputation methods effectively mitigate this issue. To assess scDTL’s effectiveness in improving trajectory analysis, we reconstructed the cell trajectory imputed by each method. We employed scanpy.tl.paga and scanpy.tl.dpt toolkits to reconstruct trajectory inference for the GSE112274 and GSE134838 datasets, featuring cancer cells treated with tyrosine kinase inhibitor or vemurafenib, respectively, in a time-course manner. We first calculated Pseudo-temporal Ordering Score (POS) [37] to quantify the accuracy of cell pseudo-time ordering compared to the ground-truth cell ordering. POS [38] is defined as the sum of scores that characterize how well the order of the
th and
th cells in the ordered path
matches their expected order based on the external information:
![]() |
(19) |
where
is the true-time of
th and
th cells in cell pseudo-time ordering respectively.
Quantitative assessment using the POS index shows that scDTL achieves the highest consistency between cell pseudo-time and true time in both datasets compared to other methods (Fig. 6a), indicating improved accuracy in inferred cell pseudo-time ordering. Furthermore, as shown in Fig. 6(b), scDTL imputed data trajectories align better with the true-labeled time points in comparison with the non-imputed data. These findings highlight that scDTL enhances the precision of inferred cell pseudo-time ordering. An improved pseudo-time analysis in this dataset could help reveal different cell state changes along drug treatment and uncover new mechanisms underlying drug resistance that could be ignored before data imputation.
Figure 6.
Trajectory analysis. (a) POS of the indicated datasets imputed by different methods. (b) Visualization of trajectory inference. In true label, G3 and G5 refer to two independent TKI-resistant cell clones. Early and late refer to cells collected at 30 days and 120 days after the treatment.
Ablation study
To evaluate the impact of specific settings in the proposed DTL framework, we assess the imputation performance by testing various module combinations within scDTL, including with or without CBAM, U-Net, and the Bulk Cell Imputer. As shown in Supplementary Tables 5 and 6, the quantitative and clustering metrics for scRNA-seq imputation at a 40% dropout rate are calculated using eight datasets. As we can see from the selected eight databases, the actual average dropout rates of scRNA-seq are typically above 40%, making this setting more practical in real-world testing. Moreover, the imputation results demonstrate that our scDTL framework performs exceptionally well. If the Bulk Cell Imputer is not used, imputation accuracy is significantly affected. For example, database GSE112274 shows PCCs 0.96
0.84, RMSE 0.5
1.24, L1 distance 0.15
0.19, ARI 0.5
0.31, and FMI 0.63
0.5. In addition, scDTL without U-Net that enhances the representation of heterogeneous gene expression in scRNA-seq data. The imputation performance surpasses that of the model without bulk cell information, it remains inferior to the fully integrated framework. Moreover, we also evaluated the impact of not using CBAM to dynamically assign weights to scRNA and bulk cell features, instead using fixed weights of 0.5 and 0.5. The results showed a decline in imputation performance. For more detailed results and other factors, please refer to the data in Supplementary Tables 4 and 5. Therefore, the combination of bulk cell information, the U-Net module, and the CBAM module is essential to the effectiveness of the scDTL method.
Discussion
scRNA-seq has significantly facilitated the comprehensive studies of gene expression profiles at the single cell level. In the field of cancer research, scRNA-seq has been applied to investigate the cellular heterogeneity in the tumor microenvironment and tumor cells themselves. However, dropout events have become a major challenge that significantly diminished the accuracy of scRNA sequencing analysis. Consequently, it is imperative to develop effective imputation methods to recover the incomplete gene expression profiles. In this article, we used scRNA-seq data and bulk RNA-seq data from tumor cells to examine the performance of a novel gene-imputation framework - scDTL as a proof-of concept. Compared to other imputation methods that only take advantage of scRNA-seq data, scDTL tremendously improved the imputation result even in datasets with high dropout rates. Although several other imputation methods also leverage bulk RNA-seq data, scDTL does not require matched bulk and single cell data from the same sample, which is often not the case in many experimental settings. The evaluation results show that the proposed method can be further applied in different cell types or tissues, thereby enabling a broader range of research scenarios such as cancer research and other diseases.
Typically, there is always a high ratio of ‘dropouts’ in scRNA-seq matrices (can contain up to 90% zero values), while
of genes in bulk RNAseq of various tissues are not expressed. Unlike existing imputation methods that only consider scRNA-seq data taking advantage of the similarity among cells and genes, we aim to explore inherent relationships between genes by leveraging bulk cell data. Specifically, scDTL applies a domain adaptation technique that preliminarily trains the imputation model for scRNA-seq data using a large volume of bulk RNA-seq data. We also employ a parallel operation with a cross-attention mechanism to balance the feasibility of harmonizing scRNA-seq data with bulk data while preserving original single-cell data information.
In the evaluation, we examine the imputation performance of scDTL and other state-of-the-art baseline methods by conducting extensive experiments including quantitative metrics, clustering analysis, PAGA, and pseudotime analysis. The results demonstrate that scDTL outperforms other solutions in most cases across eight scRNA-seq datasets. In the future, we would like to continue to enhance the capability of scDTL and use unsupervised learning method such as contrastive learning, as it allows the model to gain a self-perspective in exploring cell representations, we can capture and learn the single cell gene expression features without labeled data.
Key Points
scDTL is a deep transfer learning based approach for scRNA-seq data imputation by harnessing the bulk RNA-sequencing information.
We compared scDTL with nine other imputation methods using eight datasets and scDTL achieved the best performance in missing data recovery.
scDTL greatly improved the performance of cell clustering and trajectory analysis and outperformed other methods in most cases.
Supplementary Material
Contributor Information
Liuyang Zhao, College of Computer Science and Software Engineering, Shenzhen University, Guangdong 518057, China.
Landu Jiang, College of Future Technology, HKUST(GZ), Guangdong 510641, China.
Yufeng Xie, Shenzhen Hospital of Guangzhou University of Chinese Medicine (Futian), Guangdong 518034, China.
JianHao Huang, Shenzhen Hospital of Guangzhou University of Chinese Medicine (Futian), Guangdong 518034, China.
Haoran Xie, Department of Computing and Decision Sciences, Lingnan University, Hong Kong Special Administrative Region 999077, China.
Jun Tian, Department of Biochemistry, School of Medicine, Southern University of Science and Technology, Guangdong 518055, China; Key University Laboratory of Metabolism and Health of Guangdong, Southern University of Science and Technology, Shenzhen 518055, China.
Dian Zhang, College of Computer Science and Software Engineering, Shenzhen University, Guangdong 518057, China.
Author contributions
L.Z., L.J., J.T., and D.Z. participated in the design and execution of the study. Y.X., J.H., and H.X. performed the data curation and analysis. L.Z., L.J., and J.T. wrote the original draft, with all authors contributing to writing and providing feedback. L.J., J.T., and D.Z. supervised all aspects of the research.
Conflict of interest: None declared.
Funding
This work is supported by Stable Support Project of Shenzhen (Project No. 20231122145548001), JCYJ20220531091407016, Futian Healthcare Research Project (No. FTWS055, FTWS069), Shenzhen Hospital (Futian) of Guangzhou University of Chinese Medicine Research Project (No. GZYSY2024010), Guangdong Province Key Laboratory of Popular High Performance Computers 2017B030314073, and Guangdong Provincial Department of Education Youth Talent Project (No. 2024KQNCX052).
Availability and Implementation
The code and data of scDTL are available at: https://github.com/bleedingseraphY/scDTL.git.
References
- 1. Lee J, Kim S, Hyun D. et al.. Deep single-cell RNA-seq data clustering with graph prototypical contrastive learning. Bioinformatics 2023;39:btad42. 10.1093/bioinformatics/btad342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Van Dijk D, Sharma R, Nainys J. et al.. Recovering gene interactions from single-cell data using data diffusion. Cell 2018;174:716–729.e27. 10.1016/j.cell.2018.05.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Li WV, Li JJ. An accurate and robust imputation method scimpute for single-cell RNA-seq data. Nat Commun 2018;9:997. 10.1038/s41467-018-03405-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wen Z-H, Langsam JL, Zhang L. et al.. A bayesian factorization method to recover single-cell RNA sequencing data. Cell Rep Methods 2022;2:100133. 10.1016/j.crmeth.2021.100133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Gong W, Kwak I-Y, Pota P. et al.. Drimpute: Imputing dropout events in single cell RNA sequencing data. BMC Bioinform 2018;19:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wagner F, Yan Y, Yanai I. K-nearest neighbor smoothing for high-throughput single-cell RNA-seq data. BioRxiv 2018;217737. 10.1101/217737. [DOI] [Google Scholar]
- 7. Arisdakessian C, Poirion O, Yunits B. et al.. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol 2019;20:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Deng Y, Bao F, Dai Q. et al.. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 2019;16:311–4. 10.1038/s41592-019-0353-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Rao J, Zhou X, Yutong L. et al.. Imputing single-cell RNA-seq data by combining graph convolution and autoencoder neural networks. Iscience 2021;24:102393. 10.1016/j.isci.2021.102393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wang J, Ma A, Chang Y. et al.. scGNN is a novel graph neural network framework for single-cell RNA-seq analyses. Nat Commun 2021;12:1882. 10.1038/s41467-021-22197-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Shi Y, Wan J, Zhang X. et al.. CL-Impute: a contrastive learning-based imputation for dropout single-cell RNA-seq data. Comput Biol Med 2023;164:107263. 10.1016/j.compbiomed.2023.107263. [DOI] [PubMed] [Google Scholar]
- 12. Wang J, Xia J, Wang H. et al.. scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network. Brief Bioinform 2023;24:bbac625. 10.1093/bib/bbac625. [DOI] [PubMed] [Google Scholar]
- 13. Xiong Z, Luo J, Shi W. et al.. scGCL: an imputation method for scRNA-seq data based on graph contrastive learning. Bioinformatics 2023;39:btad098. 10.1093/bioinformatics/btad098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Huang Z, Wang J, Xudong L. et al.. scGGAN: single-cell RNA-seq imputation by graph-based generative adversarial network. Brief Bioinform 2023;24:bbad040. 10.1093/bib/bbad040. [DOI] [PubMed] [Google Scholar]
- 15. Peng T, Zhu Q, Yin P. et al.. Scrabble: single-cell RNA-seq imputation constrained by bulk RNA-seq data. Genome Biol 2019;20:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Chen J, Wang X, Ma A. et al.. Deep transfer learning of cancer drug responses by integrating bulk and single-cell RNA-seq data. Nat Commun 2022;13:6494. 10.1038/s41467-022-34277-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 2019;9:5233. 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Alexander Wolf F, Hamey FK, Plass M. et al.. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 2019;20:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Haghverdi L, Maren Büttner F, Wolf A. et al.. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods 2016;13:845–8. 10.1038/nmeth.3971. [DOI] [PubMed] [Google Scholar]
- 20. Woo S, Park J, Lee J-Y. et al.. CBAM: Convolutional Block Attention Module. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, ECCV, 2018, pp. 3–19.
- 21. Sharma A, Cao EY, Kumar V. et al.. Longitudinal single-cell RNA sequencing of patient-derived primary cells reveals drug-induced infidelity in stem cell hierarchy. Nat Commun 2018;9:4931. 10.1038/s41467-018-07261-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Kong SL, Li H, Tai JA. et al.. Concurrent single-cell RNA and targeted DNA sequencing on an automated platform for comeasurement of genomic and transcriptomic signatures. Clin Chem 2019;65:272–81. 10.1373/clinchem.2018.295717. [DOI] [PubMed] [Google Scholar]
- 23. Aissa AF, Islam ABMMK, Ariss MM. et al.. Single-cell transcriptional changes associated with drug tolerance and response to combination therapies in cancer. Nat Commun 2021;12:1628. 10.1038/s41467-021-21884-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Li H, Courtois ET, Sengupta D. et al.. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet 2017;49:708–18. 10.1038/ng.3818. [DOI] [PubMed] [Google Scholar]
- 25. Tian L, Dong X, Freytag S. et al.. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019;16:479–87. 10.1038/s41592-019-0425-8. [DOI] [PubMed] [Google Scholar]
- 26. Barretina J, Caponigro G, Stransky N. et al.. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 2012;483:603–7. 10.1038/nature11003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Wolf FA, Angerer P, Theis FJ. Scanpy: Large-scale single-cell gene expression data analysis. Genome Biol 2018;19:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Sturm G, Szabo T, Fotakis G. et al.. Scirpy: a scanpy extension for analyzing single-cell t-cell receptor-sequencing data. Bioinformatics 2020;36:4817–8. 10.1093/bioinformatics/btaa611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Ganin Y, Ustinova E, Ajakan H. et al.. Domain-adversarial training of neural networks. J Mach Learn Res 2016;17:1–35. [Google Scholar]
- 30. Hou W, Ji Z, Ji H. et al.. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol 2020;21:1–30. 10.1186/s13059-020-02132-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Junlin X, Cai L, Liao B. et al.. CMF-Impute: an accurate imputation tool for single-cell RNA-seq data. Bioinformatics 2020;36:3139–47. [DOI] [PubMed] [Google Scholar]
- 32. Xiaobin W, Zhou Y. Ge-impute: Graph embedding-based imputation for single-cell RNA-seq data. Brief Bioinform 2022;23:bbac313. [DOI] [PubMed] [Google Scholar]
- 33. Wang J, Agarwal D, Huang M. et al.. Data denoising with transfer learning in single-cell transcriptomics. Nat Methods 2019;16:875–8. 10.1038/s41592-019-0537-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Deng Y, Bao F, Dai Q. et al.. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 2019;16:311–4. 10.1038/s41592-019-0353-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Lopez R, Regier J, Cole MB. et al.. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15:1053–8. 10.1038/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Hubert L, Arabie P. Comparing partitions. J Classif 1985;2:193–218. 10.1007/BF01908075. [DOI] [Google Scholar]
- 37. Ji Z, Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 2016;44:e117–7. 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Ji Z, Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 2016;44:e117–7. 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






















