Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene–gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene–gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene–gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.
Keywords: deep learning, gene–gene interactions, scRNA-seq data imputation, genomic data analysis
INTRODUCTION
The success of transcriptomic studies, such as differential expression analysis [1–3], cell subpopulation identification [4–6], cell trajectory reconstruction [7–9] and alternative splicing detection [10, 11], depends critically on the accuracy of the gene expression counts. In practice, gene expression data often suffer from low transcript capture efficiency and technical noise, leading to inaccurate gene counts. Thus, recovery of the expression values of the genes using computational techniques is critical for the downstream applications.
Broadly, gene expression recovery techniques can be divided into four main categories: (i) low-rank matrix-based approaches, (ii) probabilistic-based models, (iii) nearest neighbor-based techniques and (iv) deep learning methods. Methods from the first category rely on the use of low-rank matrix-based techniques. For example, scRMD [12] utilizes the Alternating Direction Method of Multipliers to obtain the low-rank matrix of the original gene expression matrix. As another example, McImpute [13] employs a matrix completion approach for data imputation, considering both gene–gene and cell–cell relationships. SAVER [14], SAVER-X [15], BayNorm [16] and scImpute [17] are the prominent techniques of the second group. SAVER [14] and BayNorm [16] are two Bayesian approaches that use the gene–gene relationships to estimate the expression levels of the omitted genes. scImpute [17] identifies and imputes the dropout gene expression values by applying a known statistical model. The third group uses the information from the neighboring genes to interpolate the omitted gene expressions. MAGIC [18] explores information across similar cells and utilizes a Markov matrix to denoise the gene expression data while imputing the dropouts. DrImpute [19] is another method of this group, which integrates information from similar cells and adopts expression averaging for data recovery. AutoImpute [20], dca [21], DeepImpute [22], scVI [23] and scScope [24] are the representative methods of the fourth category. These methods use different network architectures such as autoencoder (AutoImpute [20] and dca [21]), multilayer perceptron (DeepImpute [22]) and recurrent neural network (scScope [24]) to extract information from similar cells and genes for data recovery.
These existing methods suffer from either accuracy or computational efficiency or both, which limits their practical applications. As for the first group (e.g. scRMD, McImpute), the low-rank assumption may not always hold true for real-world Single-cell RNA sequencing (scRNA-seq) data. Similarly, the second group (e.g. SAVER) assumes a specific distribution of gene expression data that may not be sufficiently accurate in practical scenarios. The third type of methods such as MAGIC relies on expression averaging for imputation, which may result in removal of the variability in gene expression. Deep learning techniques have limited receptive field size and may lead to suboptimal results due to their difficulties in capturing the long-range relationship between cells and genes.
It is well known that the gene–gene interaction patterns are unique to biological systems and present discriminative signatures of the cells involved. Here we leverage the interactive information and establish a transform-and-conquer expression recovery (TCER) strategy to tackle the gene imputation problem. First, we transform the gene expression data into an image format, referred to as the GenoMap, based on the interactions among the genes. In this step, the gene–gene interactions of the system are configured by placing the genes in such a way that the genes interacting strongly are close to each other. This transformation maps the gene–gene interactions into configured format and enables a deep neural network (DNN) to exploit the interactions more effectively. Thus, the method mitigates the problem of limited receptive field of conventional CNN-based deep learning approaches and facilitates full exploitation of the gene–gene relationships. In Figure 1, we present the visualizations of some GenoMaps from the mouse intestinal epithelium dataset. GenoMaps in each row are randomly selected from the same cell type group. It is worth noting that there exist similar patterns among GenoMaps within the same cell type.
Figure 1.
Visualization of GenoMaps of different cell types for mouse intestinal epithelium dataset. Each row represents GenoMap from eight different cells in the same cell type.
To extract deep interaction information from a GenoMap for the recovery of missing expression values, a novel encoder-decoder architecture, referred to as expression recovery network (ER-Net), is designed. In ER-Net, we include three cascaded Deformable Fusion Attention (DFA) modules between an encoder and a decoder. Each DFA module includes a deformable convolution layer so that its kernel shape can be adaptively adjusted according to the input feature maps. The deformable kernels enable the network to explore gene–gene relationships more flexibly and enhance the capability for the network to discover the underlying patterns in GenoMaps. Additionally, the network uses a dual-attention (channel-wise and pixel-wise attention) mechanism to adaptively assign higher values to more important channels and positions for high-performance expression recovery.
Extensive experiments on the simulated scRNA-seq data and six real-world scRNA-seq datasets are performed. We demonstrate that the proposed method substantially outperforms the existing ones in terms of imputation accuracy. We show that the recovered data using the proposed technique also yields the best outcomes of cell clustering and trajectory analyses.
RESULTS
TCER enables reliable gene imputations
We use the Splatter simulator [25] to simulate a reference scRNA-Seq data based on a gamma-Poisson distribution for 10000 cells each with 2400 genes. We set the number of cell groups to 5 with a probability of 0.2 for each group. To imitate the experimental scRNA-Seq data acquisition process, we sample the reference data following SAVER [14] to generate an observation dataset with a 1% efficiency loss. We compare the performance of TCER with two existing gene imputation methods, scVI [23] and MAGIC [18]. In Figure 2, we show t-SNE [26] and UMAP [27] visualization of the reference data (first column), observation data (second column), and results from TCER (third column), scVI (fourth column) and MAGIC (fifth column). It is obvious that our method shows better-separated clusters when compared to the results directly from the observation data and imputed data from scVI and MAGIC. scVI performs relatively better than MAGIC. We calculate the clustering accuracy and quality indices to quantitatively evaluate the clusters resulted from different methods. As shown in Figure 2, TCER greatly outperforms other methods. Note that the appearance of the UMAP clustering may vary with different initializations, To address this, we conducted the UMAP embedding and corresponding quantitative evaluation 500 times. The mean and standard deviation of these results are also presented in Figure 2.
Figure 2.
Analyses of simulation dataset. (A) t-SNE and UMAP visualizations of reference (first column), observed data (second column) and imputed data by TCER (third column), scVI (fourth column) and MAGIC (fifth column). (B) The clustering accuracy and cluster quality indices for UMAP visualizations of the reference and observed data, and imputed data using different methods.
TCER also outperforms other methods in real-world settings. To demonstrate this, we conduct experiments using four experimental scRNA-seq datasets: (1) mouse intestinal epithelium [28], (2) engineered 3D neural tissues [29], (3) mammalian brains [30] and (4) cellular taxonomy [31]. Details of the datasets are described in the Datasets section (see Section 5). For each dataset, we first select cells and genes with high expression levels and use them as the reference data, following the procedure described in SAVER [14]. We then sample the reference data using a negative binomial model [14] to generate observation datasets at different efficiency losses. In Figure 3, we show the GenoMaps of the reference and observation datasets, and the recovered GenoMaps by TCER. The recovered GenoMaps correlate strongly to the reference ones.
Figure 3.
GenoMaps of the reference, observed and recovered data from different datasets sampled at different efficiencies. Each row displays samples from the same dataset with different efficiency loss. Datasets include (A) mouse intestinal epithelium, (B) 3D neural tissue data, (C) mammalian brain, (D) cellular taxonomy. And
represents 0.75% efficiency,
indicates 0.5% efficiency, while
means 0.4% efficiency.
We present the UMAP visualizations of original data and data imputed by different methods (TCER, dca, scScope and scImpute) for all four datasets in Figure 4. We also include UMAP visualizations of additional imputation methods in Section 1.1 of the Supplementary Materials. It is seen that, compared to the results from other methods, UMAP visualizations of our imputation results correlate better to the reference data (the first column) with superior clustering quality. We note that data distortions are observed in the results from other methods in all four datasets.
Figure 4.
UMAP visualizations of the reference (first column) and observed data (second column), and the imputed results from TGER, dca, scScope and scImpute. Here, (A) mouse intestinal epithelium is sampled at 0.75% efficiency, (B) 3D neural tissue data is sampled at 0.5% efficiency, (C) mammalian brain is sampled at a 0.4% efficiency and (D) cellular taxonomy data are sampled at 0.4% efficiency.
The Pearson correlation results from different methods are shown in Figure 5. It is seen that TCER yields the best performance for all datasets with more than 6% improvement in Pearson coefficients as compared to that obtained from the observed datasets directly. It is worth noting that the other six methods (MAGIC, scVI, scScope, scImpute, dca and SAVER) lead to Pearson coefficients lower than that from the original data without any imputation. For example, for an efficiency loss of 0.4% for mammalian brains dataset, Pearson coefficients resulted from the observed data, TCER-, dca-, scScope- and scImpute-imputed data are 0.7653, 0.7974, 0.1820, −0.2013 and 0.6208, respectively.
Figure 5.

Box plots of Pearson coefficient between the reference and imputed data from different methods (MAGIC, scVI, scScope, scImpute, dca, SAVER and TCER) for mouse intestinal epithelium (first row), 3D neural tissue data (second row), mammalian brain (third row), cellular taxonomy (fourth row). 1–3 shown in the x-axis indicate the sampling efficiency of 0.4, 0.5 and 0.75%, respectively.
TCER improves the accuracy of cell trajectory analysis
Cell trajectory analysis is widely used to understand the cellular differentiation process and plays an important role in the study of embryo and organ developments. Here, we use two experimental datasets to demonstrate that TCER can restore the missing expressions of trajectory data and facilitates the analysis of cell trajectories. The datasets are from (1) human Embryonic Stem cells differentiated to embryoid bodies (EBs) [32] and (2) early zebrafish development [33]. Details of each dataset can be found in the Datasets section (see Section 5). We employ the well-established PHATE [34] to embed the high-dimensional data for trajectory visualization and analysis. The PHATE results for imputed data from different approaches are shown in Figure 6. It is seen that our imputed results (third column) contain continuous cell trajectory patterns that closely resemble the reference data (first column), whereas the observation data (second column) and imputed results from dca (fourth column), scScope (fifth column) and scImpute (sixth column) show distortions in the trajectories. PHATE results for more methods could also be found in Section 1.1 of the Supplementary Materials. In Figure 7, we show the Pearson coefficients with respect to the reference dataset for all the cases. Again, TCER method achieves the best Pearson coefficients for both datasets.
Figure 6.
PHATE visualizations of reference and observed data, and imputed results from TCER, dca, scScope and scImpute for (A) EB differentiation data sampled at 0.75% efficiency and (B) zebrafish development data sampled at 0.5% efficiency. The colorbar for (A) indicates 1-(0–3 days), 2-(6–9 days), 3-(12–15 days), 4-(18–21 days) and 5-(24–27 days) and (B) shows the hpf (hours post fertilization).
Figure 7.
Box plots of Pearson coefficient between the reference and imputed data from different methods (MAGIC, scVI, scScope, scImpute, dca, SAVER and TCER) for EB differentiation data (first row) and zebrafish development data (second row). 1–3 shown in the x-axis indicate the sampling efficiency of 0.4, 0.5 and 0.75%, respectively.
DISCUSSION
Current scRNA-seq technologies suffer from a number of drawbacks, such as low capture efficiency, high dropout and measurement noise, and a preprocessing of the data is required for downstream applications. In this work, an effective TCER framework is proposed with incorporation of the gene–gene interactions of the system for accurate gene imputation. As revealed by the Pearson correlation analysis (Figs 5 and 7), the existing techniques can hardly improve the data quality, which has recently been pointed out by Hou et al. [35]. TCER, on the other hand, greatly improves the Pearson correlation in all cases. Moreover, the proposed technique also improves the clustering and trajectory analysis substantially (Figs 3, 4 and 6).
The TCER technique consists of two important steps. The GenoMap construction is critical in TCER for the network to recover the gene expression values. To illustrate this, we created a 2D map by randomly placing the genes on its grid and the map is processed by the same DNN in TCER for three different datasets. The Pearson correlations resulted from this random map + DNN is plotted in Supplementary Section 2. It is seen that the results are much inferior to that of GenoMap+DNN.
The construction of a GenoMap is less susceptible to noise and omitted gene expression values. This is attributed to the fact that the GenoMap construction depends on the distribution of expression values, instead of a particular gene expression value(s). Because of this, the TCER imputation results are robust even if some expression values are missing or noisy. The introduction of GenoMap helps the self-supervised DNN to learn the deep relationship among the genes and cells for better imputation.
The deep learning architecture in TCER adopts deformable convolutional operations for extraction of high-quality features [36]. As opposed to traditional convolution with a fixed kernel size, deformable convolution adaptively expands its receptive field by adjusting its kernel shape according to the input feature maps. Thus, the network can efficiently extract both low- (local) and high-level (global) gene–gene interaction features. A skip connection between the downsampling and upsampling layers is introduced to preserve the multi-scale information. Finally, the dual-channel attention (channel- and pixel-wise attention) in our network can adaptively assign important feature information to important channels and positions, which allows the network to put emphasis on the important genes and differentially explore the gene–gene relationship for better data recovery.
In conclusion, we have proposed a novel TCER framework for gene expression recovery. The technique demonstrates unprecedented accuracy and reliability in gene imputation. Fundamentally different from the existing methods, TCER exploits the underlying gene–gene interaction information of a biological system via a transform-and-conquer strategy. TCER is self-supervised and its potential applications go much beyond genomic data processing. The same strategy should be applicable to other missing data problems in various biomedical and biocomputational applications.
METHODS
Overview
As shown in Figure 8, the proposed TCER consists of two steps: (1) GenoMap Construction and (2) GenoMap Recovery.
Figure 8.
Pipeline of the proposed TCER. 1D gene expression data is first converted into an image format where the gene–gene interactions are reflected naturally in the spatial configuration of GenoMap. A dropout simulation strategy is then applied to simulate the dropout events where non-zero values are randomly masked. Last, the proposed ER-Net is employed to impute the masked GenoMap.
Given a gene sequence data
, where
is the number of cells while
is the number of genes in each cell, we first reconstruct the sequence data as 2D spatial data, termed as GenoMap, to represent the gene–gene relationship. Specifically, the generated GenoMap is denoted as
, where
is the width and height, and
is the number of GenoMaps. Then, we design an end-to-end convolutional neural network named ER-Net for recovering gene expression. Due to the lack of ground-truth annotation, we introduce a novel self-supervised training strategy which simulates the dropout event of scRNA-seq data in the real world as self-supervision for learning the recovery of GenoMaps.
GenoMap Construction
Intuition behind GenoMap Construction is to obtain a 2D spatial configuration of the genes where their interactive relationships could also be properly expressed. Further, genes with stronger interactions are supposed to have smaller Euclidean distances in the corresponding GenoMap. Specifically, in order to reposition the gene sequence data with
genes for each cell into a 2D GenoMap with a grid of
, where
, a pairwise interaction strength matrix
would first be calculated to maximize entropy of the whole sequence data. Then, a projection matrix
is obtained to reconstruct the gene data into 2D grid based on maximally preserving the pairwise interactions. Specifically, pairwise interaction strength matrix could be calculated as follows [37]:
![]() |
(1) |
![]() |
(2) |
where
is the covariance matrix and
indicates the
th gene expression of the
th cell. Then Gromov-Wasserstein discrepancy between the pair interaction strength matrix
and the distance matrix
of the 2D grid space (
) is utilized based on its minimization to obtain the optimal projection matrix
. Specifically, assume two points i, j, with coordinates
and
, respectively. The Euclidean distance
between them is defined as
![]() |
(3) |
where
![]() |
(4) |
![]() |
(5) |
And the Gromov–Wasserstein discrepancy between matrices
and
is defined as [38]
![]() |
(6) |
and
![]() |
(7) |
where
are vectors that contain relative importance of genes and locations in GenoMap. And
indicates the Kullback–Leibler divergence such that
![]() |
(8) |
After the projection matrix is obtained, the restructured data could be written as
![]() |
(9) |
where
and
represents the matrix multiplication.
is then reshaped into the image format
, i.e. the restructured GenoMap. And more detailed explanations could be found in ref. [39].
ER-Net
As shown in Figure 9, our proposed ER-Net follows a standard encoder-decoder structure [40] where we plug in several cascaded DFA modules in the bottleneck to further learn global and local gene–gene interactions. In addition, we apply skip connections with channel- and spatial-wise attention to flexibly preserve information from shallow layers. Below we will give more details of ER-Net.
Figure 9.

Network structure of ER-Net. (A) ER-Net follows an encoder-decoder structure and employs three cascaded DFA module with deformable convolution to extract both local and global features of gene–gene interactions. (B) Detailed structure of the proposed DFA module.
Encoder-Decoder structure
In ER-Net, the encoder firstly produces feature maps at different resolutions by consecutively downsampling the 2D image features. Then three cascaded DFA modules (see details in the DFA module) are applied for exploring feature maps in low resolution with dynamic deformable receptive fields. Next, the decoder restores the 2D GenoMap at the original resolution by combining the upsampled features and the skip features from the encoder at different resolutions. Specifically, in the encoder, we use one convolutional layer with the stride of 1 and two convolutional layers with the stride of 2 for downsampling. And in the Decoder, we use two transposed convolutional layers with the stride of 2 and one convolutional layer with the stride of 1 for upsampling. The down/upsampling ratios are both
.
DFA module
Following the idea of fusion attention module [41] and deformable convolution [36], we design a DFA module to better exploit the feature representation to recover 2D GenoMaps. As shown in Figure 9(B), each DFA module consists of two convolutional layers, one ReLU layer [42], one deformable convolutional layer, Channel Attention and Pixel Attention. Different from the Fusion Attention module in FFA-Net [41] that only includes the standard convolution with a fixed grid, here we employ the deformable convolutional layer [36] to enable the deformation of the kernel shape. Such adaptive kernels can then expand the receptive field flexibly since the deformed grid sampling are combinations of various transformations, e.g. scaling, rotation, etc. [36]. Specifically, for a standard convolution operation, the kernel shape is usually a fixed
square. Assume
is an odd integer. Denote the output feature map of the standard convolution as
. Then, in the
th layer,
at position
could be described as
![]() |
(10) |
where
denotes the multiplication,
enumerates all the channels of the input feature map
and
indicate the offset of the sampled grid at position
where
. For the deformable convolution operation, the output feature map of the deformable convolution is denoted as
. Then, in the
th layer,
at position
can be described as
![]() |
(11) |
where
indicate the offsets to the fixed sampled grid in the standard convolution. These offsets are learned through standard convolution layers using
as the input and optimized together with kernels during the training process. Therefore, by applying deformable convolution, features at any location inside this feature map could be considered altogether (e.g. any position
) instead of only the neighboring features, thus improving the representation capability of the network. Furthermore, we employ the channel attention module which aims to adaptively give different weights to distinct channels and the pixel attention module to adaptively give various weights to pixels at different locations. Such a dual-attention mechanism can automatically pay more attention to important features at different positions in the feature maps, which helps the network to capture more useful information. Residual learning is also included in the DFA module to prevent the degradation of the information [43].
Skip connection with attention
Inspired by Unet [44], we have adopted skip connections between feature maps from upsampling and corresponding downsampling layers at different resolutions to enable the low-level information flow through the whole network that could then be beneficial for the GenoMap restoration. Following CBAM [45], we also apply channel- and spatial-wise attention mechanisms to adaptively assign higher weights to important channels and positions for fully exploiting features at different gene expression levels.
Self-supervised training
Due to the lack of true gene expression data, in this paper, we propose a self-supervised training strategy by utilizing simulated dropout data as self-supervision for training the recovery network ER-Net. And we utilize
loss in both spatial and frequency domains during the training process for accurate restorations.
Dropout simulation for self-supervision
Inspired by the fact that real-world scRNA-seq data usually suffers from high dropouts [46], we propose a self-supervised strategy which utilizes Dropout Simulation to imitate the dropout events during training. Specifically, it essentially applies random masking to generate the input data for self-supervision. Such a self-supervised strategy not only effectively reduces the need for the ground-truth annotation but also helps ER-Net to learn the common representation among the GenoMaps. Specifically, given a batch of GenoMap
with
as the batch size, for each GenoMap, a dropout ratio
is randomly selected among
at each training step. And
are both fractions indicating the lower and upper bounds, respectively. Suppose at training step
, the randomly selected dropout ratio for
is denoted as
. The pixel indices of non-zero values inside
are denoted as
, where
is the number of non-zero values in
and
. Then, a subset
is randomly selected from
based on
where the number of pixel indexes in
is
. And a placeholder
is then used to replace the selected non-zero values indicating the masked positions that
![]() |
(12) |
where
is a fraction number. Besides, since scRNA-seq data is always sparse and has a lot true zeros, to help the network learn to distinguish those true zeros from dropout events, we also use
to replace all zero values in
that
![]() |
(13) |
where
indicates all the pixel indexes with zero values in
. And the corresponding dropout GenoMap at training step
is denoted as
and the corresponding random dropout simulation process is then described as
![]() |
(14) |
By applying this random dropout strategy, ER-Net could learn some shared patterns among all the GenoMaps that then contribute to better data recovery.
Overall training loss
We take both spatial and frequency domain information into consideration for the overall objective function.
loss in the spatial domain is applied to ensure the accurate recovery for each pixel in GenoMap while the frequency loss is employed to better restore the high frequency information. Since the goal of the proposed ER-Net is to recover
based on the input
, the optimization objective could then be described as
![]() |
(15) |
where
indicates the ER-Net with
as its parameters,
,
is the weight for the frequency loss and
indicates fast Fourier transform function. And for the overall loss at training step
, we have
![]() |
(16) |
Generation of observation datasets
Following SAVER [14], we first select high-quality genes and cells based on their expression levels and refer to them as the true expression
. The observed gene sequence dataset
is then constructed by applying Poisson–gamma mixture on
, which is also known as the negative binomial model. Specifically,
is sampled from a gamma distribution simulating the uncertainty of the gene values and a Poisson distribution is then placed on
to get the simulated observation
, which could be described as
![]() |
(17) |
where
are the shape and rate parameters respectively that control the mean and variance. And
could be considered as the efficiency loss from the true expression data
.
Implementation details
GenoMap construction is implemented in MATLAB while the imputation experiments are implemented on one RTX2080 GPU in PyTorch. For the imputation network, we adopt Adam optimizer for network parameters optimization, setting the learning rate as
,
as 0.9,
as 0.999 and
as
. The batch size is set as 64 and the learning rate is half decayed every 40 epochs. And we train our ER-Net for 100 epochs. And the placeholder
is set as 0.999.
Moreover, to mimic variation in efficiency across cells,
is sampled as follows SAVER [14]:
0.75% efficiency
Gamma(15, 2000)0.5% efficiency
Gamma(10, 2000)0.4% efficiency
Gamma(8, 2000)
DATASETS
We evaluate our proposed method on six different real-world datasets. For each dataset, we generate three different observation datasets with different efficiency loss. And Pearson coefficient is adopted to evaluate the performance. Pearson coefficient is calculated for each gene across the whole cells and is analyzed on the reference data (true expression), the observed data (input) and the imputed data (output). Details of the six real-world datasets are discussed below.
Mouse intestinal epithelium
Intestinal epithelial cells absorb nutrients, respond to microbes, function as a barrier and help to coordinate immune responses. Original dataset includes 53,193 individual epithelial cells from the small intestine and organoids of mice [28], which enabled the identification and characterization of previously unknown intestinal epithelial cell subtypes and their genetic characteristics. In our experiments, we select 4,072 cells and 1,776 genes in total as our dataset.
Engineered 3D neural tissues
Human engineered neural tissues are very helpful in understanding neurological diseases. Human neural cells from this dataset [29] are cultured with differentiated human astrocytic cells where human embryonic stem cells (hESC) induced neuronal cells and human astrocytic cells differentiated from hESCs were co-cultured at 1:1 ratio in a 3D composite hydrogel. In our study, we choose a subset of 2,364 cells and 2,735 genes from the original dataset.
Mammalian brains
A highly scalable single-nucleus RNA-seq (sNucDrop-seq) approach [30] is developed for massively parallel scRNA-seq without enzymatic dissociation and nucleus sorting. This dataset is acquired by sNucDrop-seq. Such technology could accurately resolve cellular diversity in a low-cost and high-throughput manner and provide an unbiased isolation of intact single cells from complex tissues such as adult mammalian brains. In the original dataset [30], 18,194 nuclei were isolated from cortical tissues of adult mice. With extensive evaluations, the authors illustrate that sNucDrop-seq not only reveals neuronal and non-neuronal subtype composition with high accuracy but can also analyze transient transcriptional states driven by neuronal activity at single-cell resolution. We select 10,360 cells and 2,344 genes from the original dataset in our experiments.
Cellular taxonomy of the mouse bone marrow stroma
Stroma plays an important role in the development, homeostasisment and repair of organs. This dataset defines cellular taxonomy of the mouse bone marrow stroma and its perturbation by malignancy using scRNA-seq [31]. Seventeen stromal subsets were identified expressing distinct hematopoietic regulatory genes spanning new fibroblastic and osteoblastic subpopulations including distinct osteoblast differentiation trajectories. Emerging acute myeloid leukemia impaired mesenchymal osteogenic differentiation and reduced regulatory molecules necessary for normal hematopoiesis. This taxonomy of the stromal compartment provides a comprehensive bone marrow cell census and experimental support for cancer cell crosstalk with specific stromal elements that impair normal tissue function and thus lead to new cancers. In our study, we use a subset of 12,162 cells with 2,422 genes for each cell as our dataset.
Embryoid bodies
EB differentiation outlines the key aspects of early embryogenesis and has been successfully used as the first step in differentiation protocols for certain types of neurons, astrocytes and oligodendrocytes, hematopoietic, endothelial and muscle cells, hepatocytes and pancreatic cells, and germ cells. Approximately 31,000 cells were measured in this dataset where they were evenly distributed over a 27-day differentiation time course [32]. And samples were collected at 3-day intervals and pooled for measurement on the 10x Chromium platform [32, 34]. In our experiments, we selected 9,754 cells with 2,282 genes for each cell as our dataset.
Zebrafish embryogenesis
To reveal the transcriptional trajectories during the development of zebrafish embryos, this dataset [33] was profiled using the scRNA-seq technology called Drop-seq. It includes 38,731 cells from 694 embryos across 12 closely spaced stages of early zebrafish development. Data were acquired from the high blastula stage (3.3 h postfertilization, moment after transcription starts from the zygotic genome) to the six-somite stage (12 h after postfertilization, just after gastrulation). Due to the pluripotency at high blastula stage, most of the cells then have differentiated into specific cell types at six-somite stage. And we use 20,014 cells and 2,341 genes as our dataset in our study.
Key Points
Low gene expression values in single-cell RNA sequencing (scRNA-seq) data are frequently discarded due to the technical constraints of existing sequencing technologies. Such omission usually results in unreliable gene counts.
gene expression data are transformed into a two-dimensional image format, termed GenoMap. In this spatial arrangement, gene–gene interactions are explicitly represented, thereby aiding in the extraction of deep relationships among genes.
A self-supervised 2D convolutional neural network is deployed for gene imputation. Through comprehensive experiments on multiple real-world scRNA-seq datasets, the proposed method has demonstrated superior performance in gene imputation compared to existing approaches.
Supplementary Material
Author Biographies
Qingyue Wei is a PhD student affiliated with Lei Xing’s lab at Stanford University. She is interested in high-dimensional medical data analysis to improve diagnostic accuracy and enhance our understanding of disease mechanisms.
Md Tauhidul Islam is a Physical Science Research Scientist at Professor Lei Xing’s lab at Stanford University. His research interests include bioinformatics, single cell data analysis, deep learning and medical image analysis.
Yuyin Zhou is an Assistant Professor of Computer Science and Engineering at University of California, Santa Cruz. Her laboratory has a broad interest in computer vision and machine learning, including developing efficient deep representation learning with minimal supervision and securing model performance under (adversarial) distribution shifts.
Lei Xing is the Jacob Haimson & Sarah S. Donaldson Professor of Medical Physics and Director of Medical Physics Division of Radiation Oncology Department at Stanford University. His laboratory focuses on clinical and scientific innovation in radiation oncology and on translating recent technical advancements in physical and biological sciences and engineering into clinical practice to improve patient care.
Contributor Information
Qingyue Wei, Institute for Computational and Mathematical Engineering, Stanford University, Stanford, 94305 CA, USA.
Md Tauhidul Islam, Department of Radiation Oncology, Stanford University, Stanford, 94305 CA, USA.
Yuyin Zhou, Department of Computer Science and Engineering, University of California, Santa Cruz, Santa Cruz, 95064 CA, USA.
Lei Xing, Department of Radiation Oncology, Stanford University, Stanford, 94305 CA, USA.
FUNDING
This work was partially supported by NIH (1R01 CA223667 and R01CA227713) and a Faculty Research Award from Google Inc.
AUTHOR CONTRIBUTIONS STATEMENT
L.X. conceived the experiment(s), Q.W. and M.T.I conducted the experiment(s), and Q.W., M.T.I., Y.Z. analyzed the results. All authors reviewed the manuscript.
DATA AND CODE AVAILABILITY
The datasets generated during the current study, TCER imputation results along with the corresponding checkpoint and the implementation codes are all publicly available at https://github.com/aijinrjinr/TCER.
References
- 1. Qiu X, Hill A, Packer J, et al. Single-cell mRNA quantification and differential analysis with census. Nat Methods 2017;14(3):309–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Vu TN, Wills QF, Kalari KR, et al. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics 2016;32(14):2128–35. [DOI] [PubMed] [Google Scholar]
- 3. Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics 2018;34(18):3223–4. [DOI] [PubMed] [Google Scholar]
- 4. Kiselev VY, Kirschner K, Schaub MT, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 2017;14(5):483–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 2015;31(12):1974–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Lin P, Troup M, Ho JW. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol 2017;18(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Trapnell C, Cacchiarelli D, Grimsby J, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 2014;32(4):381–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Setty M, Tadmor MD, Reich-Zeliger S, et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat Biotechnol 2016;34(6):637–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Street K, Risso D, Fletcher RB, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom 2018;19(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Welch JD, Hu Y, Prins JF. Robust detection of alternative splicing in a population of single cells. Nucleic Acids Res 2016;44(8):e73–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Huang Y, Sanguinetti G. BRIE: transcriptome-wide splicing quantification in single cells. Genome Biol 2017;18(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chen C, Wu C, Wu L, et al. scRMD: imputation for single cell RNA-seq data via robust matrix decomposition. Bioinformatics 2020;36(10):3156–61. [DOI] [PubMed] [Google Scholar]
- 13. Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based imputation for single cell RNA-seq data. Front Genet 2019;10:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Huang M, Wang J, Torre E, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 2018;15(7):539–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Wang J, Agarwal D, Huang M, et al. Data denoising with transfer learning in single-cell transcriptomics. Nat Methods 2019;16(9):875–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Tang W, Bertaux F, Thomas P, et al. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 2020;36(4):1174–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun 2018;9(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Van Dijk D, Sharma R, Nainys J, et al. Recovering gene interactions from single-cell data using data diffusion. Cell 2018;174(3):716–729.e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Gong W, Kwak IY, Pota P, et al. DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinform 2018;19(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Sci Rep 2018;8(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Eraslan G, Simon LM, Mircea M, et al. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun 2019;10(1):390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Arisdakessian C, Poirion O, Yunits B, et al. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol 2019;20(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Lopez R, Regier J, Cole MB, et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15(12):1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Deng Y, Bao F, Dai Q, et al. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 2019;16(4):311–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol 2017;18(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9(11). [Google Scholar]
- 27. Mcinnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
- 28. Haber AL, Biton M, Rogel N, et al. A single-cell survey of the small intestinal epithelium. Nature 2017;551(7680):333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Tekin H, Simmons S, Cummings B, et al. Effects of 3D culturing conditions on the transcriptomic profile of stem-cell-derived neurons. Nature Biomed Eng 2018;2(7):540–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Hu P, Fabyanic E, Kwon DY, et al. Dissecting cell-type composition and activity-dependent transcriptional state in mammalian brains by massively parallel single-nucleus RNA-seq. Mol Cell 2017;68(5):1006–1015.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Baryawno N, Przybylski D, Kowalczyk MS, et al. A cellular taxonomy of the bone marrow stroma in homeostasis and leukemia. Cell 2019;177(7):1915–1932.e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Martin GR, Evans MJ. Differentiation of clonal lines of teratocarcinoma cells: formation of embryoid bodies in vitro. Proc Natl Acad Sci 1975;72(4):1441–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Farrell JA, Wang Y, Riesenfeld SJ, et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 2018;360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Moon KR, van Dijk D, Wang Z, et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 2019;37(12):1482–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol 2020;21(1):1–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Dai J, Qi H, Xiong Y, et al. Deformable convolutional networks. In:Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017, p. 764–73.
- 37. Stein RR, Marks DS, Sander C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput Biol 2015;11(7):e1004182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Peyré G, Cuturi M, Solomon J. Gromov-Wasserstein averaging of kernel and distance matrices. In:International Conference on Machine Learning. New York City, USA: PMLR, 2016, p. 2664–72. [Google Scholar]
- 39. Islam MT, Xing L. Cartography of genomic interactions enables deep analysis of single-cell expression data. Nat Commun 2023;14(1):679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014.
- 41. Qin X, Wang Z, Bai Y, et al. FFA-Net: feature fusion attention network for single image dehazing. In:Proceedings of the AAAI Conference on Artificial Intelligence. New York City, USA: AAAI, vol. 34, 2020, p. 11908–15. [Google Scholar]
- 42. Agarap AF. Deep learning using rectified linear units (ReLU) arXiv preprint arXiv:180308375. 2018.
- 43. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016, p. 770–8.
- 44. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In:International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015, p. 234–41. [Google Scholar]
- 45. Woo S, Park J, Lee JY, et al. CBAM: Convolutional Block Attention Module. In:Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018, p. 3–19.
- 46. Haque A, Engel J, Teichmann SA, Lönnberg T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med 2017;9(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated during the current study, TCER imputation results along with the corresponding checkpoint and the implementation codes are all publicly available at https://github.com/aijinrjinr/TCER.
























