Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Feb 12;25(2):bbae031. doi: 10.1093/bib/bbae031

Self-supervised deep learning of gene–gene interactions for improved gene expression recovery

Qingyue Wei 1,#, Md Tauhidul Islam 2,#, Yuyin Zhou 3, Lei Xing 4,
PMCID: PMC10939378  PMID: 38349062

Abstract

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene–gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene–gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene–gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.

Keywords: deep learning, gene–gene interactions, scRNA-seq data imputation, genomic data analysis

INTRODUCTION

The success of transcriptomic studies, such as differential expression analysis [1–3], cell subpopulation identification [4–6], cell trajectory reconstruction [7–9] and alternative splicing detection [10, 11], depends critically on the accuracy of the gene expression counts. In practice, gene expression data often suffer from low transcript capture efficiency and technical noise, leading to inaccurate gene counts. Thus, recovery of the expression values of the genes using computational techniques is critical for the downstream applications.

Broadly, gene expression recovery techniques can be divided into four main categories: (i) low-rank matrix-based approaches, (ii) probabilistic-based models, (iii) nearest neighbor-based techniques and (iv) deep learning methods. Methods from the first category rely on the use of low-rank matrix-based techniques. For example, scRMD [12] utilizes the Alternating Direction Method of Multipliers to obtain the low-rank matrix of the original gene expression matrix. As another example, McImpute [13] employs a matrix completion approach for data imputation, considering both gene–gene and cell–cell relationships. SAVER [14], SAVER-X [15], BayNorm [16] and scImpute [17] are the prominent techniques of the second group. SAVER [14] and BayNorm [16] are two Bayesian approaches that use the gene–gene relationships to estimate the expression levels of the omitted genes. scImpute [17] identifies and imputes the dropout gene expression values by applying a known statistical model. The third group uses the information from the neighboring genes to interpolate the omitted gene expressions. MAGIC [18] explores information across similar cells and utilizes a Markov matrix to denoise the gene expression data while imputing the dropouts. DrImpute [19] is another method of this group, which integrates information from similar cells and adopts expression averaging for data recovery. AutoImpute [20], dca [21], DeepImpute [22], scVI [23] and scScope [24] are the representative methods of the fourth category. These methods use different network architectures such as autoencoder (AutoImpute [20] and dca [21]), multilayer perceptron (DeepImpute [22]) and recurrent neural network (scScope [24]) to extract information from similar cells and genes for data recovery.

These existing methods suffer from either accuracy or computational efficiency or both, which limits their practical applications. As for the first group (e.g. scRMD, McImpute), the low-rank assumption may not always hold true for real-world Single-cell RNA sequencing (scRNA-seq) data. Similarly, the second group (e.g. SAVER) assumes a specific distribution of gene expression data that may not be sufficiently accurate in practical scenarios. The third type of methods such as MAGIC relies on expression averaging for imputation, which may result in removal of the variability in gene expression. Deep learning techniques have limited receptive field size and may lead to suboptimal results due to their difficulties in capturing the long-range relationship between cells and genes.

It is well known that the gene–gene interaction patterns are unique to biological systems and present discriminative signatures of the cells involved. Here we leverage the interactive information and establish a transform-and-conquer expression recovery (TCER) strategy to tackle the gene imputation problem. First, we transform the gene expression data into an image format, referred to as the GenoMap, based on the interactions among the genes. In this step, the gene–gene interactions of the system are configured by placing the genes in such a way that the genes interacting strongly are close to each other. This transformation maps the gene–gene interactions into configured format and enables a deep neural network (DNN) to exploit the interactions more effectively. Thus, the method mitigates the problem of limited receptive field of conventional CNN-based deep learning approaches and facilitates full exploitation of the gene–gene relationships. In Figure 1, we present the visualizations of some GenoMaps from the mouse intestinal epithelium dataset. GenoMaps in each row are randomly selected from the same cell type group. It is worth noting that there exist similar patterns among GenoMaps within the same cell type.

Figure 1.

Figure 1

Visualization of GenoMaps of different cell types for mouse intestinal epithelium dataset. Each row represents GenoMap from eight different cells in the same cell type.

To extract deep interaction information from a GenoMap for the recovery of missing expression values, a novel encoder-decoder architecture, referred to as expression recovery network (ER-Net), is designed. In ER-Net, we include three cascaded Deformable Fusion Attention (DFA) modules between an encoder and a decoder. Each DFA module includes a deformable convolution layer so that its kernel shape can be adaptively adjusted according to the input feature maps. The deformable kernels enable the network to explore gene–gene relationships more flexibly and enhance the capability for the network to discover the underlying patterns in GenoMaps. Additionally, the network uses a dual-attention (channel-wise and pixel-wise attention) mechanism to adaptively assign higher values to more important channels and positions for high-performance expression recovery.

Extensive experiments on the simulated scRNA-seq data and six real-world scRNA-seq datasets are performed. We demonstrate that the proposed method substantially outperforms the existing ones in terms of imputation accuracy. We show that the recovered data using the proposed technique also yields the best outcomes of cell clustering and trajectory analyses.

RESULTS

TCER enables reliable gene imputations

We use the Splatter simulator [25] to simulate a reference scRNA-Seq data based on a gamma-Poisson distribution for 10000 cells each with 2400 genes. We set the number of cell groups to 5 with a probability of 0.2 for each group. To imitate the experimental scRNA-Seq data acquisition process, we sample the reference data following SAVER [14] to generate an observation dataset with a 1% efficiency loss. We compare the performance of TCER with two existing gene imputation methods, scVI [23] and MAGIC [18]. In Figure 2, we show t-SNE [26] and UMAP [27] visualization of the reference data (first column), observation data (second column), and results from TCER (third column), scVI (fourth column) and MAGIC (fifth column). It is obvious that our method shows better-separated clusters when compared to the results directly from the observation data and imputed data from scVI and MAGIC. scVI performs relatively better than MAGIC. We calculate the clustering accuracy and quality indices to quantitatively evaluate the clusters resulted from different methods. As shown in Figure 2, TCER greatly outperforms other methods. Note that the appearance of the UMAP clustering may vary with different initializations, To address this, we conducted the UMAP embedding and corresponding quantitative evaluation 500 times. The mean and standard deviation of these results are also presented in Figure 2.

Figure 2.

Figure 2

Analyses of simulation dataset. (A) t-SNE and UMAP visualizations of reference (first column), observed data (second column) and imputed data by TCER (third column), scVI (fourth column) and MAGIC (fifth column). (B) The clustering accuracy and cluster quality indices for UMAP visualizations of the reference and observed data, and imputed data using different methods.

TCER also outperforms other methods in real-world settings. To demonstrate this, we conduct experiments using four experimental scRNA-seq datasets: (1) mouse intestinal epithelium [28], (2) engineered 3D neural tissues [29], (3) mammalian brains [30] and (4) cellular taxonomy [31]. Details of the datasets are described in the Datasets section (see Section 5). For each dataset, we first select cells and genes with high expression levels and use them as the reference data, following the procedure described in SAVER [14]. We then sample the reference data using a negative binomial model [14] to generate observation datasets at different efficiency losses. In Figure 3, we show the GenoMaps of the reference and observation datasets, and the recovered GenoMaps by TCER. The recovered GenoMaps correlate strongly to the reference ones.

Figure 3.

Figure 3

GenoMaps of the reference, observed and recovered data from different datasets sampled at different efficiencies. Each row displays samples from the same dataset with different efficiency loss. Datasets include (A) mouse intestinal epithelium, (B) 3D neural tissue data, (C) mammalian brain, (D) cellular taxonomy. And Inline graphic represents 0.75% efficiency, Inline graphic indicates 0.5% efficiency, while Inline graphic means 0.4% efficiency.

We present the UMAP visualizations of original data and data imputed by different methods (TCER, dca, scScope and scImpute) for all four datasets in Figure 4. We also include UMAP visualizations of additional imputation methods in Section 1.1 of the Supplementary Materials. It is seen that, compared to the results from other methods, UMAP visualizations of our imputation results correlate better to the reference data (the first column) with superior clustering quality. We note that data distortions are observed in the results from other methods in all four datasets.

Figure 4.

Figure 4

UMAP visualizations of the reference (first column) and observed data (second column), and the imputed results from TGER, dca, scScope and scImpute. Here, (A) mouse intestinal epithelium is sampled at 0.75% efficiency, (B) 3D neural tissue data is sampled at 0.5% efficiency, (C) mammalian brain is sampled at a 0.4% efficiency and (D) cellular taxonomy data are sampled at 0.4% efficiency.

The Pearson correlation results from different methods are shown in Figure 5. It is seen that TCER yields the best performance for all datasets with more than 6% improvement in Pearson coefficients as compared to that obtained from the observed datasets directly. It is worth noting that the other six methods (MAGIC, scVI, scScope, scImpute, dca and SAVER) lead to Pearson coefficients lower than that from the original data without any imputation. For example, for an efficiency loss of 0.4% for mammalian brains dataset, Pearson coefficients resulted from the observed data, TCER-, dca-, scScope- and scImpute-imputed data are 0.7653, 0.7974, 0.1820, −0.2013 and 0.6208, respectively.

Figure 5.

Figure 5

Box plots of Pearson coefficient between the reference and imputed data from different methods (MAGIC, scVI, scScope, scImpute, dca, SAVER and TCER) for mouse intestinal epithelium (first row), 3D neural tissue data (second row), mammalian brain (third row), cellular taxonomy (fourth row). 1–3 shown in the x-axis indicate the sampling efficiency of 0.4, 0.5 and 0.75%, respectively.

TCER improves the accuracy of cell trajectory analysis

Cell trajectory analysis is widely used to understand the cellular differentiation process and plays an important role in the study of embryo and organ developments. Here, we use two experimental datasets to demonstrate that TCER can restore the missing expressions of trajectory data and facilitates the analysis of cell trajectories. The datasets are from (1) human Embryonic Stem cells differentiated to embryoid bodies (EBs) [32] and (2) early zebrafish development [33]. Details of each dataset can be found in the Datasets section (see Section 5). We employ the well-established PHATE [34] to embed the high-dimensional data for trajectory visualization and analysis. The PHATE results for imputed data from different approaches are shown in Figure 6. It is seen that our imputed results (third column) contain continuous cell trajectory patterns that closely resemble the reference data (first column), whereas the observation data (second column) and imputed results from dca (fourth column), scScope (fifth column) and scImpute (sixth column) show distortions in the trajectories. PHATE results for more methods could also be found in Section 1.1 of the Supplementary Materials. In Figure 7, we show the Pearson coefficients with respect to the reference dataset for all the cases. Again, TCER method achieves the best Pearson coefficients for both datasets.

Figure 6.

Figure 6

PHATE visualizations of reference and observed data, and imputed results from TCER, dca, scScope and scImpute for (A) EB differentiation data sampled at 0.75% efficiency and (B) zebrafish development data sampled at 0.5% efficiency. The colorbar for (A) indicates 1-(0–3 days), 2-(6–9 days), 3-(12–15 days), 4-(18–21 days) and 5-(24–27 days) and (B) shows the hpf (hours post fertilization).

Figure 7.

Figure 7

Box plots of Pearson coefficient between the reference and imputed data from different methods (MAGIC, scVI, scScope, scImpute, dca, SAVER and TCER) for EB differentiation data (first row) and zebrafish development data (second row). 1–3 shown in the x-axis indicate the sampling efficiency of 0.4, 0.5 and 0.75%, respectively.

DISCUSSION

Current scRNA-seq technologies suffer from a number of drawbacks, such as low capture efficiency, high dropout and measurement noise, and a preprocessing of the data is required for downstream applications. In this work, an effective TCER framework is proposed with incorporation of the gene–gene interactions of the system for accurate gene imputation. As revealed by the Pearson correlation analysis (Figs 5 and 7), the existing techniques can hardly improve the data quality, which has recently been pointed out by Hou et al. [35]. TCER, on the other hand, greatly improves the Pearson correlation in all cases. Moreover, the proposed technique also improves the clustering and trajectory analysis substantially (Figs 3, 4 and 6).

The TCER technique consists of two important steps. The GenoMap construction is critical in TCER for the network to recover the gene expression values. To illustrate this, we created a 2D map by randomly placing the genes on its grid and the map is processed by the same DNN in TCER for three different datasets. The Pearson correlations resulted from this random map + DNN is plotted in Supplementary Section 2. It is seen that the results are much inferior to that of GenoMap+DNN.

The construction of a GenoMap is less susceptible to noise and omitted gene expression values. This is attributed to the fact that the GenoMap construction depends on the distribution of expression values, instead of a particular gene expression value(s). Because of this, the TCER imputation results are robust even if some expression values are missing or noisy. The introduction of GenoMap helps the self-supervised DNN to learn the deep relationship among the genes and cells for better imputation.

The deep learning architecture in TCER adopts deformable convolutional operations for extraction of high-quality features [36]. As opposed to traditional convolution with a fixed kernel size, deformable convolution adaptively expands its receptive field by adjusting its kernel shape according to the input feature maps. Thus, the network can efficiently extract both low- (local) and high-level (global) gene–gene interaction features. A skip connection between the downsampling and upsampling layers is introduced to preserve the multi-scale information. Finally, the dual-channel attention (channel- and pixel-wise attention) in our network can adaptively assign important feature information to important channels and positions, which allows the network to put emphasis on the important genes and differentially explore the gene–gene relationship for better data recovery.

In conclusion, we have proposed a novel TCER framework for gene expression recovery. The technique demonstrates unprecedented accuracy and reliability in gene imputation. Fundamentally different from the existing methods, TCER exploits the underlying gene–gene interaction information of a biological system via a transform-and-conquer strategy. TCER is self-supervised and its potential applications go much beyond genomic data processing. The same strategy should be applicable to other missing data problems in various biomedical and biocomputational applications.

METHODS

Overview

As shown in Figure 8, the proposed TCER consists of two steps: (1) GenoMap Construction and (2) GenoMap Recovery.

Figure 8.

Figure 8

Pipeline of the proposed TCER. 1D gene expression data is first converted into an image format where the gene–gene interactions are reflected naturally in the spatial configuration of GenoMap. A dropout simulation strategy is then applied to simulate the dropout events where non-zero values are randomly masked. Last, the proposed ER-Net is employed to impute the masked GenoMap.

Given a gene sequence data Inline graphic, where Inline graphic is the number of cells while Inline graphic is the number of genes in each cell, we first reconstruct the sequence data as 2D spatial data, termed as GenoMap, to represent the gene–gene relationship. Specifically, the generated GenoMap is denoted as Inline graphic, where Inline graphic is the width and height, and Inline graphic is the number of GenoMaps. Then, we design an end-to-end convolutional neural network named ER-Net for recovering gene expression. Due to the lack of ground-truth annotation, we introduce a novel self-supervised training strategy which simulates the dropout event of scRNA-seq data in the real world as self-supervision for learning the recovery of GenoMaps.

GenoMap Construction

Intuition behind GenoMap Construction is to obtain a 2D spatial configuration of the genes where their interactive relationships could also be properly expressed. Further, genes with stronger interactions are supposed to have smaller Euclidean distances in the corresponding GenoMap. Specifically, in order to reposition the gene sequence data with Inline graphic genes for each cell into a 2D GenoMap with a grid of Inline graphic, where Inline graphic, a pairwise interaction strength matrix Inline graphic would first be calculated to maximize entropy of the whole sequence data. Then, a projection matrix Inline graphic is obtained to reconstruct the gene data into 2D grid based on maximally preserving the pairwise interactions. Specifically, pairwise interaction strength matrix could be calculated as follows [37]:

graphic file with name DmEquation1.gif (1)
graphic file with name DmEquation2.gif (2)

where Inline graphic is the covariance matrix and Inline graphic indicates the Inline graphicth gene expression of the Inline graphicth cell. Then Gromov-Wasserstein discrepancy between the pair interaction strength matrix Inline graphic and the distance matrix Inline graphic of the 2D grid space (Inline graphic) is utilized based on its minimization to obtain the optimal projection matrix Inline graphic. Specifically, assume two points i, j, with coordinates Inline graphic and Inline graphic, respectively. The Euclidean distance Inline graphic between them is defined as

graphic file with name DmEquation3.gif (3)

where

graphic file with name DmEquation4.gif (4)
graphic file with name DmEquation5.gif (5)

And the Gromov–Wasserstein discrepancy between matrices Inline graphic and Inline graphic is defined as [38]

graphic file with name DmEquation6.gif (6)

and

graphic file with name DmEquation7.gif (7)

where Inline graphic are vectors that contain relative importance of genes and locations in GenoMap. And Inline graphic indicates the Kullback–Leibler divergence such that

graphic file with name DmEquation8.gif (8)

After the projection matrix is obtained, the restructured data could be written as

graphic file with name DmEquation9.gif (9)

where Inline graphic and Inline graphic represents the matrix multiplication. Inline graphic is then reshaped into the image format Inline graphic, i.e. the restructured GenoMap. And more detailed explanations could be found in ref. [39].

ER-Net

As shown in Figure 9, our proposed ER-Net follows a standard encoder-decoder structure [40] where we plug in several cascaded DFA modules in the bottleneck to further learn global and local gene–gene interactions. In addition, we apply skip connections with channel- and spatial-wise attention to flexibly preserve information from shallow layers. Below we will give more details of ER-Net.

Figure 9.

Figure 9

Network structure of ER-Net. (A) ER-Net follows an encoder-decoder structure and employs three cascaded DFA module with deformable convolution to extract both local and global features of gene–gene interactions. (B) Detailed structure of the proposed DFA module.

Encoder-Decoder structure

In ER-Net, the encoder firstly produces feature maps at different resolutions by consecutively downsampling the 2D image features. Then three cascaded DFA modules (see details in the DFA module) are applied for exploring feature maps in low resolution with dynamic deformable receptive fields. Next, the decoder restores the 2D GenoMap at the original resolution by combining the upsampled features and the skip features from the encoder at different resolutions. Specifically, in the encoder, we use one convolutional layer with the stride of 1 and two convolutional layers with the stride of 2 for downsampling. And in the Decoder, we use two transposed convolutional layers with the stride of 2 and one convolutional layer with the stride of 1 for upsampling. The down/upsampling ratios are both Inline graphic.

DFA module

Following the idea of fusion attention module [41] and deformable convolution [36], we design a DFA module to better exploit the feature representation to recover 2D GenoMaps. As shown in Figure 9(B), each DFA module consists of two convolutional layers, one ReLU layer [42], one deformable convolutional layer, Channel Attention and Pixel Attention. Different from the Fusion Attention module in FFA-Net [41] that only includes the standard convolution with a fixed grid, here we employ the deformable convolutional layer [36] to enable the deformation of the kernel shape. Such adaptive kernels can then expand the receptive field flexibly since the deformed grid sampling are combinations of various transformations, e.g. scaling, rotation, etc. [36]. Specifically, for a standard convolution operation, the kernel shape is usually a fixed Inline graphic square. Assume Inline graphic is an odd integer. Denote the output feature map of the standard convolution as Inline graphic. Then, in the Inline graphicth layer, Inline graphic at position Inline graphic could be described as

graphic file with name DmEquation10.gif (10)

where Inline graphic denotes the multiplication, Inline graphic enumerates all the channels of the input feature map Inline graphic and Inline graphic indicate the offset of the sampled grid at position Inline graphic where Inline graphic. For the deformable convolution operation, the output feature map of the deformable convolution is denoted as Inline graphic. Then, in the Inline graphicth layer, Inline graphic at position Inline graphic can be described as

graphic file with name DmEquation11.gif (11)

where Inline graphic indicate the offsets to the fixed sampled grid in the standard convolution. These offsets are learned through standard convolution layers using Inline graphic as the input and optimized together with kernels during the training process. Therefore, by applying deformable convolution, features at any location inside this feature map could be considered altogether (e.g. any position Inline graphic) instead of only the neighboring features, thus improving the representation capability of the network. Furthermore, we employ the channel attention module which aims to adaptively give different weights to distinct channels and the pixel attention module to adaptively give various weights to pixels at different locations. Such a dual-attention mechanism can automatically pay more attention to important features at different positions in the feature maps, which helps the network to capture more useful information. Residual learning is also included in the DFA module to prevent the degradation of the information [43].

Skip connection with attention

Inspired by Unet [44], we have adopted skip connections between feature maps from upsampling and corresponding downsampling layers at different resolutions to enable the low-level information flow through the whole network that could then be beneficial for the GenoMap restoration. Following CBAM [45], we also apply channel- and spatial-wise attention mechanisms to adaptively assign higher weights to important channels and positions for fully exploiting features at different gene expression levels.

Self-supervised training

Due to the lack of true gene expression data, in this paper, we propose a self-supervised training strategy by utilizing simulated dropout data as self-supervision for training the recovery network ER-Net. And we utilize Inline graphic loss in both spatial and frequency domains during the training process for accurate restorations.

Dropout simulation for self-supervision

Inspired by the fact that real-world scRNA-seq data usually suffers from high dropouts [46], we propose a self-supervised strategy which utilizes Dropout Simulation to imitate the dropout events during training. Specifically, it essentially applies random masking to generate the input data for self-supervision. Such a self-supervised strategy not only effectively reduces the need for the ground-truth annotation but also helps ER-Net to learn the common representation among the GenoMaps. Specifically, given a batch of GenoMap Inline graphic with Inline graphic as the batch size, for each GenoMap, a dropout ratio Inline graphic is randomly selected among Inline graphic at each training step. And Inline graphic are both fractions indicating the lower and upper bounds, respectively. Suppose at training step Inline graphic, the randomly selected dropout ratio for Inline graphic is denoted as Inline graphic. The pixel indices of non-zero values inside Inline graphic are denoted as Inline graphic, where Inline graphic is the number of non-zero values in Inline graphic and Inline graphic. Then, a subset Inline graphic is randomly selected from Inline graphic based on Inline graphic where the number of pixel indexes in Inline graphic is Inline graphic. And a placeholder Inline graphic is then used to replace the selected non-zero values indicating the masked positions that

graphic file with name DmEquation12.gif (12)

where Inline graphic is a fraction number. Besides, since scRNA-seq data is always sparse and has a lot true zeros, to help the network learn to distinguish those true zeros from dropout events, we also use Inline graphic to replace all zero values in Inline graphic that

graphic file with name DmEquation13.gif (13)

where Inline graphic indicates all the pixel indexes with zero values in Inline graphic. And the corresponding dropout GenoMap at training step Inline graphic is denoted as Inline graphic and the corresponding random dropout simulation process is then described as

graphic file with name DmEquation14.gif (14)

By applying this random dropout strategy, ER-Net could learn some shared patterns among all the GenoMaps that then contribute to better data recovery.

Overall training loss

We take both spatial and frequency domain information into consideration for the overall objective function. Inline graphic loss in the spatial domain is applied to ensure the accurate recovery for each pixel in GenoMap while the frequency loss is employed to better restore the high frequency information. Since the goal of the proposed ER-Net is to recover Inline graphic based on the input Inline graphic, the optimization objective could then be described as

graphic file with name DmEquation15.gif (15)

where Inline graphic indicates the ER-Net with Inline graphic as its parameters, Inline graphic, Inline graphic is the weight for the frequency loss and Inline graphic indicates fast Fourier transform function. And for the overall loss at training step Inline graphic, we have

graphic file with name DmEquation16.gif (16)

Generation of observation datasets

Following SAVER [14], we first select high-quality genes and cells based on their expression levels and refer to them as the true expression Inline graphic. The observed gene sequence dataset Inline graphic is then constructed by applying Poisson–gamma mixture on Inline graphic, which is also known as the negative binomial model. Specifically, Inline graphic is sampled from a gamma distribution simulating the uncertainty of the gene values and a Poisson distribution is then placed on Inline graphic to get the simulated observation Inline graphic, which could be described as

graphic file with name DmEquation17.gif (17)

where Inline graphic are the shape and rate parameters respectively that control the mean and variance. And Inline graphic could be considered as the efficiency loss from the true expression data Inline graphic.

Implementation details

GenoMap construction is implemented in MATLAB while the imputation experiments are implemented on one RTX2080 GPU in PyTorch. For the imputation network, we adopt Adam optimizer for network parameters optimization, setting the learning rate as Inline graphic, Inline graphic as 0.9, Inline graphic as 0.999 and Inline graphic as Inline graphic. The batch size is set as 64 and the learning rate is half decayed every 40 epochs. And we train our ER-Net for 100 epochs. And the placeholder Inline graphic is set as 0.999.

Moreover, to mimic variation in efficiency across cells, Inline graphic is sampled as follows SAVER [14]:

  • 0.75% efficiency Inline graphic Gamma(15, 2000)

  • 0.5% efficiency Inline graphic Gamma(10, 2000)

  • 0.4% efficiency Inline graphic Gamma(8, 2000)

DATASETS

We evaluate our proposed method on six different real-world datasets. For each dataset, we generate three different observation datasets with different efficiency loss. And Pearson coefficient is adopted to evaluate the performance. Pearson coefficient is calculated for each gene across the whole cells and is analyzed on the reference data (true expression), the observed data (input) and the imputed data (output). Details of the six real-world datasets are discussed below.

Mouse intestinal epithelium

Intestinal epithelial cells absorb nutrients, respond to microbes, function as a barrier and help to coordinate immune responses. Original dataset includes 53,193 individual epithelial cells from the small intestine and organoids of mice [28], which enabled the identification and characterization of previously unknown intestinal epithelial cell subtypes and their genetic characteristics. In our experiments, we select 4,072 cells and 1,776 genes in total as our dataset.

Engineered 3D neural tissues

Human engineered neural tissues are very helpful in understanding neurological diseases. Human neural cells from this dataset [29] are cultured with differentiated human astrocytic cells where human embryonic stem cells (hESC) induced neuronal cells and human astrocytic cells differentiated from hESCs were co-cultured at 1:1 ratio in a 3D composite hydrogel. In our study, we choose a subset of 2,364 cells and 2,735 genes from the original dataset.

Mammalian brains

A highly scalable single-nucleus RNA-seq (sNucDrop-seq) approach [30] is developed for massively parallel scRNA-seq without enzymatic dissociation and nucleus sorting. This dataset is acquired by sNucDrop-seq. Such technology could accurately resolve cellular diversity in a low-cost and high-throughput manner and provide an unbiased isolation of intact single cells from complex tissues such as adult mammalian brains. In the original dataset [30], 18,194 nuclei were isolated from cortical tissues of adult mice. With extensive evaluations, the authors illustrate that sNucDrop-seq not only reveals neuronal and non-neuronal subtype composition with high accuracy but can also analyze transient transcriptional states driven by neuronal activity at single-cell resolution. We select 10,360 cells and 2,344 genes from the original dataset in our experiments.

Cellular taxonomy of the mouse bone marrow stroma

Stroma plays an important role in the development, homeostasisment and repair of organs. This dataset defines cellular taxonomy of the mouse bone marrow stroma and its perturbation by malignancy using scRNA-seq [31]. Seventeen stromal subsets were identified expressing distinct hematopoietic regulatory genes spanning new fibroblastic and osteoblastic subpopulations including distinct osteoblast differentiation trajectories. Emerging acute myeloid leukemia impaired mesenchymal osteogenic differentiation and reduced regulatory molecules necessary for normal hematopoiesis. This taxonomy of the stromal compartment provides a comprehensive bone marrow cell census and experimental support for cancer cell crosstalk with specific stromal elements that impair normal tissue function and thus lead to new cancers. In our study, we use a subset of 12,162 cells with 2,422 genes for each cell as our dataset.

Embryoid bodies

EB differentiation outlines the key aspects of early embryogenesis and has been successfully used as the first step in differentiation protocols for certain types of neurons, astrocytes and oligodendrocytes, hematopoietic, endothelial and muscle cells, hepatocytes and pancreatic cells, and germ cells. Approximately 31,000 cells were measured in this dataset where they were evenly distributed over a 27-day differentiation time course [32]. And samples were collected at 3-day intervals and pooled for measurement on the 10x Chromium platform [32, 34]. In our experiments, we selected 9,754 cells with 2,282 genes for each cell as our dataset.

Zebrafish embryogenesis

To reveal the transcriptional trajectories during the development of zebrafish embryos, this dataset [33] was profiled using the scRNA-seq technology called Drop-seq. It includes 38,731 cells from 694 embryos across 12 closely spaced stages of early zebrafish development. Data were acquired from the high blastula stage (3.3 h postfertilization, moment after transcription starts from the zygotic genome) to the six-somite stage (12 h after postfertilization, just after gastrulation). Due to the pluripotency at high blastula stage, most of the cells then have differentiated into specific cell types at six-somite stage. And we use 20,014 cells and 2,341 genes as our dataset in our study.

Key Points

  • Low gene expression values in single-cell RNA sequencing (scRNA-seq) data are frequently discarded due to the technical constraints of existing sequencing technologies. Such omission usually results in unreliable gene counts.

  • gene expression data are transformed into a two-dimensional image format, termed GenoMap. In this spatial arrangement, gene–gene interactions are explicitly represented, thereby aiding in the extraction of deep relationships among genes.

  • A self-supervised 2D convolutional neural network is deployed for gene imputation. Through comprehensive experiments on multiple real-world scRNA-seq datasets, the proposed method has demonstrated superior performance in gene imputation compared to existing approaches.

Supplementary Material

GenoMap_Supp_Briefings_in_Bioinformatics_revision_bbae031

Author Biographies

Qingyue Wei is a PhD student affiliated with Lei Xing’s lab at Stanford University. She is interested in high-dimensional medical data analysis to improve diagnostic accuracy and enhance our understanding of disease mechanisms.

Md Tauhidul Islam is a Physical Science Research Scientist at Professor Lei Xing’s lab at Stanford University. His research interests include bioinformatics, single cell data analysis, deep learning and medical image analysis.

Yuyin Zhou is an Assistant Professor of Computer Science and Engineering at University of California, Santa Cruz. Her laboratory has a broad interest in computer vision and machine learning, including developing efficient deep representation learning with minimal supervision and securing model performance under (adversarial) distribution shifts.

Lei Xing is the Jacob Haimson & Sarah S. Donaldson Professor of Medical Physics and Director of Medical Physics Division of Radiation Oncology Department at Stanford University. His laboratory focuses on clinical and scientific innovation in radiation oncology and on translating recent technical advancements in physical and biological sciences and engineering into clinical practice to improve patient care.

Contributor Information

Qingyue Wei, Institute for Computational and Mathematical Engineering, Stanford University, Stanford, 94305 CA, USA.

Md Tauhidul Islam, Department of Radiation Oncology, Stanford University, Stanford, 94305 CA, USA.

Yuyin Zhou, Department of Computer Science and Engineering, University of California, Santa Cruz, Santa Cruz, 95064 CA, USA.

Lei Xing, Department of Radiation Oncology, Stanford University, Stanford, 94305 CA, USA.

FUNDING

This work was partially supported by NIH (1R01 CA223667 and R01CA227713) and a Faculty Research Award from Google Inc.

AUTHOR CONTRIBUTIONS STATEMENT

L.X. conceived the experiment(s), Q.W. and M.T.I conducted the experiment(s), and Q.W., M.T.I., Y.Z. analyzed the results. All authors reviewed the manuscript.

DATA AND CODE AVAILABILITY

The datasets generated during the current study, TCER imputation results along with the corresponding checkpoint and the implementation codes are all publicly available at https://github.com/aijinrjinr/TCER.

References

  • 1. Qiu X, Hill A, Packer J, et al.  Single-cell mRNA quantification and differential analysis with census. Nat Methods  2017;14(3):309–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Vu TN, Wills QF, Kalari KR, et al.  Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics  2016;32(14):2128–35. [DOI] [PubMed] [Google Scholar]
  • 3. Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics  2018;34(18):3223–4. [DOI] [PubMed] [Google Scholar]
  • 4. Kiselev VY, Kirschner K, Schaub MT, et al.  SC3: consensus clustering of single-cell RNA-seq data. Nat Methods  2017;14(5):483–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics  2015;31(12):1974–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Lin P, Troup M, Ho JW. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol  2017;18(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Trapnell C, Cacchiarelli D, Grimsby J, et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol  2014;32(4):381–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Setty M, Tadmor MD, Reich-Zeliger S, et al.  Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat Biotechnol  2016;34(6):637–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Street K, Risso D, Fletcher RB, et al.  Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom  2018;19(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Welch JD, Hu Y, Prins JF. Robust detection of alternative splicing in a population of single cells. Nucleic Acids Res  2016;44(8):e73–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Huang Y, Sanguinetti G. BRIE: transcriptome-wide splicing quantification in single cells. Genome Biol  2017;18(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Chen C, Wu C, Wu L, et al.  scRMD: imputation for single cell RNA-seq data via robust matrix decomposition. Bioinformatics  2020;36(10):3156–61. [DOI] [PubMed] [Google Scholar]
  • 13. Mongia A, Sengupta D, Majumdar A. McImpute: matrix completion based imputation for single cell RNA-seq data. Front Genet  2019;10:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Huang M, Wang J, Torre E, et al.  SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods  2018;15(7):539–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Wang J, Agarwal D, Huang M, et al.  Data denoising with transfer learning in single-cell transcriptomics. Nat Methods  2019;16(9):875–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Tang W, Bertaux F, Thomas P, et al.  bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics  2020;36(4):1174–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Li WV, Li JJ. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat Commun  2018;9(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Van Dijk D, Sharma R, Nainys J, et al.  Recovering gene interactions from single-cell data using data diffusion. Cell  2018;174(3):716–729.e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Gong W, Kwak IY, Pota P, et al.  DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinform  2018;19(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Sci Rep  2018;8(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Eraslan G, Simon LM, Mircea M, et al.  Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun  2019;10(1):390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Arisdakessian C, Poirion O, Yunits B, et al.  DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol  2019;20(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Lopez R, Regier J, Cole MB, et al.  Deep generative modeling for single-cell transcriptomics. Nat Methods  2018;15(12):1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Deng Y, Bao F, Dai Q, et al.  Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods  2019;16(4):311–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol  2017;18(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res  2008;9(11). [Google Scholar]
  • 27. Mcinnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
  • 28. Haber AL, Biton M, Rogel N, et al.  A single-cell survey of the small intestinal epithelium. Nature  2017;551(7680):333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Tekin H, Simmons S, Cummings B, et al.  Effects of 3D culturing conditions on the transcriptomic profile of stem-cell-derived neurons. Nature Biomed Eng  2018;2(7):540–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Hu P, Fabyanic E, Kwon DY, et al.  Dissecting cell-type composition and activity-dependent transcriptional state in mammalian brains by massively parallel single-nucleus RNA-seq. Mol Cell  2017;68(5):1006–1015.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Baryawno N, Przybylski D, Kowalczyk MS, et al.  A cellular taxonomy of the bone marrow stroma in homeostasis and leukemia. Cell  2019;177(7):1915–1932.e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Martin GR, Evans MJ. Differentiation of clonal lines of teratocarcinoma cells: formation of embryoid bodies in vitro. Proc Natl Acad Sci  1975;72(4):1441–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Farrell JA, Wang Y, Riesenfeld SJ, et al.  Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science  2018;360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Moon KR, van Dijk D, Wang Z, et al.  Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol  2019;37(12):1482–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol  2020;21(1):1–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Dai J, Qi H, Xiong Y, et al.  Deformable convolutional networks. In:Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017, p. 764–73.
  • 37. Stein RR, Marks DS, Sander C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput Biol  2015;11(7):e1004182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Peyré G, Cuturi M, Solomon J. Gromov-Wasserstein averaging of kernel and distance matrices. In:International Conference on Machine Learning. New York City, USA: PMLR, 2016, p. 2664–72. [Google Scholar]
  • 39. Islam MT, Xing L. Cartography of genomic interactions enables deep analysis of single-cell expression data. Nat Commun  2023;14(1):679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Cho K, Van Merriënboer B, Gulcehre C, et al.  Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014.
  • 41. Qin X, Wang Z, Bai Y, et al.  FFA-Net: feature fusion attention network for single image dehazing. In:Proceedings of the AAAI Conference on Artificial Intelligence. New York City, USA: AAAI, vol. 34, 2020, p. 11908–15. [Google Scholar]
  • 42. Agarap AF. Deep learning using rectified linear units (ReLU)  arXiv preprint arXiv:180308375. 2018.
  • 43. He K, Zhang X, Ren S, et al.  Deep residual learning for image recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016, p. 770–8.
  • 44. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In:International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015, p. 234–41. [Google Scholar]
  • 45. Woo S, Park J, Lee JY, et al.  CBAM: Convolutional Block Attention Module. In:Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018, p. 3–19.
  • 46. Haque A, Engel J, Teichmann SA, Lönnberg T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med  2017;9(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

GenoMap_Supp_Briefings_in_Bioinformatics_revision_bbae031

Data Availability Statement

The datasets generated during the current study, TCER imputation results along with the corresponding checkpoint and the implementation codes are all publicly available at https://github.com/aijinrjinr/TCER.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES