Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Aug 1.
Published in final edited form as: Methods. 2020 Sep 22;192:112–119. doi: 10.1016/j.ymeth.2020.09.010

Model-Based Autoencoders for Imputing Discrete single-cell RNA-seq Data

Tian Tian 1, Martin Renqiang Min 2, Zhi Wei 1,*
PMCID: PMC8592282  NIHMSID: NIHMS1632741  PMID: 32971193

Abstract

Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing ‘false’ zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.

Keywords: Deep learning, scRNA-seq, Imputation

1. Introduction

A cell is a fundamental unit in biology. Recent revolutionary biotechnology, single-cell RNA sequencing (scRNA-seq), has made it possible to profile all gene expression activities (transcriptome) at the single cell level. This new technology is so powerful that it has helped researchers to understand complex biological questions in many applications better [25, 14]. As a result, the past few years have witnessed a surging number of studies based on scRNA-seq [19, 13, 33]. Despite the advances in measuring technologies, the analysis of scRNA-seq data remains a statistical and computational challenge.The scRNA-seq generates discrete count data, which takes the form of n × p matrices, representing the read counts mapped to the p genes across n cells. The read count values represent the relative expression levels of each gene in a cell. Higher count value means higher relative expression levels. To compare gene expression levels across samples/cells, the read counts need to be normalized by the library size and other biases. The software for quality control, alignment and read count summarization and normalization have been developed for scRNA-seq data [3]. In this study, it is assumed that quality control, mapping, pooling, and summarization of the raw reads have already been performed. Our study starts with the read count matrices.

Because of the low initial amounts of RNA obtained from every single cell, the discrete count data matrix output from scRNA-seq exhibits much higher levels of noise and many zero values. The zero values may indicate a low or zero expression level of a gene. However, during library preparation, some mRNA molecule might be lost due to the tiny initial amounts [10]. Consequently, the sequencing step can also generate “false” zero values, caused by the relative shallow sequencing depth per cell [2]. This is the so-called dropout event in scRNA-seq, which will mask true expression values of a gene with false zeros. It is common to observe more than 50% entries of the measured data matrix to be zero values [12]. Given the discrete count data matrix, we do not know which zeros are true and which are false. To address this issue and make the downstream analysis more effective, quite a few methods have been proposed to impute missing values (false zeros) caused by dropout in scRNA-seq data recently, including scImpute [15], SAVER [8], MAGIC ([28]), scVI [16] and DCA [6], as summarized below.

MAGIC ([28]) employs Markov affinity-based graph for imputing scRNA-seq data. It uses a diffusion operator to learn an underlying manifold, which is represented using a nearest neighbor graph of cells. Edges connect most similar cells based upon gene expression. During the learning process, missing values are restored. scImpute [15] first uses a Gamma-Normal mixture model to identify which zeros are caused by dropout. Genes affected by dropout are captured using a Gamma model, while the others are characterized using a normal model. Then for genes affected by dropout, scImpute estimates their expression value based on expression values of the same gene in other similar cells. Cell similarities are determined based on genes that aren’t affected by dropout. Both MAGIC and scImpute rely on pooling of the data for each gene across similar cells. In contrast, SAVER [8] takes advantage of gene-to-gene relationships to recover the true expression level of each gene in each cell. SAVER basically is a Bayesian generalized linear regression approach. It characterizes the count of each gene in each cell using a negative binomial model.

The above approaches rely on sophisticated modelling. Their performance may deteriorate when model assumptions are violated. In addition, they either exploit cell similarity or take advantage of gene-to-gene relationships, but not both. Deep learning approaches have the potential of overcoming these limitations. Of note is DCA [6], which employs deep autoencoder for imputation. Autoencoder is a kind of deep neural networks (DNNs) widely used to learn efficient feature representation in an unsupervised manner [7]. Autoencoder is capable of capturing complex features with nonlinear dependency, which makes it an ideal candidate for accounting for sophisticated gene-gene interactions in a cell. Due to the dimension compression ability of autoencoder, DCA shares and pools information across both features (genes) and samples (cells) to improve imputation. scVI is a variantional autoencoder-based method [16].

Different from regular autoencoder, DCA and scVI replaces the conventional mean square error (MSE) loss with a zero-inflated negative binomial (ZINB) model-based loss function for better characterizing scRNA-seq count data. Specifically, the final output of DCA is three layers representing the mean, dispersion, and dropout probability parameters of a ZINB model, respectively. Then the authoencoder is learnt by maximizing the likelihood of the three layers of parameters in generating the original input count matrix. ZINB model has been commonly used, and proved to be effective for modeling genomic count data with excessive zero values [22, 24, 4, 27]. The successful applications of ZINB model including DCA suggest the ZINB model is effective at characterizing the discrete, over-dispersed and highly sparse count data.

Although this model-based deep autoencoder approach has demonstrated superior performance, it is over-permissive in defining the ZINB model space, namely, for every data point, it allows three parameters, mean, dispersion and dropout probability. This high degree of freedom issue can lead to an unidentifiable model and make results unstable. For example, for a zero entry, the ZINB model can be optimized by either setting the mean to zero or using a large dropout probability (close to 1). It is desired to impose certain regularization. One idea is to regularize the mean estimator to be closer to data by adding a reconstruction loss. It is ideal to use the estimated mean inflated with zeros caused by dropout events for calculating reconstruction loss (as confirmed in our ablation experiments shown later). However, it is challenging for ZINB Model-based autoencoder to generate the reconstruction with missing values. Furthermore, it is hard to train stochastic networks with categorical variables (e.g. dropout events here) because the backpropagation algorithm cannot be applied to non-differentiable layers.

Recently, Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical data, has been introduced. Its parameter gradients can be easily computed via the reparameterization trick [9]. Gumbel-Softmax has been applied to dimension reduction analysis of scRNA-seq datasets [30]. As mentioned above, we want to add a reconstruction loss but using the conventional mean square error (MSE) between raw counts and reconstructed numerical means for it may not be appropriate. Therefore, we propose to use a Gumbel-Softmax layer to accommodate zero-inflated count data for adding a new zero-inflated (ZI) reconstruction loss. The Gumbel-Softmax can mimic the real process to generate excessive zeros while introducing randomness. It also helps to overcome the difficulty of training stochastic networks with categorical variables. With the aid of Gumbel-Softmax. we obtain a new form of the ZINB model-based autoencoder – ZINBAE, which integrates the ZINB model-based loss with the zero-inflated (ZI) reconstruction loss. The network architecture of ZINBAE is shown in Fig. 1.

Figure 1:

Figure 1:

Network architecture of ZINBAE. Size is marked under each layer.

2. Material and methods

2.1. Read count data pre-processing as the input

We followed the same pre-processing steps in DCA [6]. Raw scRAN-seq read count data are preprocessed by the Python package SCANPY [31] . First, genes with no count in any cell are filtered out. Second, the size factors are calculated and read counts are normalized by library size, so total counts are same across cells. Formally, if we denote the library size (number of total read counts) of cell i as li , then the size factor of cell i is li/median(l). The last step is log transform and scaling of the read counts, so that values follow unit variance and zero mean. The pre-processed read count matrix (X~count) is treated as the input for the ZINB model-based autoencoder.

2.2. ZINB model-based autoencoder

The autoencoder is a type of artificial neural network trying to reconstruct its inputs ([7]). The denoising autoencoder is an autoencoder that receives corrupted data points as input and is trained to predict the original uncorrupted data points as its output [29]. We use the denoising autoencoder to learn robust feature representation and reconstruction. Following [4, 6, 27], we employ a zero-inflated negative binomial (ZINB) model to characterize highly sparse and over-dispersed scRNA-seq data. Specifically, the proposed autoencoder produces three layers in parallel as final output, representing the mean, dispersion and dropout probability parameters of a ZINB model, respectively (Fig. 1). Then we train the ZINB model-based autoencoder (ZINBAE) by minimizing a loss function which integrates the likelihood of the three layers of parameters in generating the original input count matrix. The ZINB mixture model consists of two components: (1) a point mass at zero which represents the probability of dropout events in the data and (2) a negative binomial which models the count distribution. Formally, the ZINB model is parameterized with three parameters: mean (μij), dispersion (θij) and dropout (πij):

NB(Xijcountμij,θij)=Γ(Xijcount+θ)Xijcount!Γ(θij)(θijθij+μij)θij(μijθij+μij)XijcountZINB(Xijcountπij,μij,θij)=πδ0(Xijcount)+(1πij)NB(Xijcountμij,θij)

where Xijcount is the entries in the raw count matrix, i and j represent cell and gene indices.

We denote the output matrix of encoder as E, the output matrix of decoder as D and the output matrix of bottleneck layer as B. Formally, we have

E=ReLU((X~count+e)WE),B=ReLU(EWB)D=ReLU(BWD),M=diag(li)×exp(DWμ)Θ=exp(DWθ),Π=sigmoid(DWπ)

where M, Θ and Π represent the matrix forms of estimations of mean, dispersion and dropout probability. ReLU is the rectifier activation [21]. The input X~count and size factors li are calculated in the “data preprocess” part and are included as inputs to the model. Denoising technique is implemented by adding a Gaussian noise e into the input X~count. The activation function chosen for mean and dispersion is exponential because the mean and dispersion parameters are non-negative values, while the activation function for the coefficient π is sigmoid. Dropout probability is in the interval of 0–1, so sigmoid is a suitable choice of the activation function. The loss function of the ZINB model-based autoencoder is the negative log of the ZINB likelihood:

LZINB=i=1nj=1plog(ZINB(Xijcountπij,μij,θij))

where n and p represent the number of cells and genes.

2.3. Gumbel-Softmax layer

The ZINB model parameters have a high degree of freedom, which can lead to an unidentifiable model and make results unstable. We propose to regularize the estimated mean to be close to data by imposing a reconstruction loss. It is challenging for the ZINB Model-based autoencoder to generate the reconstruction with missing values. Furthermore, it is hard to train the stochastic networks with dropout events. To overcome these issues, we propose to add a Gumbel-Softmax layer after the ZINB model-based decoder network to explicitly model the dropout events. Gumbel-Softmax distribution is a differentiable categorical distribution that can be incorporated into back-propagation [9]. Suppose π is the dropout probability, the sample s from Gumbel-Softmax distribution is obtained by

s=exp(logπ+g0τ)exp(logπ+g0τ)+exp(log(1π)+g1τ)

where g0 and g1 are sampled from a Gumbel(0,1) distribution. We first sample u from uniform distribution: u ~ Uniform(0, 1) and then calculate g = −log(−logu). When τ → 0, the generated samples from the Gumbel-Softmax distribution can be considered as sampling from a Bernoulli distribution. In practice, too small values of τ will make the gradient vanish. To overcome it, we use a temperature schedule strategy. Formally, we update τ per 100 iterations by τ = τ0 exp(−anneal rate × No. of iteration). The τ0 is 1, anneal rate is 0.0003 and the minimum value of τ is 0.5.

2.4. ZINB model-based autoencoder with zero-inflated reconstruction

Based on the probability π, Gumbel-Softmax can explicitly model dropout events. It can generate the zero-inflated reconstruction (X′) by matrix forms of the Gumbel-Softmax sample S and the mean estimated in the ZINB model M

X=SM

where " is element-wise multiplication. To improve the imputation accuracy of ZINB model-based autoencoder, we combine two parts of losses:

L=LZINB+γMSE(Xcount,X)

where MSE stands for the mean square error between log(Xcount + 1) and log(X′ + 1). Here, Xcount is the raw count matrix, X′ is the reconstruction of the count matrix, so we apply log transformation to smooth MSE loss. The γ is a hyper-parameter that controls the relative weights of the two parts. After optimizing the combined loss, the estimated mean M can be considered as the imputed/denoised counts. We name the model as ZINB model-based autoencoder with ZI reconstruction (ZINBAE for short).

2.5. Implementation

The model is implemented in Python 3 using Keras with TensorFlow [1] backend. The random Gaussian noise is implemented by the Keras ”GaussianNoise” with the setting of stddev = 2.5. The ZINB loss is first pretrained for 300 epochs; then the combined loss is optimized. The optimizer is AMSGrad variant of Adam [11, 23], with the setting of initial learning rate lr = 0.001, β1 = 0.9 and β2 = 0.999. The hidden layer size of encoder and decoder is 64 and the bottleneck layer size is 32. The choice of γ = 0.01 (this value has been searched from 1e-3 to 1 on various simulated data, then fixed for all datasets). The source code is available on GitHub: https://github.com/ttgump/ZINBAE.

2.6. Ablation methods

To illustrate the contribution of the proposed ZI reconstruction, we designed two ablation models: ZINBAE(-ZI) and ZINBAE(-ZI/+MSE). ZINBAE(-ZI) is ZINBAE but with the ZI reconstruction removed. We may impose the reconstruction loss directly over the estimated mean without dropout events sampled, namely, MSE between the raw count matrix: log(Xcount + 1) and the estimated mean log(M + 1). The reconstruction loss based on this naive MSE ignores dropout events and may make imputation accuracy deteriorate, although it requires no Gumbel-Softmax layer and makes optimization less challenging. We denote this ablation model as ZINBAE(-ZI/+MSE). We designed another type of reconstruction: MSE(X, (1 – Π) * M). It is noted that the expectation of Gumbel-softmax E(Gumbel(Π)) = (1 – Π) (denoted as ZINBAE(-ZI/+MSE(1 - pi)*mu), ZINBAE0 for short)

2.7. Data simulation

Simulation can generate datasets with both raw counts (with dropout events) and true counts (without dropout events), which is very useful for evaluation of imputation accuracy. Following previous studies, we generate all simulated datasets using the R package Splatter [32]. In all settings, we simulated 1500 cells of 2500 genes, allocating to three cell groups, and 20 datasets were repeatedly generated for each setting. For the simulation of various dropout rates, we set the parameter dropout.shape = −1 (default setting in Splatter) and vary dropout.mid from 3 to 6 (3, 4, 5, 6). The corresponding dropout rates are 66±0.6%, 80±0.4%, 89±0.2% and 95±0.1%, mimicking different real scenarios. The resulted dropout rates were slightly different under the same setting due to randomness. It is common for real scRNA-seq datasets to have more than 50% zero entries [12]. We generated datasets with high dropout rates and evaluate the imputation accuracy of the proposed method. We also generated datasets with various high expression outlier proportions. The parameter controls outlier is out.prob which controls the proportions of genes to be amplified by the high expression outlier factors. We varied the out.prob from 0.1 to 0.4 (0.1, 0.2, 0.3, 0.4) and set dropout.mid = 5. All other parameters are set to be default. For each setting, we repeat experiments 20 times.

2.8. Real scRNA-seq datasets

The 10X PBMC dataset was provided by the 10X scRNA-seq platform [33], which profiled the transcriptome of a healthy donor’s peripheral blood mononuclear cells (PBMCs). The dataset was downloaded from https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k, and cell types had been already defined. The 10X PBMC dataset contains 4271 cells of 16,449 genes, across 8 different cell types. The proportion of zero entries in this dataset is 92.2%.

Chu et al. [5] profiled both single cells and bulk samples from human embryonic stem cells (H1) and definitive endoderm cells (DEC). The dataset was provided by the GEO database with the accession code GSE75748. The numbers of H1 and DEC cells are 212 and 138 respectively. In bulk data, it has 4 H1 samples and 2 DEC samples. The proportion of zero entries in this dataset is 44.6%.

2.9. Baseline methods for comparisons

Current state-of-arts methods for imputating/denoising scRNA-seq data are used: scImpute [15], SAVER [8], MAGIC [28] and DCA [6]. scImpute was downloaded from https://github.com/Vivianstats/scImpute. The parameter kCluster in scImpute was set to be the true number of clusters. SAVER (https://github.com/mohuangx/SAVER) was run using default parameters, which are 300 for the maximum number of genes used in the prediction, 50 for the number of lambda to calculate in 5 fold cross-validations, 10 for doParallel. MAGIC (https://github.com/pkathail/magic, python version)) was set to be the default parameters with 20 for the numbers of principal components, 6 for the parameter t for the power of the Markov affinity matrix, 30 the number of nearest neighbors, 10 for the autotune parameter and 99th percentile to use for scaling. scVI was downloaded from https://github.com/YosefLab/scVI and run by default settings. DCA (https://github.com/theislab/dca) was run using default parameters with –ridge 0.

3. Results

3.1. Simulation experiments

Simulation results

To evaluate the imputation ability of the proposed ZINB model-based autoencoder with ZI reconstruction, we designed the following simulation settings approximating different biological scenarios. Specifically, we applied R package Splatter [32] to simulate datasets with (1) various dropout rates (varying the parameter dropout.mid from 3 to 6), (2) various outlier proportions (varying the parameter out.prob from 0.1 to 0.4). For the simulated datasets, we know the true counts (count matrix before dropout). We define lists cimpute and ctrue whose elements corresponded to either imputed and ground truth values respectively for all dropout entries in the data matrix. Thus, the imputation error is calculated by the relative deviation from the ground truth values:

Imputation error=Average(cimputectruectrue)

A larger imputation error represents that imputed values are farther away from true values.

Imputation results under various settings are summarized in Fig. 2. As we can see, with the increasing of dropout rates, all methods have increasing imputation errors except the proposed ZINBAE model. This result is reasonable since more dropout events make imputation more difficult, but the proposed ZINBAE model is robust against the increase of dropout rates. Especially, when compared to DCA and scVI, ZINBAE has more a significant improvement on datasets with larger dropout rates, suggesting ZINBAE is useful for very sparse scRNA-seq datasets. scVI is based on variational autoencoder and its non-deterministic nature will generate vague reconstructions. Next, imputation performances on datasets with different high-expression outlier proportions are reported. We observe the same result: the proposed ZINBAE method outperformed all other baseline methods significantly ( P < 0.01, one-sided t-test ).

Figure 2:

Figure 2:

Imputation errors of competing methods on simulated datasets. Left panel are results on various dropout rates (variances in x-axis represent different random seeds in simulations), right panel are results on various outlier proportions.

Ablation studies

We found the two versions (ZINBAE and ZINBAE(ZI/+MSE(1-pi)*mu)) yield very comparable results (Figure 2). This result indicates that Gumbel-Softmax can model dropout event well, but it can introduce some randomness and helps optimization.

The ZINBAE model outperforms or equals to the ablation models under all settings (Fig. 3). The improvement brought ZI confirms that regularization via the ZI reconstruction loss helps. We also can see that adding naive MSE without capturing dropout events would make performance worse.

Figure 3:

Figure 3:

Imputation errors of ablation models on simulated datasets. Left panel are results on various dropout rates (variances in x-axis represent different random seeds in simulations), right panel are results on various outlier proportions.

3.2. Real data experiments

For real scRNA-seq data, we do not know ground truth counts (counts without dropouts). To make the evaluation, following previous studies [6, 15], we test if the imputation method can improve downstream analyses, namely, cell type identification and differential expression analysis. We focus on comparison of ZINBAE with DCA in real data experiments. DCA is selected because our simulation study and most recently published results [6] have indicated that DCA was best among all other competing methods except ZINBAE.

ZINBAE model improves separating different types of cells

Complex scRNA-seq datasets, such as those generated from a whole tissue, are composed of heterogeneous types of cells. Cell types are often defined by gene differential expression patterns. Dropout events would obscure gene expression profiles dramatically, and make cell type recognition challenging. Effective imputation methods would help to recover obscured gene expression data and improve separating different types of cells. We imputed the 10X PBMC dataset [33] by DCA and ZINBAE. We expect that cells in the same clusters will separate better when missing values (dropout events) are imputed more accurately. Better separation may suggest better imputation performance. Our data are high dimensional and cannot be visualized directly. Therefore, we employ UMAP [20] (Fig. 4) to reduce the data to 2-D space before and after imputation. Briefly, UMAP constructs a graph representation of the original high dimensional data then optimizes a graph to be as structurally similar as possible in a low-dimensional space. UMAP is a similar tool as t-SNE [18] but offers notably increased speed and better preservation of the data’s global structure. We observed that different cell types are mixed together using the raw count matrix as input, but separate well when using the imputed data matrices.

Figure 4:

Figure 4:

UMAP representations of raw and imputed count matrices by different methods for 10X PBMC dataset. Colors are arbitrary and different colors represent different cell types.

Visualization provides a perceptual but not quantitative evaluation of separation and clustering. To quantify separation performance, we use a simple metric 1-nearest neighbor error (1-NNE). Specially, we randomly split all cells (samples) by 9:1 fold into training and testing sets. A 1-nearest neighbor classifier is trained on the training data and the prediction error is reported on testing data. We repeated the experiment 100 times. We summarized the mean and standard error of 1-NNE using different data matrices as input in Table 1. We can see that imputation can improve the separation of different cell types, and that the ZINBAE model is significantly better than DCA (P < 0.01, one-sided t-test). Again, we observed the performances of ZINBAE and ZINBAE(ZI/+MSE(1-pi)*mu) are very similar.

Table 1:

1-nearest neighbor error rates (1-NNEs) over the 10X PBMC dataset.

Method 1-NNE
Raw count matrix 38.71 ± 0.21%
DCA 13.04 ± 0.15%
ZINBAE 11.76 ± 0.14%
ZINBAE(-ZI/+MSE(1-pi)*mu) 11.74 ± 0.145%

ZINBAE model increases correspondence between single-cell and bulk differential expression analysis

Bulk RNA-seq measures gene expression levels of a population of cells. Bulk RNA-seq does not have the issue of dropout events, although it could not measure genes at single cell level. Following [15, 6], we evaluated imputation efficiency based on the concordance of differential expression analysis results between using bulk RNA-seq data vs using scRNA-seq data (after imputation). To this end, we used the dataset of H1 and DEC cells with both bulk RNA-seq and scRNA-seq data [5] (see Expirments Section). We first imputed the scRNA-seq data by DCA and ZINBAE, then compared differential expression analysis results with bulk data. Specifically, following [15, 6], we focused on LEFTY1 and GATA3, two key regulatory genes in DEC cells. Fig. 5 (a) and (b) presents the distribution of these two genes in bulk RNA-seq, raw scRNA-seq, scRNA-seq imputed using DCA, and scRNA-seq imputed using ZINBAE in the four columns, respectively. LEFTY1 and GATA3 are known to express highly in DEC cells but not in H1 cells. We see that their median expression levels in scRNA-seq are shifted higher (relative expression levels) after imputation, which is more consistent with the expression levels in bulk data. We observed that ZINBAE shifted these two gene higher in DEC cells than DCA.

Figure 5:

Figure 5:

Correspondence between scRNA-seq and bulk RNA-seq: (a) counts (raw and imputed) of marker gene LEFTY1, (b) counts (raw and imputed) of marker gene GATA3, (c) gene correlations, (d) Biologically meaningful GO terms identified using ZINBAE but missed by DCA.

Next, following [6] we systematically compared robustness of the two methods using a bootstrapping approach. Briefly, top 5000 highly dispersed genes were selected based on bulk RNA-seq data; twenty cells were randomly selected from the H1 and DEC populations followed by differential expression (DE) analysis using DESeq2 [17]. We repeated the experiments 50 times. Pearson correlations between log fold changes estimated from scRNA-seq and bulk RNA-seq data are summarized in Fig. 4 (c). We observed that both imputation methods had improved the correlation with bulk RNA-seq data, comparing with raw scRNA-seq data. Correlation coefficients resulted from ZINBAE were significantly higher than DCA and ZINBAE(-ZI/+MSE(1-pi)*mu) (P < 0.01, one-sided t-test). The enhanced performance highlights the increased agreement between the ZINBAE model imputed data and the purified bulk data.

Finally, we conducted pathway (gene set) enrichment analysis (GSEA) to determine whether the DE genes identified using ZINBAE are biologically meaningful. We used all 350 cells for DE analysis (DEC vs H1) using DESeq2 after imputation. We then performed pre-ranked GSEA using the GSEA software [26] (all default settings) based on the results of DESeq2. Under the false discovery rate (FDR) level of 0.01, we obtained 243 gene ontology (GO) terms significantly enriched using the bulk-RNA seq data. For the same top number of significant GO terms, the scRNA-seq data using ZINBAE for imputation had 145 GO terms as reproduced from the bulk-RNA seq data, while using DCA reproduced only 120 GO terms. Therefore we can see that the pathway analysis of scRNA-seq using ZINBAE imputed data achieved more significantly higher concordance with bulk RNA-seq (P < 0.01, one sided Fisher exact test). Interestingly, we found that several significant pathways, which were detected using ZINBAE-imputed data but not DCA-imputed data, were highly relevant to the functions of DEC cells (Fig. 5 (d)).

4. Discussion

The major challenge in the analysis of scRNA-seq data is pervasive dropout events. Recent researches have indicated that imputation can significantly improve downstream analysis [15, 8, 28]. Imputation even becomes a standard step in some scRNA-seq workbenches [34]. In the zero-inflated negative binomial model, the mean is the parameter of our interest for the imputation purpose, while dispersion and dropout rate are nuisance parameters. So we focus on optimizing mean estimator. The original count can be considered as a method of moments estimator (MME) of the mean of the mixture distribution (point mass at 0 and NB). MME is easy to compute and always works. MME is consistent. The original mean estimator of DCA can be considered as maximum likelihood estimator (MLE) of mean of NB. MLE has many good properties, e.g. consistent, unbiased, and efficient when sample size n is large. But MLE can be highly biased when sample size n is small, and numerical optimization can be sensitive to starting values when there’s no closed-form solution. This is exactly the case for scRNA-seq data (n=1, without closed-form solution). The Gumbel-Softmax transformation allows us to connect MME of mean of the mixture distribution to the MLE of mean of NB. The reconstruction loss effectively pulls the MLE towards MME. Such a hybrid estimator may provide a good comprise between MLE and MME. Using both simulation study and real data application, we show that the proposed ZINB model-based autoencoder can significantly enhance imputation for analysis of discrete scRNA-seq data and achieve the state-of-arts performance. We attribute the superiority of ZINBAE to the regularization via a novel zero-inflated reconstruction loss. The regularization essentially provides a compromise between the maximum likelihood estimate and the moment estimate (data or sample mean). Other types of regularization may exist. For example, based on domain knowledge, we may assume genes from the same condition or cell types share the same mean. We will explore other types of regularization in our future work.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), Pp. 265–283. [Google Scholar]
  • [2].Angerer P, Simon L, Tritschler S, Wolf FA, Fischer D, and Theis FJ 2017. Single cells make big data: New challenges and opportunities in transcriptomics. Current Opinion in Systems Biology, 4:85–91. [Google Scholar]
  • [3].Butler A, Hoffman P, Smibert P, Papalexi E, and Satija R 2018. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5):411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Chen J, King E, Deek R, Wei Z, Yu Y, Grill D, and Ballman K 2018. An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics, 34(4):643–651. [DOI] [PubMed] [Google Scholar]
  • [5].Chu L-F, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, Choi J, Kendziorski C, Stewart R, and Thomson JA 2016. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biology, 17(1):173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Eraslan G, Simon LM, Mircea M, Mueller NS, and Theis FJ 2019. Single-cell rna-seq denoising using a deep count autoencoder. Nature Communications, 10(1):390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Hinton GE and Salakhutdinov RR 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507. [DOI] [PubMed] [Google Scholar]
  • [8].Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, and Zhang NR 2018. Saver: gene expression recovery for single-cell rna sequencing. Nature Methods, 15(7):539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Jang E, Gu S, and Poole B 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. [Google Scholar]
  • [10].Kharchenko PV, Silberstein L, and Scadden DT 2014. Bayesian approach to single-cell differential expression analysis. Nature Methods, 11(7):740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Kingma DP and Ba J 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [Google Scholar]
  • [12].Kiselev VY, Andrews TS, and Hemberg M 2019. Challenges in unsupervised clustering of single-cell rna-seq data. Nature Reviews Genetics, P. 1. [DOI] [PubMed] [Google Scholar]
  • [13].Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, and Kirschner MW 2015. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, and Teichmann SA 2015. The technology and biology of single-cell rna sequencing. Molecular Cell, 58(4):610–620. [DOI] [PubMed] [Google Scholar]
  • [15].Li WV and Li JJ 2018. An accurate and robust imputation method scimpute for single-cell rna-seq data. Nature Communications, 9(1):997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Lopez R, Regier J, Cole MB, Jordan MI, and Yosef N 2018. Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12):1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Love MI, Huber W, and Anders S 2014. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology, 15(12):550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Maaten L. v. d. and Hinton G 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2008):2579–2605. [Google Scholar]
  • [19].Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. 2015. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].McInnes L, Healy J, and Melville J 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. [Google Scholar]
  • [21].Nair V and Hinton GE 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Pp. 807–814. [Google Scholar]
  • [22].Pierson E and Yau C 2015. Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology, 16(1):241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Reddi SJ, Kale S, and Kumar S 2019. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. [Google Scholar]
  • [24].Risso D, Perraudeau F, Gribkova S, Dudoit S, and Vert J-P 2018. A general and flexible method for signal extraction from single-cell rna-seq data. Nature Communications, 9(1):284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Shapiro E, Biezuner T, and Linnarsson S 2013. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics, 14(9):618. [DOI] [PubMed] [Google Scholar]
  • [26].Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43):15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Tian T, Wan J, Song Q, and Wei Z 2019. Clustering single-cell rna-seq data with a model-based deep learning approach. Nature Machine Intelligence, 1(4):191. [Google Scholar]
  • [28].Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. 2018. Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Vincent P, Larochelle H, Bengio Y, and Manzagol P-A 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, Pp. 1096–1103. ACM. [Google Scholar]
  • [30].Wang D and Gu J 2018. Vasc: Dimension reduction and visualization of single-cell rna-seq data by deep variational autoencoder. Genomics, Proteomics & Bioinformatics, 16(5):320–331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Wolf FA, Angerer P, and Theis FJ 2018. Scanpy: large-scale single-cell gene expression data analysis. Genome Biology, 19(1):15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Zappia L, Phipson B, and Oshlack A 2017. Splatter: simulation of single-cell rna sequencing data. Genome Biology, 18(1):174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. 2017. Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Zhu X, Wolfgruber TK, Tasato A, Arisdakessian C, Garmire DG, and Garmire LX 2016. 2016 Granatum: a graphical single-cell rna-seq analysis pipeline for genomics scientists. Genome Medicine, 9(1):108. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES