Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jul 15.
Published in final edited form as: ACM BCB. 2020 Sep;2020:41. doi: 10.1145/3388440.3412462

Correlation Imputation in Single cell RNA-seq using Auxiliary Information and Ensemble Learning

Luqin Gan 1, Giuseppe Vinci 2, Genevera I Allen 3
PMCID: PMC8281968  NIHMSID: NIHMS1715319  PMID: 34278382

Abstract

Single cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.

Keywords: Single Cell RNA-seq, Imputation, Correlation Completion, Ensemble Learning, Auxiliary Information, Clustering, Dimension Reduction, Graphical modeling

1. Introduction

In genomics, researchers are interested in discovering the relationships between genes, monitoring changes of gene expression, and understanding the influence of genes on the organism. Bulk RNA sequencing (bulk RNA-seq) is a sequencing technology that lets us analyze gene expression from samples that contain a large number of cells by revealing the presence and quantity of RNA. With bulk RNA-seq data, significant results on gene-to-gene connection and gene-to-disease relationship can be obtained by machine learning methods, including dimension reduction, clustering models, and graphical models.

However, bulk RNA-seq only measures average gene expression levels across all cells in the sample, and it cannot detect gene expression differences across different types of cells. Single cell RNA sequencing (scRNA-seq) solves this problem by measuring the gene expression of individual cells, allowing us to clarify the critical difference among cells from the same organism. This genomic technology has helped discovering rare cells in different tissues by gene expression patterns and therefore is an important and powerful tool for transcriptome analysis [19].

Yet, data quality of scRNAseq is poorer than that of the bulk RNAseq, especially because of the presence of dropouts, technical artifacts where genes erroneously appear to have zero expression due to sequencing inefficiency. The loss of information is significant in scRNA-seq data, and can lead to major problems in downstream analyses.

Numerous imputation methods have been developed to fill in the dropout values in the scRNA-seq data. The SAVER model [14] predicts gene expressions under the assumption that the measured gene expressions follow Poisson-Gamma distributions, where the latent Gamma random variables are the true gene expressions. Based on the similarity among cells’ gene expressions, both drImpute [9] and PRIME [16] impute the dropouts of a cell by using the gene expressions of the cells belonging to the same cluster. The scRMD methodology [1] infers the gene expressions of cells by robust matrix decomposition, where the dropouts are encoded in a sparse matrix, and the matrix of true gene expressions is low rank. Other approaches include [30] and [37].

Yet, effective imputation of the missing values in scRNA-seq data can be difficult and biased because the data is sparse and high-dimensional – the number of genes is typically over 20,000 and the number of cells is usually only a few hundreds. In fact, researchers are interested in gene-to-gene connections and interactions, clustering of cells or principal component analysis of the scRNA-seq, and all these analyses require a well estimated correlation matrix of the gene expressions. Unfortunately, the presence of dropouts generates several challenges. For instance, if all zeros are assumed to be true values, the sample correlation matrix is corrupted. On the other hand, assuming that all zeros are dropouts, i.e. missing values, the Pearson correlation of two genes may be computed empirically only if there are enough pairwise-complete observations, otherwise it is infeasible. That is, the presence of dropouts can cause missingness in the sample covariance matrix. For this last problem there are plenty of covariance matrix completion methodologies that may be used [26, 21, 10, 17], but these, just like the data imputation methods [14, 9, 16, 1], perform ideally under different assumptions about the data.

However, in this challenging situation auxiliary information could be very helpful. Indeed, improvements in estimation performance due to the use of auxiliary information have been observed in various contexts [5, 11, 25, 8, 22, 31, 27, 23]. For instance, Hecker et al. [11] discuss improvements in network inference allowed by the incorporation of genome sequence and protein-DNA interaction data. Moreover, Lin et al. [25] study age-related macular degeneration by incorporating prior knowledge from previous linkage and association studies. Furthermore, Gao et al. [8] and Li et al. [22] use information about Gene Ontology annotation to improve network estimation. Finally, Novianti et al. [27] use gene pathway databases and genomic annotations to improve prediction accuracy, and Liang et al. [23] use auxiliary information about gene length and test statistics from microarray studies for the analysis of differential expression of genes.

In this paper we propose a novel approach, SCENA (Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information), which estimates the gene-by-gene correlation matrix by incorporating auxiliary information about the underlying biological connections and other data sources together with the dropout-corrupted scRNA-seq data of interest. The auxiliary information we use includes the gene pathway database from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [18], gene interaction networks from the Biological General Repository for Interaction Datasets (BioGRID) [28], protein-protein interaction networks from STRING [29], bulk cell RNA-seq data of 39 different tissues from the Encyclopedia of DNA Elements (ENCODE) [3], and other scRNA-seq data collected from cells of the same type of organism as the scRNA-seq dataset of interest. To implement SCENA, we first convert all these auxiliary data sources into a collection of correlation matrices, in addition to the correlation matrices recovered from scRNA-seq data via various imputation approaches and matrix completion strategies. Then, we ensemble all obtained correlation matrices into a final gene-by-gene correlation matrix estimate by model stacking.

We show that SCENA outperforms other existing methods in terms of correlation matrix completion, dimension reduction, clustering, and graphical modeling with an extensive simulation study (Section 3). Finally, we apply the methods to the analysis of a massive scRNA-seq data set of embryonic stem cells [2] (Section 4).

2. Method: SCENA

Let Xs be an N × M scRNA-seq data matrix of gene expressions of N genes measured over M cells. Because of technology limitations of the sequencing process, the data matrix Xs typically contains numerous false zeros called “dropouts”, where the transcript was not detected at the sequencing process. Thus, Xs must be seen as a corrupted version of some underlying data matrix X that is free from dropouts. To obtain estimates of the gene-by-gene correlations, in traditional approaches the data matrix Xs is typically first subject to some imputation process I, which identifies the dropouts and predicts their values. The resulting imputed data matrix I(Xs) is then used to compute correlation estimates. Alternatively to this approach, we may complete or repair directly the correlation estimates obtained from Xs by standard matrix completion approaches, or approximate them by using various sources of auxiliary information.

Thus, several possible useful estimates of the single-cell gene-by-gene correlations are available, and a combination of them may let us obtain an ultimate reliable estimate of the gene-by-gene correlation matrix. Our proposed approach, SCENA, builds upon this strikingly simple but powerful idea of optimally combining multiple correlation matrix estimates. SCENA estimates the gene expression correlation matrix of scRNA-seq data by combining multiple genetic correlation matrices derived from various sources of information. In Section 2.1 we describe the derivation of several correlation matrix estimates, and in Section 2.2 we combine them via model stacking. In the rest of the paper, gene expressions are transformed according to x ↦ log2(1 + x) before computing correlations.

2.1. Single correlation matrix estimates

SCENA combines the following four groups of single correlation matrix estimates.

  1. Blind correlation estimate (“all zeros are true”) This is the sample correlation matrix Σ^s of the scRNA-seq data matrix Xs assuming all zeros are real zeros.

  2. Imputation (“some zeros are dropouts and we try to correct them”). The imputation methods [14, 9, 16, 1] let us impute the scRNA-seq data, and thereby obtain correlation matrices.

  3. Correlation matrices based on auxiliary data. Auxiliary is any kind of data that is beyond the scRNA-seq data of interest. We consider the following three kinds of auxiliary data, which can be used to compute correlation matrices.
    1. Bulk RNA-seq data. Given a matrix of bulk RNA-seq data, we calculate its sample correlation matrix. In this paper we use auxiliary bulk RNA-seq data of [3].
    2. Other scRNA-seq data. It is possible that another scRNA-seq data set Xs* presents less dropouts for some of the genes in the main data matrix Xs, so the lost information of those genes might be found in Xs*. The sample correlation matrix of such additional data matrix Xs* is computed. In this paper we use auxiliary scRNA-seq data from [35, 20].
    3. Biological networks. The KEGG pathway database [18] provides information about n = 6860 genes and c = 239 gene pathways which can be summarized in an n × c matrix K = [Kij] where Kij = 1 if gene i is in pathway j, Kij = 0 otherwise. We compute the n × n sample correlation matrix of K. From the BioGRID network [28] we extract an adjacency matrix A ∈ {0, 1}N×N of gene connections, and obtain the correlation matrix diag(L1)12L1diag(L1)12, where L is the Laplacian matrix L = DA, and D is a diagonal matrix with Dii=j=1NAji, i.e. the degree of gene i. Finally, in the STRING network [29], we construct a correlation matrix by treating gene-by-gene combined connection scores as correlations.
  4. Skeptical correlation estimates (“all zeros are dropouts”) Assuming all zeros are dropouts, we obtain the matrix XsNA which corresponds to Xs with all zeros replaced by missing values NAs. From this matrix, it is possible to compute the pairwise complete sample correlation matrix Σ^O, which contains missing entries for all those gene pairs with no jointly nonzero measured scRNA-seq expressions. We obtain completed versions of Σ^O as follows.
    1. Matrix completion. We use [26] to produce a complete correlation matrix.
    2. Convex combinations with matrices in (3). Given a correlation matrix derived from auxiliary data, say Σ^aux, we can obtain a completed version of the pairwise complete scRNA-seq sample correlation matrix Σ^O as Σ˙=αΣ^O*+(1α)Σ^aux, where Σ^O* is a version of Σ^O with all NA’s replaced by zeros, and α = [αij] is a N × N weight matrix. We consider two types of weights:
      1. Simple replacement
        αij={1, if Σ^O,ij is not NA0, otherwise  (1)
      2. Signal-to-noise ratio
        αij=1Mk=1MI(Xs,ik0)I(Xs,jk0), (2)
        which is the proportion of cells where genes i and j have jointly nonzero read counts.

2.2. Model stacking

Let Σ˙1,,Σ˙p be the single correlation matrix estimates derived in Section 2.1. We aim to obtain a final correlation matrix estimate Σ˜ by model stacking in the form

˜=F1(q=1pβqF(Σ˙q)), (3)

where F:N×NN×N is an invertible mapping, and β1,,βq. The simplest choice of F is the identity mapping F(A)=A,AN×N, which however does not guarantee Σ˜ to be a positive semi-definite correlation matrix, even if all Σ˙1,Σ˙p are positive semi-definite correlation matrices, unless we impose appropriate constraints on β1, …, βq. For instance, a sufficient condition is βq > 0, ∀q, with ∑q βq = 1, which specifies a convex linear combination. Another possible mapping is F(A) = A−1, which requires Σ˙p0,p. In any case, if Σ˜ is not a positive semi-definite correlation matrix, we replace it with the nearest correlation matrix [13] as per Σ˜:=argminΨCΣ˜ΨF, where C is the set of positive semi-definite correlation matrices.

There are many possible ways to specify Equation (3). We consider the following ones.

2.2.1. Simple average.

The simple average is obtained by setting F = identity mapping and βq=1p, for all q, yielding the convex linear combination

Σ˜=1pq=1pΣ˙q (4)

Since the weights are prespecified, this approach requires no additional tuning or validation steps. Also, if Σ˙1,,Σ˙p are all positive semi-definite correlation matrices, so is Σ˜. We will denote this solution by SCENAaverage.

2.2.2. Regression.

We assume a linear relationship between the true underlying correlation and the single correlation estimates,

f(Σij)=q=1pβqf(Σ˙q,ij)+ϵij, (5)

where f:(1,1) is an invertible function, e.g. the Fisher transformation f(x)=12log((1+x)/(1x)), and ϵij is an error component. The vector of coefficients β = (β1, …, βq)T is then estimated by solving the penalized optimization problem

β^=arg minβi<j(f(Σij)q=1pβqf(Σ˙q,ij))2+λP(β) (6)

where P is a penalty and λ ≥ 0 is selected via cross-validation. Cross-validation lets us reduce the risk of overfitting, and is implemented by creating multiple held-out data subsets that are iteratively removed from training and used instead to validate prediction accuracy. Setting P(β)=q=1pβq2 produces the ridge estimator, and the resulting final correlation matrix Σ˜ which will be denoted by SCENAridge.

Of course, we do not know Σij in Equation (5), but we can identify a small subset of genes and cells which we may assume to contain very few dropouts and could give us reliable estimates of Σij. Thus, to fit the regression model in Equation (5), we first extract a reference data matrix Y′ from the scRNA-seq data matrix Xs (Algorithm A), then compute the sample correlation matrix Σ^Y of Y′, and finally extract the off-diagonal entries which will be used as the response vector of the regression. Then, we obtain multiple perturbed versions of Y′ by creating artificial dropouts (Algorithm A). The off-diagonal entries of the matrices Σ˙q, for q = 1…p, based on the perturbed data, are used as predictors. Algorithm A summarizes the full procedure.

3. Simulations

In this section we present an extensive simulation study showing that SCENA is superior to other methods in terms of correlation matrix completion, dimensionality reduction, clustering, and conditional dependence graphical modeling. In Section 3.1 we describe how we generate realistic artificial scRNA-seq data based on real data sets. Specifically, given a real scRNA-seq data set, we first extract a reference data set, a subset of data where all zeros can be safely assumed to be true values and not dropouts. Then, we generate downsampled data by creating dropouts in the reference scRNA-seq data according to the Poisson-Gamma scheme in Algorithm A. Finally, in Section 3.2 we assess the performance of SCENA and other existing methods at recovering the correlation structure of the reference data based on the corrupted downsampled data and other available auxiliary data.

3.1. Generating scRNA-seq data

3.1.1. Original data sets

We use three human scRNA-seq data sets in this simulation study:

  1. chu: human embryonic stem cells [2].

  2. chu_time: human definitive endoderm cells (time-series sequencing) [2].

  3. daramanis: human brain cells [4].

The number of cells and number of cell types are reported in Table 1 (genes with zero expression in all cells are removed).

Table 1:

Human scRNA-seq data sets used in simulations.

chu chu_time darmanis
Tissue embryonic stem cells definitive endoderm cell brain cell
# cell types 7 6 5
# cells 1,018 758 366
# genes 21,413 18,294 17,738
% zeros 47.43% 51.15% 80.06%
# reference cells 951 689 332
# reference genes 2,522 2,160 2,306
citation [2] [2] [4]
GEO accession code GSE75748 GSE75748 GSE67835

3.1.2. Reference data sets

For each of the three data sets, we first match the genes with those available in the auxiliary data (Section 2.1), and then apply Algorithm A to perform quality control by filtering out low quality genes and cells, and finally extract reference data. All values in the reference data are treated as true gene expressions, i.e. the reference data is free from dropouts. The dimensions of the three resulting reference data sets are reported in Table 1.

3.1.3. Downsampled data

For each of the three reference data sets, we apply Algorithm A to generate downsampled versions of the reference data. We set s = 10, and r = 3000, 1000, 1000 for chu, chu_time and darmanis data, respectively, to ensure the expected percentage of zeros in the downsampled data to be similar to the percentage of zeros in the original scRNA-seq data (Table 1).

3.2. Models comparison

In this section we assess the performance of SCENA and other existing methods at recovering the correlation structure of the reference data based on the corrupted downsampled data and other available auxiliary data. We show that SCENAaverage and SCENAridge (Section 2.2) outperform SAVER, drImpute, scRMD, and PRIME in terms of correlation matrix completion, dimension reduction, clustering, and graphical modeling. The results shown are averaged across multiple downsampled data sets.

3.2.1. Correlation completion

We measure the similarity between the correlation matrix estimators considered and the reference correlation matrix Σ^ref in terms of mean squared error (MSE) and average correlation matrix distance (CMD) [12]. Table 2 shows that SCENA outperforms all other data imputation methods in terms of both MSE and CMD. The baseline is set to be the MSE and CMD between Σ^ref  and the sample correlation of the downsampled data (blind estimate Σ^s, Section 2.1), treating all 0s as true gene expressions. SCENAridge has the lowest MSE and CMD among all methods as well as the baseline in chu_time and darmanis data. SCENAaverage is also better in correlation completion than SAVER and PRIME.

Table 2:

Correlation completion accuracy. MSE and Correlation Matrix Distance (CMD) between the reference correlation and the estimated correlation derived from various methods. SCENAridge has the lowest MSE and CMD among imputation methods in chu_time and darmanis data, and is lower than the baseline (“Downsample”), which is the sample correlation matrix of the downsampled data. All other imputation approaches perform worse than the baseline.

MSE CMD
chu chu_time darmanis chu chu_time darmanis
Downsample 0.01008 0.00796 0.00804 0.21217 0.15855 0.22897
SAVER 0.01949 0.01850 0.01476 0.44604 0.40991 0.52235
drImpute 0.01181 0.00796 0.01371 0.23595 0.16050 0.33769
scRMD 0.01125 0.00868 0.00879 0.23598 0.17621 0.26170
PRIME 0.03707 0.03508 0.02661 0.47846 0.42137 0.33218
SCENA_average 0.01285 0.01119 0.01313 0.27283 0.21596 0.42946
SCENA_ridge 0.01212 0.00708 0.00513 0.20497 0.12942 0.12474

3.2.2. Dimension reduction

For each correlation matrix estimate from inputed data and SCENA, we compute the matrix of eigenvectors V, and obtain the principal component scores U = ZTV, where Z is a standardized version of the log transformed scRNA-seq data Xs. In Figure 1, we compare the scatterplots of the top two PC scores of the cells against each other. The plots are colored by cell type labels, and PCs are derived from sample correlations of reference data, SAVER imputed data, and correlation estimations of SCENAridge and SCENAaverage. Both SCENAridge and SCENAaverage recover the reference data structure better than data imputation, and yield scatterplots with a clear separation among different types of cells indicated by the cell type labels.

Figure 1:

Figure 1:

Dimension reduction accuracy. Scatterplots of the top two PC scores of the cells colored by cell type. Both SCENAridge and SCENAaverage appear to recover the reference data structure better than SAVER, yielding scatterplots with a clear separation among different types of cells.

3.2.3. Clustering

We perform hierarchical cells clustering (Ward’s minimum variance method with Manhattan distance; [7]) based on the standardized principal components of the downsample scRNA-seq data obtained from the different approaches considered (quantities U computed in Section 3.2.2). For each method, we use the top PCs with proportion of variance explained within the range (90%,99%), and set the number of clusters equal to the number of true cell labels in the scRNA-seq data. To assess clustering performance, we measure the similarity between cluster assignments and true cell labels by calculating the adjusted rand index (ARI). This metric takes values in the interval [0, 1], with large values indicating stronger similarity. In Figure 2 we can see that SCENAaverage yields the best clustering performance over all other methods in all three datasets. Interestingly, in the chu data, SCENAaverage has even better performance than the clustering obtained from reference data, in accordance with the fact that SCENA exploits auxiliary information besides the scRNA-seq data.

Figure 2:

Figure 2:

Clustering performance. Adjusted rand index (higher is better) of cell type grouping via hierarchical clustering after dimension reduction via PCA explaining various proportions of variance. SCENAaverage yields the best clustering performance over all other methods in all data sets, and even better than the clustering obtained from the reference data in the chu data.

3.2.4. Conditional dependence graphs

Conditional dependence graph estimation in the case where several pairs of variables are never observed jointly is a major statistical problem that has gained strong interest recently; a thorough theoretical investigation of the so called graph quilting problem can be found in [32]. Such problem is strictly related to ours, where an extremely large number of gene pairs have no reliable empirical correlation estimates. Here, we investigate the graph recovery performance based on the various correlation matrix estimates via simulations. Specifically, we plug the correlation matrix estimates into the graphical lasso [36] to obtain gene-by-gene conditional dependence graphs via sparse precision matrix estimation. For simplicity, we compute graphs about only the top 50 most variable genes among cell types, identified by applying ANOVA to gene expression of reference data adjusted by cells’ library sizes. To evaluate the graph recovery performance of a method, we compute the F1-score with respect to the graph estimated from reference data. In Figure 3A we plot F1-score versus number of graph edges for all methods and data sets. SCENAridge is superior in recovering the reference graph than other methods in chu and darmanis data, and it produces a similarly high F1-score in chu_time data as SAVER method. For illustration, in Figure 3B we also display conditional dependence graphs relative to the chu data with 50 edges.

Figure 3:

Figure 3:

Genetic graph recovery. A: F1 score (higher is better) quantifying the performance of methods at recovering the reference conditional dependence graphs of 50 most variable genes for various numbers of edges. SCENAridge exhibits strong performance for all data sets, while other methods’ performance dramatically changes across different data sets. B: Conditional dependence graphs of chu data, setting the number of edges to 50.

4. Application to stem cell data

We now apply the methods to the analysis of the chu data set (Table 1) containing the gene expression of 6,038 genes (largest genes set that matched available auxiliary information) measured in 1,018 human embryonic stem cells. In Figure 4A we plot the first two principal components based on SCENAaverage and SCENAridge, while in Figure 4B we compare the cell clustering performance of SCENA with other methods in terms of ARI. The hierarchical clustering based on SCENAaverage performs the best at recovering true cell type labels, in accordance with simulation results (Section 3.2.3). Finally, in Figure 4C we display the conditional dependence graph (graphical lasso; [36]) of the 30 most variable genes among cell types (ANOVA criterion as in Section 3.2.4) based on SCENAridge, with number of edges 163 selected via Extended Bayesian Information Criterion (EBIC, [6]). The protein coding gene DNMT3B is the hub node with largest number of connections (20 edges). This result is reasonable because DNMT3B is a catalytically active DNA methyltransferase [24], and is specifically expressed in totipotent embryonic stem cells [34]. Moreover, DNMT3B is one of the pluripotency markers with high level of expression in the cell type H1, as demonstrated by [2]. Besides, genes IFI16 [15] and HAND1 [33] are marker genes in cell type EC and cell type TB, respectively, and correspondingly have relatively large numbers of connections.

Figure 4:

Figure 4:

Real data application results for the chu data. A. Dimension reduction: scatterplots of the top two PC scores of the cells colored by cell type. B. Clustering: adjusted rand index of cell type grouping via hierarchical clustering after dimension reduction using PCA explaining various proportions of variance. SCENAaverage yields the best clustering performance over all other methods. C: Conditional dependence graph (graphical lasso; 163 edges selected via EBIC) based on SCENAridge correlation estimate. Gene DNMT3B is the hub node with the largest number of connections (20 edges). This result is supported by the scientific literature as DNMT3B is one of the pluripotency markers with high level of expression in the cell type H1 [2]. Also, genes IFI16 and HAND1 are marker genes in cell type EC and cell type TB, respectively, and correspondingly have relatively large numbers of connections.

5. Discussion

We have proposed and studied SCENA, a novel methodology for gene-by-gene correlation matrix estimation from dropout-corrupted single cell RNA-seq data. SCENA builds upon the strikingly simple but powerful idea of optimally combining multiple gene-by-gene correlation matrices derived from various sources of information, besides the scRNA-seq data of interest. This combination is implemented efficiently via model stacking techniques.

We have demonstrated that SCENA can provide superior estimation performance compared to traditional data imputation methods. In our analyses, SCENAridge remarkably recovered the information underlying the corrupted scRNA-seq data in terms of correlation completion, dimension reduction, and graphical modeling, while the hierarchical clustering based on SCENAaverage yielded cell groupings that best reflected true cell type heterogeneity in terms of adjusted rand index. Indeed, although both variants combine the same single correlation matrices via model stacking, the weighting coefficients of SCENAridge are calibrated for the optimal recovery of the true correlation structure of the corrupted gene expression data, while SCENAaverage simply assigns uniform weights, presumably upweighting auxiliary biological network structures that are more informative about cell characteristics. SCENAaverage is computationally cheaper than SCENAridge, because the estimation of the weighting coefficients of SCENAridge involves multiple additional imputation and optimization steps that are computationally expensive. For instance, in the application presented in Section 4, the model stacking step for SCENAridge took about 40 minutes, while only about 30 seconds for SCENAaverage, on a laptop with 16GB of RAM (2133 MHz) and dual-core processor (3.1 GHz). Given all these considerations, we recommend to use SCENAaverage for the analysis of massive scRNA-seq data sets.

While we have demonstrated our approach using specific auxiliary sources, SCENA is general and conducive to many different types of correlation imputation approaches and additional sources of auxiliary information on genetic interactions. Additionally, our approach can be further optimized using different machine learning approaches to model stacking and ensemble learning. Overall, we expect SCENA to become an important instrument for downstream analyses of massive scRNA-seq data that powerfully incorporates known auxiliary information on genetic interactions.

A. Algorithms

[Reference data selection]

Input: N × M data matrix X; parameter vector a.

  1. Filter out cells with library size greater than a1-th percentile.

  2. Remove genes with mean expression less than a2-th percentile.

  3. Remove genes with less than a3-th percentile non-zero cells.

  4. Keep cells with library size greater than the a4-th percentile.

  5. Keep genes with non-zero proportion greater than a5-th percentile.

Output: N′ × M′ reference data matrix Y. We use default values a1 = 95, a2 = 25, a3 = 15, a4 = 5, a5 = 50.

[Poisson-Gamma downsampling]

Input: N × M data matrix X; parameters s, r > 0.

  1. Draw Z1,,ZM~ i.i.d. Γ(s,r).

  2. Draw X˜ij~Poisson(XijZj), for i = 1, …, N and j = 1, …, M.

Output: N × M downsampled data matrix X˜.

[Stacking regression validation]

Input: N ×M scRNA-seq data matrix X; reference parameter vector a; downsampling parameters s, r > 0; collection of N × N single correlation matrices (Section 2.1); number of downsampling repeats B; transform function f.

  1. Obtain reference N′ × M′ data matrix X′ via Algorithm A with parameter a.

  2. Construct response vector yN(N1)2 by extracting off-diagonal entries from Σ^X=cor(X).

  3. For b = 1, , B:
    1. Generate downsampled N′ × M′ data matrix X˜b from X via Algorithm A with parameters s, r.
    2. Obtain all N′ × M′ single correlation estimates based on X˜b and auxiliary correlation matrices Σ˙1b,,Σ˙pb.
    3. Construct predictors matrix W(b)N(N1)2×p by extracting off-diagonal entries from each Σ˙1(b),,Σ˙p(B).
  4. Compute β^ by regressing f(y) on f(W) via Equation (6), where y = (yT, yT, …, yT)T and W = (W(1)T, …, W(B)T)T.

Output: N × M correlation matrix Σ˜ via Equation (3) with vector of coefficients β^.

Acknowledgments

L.G., G.V., and G.A. are supported by NIH 1R01GM140468, NSF DMS-1554821 and NSF NeuroNex-1707400. G.V. is additionally supported by a Rice Academy Postdoctoral Fellowship and the Dan L. Duncan Foundation.

Contributor Information

Luqin Gan, Rice University.

Giuseppe Vinci, University of Notre Dame.

Genevera I. Allen, Rice University

References

  • [1].Chen C, Wu C, Wu L, Wang Y, Deng M, and Xi R. scrmd: Imputation for single cell rna-seq data via robust matrix decomposition. bioRxiv, page 459404, 2018. [DOI] [PubMed] [Google Scholar]
  • [2].Chu L-F, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, Choi J, Kendziorski C, Stewart R, and Thomson JA. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology, 17(1):173, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Consortium EP et al. An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57–74, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Gephart MGH, Barres BA, and Quake SR. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23):7285–7290, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Ferkingstad E, Frigessi A, Rue H, Thorleifsson G, Kong A, et al. Unsupervised empirical bayesian multiple testing with external covariates. The Annals of Applied Statistics, 2(2):714–735, 2008. [Google Scholar]
  • [6].Foygel R and Drton M. Extended bayesian information criteria for gaussian graphical models. In Advances in neural information processing systems, pages 604–612, 2010. [Google Scholar]
  • [7].Friedman J, Hastie T, and Tibshirani R. The elements of statistical learning, volume 1. Springer series in statistics; New York, 2001. [Google Scholar]
  • [8].Gao S and Wang X. Quantitative utilization of prior biological knowledge in the bayesian network modeling of gene expression data. BMC bioinformatics, 12(1):359, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Gong W, Kwak I-Y, Pota P, Koyano-Nakagawa N, and Garry DJ. Drimpute: imputing dropout events in single cell rna sequencing data. BMC bioinformatics, 19(1):220, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Hastie T, Mazumder R, Lee JD, and Zadeh R. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367–3402, 2015. [PMC free article] [PubMed] [Google Scholar]
  • [11].Hecker M, Lambeck S, Toepfer S, Van Someren E, and Guthke R. Gene regulatory network inference: data integration in dynamic models—a review. Biosystems, 96(1):86–103, 2009. [DOI] [PubMed] [Google Scholar]
  • [12].Herdin M, Czink N, Ozcelik H, and Bonek E. Correlation matrix distance, a meaningful measure for evaluation of non-stationary mimo channels. In 2005 IEEE 61st Vehicular Technology Conference, volume 1, pages 136–140. IEEE, 2005. [Google Scholar]
  • [13].Higham NJ. Computing the nearest correlation matrix—a problem from finance. IMA journal of Numerical Analysis, 22(3):329–343, 2002. [Google Scholar]
  • [14].Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, and Zhang NR. Saver: gene expression recovery for single-cell rna sequencing. Nature methods, 15(7):539, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Hurst TP, Aswad A, Karamitros T, Katzourakis A, Smith AL, and Magiorkinis G. Interferon-inducible protein 16 (ifi16) has a broad-spectrum binding ability against ssdna targets: an evolutionary hypothesis for antiretroviral checkpoint. Frontiers in microbiology, 10:1426, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Jeong H and Liu Z. Prime: a probabilistic imputation method to reduce dropout effects in single cell rna sequencing. bioRxiv, 2020. [DOI] [PubMed] [Google Scholar]
  • [17].Josse J, Sardy S, and Wager S. denoiser: A package for low rank matrix estimation. arXiv preprint arXiv:1602.01206, 2016. [Google Scholar]
  • [18].Kanehisa M, Sato Y, Kawashima M, Furumichi M, and Tanabe M. Kegg as a reference resource for gene and protein annotation. Nucleic acids research, 44(D1):D457–D462, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, and Teichmann SA. The technology and biology of single-cell rna sequencing. Molecular cell, 58(4):610–620, 2015. [DOI] [PubMed] [Google Scholar]
  • [20].Lake BB, Chen S, Sos BC, Fan J, Kaeser GE, Yung YC, Duong TE, Gao D, Chun J, Kharchenko PV, et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nature biotechnology, 36(1):70–80, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Ledoit O and Wolf M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of empirical finance, 10(5):603–621, 2003. [Google Scholar]
  • [22].Li Y and Jackson SA. Gene network reconstruction by integration of prior biological knowledge. G3: Genes, Genomes, Genetics, 5(6):1075–1079, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Liang K et al. Empirical bayes analysis of rna sequencing experiments with auxiliary information. The Annals of Applied Statistics, 13(4):2452–2482, 2019. [Google Scholar]
  • [24].Liao J, Karnik R, Gu H, Ziller MJ, Clement K, Tsankov AM, Akopian V, Gifford CA, Donaghey J, Galonska C, et al. Targeted disruption of dnmt1, dnmt3a and dnmt3b in human embryonic stem cells. Nature genetics, 47(5):469, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Lin W-Y and Lee W-C. Incorporating prior knowledge to facilitate discoveries in a genome-wide association study on age-related macular degeneration. BMC research notes, 3(1):26, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Lounici K. High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20(3):1029–1058, 2014. [Google Scholar]
  • [27].Novianti PW, Snoek BC, Wilting SM, and Van De Wiel MA. Better diagnostic signatures from rnaseq data through use of auxiliary co-data. Bioinformatics, 33(10):1572–1574, 2017. [DOI] [PubMed] [Google Scholar]
  • [28].Stark C, Breitkreutz B-J, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, Nixon J, Van Auken K, Wang X, Shi X, et al. The biogrid interaction database: 2011 update. Nucleic acids research, 39(suppl_1):D698–D704, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, et al. The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, page gkw937, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Tracy S, Yuan G-C, and Dries R. Rescue: imputing dropout events in single-cell rna-sequencing data. BMC bioinformatics, 20(1):388, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Van De Wiel MA, Lien TG, Verlaat W, van Wieringen WN, and Wilting SM. Better prediction by use of co-data: adaptive group-regularized ridge regression. Statistics in Medicine, 35(3):368–381, 2016. [DOI] [PubMed] [Google Scholar]
  • [32].Vinci G, Dasarathy G, and Allen GI. Graph quilting: graphical model selection from partially observed covariances. arXiv preprint arXiv:1912.05573, 2019. [Google Scholar]
  • [33].Wagh V, Pomorski A, Wilschut KJ, Piombo S, and Bernstein HS. Microrna-363 negatively regulates the left ventricular determining transcription factor hand1 in human embryonic stem cell-derived cardiomyocytes. Stem cell research & therapy, 5(3):75, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Watanabe D, Suetake I, Tada T, and Tajima S. Stage-and cell-specific expression of dnmt3a and dnmt3b during embryogenesis. Mechanisms of development, 118(1–2):187–190, 2002. [DOI] [PubMed] [Google Scholar]
  • [35].Yan L, Yang M, Guo H, Yang L, Wu J, Li R, Liu P, Lian Y, Zheng X, Yan J, et al. Single-cell rna-seq profiling of human preimplantation embryos and embryonic stem cells. Nature structural & molecular biology, 20(9):1131, 2013. [DOI] [PubMed] [Google Scholar]
  • [36].Yuan M and Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007. [Google Scholar]
  • [37].Zhu L, Lei J, Devlin B, and Roeder K. A unified statistical framework for single cell and bulk rna sequencing data. The annals of applied statistics, 12(1):609, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES