Accounting for technical noise in Bayesian graphical models of single-cell RNA-sequencing data

Jihwan Oh; Changgee Chang; Qi Long

doi:10.1093/biostatistics/kxab011

. 2021 Sep 14;24(1):161–176. doi: 10.1093/biostatistics/kxab011

Accounting for technical noise in Bayesian graphical models of single-cell RNA-sequencing data

Jihwan Oh ¹, Changgee Chang ¹, Qi Long ^1,^✉

PMCID: PMC9748577 PMID: 34520533

Summary

Single-cell RNA-sequencing (scRNAseq) data contain a high level of noise, especially in the form of zero-inflation, that is, the presence of an excessively large number of zeros. This is largely due to dropout events and amplification biases that occur in the preparation stage of single-cell experiments. Recent scRNAseq experiments have been augmented with unique molecular identifiers (UMI) and External RNA Control Consortium (ERCC) molecules which can be used to account for zero-inflation. However, most of the current methods on graphical models are developed under the assumption of the multivariate Gaussian distribution or its variants, and thus they are not able to adequately account for an excessively large number of zeros in scRNAseq data. In this article, we propose a single-cell latent graphical model (scLGM)—a Bayesian hierarchical model for estimating the conditional dependency network among genes using scRNAseq data. Taking advantage of UMI and ERCC data, scLGM explicitly models the two sources of zero-inflation. Our simulation study and real data analysis demonstrate that the proposed approach outperforms several existing methods.

Keywords: Bayesian hierarchical model, Graphical models, Single-cell RNA-sequencing, Zero-inflation

1. Introduction

Advances in high-throughput sequencing technology have paved the way for utilizing RNA-sequencing data in biomedical research. Especially during the last decade, various statistical methods have been developed to analyze high-dimensional data from the bulk RNA-sequencing (bRNAseq) experiments. For example, Chun and others (2015) proposed a method to retrieve gene networks from bRNAseq data. However, the bRNAseq technology ignores heterogeneity among individual cells and is not appropriate for analyzing the data with cellular diversity, because the expression levels are summed over different types of input cells in the tissue of interest.

More recently, researchers started to use the single-cell RNA-sequencing (scRNAseq) technology (Tang and others, 2009), which was nominated to be the “Method of the Year 2013” by Nature Methods (Editorial, 2014). Unlike the traditional bRNAseq, each observation from scRNAseq experiments consists of gene expression levels from each individual cell. This fundamental difference enables scientists to have better views into cell-to-cell heterogeneity, such as subpopulation identification (Buettner and others, 2015), heterogeneity of cell responses (Harari and others, 2005), and stochasticity of gene expression (Elowitz and others, 2002).

On the other hand, the scRNAseq technology brought us its own problems. Even though scRNAseq data are often structurally indistinguishable from bRNAseq data, the scarcity in starting materials in scRNAseq experiments often results in technical noise (Jia and others, 2017)—highly frequent dropout events and severe amplification biases. The dropout event refers to the situation where a transcript expressed in the cell is lost during the library preparation (Gong and others, 2018), and the amplification bias happens when the end product of the amplification does not faithfully recapitulate the amount of starting DNA (Islam and others, 2014).

In this article, we focus on the problem of inferring the gene conditional dependency network from scRNAseq data by properly addressing the aforementioned special characteristics. Note that most of the available methods are not suitable for scRNAseq data. For example, many existing methods on graphical models (Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Banerjee and others, 2008; Peng and others, 2009; Fan and others, 2009; Lam and Fan, 2009; Cai and others, 2011; Li and others, 2012) including graphical LASSO (GLASSO) (Friedman and others, 2008) adopted the multivariate Gaussian assumption, as it simplifies the identification of conditional independence between a pair of random variables into the problem of identifying the zero entry in the precision matrix. On these foundations, many researchers went one step further towards more general settings (Fukumizu and others, 2007; Liu and others, 2009, 2011, 2012; Harris and Drton, 2013; Voorman and others, 2013; Li and others, 2014; Székely and Rizzo, 2014; Wang and others, 2015). However, these methods are not capable of taking into account the scRNAseq related issues.

Recently, several new methods for the analysis of scRNAseq data have been introduced. Oh and others (2018) proposed a machine learning method, in which the zero inflations are regarded as outliers. They proposed a method in which two robust statistical methods—support vector regression and Hilbert–Schmidt information criterion—are applied to calculate the partial correlation coefficients among the genes. On the other hand, McDavid and others (2019) incorporated the zero inflations into their multivariate Hurdle model using a finite mixture of singular Gaussian distributions. Their model permits inference on statistical independence in zero-inflated, semicontinuous data to learn undirected Markov graphical models. Yet, their model does not take into account where and how the technical noises occur in scRNAseq data.

To overcome such limitations, we propose single-cell latent graphical model (scLGM)—a new method that estimates the conditional dependency network among the gene expression levels from scRNAseq data. Unlike Oh and others (2018), our model explicitly incorporates the source of technical noise by introducing two types of parameters—cell-specific parameters and gene-specific parameters—in a unified Bayesian hierarchical structure, so that both dropout events and amplification bias can be well explained while estimating the conditional dependency network. Specifically, the cell-specific parameters account for the special characteristics of scRNAseq data while the gene-specific parameters explicate the conditional independence structure between true but unobservable underlying gene expression levels which are assumed to follow the multivariate Gaussian distribution.

The introduction of cell-specific parameters are motivated by two recent technologies, unique molecular identifiers (UMIs) and external RNA controls consortium (ERCC) spike-in molecules. First, UMIs enable us to accurately identify true polymerase chain reaction (PCR) duplicates in high-throughput sequencing experiments (König and others, 2010; Kivioja and others, 2012; Islam and others, 2014; Smith and others, 2017). Because the UMIs can distinguish identical copies arising from distinct molecules by attaching a random barcode to each individual fragment during the library preparation step, it establishes a one-to-one mapping between the set of unique UMI barcodes and the set of unique fragments that have been sequenced. Second, each observed cell is augmented with the ERCC spike-in molecules, which are designed to be added to an RNA analysis experiment following a sample isolation step. Adding these external RNA controls enables researchers to follow cell-to-cell variabilities in scRNAseq experiments (Jiang and others, 2011; Stegle and others, 2015; Bacher and Kendziorski, 2016).

In this article, we propose a new method that infers the conditional dependency network from scRNAseq data. In order to account for the technical noise from the single-cell data, we adopt a similar approach as in Jia and others (2017), which uses the aforementioned technologies to analyze differentially expressed genes from single-cell data. Since the cell-specific parameters properly take into account the technical noises which occur in the reverse transcription step and the preamplification step of the current scRNAseq protocols (Hicks and others, 2018), our model is able to latently classify each observed zero into either a true zero in the cell or a false zero attributed to the technical noise and incorporates a specific mechanism of zero inflations. In this sense, scLGM is more interpretable than the approach of McDavid and others (2019).

The two sets of parameters are estimated individually; the cell-specific parameters are estimated with the Gibbs sampling and the gene-specific parameters are estimated with the variational expectation–maximization (EM) algorithm. Simulation studies show that scLGM outperforms other methods in terms of edge selection. A real data application with a mouse scRNAseq data is illustrated and the results are compared to the Kyoto encyclopedia of genes and genomes (KEGG) pathway database (Kanehisa and others, 2016).

The remainder of this article is organized as follows. We describe scLGM in Section 2. The computational details are provided in Section 2.1, which is comprised of two separate algorithms: one for the estimation of the cell-specific parameters, and the other for the estimation of the gene-specific parameters. It is followed by Section 2.2, which discusses an alternative approach. The results of the simulation studies are summarized in Section 3, and the real data analysis follows in Section 4. We conclude in Section 5.

2. Modeling of scRNAseq data

In this section, we describe our model which can deal with the technical noise in scRNAseq data. Let Inline graphic be the index set of cells, and be the index set of genes. For each pair of representing the th gene in the th cell, denotes the random variable for the unobservable true gene expression level, and denotes the random variable for the observed UMI count. We have auxiliary random variables Inline graphic , which indicate whether the th gene has been captured via UMI counting during the library preparation step of the th cell. The statistical variations of our observations are modeled in the following way.

(1)
For each , the -dimensional random vector representing the unobservable true counts of genes in the th cell jointly follows a multivariate log-normal distribution such that
where is the -dimensional mean vector and is the inverse covariance matrix. We set and to simplify manipulations of the model.
(2)
We have the binary random variable , where indicates that the th gene has been captured well in the library of the th cell, whereas indicates that the dropout event has happened. We assume that the probability as to whether a dropout event occurs depends on the unobservable true expression level of the corresponding gene via the probit model (Albert and Chib, 1993) such that
where , and is the cumulative distribution function of the standard normal random variable. This reflects the fact that the chance of a gene being captured in the library increases as the true expression level of the gene increases. To facilitate computations, we employ another auxiliary random variables as , where .
(3)
The conditional distribution of the observed count given and is assumed that
where is the Dirac measure of concentrated on the singleton , , denotes the capture efficiency of reverse transcription, and reflects the amplification rate. The link function in the distribution reflects the fact that the genes are amplified exponentially, and the difference between and implies the amplification bias.

We define Inline graphic to be the vectors of ’s, ’s, ’s, and ’s, respectively, which we call the cell-specific parameters as opposed to the gene-specific parameters .

2.1. Computation

Let Inline graphic , , and be the matrices with entries , , and , respectively. Our goal is to find the nonzero entries in as they represent the conditional dependencies among true but un-observable gene expression levels . For this, the likelihood function of our model is

where Inline graphic is either the probability density function (PDF) or the probability mass function under the distributions described above. A graphical summary is provided in Figure 1. The following two subsections describe how our method estimates the cell-specific parameters first and use them to estimate the gene-specific parameters.

Fig. 1. — **Modeling Scheme.** In our model, the unobservable true count of mRNA () effects both the probability of it being captured in the library () and the observed counts (). Rectangles represent the model parameters, uncolored circles the latent variables, and colored circles the observed variables.

2.1.1. Estimation of cell-specific parameters

Let a Inline graphic -dimensional vector be the log-scaled counts of predetermined ERCC spike-in molecules, be their index set, and be the observed UMI count of the th spike-in in the th cell. Here, we used the notation to distinguish fake RNAs from real RNAs. As we mentioned in Section 1, we know the number of ERCC spike-in molecules in each cell. The expression levels of those fake genes serve as a control group against the true but unobservable gene expression levels. We connect these known spike-in counts and corresponding observed UMI counts to estimate the cell-specific parameters Inline graphic for each cell in the following way.

First, we estimate a pair Inline graphic for each by regressing the log-transformed nonzero UMI counts on the predictor . This procedure is based on the fact that the conditional expectation of the nonzero UMI counts becomes

As described in Jia and others (2017), the pattern of data missing is not at random, but they also showed in simulation studies that the amounts of biases of these estimators are hardly recognizable, indicating the biases are under control with this estimation procedure.

Next, we use Gibbs sampling method to estimate a pair Inline graphic for each cell . Specifically, we use the Gaussian priors for and :

Then, the joint posterior density of the cell-specific parameters, the latent variables, and the observed counts is given by

where Inline graphic ,

and Inline graphic is the PDF of a normal distribution with mean and variance . This joint distribution yields Gaussian conditional distributions for all parameters. Their exact forms for Gibbs sampling can be found in Supplementary material available at Biostatistics online. We estimate and by the averages of the corresponding Markov chain Monte Carlo (MCMC) samples.

2.1.2. Estimation of gene-specific parameters

Recall that our ultimate goal is to estimate the precision matrix Inline graphic of the unobservable true gene expression levels ’s only with the observed UMI counts ’s, while the other latent variables remain unknown. In this subsection, we propose an iterative algorithm for estimating .

We use the exponential distribution for the prior of the diagonal elements of Inline graphic and a Laplace distribution for the prior of its off-diagonal elements to impose sparsity. The location parameter are given noninformative flat priors. The prior distribution can be formulated into

where Inline graphic controls the magnitude of the diagonal elements of , and controls the sparsity of the off-diagonal elements of . The Laplace priors force the off-diagonal elements to shrink towards (Tibshirani, 1996; Park and Casella, 2008).

We propose the maximum a posteriori (MAP) estimator for Inline graphic and

(2.1)

where Inline graphic is a normalizing constant, and are the estimate of cell-specific parameters from the previous subsection. Note that, however, this posterior marginal distribution is analytically intractable. Therefore, we augment auxiliary random variables following Pólya-gamma distributions (Polson and others, 2013) as described in Supplementary material available at Biostatistics online and use the variational EM approach (Tzikas and others, 2008; Blei and others, 2017) to find the MAP estimate. Specifically, we maximize the evidence lower bound (ELBO), which is a lower bound of the evidence

(2.2)

Note that Inline graphic is the variational distribution approximating the posterior distribution, which is defined as

where the exact forms of each component are

with Inline graphic being the PDF of the Pólya-gamma distribution with parameters and (Polson and others, 2013).

While the variational parameters are chosen by maximizing the ELBO given Inline graphic and , we estimate the gene-specific parameters and with fixed, for which the solution is given by and

(2.3)

where

(2.4)

Note that (2.3) can be seen as a graphical lasso problem with Inline graphic being the sample covariance matrix, and can be solved by the GLASSO algorithm (Friedman and others, 2008).

2.2. Alternative approach

The algorithm proposed above is an iteration of two stages until convergence: (i) approximating the posterior marginal distribution of the gene-specific parameters via approximating the conditional distribution of the latent variables by a variational distribution and then (ii) estimating the gene-specific parameters using the approximated marginal distribution. Note that these two steps can be viewed in a different way. The first step can be seen as the imputation of the sample covariance matrix ( Inline graphic ) of the underlying true gene expression levels and the second step can be seen as finding the sparse precision matrix based on the imputed sample covariance matrix via GLASSO.

From this perspective, we introduce an alternative approach by replacing the second part of the algorithm, GLASSO, with another inverse covariance matrix estimation method, the constrained -minimization for inverse matrix estimation (CLIME) by Cai and others (2011). In particular, we find the optimal solution of

where Inline graphic is given by the first step as described in (2.4). This new algorithm does not find the MAP estimator from the posterior density of our model, but gives scalable computations.

3. Simulation studies

In this section, we present the comparison of the scLGM method against other methods – GLASSO and HurdleNormal. A total of nine different simulation scenarios are considered by combining three different graph structures and three different Inline graphic ratios. We fix and consider . For the other two methods, we applied a logarithmic transformation to the UMI counts with a very small number added due to zero counts.

Three different graph structures are described with three corresponding graphs in Figure 2 for the Inline graphic cases. Specifically, the subfigure on the left represents our first graph structure, in which vertices are divided into equally sized subsets. Each subset forms a hub structure such that there is one vertex linked to all the rest nine vertices while they are not linked among themselves. The second graph structure is the random sparse graph such that the edge densities are around Inline graphic for all three ’s. Lastly, we consider moralized direct acyclic graphs (MDAGs). We generated a direct acyclic graph (DAG) for each scenario with directional edges out of possible combinations. Then, the graphs are moralized by adding undirected edges among parent nodes if they share the same child node, and then converting all the edges to undirected ones.

Fig. 2. — Shapes of three simulated graphs when .

On these three graph structures, we used existing algorithms to simulate the underlying true unobservable UMI counts of each simulation scenario. For the hub structured graphs and the random sparse graphs, we used “huge” package (Zhao and others, 2012) to explicitly generate the inverse covariance matrix Inline graphic , and then simulated the log-valued gene expression levels from the -dimensional multivariate Gaussian distribution with the mean vector and the covariance matrix . In the moralized DAG cases, we used “spacejam” package (Voorman and others, 2013) to generate DAGs and to simulate data with Inline graphic -dimensional multivariate normal distributions. Note that, in all scenarios, each element in is generated from the normal distribution with mean and variance , which resulted in around of nonzero data entries in the observed counts .

Given the sample observations of true log-expression levels Inline graphic from the multivariate normal distribution with the gene-specific parameters and , we generate the latent variables from the Bernoulli distribution and from the Poisson distribution as described in the modeling framework. In these simulated library preparation processes and the simulated amplification processes, the four cell-specific parameters for each of the 100 observations are generated from the multivariate Gaussian distributions given below, where the mean vectors and the covariance matrices have been estimated from the cell class “CA1Pyr2” of the real data analyzed in Section 4.

The simulation results are summarized in Figure 3 and Table 1. First, Figure 3 consists of nine subfigures containing receiver operating characteristic (ROC) curves as to the selection of edges. They consistently exemplify that scLGM outperforms other methods in terms of the area under the ROC curves. Especially in the first two graph structures, scLGM with the GLASSO M-step is the best performing method with respect to MCC. For the MDAG structures, scLGM with the Clime M-step is the best performing, while the performance of scLGM with the GLASSO M-step is comparable. The GLASSO method also works very well on MDAG graphs, while in all cases HurdleNormal is underperforming with respect to the area under the ROC curves.

Fig. 3. — Simulation results: each subfigure represents a different simulation scenario such that each column represents a corresponding number of simulated genes ( genes on the left; genes in the middle; genes on the right), while each row represents a corresponding graph structure (hub graph at the top; random graph in the middle; moralized DAG at the bottom). Each bullets on ROC curves represents selected graph with respect to corresponding criteria.

Table 1.

Simulation results: we compare scLGM with Glasso M-step, scLGM with Clime M-step, HurdleNormal, and Glasso. Number of detected edges, false positive rates, true positive rates, and Matthews correlation coefficients are reported. Standard errors are in the parentheses.

Graph	Method	Edge	FPR	TPR	Matthews
	scLGM (Glasso)	86.16 (26.159)	0.042 (0.019)	0.824 (0.114)	0.589 (0.057)
	scLGM (Clime)	81.43 (21.627)	0.038 (0.017)	0.805 (0.076)	0.590 (0.064)
	HurdleNormal	1.94 (1.420)	0.001 (0.001)	0.007 (0.012)	0.024 (0.053)
	Glasso	9.18 (3.812)	0.003 (0.002)	0.114 (0.048)	0.240 (0.077)

	scLGM (Glasso)	131.89 (22.943)	0.014 (0.003)	0.731 (0.086)	0.597 (0.039)
	scLGM (Clime)	148.00 (45.931)	0.017 (0.008)	0.738 (0.074)	0.581 (0.057)
Hub	HurdleNormal	10.58 (3.888)	0.002 (0.001)	0.006 (0.007)	0.012 (0.021)
	Glasso	10.48 (7.711)	0.003 (0.001)	0.066 (0.032)	0.138 (0.056)

	scLGM (Glasso)	190.67 (68.192)	0.002 (0.001)	0.442 (0.071)	0.529 (0.036)
	scLGM (Clime)	449.78 (15.752)	0.006 (0.000)	0.625 (0.030)	0.481 (0.022)
	HurdleNormal	6380.96 (181.306)	0.141 (0.004)	0.284 (0.024)	0.032 (0.005)
	Glasso	60.41 (15.078)	0.001 (0.000)	0.103 (0.025)	0.216 (0.036)
	scLGM (Glasso)	61.11 (26.922)	0.021 (0.015)	0.434 (0.126)	0.494 (0.060)
	scLGM (Clime)	76.32 (22.362)	0.027 (0.013)	0.523 (0.107)	0.530 (0.059)
	HurdleNormal	1.62 (1.324)	0.001 (0.001)	0.002 (0.005)	0.006 (0.029)
	Glasso	6.01 (3.401)	0.003 (0.002)	0.027 (0.019)	0.087 (0.060)

	scLGM (Glasso)	76.59 (35.016)	0.007 (0.004)	0.287 (0.108)	0.398 (0.066)
	scLGM (Clime)	89.96 (22.118)	0.008 (0.003)	0.344 (0.053)	0.443 (0.038)
Random	HurdleNormal	9.68 (2.624)	0.002 (0.000)	0.006 (0.006)	0.014 (0.022)
	Glasso	9.15 (4.293)	0.001 (0.001)	0.024 (0.013)	0.095 (0.043)

	scLGM (Glasso)	83.63 (18.427)	0.001 (0.000)	0.100 (0.021)	0.226 (0.030)
	scLGM (Clime)	101.63 (36.487)	0.001 (0.001)	0.124 (0.025)	0.255 (0.026)
	HurdleNormal	6550.88 (362.441)	0.145 (0.008)	0.219 (0.022)	0.020 (0.005)
	Glasso	32.54 (13.204)	0.001 (0.000)	0.008 (0.005)	0.027 (0.018)
	scLGM (Glasso)	171.92 (30.066)	0.106 (0.025)	0.625 (0.028)	0.376 (0.041)
	scLGM (Clime)	121.45 (21.698)	0.063 (0.017)	0.616 (0.040)	0.465 (0.038)
MDAG	HurdleNormal	4.98 (1.803)	0.002 (0.001)	0.038 (0.014)	0.145 (0.046)
	Glasso	35.96 (5.059)	0.006 (0.002)	0.360 (0.040)	0.522 (0.034)

	scLGM (Glasso)	310.64 (51.376)	0.045 (0.010)	0.604 (0.021)	0.408 (0.036)
	scLGM (Clime)	207.59 (19.071)	0.024 (0.004)	0.604 (0.020)	0.509 (0.024)
	HurdleNormal	11.11 (3.632)	0.002 (0.001)	0.015 (0.009)	0.050 (0.034)
	Glasso	64.52 (7.589)	0.002 (0.001)	0.338 (0.030)	0.519 (0.024)

	scLGM (Glasso)	760.24 (134.529)	0.011 (0.003)	0.517 (0.022)	0.420 (0.034)
	scLGM (Clime)	514.15 (15.322)	0.006 (0.000)	0.505 (0.011)	0.497 (0.011)
	HurdleNormal	6600.76 (338.319)	0.145 (0.007)	0.305 (0.022)	0.048 (0.005)
	Glasso	194.21 (18.630)	0.001 (0.000)	0.308 (0.019)	0.495 (0.012)

Open in a new tab

In addition, we place bullets on the ROC curves indicating the selected graphs. Table 1 show the summary of their qualities. Overall, scLGMs tend to select higher TPRs while keeping FPRs moderate. While our methods show better Matthews correlation coefficients (MCC) for the hub graphs and the random graphs, GLASSO gives slightly better MCC on the moralized DAG graphs. In all scenarios, our methods tend to detect more edges than GLASSO. HurdleNormal has lower MCC in general and shows very inconsistent numbers of detected edges between low Inline graphic situations and high situations compared to other methods.

4. Real data analysis

We analyze the data set from Zeisel and others (2015), in which large scale scRNAseq data were collected from mouse brain single cells located in the primary somatosensory cortex and the hippocampal CA1 region. In Zeisel and others (2015), the authors clustered Inline graphic cells into 9 level-1 classes, then further decomposed them into 47 level-2 (Table S1 of Supplementary material available at Biostatistics online). We focused on the biggest level-1 class “pyramidal CA1” and its biggest level-2 “CA1Pyr2,” which have and cells, respectively. We selected Inline graphic genes out of genes, where the first genes are the most highly expressed genes, while the other genes are randomly chosen from the rest of the genes with at least of nonzero UMI ratio. The analysis is repeated with five different randomly selected genes. This resulted in 12.3% zeros on average over the five repeated analyses for “CA1Pyr2” cell subclass, and 13.8% zeros for “pyramidal CA1” class.

We compare the edges estimated by scLGM and other methods to the edges retrieved from the KEGG pathway database (Kanehisa and others, 2016). A total of Inline graphic and edges for the genes were collected from mouse-specific KEGG pathways for classes “pyramidal CA1”and “CA1Pyr2,” respectively. Other methods considered are GLASSO and HurdleNormal as in Section 3.

The results are summarized in Table 2 and in Figure 4. Table 2 contains the number of detected edges, their overlaps with our gold standards, and the additionally discovered edges which does not exist in the KEGG database. As in simulations, our methods detected the most edges overlapping with the gold standards. The extra discoveries of all methods beyond the KEGG edges suggest that there can potentially be many unknown relationships between the genes of interests. Figure 4 contains the number of overlapping selected edges across all methods, and the number of overlapping edges that are also included in KEGG. We can see that our methods has the largest overlaps with the two competitors, while GLASSO and HurdleNormal have none or less overlaps. The edges discovered by GLASSO is nearly the subset of scLGM with GLASSO M-step.

Table 2.

Real data analysis result: we compare scLGM with Glasso M-step, scLGM with Clime M-step, HurdleNormal, and Glasso over the real mouse single-cell data. Among all possible pairs of Inline graphic selected genes, and edges for each of clusters on average are verified by the KEGG database which is set to be our gold standard for comparison. Each of other possible pairs of genes is either not connected or connected but unknown yet. The number of detected edges for each group is reported.

Method	CA1Pyr2		Pyramidal CA1
	Edge	Overlap	Edge	Overlap
scLGM (Glasso)	3807.0	17.0	4756.4	18.4
scLGM (Clime)	3933.2	18.0	7201.4	26.4
HurdleNormal	352.0	8.0	295.2	6.0
Glasso	4222.0	2.0	4709.2	5.6

Open in a new tab

Fig. 4. — Comparison of detected edges among methods: each row represents the overlaps from two classes (“pyramidal CA1” on top and “CA1Pyr2” on bottom). The left column represents total number of overlaps among all detected edges, while the right column represents the number of overlaps among ones existing in the KEGG database.

To further investigate the performance of scLGM on the real data, we conducted another simulation study which is based on the results of our real data analysis. By using the cell-specific and gene-specific parameters from the real data analysis estimated with GLASSO, we generated Inline graphic synthetic data sets with the same sample size . We conducted the same analysis as in Section 3, and the results are shown in Figure 5 and Table 3. Our two methods still outperform other methods in terms of the area under the ROC curves and achieve the best TPRs with moderate FPRs resulting in the best MCC.

Fig. 5. — ROC curves from the simulation results based on real data: the scLGM outperforms other methods in terms of the area under the ROC curves, while compared methods only suggest limited number of edges.

Table 3.

Simulation results based on real data analysis: we compare scLGM with Glasso M-step, scLGM with Clime M-step, HurdleNormal, and Glasso on the simulated data using the parameters estimated from the real mouse single-cell data analysis. False positive rates, true positive rates, and Matthews correlation coefficients are reported. Standard errors are in the parentheses.

Method	Edge	FPR	TPR	Matthews
scLGM (Glasso)	1838.13 (63.372)	0.062 (0.003)	0.254 (0.007)	0.243 (0.007)
scLGM (Clime)	1494.30 (272.314)	0.041 (0.011)	0.256 (0.031)	0.300 (0.009)
HurdleNormal	485.73 (27.850)	0.019 (0.001)	0.054 (0.003)	0.084 (0.006)
Glasso	314.55 (37.404)	0.010 (0.001)	0.049 (0.006)	0.114 (0.009)

Open in a new tab

5. Discussion

In this article, we proposed a novel method for graphical modeling using high-dimensional scRNAseq data. Unlike the existing methods that do not account for unique features of scRNAseq data, the proposed approach uses the state-of-art technique of Jia and others (2017) to properly explain two source of excessive zeros that are present in scRNAseq data and enables estimation of the conditional dependency structure of the unobserved underlying true gene expression values. The simulation results demonstrate the superiority of scLGM, and the real data application shows our method can be useful in practice.

One drawback of scLGM is its scalability. As the algorithms are iterative, our method runs several times slower than the other methods considered in the paper. In our data analysis for “CA1Pyr2” cell class, it took 1 min per tuning parameter on average with Inline graphic . We expect that our method will be scalable up to a few thousand genes, depending on the sample size and the sparsity of edges. Developing a more scalable method certainly deserves more attention.

As scLGM enables estimating a gene dependency network, it can be combined with other methods to analyze scRNAseq data where incorporation of graphical information can be valuable. For example, many of supervised and unsupervised learning methods when using scRNAseq data as predictors can be improved by incorporating graph knowledge among the predictors, via sequential or joint estimation.

Supplementary Material

kxab011_Supplementary_Data

Click here for additional data file.^{(203.2KB, pdf)}

Acknowledgments

Conflict of Interest: None declared.

6. Software

Software in the form of R code is available on https://github.com/jihwan05/scLGM/.

Supplementary Material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

Funding

National Institutes of Health (NIH) grants (P30CA016520 and RF1AG063481). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679. [Google Scholar]
Bacher, R. and Kendziorski, C. (2016). Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biology 17, 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 9, 485–516. [Google Scholar]
Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association 112, 859–877. [Google Scholar]
Buettner, F., Natarajan, K. N., Casale, F. P., Proserpio, V., Scialdone, A., Theis, F. J., Teichmann, S. A., Marioni, J. C. and Stegle, O. (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33, 155. [DOI] [PubMed] [Google Scholar]
Cai, T., Liu, W. and Luo, X. (2011). A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 594–607. [Google Scholar]
Chun, H., Zhang, X. and Zhao, H. (2015). Gene regulation network inference with joint sparse gaussian graphical models. Journal of Computational and Graphical Statistics 24, 954–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
Editorial. (2014). Method of the year 2013. Nature Methods 11. [DOI] [PubMed] [Google Scholar]
Elowitz, M. B., Levine, A. J., Siggia, E. D. and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297, 1183–1186. [DOI] [PubMed] [Google Scholar]
Fan, J., Feng, Y. and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics 3, 521. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fukumizu, K., Gretton, A., Sun, X. and Schölkopf, B. (2007). Kernel measures of conditional dependence. In: Twenty-First Annual Conference on Neural Information Processing Systems (NIPS 2007). Curran, Volume 20. pp. 489–496. [Google Scholar]
Gong, W., Kwak, I.-Y., Pota, P., Koyano-Nakagawa, N. and Garry, D. J. (2018). Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics 19, 220. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harari, A., Vallelian, F., Meylan, P. R. and Pantaleo, G. (2005). Functional heterogeneity of memory CD4 T cell responses in different conditions of antigen exposure and persistence. The Journal of Immunology 174, 1037–1045. [DOI] [PubMed] [Google Scholar]
Harris, N. and Drton, M. (2013). PC algorithm for nonparanormal graphical models. The Journal of Machine Learning Research 14, 3365–3383. [Google Scholar]
Hicks, S. C., Townes, F. W., Teng, M. and Irizarry, R. A. (2018). Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lönnerberg, P. and Linnarsson, S. (2014). Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods 11, 163. [DOI] [PubMed] [Google Scholar]
Jia, C., Hu, Y, Kelly, D., Kim, J., Li, M. and Zhang, N. R. (2017). Accounting for technical noise in differential expression analysis of single-cell RNA sequencing data. Nucleic Acids Research 45, 10978–10988. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang, L., Schlesinger, F., Davis, C. A., Zhang, Y., Li, R., Salit, M., Gingeras, T. R. and Oliver, B. (2011). Synthetic spike-in standards for RNA-seq experiments. Genome Research 21, 1543–1551. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. and Morishima, K. (2016). Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S. and Taipale, J. (2012). Counting absolute numbers of molecules using unique molecular identifiers. Nature Methods 9, 72. [DOI] [PubMed] [Google Scholar]
König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D. J., Luscombe, N. M. and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Structural & Molecular Biology 17, 909. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics 37, 4254. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, B., Chun, H. and Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association 107, 152–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, B., Chun, H. and Zhao, H. (2014). On an additive semi-graphoid model for statistical networks with application to pathway analysis. Journal of the American Statistical Association 109, 1188–1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L.. and others. (2012). High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics 40, 2293–2326. [Google Scholar]
Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest density estimation. The Journal of Machine Learning Research 12, 907–951. [Google Scholar]
McDavid, A., Gottardo, R., Simon, N. and Drton, M. (2019). Graphical models for zero-inflated single cell gene expression. The Annals of Applied Statistics 13, 848–873. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436–1462. [Google Scholar]
Oh, J., Zheng, F., Doerge, R. W. and Chun, H. (2018). Kernel partial correlation: a novel approach to capturing conditional independence in graphical models for noisy data. Journal of Applied Statistics, 45, 2677–2696. [Google Scholar]
Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]
Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 735–746. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American statistical Association 108, 1339–1349. [Google Scholar]
Smith, T. S., Heger, A. and Sudbery, I. (2017). Umi-tools: modelling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Research 27, 491–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stegle, O., Teichmann, S. A. and Marioni, J. C. (2015). Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics 16, 133. [DOI] [PubMed] [Google Scholar]
Székely, G. J. and Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics 42, 2382–2412. [Google Scholar]
Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B,Siddiqui, A.. and others. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods 6, 377. [DOI] [PubMed] [Google Scholar]
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 58, 267–288. [Google Scholar]
Tzikas, D. G., Likas, A. C. and Galatsanos, N. P. (2008). The variational approximation for Bayesian inference. IEEE Signal Processing Magazine 25, 131–146. [Google Scholar]
Voorman, A., Shojaie, A. and Witten, D. (2013). Graph estimation with joint additive models. Biometrika 101, 85–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang, X., Pan, W,Hu, W., Tian, Y. and Zhang, H. (2015). Conditional distance correlation. Journal of the American Statistical Association 110, 1726–1734. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35. [Google Scholar]
Zeisel, A., Muñoz-Manchado, A. B., Codeluppi, S., Lönnerberg, P., La Manno, G., Juréus, A., Marques, S., Munguba, H., He, L., Betsholtz, C.. and others. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. [DOI] [PubMed] [Google Scholar]
Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research 13, 1059–1062. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxab011_Supplementary_Data

Click here for additional data file.^{(203.2KB, pdf)}

[B1] Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679. [Google Scholar]

[B2] Bacher, R. and Kendziorski, C. (2016). Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biology 17, 63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 9, 485–516. [Google Scholar]

[B4] Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association 112, 859–877. [Google Scholar]

[B5] Buettner, F., Natarajan, K. N., Casale, F. P., Proserpio, V., Scialdone, A., Theis, F. J., Teichmann, S. A., Marioni, J. C. and Stegle, O. (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33, 155. [DOI] [PubMed] [Google Scholar]

[B6] Cai, T., Liu, W. and Luo, X. (2011). A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 594–607. [Google Scholar]

[B7] Chun, H., Zhang, X. and Zhao, H. (2015). Gene regulation network inference with joint sparse gaussian graphical models. Journal of Computational and Graphical Statistics 24, 954–974. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Editorial. (2014). Method of the year 2013. Nature Methods 11. [DOI] [PubMed] [Google Scholar]

[B9] Elowitz, M. B., Levine, A. J., Siggia, E. D. and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297, 1183–1186. [DOI] [PubMed] [Google Scholar]

[B10] Fan, J., Feng, Y. and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics 3, 521. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Fukumizu, K., Gretton, A., Sun, X. and Schölkopf, B. (2007). Kernel measures of conditional dependence. In: Twenty-First Annual Conference on Neural Information Processing Systems (NIPS 2007). Curran, Volume 20. pp. 489–496. [Google Scholar]

[B13] Gong, W., Kwak, I.-Y., Pota, P., Koyano-Nakagawa, N. and Garry, D. J. (2018). Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics 19, 220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Harari, A., Vallelian, F., Meylan, P. R. and Pantaleo, G. (2005). Functional heterogeneity of memory CD4 T cell responses in different conditions of antigen exposure and persistence. The Journal of Immunology 174, 1037–1045. [DOI] [PubMed] [Google Scholar]

[B15] Harris, N. and Drton, M. (2013). PC algorithm for nonparanormal graphical models. The Journal of Machine Learning Research 14, 3365–3383. [Google Scholar]

[B16] Hicks, S. C., Townes, F. W., Teng, M. and Irizarry, R. A. (2018). Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lönnerberg, P. and Linnarsson, S. (2014). Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods 11, 163. [DOI] [PubMed] [Google Scholar]

[B18] Jia, C., Hu, Y, Kelly, D., Kim, J., Li, M. and Zhang, N. R. (2017). Accounting for technical noise in differential expression analysis of single-cell RNA sequencing data. Nucleic Acids Research 45, 10978–10988. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Jiang, L., Schlesinger, F., Davis, C. A., Zhang, Y., Li, R., Salit, M., Gingeras, T. R. and Oliver, B. (2011). Synthetic spike-in standards for RNA-seq experiments. Genome Research 21, 1543–1551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. and Morishima, K. (2016). Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Kivioja, T., Vähärautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S. and Taipale, J. (2012). Counting absolute numbers of molecules using unique molecular identifiers. Nature Methods 9, 72. [DOI] [PubMed] [Google Scholar]

[B22] König, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D. J., Luscombe, N. M. and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nature Structural & Molecular Biology 17, 909. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics 37, 4254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Li, B., Chun, H. and Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association 107, 152–167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Li, B., Chun, H. and Zhao, H. (2014). On an additive semi-graphoid model for statistical networks with application to pathway analysis. Journal of the American Statistical Association 109, 1188–1204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L.. and others. (2012). High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics 40, 2293–2326. [Google Scholar]

[B27] Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]

[B28] Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest density estimation. The Journal of Machine Learning Research 12, 907–951. [Google Scholar]

[B29] McDavid, A., Gottardo, R., Simon, N. and Drton, M. (2019). Graphical models for zero-inflated single cell gene expression. The Annals of Applied Statistics 13, 848–873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436–1462. [Google Scholar]

[B31] Oh, J., Zheng, F., Doerge, R. W. and Chun, H. (2018). Kernel partial correlation: a novel approach to capturing conditional independence in graphical models for noisy data. Journal of Applied Statistics, 45, 2677–2696. [Google Scholar]

[B32] Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association 103, 681–686. [Google Scholar]

[B33] Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 735–746. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American statistical Association 108, 1339–1349. [Google Scholar]

[B35] Smith, T. S., Heger, A. and Sudbery, I. (2017). Umi-tools: modelling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Research 27, 491–499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] Stegle, O., Teichmann, S. A. and Marioni, J. C. (2015). Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics 16, 133. [DOI] [PubMed] [Google Scholar]

[B37] Székely, G. J. and Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics 42, 2382–2412. [Google Scholar]

[B38] Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B,Siddiqui, A.. and others. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods 6, 377. [DOI] [PubMed] [Google Scholar]

[B39] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 58, 267–288. [Google Scholar]

[B40] Tzikas, D. G., Likas, A. C. and Galatsanos, N. P. (2008). The variational approximation for Bayesian inference. IEEE Signal Processing Magazine 25, 131–146. [Google Scholar]

[B41] Voorman, A., Shojaie, A. and Witten, D. (2013). Graph estimation with joint additive models. Biometrika 101, 85–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] Wang, X., Pan, W,Hu, W., Tian, Y. and Zhang, H. (2015). Conditional distance correlation. Journal of the American Statistical Association 110, 1726–1734. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B43] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35. [Google Scholar]

[B44] Zeisel, A., Muñoz-Manchado, A. B., Codeluppi, S., Lönnerberg, P., La Manno, G., Juréus, A., Marques, S., Munguba, H., He, L., Betsholtz, C.. and others. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. [DOI] [PubMed] [Google Scholar]

[B45] Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research 13, 1059–1062. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Accounting for technical noise in Bayesian graphical models of single-cell RNA-sequencing data

Jihwan Oh

Changgee Chang

Qi Long

Summary

1. Introduction

2. Modeling of scRNAseq data

2.1. Computation

Fig. 1.

2.1.1. Estimation of cell-specific parameters

2.1.2. Estimation of gene-specific parameters

2.2. Alternative approach

3. Simulation studies

Fig. 2.

Fig. 3.

Table 1.

4. Real data analysis

Table 2.

Fig. 4.

Fig. 5.

Table 3.

5. Discussion

Supplementary Material

Acknowledgments

6. Software

Supplementary Material

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Accounting for technical noise in Bayesian graphical models of single-cell RNA-sequencing data

Jihwan Oh

Changgee Chang

Qi Long

Summary

1. Introduction

2. Modeling of scRNAseq data

2.1. Computation

Fig. 1.

2.1.1. Estimation of cell-specific parameters

2.1.2. Estimation of gene-specific parameters

2.2. Alternative approach

3. Simulation studies

Fig. 2.

Fig. 3.

Table 1.

4. Real data analysis

Table 2.

Fig. 4.

Fig. 5.

Table 3.

5. Discussion

Supplementary Material

Acknowledgments

6. Software

Supplementary Material

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases