Abstract
Bulk gene expression experiments relied on aggregations of thousands of cells to measure the average expression in an organism. Advances in microfluidic and droplet sequencing now permit expression profiling in single cells. This study of cell-to-cell variation reveals that individual cells lack detectable expression of transcripts that appear abundant on a population level, giving rise to zero-inflated expression patterns. To infer gene co-regulatory networks from such data, we propose a multivariate Hurdle model. It is comprised of a mixture of singular Gaussian distributions. We employ neighborhood selection with the pseudo-likelihood and a group lasso penalty to select and fit undirected graphical models that capture conditional independences between genes. The proposed method is more sensitive than existing approaches in simulations, even under departures from our Hurdle model. The method is applied to data for T follicular helper cells, and a high-dimensional profile of mouse dendritic cells. It infers network structure not revealed by other methods; or in bulk data sets. An R implementation is available at https://github.com/amcdavid/HurdleNormal.
1. Introduction.
Graphical models have been used to synthesize high-throughput gene expression experiments into understandable, canonical forms [Dobra et al., 2004, Markowetz and Spang, 2007]. Although inferring causal relationships between genes is perhaps the ultimate goal of such analysis, causal models may be difficult to estimate with observational data, and experimental manipulation of specific genes has remained costly, and largely inimitable to high-throughput biology. Many analyses have thus focused on undirected graphical models (also known as Markov random fields) that capture the conditional independences present between gene expression levels. The graph determining such a model describes each gene’s statistical predictors: each gene is optimally predicted using only its neighbors in the graph. With gene expression studies serving as key motivation, a host of different approaches have been developed for structure learning and parameter estimation in undirected graphical models [Drton and Maathuis, 2017].
Characterization of the conditional independences between genes answers a variety of scientific questions. It can help falsify models of gene regulation, since statistical dependence is expected, given causal dependence. In immunology, polyfunctional immune cells, which simultaneously and non-independently express multiple cytokines, are useful predictors of vaccine response [Precopio et al., 2007]. Simultaneous expression or co-expression of cellular surface markers potentially define new cellular phenotypes [Lin et al., 2015], so expanding the “dictionary” of co-expression allows phenotypic refinements. Graphical models allow one to study such co-expression at the level of direct interactions.
1.1. Single cell gene expression.
Established technology determines gene expression levels by assaying bulk aggregates of cells assayed through microarrays or RNA sequencing. Although graphical modeling of the resulting data has seen profitable applications, see e.g. Li et al. [2015], there is an inheritant limitation to what can be inferred from expression levels that are averages across hundreds or thousands of individual cells, as we discuss in Section 2. In contrast, recent microfluidic and molecular barcoding advances have enabled the measurement of the minute quantities of mRNA present in single cells. This new technology provides a unique resolution of gene co-expression and has the potential to facilitate more interpretable conclusions from multivariate data analysis and, in particular, graphical modeling.
At the same time, single cell expression experiments bring about new statistical challenges. Indeed, a distinctive feature of single cell gene expression data—across methods and platforms—is the bimodality of expression values [Finak et al., 2015, Marinov et al., 2014, Shalek et al., 2014]. Genes can be ‘on’, in which case a positive expression measure is recorded, or they can be ‘off ‘, in which case the recorded expression is zero or negligible. Although the cause of this zero-inflation remains unresolved, its properties are of intrinsic interest [Kim and Marioni, 2013]. It has been argued that the zero-inflation represents censoring of expression below a substantial limit of detection, yet comparison of in silico signal summation from many single cells, to the signal measured in biological sums of cells suggest that the limit of detection is negible [McDavid et al., 2013]. Moreover, the empirical distribution of the log-transformed counts appears rather different than would be expected from censoring: the distribution of the log-transformed, positive values is generally symmetric. Yet the presence of bimodality in technically replicated experiments (“Pool/split” experiments) implicates the involvement of technical factors [Marinov et al., 2014].
Zero-inflation is seen, in particular, in a single cell gene expression experiment we analyze in Section 6. The experiment concerns T follicular helper (Tfh) cells, which are a class of CD4+ lymphocytes. B-cells that secrete antibodies require Tfh cell co-stimulation to become active [Ma et al., 2012]. Tfh cells are defined, and identified both through their location in the B-cell germinal centers, as well as their production of high levels of the proteins CXCR5, PD1 and BcL-6. In the experiment we consider, Tfh cells were identified from CD4+CXCR5+PD1+ cells from lymph node biopsy. Figure 1 shows the pairwise expression distribution of four Tfh marker genes (P < 10−20 compared to non-Tfh lymph node T-cells, which are not shown). Although the expression of these genes could help discriminant Tfh from non-Tfh cells, the strength of linear relationships within Tfh cells (upper panels) varies. To identify coexpressing subsets of cells or to clarify the conditional relationship between genes, estimating the multivariate dependence structure of expression within Tfh cells is necessary. Figure 1 illustrates the issue of zero-inflation. The data are clearly poorly modeled by the linear regression models whose fit is shown in the lower panels of the figure.
1.2. Modeling zero-inflation.
In order to accommodate the distributional features observed in single cell gene expression, we propose a joint probability density function f(y) of the form
(1) |
for the dominating measure obtained by adding a Dirac mass at zero to the Lebesgue measure. The vector comprises the expression levels of m genes in a single cell, and the vector vy ∈ {0,1}m is defined through element wise indicators of non-zero, so for i = 1,…,m. In the specification from (1), both binary and continuous versions of gene expression are sufficient statistics, and interactions thereof are parametrized, with G, H and K being matrices of interaction parameters. Zeros in these interaction matrices indicate conditional independences (and, thus, absence of edges in a graph for a graphical model). Specifically, the ith and jth coordinate are conditionally independent if and only if all interaction matrices have their (i, j) and (j, i) entries zero [Lauritzen, 1996, Theorem 3.9].
As we discuss in more detail in Section 3, the model given by (1), which we refer to as the Hurdle model, can be shown to be equivalent to a finite mixture model of singular Gaussian distributions. In light of the observed symmetry in the positive single cell expression levels, linking the modeling of zero-inflation with Gaussian parameters for nonzero observations is both natural and convenient. This said, it is an interesting topic for future work to develop more refined models of the continuous expression arising when genes are ‘on’.
We will base statistical inference in the Hurdle model on so-called neighborhood selection, where the neighborhood of each gene is inferred via penalized regression methods [Meinshausen and Bühlmann, 2006]. Neighborhood selection is a state-of-the-art method for estimation and inference in potentially high-dimensional graphical models; see the review in Section 3.4 of Drton and Maathuis [2017]. The main challenge in our setting is determining how to calibrate signal in the binary versus the continuous part. We solve this problem using an anisometric group-lasso penalty (Section 4).
1.3. Outline.
The remainder of the paper is structured as follows. Section 2 discusses the parameter targeted in single cell gene expression experiments, and why it is not accessible from traditional bulk experiments. Section 3 develops the parametric Hurdle model for single cell gene expression, as specified in (1), and discusses conditional independence in this setting. Section 4 gives a detailed account of estimation of graphical models using neighborhood selection via penalized regression. Section 5 provides a simulation study that demonstrates the benefits of our approach. In Section 6, we analyze the aforementioned experiment on Tfh cells. Since the data set contains selected gene profiles that were available for both single-and several-cell aggregates, we are able to highlight the refined inferences that can be obtained from single cell data. In Section 7, we analyze data on mouse dendritic cells, which are of far higher dimensionality than the Tfh cell data. Our analyses show in particular that modeling the zero-inflation may uncover distinct networks compared to existing approaches. We conclude with a discussion in Section 8, where we highlight interesting problems for future research, in particular, in graphical modeling. A supplement Mc-David et al. [2018] contains expanded derivations and details on simulation scenarios.
2. Single cell versus bulk expression experiments.
Protocols for bulk gene expression experiments, such as for Illumina TrueSeq, call for 100 nanograms of total mRNA, hence require hundreds to thousands of cells. On the one hand, this biological “summation” over many of cells is expected to yield sharper inference on the mean expression level of each gene. However, it can also be expected to distort any conditional (in-)dependences present between genes.
Let Y1, …, Yn be iid random vectors taking values in , with Yi representing the copy numbers of m transcripts present in the ith single cell. Now suppose the n cells are aggregated, and the total expression is measured using a linear quantification that reports values proportional to the input counts of mRNA. The expression observed in this bulk experiment is then
with the constant of proportionality typically a semi-empirical normalization factor, such as TPM (Transcripts Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). Although most bulk experiments are designed to test for differences in mean expression due to experimental treatments and lack extensive replication within a condition, stochastic profiling [Janes et al., 2010] experiments have provided iid replicates of Z suitable for estimating higher order moments. However, when the distribution of Yi obeys some conditional independence relationships, in general the distribution of Z does not obey these same relationships.
For example, take m = 3 and suppose that the Yi are iid samples from a tri-variate distribution supported on {0,1}3. Let [Y1, Y2, Y3] be a random vector following this distribution, and let pijk = P (Y1 = i, Y2 = j, Y3 = k) be the joint probabilities. Then Y1 and Y3 are conditionally independent given Y2 (in symbols, Y1 ⫫ Y3 | Y2) if and only if the two matrices (pi0k)ik and (pi1k)ik have rank 1 [Drton et al., 2009, Prop. 3.1.4]. Yet even summing over only n = 2 cells, the random vector Z = Y1 + Y2 [Z1, Z2, Z3] taking values in {0, 1, 2}3generally does not have Z1 ⫫ Z3 | Z2.
When the Yi are multivariate Normal, the conditional independence structure is preserved under convolution. Unfortunately for non-Gaussian distributions this does not generally hold. As noted in our introduction, single cell gene expression is generally bimodal and zero-inflated, so not plausibly described by a multivariate Normal distribution. Therefore, even though for large enough n the distribution of the bulk experiment Z might approach multivariate (log-)normality, the networks estimated from graphical modeling of bulk data will not reflect conditional independences that hold among expression levels in single cells.
3. Hurdle models.
Univariate Hurdle models arise from modification of a density through excision of points in the support and assignment of positive masses to these points. Targeting zero-inflation, our excision point is the origin. Let vy = I{y≠0 be the indicator function for a non-zero value of the observation y. Then the Hurdle model derived from a Normal distribution with mean ξ and precision τ2 has density
(2) |
with respect to the measure λ0 that is the sum of the Lebesgue measure and a Dirac mass at zero. Here, P (Vy = 1) = p ε (0, 1) is a mixing weight representing the chance of observing a non-zero value. Varying p, ξ and τ2, one obtains an exponential family with su cient statistic y,− y2/2, and vy, and associated natural parameters h = ξτ2, k = τ2, and
3.1. Multivariate Hurdle models.
A plausible model for the joint distribution of a random vector Y = [Y1, …, Ym] representing single cell gene expression puts positive mass on every one of the 2m coordinate subspaces (recall Figure 1), including the origin when all genes are ‘off’ and the entire space when all genes are ‘on’. Assigning positive mass to the coordinate subspaces generalizes the univariate construction from (2). As it is easiest to construct this model conditionally, we introduce the vector that indicates the non-zero coordinates of Y. Throughout, our notation suppresses the dependence of V on Y. We emphasize that specification of the distribution of the multivariate Bernoulli random vector V simply amounts to specification of a 2m probability table.
For any vector v = [v1, …, vm] ε{0, 1}m, define the subspace where we set . So, is coordinate subspace corresponding to the non-zero entries of v. Similarly, define PD(v) to be the cone of m × m symmetric matrices that have non-zero entries only in rows and columns indexed by i with vi = 1, and for which the submatrix given by these rows and columns is positive definite. Now suppose that the conditional distribution of Y given V is multivariate Normal and, specifically,
(3) |
with mean vector and covariance matrix Σ(v) ∈ PD(v) The normal distribution in (3) is singular (see Section 2 in supplement McDavid et al. [2018] for details) and supported on the subspace .
In the applications we have in mind the dimension m will be large enough so that it is infeasible to accurately estimate a general 2m probability table for the distribution of V, and a collection of 2m mean vectors and covariance matrices for the conditional distribution of Y. We thus proceed to formulate a more parsimonious pairwise interaction model. While of far lower dimension, the pairwise model allows one to capture interesting conditional (in-)dependences.
First, we assume V to follow an Ising model with joint probabilities
(4) |
where G is a symmetric interaction matrix in . Second, we assume that the conditional normal distribution of Y given V = v has log-density
(5) |
with respect to Lebesgue measure restricted to the subspace . In (5), H and K are two m × m interaction matrices that do not vary with v, and C′(H, K) is a normalization constant. The matrix K is symmetric and positive definite, but H may be arbitrary from . Putting the two pieces from (4) and (5) together, the joint density of Y with respect to the product measure simplifies to
(6) |
We recognize an exponential family with three interaction matrices G, H and K as natural parameters and the three statistics vvT, vyT, and yyT sufficient.
Let be the m × m diagonal matrix with (i, i) entry equal to Vi. Then for any vector the product is the vector that has the ith coordinate replaced by zero for all indices i with Yi = Vi = 0. Similarly, multiplying from left and right to a matrix zeros out all but the principal submatrix determined by this set of indices. Using this notation, the pairwise Hurdle model from (6) corresponds to the particular choice of
(7) |
for the mean vectors and covariance matrices in the conditional specification from (5). In (7), A− denotes the Moore-Penrose pseudoinverse of a matrix A. From the perspective of (7), the pairwise Hurdle model is a mixture of 2m singular Gaussian distributions whose mean vectors and covariance matrices are derived from one precision matrix K and an interaction matrix H.
The notation we used in the conditional specification of the multivariate Hurdle model follows Lauritzen [1996], who describes conditional Gaussian (CG) models with inhomogeneous, non-singular precision K(v) that can depend on the discrete set of covariates in arbitrary, positive-definite fashion. These models have been considered more recently by Lee and Hastie [2013] and Cheng et al. [2013]. Our formulation differs from the traditional CG models by involving singular distributions with means and covariance matrices that exhibit structured inhomogeneity.
3.2. Conditional distributions identify interaction parameters.
The normalizing constant C in equation (6) is a difficult to compute sum of 2m terms. This is expected as already the distributions in the Ising model from (4)have an intractable normalization constant for moderately large m. Fortunately, the univariate full conditional distributions obtained from (6) have tractable normalizing constants and identify the parameters from a given row/column of the interaction matrices G = (gab), H = (hab), and K = (kab).
Fix a coordinate b, and define its complement A = {1, …, m} \ {b}. Consider now the density f(y) from (6) as a function of only yb, i.e., yA =[yi : i ԑ A] is fixed, and write f[b| A] for the conditional density of yb given yA. Then noting that viyi = yi and we have
(8) |
where C[b |A] does not depend on yb and
(9) |
(10) |
The conditional density f[b| A] is thus a univariate Hurdle density as specified in (2) with natural parameters g[b| A], h[b| A], and k[b| A].
The three natural parameters are obtained from linear predictors that depend on a design matrix constructed from yA and vA. For example, we may write
for Xa = [va, ya]. The linear predictor for h[b| A] can be written analogously. We note that if the data include additional nuisance covariates W0 that describe each experimental unit then these can be included by augmenting the linear predictor to
(11) |
with gb0 being the parameters capturing the effects of the covariates. From this perspective, the conditional distribution in (8) defines a vector generalized linear model, parametrized by three natural parameters g[b| A], h[b |A] and k[b| A], the first two of which are modeled as a linear function of the expression of other genes.
3.3. Conditional independence graphs.
The dependence structure of the random vector Y = [Y1, …, Ym] may be summarized in its conditional independence graph. This is an undirected graph with vertex set and an edge set Ԑ that is determined by the conditional independences Y. More precisely, the edges in Ԑ are those two-element sets for which Ya and Yb are conditionally dependent given the remaining variables, i.e., . In our case, Y has a density f as in (6). The dominating measure is a product measure, and f is positive and continuous. Hence, the Hammersley-Clifford theorem assures that the conditional independence graph of Y has an edge {a, b} if and only if the four possible ab interactions are zero, so
(12) |
see Lauritzen [1996, Chapter 3]. This fact is also evident from the form of the conditional distributions detailed in (8), (9), and (10). It motivates the neighborhood selection procedure developed in the next section.
4. Neighborhood estimation via penalized regression.
In the single cell experiments to which we envision applying this method, the number of cell replicates, n, is larger than the sample sizes seen in typical bulk mRNA experiments. However, it is still often the case that the number of genes m is larger than the number of cell replicates. We are thus in a setting that benefits from application of methods from ‘high-dimensional statistics’; though emerging technologies are increasing available sample sizes.
4.1. Related work.
Under scenarios in which n, m ➝ ∞while satisfying that n > Cdϕ(log m)ψ where C,ϕ and ψ are constants that depend on the model and d is the maximum vertex degree of the conditional independence graph, penalized regression has been shown to consistently identify the graph of multivariate Normal models [Meinshausen and Bühlmann, 2006], of Ising (auto-logistic) graph of multivariate Normal mosels [Meinshausen and Bühlmann, 2006], of Ising (auto-logistic) models [Ravikumar et al., 2010] and of exponential family graphical models [Yang et al., 2014, Chen et al., 2015]. While this paper was in preparation, Tansey et al. [2015] further extended this line of work to general vector space graphical models that include the multivariate Hurdle model as a special case. However, the standard (isometric) group-lasso they propose for estimation of the conditional independence graph does not account for heterogeneity in the scaling of predictors in the conditional distributions. The anisometric group-lasso we propose in the following section yields drastic improvements in finite samples.
4.2. Anisometric penalty.
Throughout this section, we fix an index b and consider the conditional distribution Yb given the other variables in YA for A = {1, …, m} \ {b}. For any a ε A, define the parameter vector θa = [gba, hba, hab, kba], By (12), Yb ⫫Ya|YA\{a} if and only if θa = 0.
(13) |
be the group lasso penalty for tuning parameter λ ≥ 0. Maximization of the penalized conditional log-likelihood function
can lead to a solution that is sparse in parameter blocks, that is, some of the subvectors θa are zero. The penalty is equivalent to placing a sequence of independent, multivariate Laplace priors on blocks of and reporting the MAP [Eltoft et al., 2006].
Viewed as a prior, the standard group-lasso penalty from (13) implicitly assumes that each variable in each block has a similar effect size. This may be reasonable if the variables in each block are measured in comparable units, but is problematic otherwise. For example, if covariate X1 is measured in meters, while covariate X2 in centimeters, then the distribution of effect sizes for X2 would be 100-times more dispersed than the distribution of effect sizes for X1. In penalized GLMs, this is typically enforced “at run time” by ensuring covariates are on comparable scales, or Z-scoring each column of the design matrix if no intrinsic scale exists.
In our setting of a vector regression, terms from linear predictor g[b |A] and linear predictor h [b| A] end up together in blocks, and these coefficients are not necessarily comparable, as one specifies log-odds of E(Vb| VA = 0) while the other specifies conditional expectations of E(Yb| YA). Re-scaling does not resolve this, since the same design matrix Xa = [Va, Ya] is used in each linear predictor, and in any case, re-scaling generally alters the solution [Simon and Tibshirani, 2012]. Instead, we propose replacing the isometric norm in the sum in (13) so that the penalty is
(14) |
Here, H ≡ diag (Haa) is a block-diagonal, positive-definite matrix that allows terms from the linear predictors to have different scales of penalty. It also accounts for correlation between components of θa, since columns of the design are correlated due to both va and ya appearing as predictors.
If prior information existed, the matrix H could be chosen accordingly, with interpretation as a multivariate Laplace prior. Absent prior information, setting H equal to the Fisher information under a null model θa = 0 for all a results in variable selection approximately equal to conducting score tests, with exact equivalence holding under a null hypothesis of θa = 0 for all a; see Proposition 1 in McDavid et al. [2018].
4.3. Computation.
In Algorithm 1, we outline the proposed neighborhood selection, allowing for possible nuisance covariates W. The nuisance covariates W might just be an intercept column, but generally could be any cell-level covariate deemed relevant. The smooth and concave function in line 7 can be maximized using any Newton-like algorithm (e.g., BFGS). The objective in line 10 is a sum of a concave, smooth function and a structured concave function and can be efficiently solved using proximal gradient ascent [Parikh and Boyd, 2014]. In particular, one may exploit the fact that although the proximal operator
is not available in the familiar form of a soft-thresholding operator as in the isometric group-lasso, the proximal operator of the anisometric group-lasso can be efficiently found via a line search after one-time pre-calculation of the singular value decomposition of Haa [Foygel and Drton, 2010]. Throughout the inner-loop, warm starts are exploited for as varies. λ Active set heuristics using the strong rules of Tibshirani et al. [2012] yield computational gains for sparse solutions with large m. The algorithm yields, for each node, a sequence of neighborhoods over a sequence of tuning parameters Λ. These neighborhoods need not be consistent, in the sense that for some element of it could be that b ε Ne(a) but a ∉ Ne(b). We resolve that by adopting an “or” rule. In the accompanying software1, the algorithm is written in a combination of R and C++. Timings for the proposed method and competitors (described further in Section 5) are shown in Figure 3.
5. Simulations.
We consider a series of simulations under several sets of underlying i) graph topologies, ii) parametric models, iii) sample sizes and iv) number of vertices. We summarize the considered setups here and defer details (including the choice of graph topology) to the supplementary material in Section 3. The number of observations n varies from 100 to 12500. In the chain graph topology, the number of vertices varies from m = 16 to m = 128, while in the e. coli graph topology, m = 500. The parametric models include the pairwise hurdle model (6), the hurdle model under contamination by t8 noise, a logistic/Ising model and a Gaussian/logistic censoring model specified in Supplementary Table 1. The pairwise hurdle model is said to be complete if for each edge present in the graph, all of the corresponding entries in each of the three interaction matrices are non-zero. The pairwise hurdle model is said to be G-minimal when H and K are diagonal matrices and only G contains non-zero off-diagonal entries. In this case, the G-minimal model is equivalent to a logistic/Ising model.
5.1. Methods compared and default tunings.
Six methods were examined to test graph structure inference, and are described in Supplemental Table 1. The Hurdle models are fit using the accompanying software HurdleNormal version 0.98.2, while the Logistic, Gaussian and NPN models are fit using the R package glmnet version 2.0–5 (via the autoGLM function in HurdleNormal). The Aracne method is fit using package netbenchmark version 1.6.0. For methods 1–5, neighborhoods are stitched together using an “or” rule, i.e., vertices a and b are adjacent if either b ∈ ne (a) or a ∈ ne(b).
In Figure 4 various fixed tunings are shown. In the oracle tuning, the graph with maximum sensitivity subject to FDR < 10% is shown. This tuning is not available in practice, but shows the maximum achievable performance of each method. With the BIC tuning, we employ the Bayesian Information Criteria on the pseudo-likelihood
where θbλ, is the penalized solution at penalty λ for vertex b, ∥θb,λ∥0 is the number of non-zero entries, and is the (unpenalized) maximum pseudo-likelihood estimate for the non-zero entries. The BIC solution is the one that minimizes BICλ. This tuning is available for methods 1–5. In the case of the the Aracne method the BIC is unavailable as no likelihood is defined.
5.2. Results.
30 simulation replicates sufficed to bound the simulation-induced Monte Carlo standard error of the mean < 5 × 10−3 for FDR and < .02 for the sensitivity.
The simulations show that mis-specified estimation procedures perform poorly when model (6) is the data generating distribution. When an FDR-controlling oracle is available, the anisometric Hurdle model can dominate other methods in edge-sensitivity (Figure 4A–B). However, when the Hurdle model is over-parameterized as in the G-sparse scenarios, the minimal Logistic model is superior, though the anisometric penalty partially ameliorates this gap. In very simple chain-graph scenarios, it is neigh-impossible to recover a network using 10-cell data. The e. coli network provides a counter example where 10-cell data nearly equals the performance available from single cell data. This may be due to the hub-and-spoke nature of thee. coli network, so the effect of marginalization by convolution tends to only add more connections between the hub and its neighborhood. The e. coli data and chain-graphs suggest that collecting single cell data, and estimating graph structure with a method that accommodates zero inflation can accurately discover a wide variety of network topologies.
More seriously, ignoring zero-inflation confounds use of information criteria to tune network size (panel C). On the other hand, the Hurdle model is robust to a variety of model departures, including contamination with t8-distributed errors (labelled with “t”), and data generation under a Gaussian-Logistic censoring model. When the full solution path is examined (Figure 5), a practioner who reported only the top few edges would often suffer from a large number of false positives when using methods not designed for zero-inflated data. For example, with n = 100 in the e. coli network, all methods, aside from the Hurdle have FDR exceeding 20%. The simulations also suggest that perfect recovery of gene networks is impractical at realistic sample sizes, even with a correctly specified model, motivating a form of meta-analysis on estimated graphs, discussed further in Section 7.2.
6. T follicular helper cells.
Our simulations show that depending on the data generating scenario, the Hurdle method may substantially outperform, or at least mimic the performance of other candidate methods. We next sought to see if methods would tend towards consensus in biologically-derived single cell and 10-cell data, or if it were possible that the Hurdle method might offer unique insights. We considered co-expression networks in Tfh cells measured in eight healthy donors. 65 genes were selected for profiling via qPCR on the basis of their role in Tfh signaling and differentiation, generally with sparse expression across single cells (overall probability of expression 27%). 465 single cell, and 187 10-cell replicates were taken.
Figure 6 shows networks of approximately 24 edges estimated using Hurdle, Gaussian (with centered data, see Section 1 McDavid et al. [2018] and Logistic, and Gaussian model using 10-cell aggregates. The size of the network is a compromise between stability selected [Shah and Samworth, 2013] sizes of each procedure, which varies from 11 edges (Hurdle) to 32 edges (Gaussian).
Normalized Hamming distances between the four methods, the Aracne method and the Gaussian model fit on the “raw”, uncentered data are reported in Table 1. The Hurdle and Gaussian models are most similar, while the logistic and Gaussian 10-cell network are quite distinct. The Gaussian(raw) model on untransformed data is similar to the logistic model, as distance of non-zero expression values from the origin is large compared to the variation among the non-zero values.
Table 1.
Gaussian (10) | Gaussian | Gaussian(raw) | Hurdle | logistic | |
---|---|---|---|---|---|
Aracne | 1.00 | 0.92 | 0.92 | 1.00 | 1.00 |
Gaussian(10) | 1.00 | 1.00 | 1.00 | 1.00 | |
Gaussian | 0.92 | 0.65 | 1.00 | ||
Gaussian(raw) | 1.00 | 0.39 | |||
Hurdle | 1.00 |
In the Hurdle network, the transcription factors NFATC1 (Nuclear factor of activated T-cells) and BCL6, and the signaling molecule CD154 and chemokine receptor CCR3 are hubs. NFATC1 has been found to promote transcription of cytokines IL21 [Hermann-Kleiter and Baier, 2010] and signaling molecule CD154 [Pham et al., 2005], while BCL6 serves as a transcriptional repressor, and is one of the canonical markers constitutively expressed in Tfh cells. CTLA4 which has been described to inhibit inflammation, interacts negatively with inflammatory activator JAK3. The disconnected component of CCR3-CCR4-BTLA-SELL-TNFSF4 may hint at plasticity between Tfh cells and the related T-cell lineages Th1 and Th2. CCR3 and CCR4 are canonical markers of Th2 cells, while TNFSF4 (coding for OX40L) promotes Th2 development de Jong et al. [2002]. Thus co-expression of these genes may suggest cells transitioning between Tfh and Th1 or Th2 states.
In the Gaussian network, though NFATC1, BCL6 and CD154 remain highly connected, CD27 now has highest degree and serves as a hub to receptors CXCR4, IL2Rb, IL2Rg, as well as ITGB2, NFATC1 and FYN. CD3e, the backbone responsible for transducing the T-cell receptor signal is connected with co-receptor CD4, CD154, IL2Rg, Fyn and ANP32B. The negative interactions between BTLA and CTLA4 are absent.
The logistic network consists primarily of negative interactions. The strongly negative BCL6–BLIMP1 edge is consistent with previously described antagonism between these genes [Johnston et al., 2009]. Interestingly, this edge is absent in the other networks.
7. Mouse dendritic cells.
Shalek et al. [2014] exposed bone marrow-derived dendritic cells, from mus musculus, to lipopolysaccharide (LPS). LPS is a toxic compound secreted and structurally utilized by gram-negative bacteria and induces a cascade of changes in a cell’s expression profile through several pathways. Cells were sampled after 0, 1, 2, 4, and 6 hours post-exposure. We estimated transcription networks using 4431 transcripts expressed in at least 20% of 65 cells sampled 2 hours after LPS exposure, at which interval transcription is expected to be undergoing a variety of dynamic changes. Rather than attempting to perform model selection on this limited sample size, we consider highly sparse (< .01% sparsity) networks of 700 edges, chosen to provide tractable visualization and illustration of the method. The BIC tunings (discussed subsequently) are decidedly larger.
7.1. Selected networks.
In a Gaussian model, the network is star-shaped, with Mx1, Ccl17, Tax1bp3 and Ccl3 as hubs all with degrees ≥15, though none are directly inter-connected (Figure 7). In all, 2.5% of non-isolated vertices contribute 50% of the edges in the network. With the exception of Tax1bp3, these hub genes are all immune-signaling related.
In the Hurdle model (Figure 8), the graph is more chain-like, with maximum degree 12: 7% of nodes provide 50% of the edges. The strongest hub, Mgl2 (also known as Cd301b), has been recently described to be involved in uptake and presentation of glycosylated antigens, such as LPS, by dendritic cells [Denda-Nagai et al., 2010]. A sub-connected set of genes coding for MHC-II antigen presentation (H2ab1, H2eb1, H2aa) is the densest subcomponent, and interconnected to Mgl2 as well as Fabp5. Increased expression of Fabp5 has been shown to increase expression of cytokines Il7 and Il18, hence is also involved in immune cell stimulation [Adachi et al., 2012]. Many of the neighbors of Mgl2,H2ab1, H2eb1, H2aa and Fabp5 are neighbors of the hub genes in the Gaussian graph, whereas Mx1, Ccl17 and Ccl3 are sparsely connected in the Hurdle network. Tax1bp3 is absent.
Using BIC, both the Gaussian and Logistic models yield networks with more than 25,000 edges, while the Hurdle selects a network of roughly 12,000 edges. The additional flexibility available in the Hurdle for modeling inter-node relationships may permit sparser graphs to describe the conditional dependence relationships. We also observe that the Hurdle synthesizes signal from both Gaussian and Logistic networks. For sufficiently rich network sizes, the Gaussian and Hurdle and Logistic and Hurdle networks share 21% and 1% of possible edges, respectively, compared to only .08% of possible edges between the Gaussian and Logistic networks (binomial test p < 10−6).
7.2. Graphical geneset edge enrichment.
We consider how well the 700 edge networks recapitulate known relationships between genes using previously described functional annotations. The Gene Ontology Consortium [2015] provides a database of categories to which genes may be annotated if experimentally or computationally they are involved in a biological process. We note that networks may exhibit intraconnection within GO categories, and that some pairs of categories may exhibit preferential interconnection.
Each pair (i, j) of GO categories—including self-pairs—induces a coloring of vertices, coloring the vertices belonging to category i color ci and category j color cj. Vertices that do not belong to either i or j remain uncolored. Iterating through the 39872/2 pairs of categories, we test for edge enrichment between colors. Suppose in the inferred graph of 700 edges, nij edges connect ci-colored vertices to cj vertices. If the colored vertices were completely connected with ni vertices of color ci and nj vertices of color cj, then there would be mij = ni × nj edges among them (with the obvious adjustment made for self-edges when i = j). We now define an enrichment statistic as the hypergeometric tail probability
which is the probability of drawing nij colored balls, given 700 draws from urn containing 4431 × 4430/2 balls of which mij are colored.
This results in nearly 16 million enrichment statistics on the pairs of categories, which follow a complicated dependence structure under the series of null hypotheses that the observed edges being connected independent of coloring. The top 200 (smallest in magnitude) enrichment statistics t(k), k < 200 are compared to their distribution P (t*) under a Erdos-Renyi random graph model, yielding a Monte Carlo p-value for each order statistic. A pair of colors (i, j) with rank rij < 200 is declared significant if P (tij < t* (rij)) <.05 and P (t(r) < t* (r)) < .05 for all r < rij, that is, it is significant at 5% and all smaller order statistics are also significant.
7.2.1. Hurdle graphs tend to include intra-category enrichment.
In the Gaussian model, more than 100 pairs of categories (colors) are significantly enriched at an FDR of less than 10%, however in these pairs, only 6 correspond to intra-category enrichment (Figure 10). These are: response to salt stress, potassium channel regulator activity, extracellular exosome and three genesets containing genes with significant time-course differential expression in the original experiment. In the Hurdle model (Figure 11), 13 of 57 significantly enriched pairs form intra-connections, including defense response to Gram-negative bacteria, and cell-cell adhesion and several modules involving extracellular secretion via the Golgi apparatus. Also of particular note, genes annotated to the activation of innate immune response are directly connected to RNA PolII transcription factors, as well as “detection of lipopolysaccharide”–“endoplasmic reticulum–Golgi intermediate compartment.” Both of the modules are absent from the Gaussian network. This suggests that the more appropriate Hurdle model manages to identify transcription factor-induced expression changes in these regulated genes, a direct method by which one gene would induce expression changes in another.
No significant enrichment was found in the logistic model.
8. Discussion.
Graphical models estimated from single cell data are distinct from networks estimated from bulk data, or even repeated stochastic samples. In simulations, the Hurdle model with anisometric penalty has much greater sensitivity compared to available methods, while in the two data sets here, it yields substantially different network estimates compared to Gaussian and Logistic models on these zero-inflated data. When enrichment of gene ontology categories is considered between vertices in transcriptome-wide data, the enrichment uncovered with the Hurdle model is consistent with identifying direct effects of transcription factors on genes undergoing dynamic regulation due to LPS exposure.
In our work, we have utilized methods for sparse neighborhood selection. However, the zero-inflated parametric model explored here is not limited to this framework, and could serve as a basis for many network inference techniques, including mutual information-based techniques, or to parametrize families of directed networks.
Although measuring transcriptome-wide data allows conditional estimation of direct effects between genes, non-mRNA factors may also greatly affect gene expression. In this sense, important variables have still been marginalized over, and in the case of the Tfh data, indeed, most of the transcriptome has been marginalized over. Extensions that adapt graphical model selection to clustering and/or factor analytic models would likely be useful and allow greater biological insight with these data sets.
Supplementary Material
Acknowledgments.
RG was funded by a grant from the Bill and Melinda Gates foundation, the Vaccine and Immunology Statistical Center (VISC), OPP1151646. AM and RG acknowledge funding through grant R01 EB008400 from the National Institute of Biomedical Imaging and Bioengineering, US National Institutes of Health. MD was partially supported by grant DMS 1561814 from the US National Science Foundation.
AM thanks Daniel Lu for comments on the networks in reported section 6.
Footnotes
SUPPLEMENTARY MATERIAL
Supplement: Derivations and Methods (doi: 10.1214/00-AOASXXXXSUPP; .pdf). Supplemental derivations and methods for simulation and data preprocessing.
Available https://github.com/amcdavid/HurdleNormal
References.
- Adachi Yasuhiro, Hiramatsu Sumie, Tokuda Nobuko, Sharifi Kazem, Ebrahimi Majid, Islam Ariful, Kagawa Yoshiteru, Vaidyan Linda Koshy, Sawada Tomoo, Hamano Kimikazu, and Owada Yuji. Fatty acid-binding protein 4 (FABP4) and FABP5 modulate cytokine production in the mouse thymic epithelial cells. Histochemistry and Cell Biology, 138(3):397–406, 2012. [DOI] [PubMed] [Google Scholar]
- Chen Shizhe, Witten Daniela M., and Shojaie Ali. Selection and estimation for mixed graphical models. Biometrika, 102(1):47–64, November 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng Jie, Levina Elizaveta, and Zhu Ji. High-dimensional mixed graphical models. arXiv preprint arXiv:1304.2810, April 2013. [Google Scholar]
- de Jong Esther C, Vieira Pedro L, Kalinski Pawel, Schuitemaker Joost H N, Tanaka Yuetsu, Wierenga Eddy a, Yazdanbakhsh Maria, and Kapsenberg Martien L. Microbial compounds selectively induce Th1 cell-promoting or Th2 cell-promoting dendritic cells in vitro with diverse th cell-polarizing signals. Journal of Immunology, 168(4): 1704–1709, 2002. [DOI] [PubMed] [Google Scholar]
- Denda-Nagai Kaori, Aida Satoshi, Saba Kengo, Suzuki Kiwamu, Moriyama Saya, Oo-puthinan Sarawut, Tsuiji Makoto, Morikawa Akiko, Kumamoto Yosuke, Sugiura Daisuke, Kudo Akihiko, Akimoto Yoshihiro, Kawakami Hayato, Bovin Nicolai V., and Irimura Tatsuro. Distribution and function of macrophage galactose-type C-type lectin 2 (MGL2/CD301b): Efficient uptake and presentation of glycosylated antigens by dendritic cells. Journal of Biological Chemistry, 285(25):19193–19204, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobra Adrian, Hans Chris, Jones Beatrix, Nevins Joseph R., Yao Guang, and West Mike. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90:196–212, 2004. [Google Scholar]
- Drton Mathias and Maathuis Marloes. Structure learning in graphical modeling. AnnualReview of Statistics and Its Application, 4:365–393, 2017. [Google Scholar]
- Drton Mathias, Sturmfels Bernd, and Sullivant Seth. Lectures on algebraic statistics, volume 39 of Oberwolfach Seminars. Birkhäuser Verlag, Basel, 2009. [Google Scholar]
- Eltoft Torbjørn, Kim Taesu, and Lee Te Won. On the multivariate Laplace distribution. IEEE Signal Processing Letters, 13(5):300–303, 2006. [Google Scholar]
- Finak Greg, McDavid Andrew, Yajima Masanao, Deng Jingyuan, Gersuk Vivian, Shalek Alex K., Slichter Chloe K., Miller Hannah W., McElrath M. Juliana, Prlic Martin, Linsley Peter S., and Gottardo Raphael. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 16(1):278, December 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foygel Rina and Drton Mathias. Exact block-wise optimization in group lasso and sparse group lasso for linear regression. Arxiv preprint arXiv:1010.3320, pages 1–19, 2010. [Google Scholar]
- Hermann-Kleiter Natascha and Baier Gottfried. NFAT pulls the strings during CD4+ T helper cell effector functions, 2010. [DOI] [PubMed]
- Janes Kevin a, Wang Chun-Chao, Holmberg Karin J, Cabral Kristin, and Brugge Joan S. Identifying single-cell molecular programs by stochastic profiling. Nature Methods, 7(4):311–317, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnston Robert J, Poholek Amanda C, DiToro Daniel, Yusuf Isharat, Eto Danelle, Barnett Burton, Dent Alexander L, Craft Joe, and Crotty Shane. Bcl6 and Blimp-1 are reciprocal and antagonistic regulators of T follicular helper cell differentiation. Science, 325, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim Jong Kyoung and Marioni John C. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biology, 14(1), 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lauritzen Steffen. Graphical Models. Oxford University Press, Oxford, 1st edition, 1996. [Google Scholar]
- Lee Jason D and Hastie Trevor J. Structure Learning of Mixed Graphical Models In AISTATS 16, volume 31, pages 388–396, Scottsdale, AZ, USA, 2013. [Google Scholar]
- Li Yupeng, Pearl Stephanie A., and Jackson Scott A.. Gene networks in plant biology: Approaches in reconstruction and analysis. Trends in Plant Science, 20(10):664–675, 2015. [DOI] [PubMed] [Google Scholar]
- Lin Lin, Finak Greg, Ushey Kevin, Seshadri Chetan, Hawn Thomas R, Frahm Nicole, Scriba Thomas J, Mahomed Hassan, Hanekom Willem, Bart Pierre-Alexandre, Pantaleo Giuseppe, Tomaras Georgia D, Rerks-Ngarm Supachai, Kaewkungwal Jaranit, Nitayaphan Sorachai, Pitisuttithum Punnee, Michael Nelson L, Kim Jerome H, Robb Merlin L, O’Connell Robert J, Karasavvas Nicos, Gilbert Peter, De Rosa Stephen C, McElrath M Juliana, and Gottardo Raphael. COMPASS identifies T-cell subsets correlated with clinical outcomes. Nature Biotechnology, 33(6):610–616, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma Cindy S., Deenick Elissa K., Batten Marcel, and Tangye Stuart G.. The origins, function, and regulation of T follicular helper cells. Journal of Experimental Medicine, 209(7):1241–1253, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marinov Georgi K, Williams Brian A, McCue Ken, Schroth Gary P, Gertz Jason, Myers Richard M, and Wold Barbara J. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Research, 24(3):496–510, March 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Markowetz Florian and Spang Rainer. Inferring cellular networks: a review. BMC Bioinformatics, 8(6), 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDavid Andrew, Finak Greg, Chattopadyay Pratip K, Dominguez Maria, Lamoreaux Laurie, Ma Steven S, Roederer Mario, and Gottardo Raphael. Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments. Bioinformatics, 29(4):461–467, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDavid Andrew, Gottardo Raphael, Simon Noah, and Drton Mathias. Supplement to“graphical models for zero-inflated single cell gene expression”. 2018. [DOI] [PMC free article] [PubMed]
- Meinshausen Nicolai and Peter Bühlmann. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34(3):1436–1462, 2006. [Google Scholar]
- Parikh Neal and Boyd Stephen. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2014. [Google Scholar]
- Pham Lan V., Tamayo Archito T., Yoshimura Linda C., Lin-Lee Yen Chiu, and Ford Richard J.. Constitutive NF-kappaB and NFAT activation in aggressive B-cell lymphomas synergistically activates the CD154 gene and maintains lymphoma cell survival. Blood, 106(12):3940–3947, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Precopio Melissa L., Betts Michael R., Parrino Janie, Price David A., Gostick Emma, Ambrozak David R., Asher Tedi E., Douek Daniel C., Harari Alexandre, Pantaleo Giuseppe, Bailer Robert, Graham Barney S., Roederer Mario, and Koup Richard A.. Immunization with vaccinia virus induces polyfunctional and phenotypically distinctive CD8(+) T cell responses. Journal of Experimental Medicine, 204(6):1405–1416, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ravikumar Pradeep, Wainwright Martin J., and Laerty John D.. High-dimensional Ising model selection using -regularized logistic regression. The Annals of Statistics, 38(3): 1287–1319, June 2010. [Google Scholar]
- Shah Rajen D. and Samworth Richard J.. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):55–80, January 2013. [Google Scholar]
- Shalek Alex K., Satija Rahul, Shuga Joe, Trombetta John J., Gennert Dave, Lu Diana, Chen Peilin, Gertner Rona S., Gaublomme Jellert T., Yosef Nir, Schwartz Schraga, Fowler Brian, Weaver Suzanne, Wang Jing, Wang Xiaohui, Ding Ruihua, Raychowdhury Raktima, Friedman Nir, Hacohen Nir, Park Hongkun, May Andrew P., and Regev Aviv. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature, 510(7505):263–269, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simon Noah and Tibshirani Robert. Standardization and the group lasso penalty. Statistica Sinica, 22(3):983–1001, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tansey Wesley, Madrid Padilla Oscar Hernan, Suggala Arun Sai, and Ravikumar Pradeep. Vector-space Markov random fields via exponential families. Proceedings of The 32nd International Conference on Machine Learning, 37:684–692, 2015. [Google Scholar]
- The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Research, 43(D1):D1049–D1056, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani Robert, Bien Jacob, Friedman Jerome, Hastie Trevor, Simon Noah, Taylor Jonathan, and Tibshirani Ryan J.. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 74(2):245–266, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Eunho, Baker Y, Ravikumar P, Allen G, and Liu Z. Mixed graphical models via exponential families In AISTATS 17, volume 33, Reykjavik, Iceland, 2014. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.