A COMPOSITIONAL MODEL TO ASSESS EXPRESSION CHANGES FROM SINGLE-CELL RNA-SEQ DATA

Xiuyu MA; Keegan Korthauer; Christina Kendziorski; Michael A Newton

doi:10.1214/20-aoas1423

. Author manuscript; available in PMC: 2023 Jun 16.

Published in final edited form as: Ann Appl Stat. 2021 Jul 12;15(2):880–901. doi: 10.1214/20-aoas1423

A COMPOSITIONAL MODEL TO ASSESS EXPRESSION CHANGES FROM SINGLE-CELL RNA-SEQ DATA

Xiuyu MA ^1,^*, Keegan Korthauer ², Christina Kendziorski ³, Michael A Newton ^1,^†

PMCID: PMC10275512 NIHMSID: NIHMS1901161 PMID: 37332668

Abstract

On the problem of scoring genes for evidence of changes in the distribution of single-cell expression, we introduce an empirical Bayesian mixture approach and evaluate its operating characteristics in a range of numerical experiments. The proposed approach leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model not only in which a host of gene-specific mixture components are allowed but also in which the mixing proportions are constrained at the whole cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes.

Key words and phrases. Local false discovery rate, mixture model, empirical Bayes, clustering, double Dirichlet mixture

1. Introduction.

The ability to measure genome-wide gene expression at single-cell resolution has accelerated the pace of biological discovery. Overcoming data analysis challenges caused by the scale and unique variation properties of single-cell data will surely fuel further advances in immunology (Papalexi and Satija (2017)), developmental biology (Marioni and Arendt (2017)), cancer (Navin (2015)) and other areas (Nawy (2013)). Computational tools and statistical methodologies created for data of lower resolution (e.g., bulk RNA-seq) or lower dimension (e.g., flow cytometry) guide our response to the data-science demands of new measurement platforms, but they remain inadequate for efficient knowledge discovery in this rapidly advancing domain (Bacher and Kendziorski (2016)).

An important feature of single-cell studies that could be leveraged better statistically is the fact that cells populate distinct, identifiable subtypes determined by lineage history, epigenetic state, the activity of various transcriptional programs or other distinguishing factors. Extensive research on clustering cells has produced tools for identifying subtypes, including SC3 (Kiselev et al. (2017)), CIDR (Lin, Troup and Ho (2017)) and ZIFA (Pierson and Yau (2015)). We hypothesize that such subtype information may be usefully utilized in other inference procedures in order to improve their operating characteristics.

Assessing the magnitude and statistical significance of changes in gene expression associated with changes in cellular condition has been a central statistical problem in genomics. New tools specific to the single-cell RNA-seq data structure, including MAST (Finak et al. (2015)), scDD (Korthauer et al. (2016)), and D3E (Delmans and Hemberg (2016)) have been deployed to address this problem. These tools respond to scRNA-seq characteristics, such as high prevalence of zero counts and gene-level multimodality, but they do not fully exploit cellular-subtype information and, therefore, may be underpowered in some settings. The proposed method, which uses negative binomial mixtures to measure changes in a gene’s marginal sampling distribution, acquires sensitivity to a variety of distributional effects by how it integrates gene-level data with subtype information. From input data the associated R package scDDboost¹ prioritizes genes with a local false-discovery rate against the null hypothesis of no condition effect on the marginal sampling distribution. The complement of this rate is an empirical Bayesian posterior probability of differential distribution (DD). By incorporating transciptomic information on cell subtypes, scDDboost leverages useful and previously untapped information on each gene’s expression sampling distribution.

Through the compositional model underlying scDDboost, subtypes inferred by clustering inform the analysis of gene-level expression. The proposed methodology merges two lines of computation after cell clustering: one concerns patterns of differential expression among the cellular subtypes, and here we take advantage of the powerful EBseq method for detecting patterns in negative-binomially-distributed expression data (Leng et al. (2013)). The second concerns the counts of cells in various subtypes; for this we propose a double Dirichlet mixture distribution to model the pair of multinomial probability vectors for subtype counts in two experimental conditions. Further elements are developed on the selection of the number of subtypes and on accounting for uncertainty in the cluster output, in order to provide an end-to-end solution to the differential distribution problem. We note that modularity in the necessary elements provides some methodological advantages. For example, improvements in clustering may be used in place of the default clustering without altering the form of downstream analysis. Also, by avoiding Markov chain Monte Carlo, scDDboost computations are relatively inexpensive for a Bayesian procedure.

To set the context by way of example, Figure 1 highlights the ability of scDDboost to sense subtype composition changes and thus detect subtle gene expression changes between conditions. The three panels on the left compare expression from 91 human stem cells known to be in the G1 phase of the cell cycle, as well as from 76 such cells known to be in the G2/M phase (Leng et al. (2015)) in three genes (BIRC5, HMMR and CKAP2), which we happen to know from prior studies have differential activity between G1 and G2/M (Li and Altieri (1999), Sohr and Engeland (2008), Dominguez et al. (2016)). Several standard statistical tools applied to the data behind Figure 1 do not find the observed differences in any of these genes to be statistically significant when controlling the false discovery rate (FDR) at 5%, but scDDboost does include these genes on its 5% FDR list. Considering prior studies, these subtle distributional changes are probably not false discoveries. The right panel in Figure 1 shows these three among many other genes also known to be involved in cell-cycle regulation but not identified by standard tools as altered between G1 and G2/M at the 5% FDR level. The color panel provides insight into why scDDboost has identified these genes. For this data set, six cellular subtypes were identified in the first step of scDDboost (colors red, blue, green and orange are visible). These subtypes have changed in their proportions between G1 and G2/M; there is a lower proportion of red cells and a greater proportion of orange cells in G2/M, for example. These proportion shifts, which are inferred from genome-wide data, stabilize gene-specific statistics that measure changes between conditions in the mixture distribution of expression and, thereby, increase power. We note that scDDboost agrees with other statistical tools on very strong differential-distribution signals (not shown), but it has the potential to increase power for subtle signals owing to its unique approach to leveraging cell subtype information.

Fig. 1. — Genes involved in cell cycle that are identified by scDDboost, but not standard approaches, as differentially distributed between cell-cycle phases G1 and G2/M in human embryonic stem cells (GSE64016). Density estimates on the left show expression data (log2 scale) of three genes identified by scDDboost at 5% FDR but not similarly identified by MAST, scDD or DESeq2. Prior studies have shown that the expression of BIRC5, CKAP2 and HMMR is dependent on the phase of cell cycle, suggesting that these subtle shifts are not false positives. Heatmap (right) shows these three genes among 137 other cell-cycle genes (GO:0007049) identified exclusively by scDDboost, with expression from low (blue) to high (red). Cells (columns) are clustered by their genome-wide expression profiles into distinct cellular subtypes, as indicated by the color panel.

Numerical experiments on both synthetic and published scRNA-seq data bear out the incidental finding in Figure 1 that scDDboost has sensitivity for detecting subtle distribution changes. In these experiments we take advantage of splatter for generating synthetic data (Zappia, Phipson and Oshlack (2017)) as well as the compendium of scRNA-seq data available through conquer (Soneson and Robinson (2018)). Additional numerical experiments show a relationship between scDDboost findings and more mechanistic attempts to parameterize transcriptional activation (Delmans and Hemberg (2016)). Finally, we establish first-order asymptotic results for the methodology.

On manuscript organization we present the modeling and methodology elements in Section 2, numerical experiments in Section 3, asymptotic analysis in Section 4 and a discussion in Section 5. We relegate some details to an Appendix and many others to the Supplementary Material (Ma et al. (2021)).

2. Modeling.

2.1. Data structure, sampling model and parameters.

In modeling scRNA-seq data we imagine that each cell $c$ falls into one of $K > 1$ classes, which we think of as subtypes or subpopulations of cells. For notation, $z_{c} = k$ means that cell $c$ is of subtype $k$ , with the vector $z = (z_{c})$ recording the states of all sampled cells. Knowledge of this class structure prior to measurement is not required, as it will be inferred as necessary from available genomic data. We expect that cells arise from multiple experimental conditions, such as by treatment-control status or some other factors measured at the cell level, but we present our development for the special case of two conditions. Notationally, $y = (y_{c})$ records the experimental condition, say $y_{c} = 1$ or $y_{c} = 2$ . Let’s say condition $j$ measures $n_{j} = \sum_{c} 1 [y_{c} = j]$ cells, and, in total, we have $n = n_{1} + n_{2}$ cells in the analysis. The examples in Section 3 involve hundreds to thousands of cells. Further, let

t_{k}^{j} = t_{k}^{j} (y, z) = \sum_{c} 1 [y_{c} = j, z_{c} = k]

(1)

denote the number of cells of subtype $k$ in condition $j$ , and $X_{g, c}$ denote the normalized expression of gene $g$ in cell $c$ . This is one entry in a typically large genes-by-cells data matrix $X$ . Thus, the data structure entails an expression matrix $X$ , a treatment label vector $y$ and a vector $z$ of latent subtype labels.

We treat subtype counts in the two conditions, $t^{1} = (t_{1}^{1}, t_{2}^{1}, \dots, t_{K}^{1})$ and $t^{2} = (t_{1}^{2}, t_{2}^{2}, \dots, t_{K}^{2}$ ), as independent multinomial vectors, reflecting the experimental design. Explicitly,

t^{1} ∣ y ~ {Multinomial}_{K} (n_{1}, ϕ) and t^{2} ∣ y ~ {Multinomial}_{K} (n_{2}, ψ)

(2)

for probability vectors $ϕ = (ϕ_{1}, ϕ_{2}, \dots, ϕ_{K})$ and $ψ = (ψ_{1}, ψ_{2}, \dots, ψ_{K})$ that characterize the populations of cells from which the $n$ observed cells are sampled. This follows from the more basic sampling model: $P (z_{c} = k ∣ y_{c} = 1) = ϕ_{k}$ and $P (z_{c} = k ∣ y_{c} = 2) = ψ_{k}$ .

Our working hypothesis, referred to as the compositional model, is that any differences in the distribution of expression $X_{g, c}$ between $y_{c} = 1$ and $y_{c} = 2$ (i.e., any condition effects) are attributable to differences between the conditions in the underlying composition of cell types, that is, owing to $ϕ \neq ψ$ . We suppose that cells of any given subtype $k$ will present data according to a distribution reflecting technical and biological variation specific to that class of cells, regardless of the condition $y_{c}$ of the cell. Some care is needed in this, as an overly broad cell subtype (e.g., epithelial cells) could have further subtypes that show differential response to some treatment, for example, and so cellular condition (treatment) would then affect the distribution of expression data within the subtype which is contrary to our working hypothesis. Were that the case, we could have refined the subtype definition to allow a greater number of population classes $K$ in order to mitigate the problem of within-subtype heterogeneity. A risk in this approach is that $K$ could approach $n$ , as if every cell were its own subtype. We find, however, that data sets often encountered do not display this theoretical phenomenon when using a broad class of within-subtype expression distributions. Subtypes are considered such that cellular condition affects their composition but not the sampling distribution of expression within a subtype.

Within the compositional model, let $f_{g, k}$ denote the sampling distribution of expression measurement $X_{g, c}$ assuming that cell $c$ is from subtype $k$ . Then, for the two cellular conditions and at some expression level $x$ , the marginal distributions over subtypes are finite mixtures,

f_{g}^{1} (x) = \sum_{k = 1}^{K} ϕ_{k} f_{g, k} (x) and f_{g}^{2} (x) = \sum_{k = 1}^{K} ψ_{k} f_{g, k} (x) .

In other words, $X_{g, c} ∣ [y_{c} = j] ~ f_{g}^{j}$ and $X_{g, c} ∣ [z_{c} = k, y_{c} = j] ~ f_{g, k}$ .

We say that gene $g$ is differentially distributed, denoted ${DD}_{g}$ and indicated by $f_{g}^{1} \neq f_{g}^{2}$ , if $f_{g}^{1} (x) \neq f_{g}^{2} (x)$ for some $x$ , and, otherwise, it is equivalently distributed ( ${ED}_{g}$ ). Motivated by findings from bulk RNA-seq data analysis, we further set each $f_{g, k}$ to have a a negative-binomial form with mean $μ_{g, k}$ and shape parameter $σ_{g}$ , as in (Leng et al. (2013), Anders and Huber (2010), Love, Huber and Anders (2014) and Chen et al. (2018)). This choice is effective in our numerical experiments though it is not critical to the modeling formulation. The use of mixtures per gene has proven useful in related model-based approaches (e.g., Finak et al. (2015), McDavid et al. (2014), Huang et al. (2018)).

We seek methodology to prioritize genes for evidence of ${DD}_{g}$ . Interestingly, even if we have evidence for condition effects on the subtype frequencies, it does not follow that a given gene will have $f_{g}^{1} \neq f_{g}^{2}$ . That depends on whether or not the subtypes show the right pattern of differential expression at $g$ , using the standard terminology from bulk RNA-seq. For example, if two subtypes have different frequencies between the two conditions $(ϕ_{1} \neq ψ_{1}$ and $ϕ_{2} \neq ψ_{2})$ but the same aggregate frequency $(ϕ_{1} + ϕ_{2} = ψ_{1} + ψ_{2})$ and also, if components are equivalent, $f_{g, 1} = f_{g, 2}$ , then, other things being equal, marginals $f_{g}^{1} = f_{g}^{2}$ , even though $ϕ \neq ψ$ . Details confirming such equality are exemplified further in Supplementary Material Section 2.1, Ma et al. (2021). The fact is so central that we emphasize:

Key Issue.

A gene that does not distinguish two subtypes will also not distinguish the cellular conditions if those subtypes appear in the same aggregate frequency in the two conditions, regardless of changes in the individual subtype frequencies.

We formalize this issue in order that our methodology has the necessary functionality. To do so, first consider the parameter space $Θ = {θ = (ϕ, ψ, μ, σ)}$ , where $ϕ = (ϕ_{1}, ϕ_{2}, \dots, ϕ_{K})$ and $ψ = (ψ_{1}, ψ_{2}, \dots, ψ_{K})$ are as before, where $μ = {μ_{g, k}}$ holds all the subtype-and-gene-specific expected values and where $σ = {σ_{g}}$ holds all the gene-specific negative-binomial shape parameters. Critical to our construction are special subsets of $Θ$ corresponding to partitions of the $K$ cell subtypes. A single partition, $π$ , is a set of mutually exclusive and exhaustive blocks, $b$ , where each block is a subset of ${1, 2, \dots, K}$ , and we write $π = {b}$ . Of course, the set $Π$ containing all partitions $π$ of ${1, 2, \dots, K}$ has cardinality that grows rapidly with $K$ . We carry along an example involving $K = 7$ cell types and one three-block partition taken from the set of 877 possible partitions of ${1, 2, \dots, 7}$ (Figure 2).

Fig. 2. — Proportions of $K = 7$ cellular subtypes in two different conditions. Aggregated proportions of subtypes 3 and 4, subtypes 2, 5 and 6 and subtypes 1 and 7 remain same across conditions, while individual subtype frequencies change. Depending on the changes in average expression among subtypes, these frequency changes may or may not induce changes between two conditions in the marginal distribution of some gene’s expression.

For any partition $π = {b}$ , consider aggregate subtype frequencies

Φ_{b} = \sum_{k \in b} ϕ_{k} {and Ψ}_{b} = \sum_{k \in b} ψ_{k}

and extend the notation, allowing vectors $Φ_{π} = {Φ_{b} : b \in π}$ , and similarly for $Ψ_{π}$ . Recall the partial ordering of partitions based on refinement, and note that as long as $π$ is not the most refined partition (every cell type is in its own block), then the mapping from $(ϕ, ψ)$ to $(Φ_{π}, Ψ_{π})$ is many-to-one. Further, define sets

A_{π} = {θ \in Θ : Φ_{b} = Ψ_{b} \forall b \in π}

(3)

and

M_{g, π} = {θ \in Θ : μ_{g, k} = μ_{g, k^{'}} \Leftrightarrow k, k^{'} \in b, b \in π} .

(4)

Under $A_{π}$ there are constraints on cell subtype frequencies; under $M_{g, π}$ there is equivalence in the gene-level distribution of expression between certain subtypes. These sets are precisely the structures needed to address differential distribution ${DD}_{g}$ (and its complement, equivalent distribution, ${ED}_{g}$ ) at a given gene $g$ .

Theorem 1. Let $C_{g, π} = A_{π} \cap M_{g, π}$ . For partitions $π_{1} \neq π_{2}, C_{g, π_{1}} \cap C_{g, π_{2}} = \emptyset$ . Further, at any gene $g$ , equivalent distribution is

{ED}_{g} = \underset{π \in Π}{\cup} C_{g, π} .

With additional probability structure on the parameter space, we immediately obtain from Theorem 1 a formula for local false discovery rates,

1 - P ({DD}_{g} ∣ X, y) = P ({ED}_{g} ∣ X, y) = \sum_{π \in Π} P (A_{π} \cap M_{g, π} ∣ X, y) .

(5)

Local false discovery rates are important empirical Bayesian statistics in large-scale testing (Efron (2007); Muralidharan (2010); Newton et al. (2004)). For example, the conditional false discovery rate of a list of genes is the arithmetic mean of the associated local false discovery rates. The partition representation guides the construction of a prior distribution (Section 2.3) and a model-based method (Section 2.2) for scoring differential distribution. Setting the stage, Figure 3 shows the dependency structure of the proposed compositional model and the partition-reliant prior specification.

Fig. 3. — Directed acyclic graph structure of the compositional model and partition-reliant prior. The plate on the right side indicates i.i.d. copies over cells c, conditionally on mixing proportions and mixing components. Observed data are indicated in rectangles/squares, and unobserved variables are in circles/ovals.

Key to computing the gene-specific local false discovery rate $P ({ED}_{g} ∣ X, y)$ is evaluating probabilities $P (A_{π} \cap M_{g, π} ∣ X, y)$ . The dependence structure (Figure 3) implies a useful reduction of this quantity, at least conditionally upon subtype labels $z = (z_{c})$ . For each subtype partition $π$ and gene $g$ ,

Theorem 2. $P (A_{π} \cap M_{g, π} ∣ X, y, z) = P (A_{π} ∣ y, z) P (M_{g, π} ∣ X, z)$ .

In what follows we develop the modeling and computational elements necessary to efficiently evaluate inference summaries (5) taking advantage of Theorems 1 and 2. Roughly, the methodological idea is that subtype labels $z$ have relatively low uncertainty and may be estimated from genome-wide clustering of cells in the absence of condition information $y$ . (up to an arbitrary label permutation). We handle the modest uncertainty in $z$ through a computationally efficient randomized clustering scheme. Theorem 2 indicates that our computational task then separates into two parts given $z$ . On one hand, cell subtype frequcies combine with condition labels to give $P (A_{π} ∣ y, z)$ . Then, gene-level data locally drive the posterior probabilities $P (M_{g, π} ∣ X, z)$ that measure differential expression between subtypes. Essentially, the model provides a specific form of information sharing between genes that leverages the compositional structure of single-cell data in order to sharpen our assessments of between-condition expression changes.

2.2. Method structure and clustering.

To infer subtypes, we leverage the extensive research on how to cluster cells using scRNA-seq data: for example, SC3 (Kiselev et al. (2017)), CIDR (Lin, Troup and Ho (2017)) and ZIFA (Pierson and Yau (2015)). We propose distance-based clustering on the full set of profiles in a way that is blind to the condition label vector $y$ in order to have as many cells as possible to inform the subtype structure. We investigated several clustering schemes in numerical experiments and allow flexibility in this choice within the scDDboost software. Associating clusters with subtype labels ${\hat{z}}_{c}$ estimates the actual subtypes $z_{c}$ and prepares us to use Theorems 1 and 2 in order to compute separate posterior probabilities $P (A_{π} ∣ y, \hat{z})$ and $P (M_{g, π} ∣ X, \hat{z})$ that are necessary for scoring differential distribution. The first probability concerns patterns of cell counts over subtypes in the two conditions and has a convenient closed form within the double-Dirichlet model (Section 2.3). The second probability concerns patterns of changes in expected expression levels among subtypes, and this is also conveniently computed for negative-binomial counts using EBSeq (Leng et al. (2013)). Algorithm 1 summarizes how these elements combine to get the posterior probability of differential distribution per gene, conditional on an estimate of the subtype labels.

We invoke $K$ -medoids (Kaufman and Rousseeuw (1987)) as the default clustering method in scDDboost, and customize the cell-cell distance by integrating two measures. The first assembles gene-level information by cluster-based-similarity partitioning (Strehl and Ghosh (2003)). Separately at each gene, modal clustering (Dahl (2009) and in the Supplementary Material Section 2.2, Ma et al. (2021)) partitions the cells, and then we define dissimilarity between cells as the Manhattan distance between gene-specific partition labels. A second measure defines dissimilarity by one minus the Pearson correlation between cells, which is computationally inexpensive, less sensitive to outliers than Euclidean distance and effective at detecting cellular clusters in scRNA-seq (Kim et al. (2019)). The default clustering in scDDboost combines these two measures by weighted average, with $w_{C} = \frac{σ_{P}}{σ_{C} + σ_{P}}$ and $w_{P} = 1 - w_{C}$ , where $w_{C}, σ_{C}, w_{P}, σ_{P}$ are the weights and standard deviations of cluster-based distance and Pearson-correlation distance, respectively. The software allows other distances, such as provided by SC3, which we use in some numerical experiments; in any case, the final distance matrix is denoted $D = (d_{i, j})$ .

Any clustering method entails classification errors, and so ${\hat{z}}_{c} \neq z_{c}$ for some cells. To mitigate the effects of this uncertainty, scDDboost averages output probabilities from scDDboost-core over randomized clusterings ${\hat{z}}^{*}$ . These are not uniformly random but rather are generated by applying $K$ -medoids to a randomized distance matrix $D^{*} = (d_{i, j} / w_{i, j})$ , where $w_{i, j}$ are nonnegative weights $w_{i, j} = (e_{i} + e_{j})$ , where $(e_{i})$ are independent and identically Gamma distributed deviates with shape $\hat{a} / 2$ and rate $\hat{a}$ and where $\hat{a}$ is estimated from $D$ . (Thus, $w_{i, j}$ is $Gamma (\hat{a}, \hat{a})$ and has unit mean.) The distribution of clusterings induced by this simple computational scheme approximates a Bayesian posterior analysis, as we argue in the Appendix where we also present pseudo-code for the resulting scDDboost Algorithm 2. Averaging over results from randomized clusterings gives additional stability to the posterior probability statistics (Supplementary Figure S10).

Computations become more intensive the larger is the number $K$ of cell subtypes. Version 1.0 of scDDboost is restricted to $K \leq 9$ ; we consider further computational strategies in Section 5. Inferentially, taking $K$ to be too large may inflate the false positive rate (Supplementary Figure S11). The approach taken in ScDDboost is to set $K$ using the validity score (Ray and Turi (2000)) which measures changes in within-cluster sum of squares as we increase $K$ . Our implementation, in Supplementary Material Section 2.2, Ma et al. (2021), shows good operating characteristics in simulation.

2.3. $P (A_{π} ∣ y, z)$ .

We introduce the double Dirichlet mixture (DDM), which is the partition-reliant prior $p (ϕ, ψ)$ indicated in Figure 3, in order to derive an explicit formula for $P (A_{π} ∣ y, z)$ . We lose no generality here by defining $A_{π} = {(ϕ, ψ) : Φ_{b} = Ψ_{b} \forall b \in π}$ rather than as a subset of the full parameter space, as in (3). Each $A_{π}$ is closed and convex subset of the product space holding all possible pairs of length- $K$ probability vectors.

We propose a spike-slab-style mixture prior with the following form:

p (ϕ, ψ) = \sum_{π \in Π} ω_{π} p_{π} (ϕ, ψ) .

(6)

Each mixture component $p_{π} (ϕ, ψ)$ has support $A_{π}$ ; the mixing proportions $ω_{π}$ are positive constants summing to one. To specify component $p_{π}$ , notice that on $A_{π}$ there is a 1–1 correspondence between pairs $(ϕ, ψ)$ and parameter states,

{({\tilde{ϕ}}_{b}, {\tilde{ψ}}_{b}, Φ_{b}), \forall b \in π},

(7)

where

{\tilde{ϕ}}_{b} = \frac{ϕ_{b}}{Φ_{b}}, {\tilde{ψ}}_{b} = \frac{ψ_{b}}{Ψ_{b}} {and Φ}_{b} = \sum_{k \in b} ϕ_{k} = \sum_{k \in b} ψ_{k} = Ψ_{b}

For example, ${\tilde{ϕ}}_{b}$ is a vector of conditional probabilities for each subtype, given that a cell from the first condition is one of the subtypes in $b$ .

We introduce hyperparameters $α_{k}^{1}, α_{k}^{2} > 0$ for each subtype $k$ and set $β_{b} = \sum_{k \in b} (α_{k}^{1} + α_{k}^{2})$ for any possible block $b$ . Extending notation, let $α_{b}^{j}$ be the vector of $α_{k}^{j}$ for $k \in b, β_{π}$ be the vector of $β_{b}$ for $b \in π, ϕ_{b}$ and $ψ_{b}$ be vectors of $ϕ_{k}$ and $ψ_{k}$ , respectively, for $k \in b$ , and $Φ_{π}$ and $Ψ_{π}$ be the vectors of $Φ_{b}$ and $Ψ_{b}$ for $b \in π$ . The proposed double-Dirichlet component $p_{π}$ is determined in the transformed scale by assuming $Ψ_{π} = Φ_{π}$ and further,

\begin{array}{l} Φ_{π} ~ {Dirichet}_{N (π)} [β_{π}], \\ {\tilde{ϕ}}_{b} ~ {Dirichlet}_{N (b)} [α_{b}^{1}] & \forall b \in π, \\ {\tilde{ψ}}_{b} ~ {Dirichlet}_{N (b)} [α_{b}^{2}] & \forall b \in π, \end{array}

(8)

where $N (π)$ is the number of blocks in $π$ and $N (b)$ is the number of subtypes in $b$ and where all random vectors in (8) are mutually independent. Mixing over $π$ as in (6), we write $(ϕ, ψ) ~ DDM [ω = (ω_{π}), α^{1} = (α_{k}^{1}), α^{2} = (α_{k}^{2})]$ .

We record some properties of the component distributions $p_{π}$ .

Property 1. In $p_{π} (ϕ, ψ), ψ$ and $ϕ$ are dependent, unless $π$ is the null partition in which all subtypes constitute a single block.

Property 2. With $k \in b$ , marginal means are

E_{π} (ϕ_{k}) = \frac{α_{k}^{1}}{\sum_{k^{'} \in b} α_{k^{'}}^{1}} \frac{β_{b}}{\sum_{b^{'} \in π} β_{b^{'}}} and E_{π} (ψ_{k}) = \frac{α_{k}^{2}}{\sum_{k^{'} \in b} α_{k^{'}}^{2}} \frac{β_{b}}{\sum_{b^{'} \in π} β_{b^{'}}} .

Recall from (1) the vectors $t^{1}$ and $t^{2}$ holding counts of cells in each subtype in each condition, computed from $y$ and $z$ . Relative to a block $b \in π$ , let $t_{b}^{j} = \sum_{k \in b} t_{k}^{j}$ , for cell conditions $j = 1, 2$ , and, let $t_{π}^{j}$ be the vector of these counts over $b \in π$ . The following properties refer to marginal distributions in which $(ϕ, ψ)$ have been integrated out of the joint distribution involving (2) and the component $p_{π}$ .

Property 3. $t^{1}$ and $t^{2}$ are conditionally independent, given $y, t_{π}^{1}$ and $t_{π}^{2}$ .

Property 4. For $j = 1, 2$ ,

p_{π} (t^{j} ∣ t_{π}^{j}, y) = \prod_{b \in π} {[\frac{Γ (t_{b}^{j} + 1)}{\prod_{k \in b} Γ (t_{k}^{j} + 1)}] [\frac{Γ (\sum_{k \in b} α_{k}^{j})}{\prod_{k \in b} Γ (α_{k}^{j})}] [\frac{\prod_{k \in b} Γ (α_{k}^{j} + t_{k}^{j})}{Γ (t_{b}^{j} + \sum_{k \in b} α_{k}^{j}))}]} .

Property 5.

p_{π} (t_{π}^{1}, t_{π}^{2} ∣ y) = [\frac{Γ (n_{1} + 1) Γ (n_{2} + 1)}{\prod_{b \in π} Γ (t_{b}^{1} + 1) Γ (t_{b}^{2} + 1)}] [\frac{Γ (\sum_{b \in π} β_{b})}{\prod_{b \in π} Γ (β_{b})}] [\frac{\prod_{b \in π} Γ (β_{b} + t_{b}^{1} + t_{b}^{2})}{Γ (n_{1} + n_{2} + \sum_{b \in π} β_{b})}] .

Let’s look at some special cases to dissect this result.

Case 1. If $π$ has a single block equal to the entire set of cell types ${1, 2, \dots, K}$ , then $t_{b}^{j} = n_{j}$ for both $j = 1, 2$ , and Property 5 reduces, correctly, to $p_{π} (t_{π}^{1}, t_{π}^{2} ∣ y) = 1$ . Further,

p_{π} (t^{j} ∣ t_{π}^{j}, y) = [\frac{Γ (n_{j} + 1)}{Γ (n_{j} + \sum_{k = 1}^{K} α_{k}^{j})}] [\frac{Γ (\sum_{k = 1}^{K} α_{k}^{j})}{\prod_{k = 1}^{K} Γ (α_{k}^{j})}] [\prod_{k = 1}^{K} \frac{Γ (α_{k}^{j} + t_{k}^{j})}{Γ (t_{k}^{j} + 1)}]

which is the well-known Dirichlet-multinomial predictive distribution for counts $t^{j}$ (Wagner and Taudes (1986)). For example, taking $α_{k}^{j} = 1$ for all types $k$ , we get the uniform distribution

p_{π} (t^{j} ∣ t_{π}^{j}, y) = \frac{Γ (n_{j} + 1) Γ (K)}{Γ (n_{j} + K)} .

Case 2. At the opposite extreme, $π$ has one block $b$ for each class $k$ , so $ϕ = ψ$ . Then, $p_{π} (t^{j} ∣ t_{π}^{j}, y) = 1$ , and, further, writing $b = k$ ,

p_{π} (t_{π}^{1}, t_{π}^{2} ∣ y) = [\frac{Γ (n_{1} + 1) Γ (n_{2} + 1)}{\prod_{k = 1}^{K} Γ (t_{k}^{1} + 1) Γ (t_{k}^{2} + 1)}] [\frac{Γ (\sum_{k = 1}^{K} β_{k})}{\prod_{k = 1}^{K} Γ (β_{k})}] [\frac{\prod_{k = 1}^{K} Γ (β_{k} + t_{k}^{1} + t_{k}^{2})}{Γ (n_{1} + n_{2} + β_{k})}],

which corresponds to Dirichlet-multinomial predictive distribution for counts $t^{1} + t^{2}$ since $t^{1}$ and $t^{2}$ are identical distributed given $(ϕ, ψ)$ in this case. These properties are useful in establishing

Theorem 3. DDM is conjugate to multinomial sampling of $t^{1}$ and $t^{2}$ ,

(ϕ, ψ) ∣ y, z ~ DDM [ω^{post} = (ω_{π}^{post}), α^{1} + t^{1}, α^{2} + t^{2}]

where

ω_{π}^{post} \propto p_{π} (t^{1} ∣ t_{π}^{1}, y) p_{π} (t^{2} ∣ t_{π}^{2}, y) p_{π} (t_{π}^{1}, t_{π}^{2} ∣ y) ω_{π}

(9)

The target probability $P (A_{π} ∣ y, z)$ is an integral of the posterior distribution in Theorem 3. To evaluate it, we need to contend with the fact that sets ${A_{π} : π \in Π}$ are not disjoint. Relevant overlaps have to do with partition refinement. Recall that a partition $π^{r}$ is a refinement of a partition $π^{c}$ if, for any $b \in π^{c}$ , there exists $s \subset π^{r}$ such that $\cup_{b^{'} \in s} b^{'} = b$ . We say $π^{c}$ coarsens $π^{r}$ when $π^{r}$ refines $π^{c}$ . Any partition both refines and coarsens itself as a trivial case. Generally, refinements increase the number of blocks. If subtype frequency vectors $(ϕ, ψ)$ satisfy the constraints in $A_{π^{r}}$ , then they also satisfy the constraints of any $π^{c}$ that coarsens $π^{r}$ ; that is, $A_{π^{r}} \subset A_{π^{c}}$ . Refinements reduce the dimension of allowable parameter states. For the double-Dirichlet component distributions $P_{π}$ , we find

Property 6. For two partitions $\tilde{π}$ and $π, P_{\tilde{π}} (A_{π} ∣ y, z) = 1 [\tilde{π}$ refines $π]$ .

This supports the main finding of this section,

P (A_{π} ∣ y, z) = \sum_{\tilde{π} \in Π} ω_{\tilde{π}}^{post} 1 [\tilde{π} refines π]

(10)

2.4. $P (M_{g, π} ∣ X, z)$ .

We leverage well-established modeling techniques for transcript analysis, including (Leng et al. (2013), Newton and Kendziorski (2003) and Jensen et al. (2009)) which characterize equivalent or differential expression in terms of shared or independently drawn mean effects. Let $X_{g, b}$ denote the subvector of expression values at gene $g$ over cells $c$ with $z_{c} = k$ for which subtype $k$ is part of block $b$ of partition $π$ . Conditioning on subtype labels $z = (z_{c})$ , we assume that under $M_{g, π}$ :

Between blocks: subvectors ${X_{g, b} : b \in π}$ are mutually independent;
Within blocks: for cells mapping to block $b$ , observations $X_{g, c}$ are i.i.d;
Mean effects: for each block $b$ , there is a univariate mean, $μ_{g, b}$ , shared by cells mapping to that block; a priori these means are i.i.d. between blocks.

These assumptions imply a useful factorization marginally to latent means,

P (X_{g} ∣ M_{g, π}, z) = \prod_{b \in π} f (X_{g, b}),

(11)

where $f$ is a customized density kernel. In our case we use EBseq from (Leng et al. (2013)): the sampling distribution of $X_{g, c}$ is negative binomial, and $f$ becomes a particular compound multivariate negative binomial formed from integrating uncertainty in the block-specific means (see Supplementary Material Section 2.2, Ma et al. (2021)). Through its gene-level mixing model, EBseq also gives estimates of ${P (M_{g, π} ∣ z)}$ : the proportions of genes governed by any of the different patterns $π$ of equivalent/differential expression among subtypes. With these estimates and (11) we compute by Bayes’s rule,

P (M_{g, π} ∣ X, z) \propto P (M_{g, π} ∣ z) \prod_{b \in π} f (X_{g, b}) .

The proportionality is resolved by calculating over all partitions $π$ .

3. Numerical experiments.

3.1. Synthetic data.

We used splatter (v. 1.2.0) to generate synthetic scRNA-seq data for which the DD status of genes is known (Zappia, Phipson and Oshlack (2017)), thereby allowing us to measure operating characteristics of scDDboost in a controlled setting. Splatter is a generative system for simulating realistic single-cell RNA-seq data. It accounts for biological and technical sources of variation and is calibrated from a number of published data sets. Our hypothetical two-condition comparison involved $n = 400$ cells and 17,421 genes and mixing over various numbers $K$ of distinct subtypes. To reflect common variation patterns, we adopted default settings of the primary parameters in splatter and focused our experiments on four settings of splatter’s location and scale parameters $(θ, γ)$ which encode distributional shifts between subtypes. We entertained 12 scenarios encoding four distributional shift settings for each of three different values for the number $K$ of subtypes, with composition parameters $ϕ$ and $ψ$ selected to account for various mixing possibilities. Ten replicate data sets were simulated on each scenario. These 12 scenarios, encoded by $K / θ / γ$ , span states with rather strong signals, like 3/ −0.1/1 to quite weak signals, like 15/0.1/0.4. Supplementary Figures S6 and S7 provide a view of the global separation between the subtypes and the degree of difficulty of the inference task. We note that the mechanistic sampling model induced by splatter is distinct from the descriptive model underlying scDDboost. We choose it to reflect anticipated technical and biological sources of variation. Further details are in Supplementary Material Section 3.1, Ma et al. (2021).

Figures 4 and 5 summarize the true positive rate and false discovery rate of scDDboost compared to three other methodologies: MAST (v. 1.4.0), ScDD (v. 1.2.0) and DESeq2 (v. 1.18.1). scDDboost exhibits very good operating characteristics in this study, as it controls the FDR in all cases while also delivering a relatively high rate of true positives in all cases. The beneficial sampling properties are not limited to the 5% FDR threshold, as indicated by receiver operator characteristic curves (Supplementary Figure S9).

Fig. 4. — True positive rate (vertical) of four DD detection methods in 12 synthetic-data settings (horizontal). Settings are labeled for $K / θ / γ$ and ranked by scDDboost values. Each method is targeting a 5% false discovery rate (FDR). The plot shows average rates over replicate simulated data in each setting.

Fig. 5. — False discovery rate (vertical) of methods in the synthetic-data settings (horizontal, same order) from Figure 4. FDR is approximated from the 12 replicate data sets in each scenario using the known generative hypothesis states.

3.2. Empirical study.

We applied scDDboost to a collection of previously published data sets that are recorded at conquer (Soneson and Robinson (2018)). Though not knowing the truly DD genes, we can examine how scDDboost output compares to output from several standard methods. We selected 12 data sets from conquer representing different species and experimental settings and involving hundreds to thousands of cells. Appendix Table A1 provides details. Figure 6 compares methods in terms of the size of the reported list of DD genes at the 5% FDR target level. We see a consistently high yield of scDDboost among the evaluated methods. For reference, one of these data sets (GSE64016) happens to be the data behind Figure 1, where we know from other information that some of the uniquely identified genes are likely not to be false positives.

Fig. 6. — Proportion of DD genes at 5% FDR threshold with respect to total number of genes identified by each method. Data sets are ordered by scDDboost list size.

To check that the increased discovery rate of scDDboost is not associated with an increased rate of false calls, we applied it to a series of random splits of single-condition data sets (Appendix Table A2). Figure 7 confirms a very low call rate in cases where no changes in distribution are expected.

Fig. 7. — False positive counts at 5% FDR threshold by several methods on five random splits of nine single-condition data sets from Appendix Table A2.

We conjecture that scDDboost gains power through its novel approach to borrowing strength across genes, that is, that the genomic data are providing information about cell subtypes and mixing proportions, leaving gene-level data to guide gene-specific mixture components. One way to drill into this idea is to consider how many genes have similar expression characteristics to a given gene. By virtue of the EBseq analysis inside scDDboost, we may assign each gene to a set of related genes that all have the same highest-probability pattern of equality/inequality of means across the subtypes. Say ${\hat{π}}_{g} = {argmax}_{π} P (M_{g, π} ∣ \hat{z}, X)$ . In Figure 8 we show that, compared to DD genes commonly identified by multiple methods (blue), the set sizes for genes uniquely identified by scDDboost (red) tend to be larger. Essentially, the proposed methodology boosts weak DD evidence when a gene’s pattern of differential expression among cell subtypes matches a large number of other genes.

Fig. 8. — Genes are grouped by their pattern of differential expression across subtypes as inferred by the EBseq computation within scDDboost for three example datasets. Cumulative distribution functions of the log-scale size statistic for all genes identified by scDDboost are plotted; red is the subset uniquely identified by scDDboost; blue are those also identified by the comparison methods (MAST, scDD, or DESeq2). Sets of similarly-patterned genes tend to be larger (horizontal axis, log size) for genes uniquely identified by scDDboost (red) compared to other DD genes (blue), at 5% FDR.

3.3. Bursting.

Transcriptional bursting is a fundamental property of genes, wherein transcription is either negligible or attains a certain probability of activation (Raj and van Oudenaarden (2008)). D3E (Delmans and Hemberg (2016)) is a computationally intensive method for DE gene analysis rooted in modeling the bursting process. It considers transcripts as in the stationary distribution from an experimentally validated stochastic process of single-cell gene expression (Peccoud and Ycart (1995)). Three mechanistic parameters (rate of promoter activation, rate of promoter inactivation and the conditional rate of transcription given an active promoter) characterize the model which allows distributional changes between conditions without changes in mean expression level. For genes identified as DD by scDDboost in dataset GSE71585, either uniquely or in common with comparison methods MAST, scDD and DESeq2, Figure 9 shows changes of these bursting parameters. Interestingly, genes uniquely identified by scDDboost are associated with more significant changes between estimated bursting parameters than genes that all methods identify. This finding and similar findings on other data sets (not shown) provide some evidence that scDDboost is able to detect biologically meaningful changes in the expression distribution.

Fig. 9. — Absolute values of log fold changes of bursting parameters tend to be larger for 1758 genes uniquely identified by scDDboost (red) compared to 2983 genes (blue) that are identified at 5% FDR by scDDboost and other methods: MAST, scDD, and DESeq2.

3.4. Time complexity.

Run time complexity of scDDboost is dominated by the cost of clustering cells and of running EBSeq to measure differences between subtypes. Recall the notation that $n$ for number of cells, $G$ for number of genes and $K$ for number of subtypes. Our distance-based clustering of $n$ cells measuring $G$ genes requires on the order of $G \times n^{2}$ operations (see Supplementary Material Section 2.2, Ma et al. (2021)). Further, EBSeq uses summed counts within each subtype for each gene to compute its density kernel, and there are $Bell (K)$ differential patterns to compute, where Bell counts the partitions of $K$ . We impose the computational limit $K \leq 9$ in scDDboost (v. 1.0). In a typical case involving 20,000 genes and 200 cells, using 50 of randomized distances, scDDboost is relatively efficient for $K \leq 6$ requiring less than 15 CPU minutes on, for example, a quad-core 2.2 GHz Intel Core i7 with 16 Gb of RAM. The same data might require 20 to 40 CPU hours when $K = 9$ . In Section 5 we mention some opportunities to improve this speed.

3.5. Diagnostics.

As implemented, scDDboost uses a particular distance matrix to inform subtypes and computes probabilities in a model for which expression is a mixture of constant-shape negative binomials. To check the effect of these assumptions, we consider a variety of diagnostic calculations using the data sets presented in Sections 3.1 and 3.2. We first point out that model misspecification may have a limited impact on Type-I error rates, as evidenced by the permutation study (Figure 7) and also the synthetic-data study (Figure 5), which does not encode the same modeling assumptions as scDDboost.

To check the within-subtype negative binomial (NB) assumption, we deployed a bootstrap goodness-of-fit test in three data sets (Yin and Ma (2013)). Fewer than 1.5% of genes show evidence against a within subtype NB assumption at a 5% FDR. Further, for most of these non-NB genes the inference drawn by scDDboost is the same as that drawn by various other methods (Supplementary Table S3) and where there are differences the scDDboost call is plausible (Supplementary Figure S12). Among the genes identified by scDDboost at 5% FDR in the stem-cell example (recall Figure 1), just six of them fail the NB test, and two of these are uniquely called by scDDboost; further, one of the two genes is cell-cycle related.

The constant-shape assumption is less well supported empirically, according to a likelihood-ratio test that we developed (Supplementary Figure S13). More than 15% of genes show evidence against constant shape in the examples considered, though inference on differential distribution is only mildly affected (Supplementary Table S4, Figure S14). We have similar findings in the splatter-generated synthetic data. Even with evident model misspecification, Figure 10 shows that the marginal fit by scDDboost is reasonably accurate in the stem-cell example.

Fig. 10. — Empirical CDF for observed data (green) compared to expression levels simulated from the fitted NB mixture (pink) for data set GSE64016. The top six panels are genes randomly selected from those genes being identified as DD by `scDDboost` and not violating constant shape assumption; the last three genes are randomly sampled from genes that fail the test of constant NB shape parameter. Each thin pink curve is from one of the randomized distances; the thicker pink curve represents pointwise averaging over 10 randomized distance matrices. Note, horizontal scales differ among the panels, and cells are pooled from the two conditions.

Choice of distance function.

We propose distance-based clustering to convey subtype information (Section 2.2), and one may ask how sensitive are findings to the choice of distance function. The randomization scheme mitigates variation associated with the specific partition of cells, but the choice of distance function does affect the computed probabilities. To check this, we repeated the numerical experiments in Section 3 using the SC3 distance in place of the default. Importantly, Type I error rates remain in control, as evidenced by the simulation study and the negative-control permutatation study (Supplementary Figure S15). There can be substantial differences in yeild at a given FDR control level (Supplementary Table S5).

4. Asymptotics of the double Dirichlet mixture.

Summary statistics $P (A_{π} ∣ y, z)$ , from Section 2.3, are amenable to a first-order asymptotic analysis that provides further insight into DDM model behavior. The fact that support sets $A_{π}$ for component distributions $p_{π} (ϕ, ψ)$ are not disjoint becomes an important issue. Consider distinct partitions $π_{1}$ and $π_{2}$ of subtypes ${1, 2, \dots, K}$ , and recall that $N (π)$ counts the number of blocks in partition $π$ . In case $π_{2}$ refines $π_{1}$ , then $N (π_{1}) < N (π_{2})$ , and we also know that $A_{π_{2}} \subset A_{π_{1}}$ , since refinement imposes additional constraints on the pair $(ϕ, ψ)$ of probability vectors. If the data-generating state $(ϕ, ψ) \in A_{π_{2}}$ , one might ask how posterior probability mass tends to be allocated among the other mixture components whose support sets also contain this state. The question is addressed by the following.

Theorem 4. Let $π_{1}$ and $π_{2}$ denote two partitions for which $N (π_{1}) < N (π_{2})$ and $A_{π_{1}} \cap$ $A_{π_{2}}$ is nonempty. Let $(ϕ, ψ) \in A_{π_{1}} \cap A_{π_{2}}$ denote the data generating state for subtype labels $z_{1}, z_{2}, \dots, z_{n}$ given i.i.d. Bernoulli condition labels $y_{1}, y_{2}, \dots, y_{n}$ , and recall the posterior mixing proportions $ω_{π}^{post}$ from equation (9) with hyperparameters $α_{i}^{j} \geq 1$ for $i = 1, \dots, K$ , $j = 1, 2$ . Then

\frac{ω_{π_{1}}^{post}}{ω_{π_{2}}^{post}} \to_{a.s.} 0 as n \to \infty .

Essentially, mixing mass is transferred to components associated with the most refined partition consistent with a given parameter state. To be precise, let $H (ϕ, ψ) = {π : (ϕ, ψ) \in A_{π}}$ record all the partitions associated with one state. Typically, there is a most refined partition, $π^{*} = π^{*} (ϕ, ψ)$ such that

A_{π^{*}} = \underset{π \in H (ϕ, ψ)}{\cap} A_{π} .

(12)

This always happens when $K \leq 3$ . In Supplementary Material Section 4, Ma et al. (2021), we characterize the exceptional set of states where (12) does not hold. Notably, if (12) does hold for state $(ϕ, ψ)$ , then, for any $π \in H (ϕ, ψ)$ , using Theorem 4 and (10), we have

P (A_{π} ∣ y_{1}, \dots, y_{n}; z_{1}, \dots, z_{n}) \to_{a.s.} 1 as n \to \infty .

This provides conditions under which we expect good performance for large numbers of cells.

5. Concluding remarks.

We have presented scDDboost, a tool for detecting differentially distributed genes from scRNA-seq data, where transcripts are modeled as a mixture of cellular subtypes. The methodology links established model-based techniques with novel empirical Bayesian modeling and computational elements to provide a powerful detection method showing comparatively good operating characteristics in simulation, empirical and asymptotic studies.

In the software and numerical experiments we made specific choices, such as to use mixtures of negative binomial components per gene, and to use K-medoids clustering on particular cell-cell distances. These choices have evident advantages, but the model structure and theory developed in Section 2 carry through for other cases. Future experiments could study other formulations within the same schema; for example there may be cell-cell distances that better capture the intrinsic dimensionality of expression programs, including, perhaps distances based on diffusions (Haghverdi, Buettner and Theis (2015)) or the longest-leg path distance (Little, Maggioni and Murphy (2017)). Future experiments could also further assess operating characteristics when the number of cells is very large and the number of reads is relatively small, as may arise with unique molecular identifiers (Chen et al. (2018)). Further, assuming a compositional structure to drive model-based computations may not be restrictive, since it allows great flexibility in the form of each gene/condition-specific expression distribution (as coded, they are finite mixtures of negative binomials).

EBSeq currently presents a computational bottleneck for scDDboost, since it searches all partitions of $K$ and encodes a hyperparameter estimation algorithm that scales poorly with $K$ . Several approximations present themselves that may redress the problem, since, in the mixture model context, only patterns $π$ corresponding to relatively probable expression-change patterns over subtypes have a big impact on the final posterior inference Ma, Kendziorski and Newton (2020). Even after resolving this bottleneck, there are advantages to having $K$ small compared to $n$ . Numerical experiments show increased false discoveries when $K$ is overestimated. But accurate estimation with large $K$ would not be expected to provide much improved power, since that depends on accurate estimation of subtypes and their frequencies which relies on $K$ being relatively small compared to $n$ .

Supplementary Material

Supplemental Material

NIHMS1901161-supplement-Supplemental_Material.zip^{(6.2MB, zip)}

Acknowledgments.

Dr Newton is also in the Department of Biostatistics and Medical Informatics at University of Wisconsin-Madison.

We thank the Associate Editor and referees for very helpful comments.

Funding.

This research was supported in part by U.S. National Institutes of Health Grants P50 DE026787, P30CA14520-45, R01 GM102756, U54AI117924, and U.S. National Science Foundation Grant 1740707.

APPENDIX

Proof of Theorem 1. If $θ \in \cup_{π \in Π} [A_{π} \cap M_{g, π}]$ , then there exists a partition $π$ for which $θ \in A_{π}$ and $θ \in M_{g, π}$ . By construction

f_{g}^{1} (x) = \sum_{k = 1}^{K} ϕ_{k} f_{g, k} (x) = \sum_{b \in π} \sum_{k \in b} ϕ_{k} f_{g, k} (x) = \sum_{b \in π} Φ_{b} f_{g, k^{*} (b)} (x),

where $k^{*} (b)$ indexes any component in $b$ , since all components in that block have the same component distribution owing to constraint $M_{g, π}$ . Continuing, using the constraint $θ \in A_{π}$ ,

f_{g}^{1} (x) = \sum_{b \in π} Ψ_{b} f_{g, k^{*} (b)} (x) = f_{g}^{2} (x) \forall x

That is, $θ \in {ED}_{g}$ .

If $θ \in {ED}_{g}$ , then $f_{g}^{1} (x) = f_{g}^{2} (x)$ for all $x$ . Noting that both are mixtures over the same set of components ${f_{g, k}}$ , let ${h_{g, l} : l = 1, 2, \dots, L}$ be the set of distinct components over this set, and so

f_{g}^{1} (x) = \sum_{k = 1}^{k} ϕ_{k} f_{g, k} (x) = \sum_{l = 1}^{L} c_{g, l} (ϕ) h_{g, l} (x) = \sum_{l = 1}^{L} c_{g, l} (ψ) h_{g, l} (x) = f_{g}^{2} (x),

where

c_{g, l} (ϕ) = \sum_{k = 1}^{K} ϕ_{k} 1 [f_{g, k} = h_{g, l}], c_{g, l} (ψ) = \sum_{k = 1}^{K} ψ_{k} 1 [f_{g, k} = h_{g, l}] .

(13)

Finite mixtures of distinct negative binomial components are identifiable (Proposition 5 from Yakowitz and Spragins (1968)), and so the equality of $f_{g}^{1}$ and $f_{g}^{2}$ implies $c_{g, l} (ϕ) = c_{g, l} (ψ)$ for all $l = 1, 2, \dots, L$ . Identifying the partition blocks $b_{l} = {k : f_{g, k} = h_{g, l}}$ and the partition $\tilde{π} = {b_{l}}$ , we find $θ \in A_{\tilde{π}} \cap M_{g, \tilde{π}}$ . The accumulated probabilities in (13) correspond to $Φ_{\tilde{π}}$ and $Ψ_{\tilde{π}}$ which are equal on $A_{\tilde{π}}$ .

Randomizing distances for approximate posterior inference.

One way to frame the subtype problem is to suppose that subtype labels $z = (z_{i})$ satisfy $z = f (Δ)$ , where $Δ = (δ_{i, j})$ is a $n \times n$ matrix holding true, unobservable distances, such as $δ_{i, j}$ between cells $i$ and $j$ and that $f$ is some assignment function, like the one induced by the $K$ -medoids algorithm. Then, posterior uncertainty in $z$ would follow directly from posterior uncertainty in $Δ$ . On one hand, we could proceed via formal Bayesian analysis, say under a simple conjugate prior in which $1 / δ_{i, j} ~ Gamma (a_{0}, d_{0})$ , for hyperparameters $a_{0}$ and $d_{0}$ and in which the observed distance $d_{i, j} ∣ δ_{i, j} ~ Gamma (a_{1}, a_{1} / δ_{i, j})$ . This would assure that $δ_{i, j}$ is the expectation of $d_{i, j}$ , with shape parameter $a_{1}$ affecting variation of measured distances about their expected values. Not accounting for any constraints imposed by both $D$ and $Δ$ being distance matrices, we would have the posterior distribution $1 / δ_{i, j} ∣ D ~ Gamma (a_{0} + a_{1}, d_{0} + a_{1} d_{i, j})$ . For any threshold $c > 0$ , we would find

P (δ_{i, j} \leq c ∣ D) = P (U \geq \frac{d_{0} + a_{1} d_{i, j}}{c (a_{0} + a_{1})})

(14)

where $U ~ Gamma (a_{0} + a_{1}, a_{0} + a_{1})$ .

Alternatively, we could form randomized distances $d_{i, j}^{*} = d_{i, j} / w_{i, j}$ , where $w_{i, j}$ is the analyst-supplied random weight distributed as $Gamma (\hat{a}, \hat{a})$ , as in Section 2.2. Notice that

P (d_{i, j}^{*} \leq c ∣ D) = P (w_{i, j} > d_{i, j} / c ∣ D)

which is also an upper tail probability for a unit-mean Gamma deviate with shape and rate equal to $\hat{a}$ . Comparing to (14), by setting $\hat{a}$ to equal $a_{0} + a_{1}$ , and if $a_{0}$ and $d_{0}$ are relatively small, we find

P (d_{i, j}^{*} \leq c ∣ D) \approx P (δ_{i, j} \leq c ∣ D) .

In other words, the randomized distance procedure is providing approximate posterior draws of the underlying distance matrix. In spite of limitations of this procedure for full Bayesian inference, it provides an elementary scheme to account for uncertainty in subtype allocations. Numerical experiments in the Supplementary Material make comparisons to a full, Dirichlet-process-based, posterior analysis.

Pseudo-code.

graphic file with name nihms-1901161-f0002.jpg

Empirical datasets.

Appendix Table A1.

Data sets used for the empirical study of scDDboost

	Conditions	# cells	Organism	Ref
GSE94383	0 min unstim vs 75 min stim	186, 145	human	Lane et al. (2017)
GSE48968-GPL13112	BMDC (2h LPS stimulation) vs 6h LPS	96, 96	mouse	Shalek et al. (2014)
GSE52529	T0 vs T72	69, 74	human	Trapnell et al. (2014)
GSE74596	NKT1 vs NTK2	46, 68	mouse	Engel et al. (2016)
EMTAB2805	G1 vs G2M	96, 96	mouse	Buettner et al. (2015)
GSE71585-GPL13112	Gad2tdTpositive vs Cux2tdTnegative	80, 140	mouse	Tasic et al. (2016)
GSE64016	G1 vs G2	91,76	human	Leng et al. (2015)
GSE79102	patient1 vs patient2	51, 89	human	Kiselev et al. (2017)
GSE45719	16-cell stage blastomere vs mid blastocyst cell	50, 60	mouse	Deng et al. (2014)
GSE63818	Primordial Germ Cells, developmental stage: 7 week gestation vs Somatic Cells, developmental stage: 7 week gestation	40, 26	mouse	Guo et al. (2015)
GSE75748	DEC vs EC	64, 64	human	Chu et al. (2016)
GSE84465	neoplastic cells vs non-neoplastic cells	1000, 1000	human	Darmanis et al. (2017)

Open in a new tab

Appendix Table A2.

Single-condition data sets used in the random-splitting experiment

Data set	Condition	# cells
GSE63818null	7 week gestation	40
GSE75748null	DEC	64
GSE94383null	T0	186
GSE48968-GPL13112null	BMDC (2h LPS stimulation)	96
GSE74596null	NKT1	46
EMTAB2805null	G1	96
GSE71585-GPL13112null	Gad2tdTpositive	80
GSE64016null	G1	91
GSE79102null	patient1	51

Open in a new tab

Footnotes

Supplementary material (DOI: 10.1214/20-AOAS1423SUPPA; .zip). Additional mathematical and computational results.

R package (DOI: 10.1214/20-AOAS1423SUPPB; .zip). scDDboost version 2.0.

Code and data (DOI: 10.1214/20-AOAS1423SUPPC; .zip). Source code required for figures and tables.

http://github.com/wiscstatman/scDDboost/

REFERENCES

Anders S and Huber W (2010). Differential expression analysis for sequence count data. Genome Biol 11 R106. 10.1186/gb-2010-11-10-r106 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bacher R and Kendziorski C (2016). Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 17 63. 10.1186/s13059-016-0927-y [DOI] [PMC free article] [PubMed] [Google Scholar]
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC and Stegle O (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol 33155. [DOI] [PubMed] [Google Scholar]
Chen W, Li Y, Easton J, Finkelstein D, Wu G and Chen X (2018). UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol 19 70. 10.1186/s13059-018-1438-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Chu L-F, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, Choi J, Kendziorski C, Stewart R et al. (2016). Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol 17 173. 10.1186/s13059-016-1033-x [DOI] [PMC free article] [PubMed] [Google Scholar]
Dahl DB (2009). Modal clustering in a class of product partition models. Bayesian Anal 4 243–264. MR2507363 10.1214/09-BA409 [DOI] [Google Scholar]
Darmanis S, Sloan SA, Croote D, Mignardi M, Chernikova S, Samghababi P, Zhang Y, Neff N, Kowarsky M et al. (2017). Single-cell RNA-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. Cell Rep 21 1399–1410. 10.1016/j.celrep.2017.10.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
Delmans M and Hemberg M (2016). Discrete distributional differential expression (D3E)-A tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinform 17 110. 10.1186/s12859-016-0944-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Deng Q, Ramsköld D, Reinius B and Sandberg R (2014). Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343 193–196. 10.1126/science.1245316 [DOI] [PubMed] [Google Scholar]
Dominguez D, Tsai Y-H, Gomez N, Jha DK, Davis I and Wang Z (2016). A high-resolution transcriptome map of cell cycle reveals novel connections between periodic genes and cancer. Cell Res 26 946. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efron B (2007). Size, power and false discovery rates. Ann. Statist 35 1351–1377. MR2351089 10.1214/009053606000001460 [DOI] [Google Scholar]
Engel I, Seumois G, Chavez L, Samaniego-Castruita D, White B, Chawla A, Mock D, Vijayanand P and Kronenberg M (2016). Innate-like functions of natural killer T cell subsets result from highly divergent gene programs. Nat. Immunol 17 728. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ et al. (2015). MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol 16 278. 10.1186/s13059-015-0844-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo F, Yan L, Guo H, Li L, Hu B, Zhao Y, Yong J, Hu Y, Wang X et al. (2015). The transcriptome and DNA methylome landscapes of human primordial germ cells. Cell 161 1437–1452. 10.1016/j.cell.2015.05.015 [DOI] [PubMed] [Google Scholar]
Haghverdi L, Buettner F and Theis FJ (2015). Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 31 2989–2998. 10.1093/bioinformatics/btv325 [DOI] [PubMed] [Google Scholar]
Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M et al. (2018). SAVER: Gene expression recovery for single-cell RNA sequencing. Nat. Methods 15 539–542. 10.1038/s41592-018-0033-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Jensen ST, Erkan I, Arnardottir ES and Small DS (2009). Bayesian testing of many hypotheses × many genes: A study of sleep apnea. Ann. Appl. Stat 3 1080–1101. MR2750387 10.1214/09-AOAS241 [DOI] [Google Scholar]
Kaufman L and Rousseeuw P (1987). Clustering by Means of Medoids North-Holland, Amsterdam. [Google Scholar]
Kendziorski CM, Newton MA, Lan H and Gould MN (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat. Med 22 3899–3914. 10.1002/sim.1548 [DOI] [PubMed] [Google Scholar]
Kim T, Chen IR, Lin Y, Wang AY-Y, Yang JYH and Yang P (2019). Impact of similarity metrics on single-cell RNA-seq data clustering. Brief. Bioinform 20 2316–2326. 10.1093/bib/bby076 [DOI] [PubMed] [Google Scholar]
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M et al. (2017). SC3: Consensus clustering of single-cell RNA-seq data. Nat. Methods 14 483. [DOI] [PMC free article] [PubMed] [Google Scholar]
Korthauer KD, Chu L-F, Newton MA, Li Y, Thomson J, Stewart R and Kendziorski C (2016). A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol 17 222. 10.1186/s13059-016-1077-y [DOI] [PMC free article] [PubMed] [Google Scholar]
Lane K, Van Valen D, DeFelice MM, Macklin DN, Kudo T, Jaimovich A, Carr A, Meyer T, Pe’er D et al. (2017). Measuring signaling and RNA-seq in the same cell links gene expression to dynamic patterns of NF- κ B activation. Cell Syst 4 458–469.e5. 10.1016/j.cels.2017.03.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Stewart RM et al. (2013). EBSeq: An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29 1035–1043. 10.1093/bioinformatics/btt087 [DOI] [PMC free article] [PubMed] [Google Scholar]
Leng N, Chu L-F, Barry C, Li Y, Choi J, Li X, Jiang P, Stewart RM, Thomson JA et al. (2015). Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nat. Methods 12 947. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li F and Altieri DC (1999). The cancer antiapoptosis mouse Survivin gene. Cancer Res 59 3143. [PubMed] [Google Scholar]
Lin P, Troup M and Ho JWK (2017). CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol 18 59. 10.1186/s13059-017-1188-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Little AF, Maggioni M and Murphy JM (2017). Path-based spectral clustering: Guarantees, robustness to outliers, and fast algorithms
Love MI, Huber W and Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15 550. 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma X, Kendziorski C and Newton MA (2020). EBSeq: Improving mixing computations for multigroup differential expression analysis. BioRxiv 10.1101/2020.06.19.162180 [DOI] [Google Scholar]
Ma X, Korthauer K, Kendziorski C and Newton MA (2021). Supplement to “A compositional model to assess expression changes from single-cell RNA-seq data.” 10.1214/20-AOAS1423SUPPA, , [DOI] [PMC free article] [PubMed] [Google Scholar]
Marioni JC and Arendt D (2017). How single-cell genomics is changing evolutionary and developmental biology. Annu. Rev. Cell Dev. Biol 33 537–553. 10.1146/annurev-cellbio-100616-060818 [DOI] [PubMed] [Google Scholar]
McDavid A, Dennis L, Danaher P, Finak G, Krouse M, Wang A, Webster P, Beechem J and Gottardo R (2014). Modeling bi-modality improves characterization of cell cycle on gene expression in single cells. PLoS Comput. Biol 10 e1003696. [DOI] [PMC free article] [PubMed] [Google Scholar]
Muralidharan O (2010). An empirical Bayes mixture method for effect size and false discovery rate estimation. Ann. Appl. Stat 4 422–438. MR2758178 10.1214/09-AOAS276 [DOI] [Google Scholar]
Navin NE (2015). The first five years of single-cell cancer genomics and beyond. Genome Res 25 1499–1507. 10.1101/gr.191098.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nawy T (2013). Single-cell sequencing. Nat. Methods 11 18. [DOI] [PubMed] [Google Scholar]
Newton MA, Noueiry A, Sarkar D and Ahlquist P (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176. 10.1093/biostatistics/5.2.155 [DOI] [PubMed] [Google Scholar]
Papalexi E and Satija R (2017). Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev., Immunol 18 35. [DOI] [PubMed] [Google Scholar]
Peccoud J and Ycart B (1995). Markovian modeling of gene-product synthesis. Theor. Popul. Biol 48 222–234. 10.1006/tpbi.1995.1027 [DOI] [Google Scholar]
Pierson E and Yau C (2015). ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 16 241. 10.1186/s13059-015-0805-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Raj A and Van Oudenaarden A (2008). Nature, nurture, or chance: Stochastic gene expression and its consequences. Cell 135 216–226. 10.1016/j.cell.2008.09.050 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ray S and Turi RH (2000). Determination of number of clusters in K-Means clustering and application in colour image segmentation
Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner RS, Gaublomme JT et al. (2014). Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510 363. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sohr S and Engeland K (2008). RHAMM is differentially expressed in the cell cycle and downregulated by the tumor suppressor p53. Cell Cycle 7 3448–3460. 10.4161/cc.7.21.7014 [DOI] [PubMed] [Google Scholar]
Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15 255–261. 10.1038/nmeth.4612 [DOI] [PubMed] [Google Scholar]
Strehl A and Ghosh J (2003). Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res 3 583–617. MR1991087 10.1162/153244303321897735 [DOI] [Google Scholar]
Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, Levi B, Gray LT, Sorensen SA et al. (2016). Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci 19 335. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS et al. (2014). The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol 32 381–386. 10.1038/nbt.2859 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wagner U and Taudes A (1986). A multivariate Polya model of brand choice and purchase incidence. Mark. Sci 5 219–244. 10.1287/mksc.5.3.219 [DOI] [Google Scholar]
Yakowitz SJ and Spragins JD (1968). On the identifiability of finite mixtures. Ann. Math. Stat 39 209–214. MR0224204 10.1214/aoms/1177698520 [DOI] [Google Scholar]
Yin G and Ma Y (2013). Pearson-type goodness-of-fit test with bootstrap maximum likelihood estimation. Electron. J. Stat 7 412–427. MR3020427 10.1214/13-EJS773 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zappia L, Phipson B and Oshlack A (2017). Splatter: Simulation of single-cell RNA sequencing data. Genome Biol 18 174. 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

NIHMS1901161-supplement-Supplemental_Material.zip^{(6.2MB, zip)}

[R1] Anders S and Huber W (2010). Differential expression analysis for sequence count data. Genome Biol 11 R106. 10.1186/gb-2010-11-10-r106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bacher R and Kendziorski C (2016). Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 17 63. 10.1186/s13059-016-0927-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC and Stegle O (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol 33155. [DOI] [PubMed] [Google Scholar]

[R4] Chen W, Li Y, Easton J, Finkelstein D, Wu G and Chen X (2018). UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol 19 70. 10.1186/s13059-018-1438-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chu L-F, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, Choi J, Kendziorski C, Stewart R et al. (2016). Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol 17 173. 10.1186/s13059-016-1033-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Dahl DB (2009). Modal clustering in a class of product partition models. Bayesian Anal 4 243–264. MR2507363 10.1214/09-BA409 [DOI] [Google Scholar]

[R7] Darmanis S, Sloan SA, Croote D, Mignardi M, Chernikova S, Samghababi P, Zhang Y, Neff N, Kowarsky M et al. (2017). Single-cell RNA-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. Cell Rep 21 1399–1410. 10.1016/j.celrep.2017.10.030 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Delmans M and Hemberg M (2016). Discrete distributional differential expression (D3E)-A tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinform 17 110. 10.1186/s12859-016-0944-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Deng Q, Ramsköld D, Reinius B and Sandberg R (2014). Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343 193–196. 10.1126/science.1245316 [DOI] [PubMed] [Google Scholar]

[R10] Dominguez D, Tsai Y-H, Gomez N, Jha DK, Davis I and Wang Z (2016). A high-resolution transcriptome map of cell cycle reveals novel connections between periodic genes and cancer. Cell Res 26 946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Efron B (2007). Size, power and false discovery rates. Ann. Statist 35 1351–1377. MR2351089 10.1214/009053606000001460 [DOI] [Google Scholar]

[R12] Engel I, Seumois G, Chavez L, Samaniego-Castruita D, White B, Chawla A, Mock D, Vijayanand P and Kronenberg M (2016). Innate-like functions of natural killer T cell subsets result from highly divergent gene programs. Nat. Immunol 17 728. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ et al. (2015). MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol 16 278. 10.1186/s13059-015-0844-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Guo F, Yan L, Guo H, Li L, Hu B, Zhao Y, Yong J, Hu Y, Wang X et al. (2015). The transcriptome and DNA methylome landscapes of human primordial germ cells. Cell 161 1437–1452. 10.1016/j.cell.2015.05.015 [DOI] [PubMed] [Google Scholar]

[R15] Haghverdi L, Buettner F and Theis FJ (2015). Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 31 2989–2998. 10.1093/bioinformatics/btv325 [DOI] [PubMed] [Google Scholar]

[R16] Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M et al. (2018). SAVER: Gene expression recovery for single-cell RNA sequencing. Nat. Methods 15 539–542. 10.1038/s41592-018-0033-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Jensen ST, Erkan I, Arnardottir ES and Small DS (2009). Bayesian testing of many hypotheses × many genes: A study of sleep apnea. Ann. Appl. Stat 3 1080–1101. MR2750387 10.1214/09-AOAS241 [DOI] [Google Scholar]

[R18] Kaufman L and Rousseeuw P (1987). Clustering by Means of Medoids North-Holland, Amsterdam. [Google Scholar]

[R19] Kendziorski CM, Newton MA, Lan H and Gould MN (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat. Med 22 3899–3914. 10.1002/sim.1548 [DOI] [PubMed] [Google Scholar]

[R20] Kim T, Chen IR, Lin Y, Wang AY-Y, Yang JYH and Yang P (2019). Impact of similarity metrics on single-cell RNA-seq data clustering. Brief. Bioinform 20 2316–2326. 10.1093/bib/bby076 [DOI] [PubMed] [Google Scholar]

[R21] Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M et al. (2017). SC3: Consensus clustering of single-cell RNA-seq data. Nat. Methods 14 483. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Korthauer KD, Chu L-F, Newton MA, Li Y, Thomson J, Stewart R and Kendziorski C (2016). A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol 17 222. 10.1186/s13059-016-1077-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Lane K, Van Valen D, DeFelice MM, Macklin DN, Kudo T, Jaimovich A, Carr A, Meyer T, Pe’er D et al. (2017). Measuring signaling and RNA-seq in the same cell links gene expression to dynamic patterns of NF- κ B activation. Cell Syst 4 458–469.e5. 10.1016/j.cels.2017.03.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Stewart RM et al. (2013). EBSeq: An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29 1035–1043. 10.1093/bioinformatics/btt087 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Leng N, Chu L-F, Barry C, Li Y, Choi J, Li X, Jiang P, Stewart RM, Thomson JA et al. (2015). Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nat. Methods 12 947. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Li F and Altieri DC (1999). The cancer antiapoptosis mouse Survivin gene. Cancer Res 59 3143. [PubMed] [Google Scholar]

[R27] Lin P, Troup M and Ho JWK (2017). CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol 18 59. 10.1186/s13059-017-1188-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Little AF, Maggioni M and Murphy JM (2017). Path-based spectral clustering: Guarantees, robustness to outliers, and fast algorithms

[R29] Love MI, Huber W and Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15 550. 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Ma X, Kendziorski C and Newton MA (2020). EBSeq: Improving mixing computations for multigroup differential expression analysis. BioRxiv 10.1101/2020.06.19.162180 [DOI] [Google Scholar]

[R31] Ma X, Korthauer K, Kendziorski C and Newton MA (2021). Supplement to “A compositional model to assess expression changes from single-cell RNA-seq data.” 10.1214/20-AOAS1423SUPPA, , [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Marioni JC and Arendt D (2017). How single-cell genomics is changing evolutionary and developmental biology. Annu. Rev. Cell Dev. Biol 33 537–553. 10.1146/annurev-cellbio-100616-060818 [DOI] [PubMed] [Google Scholar]

[R33] McDavid A, Dennis L, Danaher P, Finak G, Krouse M, Wang A, Webster P, Beechem J and Gottardo R (2014). Modeling bi-modality improves characterization of cell cycle on gene expression in single cells. PLoS Comput. Biol 10 e1003696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Muralidharan O (2010). An empirical Bayes mixture method for effect size and false discovery rate estimation. Ann. Appl. Stat 4 422–438. MR2758178 10.1214/09-AOAS276 [DOI] [Google Scholar]

[R35] Navin NE (2015). The first five years of single-cell cancer genomics and beyond. Genome Res 25 1499–1507. 10.1101/gr.191098.115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Nawy T (2013). Single-cell sequencing. Nat. Methods 11 18. [DOI] [PubMed] [Google Scholar]

[R37] Newton MA, Noueiry A, Sarkar D and Ahlquist P (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176. 10.1093/biostatistics/5.2.155 [DOI] [PubMed] [Google Scholar]

[R38] Papalexi E and Satija R (2017). Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev., Immunol 18 35. [DOI] [PubMed] [Google Scholar]

[R39] Peccoud J and Ycart B (1995). Markovian modeling of gene-product synthesis. Theor. Popul. Biol 48 222–234. 10.1006/tpbi.1995.1027 [DOI] [Google Scholar]

[R40] Pierson E and Yau C (2015). ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 16 241. 10.1186/s13059-015-0805-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Raj A and Van Oudenaarden A (2008). Nature, nurture, or chance: Stochastic gene expression and its consequences. Cell 135 216–226. 10.1016/j.cell.2008.09.050 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Ray S and Turi RH (2000). Determination of number of clusters in K-Means clustering and application in colour image segmentation

[R43] Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner RS, Gaublomme JT et al. (2014). Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510 363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Sohr S and Engeland K (2008). RHAMM is differentially expressed in the cell cycle and downregulated by the tumor suppressor p53. Cell Cycle 7 3448–3460. 10.4161/cc.7.21.7014 [DOI] [PubMed] [Google Scholar]

[R45] Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15 255–261. 10.1038/nmeth.4612 [DOI] [PubMed] [Google Scholar]

[R46] Strehl A and Ghosh J (2003). Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res 3 583–617. MR1991087 10.1162/153244303321897735 [DOI] [Google Scholar]

[R47] Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, Levi B, Gray LT, Sorensen SA et al. (2016). Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci 19 335. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS et al. (2014). The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol 32 381–386. 10.1038/nbt.2859 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Wagner U and Taudes A (1986). A multivariate Polya model of brand choice and purchase incidence. Mark. Sci 5 219–244. 10.1287/mksc.5.3.219 [DOI] [Google Scholar]

[R50] Yakowitz SJ and Spragins JD (1968). On the identifiability of finite mixtures. Ann. Math. Stat 39 209–214. MR0224204 10.1214/aoms/1177698520 [DOI] [Google Scholar]

[R51] Yin G and Ma Y (2013). Pearson-type goodness-of-fit test with bootstrap maximum likelihood estimation. Electron. J. Stat 7 412–427. MR3020427 10.1214/13-EJS773 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Zappia L, Phipson B and Oshlack A (2017). Splatter: Simulation of single-cell RNA sequencing data. Genome Biol 18 174. 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A COMPOSITIONAL MODEL TO ASSESS EXPRESSION CHANGES FROM SINGLE-CELL RNA-SEQ DATA

Xiuyu MA

Keegan Korthauer

Christina Kendziorski

Michael A Newton

Abstract

1. Introduction.

Fig. 1.

2. Modeling.

2.1. Data structure, sampling model and parameters.

Key Issue.

Fig. 2.

Fig. 3.

2.2. Method structure and clustering.

2.3. P(Aπ∣y,z).

2.4. P(Mg,π∣X,z).

3. Numerical experiments.

3.1. Synthetic data.

Fig. 4.

Fig. 5.

3.2. Empirical study.

Fig. 6.

Fig. 7.

Fig. 8.

3.3. Bursting.

Fig. 9.

3.4. Time complexity.

3.5. Diagnostics.

Fig. 10.

Choice of distance function.

4. Asymptotics of the double Dirichlet mixture.

5. Concluding remarks.

Supplementary Material

Acknowledgments.

Funding.

APPENDIX

Randomizing distances for approximate posterior inference.

Pseudo-code.

Empirical datasets.

Appendix Table A1.

Appendix Table A2.

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3. $P (A_{π} ∣ y, z)$ .

2.4. $P (M_{g, π} ∣ X, z)$ .