Abstract
Background
Single-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information.
Results
Here, we formally quantify the information obtained from a sc-Seq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types.
Conclusions
Thus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from sc-Seq data, and provide an R package implementation.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12859-023-05424-8.
Background
Advances in single-cell sequencing (sc-Seq) technologies have enabled us to profile thousands of cells in a single experiment [43]. In combination with advances in unsupervised analysis methods, particularly specialised clustering algorithms and dimensionality reduction techniques, these technologies have allowed us to dissect cellular identities in unprecedented detail and discover novel, functionally important, cell types [48]. The goal of most sc-Seq studies (except those focused on methodology development) is to extract biological information, often concerning the mix of cell types present in the tissue sample, from the data obtained. Yet, data is not the same as information; and large, complex, data sets do not necessarily convey useful or usable information. Notably, current single-cell profiling technologies typically produce noisy data for numerous technical reasons, including low capture rate, sparsity due to shallow sequencing, and batch effects [17, 19]. Consequently, the relationship between biological information and sc-Seq data is complex and incompletely understood. There is, therefore, a need for formal, quantitative methods to evaluate this relationship.
To address this challenge, we propose an information-theoretic framework that quantifies the amount of information contained in a sc-Seq data set, and leads to a natural definition of gene expression heterogeneity. Our measure of gene expression heterogeneity decomposes into that which is explained by a given grouping of cells—a proposed clustering into cell types, for example—and that which remains unexplained.
Our framework differs from other approaches to heterogeneity decomposition (e.g. [14, 27]) by formally quantifying the information, in terms of gene expression heterogeneity, gained from a sc-Seq experiment concerning the mix of cell types present. The resulting measure of heterogeneity, and its decomposition into inter- and intra-cluster heterogeneity, is non-parametric, intrinsic, unbiased, and requires no a priori assumptions about gene expression patterns.
Our approach is mathematically precise, biologically intuitive and computationally simple to implement, enabling a practitioner to quickly assess the information content of a sc-Seq data set, in terms of gene expression heterogeneity, identify highly informative genes, and determine the extent to which observed patterns of gene expression variability are explained by the presence of a mixture of cell types in a cellular population. Furthermore, we provide an efficient unsupervised clustering algorithm of cells from sc-Seq data based on heterogeneity and an implementation as an R package.
Results
High-throughput single-cell analysis methods, such as single-cell sequencing, typically view cells as the objects of study and seek to compare cell identities with each other [4, 12, 14, 20, 24, 25, 35, 47]. However, this cell-centric view is less well suited to quantifying gene expression heterogeneity, which is concerned with patterns of variation that arise from the mixing of cell types within a population, and may vary from gene to gene. For instance, variance, the standard measure of heterogeneity in the cell-centric view, may not be informative in multi-modal distributions [39].
In the context of sc-Seq, heterogeneity results from differential expression between cell types. In the simplest case, where one cell type expresses a gene highly and another lowly, such differential expression introduces bimodality in the cell-centric expression distribution. In a bimodal distribution, the population mean is no longer characteristic of either of the two subpopulations, making variance, as a measure of the spread about the mean, also misleading (see Smith and MacArthur [39]).
To meet these challenges, we introduce a novel gene-centric probabilistic view that seeks to more formally specify what is meant by gene expression homogeneity and heterogeneity. We show that this gene-centric view is better suited to quantifying gene expression heterogeneity in the context of a multimodal population, and formalises the biological intuition of expression homogeneity and heterogeneity.
Quantifying gene expression heterogeneity
Consider the expression pattern of a single gene g of interest in a population of N distinct cells. Assume that in total M transcripts of g are identified in the cell population (i.e. across all N cells profiled). Note that M represents the observed transcript count, which may differ from the true count due to technical artefacts. Now consider the stochastic process of assigning the M identified transcripts of gene g to the N cells profiled. Intuitively, the population is homogeneous with respect to expression of g if all the cells are statistically the same with respect to its expression. Mathematically, this means that the M transcripts of g will be assigned to the N cells independently and equiprobably—i.e. each transcript will be assigned to each cell with probability 1/N. Conversely, if the population is heterogeneous with respect to expression of g (that is, it consists of a mix of cell types, each expressing the gene differently), then transcripts will not be assigned uniformly, but rather will be assigned preferentially to distinct subsets of cells. Heterogeneity in experimentally observed patterns of expression can, therefore, be assessed in terms of deviation from this hypothetical homogeneous null model.
The Kullback–Leibler divergence (KLD, also known as the relative entropy) is a measure of the information encoded in the difference between an observed and null distribution [22]. In the sc-Seq context, the KLD of an experimentally observed gene expression distribution from the homogeneous null model, described above, measures the amount of information that is lost by assuming that the gene is homogeneously expressed in the sequenced cell population. We will denote this quantity I(g) and refer to it as the heterogeneity of g (see Fig. 1 for a schematic). Formally,
1 |
where is the fraction of transcripts of gene g expressed by cell i, for each , and H(g) is the entropy of the expression of g in the population (see “Methods” Section for full details). For technical reasons, explained in the “Methods” Section, we will use natural logarithms in all calculations; I(g) therefore has units of nats.
Intuitively, if the cell population is unstructured with respect to the expression of g (i.e. if the cells are approximately interchangeable with respect to the expression of g) then the assumption of homogeneity is correct and . Conversely, the theoretical maximum for I(g) is , which is achieved when and all transcripts of the gene are assigned to the same cell (see Fig. 1). Note that: (1) we do not need to know, or model, the particular expression distribution of g in the population, so no a priori assumptions about expression patterns are required to calculate I(g); (2) I(g) is agnostic concerning missing readings or counts so long as they are distributed uniformly at random; (3) since it quantifies the deviation from the homogeneous null model, I(g) measures the information obtained from the experiment concerning the expression of g.
In practice, I(g) is associated with cellular diversity: the more distinct cell sub-populations present in a sample, and the more those sub-populations differ from one another with respect to their expression of g, the larger I(g) will be. Thus, I(g) is a parsimonious measure of expression heterogeneity that makes minimal assumptions concerning expression patterns and, therefore, imposes minimal technical requirements on data collection methodology or quality. As such, it can be used as the basis for numerous aspects of the sc-Seq pipeline, including feature selection and unsupervised clustering.
To validate these uses, we considered a series of single-cell RNA-sequencing bench-marking data sets: Svensson, a technical control [42]; Tian, a mixture of three cancerous cell lines [46]; Zheng, a set of FACS separated peripheral blood mononuclear cells; Stumpf, a sampling of cells from mouse bone marrow [41]; and the Tabula Muris, a mouse cell atlas with cells from twelve organs [44]. Fig. 2a–e shows plots of I(g) for each gene profiled in these experiments. These plots confirm the intuition that the more distinct types present in a sample, and the more those cell types differ from one another with respect to their gene expression patterns, the greater the population heterogeneity will be, as measured by I(g).
These results indicate that heterogeneity (as defined here) can be used for feature selection. For example, heterogeneously expressed genes can be identified by 1) ordering them in descending ordering by I(g), and selecting the top n (where n is user-determined) or 2) finding those genes for which I(g) greater than some user-defined threshold (note that , with indicating the gene g is expressed in approximately half of cells).
As a measure for use in feature selection, I(g) identifies gene sets largely distinct from those based on Highly Variable Gene selection (the mean overlap of selected genes—for listed data sets excluding Svensson—was 0.36, with the number of genes selected determined by scran, based on an false discovery rate threshold of 0.05). For instance, I(g) identifies Hbb-bs, a marker of erythrocyte maturation, as informative in the Stumpf et al. [41] data set (high I(g) value), whereas scran does not, instead attributing the observed variance to technical sources (see Additional file 1: Fig. S1)[27]. Critically, information-theoretic heterogeneity is more mathematically precise and computationally simpler to implement than variance-based approaches, making no distributional assumptions (e.g. use of the negative binomial model) and having no free parameters to fit. Furthermore, it may be easily modified to account for the presence of multiple homogeneous cell subpopulations, and thereby act as a measure of cluster quality.
Inter-cluster heterogeneity
Let S be a discrete clustering of a population of cells—i.e. an assignment of the cells to a finite set of C non-intersecting sub-populations (also known as clusters). In the “Methods” Section, we show that I(g) can be decomposed into two parts: one part that quantifies the extent to which each of the sub-populations defined by S deviate from homogeneity, which we call the intra-cluster heterogeneity and write as ; and one part that quantifies how differently the gene is expressed, in aggregate, between sub-populations, which we call the inter-cluster heterogeneity and write as . Namely, we show that
2 |
for any proposed clustering S. Thus, the information obtained from an experiment concerning the expression of a gene g can be explicitly related to both local (within cluster) and global (between cluster) patterns of variation, for any proposed clustering. Full mathematical details of this decomposition, including formulae for and , are provided in the “Methods” Section, and an illustration is given in Fig. 3. We also provide an R package for its calculation (see “Methods” Section).
This decomposition can be used to assess cluster quality. For a proposed cluster to meaningfully represent a cell type, it must be associated with differential expression of some subset of marker genes. Because measures the extent to which the expression of gene g deviates from the homogeneous null model it is a simple measure of the extent to which g is differentially expressed between the clusters defined by S. We therefore expect that biologically meaningful clustering based on a marker gene g will result in a high value for and (by necessity) a low value for .
This expectation is confirmed for the Tian, Zheng and Stumpf data sets for which we possess high-confidence a priori cell type annotations, derived either experimentally (for the Tian and Zheng data) or from expert annotation of a computational clustering (for the Stumpf data). In the set of genes most likely to be differentially expressed between clusters in these data sets (taken to be the top 500 genes by I(g) in each case) the majority are associated with substantially higher inter-cluster heterogeneity than expected by chance (i.e., in randomly permuted clusters). Indeed, in Tian 97.0%, in Zheng 94.4% and in Stumpf 99.6% of the tested genes had significantly higher inter-cluster heterogeneity than expected (one-sided exact test with permutations, , false discovery rate corrected for 500 trials; see Fig. 2f–h for illustration) [2, 8].
These results indicate that by simple information-theoretic reasoning, we are able to quantify the amount of information in the expression pattern of a single gene that is explained by a given clustering. However, a key strength of high-throughput single-cell profiling methods is that they allow the simultaneous profiling of thousands of genes in large cell populations.
Our information-theoretic reasoning may be extended to a high-throughput single-cell profiling experiment by assuming that each gene is an independent source of information and making use of the fact that information from independent sources is additive [37]. Thus, we can determine the total information explained by a given clustering S by evaluating the sum over all genes profiled. The total information explained by S is a simple, easily computed measure for cluster evaluation that favours grouping of cells into homogeneous (with respect to gene expression) sub-populations and is maximised at if and only if the proposed clustering divides the population into indivisible perfectly homogeneous sub-populations and thereby accounts for all the heterogeneity contained in the sc-Seq data set. If so, then S is the maximum entropy partition of the cell population into distinct classes, and may therefore be considered as the most parsimonious way of assigning cell identities.
To illustrate the association between total information explained, , and cluster quality, we again considered the established annotations of the Tian, Zheng and Stumpf data sets. As expected, in all cases the observed value of significantly exceeds that of all random label permutations (see Fig. 2i indicating that is strongly associated with cluster quality and that these benchmark cell annotations are associated with homogeneous cell sub-populations. Notably, this association holds true independently of the method used to annotate cell identities (annotations were derived from genotypic information for the Tian data, surface protein expression for the Zheng data, and unsupervised clustering for the Stumpf data), indicating that provides a methodology agnostic, parameter-free, means to quantify cluster quality.
Unsupervised clustering
Building on this empirical association, we investigate the clustering of cells that maximises (and necessarily minimises ) as a reasonable approximation to (homogeneous) cell type. That is, this is the cell clustering that explains the most of the gene expression heterogeneity as inter-cluster variability.
We implement -maximisation as the objective function of an unsupervised clustering method, which we call scEC (single-cell Entropic Clustering). We show that our method is comparable to the state-of-the-art Louvain method, and in doing so, validate the underlying definition of heterogeneity and homogeneous cell type.
Namely, we compare the clustering returned by scEC and the established annotations for the Tian, Zheng and Stumpf data sets. The resulting scEC clusterings agreed strongly with the established cell annotations, achieving adjusted Rand Indices (ARI) of 0.99 for the Tian annotations, 0.87 for the Zheng annotations, and 0.69 for the Stumpf annotations (ARIis a measure of similarity of a proposed clustering to a known clustering, Rand [33]). To benchmark these results we also repeated the analysis using the Louvain method, a leading single-cell clustering algorithm, which achieved ARIs of 0.99, 0.99 and 0.35 for the Tian, Zheng, and Stumpf data sets respectively [3, 9, 26, 40].
The scEC objective function, , is naturally built by summation of gene-level measurements (see “Methods” Section), so can be used to directly identify the key drivers of a given clustering without relying on post-hoc differential gene expression testing. To demonstrate this feature, we compared the inter-cluster heterogeneity values (i.e., the values for each of the genes profiled) for clusters identified by scEC and by established annotations for the Tian & Stumpf data sets (see Fig. 4).
We found that the values of associated with scEC clusters are strongly positively correlated with those associated with established annotations for both the Zheng and Stumpf data sets (Pearson’s correlation coefficient of 0.94 for Zheng and 0.99 for Stumpf; see Fig. 4). Moreover, the key marker genes in each data set were found to have significantly different expression levels across clusters, as measured by (one-sided permutation test on clustering labels, permutations, , false discovery rate corrected; marker genes for Zheng: CD14, CD4, CD8 and NCAM1, Zheng et al. [52]; and Stumpf: Cd34, Kitl, Spi1, Gata1 and Pax5, Stumpf et al. [41]).
This concordance indicates that scEC is generally able to recapitulate known cell clusters in an unbiased and parameter free way. However, despite this concordance, there were differences between scEC clusters and established annotations. To investigate these differences further we selected those genes for which the value of differed substantially () between scEC clusters and established annotations in each data set for further investigation.
In the Zheng data, we found that those genes with expression heterogeneity better explained by the scEC clustering were enriched for cell-cycle related genes (gene ontology enrichment, , false discovery rate corrected). Conversely, those genes with expression heterogeneity better explained by the established annotation were enriched for genes involved in the immune response (, false discovery rate corrected) [1, 2, 10].
In the Stumpf data set, we found that those genes with expression heterogeneity better explained by the scEC clustering were enriched for genes involved in erythroblast differentiation and homeostasis ( and respectively, false discovery rate corrected). Conversely, those genes with expression heterogeneity better explained by the established annotation were enriched for genes involved in the immune response (, false discovery rate corrected). In this example, scEC generally identifies additional erythrocyte sub-populations while merging the cell sub-types of the neutrophil lineage (see Fig. 2j for illustration and Additional file 1: Table S1 for contingency table of cluster assignments).
Collectively, these differences reflect the preference of scEC for cellular annotations driven by biological processes involving larger tranches of genes. Depending on circumstance, this preference may be a benefit or a drawback. It is generally beneficial because it ensures that scEC annotations are robust to outliers (i.e. anomalous expression patterns of individual genes cannot easily distort scEC cluster assignments). Conversely, in circumstances in which key cellular functions are determined by a small number of genes, scEC may fail to disambiguate important cell populations from similar cell types. However, given the simplicity of the scEC methodology this issue can this issue can be relatively easily addressed, by adjusting the prior, as follows. In the standard formulation of scEC each gene contributes equally to the objective function . This corresponds to assuming a uniform prior. However, alternatives may be easily considered that take into account different sources of prior knowledge. In general, each gene can contribute in a weighted way to the overall inter-cluster heterogeneity. In this case the objective becomes , where w(g) is the weight of gene g. These weights can encode known biology—for example, by up-weighting known drivers of cellular identities, down-weighting genes associated with housekeeping processes and/or the cell cycle, or by weighting genes in inverse proportion to the size (i.e. the number of associated genes) of their leading ontology term.
While incorporating such priors into scEC may be beneficial, defining them is a challenge and potentially introduces a source of bias into the resulting clustering and so should be approached with care. An alternative approach—which encodes prior information in a data-led, rather than algorithmic way—is to use a reference data set, such as a cell atlas, to benchmark cell annotations (these atlases can serve as reference transcriptomes, in analogy to reference genomes).
Doing so leads to a semi-supervised version of scEC, wherein we cluster a novel test data set based on the known clustering of a given reference data set. We detail the mathematics of semi-supervised clustering in the “Methods” Section, but, in short, the extension to the semi-supervised setting requires no additional mathematical machinery. Such mathematical simplicity affirms information theory as providing a mathematically precise and biological intuitive framework for cellular clustering.
As an example, we cluster the Tian data set based on a distinct data set containing the same three cell lines (plus two additional cell lines) [46]. The semi-supervised clustering strongly agreed with the established cell annotation, achieving an ARI of 0.90. Notably, the ARI was reduced compared with the unsupervised version due to the assignment of of cells to cell types present only in the reference data set, a disadvantage of semi-supervised (and supervised) methods.
Benchmarking and imputation
We benchmark scEC and state-of-the-art methods against a series of data sets consisting of cells sampled from three or five distinct cancerous cell lines and sequenced using various technologies (CEL-Seq2, Sort-seq, 10x, Drop-seq) [15, 28, 31, 46]. We compare each method’s clustering against ground truth for each data set (by adjusted rand index), finding that scEC performs subpar to the best of existing methods [13, 21, 27, 34, 40]. Investigating the gene-wise contributions via reveals that scEC-derived clusterings diverge from the ground truth due to genes expressed in small numbers of cells, e.g., SPINK6 (difference in from ground truth of 0.23 nats; expressed in of cells in the five-cell line, 10X data set; see Fig. 4c).
To protect against low cell-count genes, we adopt a data-diffusion step, imputing the expression value of each cell as a weighted average of itself and a small subset of cells with similar expression patterns (i.e., diffusion of observed expression values over a shared nearest-neighbours graph derived from the top 500 genes by I(g), see “Methods” Section) [49]. The smoothing layer shares information between genes, lessening scEC’s sensitivity to the expression pattern of individual genes; this raises the performance of scEC clustering to be on par with the state-of-the-art (see Fig. 5a).
We further benchmark scEC (with and without the imputation step) as a feature selection tool, evaluating the ability of I(g) to a priori identify differentially expressed genes, assuming differential expression to be the ground truth of feature selection (differentially expressed genes are identified per data set via Wilcoxon Rank Sum testing, selecting those genes with an FDR-adjusted in at least one cluster; Stuart et al. [40]). We find that, without smoothing, scEC is comparable to the state-of-the-art; with smoothing, scEC notably improves, substantially bettering the state-of-the-art in selecting differentially expressed genes from the benchmark data sets (see Fig. 5b). Notably, relatively few genes remain heterogeneous after the imputation step, so the imputed I(g) identifies a much smaller number of genes than other methods, where an arbitrary cut-off in selected gene number is required.
Conclusion
Traditional unsupervised clustering methods for single cell data are based on the, often implicit, assumption that cell types correspond to regions of high probability density in the joint gene expression distribution [11]. Although useful, this is not a specifically biological assumption (similar assumptions form the basis of numerous clustering algorithms in other different disciplines) and does not have a clear biological basis. Indeed, this assumption does not naturally accord with biological intuition that distinct cell populations should be ‘homogeneous’ and cells of the same type should be functionally interchangeable. Here, we have taken a different approach that seeks to encode biological intuition about population homogeneity in a mathematical formulation, drawing on tools from information theory.
We have proposed a formal measure of gene expression heterogeneity and shown that this measure captures biologically relevant heterogeneity arising from differential expression between established cell types. We formalised the additive decomposition of heterogeneity with respect to a given grouping of cells (in “Methods” Section), and tested the association between high inter-cluster heterogeneity and grouping by cell type. Finally, we used this measure as a basis of a biologically-motivated clustering procedure, which we call single-cell entropic clustering (scEC), which identifies homogeneous cell types. The scEC method is free of tunable hyper-parameters and its performance is comparable to a state-of-the-art clustering algorithm on a series of benchmark sc-Seq data sets, suggesting that this underlying biological basis is justified.
While our method represents a mathematically rigorous definition of heterogeneity, the stochasticity inherent in sc-Seq data does limit our resolution. For instance, nearly every gene is associated with some non-zero level of heterogeneity and, over thousands of genes, even low levels of stochastically-induced heterogeneity will accumulate (a version of the so-called ‘curse of dimensionality’). This limits our ability to directly interpret the genome-wide heterogeneity (e.g. the total heterogeneity of the technical control Svensson exceeds that of Tian, both absolute and per gene, although the trend reverses when excluding those genes with fewer than 100 total transcripts).
Aside from these technical limitations, our approach demonstrates the benefit of a biologically grounded, mathematically formal approach to understanding sc-Seq data. As these data sets grow in complexity and size, there will be a greater need for biological interpretability and mathematical rigour in analysis. We suggest information theory as the appropriate mathematical language for describing gene expression and its heterogeneity. Indeed, we anticipate that information theory will become an established part of the quantitative biologist’s toolkit.
Methods
Data collection
Count matrices for each data set were downloaded from their respective repositories (see Availability of Data and Materials). For the Zheng and Tabula Muris data sets, the count matrices of the various cell types/tissues were concatenated using the Matrix (v1.3-2) package in R (v4.0.3) [32].
Data pre-processing
For each data set, those genes with less than 100 total transcripts in total (i.e. across all cells) were excluded from further analysis. For the Zheng data set, FACS identities were taken from the repository metadata, with the CD4+ Helper T-cells, CD4+/CD25+ Regulatory T-cells, CD4+/CD45RA+/CD25- Naïve T cells, CD4+/CD45RO+ Memory T-cells, CD8+ Cytotoxic T cells and CD8+/CD45RA+ Naïve Cytotoxic T-cells merged into the single identity of T-cells.
Normalisation
Since sc-Seq data is noisy, rather than working with the experimentally observed proportions of transcripts assigned to each cell, we instead adopt a Bayesian approach to estimate the cellular frequencies using the James–Stein estimator, as explained below, and use this estimator in all subsequent calculations.
We write for the number of transcripts of gene g associated with cell i in a population of N cells. Let be the total number of transcripts of gene g, and the fraction of transcripts of gene g expressed by cell i, for each . Thus, and we can think of as the probability that a randomly selected transcript of gene g is associated with cell i. To take under-sampling into account, we adjust the observed probabilities as follows (let us drop the superscript g for simplicity). We define as the discrete random variable on the set with probabilities
3 |
where is the shrinkage intensity given by
4 |
which is the James–Stein estimator, with the uniform distribution as the shrinkage target [16]. Shrinkage approaches deal with under-sampling and so are able to correct for the substantial sparsity observed in single-cell RNA-sequencing data. The shrinkage probabilities given in Eq. 3 are a compromise between the observed probabilities , which are unbiased but with a high variance, and the target uniform probabilities 1/N, which are biased but with low variance [6, 16].
Feature selection
The Shannon entropy of X is, by definition,
5 |
By convention, we assume that . For the logarithm, we take base e for ease in differentiation (see later) rather than the usual base 2 (recall that for all ), and thus H(X), and I(g) below, are measured in nats.
The Shannon entropy is a measure of the uncertainty in the outcomes of the random variable X. It has a minimum value of zero, when for some i (e.g. if the gene g is expressed in only one cell of the population) and has a maximum value of , when for all i (e.g. if the gene is uniformly expressed in the cell population). The entropy may therefore be considered as a measure of the homogeneity of expression of the gene g in the cell population profiled (note that the association of high entropy with homogeneity is counter to the usual intuition: it occurs because we are concerned with the entropy of the generative process of assigning transcripts to cells, rather than transcript count distribution in the cell population, as is more usual. A high entropy transcript assignment process gives rise to a sharp transcript distribution, hence its association with homogeneity). By contrast, the quantity also ranges between zero and , yet is minimised when the gene is homogeneously expressed and so is a simple measure of expression heterogeneity, which we will denote I(g). We can rewrite this as
6 |
The Kullback–Leibler divergence (KLD), or relative entropy, is a measure of the information encoded in the difference between two probability distributions [22]. The relative entropy of a discrete probability distribution from a discrete probability distribution is, by definition,
7 |
with the provision that implies , and the convention that . From this definition and Eq. 6, it is clear that our measure of heterogeneity is simply the relative entropy of the observed expression distribution from the uniform distribution U. Thus,
8 |
where U denotes to the uniform distribution on the set with probabilities for all .
Cluster-level heterogeneity
A crucial property of the relative entropy is that it is additively decomposable with respect to arbitrary groupings [38]. Informally, this means that if we have a clustering of the cells into disjoint groups, then I(g) can be reconstructed from inter-cluster and intra-cluster heterogeneities. Next, we formalise this additive decomposition and give a self-contained derivation (cf. Theil [45]).
Consider a clustering that unambiguously assigns each cell in the sample into one of C non-intersecting sub-populations of sizes . Note that , the total number of cells. Let be the fraction of transcripts associated with cells in sub-population , adjusted by the shrinkage estimator. That is,
9 |
with defined by Eq. 3. This gives another discrete random variable Y with probability distribution on the set . For each , we can also assess the heterogeneity of the sub-population by considering the random variable with probability distribution on the set . Note that
10 |
so is a random variable on the set (cluster) .
We may rewrite I(g) in terms of Y and , as follows:
11 |
12 |
13 |
14 |
15 |
16 |
17 |
Expression A may be rewritten as:
18 |
19 |
20 |
21 |
This is the relative entropy of Y from the uniform distribution in which for . Since is the proportion of transcripts assigned to the cluster , it measures the deviation from the assumption that the clusters are homogeneous in their expression of g (i.e. each cluster expresses g at the same level). Since it is a measure of the extent to which the population deviates from homogeneity between clusters, we will term this contribution the inter-cluster heterogeneity of g with respect to the clustering S, denoted . Informally, it is a measure of the extent to which the gene g is differentially expressed between clusters.
Expression B may be rewritten as:
22 |
23 |
24 |
25 |
26 |
This is the weighted sum of the relative entropies of the empirical distributions (i.e. the observed gene expression distribution in group ) from the uniform distribution on (in which for each ). It is the deviation from the assumption that the population consists of a mixture of homogeneous sub-populations according to the clustering S (where the expectation is taken with respect to the probability measure provided by Y). Since it is a measure of the expected extent to which the proposed sub-populations deviate from homogeneity within clusters, we will term this contribution the intra-cluster heterogeneity of g with respect to S, denoted .
Taken together, these results show that I(g) can be decomposed into two well-defined parts that encode local and global properties of the expression distribution of g with respect to a given clustering :
27 |
The relative entropy is always non-negative and hence so are and for any S. Thus, both quantities range from zero to I(g). If S places one cell in each cluster (i.e. ) then for all k and thus . Conversely, if S places all cells in one cluster (i.e. ) then and thus . In this case, the population heterogeneity is equivalent to the intra-cluster heterogeneity of the trivial clustering.
Unsupervised clustering
The above measures of heterogeneity can be easily extended from one gene to many. In this case, the homogeneous null model is obtained by assuming that each gene is expressed homogeneously and independently. Because entropy from independent sources is additive [7], the total heterogeneity of a single-cell RNA-sequencing data set under a given clustering is given by the sum:
28 |
Here and represent the total inter-cluster and intra-cluster heterogeneity of the data with respect to a clustering .
Naturally, we want to identify a clustering (e.g. an assignment of cells to types) that produces maximally homogeneous sub-groups and thereby explains most of the expression heterogeneity for most genes in terms of inter-cluster heterogeneity. Mathematically, this means we want to find a clustering S that maximises or, equivalently, minimises . A brute force approach is not feasible, as the number of partitions of a set with n elements into two or more subsets grows exponentially with n.
To approach the problem we therefore converted the problem from one of assigning discrete identities (i.e., a hard clustering problem) to one of assigning continuous identities (i.e., a fuzzy clustering problem). The advantage of this approach is that it defines the problem in terms of a continuous, differentiable objective function which can be approached with more efficient optimisation routines.
Fuzzy clustering
We adopt a fuzzy conception of clustering in which cells are assigned to C (possibly) overlapping, fuzzy clusters, . Each cell i has C corresponding membership functions for , which assess the probability that cell i belongs to cluster . Thus,
29 |
Our information-theoretic framework can be adapted to this fuzzy setting. Population heterogeneity, I(g), is independent of the chosen clustering, so it remains unaffected by the adoption of fuzzy clusters, i.e. the total gene expression heterogeneity is unaffected by the choice of discrete or fuzzy cluster memberships. For the calculations of inter-cluster and intra-cluster heterogeneities, we extend the discrete random variables and to the fuzzy setting, where now measures the expression distribution of the gene g across the C fuzzy clusters, and measures the expression distribution of the gene g within fuzzy cluster , as follows.
We define the discrete random variable on the set of fuzzy clusters with probabilities given by
30 |
We also define, for each , a discrete random variable on the set of cells, , with probabilities given by,
31 |
We can rewrite I(g) (which is independent of the clustering) in terms of and (which depend on the clustering) as
32 |
where,
33 |
and
34 |
Note that Eq. 33 is the entropy of the random variable , however is not quite the entropy of (there is a missing factor inside the logarithm).
From Eq. 32, we can obtain a second decomposition,
35 |
where
36 |
37 |
38 |
Here H(g) and h(g) represent the inter-cluster heterogeneity and intra-cluster heterogeneity of gene g with respect to the fuzzy clustering S. Equation 36 is also the KLD of the distribution of from the uniform distribution given by for (note that ). Similarly, Eq. 37 is the KLD of the distribution of from the uniform distribution given by for .
Either or can serve as our objective function, f, for optimisation: optimising either quantity simultaneously identifies the clustering with the greatest heterogeneity between clusters (i.e. the greatest differential gene expression between clusters) and the least heterogeneity within clusters (i.e. in which the cells within each cluster are closest to being interchangeable with respect to their patterns of gene expression).
Choosing inter-cluster heterogeneity , we solve the optimisation problem
39 |
where
40 |
using the limited-memory, box constrained BFGS (L-BFGS-B) optimisation algorithm from the Python3 (v3.8.2) package SciPy (v1.5.3) [5, 50, 51, 53].
The L-BFGS-B algorithm is an efficient non-linear local optimisation method developed for solving large, dense problems such as this [53] but its speed and the quality of the solution produced depends on the availability of an explicit formulation of the gradient of the objective function. Therefore, we derive the required gradient by partially differentiating the total inter-cluster heterogeneity, , with respect to fuzzy cluster membership, i.e. the membership functions ().
Beginning by differentiating individual elements of f with respect to , from Eqs. 30 and 38, we have
41 |
42 |
Using the product, chain and quotient rules, we can differentiate the g-summand, call it , of the objective function f:
43 |
44 |
45 |
46 |
47 |
48 |
49 |
The membership functions are not independent, since for all i, by Eq. 29. To incorporate this constraint, and the constraint for all i, k, we introduce variables given by
50 |
and make the objective function f a function of the . By the chain rule,
51 |
Determining , let us write ; then, Eq. 50 can be written as . By the quotient rule,
52 |
Numerical optimisation
The objective function , as formulated in Eq. 40, was optimised with respect to cluster memberships, using the L-BFGS-B optimisation algorithm from the Python3 (v3.8.2) package SciPy (v1.5.3) [5, 50, 51], with random initial starting points (initial cluster memberships were defined by random sampling of a uniform distribution centred on zero). The gradient function, as formulated in Eq. 51, was supplied to the optimisation routine.
Because initial conditions are randomly generated, and the L-BFGS-B algorithm only identifies local optima, the specific clustering found by the algorithm may depend on the choice of initial vector [5, 53]. To make the implementation robust a multi-start process is adopted, in which the optimisation is repeated multiple times with different initial vectors, and the clustering S with the greatest inter-cluster heterogeneity is chosen.
Permutation testing
Annotations for the Tian, Zheng and Stumpf data were taken from the respective repositories’ metadata. Random clusterings were generated by randomly permuting the established annotations 10,000 times each. Comparison of true and shuffled annotations was formulated as an exact one-side hypothesis test, generating exact p-values [8]. We control the false discovery rate arising from multiple testing for each data set using the R function p.adjust [2].
Clustering comparison
We compare our clustering results to two standard clustering methods: Seurat (v3) clustering (Louvain community detection), and UMAP projection, both with default parameters, with the exception of the resolution, as described in Stuart et al. [40] [3, 29]. The Adjusted Rand Index (ARI) was calculated for each data set using the associated function in the R package mclust (v5.4.7) [33, 36].
Semi-supervised classification
Consider a pair of count matrices, , where one count matrix, which we term the reference, , has a known cluster structure (and therefore each cell has a unique, discrete cell type classification). The goal of reference mapping is to classify cells in the test matrix, , based on cellular classifications of the reference matrix.
We normalise each count matrix separately as described in Normalisation, using the James–Stein-type estimator so that [16]. We assume that the same set of genes are profiled in and and derive the normalised combined count matrix, , via the weighted concatenation:
53 |
where is the number of cells in the reference data set, is the number of cells in the unclassified data set, and N is the total number of cells across both data sets.
As in the unsupervised setting, we cluster the combined data set, , by maximising of the combined data set with respect to a fuzzy clustering S, see Eq. 56. However, unlike in the unsupervised setting, a subset of cellular cluster memberships are known a priori. Let be the discrete clustering of the reference data set, with cluster sizes , where . As the cluster identities of cells originating from the reference data set are fixed, each cluster in the reference data set constitutes a subset of a cluster in the mixed data set, , where is the clustering of the mixed data set. Note that number of clusters in the reference and in the combined data set is the same, i.e. C.
Based on the co-mixing of known and unknown cellular identities, and of the combined data set are given by
54 |
55 |
Based on these formulations of and , the definition of follows as before,
56 |
Similarly, the derivative of follows as before, except that the membership function, , is only defined with respect to cells of the test data set, & ,
57 |
Numerical optimisation follows as in the unsupervised setting, using the L-BFGS-B algorithm, except that the known cellular identities of the reference mean that only a single run of the algorithm is required, i.e. no multiple starts.
Imputation
We rely on a cell-by-cell adjacency matrix for imputation, encoding which cells have similar gene expression profiles. Specifically, we use the shared-nearest-neighbours matrix produced by Seurat, using default settings [40].
We begin with the expression matrix X, keeping only the top G genes by I(g) (default is 500). Then, briefly, the Seurat method follows the following steps: data is scaled and normalised using a variance stabilising transform; the transformed data then undergoes principal components, and a cell-cell Euclidean distance matrix is calculated on the first 10 principal components. Next, a k-nearest-neighbours graph is constructed, where k is by default 20; a shared nearest-neighbours graph is constructed by taking the Jaccard Index of the overlap in neighbourhoods of each pair of cells [18].
From this adjacency matrix, A, we compute the imputed data via a one-hop smoothing operator:
58 |
59 |
where Q is the row-normalised stochastic matrix, i.e. , is the g-column of X, and are the imputed values for gene g [30]. The imputed expression value of gene g in cell i is a weighted mean of observed expression values in cell i and its neighbourhood in A. This is an example of a data diffusion imputation, first applied to Sc-seq data in Van Dijk et al. [49].
The vector can then be used in place of in any of the above described work. In the main text, we select the top 500 genes by I(g) based on , before clustering on . Accordingly, use of the imputed expression matrix requires two rounds of feature selection: first in the construction of the adjacency matrix, A, then for the use of in clustering.
For the benchmarking of the imputation, we recapitulated the set of data sets and methods tested in Tian et al. [46]. However, we were unable to rerun RCA [23].
Supplementary Information
Acknowledgements
We thank P. Stumpf, J. Egan and B. Kitching-Morley for helpful discussions.
Abbreviations
- ARI
Adjusted rand index
- KLD
Kullback–Leibler divergence
- scEC
Single-cell entropic clustering
- scSeq
Single-cell sequencing
Author contributions
Conceptualization, MJC, RSG, BDM; Software, MJC; Investigation, MJC, JF, RJSG and BDM; Writing—original draft, MJC, RSG and BDM; Writing—review and editing, MJC, JF, RSG and BDM; Visualization, MJC; Supervision, RSG and BDM. All authors have seen and approved the manuscript.
Funding
RSG and BDM were supported by The Alan Turing Institute under the EPSRC Grant EP/N510129/1. RSG was supported by the Alan Turing Institute—Roche strategic partnership under the project ‘Structured missingness in heterogeneous data’.
Availability of data and materials
The Svensson data was downloaded as file “svensson_chromium_control.h5a” from https://data.caltech.edu/records/1264 The Tian data sets were downloaded from https://github.com/LuyiTian/sc_mixology The Zheng data was downloaded as set of files “Gene / cell matrix (filtered)” in section “Single Cell 3’ Paper: Zheng et al. 2017” from https://support.10xgenomics.com/single-cell-gene-expression/datasets The Stumpf data was downloaded as file “RData.zip” from http://dx.doi.org/10.17632/csvm3kpkxd.1 The Tabula Muris data was downloaded as set of files “Single-cell RNA-seq data from microfluidic emulsion (v2)” from https://tabula-muris.ds.czbiohub.org
Code availability
An R package for the implementation of the described methods is available on github https://github.com/mjcasy/scEC
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Rubén J. Sánchez-García, Email: R.Sanchez-Garcia@soton.ac.uk
Ben D. MacArthur, Email: bdm@soton.ac.uk
References
- 1.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nate Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol) 1995;57(1):289–300. [Google Scholar]
- 3.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):P10008. doi: 10.1088/1742-5468/2008/10/P10008. [DOI] [Google Scholar]
- 4.Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in single-cell rna-seq experiments. Nat Meth. 2013;10(11):1093–1095. doi: 10.1038/nmeth.2645. [DOI] [PubMed] [Google Scholar]
- 5.Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput. 1995;16(5):1190–1208. doi: 10.1137/0916069. [DOI] [Google Scholar]
- 6.Chan TE, Stumpf MP, Babtie AC. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 2017;5(3):251–267. doi: 10.1016/j.cels.2017.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Cover TM, Thomas JA. Elements of information theory. Wiley; 2012. [Google Scholar]
- 8.Fisher R. Statistical methods for research workers. Gyan Books; 2017. [Google Scholar]
- 9.Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research. 2018;7:1297. doi: 10.12688/f1000research.15809.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Consortium Gene Ontology. The gene ontology resource: enriching a gold mine. Nucl Acids Res. 2021;49(D1):D325–D334. doi: 10.1093/nar/gkaa1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Greulich P, Smith R, MacArthur BD. The physics of cell fate. In: Levine H, Jolly MK, Kulkarni P, Nanjundiah V, editors. Phenotypic switching. NewYork: Academic Press; 2020. pp. 189–206. [Google Scholar]
- 12.Grün D, Kester L. Van, Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Meth. 2014;11(6):637–640. doi: 10.1038/nmeth.2930. [DOI] [PubMed] [Google Scholar]
- 13.Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H. Van, Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–255. doi: 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]
- 14.Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):1–15. doi: 10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hashimshony T, Senderovich N, Avital G, Klochendler A, De Leeuw Y, Anavy L, Gennert D, Li S, Livak KJ, Rozenblatt-Rosen O, et al. Cel-seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 2016;17:1–7. doi: 10.1186/s13059-016-0938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hausser J, Strimmer K. Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J Mach Learn Res. 2009;10(7):1469–1484. [Google Scholar]
- 17.Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19(4):562–578. doi: 10.1093/biostatistics/kxx053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Jaccard P. The distribution of the flora in the alpine zone. 1. NewPhytol. 1912;11(2):37–50. [Google Scholar]
- 19.Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Meth. 2014;11(7):740–742. doi: 10.1038/nmeth.2967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–282. doi: 10.1038/s41576-018-0088-9. [DOI] [PubMed] [Google Scholar]
- 21.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat Meth. 2017;14(5):483–486. doi: 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
- 23.Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, Kong SL, Chua C, Hon LK, Tan WS, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017;49(5):708–718. doi: 10.1038/ng.3818. [DOI] [PubMed] [Google Scholar]
- 24.Liu B, Li C, Li Z, Wang D, Ren X, Zhang Z. An entropy-based metric for assessing the purity of single cell populations. Nat Commun. 2020;11(1):1–13. doi: 10.1038/s41467-020-16904-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 2014;15(12):1–21. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746. doi: 10.15252/msb.20188746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research. 2016;5:2122. doi: 10.12688/f1000research.9501.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426. 2018
- 30.Ortega A. Introduction to graph signal processing. Cambridge University Press; 2022. [Google Scholar]
- 31.Peterman N, Levine E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genom. 2016;17:1–17. doi: 10.1186/s12864-016-2533-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.R Core Team R: a language and environment for statistical computing, R foundation for statistical computing, Vienna. https://www.R-project.org/, 2020
- 33.Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–850. doi: 10.1080/01621459.1971.10482356. [DOI] [Google Scholar]
- 34.Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, Purdom E. Clusterexperiment and rsec: a bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol. 2018;14(9):e1006378. doi: 10.1371/journal.pcbi.1006378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using gaussian finite mixture models. R J. 2016;8(1):289. doi: 10.32614/RJ-2016-021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
- 38.Shorrocks AF. The class of additively decomposable inequality measures. Econom: J Econom Soc. 1980;48(3):613–625. doi: 10.2307/1913126. [DOI] [Google Scholar]
- 39.Smith RC, MacArthur BD. Information-theoretic approaches to understanding stem cell variability. Curr Stem Cell Rep. 2017;3(3):225–231. doi: 10.1007/s40778-017-0093-5. [DOI] [Google Scholar]
- 40.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, III, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Stumpf PS, Du X, Imanishi H, Kunisaki Y, Semba Y, Noble T, Smith RC, Rose-Zerili M, West JJ, Oreffo RO, et al. Transfer learning efficiently maps bone marrow cell types from mouse to human using single-cell RNA sequencing. Commun Biol. 2020;3(1):1–11. doi: 10.1038/s42003-020-01463-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Svensson V, Natarajan KN, Ly L-H, Miragaia RJ, Labalette C, Macaulay IC, Cvejic A, Teichmann SA. Power analysis of single-cell RNA-sequencing experiments. Nat Meth. 2017;14(4):381. doi: 10.1038/nmeth.4220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Svensson V, Vento-Tormo R, Teichmann SA. Exponential scaling of single-cell RNA-seq in the past decade. Nat Protoc. 2018;13(4):599–604. doi: 10.1038/nprot.2017.149. [DOI] [PubMed] [Google Scholar]
- 44.Consortium Tabula Muris. Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature. 2018;562(7727):367. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Theil H. Economics and Information, theory studies in mathematical and managerial economics. North-Holland Publishing Company; 1967. [Google Scholar]
- 46.Tian L, Dong X, Freytag S, Lê Cao K-A, Su S, JalalAbadi A, Amann-Zalcenstein D, Weber TS, Seidi A, Jabbari JS, et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Meth. 2019;16(6):479–487. doi: 10.1038/s41592-019-0425-8. [DOI] [PubMed] [Google Scholar]
- 47.Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 2019;20(1):1–16. doi: 10.1186/s13059-019-1861-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015;25(10):1491–1498. doi: 10.1101/gr.190595.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–729. doi: 10.1016/j.cell.2018.05.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Van Rossum G, Drake FL. Python 3 reference manual. Scotts Valley: CreateSpace; 2009. [Google Scholar]
- 51.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nat Meth. 2020;17(3):261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Softw (TOMS) 1997;23(4):550–560. doi: 10.1145/279232.279236. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Svensson data was downloaded as file “svensson_chromium_control.h5a” from https://data.caltech.edu/records/1264 The Tian data sets were downloaded from https://github.com/LuyiTian/sc_mixology The Zheng data was downloaded as set of files “Gene / cell matrix (filtered)” in section “Single Cell 3’ Paper: Zheng et al. 2017” from https://support.10xgenomics.com/single-cell-gene-expression/datasets The Stumpf data was downloaded as file “RData.zip” from http://dx.doi.org/10.17632/csvm3kpkxd.1 The Tabula Muris data was downloaded as set of files “Single-cell RNA-seq data from microfluidic emulsion (v2)” from https://tabula-muris.ds.czbiohub.org
An R package for the implementation of the described methods is available on github https://github.com/mjcasy/scEC