Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2023 Aug 12;24:311. doi: 10.1186/s12859-023-05424-8

An information-theoretic approach to single cell sequencing analysis

Michael J Casey 1,2, Jörg Fliege 1, Rubén J Sánchez-García 1,2,3,, Ben D MacArthur 1,2,3,4,
PMCID: PMC10422744  PMID: 37573291

Abstract

Background

Single-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information.

Results

Here, we formally quantify the information obtained from a sc-Seq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types.

Conclusions

Thus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from sc-Seq data, and provide an R package implementation.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-023-05424-8.

Background

Advances in single-cell sequencing (sc-Seq) technologies have enabled us to profile thousands of cells in a single experiment [43]. In combination with advances in unsupervised analysis methods, particularly specialised clustering algorithms and dimensionality reduction techniques, these technologies have allowed us to dissect cellular identities in unprecedented detail and discover novel, functionally important, cell types [48]. The goal of most sc-Seq studies (except those focused on methodology development) is to extract biological information, often concerning the mix of cell types present in the tissue sample, from the data obtained. Yet, data is not the same as information; and large, complex, data sets do not necessarily convey useful or usable information. Notably, current single-cell profiling technologies typically produce noisy data for numerous technical reasons, including low capture rate, sparsity due to shallow sequencing, and batch effects [17, 19]. Consequently, the relationship between biological information and sc-Seq data is complex and incompletely understood. There is, therefore, a need for formal, quantitative methods to evaluate this relationship.

To address this challenge, we propose an information-theoretic framework that quantifies the amount of information contained in a sc-Seq data set, and leads to a natural definition of gene expression heterogeneity. Our measure of gene expression heterogeneity decomposes into that which is explained by a given grouping of cells—a proposed clustering into cell types, for example—and that which remains unexplained.

Our framework differs from other approaches to heterogeneity decomposition (e.g. [14, 27]) by formally quantifying the information, in terms of gene expression heterogeneity, gained from a sc-Seq experiment concerning the mix of cell types present. The resulting measure of heterogeneity, and its decomposition into inter- and intra-cluster heterogeneity, is non-parametric, intrinsic, unbiased, and requires no a priori assumptions about gene expression patterns.

Our approach is mathematically precise, biologically intuitive and computationally simple to implement, enabling a practitioner to quickly assess the information content of a sc-Seq data set, in terms of gene expression heterogeneity, identify highly informative genes, and determine the extent to which observed patterns of gene expression variability are explained by the presence of a mixture of cell types in a cellular population. Furthermore, we provide an efficient unsupervised clustering algorithm of cells from sc-Seq data based on heterogeneity and an implementation as an R package.

Results

High-throughput single-cell analysis methods, such as single-cell sequencing, typically view cells as the objects of study and seek to compare cell identities with each other [4, 12, 14, 20, 24, 25, 35, 47]. However, this cell-centric view is less well suited to quantifying gene expression heterogeneity, which is concerned with patterns of variation that arise from the mixing of cell types within a population, and may vary from gene to gene. For instance, variance, the standard measure of heterogeneity in the cell-centric view, may not be informative in multi-modal distributions [39].

In the context of sc-Seq, heterogeneity results from differential expression between cell types. In the simplest case, where one cell type expresses a gene highly and another lowly, such differential expression introduces bimodality in the cell-centric expression distribution. In a bimodal distribution, the population mean is no longer characteristic of either of the two subpopulations, making variance, as a measure of the spread about the mean, also misleading (see Smith and MacArthur [39]).

To meet these challenges, we introduce a novel gene-centric probabilistic view that seeks to more formally specify what is meant by gene expression homogeneity and heterogeneity. We show that this gene-centric view is better suited to quantifying gene expression heterogeneity in the context of a multimodal population, and formalises the biological intuition of expression homogeneity and heterogeneity.

Quantifying gene expression heterogeneity

Consider the expression pattern of a single gene g of interest in a population of N distinct cells. Assume that in total M transcripts of g are identified in the cell population (i.e. across all N cells profiled). Note that M represents the observed transcript count, which may differ from the true count due to technical artefacts. Now consider the stochastic process of assigning the M identified transcripts of gene g to the N cells profiled. Intuitively, the population is homogeneous with respect to expression of g if all the cells are statistically the same with respect to its expression. Mathematically, this means that the M transcripts of g will be assigned to the N cells independently and equiprobably—i.e. each transcript will be assigned to each cell with probability 1/N. Conversely, if the population is heterogeneous with respect to expression of g (that is, it consists of a mix of cell types, each expressing the gene differently), then transcripts will not be assigned uniformly, but rather will be assigned preferentially to distinct subsets of cells. Heterogeneity in experimentally observed patterns of expression can, therefore, be assessed in terms of deviation from this hypothetical homogeneous null model.

The Kullback–Leibler divergence (KLD, also known as the relative entropy) is a measure of the information encoded in the difference between an observed and null distribution [22]. In the sc-Seq context, the KLD of an experimentally observed gene expression distribution from the homogeneous null model, described above, measures the amount of information that is lost by assuming that the gene is homogeneously expressed in the sequenced cell population. We will denote this quantity I(g) and refer to it as the heterogeneity of g (see Fig. 1 for a schematic). Formally,

I(g)=i=1NxiglogNxig=log(N)-H(g), 1

where xig is the fraction of transcripts of gene g expressed by cell i, for each 1iN, and H(g) is the entropy of the expression of g in the population (see “Methods” Section for full details). For technical reasons, explained in the “Methods” Section, we will use natural logarithms in all calculations; I(g) therefore has units of nats.

Fig. 1.

Fig. 1

An information-theoretic view of sc-Seq data. Transcripts, or more generally counts, of a given gene (shown here as horizontal bars) are assigned to cells after sequencing. If the cell population is homogeneous with respect to the expression of g, then the heterogeneity I(g) will be zero (top left population, I(g)=0). In practice, the transcript assignment process is stochastic, and so there will always be some deviation from this ideal (bottom left population, I(g) small). (Note that the technical effects of this stochasticity on the information obtained may be reduced by using a shrinkage estimator to determine the distribution of transcripts (see “Methods” Section)). If the population is heterogeneous, then transcripts may be preferentially expressed in a subset of cells and the information obtained from the experiment, as measured by I(g) will be larger (top right population, I(g) large), reaching a maximum at log(N), where N is the number of cells sequenced, when only one cell expresses the gene (bottom right population, I(g)=ln(5)1.61 largest).Note that the population heterogeneity I(g) is independent of any decomposition of the cell populations into subpopulations (shown here as yellow and purple cells, for illustration). However, given any grouping of the cells into subpopulations, I(g) can be formally decomposed as the sum of the heterogeneity explained by within and in-between subpopulations (see “Results” Section and Fig. 3). This decomposition, but not the overall value of I(g), does depend on the chosen assignment of cells to subpopulations

Intuitively, if the cell population is unstructured with respect to the expression of g (i.e. if the cells are approximately interchangeable with respect to the expression of g) then the assumption of homogeneity is correct and I(g)0. Conversely, the theoretical maximum for I(g) is log(N), which is achieved when H(g)=0 and all transcripts of the gene are assigned to the same cell (see Fig. 1). Note that: (1) we do not need to know, or model, the particular expression distribution of g in the population, so no a priori assumptions about expression patterns are required to calculate I(g); (2) I(g) is agnostic concerning missing readings or counts so long as they are distributed uniformly at random; (3) since it quantifies the deviation from the homogeneous null model, I(g) measures the information obtained from the experiment concerning the expression of g.

In practice, I(g) is associated with cellular diversity: the more distinct cell sub-populations present in a sample, and the more those sub-populations differ from one another with respect to their expression of g, the larger I(g) will be. Thus, I(g) is a parsimonious measure of expression heterogeneity that makes minimal assumptions concerning expression patterns and, therefore, imposes minimal technical requirements on data collection methodology or quality. As such, it can be used as the basis for numerous aspects of the sc-Seq pipeline, including feature selection and unsupervised clustering.

To validate these uses, we considered a series of single-cell RNA-sequencing bench-marking data sets: Svensson, a technical control [42]; Tian, a mixture of three cancerous cell lines [46]; Zheng, a set of FACS separated peripheral blood mononuclear cells; Stumpf, a sampling of cells from mouse bone marrow [41]; and the Tabula Muris, a mouse cell atlas with cells from twelve organs [44]. Fig. 2a–e shows plots of I(g) for each gene profiled in these experiments. These plots confirm the intuition that the more distinct types present in a sample, and the more those cell types differ from one another with respect to their gene expression patterns, the greater the population heterogeneity will be, as measured by I(g).

Fig. 2.

Fig. 2

Information-theoretic single-cell analysis. Recall that I(g) measures the heterogeneity of a cellular population with respect to the expression of g: I(g)=0 when transcripts are expressed uniformly and increases as transcripts are expressed preferentially in a subset of cells, reaching a maximum I(g)=log(N), where N is the number of cells sequenced, when only one cell expresses the gene. ae Plots of expression heterogeneity, I(g) (normalised by the theoretical maximum, log(N)) against log mean expression for the bench-marking sc-Seq data sets described in the main text. In each panel, each point represents a gene profiled. The number of genes associated with large values of I(g) increases with the number of cell types present in the population profiled, showing I(g) as a valid measure of cell type diversity. Panel a shows data from a technical control [42] (number of cell types, C=1), b a mixture of three cancerous cell lines [46] (C=3), c FACS sorted immune cells [52] (C=4), d a sample of mouse bone marrow [41] (C=14), and e a multi-organ mouse cell atlas [44] (C=56). fh Biologically meaningful cell annotations are associated with high inter-cluster heterogeneity. Established cell annotations for the f Tian, g Zheng and h Stumpf data are associated with higher inter-cluster heterogeneity than expected by chance (i.e., in randomly permuted clusters; significance is assessed using a one-sided exact test with 104 permutations; y axes show log10(p+1)). In all panels the red line shows p<0.05, false discovery rate corrected for 500 trials [2, 8]. Genes below this threshold are significantly different gene expression patterns across the set of identified cell types. i Summary statistics for the total inter-cluster heterogeneity HS=gHS(g) based on established empirical and randomly permuted cell annotations (104 random permutations in each case). These statistics show the strong association of high HS with biologically meaningful groupings of cells. j A Uniform Manifold Approximation and Projection (UMAP) [29] plot of the top 500 genes by I(g) for the Stumpf data set; each point is a cell, coloured by its scEC cluster. This shows that I(g) is able to capture the continuous variation of developing cell types

These results indicate that heterogeneity (as defined here) can be used for feature selection. For example, heterogeneously expressed genes can be identified by 1) ordering them in descending ordering by I(g), and selecting the top n (where n is user-determined) or 2) finding those genes for which I(g) greater than some user-defined threshold (note that 0I(g)logN, with I(g)0.7 indicating the gene g is expressed in approximately half of cells).

As a measure for use in feature selection, I(g) identifies gene sets largely distinct from those based on Highly Variable Gene selection (the mean overlap of selected genes—for listed data sets excluding Svensson—was 0.36, with the number of genes selected determined by scran, based on an false discovery rate threshold of 0.05). For instance, I(g) identifies Hbb-bs, a marker of erythrocyte maturation, as informative in the Stumpf et al. [41] data set (high I(g) value), whereas scran does not, instead attributing the observed variance to technical sources (see Additional file 1: Fig. S1)[27]. Critically, information-theoretic heterogeneity is more mathematically precise and computationally simpler to implement than variance-based approaches, making no distributional assumptions (e.g. use of the negative binomial model) and having no free parameters to fit. Furthermore, it may be easily modified to account for the presence of multiple homogeneous cell subpopulations, and thereby act as a measure of cluster quality.

Inter-cluster heterogeneity

Let S be a discrete clustering of a population of cells—i.e. an assignment of the cells to a finite set of C non-intersecting sub-populations (also known as clusters). In the “Methods” Section, we show that I(g) can be decomposed into two parts: one part that quantifies the extent to which each of the sub-populations defined by S deviate from homogeneity, which we call the intra-cluster heterogeneity and write as hS(g); and one part that quantifies how differently the gene is expressed, in aggregate, between sub-populations, which we call the inter-cluster heterogeneity and write as HS(g). Namely, we show that

I(g)=HS(g)+hS(g), 2

for any proposed clustering S. Thus, the information obtained from an experiment concerning the expression of a gene g can be explicitly related to both local (within cluster) and global (between cluster) patterns of variation, for any proposed clustering. Full mathematical details of this decomposition, including formulae for HS(g) and hS(g), are provided in the “Methods” Section, and an illustration is given in Fig. 3. We also provide an R package for its calculation (see “Methods” Section).

Fig. 3.

Fig. 3

Heterogeneity is additively decomposable. The heterogeneity of a population of cells (5 cells in this illustration) with respect to the expression of a gene g, I(g), can be decomposed into inter- and intra-cluster heterogeneities for any proposed clustering, S (here, two subpopulations, or clusters, of 3 yellow and 2 purple cells). The inter-cluster heterogeneity HS(g) is determined by independently aggregating all transcripts (shown as horizontal lines) associated with each sub-population in S and then taking the KLD of the resulting distribution from the uniform distribution of the transcripts over C clusters. It measures the extent to which transcripts are uniformly assigned to clusters. The intra-cluster heterogeneity hS(g) is determined by taking the weighted sum (with respect to the number of transcripts on each subpopulation) of the heterogeneities of each of the constituent subpopulations, considered independently. It represents the average heterogeneity of the proposed clusters, accounting for disparities in number of transcripts assigned. In this toy example, the overall population heterogeneity of gene g, I(g)=0.55, decomposes as the sum of the inter-cluster heterogeneity HS(g)=0.33, plus the intra-cluster heterogeneity hS(g)=0.22. The latter is obtained as the weighted sum (with respect to the number of transcripts in each cluster, here 2/10=0.2 and 8/10=0.8) of the heterogeneities on each subpopulation. Further details and formulae are provided in the “Methods” Section

This decomposition can be used to assess cluster quality. For a proposed cluster to meaningfully represent a cell type, it must be associated with differential expression of some subset of marker genes. Because HS(g) measures the extent to which the expression of gene g deviates from the homogeneous null model it is a simple measure of the extent to which g is differentially expressed between the clusters defined by S. We therefore expect that biologically meaningful clustering based on a marker gene g will result in a high value for HS(g) and (by necessity) a low value for hS(g).

This expectation is confirmed for the Tian, Zheng and Stumpf data sets for which we possess high-confidence a priori cell type annotations, derived either experimentally (for the Tian and Zheng data) or from expert annotation of a computational clustering (for the Stumpf data). In the set of genes most likely to be differentially expressed between clusters in these data sets (taken to be the top 500 genes by I(g) in each case) the majority are associated with substantially higher inter-cluster heterogeneity than expected by chance (i.e., in randomly permuted clusters). Indeed, in Tian 97.0%, in Zheng 94.4% and in Stumpf 99.6% of the tested genes had significantly higher inter-cluster heterogeneity than expected (one-sided exact test with 104 permutations, p<0.05, false discovery rate corrected for 500 trials; see Fig. 2f–h for illustration) [2, 8].

These results indicate that by simple information-theoretic reasoning, we are able to quantify the amount of information in the expression pattern of a single gene that is explained by a given clustering. However, a key strength of high-throughput single-cell profiling methods is that they allow the simultaneous profiling of thousands of genes in large cell populations.

Our information-theoretic reasoning may be extended to a high-throughput single-cell profiling experiment by assuming that each gene is an independent source of information and making use of the fact that information from independent sources is additive [37]. Thus, we can determine the total information explained by a given clustering S by evaluating the sum HS=gHS(g) over all genes profiled. The total information explained by S is a simple, easily computed measure for cluster evaluation that favours grouping of cells into homogeneous (with respect to gene expression) sub-populations and is maximised at gI(g) if and only if the proposed clustering divides the population into indivisible perfectly homogeneous sub-populations and thereby accounts for all the heterogeneity contained in the sc-Seq data set. If so, then S is the maximum entropy partition of the cell population into distinct classes, and may therefore be considered as the most parsimonious way of assigning cell identities.

To illustrate the association between total information explained, HS, and cluster quality, we again considered the established annotations of the Tian, Zheng and Stumpf data sets. As expected, in all cases the observed value of HS significantly exceeds that of all random label permutations (see Fig. 2i indicating that HS is strongly associated with cluster quality and that these benchmark cell annotations are associated with homogeneous cell sub-populations. Notably, this association holds true independently of the method used to annotate cell identities (annotations were derived from genotypic information for the Tian data, surface protein expression for the Zheng data, and unsupervised clustering for the Stumpf data), indicating that HS provides a methodology agnostic, parameter-free, means to quantify cluster quality.

Unsupervised clustering

Building on this empirical association, we investigate the clustering of cells that maximises HS (and necessarily minimises hS) as a reasonable approximation to (homogeneous) cell type. That is, this is the cell clustering that explains the most of the gene expression heterogeneity as inter-cluster variability.

We implement HS-maximisation as the objective function of an unsupervised clustering method, which we call scEC (single-cell Entropic Clustering). We show that our method is comparable to the state-of-the-art Louvain method, and in doing so, validate the underlying definition of heterogeneity and homogeneous cell type.

Namely, we compare the clustering returned by scEC and the established annotations for the Tian, Zheng and Stumpf data sets. The resulting scEC clusterings agreed strongly with the established cell annotations, achieving adjusted Rand Indices (ARI) of 0.99 for the Tian annotations, 0.87 for the Zheng annotations, and 0.69 for the Stumpf annotations (ARIis a measure of similarity of a proposed clustering to a known clustering, Rand [33]). To benchmark these results we also repeated the analysis using the Louvain method, a leading single-cell clustering algorithm, which achieved ARIs of 0.99, 0.99 and 0.35 for the Tian, Zheng, and Stumpf data sets respectively [3, 9, 26, 40].

The scEC objective function, HS, is naturally built by summation of gene-level measurements (see “Methods” Section), so can be used to directly identify the key drivers of a given clustering without relying on post-hoc differential gene expression testing. To demonstrate this feature, we compared the inter-cluster heterogeneity values (i.e., the values HS(g) for each of the genes profiled) for clusters identified by scEC and by established annotations for the Tian & Stumpf data sets (see Fig. 4).

Fig. 4.

Fig. 4

Comparison of inter-cluster heterogeneity of scEC-generated clusters versus established annotations. Plots of HS(g) based on an scEC-generated clustering (x-axis) and established annotations (y-axis) for the a Zheng and b Stumpf, and c an alternative data set from Tian [41, 46, 52]. In all panels each point represents a gene profiled and the red line indicates HscEC(g)=HKnown(g). For the genes below the red lines, the scEC clustering is better than the prior annotation at explaining the gene expression heterogeneity as inter-cluster variability, and vice versa for the genes above the red line

We found that the values of HS(g) associated with scEC clusters are strongly positively correlated with those associated with established annotations for both the Zheng and Stumpf data sets (Pearson’s correlation coefficient of 0.94 for Zheng and 0.99 for Stumpf; see Fig. 4). Moreover, the key marker genes in each data set were found to have significantly different expression levels across clusters, as measured by HS(g) (one-sided permutation test on clustering labels, 104 permutations, p<0.05, false discovery rate corrected; marker genes for Zheng: CD14, CD4, CD8 and NCAM1, Zheng et al. [52]; and Stumpf: Cd34, Kitl, Spi1, Gata1 and Pax5, Stumpf et al. [41]).

This concordance indicates that scEC is generally able to recapitulate known cell clusters in an unbiased and parameter free way. However, despite this concordance, there were differences between scEC clusters and established annotations. To investigate these differences further we selected those genes for which the value of HS(g) differed substantially (ΔHS(g)=|HscEC(g)-HKnown(g)|>0.1nats) between scEC clusters and established annotations in each data set for further investigation.

In the Zheng data, we found that those genes with expression heterogeneity better explained by the scEC clustering were enriched for cell-cycle related genes (gene ontology enrichment, p=1.33×10-8, false discovery rate corrected). Conversely, those genes with expression heterogeneity better explained by the established annotation were enriched for genes involved in the immune response (p=1.49×10-3, false discovery rate corrected) [1, 2, 10].

In the Stumpf data set, we found that those genes with expression heterogeneity better explained by the scEC clustering were enriched for genes involved in erythroblast differentiation and homeostasis (p=1.80×10-2 and p-value=6.17×10-4 respectively, false discovery rate corrected). Conversely, those genes with expression heterogeneity better explained by the established annotation were enriched for genes involved in the immune response (p=2.57×10-18, false discovery rate corrected). In this example, scEC generally identifies additional erythrocyte sub-populations while merging the cell sub-types of the neutrophil lineage (see Fig. 2j for illustration and Additional file 1: Table S1 for contingency table of cluster assignments).

Collectively, these differences reflect the preference of scEC for cellular annotations driven by biological processes involving larger tranches of genes. Depending on circumstance, this preference may be a benefit or a drawback. It is generally beneficial because it ensures that scEC annotations are robust to outliers (i.e. anomalous expression patterns of individual genes cannot easily distort scEC cluster assignments). Conversely, in circumstances in which key cellular functions are determined by a small number of genes, scEC may fail to disambiguate important cell populations from similar cell types. However, given the simplicity of the scEC methodology this issue can this issue can be relatively easily addressed, by adjusting the prior, as follows. In the standard formulation of scEC each gene contributes equally to the objective function HS=gHS(g). This corresponds to assuming a uniform prior. However, alternatives may be easily considered that take into account different sources of prior knowledge. In general, each gene can contribute in a weighted way to the overall inter-cluster heterogeneity. In this case the objective becomes HS=gw(g)HS(g), where w(g) is the weight of gene g. These weights can encode known biology—for example, by up-weighting known drivers of cellular identities, down-weighting genes associated with housekeeping processes and/or the cell cycle, or by weighting genes in inverse proportion to the size (i.e. the number of associated genes) of their leading ontology term.

While incorporating such priors into scEC may be beneficial, defining them is a challenge and potentially introduces a source of bias into the resulting clustering and so should be approached with care. An alternative approach—which encodes prior information in a data-led, rather than algorithmic way—is to use a reference data set, such as a cell atlas, to benchmark cell annotations (these atlases can serve as reference transcriptomes, in analogy to reference genomes).

Doing so leads to a semi-supervised version of scEC, wherein we cluster a novel test data set based on the known clustering of a given reference data set. We detail the mathematics of semi-supervised clustering in the “Methods” Section, but, in short, the extension to the semi-supervised setting requires no additional mathematical machinery. Such mathematical simplicity affirms information theory as providing a mathematically precise and biological intuitive framework for cellular clustering.

As an example, we cluster the Tian data set based on a distinct data set containing the same three cell lines (plus two additional cell lines) [46]. The semi-supervised clustering strongly agreed with the established cell annotation, achieving an ARI of 0.90. Notably, the ARI was reduced compared with the unsupervised version due to the assignment of 6% of cells to cell types present only in the reference data set, a disadvantage of semi-supervised (and supervised) methods.

Benchmarking and imputation

We benchmark scEC and state-of-the-art methods against a series of data sets consisting of cells sampled from three or five distinct cancerous cell lines and sequenced using various technologies (CEL-Seq2, Sort-seq, 10x, Drop-seq) [15, 28, 31, 46]. We compare each method’s clustering against ground truth for each data set (by adjusted rand index), finding that scEC performs subpar to the best of existing methods [13, 21, 27, 34, 40]. Investigating the gene-wise contributions via HS(g) reveals that scEC-derived clusterings diverge from the ground truth due to genes expressed in small numbers of cells, e.g., SPINK6 (difference in HscEC from ground truth of 0.23 nats; expressed in 1.6% of cells in the five-cell line, 10X data set; see Fig. 4c).

To protect against low cell-count genes, we adopt a data-diffusion step, imputing the expression value of each cell as a weighted average of itself and a small subset of cells with similar expression patterns (i.e., diffusion of observed expression values over a shared nearest-neighbours graph derived from the top 500 genes by I(g), see “Methods” Section) [49]. The smoothing layer shares information between genes, lessening scEC’s sensitivity to the expression pattern of individual genes; this raises the performance of scEC clustering to be on par with the state-of-the-art (see Fig. 5a).

Fig. 5.

Fig. 5

Benchmarking of scEC performance in unsupervised clustering and feature selection. a Adjusted Rand Index of clusterings produced by specified methods against known ground truth for seven data sets, each consisting of three or five cancerous lines sequenced on different platforms. With an additional imputation step, scEC performs on par with other methods. b The percent of the top N genes by different feature selection metrics that are differentially expressed. Data set is Sc-seq from three cancerous cell lines sequenced by Drop-seq (with 2005 differential expressed genes identified from non-parametric testing for each cell line versus the remaining; Wilcox test, fasle discovery rate corrected p-value <0.05). The greater ability of scEC-impute to a priori select differentially expressed genes is repeated across each benchmark data set, see Additional file 1: Fig. S2. Note that the imputation step in scEC-impute assigns many genes a heterogeneity I(g) of zero, resulting in a low cut-off on total number of selectable genes

We further benchmark scEC (with and without the imputation step) as a feature selection tool, evaluating the ability of I(g) to a priori identify differentially expressed genes, assuming differential expression to be the ground truth of feature selection (differentially expressed genes are identified per data set via Wilcoxon Rank Sum testing, selecting those genes with an FDR-adjusted p-value<0.05 in at least one cluster; Stuart et al. [40]). We find that, without smoothing, scEC is comparable to the state-of-the-art; with smoothing, scEC notably improves, substantially bettering the state-of-the-art in selecting differentially expressed genes from the benchmark data sets (see Fig. 5b). Notably, relatively few genes remain heterogeneous after the imputation step, so the imputed I(g) identifies a much smaller number of genes than other methods, where an arbitrary cut-off in selected gene number is required.

Conclusion

Traditional unsupervised clustering methods for single cell data are based on the, often implicit, assumption that cell types correspond to regions of high probability density in the joint gene expression distribution [11]. Although useful, this is not a specifically biological assumption (similar assumptions form the basis of numerous clustering algorithms in other different disciplines) and does not have a clear biological basis. Indeed, this assumption does not naturally accord with biological intuition that distinct cell populations should be ‘homogeneous’ and cells of the same type should be functionally interchangeable. Here, we have taken a different approach that seeks to encode biological intuition about population homogeneity in a mathematical formulation, drawing on tools from information theory.

We have proposed a formal measure of gene expression heterogeneity and shown that this measure captures biologically relevant heterogeneity arising from differential expression between established cell types. We formalised the additive decomposition of heterogeneity with respect to a given grouping of cells (in “Methods” Section), and tested the association between high inter-cluster heterogeneity and grouping by cell type. Finally, we used this measure as a basis of a biologically-motivated clustering procedure, which we call single-cell entropic clustering (scEC), which identifies homogeneous cell types. The scEC method is free of tunable hyper-parameters and its performance is comparable to a state-of-the-art clustering algorithm on a series of benchmark sc-Seq data sets, suggesting that this underlying biological basis is justified.

While our method represents a mathematically rigorous definition of heterogeneity, the stochasticity inherent in sc-Seq data does limit our resolution. For instance, nearly every gene is associated with some non-zero level of heterogeneity and, over thousands of genes, even low levels of stochastically-induced heterogeneity will accumulate (a version of the so-called ‘curse of dimensionality’). This limits our ability to directly interpret the genome-wide heterogeneity (e.g. the total heterogeneity of the technical control Svensson exceeds that of Tian, both absolute and per gene, although the trend reverses when excluding those genes with fewer than 100 total transcripts).

Aside from these technical limitations, our approach demonstrates the benefit of a biologically grounded, mathematically formal approach to understanding sc-Seq data. As these data sets grow in complexity and size, there will be a greater need for biological interpretability and mathematical rigour in analysis. We suggest information theory as the appropriate mathematical language for describing gene expression and its heterogeneity. Indeed, we anticipate that information theory will become an established part of the quantitative biologist’s toolkit.

Methods

Data collection

Count matrices for each data set were downloaded from their respective repositories (see Availability of Data and Materials). For the Zheng and Tabula Muris data sets, the count matrices of the various cell types/tissues were concatenated using the Matrix (v1.3-2) package in R (v4.0.3) [32].

Data pre-processing

For each data set, those genes with less than 100 total transcripts in total (i.e. across all cells) were excluded from further analysis. For the Zheng data set, FACS identities were taken from the repository metadata, with the CD4+ Helper T-cells, CD4+/CD25+ Regulatory T-cells, CD4+/CD45RA+/CD25- Naïve T cells, CD4+/CD45RO+ Memory T-cells, CD8+ Cytotoxic T cells and CD8+/CD45RA+ Naïve Cytotoxic T-cells merged into the single identity of T-cells.

Normalisation

Since sc-Seq data is noisy, rather than working with the experimentally observed proportions of transcripts assigned to each cell, we instead adopt a Bayesian approach to estimate the cellular frequencies using the James–Stein estimator, as explained below, and use this estimator in all subsequent calculations.

We write mig for the number of transcripts of gene g associated with cell i in a population of N cells. Let i=1Nmig=Mg be the total number of transcripts of gene g, and pig=mig/Mg the fraction of transcripts of gene g expressed by cell i, for each 1iN. Thus, i=1Npig=1 and we can think of pig as the probability that a randomly selected transcript of gene g is associated with cell i. To take under-sampling into account, we adjust the observed probabilities pi=pig as follows (let us drop the superscript g for simplicity). We define X=Xg as the discrete random variable on the set {1,2,N} with probabilities

xi=λ1N+(1-λ)pi, 3

where λ is the shrinkage intensity given by

λ=1-i=1Npi2(M-1)i=1N(1N-pi)2, 4

which is the James–Stein estimator, with the uniform distribution as the shrinkage target [16]. Shrinkage approaches deal with under-sampling and so are able to correct for the substantial sparsity observed in single-cell RNA-sequencing data. The shrinkage probabilities xi given in Eq. 3 are a compromise between the observed probabilities pi, which are unbiased but with a high variance, and the target uniform probabilities 1/N, which are biased but with low variance [6, 16].

Feature selection

The Shannon entropy of X is, by definition,

H(X)=-i=1Nxilogxi. 5

By convention, we assume that 0·log0=0. For the logarithm, we take base e for ease in differentiation (see later) rather than the usual base 2 (recall that loge(x)=log2(e)log2(x) for all x>0), and thus H(X), and I(g) below, are measured in nats.

The Shannon entropy is a measure of the uncertainty in the outcomes of the random variable X. It has a minimum value of zero, when xi=1 for some i (e.g. if the gene g is expressed in only one cell of the population) and has a maximum value of log(N), when xi=1/N for all i (e.g. if the gene is uniformly expressed in the cell population). The entropy may therefore be considered as a measure of the homogeneity of expression of the gene g in the cell population profiled (note that the association of high entropy with homogeneity is counter to the usual intuition: it occurs because we are concerned with the entropy of the generative process of assigning transcripts to cells, rather than transcript count distribution in the cell population, as is more usual. A high entropy transcript assignment process gives rise to a sharp transcript distribution, hence its association with homogeneity). By contrast, the quantity log(N)-H(X) also ranges between zero and log(N), yet is minimised when the gene is homogeneously expressed and so is a simple measure of expression heterogeneity, which we will denote I(g). We can rewrite this as

I(g)=log(N)-H(X)=i=1Nxilog(N)+i=1Nxilogxi=i=1NxilogNxi=i=1Nxilogxi1/N. 6

The Kullback–Leibler divergence (KLD), or relative entropy, is a measure of the information encoded in the difference between two probability distributions [22]. The relative entropy of a discrete probability distribution p1,,pN from a discrete probability distribution q1,,qN is, by definition,

D(P||Q)=i=1Npilogpiqi, 7

with the provision that qi=0 implies pi=0, and the convention that 0·log(00)=0. From this definition and Eq. 6, it is clear that our measure of heterogeneity is simply the relative entropy of the observed expression distribution from the uniform distribution U. Thus,

I(g)=log(N)-H(X)=D(X||U), 8

where U denotes to the uniform distribution on the set {1,2,N} with probabilities ui=1/N for all 1iN.

Cluster-level heterogeneity

A crucial property of the relative entropy is that it is additively decomposable with respect to arbitrary groupings [38]. Informally, this means that if we have a clustering of the cells into disjoint groups, then I(g) can be reconstructed from inter-cluster and intra-cluster heterogeneities. Next, we formalise this additive decomposition and give a self-contained derivation (cf. Theil [45]).

Consider a clustering S={S1,,SC} that unambiguously assigns each cell in the sample into one of C non-intersecting sub-populations S1,,SC of sizes N1,,NC. Note that k=1CNk=N, the total number of cells. Let yk be the fraction of transcripts associated with cells in sub-population Sk, adjusted by the shrinkage estimator. That is,

yk=iSkxi, 9

with xi defined by Eq. 3. This gives another discrete random variable Y with probability distribution y1,,yC on the set {1,2,C}. For each k=1,,C, we can also assess the heterogeneity of the sub-population Sk by considering the random variable Zk with probability distribution zi=xi/yk on the set iSk. Note that

iXkzi=iXkxiyk=ykiXkxi=1 10

so Zk is a random variable on the set (cluster) Sk.

We may rewrite I(g) in terms of Y and Zk, as follows:

I(g)=log(N)-i=1Nxilog1xi, 11
=log(N)-k=1CiSkxilog1xi 12
=log(N)-k=1CykiSkxiyklog1xi/yk+log1yk, 13
=log(N)-k=1CykiSkxiyklog1xi/ykH(Zk)-k=1CykiSkxiyklog1yk, 14
=log(N)-k=1CykH(Zk)-k=1Clog1ykiSkxiykH(Y), 15
=log(N)-k=1CykH(Zk)-H(Y), 16
=log(N)-H(Y)-k=1CyklogNkA+k=1CyklogNk-k=1MykH(Zk)B. 17

Expression A may be rewritten as:

A=log(N)-H(Y)-k=1CyklogNk, 18
=k=1Cyklog(N)-k=1Cyklog1yk-k=1CyklogNk, 19
=k=1CyklogykNk/N, 20
=D(Y||Ugroup). 21

This is the relative entropy of Y from the uniform distribution Ugroup in which pk=Nk/N for k=1,,C. Since yk is the proportion of transcripts assigned to the cluster Sk, it measures the deviation from the assumption that the clusters are homogeneous in their expression of g (i.e. each cluster expresses g at the same level). Since it is a measure of the extent to which the population deviates from homogeneity between clusters, we will term this contribution the inter-cluster heterogeneity of g with respect to the clustering S, denoted HS. Informally, it is a measure of the extent to which the gene g is differentially expressed between clusters.

Expression B may be rewritten as:

B=k=1CyklogNk-k=1CykH(Zk), 22
=k=1CyklogNk-H(Zk), 23
=k=1CykiSkxiyklogNkxiyk, 24
=k=1CykiSkxiyklogxi/yk1/Nk, 25
=k=1CykD(Zk||Uk). 26

This is the weighted sum of the relative entropies of the empirical distributions Zk (i.e. the observed gene expression distribution in group Sk) from the uniform distribution Uk on Sk (in which pi=1/Nk for each iSk). It is the deviation from the assumption that the population consists of a mixture of homogeneous sub-populations according to the clustering S (where the expectation is taken with respect to the probability measure provided by Y). Since it is a measure of the expected extent to which the proposed sub-populations deviate from homogeneity within clusters, we will term this contribution the intra-cluster heterogeneity of g with respect to S, denoted hS(g).

Taken together, these results show that I(g) can be decomposed into two well-defined parts that encode local and global properties of the expression distribution of g with respect to a given clustering S={S1,,SC}:

I(g)=A+B=HS(g)+hS(g). 27

The relative entropy is always non-negative and hence so are HS(g) and hS(g) for any S. Thus, both quantities range from zero to I(g). If S places one cell in each cluster (i.e. C=N) then Zk=Uk for all k and thus hS(g)=0. Conversely, if S places all cells in one cluster (i.e. C=1) then Y=US and thus HS(g)=0. In this case, the population heterogeneity is equivalent to the intra-cluster heterogeneity of the trivial clustering.

Unsupervised clustering

The above measures of heterogeneity can be easily extended from one gene to many. In this case, the homogeneous null model is obtained by assuming that each gene is expressed homogeneously and independently. Because entropy from independent sources is additive [7], the total heterogeneity of a single-cell RNA-sequencing data set under a given clustering is given by the sum:

I=gI(g)=gHS(g)+ghS(g)=HS+hS. 28

Here HS=gHS(g) and hS=ghS(g) represent the total inter-cluster and intra-cluster heterogeneity of the data with respect to a clustering S={S1,,SC}.

Naturally, we want to identify a clustering (e.g.  an assignment of cells to types) that produces maximally homogeneous sub-groups and thereby explains most of the expression heterogeneity for most genes in terms of inter-cluster heterogeneity. Mathematically, this means we want to find a clustering S that maximises HS or, equivalently, minimises hS. A brute force approach is not feasible, as the number of partitions of a set with n elements into two or more subsets grows exponentially with n.

To approach the problem we therefore converted the problem from one of assigning discrete identities (i.e., a hard clustering problem) to one of assigning continuous identities (i.e., a fuzzy clustering problem). The advantage of this approach is that it defines the problem in terms of a continuous, differentiable objective function which can be approached with more efficient optimisation routines.

Fuzzy clustering

We adopt a fuzzy conception of clustering in which cells are assigned to C (possibly) overlapping, fuzzy clusters, S1,,SC. Each cell i has C corresponding membership functions μik[0,1] for k=1,2,,C, which assess the probability that cell i belongs to cluster Sk. Thus,

k=1Cμik=1withμik0, 29

Our information-theoretic framework can be adapted to this fuzzy setting. Population heterogeneity, I(g), is independent of the chosen clustering, so it remains unaffected by the adoption of fuzzy clusters, i.e. the total gene expression heterogeneity is unaffected by the choice of discrete or fuzzy cluster memberships. For the calculations of inter-cluster and intra-cluster heterogeneities, we extend the discrete random variables Yg and Zkg to the fuzzy setting, where Yg now measures the expression distribution of the gene g across the C fuzzy clusters, and Zkg measures the expression distribution of the gene g within fuzzy cluster Sk, as follows.

We define the discrete random variable Yg on the set of fuzzy clusters S={S1,,SC} with probabilities ykg given by

ykg=i=1Nμikxig. 30

We also define, for each SkS, a discrete random variable Zkg on the set of cells, i=1,,N, with probabilities zikg given by,

zikg=μikxigykg. 31

We can rewrite I(g) (which is independent of the clustering) in terms of Yg and Zkg (which depend on the clustering) as

I(g)=log(N)-H(Yg)-k=1CykgH(Zkg), 32

where,

H(Yg)=k=1Cykglog1ykg 33

and

H(Zkg)=i=1Nμikxigykglog1xig/ykg. 34

Note that Eq. 33 is the entropy of the random variable Yg, however H(Zkg) is not quite the entropy of Zkg (there is a missing 1/μik factor inside the logarithm).

From Eq. 32, we can obtain a second decomposition,

I(g)=H(g)+h(g) 35

where

HS(g)=k=1CykglogykgNk/N, 36
hS(g)=k=1Cykgi=1Nμikxigykglogxig/ykg1/Nk, 37
Nk=i=1Nμik. 38

Here H(g) and h(g) represent the inter-cluster heterogeneity and intra-cluster heterogeneity of gene g with respect to the fuzzy clustering S. Equation 36 is also the KLD of the distribution of Yg from the uniform distribution Ugroup given by uk=Nk/N for k=1,2,,C (note that k=1Cuk=1). Similarly, Eq. 37 is the KLD of the distribution of Zg from the uniform distribution Uk given by pi=μik/Nk for i=1,2,,N.

Either HS=gHS(g) or hS=ghS(g) can serve as our objective function, f, for optimisation: optimising either quantity simultaneously identifies the clustering with the greatest heterogeneity between clusters (i.e. the greatest differential gene expression between clusters) and the least heterogeneity within clusters (i.e. in which the cells within each cluster are closest to being interchangeable with respect to their patterns of gene expression).

Choosing inter-cluster heterogeneity HS, we solve the optimisation problem

maxSHS, 39

where

HS=g=1GHS(g)=g=1Gk=1CykglogykgNk/N, 40

using the limited-memory, box constrained BFGS (L-BFGS-B) optimisation algorithm from the Python3 (v3.8.2) package SciPy (v1.5.3) [5, 50, 51, 53].

The L-BFGS-B algorithm is an efficient non-linear local optimisation method developed for solving large, dense problems such as this [53] but its speed and the quality of the solution produced depends on the availability of an explicit formulation of the gradient of the objective function. Therefore, we derive the required gradient by partially differentiating the total inter-cluster heterogeneity, HS, with respect to fuzzy cluster membership, i.e. the N·C membership functions μrs (1rN,1sC).

Beginning by differentiating individual elements of f with respect to μrs, from Eqs. 30 and 38, we have

μrsykg=xrgk=s0ks 41
μrsNk=1k=s0ks 42

Using the product, chain and quotient rules, we can differentiate the g-summand, call it fg=HS(g), of the objective function f:

fgμrs=μrsk=1CykglogykgNk/N 43
=k=1CμrsykglogykgNk/N+k=1CykgμrslogykgNk/N 44
=xrglogysgNs/N+k=1CykgNk/NykgμrsykgNk/N 45
=xrglogysgNs/N+k=1CNkNNμrsykgNk 46
=xrglogysgNs/N+k=1CNk1Nk2μrsykgNk-ykgμrsNk 47
=xrglogysgNs/N+1NsxrgNs-ysg 48
=xrglogysgNs+log(N)+1-ysgNs. 49

The membership functions μrs are not independent, since k=1Cμik=1 for all i, by Eq. 29. To incorporate this constraint, and the constraint μik0 for all ik, we introduce variables wik given by

μik=ewikl=1Cewil, 50

and make the objective function f a function of the wik. By the chain rule,

fwij=r=1Ns=1Cfμrsμrswij. 51

Determining μrswij, let us write Mi=l=1Cewil; then, Eq. 50 can be written as μik=ewik/Mi. By the quotient rule,

μrswij=wijewrsMr=ewijMi-e2wijMi2ifr=iands=j,-ewisewijMi2ifr=iandsj,0ifri. 52

Numerical optimisation

The objective function HS, as formulated in Eq. 40, was optimised with respect to cluster memberships, μik using the L-BFGS-B optimisation algorithm from the Python3 (v3.8.2) package SciPy (v1.5.3) [5, 50, 51], with random initial starting points (initial cluster memberships were defined by random sampling of a uniform distribution centred on zero). The gradient function, as formulated in Eq. 51, was supplied to the optimisation routine.

Because initial conditions are randomly generated, and the L-BFGS-B algorithm only identifies local optima, the specific clustering found by the algorithm may depend on the choice of initial vector [5, 53]. To make the implementation robust a multi-start process is adopted, in which the optimisation is repeated multiple times with different initial vectors, and the clustering S with the greatest inter-cluster heterogeneity HS is chosen.

Permutation testing

Annotations for the Tian, Zheng and Stumpf data were taken from the respective repositories’ metadata. Random clusterings were generated by randomly permuting the established annotations 10,000 times each. Comparison of true and shuffled annotations was formulated as an exact one-side hypothesis test, generating exact p-values [8]. We control the false discovery rate arising from multiple testing for each data set using the R function p.adjust [2].

Clustering comparison

We compare our clustering results to two standard clustering methods: Seurat (v3) clustering (Louvain community detection), and UMAP projection, both with default parameters, with the exception of the resolution, as described in Stuart et al. [40] [3, 29]. The Adjusted Rand Index (ARI) was calculated for each data set using the associated function in the R package mclust (v5.4.7) [33, 36].

Semi-supervised classification

Consider a pair of count matrices, T={Tref,Ttest}, where one count matrix, which we term the reference, Tref, has a known cluster structure (and therefore each cell has a unique, discrete cell type classification). The goal of reference mapping is to classify cells in the test matrix, Ttest, based on cellular classifications of the reference matrix.

We normalise each count matrix separately as described in Normalisation, using the James–Stein-type estimator so that iTkxig=1 [16]. We assume that the same set of genes are profiled in Tref and Ttest and derive the normalised combined count matrix, Xmix, via the weighted concatenation:

Xmix=NrefNXref,NtestNXtest, 53

where Nref is the number of cells in the reference data set, Ntest is the number of cells in the unclassified data set, and N is the total number of cells across both data sets.

As in the unsupervised setting, we cluster the combined data set, Xmix, by maximising HS of the combined data set with respect to a fuzzy clustering S, see Eq. 56. However, unlike in the unsupervised setting, a subset of cellular cluster memberships are known a priori. Let R=R1,,RC be the discrete clustering of the reference data set, with cluster sizes n1,,nC, where k=1Cnk=Nref. As the cluster identities of cells originating from the reference data set are fixed, each cluster in the reference data set constitutes a subset of a cluster in the mixed data set, RkSk, where S=S1,,SC is the clustering of the mixed data set. Note that number of clusters in the reference and in the combined data set is the same, i.e. C.

Based on the co-mixing of known and unknown cellular identities, ykg and Nk of the combined data set Xmix are given by

ykg=i=1Ntestμikxig+iRkxig,and 54
Nk=i=1Ntestμik+nk. 55

Based on these formulations of ykg and Nk, the definition of HS follows as before,

HS=g=1GHS(g)=g=1Gk=1CykglogykgNk/N. 56

Similarly, the derivative of HS follows as before, except that the membership function, μrs, is only defined with respect to cells of the test data set, 1rNtest & 1iNtest,

fwij=r=1Ntests=1Cfμrsμrswij=r=1Ntests=1Cg=1GxrglogysgNs+log(N)+1-ysgNsμrswij. 57

Numerical optimisation follows as in the unsupervised setting, using the L-BFGS-B algorithm, except that the known cellular identities of the reference mean that only a single run of the algorithm is required, i.e. no multiple starts.

Imputation

We rely on a cell-by-cell adjacency matrix for imputation, encoding which cells have similar gene expression profiles. Specifically, we use the shared-nearest-neighbours matrix produced by Seurat, using default settings [40].

We begin with the expression matrix X, keeping only the top G genes by I(g) (default is 500). Then, briefly, the Seurat method follows the following steps: data is scaled and normalised using a variance stabilising transform; the transformed data then undergoes principal components, and a cell-cell Euclidean distance matrix is calculated on the first 10 principal components. Next, a k-nearest-neighbours graph is constructed, where k is by default 20; a shared nearest-neighbours graph is constructed by taking the Jaccard Index of the overlap in neighbourhoods of each pair of cells [18].

From this adjacency matrix, A, we compute the imputed data via a one-hop smoothing operator:

Qij=AijjAij 58
X~ig=jQijXjg, 59

where Q is the row-normalised stochastic matrix, i.e. jQij=1i1,,N, Xg=(Xjg) is the g-column of X, and X~g=(X~ig) are the imputed values for gene g [30]. The imputed expression value of gene g in cell i is a weighted mean of observed expression values in cell i and its neighbourhood in A. This is an example of a data diffusion imputation, first applied to Sc-seq data in Van Dijk et al. [49].

The vector X~g can then be used in place of Xg in any of the above described work. In the main text, we select the top 500 genes by I(g) based on X~g, before clustering on X~g. Accordingly, use of the imputed expression matrix requires two rounds of feature selection: first in the construction of the adjacency matrix, A, then for the use of X~g in clustering.

For the benchmarking of the imputation, we recapitulated the set of data sets and methods tested in Tian et al. [46]. However, we were unable to rerun RCA [23].

Supplementary Information

12859_2023_5424_MOESM1_ESM.pdf (1.2MB, pdf)

Additional file 1. Two additional figures and one additional table: Expression of Hbb-bs in the Stumpf data set (Supplementary Figure 1); Feature selection benchmarking of scEC and scEC-impute (Supplementary Figure 2); and Contingency table between cell annotations provided in the Stumpf data set and scEC labels.

Acknowledgements

We thank P. Stumpf, J. Egan and B. Kitching-Morley for helpful discussions.

Abbreviations

ARI

Adjusted rand index

KLD

Kullback–Leibler divergence

scEC

Single-cell entropic clustering

scSeq

Single-cell sequencing

Author contributions

Conceptualization, MJC, RSG, BDM; Software, MJC; Investigation, MJC, JF, RJSG and BDM; Writing—original draft, MJC, RSG and BDM; Writing—review and editing, MJC, JF, RSG and BDM; Visualization, MJC; Supervision, RSG and BDM. All authors have seen and approved the manuscript.

Funding

RSG and BDM were supported by The Alan Turing Institute under the EPSRC Grant EP/N510129/1. RSG was supported by the Alan Turing Institute—Roche strategic partnership under the project ‘Structured missingness in heterogeneous data’.

Availability of data and materials

The Svensson data was downloaded as file “svensson_chromium_control.h5a” from https://data.caltech.edu/records/1264 The Tian data sets were downloaded from https://github.com/LuyiTian/sc_mixology The Zheng data was downloaded as set of files “Gene / cell matrix (filtered)” in section “Single Cell 3’ Paper: Zheng et al. 2017” from https://support.10xgenomics.com/single-cell-gene-expression/datasets The Stumpf data was downloaded as file “RData.zip” from http://dx.doi.org/10.17632/csvm3kpkxd.1 The Tabula Muris data was downloaded as set of files “Single-cell RNA-seq data from microfluidic emulsion (v2)” from https://tabula-muris.ds.czbiohub.org

Code availability

An R package for the implementation of the described methods is available on github https://github.com/mjcasy/scEC

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Rubén J. Sánchez-García, Email: R.Sanchez-Garcia@soton.ac.uk

Ben D. MacArthur, Email: bdm@soton.ac.uk

References

  • 1.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nate Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol) 1995;57(1):289–300. [Google Scholar]
  • 3.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):P10008. doi: 10.1088/1742-5468/2008/10/P10008. [DOI] [Google Scholar]
  • 4.Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in single-cell rna-seq experiments. Nat Meth. 2013;10(11):1093–1095. doi: 10.1038/nmeth.2645. [DOI] [PubMed] [Google Scholar]
  • 5.Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput. 1995;16(5):1190–1208. doi: 10.1137/0916069. [DOI] [Google Scholar]
  • 6.Chan TE, Stumpf MP, Babtie AC. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 2017;5(3):251–267. doi: 10.1016/j.cels.2017.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cover TM, Thomas JA. Elements of information theory. Wiley; 2012. [Google Scholar]
  • 8.Fisher R. Statistical methods for research workers. Gyan Books; 2017. [Google Scholar]
  • 9.Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research. 2018;7:1297. doi: 10.12688/f1000research.15809.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Consortium Gene Ontology. The gene ontology resource: enriching a gold mine. Nucl Acids Res. 2021;49(D1):D325–D334. doi: 10.1093/nar/gkaa1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Greulich P, Smith R, MacArthur BD. The physics of cell fate. In: Levine H, Jolly MK, Kulkarni P, Nanjundiah V, editors. Phenotypic switching. NewYork: Academic Press; 2020. pp. 189–206. [Google Scholar]
  • 12.Grün D, Kester L. Van, Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Meth. 2014;11(6):637–640. doi: 10.1038/nmeth.2930. [DOI] [PubMed] [Google Scholar]
  • 13.Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H. Van, Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–255. doi: 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]
  • 14.Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):1–15. doi: 10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hashimshony T, Senderovich N, Avital G, Klochendler A, De Leeuw Y, Anavy L, Gennert D, Li S, Livak KJ, Rozenblatt-Rosen O, et al. Cel-seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 2016;17:1–7. doi: 10.1186/s13059-016-0938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hausser J, Strimmer K. Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J Mach Learn Res. 2009;10(7):1469–1484. [Google Scholar]
  • 17.Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19(4):562–578. doi: 10.1093/biostatistics/kxx053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jaccard P. The distribution of the flora in the alpine zone. 1. NewPhytol. 1912;11(2):37–50. [Google Scholar]
  • 19.Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Meth. 2014;11(7):740–742. doi: 10.1038/nmeth.2967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–282. doi: 10.1038/s41576-018-0088-9. [DOI] [PubMed] [Google Scholar]
  • 21.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat Meth. 2017;14(5):483–486. doi: 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  • 23.Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, Kong SL, Chua C, Hon LK, Tan WS, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017;49(5):708–718. doi: 10.1038/ng.3818. [DOI] [PubMed] [Google Scholar]
  • 24.Liu B, Li C, Li Z, Wang D, Ren X, Zhang Z. An entropy-based metric for assessing the purity of single cell populations. Nat Commun. 2020;11(1):1–13. doi: 10.1038/s41467-020-16904-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 2014;15(12):1–21. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746. doi: 10.15252/msb.20188746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research. 2016;5:2122. doi: 10.12688/f1000research.9501.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426. 2018
  • 30.Ortega A. Introduction to graph signal processing. Cambridge University Press; 2022. [Google Scholar]
  • 31.Peterman N, Levine E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genom. 2016;17:1–17. doi: 10.1186/s12864-016-2533-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.R Core Team R: a language and environment for statistical computing, R foundation for statistical computing, Vienna. https://www.R-project.org/, 2020
  • 33.Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–850. doi: 10.1080/01621459.1971.10482356. [DOI] [Google Scholar]
  • 34.Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, Purdom E. Clusterexperiment and rsec: a bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol. 2018;14(9):e1006378. doi: 10.1371/journal.pcbi.1006378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using gaussian finite mixture models. R J. 2016;8(1):289. doi: 10.32614/RJ-2016-021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  • 38.Shorrocks AF. The class of additively decomposable inequality measures. Econom: J Econom Soc. 1980;48(3):613–625. doi: 10.2307/1913126. [DOI] [Google Scholar]
  • 39.Smith RC, MacArthur BD. Information-theoretic approaches to understanding stem cell variability. Curr Stem Cell Rep. 2017;3(3):225–231. doi: 10.1007/s40778-017-0093-5. [DOI] [Google Scholar]
  • 40.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, III, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Stumpf PS, Du X, Imanishi H, Kunisaki Y, Semba Y, Noble T, Smith RC, Rose-Zerili M, West JJ, Oreffo RO, et al. Transfer learning efficiently maps bone marrow cell types from mouse to human using single-cell RNA sequencing. Commun Biol. 2020;3(1):1–11. doi: 10.1038/s42003-020-01463-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Svensson V, Natarajan KN, Ly L-H, Miragaia RJ, Labalette C, Macaulay IC, Cvejic A, Teichmann SA. Power analysis of single-cell RNA-sequencing experiments. Nat Meth. 2017;14(4):381. doi: 10.1038/nmeth.4220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Svensson V, Vento-Tormo R, Teichmann SA. Exponential scaling of single-cell RNA-seq in the past decade. Nat Protoc. 2018;13(4):599–604. doi: 10.1038/nprot.2017.149. [DOI] [PubMed] [Google Scholar]
  • 44.Consortium Tabula Muris. Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature. 2018;562(7727):367. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Theil H. Economics and Information, theory studies in mathematical and managerial economics. North-Holland Publishing Company; 1967. [Google Scholar]
  • 46.Tian L, Dong X, Freytag S, Lê Cao K-A, Su S, JalalAbadi A, Amann-Zalcenstein D, Weber TS, Seidi A, Jabbari JS, et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Meth. 2019;16(6):479–487. doi: 10.1038/s41592-019-0425-8. [DOI] [PubMed] [Google Scholar]
  • 47.Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 2019;20(1):1–16. doi: 10.1186/s13059-019-1861-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015;25(10):1491–1498. doi: 10.1101/gr.190595.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–729. doi: 10.1016/j.cell.2018.05.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Van Rossum G, Drake FL. Python 3 reference manual. Scotts Valley: CreateSpace; 2009. [Google Scholar]
  • 51.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nat Meth. 2020;17(3):261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Softw (TOMS) 1997;23(4):550–560. doi: 10.1145/279232.279236. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12859_2023_5424_MOESM1_ESM.pdf (1.2MB, pdf)

Additional file 1. Two additional figures and one additional table: Expression of Hbb-bs in the Stumpf data set (Supplementary Figure 1); Feature selection benchmarking of scEC and scEC-impute (Supplementary Figure 2); and Contingency table between cell annotations provided in the Stumpf data set and scEC labels.

Data Availability Statement

The Svensson data was downloaded as file “svensson_chromium_control.h5a” from https://data.caltech.edu/records/1264 The Tian data sets were downloaded from https://github.com/LuyiTian/sc_mixology The Zheng data was downloaded as set of files “Gene / cell matrix (filtered)” in section “Single Cell 3’ Paper: Zheng et al. 2017” from https://support.10xgenomics.com/single-cell-gene-expression/datasets The Stumpf data was downloaded as file “RData.zip” from http://dx.doi.org/10.17632/csvm3kpkxd.1 The Tabula Muris data was downloaded as set of files “Single-cell RNA-seq data from microfluidic emulsion (v2)” from https://tabula-muris.ds.czbiohub.org

An R package for the implementation of the described methods is available on github https://github.com/mjcasy/scEC


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES