Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Brad Balderson; Michael Piper; Stefan Thor; Mikael Bodén

doi:10.1093/bioinformatics/btad435

. 2023 Jul 14;39(7):btad435. doi: 10.1093/bioinformatics/btad435

Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Brad Balderson ^1,^✉, Michael Piper ², Stefan Thor ³, Mikael Bodén ⁴

Editor: Anthony Mathelier

PMCID: PMC10368802 PMID: 37449901

Abstract

Motivation

Identification of cell types using single-cell RNA-seq is revolutionizing the study of multicellular organisms. However, typical single-cell RNA-seq analysis often involves post hoc manual curation to ensure clusters are transcriptionally distinct, which is time-consuming, error-prone, and irreproducible.

Results

To overcome these obstacles, we developed Cytocipher, a bioinformatics method and scverse compatible software package that statistically determines significant clusters. Application of Cytocipher to normal tissue, development, disease, and large-scale atlas data reveals the broad applicability and power of Cytocipher to generate biological insights in numerous contexts. This included the identification of cell types not previously described in the datasets analysed, such as CD8+ T cell subtypes in human peripheral blood mononuclear cells; cell lineage intermediate states during mouse pancreas development; and subpopulations of luminal epithelial cells over-represented in prostate cancer. Cytocipher also scales to large datasets with high-test performance, as shown by application to the Tabula Sapiens Atlas representing >480 000 cells. Cytocipher is a novel and generalizable method that statistically determines transcriptionally distinct and programmatically reproducible clusters from single-cell data.

Availability and implementation

The software version used for this manuscript has been deposited on Zenodo (https://doi.org/10.5281/zenodo.8089546), and is also available via github (https://github.com/BradBalderson/Cytocipher).

1 Introduction

Single-cell RNA-seq (ScRNA-seq) has transformed the study of complex tissues, allowing data-driven characterization of cell types through the bioinformatics analysis of genome-wide gene expression measurements across thousands to millions of cells (Alexander Wolf et al. 2018). The current standard bioinformatics analysis pipeline consists of normalization of read counts, gene/feature selection, dimensionality reduction, nearest-neighbour graph construction, and finally clustering and visualization based upon the nearest-neighbour graph (Alexander Wolf et al. 2018).

The most widely used approach for clustering is the Leiden algorithm (Traag et al. 2019), which considers the nearest-neighbour graph as the primary representation of the single-cell data (Traag et al. 2019), and thereby does not consider the original gene expression measurements. To determine if clusters of cells represent transcriptionally distinct populations post hoc differential expression analysis is performed. Based on visual assessment of the resulting differentially expressed (DE) genes, Leiden clustering is adjusted or clusters manually altered if automated clustering cannot align with observed transcriptionally distinct cells.

The manual merging process, in addition to being irreproducible and non-quantifiable, can be impractical. For example, (i) when cell populations do not have any one single marker gene that differentiates one cluster from others but may be unique due to a combination of genes, which is difficult to visualize, or (ii) when the presence of tens to hundreds of cell types make the manual merging process time-consuming and prone to human error.

Ensuring that scRNA-seq clusters represent transcriptionally distinct populations of cells is a currently unaddressed issue in the unsupervised analysis of scRNA-seq data. To address this fundamental problem, we present Cytocipher, a ‘scverse’ compatible package that integrates within the Python Scanpy (Alexander Wolf et al. 2018) ecosystem for single-cell analysis. Cytocipher implements two novel algorithms that address the aforementioned problems; (i) ‘code-scoring’—a novel per-cell enrichment scoring method that is more sensitive at scoring unique combinations of genes within individual cells when compared with current methods, and (ii) ‘cluster-merge’—a statistical algorithm that begins with over-clustered single-cell data and iteratively tests clusters for significant transcriptional difference, merging clusters that are not mutually significantly different. The resulting output is significantly transcriptionally different single-cell populations, the marker genes for each population (Box 1), and per-cell enrichment scores for marker gene sets (Box 1). Inline graphic

To probe the versatility of Cytocipher, we applied it to synthetic data and a range of biological samples—healthy adult tissue, development tissue, diseased tissue, and multi-tissue atlas scale data. Each application resulted in the identification of biological insights not previously described, underscoring the power of Cytocipher as a universal method for improving single-cell population detection on previously analysed or newly generated scRNA-seq data. Cytocipher also scales to large datasets with high-test performance, as shown by application to the Tabula Sapiens Atlas (Tabula Sapiens Consortium 2022) representing >480 000 cells.

The ongoing Atlas scale single-cell data generation (Tabula Sapiens Consortium 2022) is producing increasingly complex and heterogeneous single-cell datasets, underscoring the importance of a statistically rigorous test to ensure cell cluster determination represents transcriptional distinction. Cytocipher represents the first analysis method that frames transcriptionally distinct single-cell population detection as a quantifiable statistical test, thereby representing an important advance in the analysis of scRNA-seq data.

2 Materials and methods

2.1 Cytocipher code-scoring

‘Cytocipher code-scoring’ requires log-normalized single-cell gene expression and tentative cluster labels as input (Fig. 1A1). Tentative cluster labels are determined from clustering, such as the Leiden method utilized in the Scanpy package. If input of marker genes is not provided these are automatically determined (Box 1 and Fig. 1A2). By default, marker genes per cluster are gene sets of up to five differentially upregulated genes (Benjamini–Hochberg adjusted P-value <.05) determined by ranking genes using a one-versus-rest mode of comparison with Welch’s t-statistic. With marker genes per cluster (G, Fig. 1A2), negative gene sets per cluster are determined (Fig. 1A3). Given the set of clusters (C), we define the negative gene sets (N_i) for a given cluster i with marker gene set G_i as the set of gene set differences between G_i and each other marker gene set G_j for which the intersection of G_i and G_j is greater than or equal to a minimum intersection count [Equation (1)].

Figure 1. — Overview of *Cytocipher* for cluster-specific ‘code-scoring’ and significant cluster analysis with ‘cluster-merge’. (A) Schematic illustration of the ‘Cytocipher code-scoring’ method. Briefly, (1) Gene expression and cluster annotation are provided as input; (2) determination of marker genes is performed; (3) genes, which are positive indicators versus negative indicators of cluster membership are determined through comparison of cluster marker genes; (4) for a given cluster, cells co-expressing the negative gene set/s score 0, while the remaining cells are scored for positive gene set co-expression; and (5) repetition of 1–4 for each cluster yields a diagnostic heatmap scoring each cell for membership of each cluster. (B) Schematic illustration of ‘Cytocipher cluster-merge’, which performs a test of significance difference between cluster pairs, merging those which are mutually non-significantly different. Briefly, (1) per-cell enrichment scores are determine using the process in (A); (2) where a large number of clusters are present, MNNs can determine cluster pairs for comparison; (3) cluster scores are compared with a statistical test to determine significant versus non-significant clusters, with non-significant cluster pairs being merged to create new cluster labels; (4) enrichment scores are redetermined based on the new cluster labels as an output diagnostic, and the process can be repeated with the new cluster labels until convergence

N_{i} = {G_{j} - G_{i} : (0 \leq j < | C |) \land (i \neq j) \land (| G_{j} \cap G_{i} | \geq f (G_{i}))} .

(1)

The minimum intersection size function, f [Equation (2)], acts as a winsorization for small gene set sizes, such that if the marker gene set for cluster i (G_i) has a size of 1 or 2 genes, all genes must intersect with the marker gene set of cluster j (G_j) for the marker gene set difference to be added to the negative gene set (N_i). If a marker gene set has two or more genes, all genes except for one must overlap with the marker gene set of another cluster j (G_j) for the marker set difference to be added to the negative gene set (N_i). Full overlap between clusters i and j marker genes results in no set difference between them, and therefore negative gene set subtraction with respect to cluster j is not performed for cluster i.

f (S) = {\begin{matrix} | S |, & if | S | \leq 2. \\ | S | - 1, & if | S | > 2. \end{matrix}

(2)

Letting Ω represent the cell by cluster code-score matrix of size $T \times | C |$ , we define the code-score for cluster i and cell κ ( $Ω_{κ, i}$ ) as in Equation (3) (Fig. 1A4 and 5).

Figure 5. — *Cytocipher* identifies intermediate cell states in mouse pancreas development. (A) UMAP of mouse E15.5 pancreas cells, with cells annotated by broad cell types; ductal, Ngn3 low EP cells, Ngn3 high EP cells, pre-endocrine, epsilon, beta, alpha, and delta. (B) UMAP with cells over-clustered at Leiden resolution 3.5, producing 30 clusters. (C) Heatmap of *Cytocipher* code-scores per cluster, as shown for clusters in panel (B). (D) UMAP of 16 significant clusters determined from ‘Cytocipher cluster-merge’ applied to the clusters depicted in panel (B). Novel intermediate states outlined. (E) Top marker genes determined for the clusters depicted in (D), with cells belonging to the relevant clusters of the marker genes outlined (see text for interpretation). (F) Heatmap of *Cytocipher* code-scores per cell (row) and cluster (column), for merged clusters depicted in panel (D)

Ω_{κ, i} = {\begin{matrix} \log 2 (\frac{H (Θ_{κ}, G_{i})}{P (Θ_{κ}, G_{i})}), & if H (Θ_{κ}, G_{i}) \geq f (G_{i}) \land M (Θ_{κ}, N_{i}) = 0. \\ 0, & otherwise . \end{matrix}

(3)

Θ represents a cell by gene expression matrix of size T × F, such that $Θ_{κ, g}$ represents the expression of gene g in cell κ. $H (Θ_{κ}, G_{i})$ is the number of genes in G_i expressed in cell κ [Equation (4)], $M (Θ_{κ}, N_{i})$ is the number of negative gene sets co-expressed in cell κ [Equation (5)], and $P (Θ_{κ}, G_{i})$ is the probability of jointly expressing genes in G_i at the observed expression level in cell κ across all cells [Equation (6)].

H (Θ_{κ}, S) = | {g : Θ_{κ, g} > 0} | w here g ϵ S,

(4)

M (Θ_{κ}, N_{i}) = | {N_{i j} : H (Θ_{κ}, N_{i j}) > f (N_{i j})} | w here N_{i j} ϵ N_{i},

(5)

P (Θ_{κ}, G_{i}) = \prod_{g ϵ G_{i}} \frac{\sum_{v = 0}^{T} 1_{Θ_{v, g} \geq Θ_{κ, g}}}{T} .

(6)

An additional enrichment scoring method, ‘coexpr-scoring’ [ $Ψ_{κ, i}$ , Equation (7)], is defined without consideration of negative gene sets (N_i) to assess the effect of negative gene sets on cluster scoring.

Ψ_{κ, i} = {\begin{matrix} \log 2 (\frac{H (Θ_{κ}, G_{i})}{P (Θ_{κ}, G_{i})}), & if H (Θ_{κ, G_{i}}) \geq f (G_{i}) . \\ 0, & otherwise . \end{matrix}

(7)

Intuitively, the greater the number of genes co-expressed and the greater the expression level, the higher the total code-score and coexpr-score; with ‘code-scoring’ ensuring non-intersecting marker genes are not expressed in similar clusters.

2.2 Cytocipher cluster-merge

‘Cluster-merge’ requires tentative cluster labels across cells (z) and Ω (defined section above). Ω can also be substituted for other enrichment score matrices, such as $Ψ$ (Fig. 1B1). Initial clusters should be over-clustered (see Box 1). Practically, Leiden clustering at a resolution between 2 and 4 produces over-clusters, which can subsequently be tested with Cytocipher. Uniform Manifold Approximation (UMAP) and the Cytocipher enrichment heatmap can be utilized to assess over-clustering. In the UMAP, over-clusters will display as sub-clusters of larger populations with no apparent separation. In the enrichment heatmap (e.g. Fig. 1B1), cross-scoring is indicated by scores along each column split amongst multiple rows (indicating cells in separate clusters having similar scores).

For a given pair of clusters i and j, two independent Student’s t-tests on enrichment scores determine if they are significantly different (Fig. 1B3). Let the notation $Ω_{j \sim i}$ indicate enrichment scores for cluster i across all cells that belong to cluster j. Given the function t below represents Student’s t-value calculation, and the function p below indicates the probability of observing a given t-value using a two-sided test; we define a significantly different pair of clusters (s_ij) as in Equation (10). p_i is the P-value that clusters i and j cells have the same cluster i enrichment scores, while p_j is the P-value the cluster pair cells have the same cluster j enrichment scores (P-value adjustment with Benjamini–Hochberg) (Fig. 1B3). Clusters are significantly different (s_ij) if either the p_i or p_j are below a significance threshold (p_cut). If the cluster pairs are mutually non-significantly different ( $s_{i j} = False$ ), cells from the cluster pairs are merged to create a new set of cluster labels (z_new) (Fig. 1B3).

p_{i} = p (t (Ω_{i \sim i}, Ω_{j \sim i})),

(8)

p_{j} = p (t (Ω_{i \sim j}, Ω_{j \sim j})),

(9)

s_{i j} = (p_{i} < p_{cut}) \lor (p_{j} < p_{cut}) .

(10)

By default, p_cut is taken as 0.01 and each cluster is compared against all others. In the case of many clusters mutual nearest neighbours (MNNs) can be compared to reduce run-time (Fig. 1B2). Cluster MNNs are cluster pairs with overlapping-nearest neighbours using the Euclidean distance metric on a $| C | \times | C |$ matrix Ξ; where Ξ_ij is the mean cluster j enrichment score in the cells of cluster i.

Marker genes, enrichment scoring, and cluster merging can be performed iteratively until convergence ( $z = z_{new}$ ) (Fig. 1B4). In the initial data tested, convergence occurred after a single iteration, prompting this as the default setting.

2.2.1 Enrichment score summarization

High false-positive rates and inflated power due to treating each cell as an independent observation has been previously observed in the context of single-cell differential expression analysis, and is addressed by ‘pseudobulking’ (Lun and Marioni 2017, Squair et al. 2021). We systematically tested different methods for summarizing enrichment scores $Ω_{i \sim i}$ and $Ω_{j \sim i}$ (if scoring for cluster i) prior to significance testing with the goal of maintaining test performance while reducing inflated P-values and correlation of significance with the total number of cells in a cluster pair. Three different methods were compared using the 3000 human peripheral blood mononuclear cells (PBMCs) scRNA-seq data; determine k evenly spaced quantiles (K-Quantiles), taking the mean of enrichment scores falling into k evenly spaced quantile bins (K-Bins), and clustering enrichment scores into k groups with K-Means clustering and averaging the enrichment scores in each group (K-Means). The PBMCs were first over-clustered (see Section 2 below) prior to applying each summarization method with k = 15.

The original PBMC annotations were used as a ground-truth, such that cluster pairs falling within a cell type were considered non-significantly different and pairs from different cell types were considered significantly different. Test performance was measured by the area under the curve (AUC) of the receiver operating characteristic (ROC) (AUROC). Cell abundance bias was measured using Spearman’s correlation with −log10(P-value) and log2(cluster pair cell count). P-value cutoffs to discriminate significant versus non-significant cluster pairs using Cytocipher were from 0 to 1 in increments of $2^{- c}$ for each integer c in range $0 \leq c \leq 100$ . Parameter sensitivity analysis was then performed varying k and measuring the cell abundance bias and test performance for each summarization method. The above tests prompted usage of the K-Quantiles method with k = 15 as the default method for enrichment score summarization prior to cluster pair significance testing with Cytocipher.

2.3 Simulation study

Splatter (Zappia et al. 2017) v1.20 was used to simulate initial scRNA-seq read counts followed by post hoc alteration of counts to simulate groups that express a unique combination of genes, rather than a single gene marker. Simulation parameters were estimated from the PBMC gene cell count matrix with splatEstimate. Ten thousand genes and 3000 cells were then simulated from the learnt parameters. Eight groups of random cells were assigned a unique random set of 3 to 7 genes drawn from 15 genes to become DE.

DE was simulated by multiplying gene counts within a cluster by factor $φ$ drawn from Gaussian distribution $N (μ_{φ} = 6, σ_{φ} = 1)$ . Zero count cells were set to the median non-zero count value prior. Zero count cells set to the median was determined by sampling proportion τ from Gaussian $N (μ_{τ} = 0.85, σ_{τ} = 0.05)$ . Default preprocessing and normalization with Scanpy (Alexander Wolf et al. 2018) v1.8.2 was subsequently performed on the simulated counts.

Enrichment scoring was performed for each method with default parameters. For ‘Scanpy-scoring’, we refer to the sc.tl.score_genes, a re-implementation of the scoring approach provided by the Seurat package (Satija et al. 2015).

Cells with positive scores for a given group were classified as belonging to the group, and cells with 0 or less scores for a given method as outside of the group. Enrichment score-based annotations were quantitatively benchmarked against the original labels for each group using accuracy, precision, recall, and F1 score calculated with ‘Sklearn’ (Pedregosa et al. 2011). ROC curves were generated equivalently to the description in Section 2.2.7.

2.4 E18.5 mouse hypothalamus study

Single-cell count matrices, cell annotations, and UMAP coordinates were obtained from the original study (Yaghmaeian Salmani et al. 2022). Gene expression was normalized using log1p counts normalized to total cell counts with the median read counts observed across all cells in Scanpy. Enrichment methods were applied using default settings on the neuropeptidergic-dopaminergic neuron subtypes defined in the original study with the annotated marker genes. Quantitative benchmarking of enrichment methods was performed equivalently to the simulated data above.

2.5 Human 3K PBMC study

The human 3K PBMC single-cell count matrices were downloaded and preprocessed as per the Scanpy PBMC 3K tutorial (Alexander Wolf et al. 2018), except for total count and cell mitochondrial percentage regression. Leiden clustering was performed at resolution 0.7 producing the same eight clusters in the vignette. For the ‘Cytocipher cluster-merge’ analysis, the PBMC scRNA-seq was first over-clustered using Leiden resolution 2.0 resulting in 19 clusters. ‘Cytocipher code-scoring’ was performed with default settings. ‘Cytocipher cluster-merge’ was also performed with default settings.

2.6 Mouse E15.5 pancreas development study

The mouse E15.5 pancreas development scRNA-seq spliced and unspliced read count matrices and cell annotations were downloaded as per the ‘Scvelo’ (Bergen et al. 2020) v0.2.4 RNA Velocity Basics tutorial. Genes with minimum shared counts of intronic versus exonic reads of four were filtered prior to log1p and total library size normalization. The velocity UMAP embedding was then reproduced from the original ‘Scvelo’ publication using the same parameters and commands detailed in the ‘Scvelo’ tutorial aforementioned. Cells were over-clustered using a Leiden resolution of 3.5. ‘Cytocipher cluster-merge’ was performed with default parameters and cluster pairs were considered significantly different at an adjusted P-value cutoff of .045.

2.7 Prostate cancer study

The ‘Anndata’ h5ad file containing the normalized gene expression, cell annotations, and UMAP was obtained from the original study (Tuong et al. 2021). The cell–cell neighbourhood graph was constructed using 30 nearest neighbours and over-clustered with a Leiden resolution of 4.0 to create 47 tentative clusters. Marker genes per tentative cluster were determined as a maximum of four genes and minimum of one gene ranked by log-FC using a one-versus-rest mode of comparison for each cluster for which Welch’s t-statistic was above 14. ‘Cytocipher cluster-merge’ was otherwise applied with default values, with an adjusted P-value cutoff of .0325 used to determine significantly different cluster pairs.

Differentially abundant (DA) subpopulations of cells between cancer and normal cells were determined using the ‘Milopy’ implementation of the ‘Milo’ method (Dann et al. 2022). Parameters used were the same as ‘Milopy’ tutorial on mouse gastrulation scRNA-seq. Exceptions were the usage of group as the linear model inputted to milo. DA_nhoods function in order to call DA cell populations between cancer and normal samples in the prostate cancer data.

2.8 Tabula sapiens study

The ‘Anndata’ h5ad file containing all preprocessed data were obtained from the original study (Tabula Sapiens Consortium 2022). In all applications mentioned, default parameters were used with 15 CPUs on the highly variable genes.

To generate ground-truth labels of significant versus non-significant cluster pairs, all cell types with more than 20 cells were split into three random subgroups. Cell types with <20 cells were not split. Cluster pairs were labelled as significantly different in this ground-truth if they were sampled from different cell types and non-significantly different if sampled from the same cell type. ‘Cytocipher cluster-merge’ was initially applied on the artificial subgroups with an adjusted P-value cutoff of .5 to examine the topmost over-merged clusters. ROC curves were determined using the ground-truth cluster pair labels compared against significant versus non-significant pairs determined at the respective P-value cutoff outputted by ‘Cytocipher cluster-merge’. P-value cutoffs used were the same as detailed in Section 2.2.1. Memory usage was measured as the change in peak random-access memory (RAM). Each artificial subgroup was then downsampled to a maximum of 15 cells resulting in 7385 cells.

‘Cytocipher cluster-merge’ was compared with Single-cell Significance of Hierarchical Clustering (‘Sc-SHC’) on a small subset of 500 cells representing 12 randomly sampled cell types from the 177 total cell types in the downsampled 7385 cell data. Both methods were run with default parameters on the highly variable genes. ‘Sc-SHC’ was run on the ambient RNA corrected ‘DecontX’ (Yang et al. 2020) read counts.

3 Results

3.1 Cytocipher code-scoring

‘Cytocipher code-scoring’ aims to specifically score cells for gene set activity based on the combination of the genes expressed. This contrasts with existing methods, which utilize an additive score whereby positive scores are possible even if a single gene in the query set is expressed (Aibar et al. 2017, Alexander Wolf et al. 2018, Dries et al. 2021). As input, ‘code-scoring’ uses log-normalized gene expression across all cells and cluster labels for each cell (Fig. 1A1). If not also inputted by the user, cluster marker genes are determined (Fig. 1A2 and Box 1, see Section 2). To reduce cross-scoring between clusters (Box 1) with overlapping marker genes, negative sets of genes are determined as the non-overlapping set of genes between a given cluster and all other clusters for which gene set overlap occurs (Fig. 1A3, see Section 2). The consideration of negative gene sets is a novel algorithmic detail of ‘Cytocipher code-scoring’ which, as shown below, increases the precision of scoring cells for cluster membership. This enables visualization of precise cell cluster membership, even in the case of complex combinations of marker gene expression between cell populations.

With the unique combination of marker genes and negative genes demarcating each cluster, each cell is scored for co-expression of the marker genes and for not expressing the negative gene sets (Fig. 1A4). To achieve this, cells are filtered, such that cells co-expressing any of the negative gene sets are removed and the remaining cells expressing some combination of the marker genes are kept for quantitative scoring (Fig. 1A4). The quantitative scoring for gene set co-expression is the log of the inverse joint probability of expressing each gene in the marker gene set greater than or equal to the level of expression observed in the given cell (Fig. 1A4; see Section 2 for full details). The result is a score for each cell quantifying the correspondence between the cell and cluster marker genes (Fig. 1A5).

3.2 Cytocipher cluster-merge

‘Cytocipher cluster-merge’ tests whether clusters are significantly different from one another, using the cluster code-scores for each cell and cluster labels as input (Fig. 1B1). In addition to the ‘code-scoring’ approach described above to quantify cell and cluster marker enrichment, we also implemented ‘coexpr-scoring’ and ‘Giotto Parametric Analysis of Gene Set Enrichment (Giotto PAGE)’ as alternatives, with the former representing the same scoring approach as ‘code-scoring’ without the negative gene set cell filtering, and the latter a method based on normalizing per-gene fold-changes (Dries et al. 2021). To avoid pair-wise comparison of clusters for scalability, each cluster can be compared against its MNNs, or for datasets with fewer clusters pair-wise comparison is performed (Fig. 1B2; see Section 2).

With the set of clusters to be compared, a bi-directional statistical test is performed on the cluster enrichment scores to identify non-significantly different clusters (Fig. 1B3, see Section 2). For a given pair of clusters, clusters i and j, we perform two-sided Welch’s t-tests; one comparing cluster i scores between cells in cluster i and cells in cluster j, the other comparing cluster j scores between cells in cluster i and cells in cluster j. If either of these tests results in significant P-values, the clusters are significantly different (Fig. 1B3; see Section 2). If both tests give non-significant P-values, the clusters are mutually non-significantly different, and are therefore merged to create one cluster (Fig. 1B3; see Section 2).

Finally, we re-perform ‘code-scoring’ with the new cluster definitions, and repeat the merging process until no further cluster merging occurs. Thus, clusters and cluster marker gene sets are used to score cells, and the scores are then used to update the clusters and their marker genes (Fig. 1B4).

3.3 Cytocipher negative gene set subtraction outperforms existing methods for binary classification task on simulated data

To test ‘Cytocipher code-scoring’ for specifically scoring cells expressing a unique combination of genes, we simulated scRNA-seq data where clusters are defined by unique gene combinations, rather than single marker genes (Fig. 2A and B; see Section 2 for simulation details). As an example, Cluster 7 in our simulation expresses the unique gene combination 3094, 5076, 6481, 702, and 7274, but none of these individual genes are specifically expressed only in Cluster 7. We subsequently scored each cell for the cluster marker genes using six different approaches: ‘Cytocipher code-scoring’, ‘Cytocipher coexpr-scoring’ (‘code-scoring’ without negative set subtraction), ‘Giotto PAGE’ (Dries et al. 2021), and ‘Scanpy-score’ (Alexander Wolf et al. 2018) (Fig. 2C). Qualitatively, in every cluster ‘code-scoring’ specifically highlighted the cells belonging to the relevant clusters, even in the difficult cases where no specific marker gene demarcated the cluster. The clearest examples of this are Clusters 4 and 7. In the former case, all other methods scored cells in Clusters 0 and 4 when attempting to score for Cluster 4 (see cross-scoring, Box 1). In the latter case cross-scoring was observed with Cluster 2 when scoring for Cluster 7.

Figure 2. — ‘Cytocipher code-scoring’ outperforms existing methods for enrichment of cell population marker gene combinations in simulated data. (A) UMAP of splatter simulated scRNA-seq data with eight groups, each of which has a unique combination of gene expression, with single marker genes demarcating a cluster from all other cells only present in a few cases. (B) UMAPs display the log-counts-per-million for each simulated differential gene between groups, illustrating gene combinatorics for cluster definition. (C) Results from performing per-cell-gene enrichment using the marker genes for each group illustrated in (B). Each sub-panel focuses on a particular group of cells, with the enrichment scores for ‘Cytocipher code-scoring’, ‘Cytocipher coexpr-scoring’, ‘Giotto PAGE’ enrichment, and ‘Scanpy-scoring’ shown alongside the highlighted cluster; clearly indicating high specificity with minimal background for the ‘code-scoring’ approach. (D) Enrichment score prediction metrics for predicting each group based on the enrichment scores for each method. To illustrate the effect of the negative gene set subtraction utilized by ‘Cytocipher code-scoring’, we also show performances for this approach applied to ‘Giotto PAGE’ scores [Giotto (−)] and ‘Scanpy-scoring’ [Scanpy (−)]. Positive scores for each enrichment method were used to indicate cluster membership. Accuracy, F1 score, precision, and recall measures are shown as violin plots, with each point in the violin representing the measure for a given group of cells, and separate violins indicating the enrichment method used for scoring. Tables above each violin plot summarize the average score for each method (μ). (E) ROC curve, for testing differences between artificial sub-clusters of each simulated group using ‘Cytocipher cluster-merge’ and each scoring method

We next sought to quantitatively test how specifically methods could score cluster membership, and tested performance on a binary classification task to achieve this. In the binary classification case, cells with a positive score for a given method are predicted to belong to the given cluster, and all other cells are considered outside of the cluster. Repeating this for each cluster, we calculated the accuracy, F1 score, precision, and recall of each enrichment method (Fig. 2D). Additionally, we also benchmarked ‘Giotto-PAGE’ and ‘Scanpy-score’ with negative gene set subtraction [Giotto (−) and Scanpy (−)], to indicate if our negative gene set subtraction approach could improve scoring performance for other methods. Based on these metrics, methods from greatest to lowest performance were Giotto (−), ‘code-scoring’, Scanpy (−), ‘Giotto PAGE’, ‘Scanpy-scoring’, and then ‘coexpr-scoring’. The separation between scoring methods was mostly driven by our negative gene set subtraction approach; ‘code-scoring’, Giotto (−), and Scanpy (−) clearly outperformed the alternatives based on all metrics, except for recall, where negative gene set subtraction reduced performance. This suggests that the negative gene set subtraction can increase false negatives at the single-cell level. However, when considering that the F1 score represents the harmonic mean of both the precision and recall (Pedregosa et al. 2011), the higher F1 of the negative gene set subtraction approach suggests improvement to overall binary classification performance regardless of the scoring method.

‘Cytocipher cluster-merge’ was then tested utilizing different cluster membership scoring approaches on the simulated data using ROC curves (Fig. 2E). Artificial clusters were created by generating three random subsets of cells within each of the simulated groups to generate ground-truth significant and non-significant cluster pairs. AUROC was used to quantify test performance, with an AUROC of 1 indicating perfect ability to discriminate significant versus non-significant cluster pairs and 0.5 as random classification. In this case, regardless of scoring method, we found ‘Cytocipher cluster-merge’ to yield high performance for determining significant versus non-significant cluster pairs, with AUROC exceeding 0.99 for all scoring approaches (Fig. 2E).

Overall, Cytocipher performed well in the simulated test case, prompting further benchmarking on real data where transcriptionally distinct cells are defined by complex combinatorial gene co-expression.

3.4 ‘Cytocipher code-scoring’ performs comparably to existing methods for binary classification task of E18.5 hypothalamus neuronal subtypes

To assess the utility of ‘Cytocipher code-scoring’ on real data with complex combinatorial gene co-expression, the enrichment scoring methods were applied to our previously published mouse E18.5 hypothalamus scRNA-seq data (Yaghmaeian Salmani et al. 2022). The hypothalamus represents an important test case, as it contains 79 different neuronal subtypes delineated by a small number of marker genes expressed in various unique combinations (Fig. 3A). Cluster annotations of the top marker genes from the original publication were used here for gene enrichment analysis (Fig. 3A and B).

Figure 3. — ‘Cytocipher code-scoring’ performs comparably to existing methods for enrichment of cell population marker gene combinations in hypothalamus neuronal subtypes. (A) UMAP E18.5 hypothalamus scRNA-seq depicting 79 neuronal subtypes clusters. (B) Heatmap displaying per-cell enrichment scores for cluster membership depicted in (A); each row is a cell, and each column is a neuronal subtype cluster. Cells and clusters are ordered such that perfect correspondence of cells to score for their respective cluster lies on the diagonal of the heatmap. Hence, scoring outside of the diagonal indicates ‘cross-scoring’, where cells also score for gene expression outside of their cluster membership. Lack of scoring along the diagonal indicates cell gene expression does not match cluster membership. (C) Enrichment score prediction metrics for predicting each neuronal subtype cluster based on the enrichment scores for each method. Positive scores for each enrichment method were used to indicate cluster membership. Accuracy, F1 score, precision, and recall measures are shown as violin plots, with each point in the violin representing the measure for a given neuronal subtype, and separate violins indicating the enrichment method used for scoring. Tables above each violin plot summarize the average score for each method (μ). (D) ROC curve, for testing differences between artificial sub-clusters of each neuronal subtype using ‘Cytocipher cluster-merge’ and each scoring method

Visualizing the per-cell enrichment scores for each cluster (Fig. 3B), ‘Cytocipher code-scoring’ and ‘coexpr-scoring’ qualitatively appeared the most specific to cells of each cluster, with minimal cross-cluster scoring evident. However, some clusters for the ‘code-scoring’ appeared to have few cells with positive scores for their respective cluster, while the ‘coexpr-scoring’ did have positive scores. This indicates cases where a given cluster co-expressed genes belonging to the negative set but is not annotated as such. In our previous paper (Yaghmaeian Salmani et al. 2022), we manually annotated this data based on the top marker genes following the standard Scanpy pipeline (Alexander Wolf et al. 2018), hence highlighting the importance of the ‘code-scoring’ approach for detecting co-expression in complex cases where manual annotation is prone to error without utilization of a method which can visualize unique combinatorial gene co-expression.

‘Giotto PAGE’ enrichment showed considerable cross-clustering scoring, although there is clearly higher enrichment for cells belonging to a cluster than outside of the cluster (Fig. 3B). ‘Scanpy-scoring’ on the other hand performed poorly, with considerable cross-cluster scoring and few apparent cases of specificity for scoring cells for their respective clusters.

To quantitatively benchmark the enrichment scoring methods for scoring cell cluster membership, we again tested using the binary classification task, as in the simulated data example above (Fig. 3C). While the ‘Cytocipher code-scoring’ and ‘coexpr-scoring’ give a higher accuracy even compared with Scanpy (−) and Giotto (−), F1 was comparable to other methods, with slight improvements in precision. Recall was clearly better for ‘Giotto PAGE’ and Scanpy than the Cytocipher methods.

‘Cytocipher cluster-merge’ test was again tested using ROC curves using the artificial subsampling strategy to generate over-clusters (see previous section). Scanpy-scoring was the only approach that did not exceed an AUROC of 0.9, while ‘coexpr-score’ had the highest AUROC (0.9715), followed by ‘Giotto PAGE’ (0.9605), ‘code-score’ (0.9565), and Giotto (−) (0.9485) (Fig. 3D).

Overall, all scoring methods performed worse on the binary classification task compared with the simulated data, but the Cytocipher methods had the most favourable performance. Despite this indication of less specific scoring of cluster membership in challenging real data, ‘Cytocipher cluster-merge’ still showed good performance with AUROC exceeding 0.9 (with the exception of Scanpy-scoring) (Fig. 3D).

Importantly, ‘Cytocipher code-scoring’ can also be used as a diagnostic tool with which to visualize imperfections in the manual annotations of gene co-expression, and the clear cases of few scores for a given cell type indicate labelling of unique gene co-expression where another cluster must also co-express this combination. This is evidenced by contrasting ‘code-scoring’ with ‘coexpr-scoring’, where the additional 0 scores in the former indicate cells that expressed genes in the negative gene sets, indicating non-distinct gene expression with at least one other cluster. Hence, in this case, the reduced recall observed is a useful diagnostic with which to assess the uniqueness of cluster marker gene set co-expression at the individual cell level.

3.5 ‘Cytocipher cluster-merge’ reveals novel heterogeneity in 3K human PBMCs

To test the capability of ‘code-scoring’ to identify over-clustering in scRNA-seq data and subsequently correct this using ‘cluster-merge’, we utilized the 3K human PBMC data, a common test dataset for single-cell methods due to its well characterized cell types. As shown in the Scanpy tutorial (Alexander Wolf et al. 2018), Leiden clustering produced eight clusters of cells, which could be labelled without any over-clustering or manual merging based upon marker genes to identify B-cells, CD14+ monocytes, Dendritic cells, FCGR3A monocytes, megakaryocytes, CD4+ T cells, CD8+ T cells, and Natural Killer (NK) cells (Fig. 4A).

Figure 4. — Cross-cluster comparisons of code-scores can be used to merge over-clustered single-cell data, revealing novel heterogeneity in human PBMCs. (A) UMAP of Human 3K PBMC scRNA-seq, clustered at Leiden resolution 0.7 to create eight clusters. Cells are annotated by cell type; B-cells, CD4+ T cells, CD8+ T cells, NK cells, Dendritic cells, CD14+ Monocytes, FCGR3A+ Monocytes, and Megakaryocytes. (B) Over-clustering of the PBMC data at Leiden resolution 2.0, producing 19 clusters. (C) Heatmap depicting *Cytocipher* code-scores, where each row is a cell and each column is a cluster. Cells and clusters are ordered such that scores along the diagonal indicate scores of cells for their respective cluster. Cross-scoring of cells for different clusters is indicated with boxes; corresponding to the over-clustering introduced in (B). (D) Violin plots show example cluster significance tests. Non-significant and significant cluster pairs are indicated with crosses and ticks, respectively. The y-axes are the code-score, and the four violins within each plot indicate combinations of cells belonging to each cluster scoring for their own cluster and the cluster being compared against. Clusters are significantly different when a significant P-value is observed for either set of cluster scores when comparing cells between clusters. (E) UMAP depicts the PBMCs after merging non-significantly different clusters. New distinct populations are outlined. Small boxes of UMAPs depict the log-cpm expression of genes in the new distinct subpopulations (GZMK and GZMH). (F) The same as (C), except for the new clusters after merging. (G) UMAPs of the data clustered at increasing Leiden resolutions from left to right. Cases of over- and under-clustering are outlined and labelled. The new distinct clusters appear at resolution 1.5 (as outlined and lablled), at the same point where over-clustering is still clearly evident

We purposely over-clustered this data, setting a Leiden resolution of 2 to identify 18 clusters, which primarily over-clustered the CD4+ T cells (Fig. 4B). ‘Cytocipher code-scoring’, using automatically identified marker genes clearly indicated this over-clustering, with all of the CD4+ T cell sub-clusters clearly all cross-scoring with one another (Fig. 4C). Importantly, using the previous PBMC cell type annotations as the ground-truth (Fig. 4A) to determine non-significantly different clusters in the over-clustered cells (Fig. 4A and B), we tested a range of possible implementations for particular design choices for cluster pair significance testing within Cytocipher, measuring test performance and bias in test significance (Supplementary Fig. S1; see Section 2). The best implementation significantly de-correlated test significance with cell pair abundance (Spearman’s ρ = 0.083) and had high-test performance (AUROC = 0.9803), thus prompting this implementation as the default setting for ‘Cytocipher cluster-merge’ (Supplementary Fig. S1; see Section 2).

We subsequently applied ‘Cytocipher cluster-merge’ with the default settings on this over-clustered data, visualizing the distributions of the code-scores using the combination of cluster profiles and cells in each cluster to show significant versus non-significant clusters (Fig. 4D). This approach correctly identified no significant difference between Clusters 3 and 9, nor Clusters 1 and 5 (Fig. 4D); the former are over-clusters of the CD14+ monocytes, while the latter are over-clusters of the CD4+ T cells (Fig. 4A and B). Clusters 0 and 9, representing B-cells and dendritic cells, were correctly identified as different (Fig. 4A, B, and D). Interestingly, all clusters were correctly merged after one iteration to the ground-truth cell types, except Clusters 2 and 8; sub-clusters of the CD8+ T cells (Fig. 4D). Note that after the merge operation Cluster 2 was relabelled to Cluster 5 (Fig. 4E). Inspection of the marker genes for these clusters revealed specific expression of GZMK in merged Cluster 5, and GZMH in merged Cluster 8 (Fig. 4E). GZMK+ CD8+ T cells were recently identified as an important CD8+ T cell subtype in ageing mice, where specific expansion of these cell types is associated with inflammation, thought to be caused by increased secretion of the protein of GZMK (Mogilenko et al. 2021). GZMH, on the other hand, has been used as a marker to identify activated cytotoxic effector CD8+ T cells in contrast with transitional CD8+ T cells (Li et al. 2019).

‘Cytocipher cluster-merge’ did not detect additional subpopulations of CD4+ T cells, despite evidence of detectable gene expression differences between naive and memory CD4+ T cells in larger PBMC scRNA-seq datasets (Hao et al. 2021). We therefore tested if increasing the number of marker genes used by ‘Cytocipher’ or analysing the CD4+ T cells in isolation would distinguish additional CD4+ T cell populations (Supplementary Fig. S2). Increasing the marker genes to 10 revealed an additional CD4+ T cell population expressing ACTG1, while analysing CD4+ T cells in isolation demarcated naive and memory T cells (Supplementary Fig. S2). Across all cell types, CD4+ memory marker S100A4 (Weatherly et al. 2015) shows high expression in monocytes and dendritic cells as well as a subset of CD4+ T cells (Supplementary Fig. S2C and D). Hence, S100A4 was not distinguished as a marker gene for T cell over-clusters due to broad expression in other cell types, and was not therefore used to distinguish CD4+ naive versus memory T cells in the ‘Cytocipher cluster-merge’ analysis performed using all cell types. This illustrates that additional cell populations may be distinguished by Cytocipher when analysing subsets of cells, due to differences in marker gene determination when using the one-versus-rest approach of differential gene expression.

‘Cytocipher cluster-merge’ was able to detect additional heterogeneity in CD8+ T cells, not previously described in this dataset (Fig. 4E and F). We tested whether this division of CD8+ T cells was detectable at a Leiden resolution without also over-clustering the data (Fig. 4G). At Leiden resolution 0.5, we identified under-clustering of the CD14+ monocytes and dendritic cells. Resolution 1 corrected this under-clustering but resulted in over-clustering of the CD4+ T cells. Resolution 1.25 mis-clustered the dendritic and CD14+ monocytes, with the over-clustering of the CD4+ T cells retained. We first observed the new distinct subdivision of CD8+ T cells at resolution 1.5, at the same time there is over-clustering of the CD4+ T cells and CD14+ monocytes (Fig. 4G).

While it was not possible to exhaustively test every possible resolution, we could not find a resolution where the CD8+ T cell subpopulations were identifiable using Leiden without also over-clustering. Hence, while ‘Cytocipher cluster-merge’ automatically identified these important subpopulations, Leiden analysis could not detect these subpopulations without over-clustering other cell types.

3.6 ‘Cytocipher cluster-merge’ identifies intermediate states in pancreas development

In the case of scRNA-seq analysis of developmental data, intermediate cell states can exist as a continuum of cells between progenitor cells and mature cell types (Trapnell et al. 2014). The continuity of developmental scRNA-seq makes determination of discrete populations with standard clustering ambiguous, since clear separation of cell populations is not evident. Given that Cytocipher considers gene expression combinations for scoring cluster membership, we hypothesized that Cytocipher would be sensitive to transcriptionally distinct intermediate states, potentially allowing for identification of fine-grained branch-points that represent lineage decisions towards terminal cell fates.

To test this hypothesis, we utilized scRNA-seq data from mouse E15.5 pancreas development (Bastidas-Ponce et al. 2019), since this has been extensively studied with the tool ‘Scvelo’ (Bergen et al. 2020), an analysis method for predicting the direction of cell differentiation. We first reproduced ‘Scvelo’ predictions of cell differentiation in the pancreas data (Bergen et al. 2020) (Fig. 5A). Over-clustering the pancreas data to 30 clusters, and subsequently applying ‘Cytocipher code-scoring’ showed considerable cross-scoring between clusters (Fig. 5B). Importantly, the ‘code-scoring’ approach was still highly binary in the case of continuous developmental clusters; as evidenced by the lack of low-level background cross-scoring, allowing for clear identification of over-clusters (Fig. 5B).

Applying ‘cluster-merge’ resulted in 16 distinctly different clusters, including several intermediate cell states not identified with the original annotations (Fig. 5A, C, and F). We then examined the marker genes automatically determined by ‘Cytocipher’ after cluster merging, which revealed that the intermediate states identified expressed several important regulators of pancreas development (Fig. 5E). For instance, Cluster 6 clearly grouped an intermediate state transition from early endocrine progenitor (EP) cells, primarily expressing Anxa2, to Btbd17 expressing EP cells; a known marker of Ephi EP cells (Scavuzzo et al. 2018) (Fig. 5D and E). Clusters 11 and 14 clearly highlighted alternate lineage paths, with the former expressing Isl1 and the latter Pax4, both important developmental transcription factors controlling pancreas development (Ahlgren et al. 1997, Napolitano et al. 2015) (Fig. 5D). Importantly, Pax4 has been previously shown as a lineage determinant for pancreatic beta cell fate (Napolitano et al. 2015), and, in-line with ‘Scvelo’ differentiation predictions, Cluster 14 appeared to represent an early decision towards this cell type not identified in the original annotations (Fig. 5A). Cytocipher cluster 9 appeared to branch off Cluster 11 (Fig. 5D), with Cluster 9 cells expressing Hhex, a homeodomain transcription factor required for pancreatic delta cell determination (Zhang et al. 2014) (Fig. 5E).

Cytocipher was able to identify potentially important intermediate states in developmental scRNA-seq data, evident when examining the outputted marker genes for the determined intermediate lineage states, which included several well-known transcription factors required for determination of the cell fates highlighted (Fig. 5D and E).

3.7 Cytocipher detects over-represented subpopulations in prostate cancer

Comparison between normal and tumour scRNA-seq samples can reveal over- and under- represented cell types in cancer, indicating important cell types that may contribute to cancer progression and therapeutic resistance. We hypothesized that application of Cytocipher within a cancer context would identify cancer relevant cell subpopulations and the key genes defining these populations. To test this hypothesis, we compared Cytocipher significant clusters in the context of prostate cancer to significantly DA cell populations independently determined with ‘Milo’ (Dann et al. 2022) (Fig. 6); a novel DA analysis method that considers sample information and the cell–cell neighbourhood graph to determine populations of cells DA between two or more conditions (Dann et al. 2022).

Figure 6. — *Cytocipher* detects over-represented subpopulations in prostate cancer. (A) UMAP of scRNA-seq from normal and tumour tissue from the human prostate, consisting of 15 492 cells. Data as provided by Tuong *et al.* (2021). (B) *Cytocipher* significant clusters after merging 47 clusters produced from Leiden clustering at resolution 4.0 (depicted as inner UMAP). (C) Heatmap depicting *Cytocipher* code-scores, where each row is a cell and each column is a cluster. Cells and clusters are ordered such that scores along the diagonal indicate scores of cells for their respective cluster. Code-scores for the 29 significant clusters in panel (B) are shown. (D) Inner volcano plot depicts −log10(p-adjusted) on the y-axis and log-fold change on the x-axis testing for differential abundance of cells (DA) on the cell–cell neighbourhood graph using ‘Milo’. The outer UMAP depicts non-significant cells in the background, and significantly DA cellular neighbourhoods are highlighted. (E) UMAP highlights the significant clusters detected by *Cytocipher* that were independently determined as over-represented in prostate cancer by ‘Milo’ DA analysis. (F) Violin plots with −log10(p-adjusted) on the y-axis and *Cytocipher* significant clusters on the x-axis. A horizontal line indicates the p-adjusted cutoff of 0.05, cells above this line are considered significantly DA between tumour and cancer samples. (G) Equivalent to (F), except for the original prostate cell types depicted in (A). (H) *Cytocipher* code-scores for the prostate cancer over-represented clusters detected by *Cytocipher*, from left-to-right code-scores depicted are specific to Clusters 25, 6, 0, and 9. The marker gene set for each respective cluster is shown within the UMAP plots. (I) Equivalent to (H), except depicting Clusters 17, 16, 12, 2, and 8. These clusters also score for Cluster 9 as depicted in (H), but are significantly different due to additional gene co-expression

The prostate cancer scRNA-seq dataset included integrated healthy and cancerous prostate tissue representing 15 492 cells from 10 patients and 24 samples (Tuong et al. 2021) (Fig. 6A). Twelve cell types are annotated by the original authors, with prostate-specific cell types including basal (BE), hillock (HE), club cells (CE), and two luminal epithelial cell types (LE-KLK3 and LE-KLK4) (Tuong et al. 2021) (Fig. 6A).

We over-clustered the prostate cancer scRNA-seq data using Leiden clustering at resolution 4.0 to produce 47 clusters, from which 29 significantly different clusters were determined using ‘Cytocipher cluster-merge’ (Fig. 6B). Examination of the Cytocipher code-scores per significant cluster indicated highly specific marker gene co-expression differentiated each cluster (Fig. 6C). Independent application of ‘Milo’ revealed four populations of over-represented cells in the prostate cancer samples, all of which tightly corresponded to significant clusters identified by Cytocipher but not by the original annotations (Fig. 6D–G). In line with the original publication and the prostate cancer literature (Chen et al. 2021, Tuong et al. 2021), three DA populations were within the luminal epithelial cells (Fig. 6A and E). The remaining population represented a T cell subpopulation (Fig. 6A and E).

Highly specific Cytocipher code-scores highlighted each of the DA luminal epithelial cells (Fig. 6H). Cytocipher marker genes used for ‘code-scoring’ for each of the DA luminal epithelial cells were also highly relevant for prostate cancer. For example, Cluster 25 included STMN1 as a marker gene, a known over-expressed oncoprotein in prostate cancer that controls cell proliferation (Chakravarthi et al. 2018) (Fig. 6H). Cluster 6 included PCA3, HPGD, and FABP5; the former produces a protein, which is a key prostate cancer biomarker in urine tests (Farha and Salami 2022), while the latter two genes have been implicated as metabolic regulators of prostate cancer (Senga et al. 2018, Wang et al. 2020). The third significant luminal epithelial subpopulation (Cluster 0) had marker genes, which included ACPP, the gene encoding the prostate cancer biomarker PAP (Xu et al. 2019).

The ‘Milo’ DA T cell subpopulation, which was also detected by Cytocipher (Cluster 9), did not express highly specific code-scores (Fig. 6H). Cluster 9 was however distinct from other Cytocipher clusters by not scoring for marker genes of other T cell subsets (Fig. 6I). This emphasizes the importance of the bi-directional test utilized by ‘Cytocipher cluster-merge’, which requires subpopulations to be mutually non-significantly different for the cluster pair to be merged (Fig. 1B3).

Overall, these findings demonstrated that Cytocipher can produce cancer relevant subpopulations defined by biologically relevant genes, such as cancer biomarkers and regulators.

3.8 Cytocipher analysis of >480 000 cells in tabula sapiens atlas reveals scalability and high-test performance

Atlas scale scRNA-seq data can contain hundreds of cell types and exceed hundreds of thousands of cells. Analysing such data requires highly scalable and accurate methods that are also time and memory efficient. We tested the scalability and performance of Cytocipher by analysing the Tabula Sapiens Atlas (Tabula Sapiens Consortium 2022), which consists of 483 152 cells across 24 tissues and 177 cell types (Tabula Sapiens Consortium 2022) (Fig. 7A). The cell type annotations were manually annotated by a team of experts for each tissue type (Tabula Sapiens Consortium 2022), thereby presenting an extremely high-quality ground-truth of cell types to test Cytocipher. First, we applied ‘Cytocipher code-scoring’ on the 177 cell types to confirm annotation quality (Fig. 7B). This revealed high correspondence between cell type annotations and cell type code-scores; confirming the 177 manual annotations were transcriptionally distinct (Fig. 7B).

Figure 7. — *Cytocipher* scales to >480 000 cells with high-test performance. (A) Tabula Sapiens UMAP depicting 483 152 cells sampled from 24 tissues by the Tabula Sapiens Consortium (2022). (B) Heatmap depicting *Cytocipher* code-scores for the 177 cell types annotated within the Tabula Sapiens dataset. Each row is a cell and each column is a cluster. Cells and clusters are ordered such that scores along the diagonal indicate scores of cells for their respective cluster. (C) Sankey diagram, indicating the top four over-merged clusters by *Cytocipher* when testing with artificial random sub-groups of the 177 cell types. The left side of the diagram indicates the sub-grouped cell types, and the right side indicates the sub-grouped cell types merged by *Cytocipher*. (D) ROC curve depicting the true-positive rate on the y-axis and the false-positive rate on the x-axis at different P-value cutoffs using ‘Cytocipher cluster-merge’ applied to artificial random subgroups of the 177 cell types using either ‘code-scoring’, ‘coexpr-scoring’, ‘Scanpy-scoring’, or ‘Scanpy-scoring’ with negative gene set subtraction [Scanpy (−)]. AUC for each scoring method is indicated in the legend. (E) Bar charts indicate time and memory usage of *Cytocipher* when analysing the 483 152 cells across artificial cell type sub-groups. ‘Giotto PAGE’ could not be performed on the full dataset due to memory limitations. (F and G) Equivalent to (D) and (E), except downsampling each of the cell type subgroups to a maximum of 15 cells to reduce the dataset size to 7385 cells, enabling ‘Giotto PAGE’ to be run for comparison and examine the effect of fewer cells on test performance. (H) Small comparison between *Cytocipher* and ‘Sc-SHC’, using the 500 cells and 12 random cell types subsetted from the 177 cell types. Ticks on the right hand side of the Sankey diagrams indicate artificial over-clusters were correctly merged, while crosses indicate incorrect merging. (I) Bar plot indicating run-time for the different methods, with methods depicted on the y-axis and run-time on the x-axis

Treating the high-quality Tabula Sapiens annotations as a ground-truth, we then randomly sampled within each cell type to artificially split cell types into three artificial clusters (see Section 2). This gave ground-truth over-clusters consisting of 503 cell populations with which to test method performance, whereby a pair of artificial clusters occurring within a cell type was considered non-significantly different and over-clusters from different cell types were considered significantly different.

Application of ‘Cytocipher cluster-merge’ to these artificial over-clusters revealed some over-merging, with the top four most over-merged clusters consisting of artificial clusters both within- and between-cell type annotations (Fig. 7C). However, closer examination of these over-merged cases revealed shared biology, such that fine-grained cell types had been merged. For example, smooth muscle cells from different parts of the body had been merged, such as vascular and bronchial smooth muscle (Fig. 7C). Different T cell subtypes were also merged to a single cluster. CD4+ and CD8+ T cells were detected however, and their subtypes merged into independent CD4+ and CD8+ clusters (Fig. 7C). Thus, while Cytocipher did not perform perfectly per the ground-truth, related function amongst the cell types merged was observed.

We then performed a systematic test of Cytocipher performance, examining how different P-value cutoffs affected the true-positive and false-positive rates for classifying significant versus non-significantly different cluster pairs using the aforementioned ground-truth (Fig. 7D). ROC curves were generated from applying ‘Cytocipher cluster-merge’ with the different scoring approaches (Fig. 7D). ‘Cytocipher cluster-merge’ with both ‘code-scoring’ and ‘coexpr-scoring’ achieved almost perfect AUROC values; 0.9974 for the former and 0.9976 for the latter (Fig. 7D). Scoring with Scanpy and ‘Scanpy (−)’ however had reduced performance (AUROC<0.9). Time and memory requirements for this task were also measured. ‘cluster-merge’ with ‘code-scoring’ took 8.5 h to compute, while ‘coexpr-scoring’ was 15 h. Scanpy-scoring had the fastest run-time of 3.8 h. Memory requirements were 12.2 GB of RAM for ‘code-scoring’, 1.19 GB for ‘coexpr-scoring’, and 11.43 GB for Scanpy. We were not able to run ‘Giotto PAGE’ because it exceeded our available RAM capacity.

Due to the high memory requirements for ‘Giotto PAGE’, we then tested the effect of subsetting the data. Downsampling to a maximum of 15 cells for each one of the 503 over-clusters resulted in a total of 7385 cells. Performing the same benchmarking as with the full dataset, and including ‘Giotto PAGE’ due to the lower RAM requirements with the reduced dataset, revealed greatly reduced time and memory performance with negligible effects on test performance (Fig. 7F and G). ‘Cytocipher cluster-merge’ applied with ‘code-scoring’, ‘coexpr-scoring’, ‘Giotto PAGE’, and Giotto (−) all produced high AUROC values; 0.9930, 0.9937, 0.9945, and 0.9940, respectively (Fig. 7F). Scanpy-scoring, even with negative gene set subtraction [Scanpy (−)], had reduced performance for cluster merging compared to the other scoring methods (AUROC<0.9). Run-time was reduced from hours (full dataset) to <15 min in each case (Fig. 7G). Memory usage was reduced from GB to MB, with Giotto having a 10-fold higher memory requirement of 200 MB, while ‘code-scoring’ and ‘coexpr-scoring’ peaked at 23.7 and 23.39 MB, respectively. Notably, Scanpy-scoring had a slower run-time and higher memory requirement on the smaller dataset compared with the other scoring methods, which was the opposite observation for the full dataset (Fig. 7E and G). This indicates a higher baseline resource requirement for Scanpy-scoring, but better scalability. Overall, ‘Cytocipher cluster-merge’ performance was maintained despite downsampling the total number of cells, suggesting this is an effective strategy to perform cluster significance analysis with Cytocipher in a timely manner on large datasets. Therefore Cytocipher scales to large atlas data with >480 000 cells and hundreds of cell types with high performance.

While Cytocipher represents the first method for cluster significance analysis in scRNA-seq data, during the preparation of this manuscript another distinctly different method, ‘Sc-SHC’, was publicly released (Grabski et al. 2023). ‘Sc-SHC’ differs from Cytocipher in the usage of raw count data, Poisson count modelling, Silhouette scoring, and utilization of a dendrogram to test branch-points for significantly different groups of clusters rather than cluster pairs (Grabski et al. 2023). Further contrasting with Cytocipher, ‘Sc-SHC’ does not include the output of marker gene sets, which significantly differentiate each cluster. We compared run-time and performance between Cytocipher and ‘Sc-SHC’ in a small test case, where 12 cell types were randomly sampled from the 177 cell types in the downsampled Tabula Sapiens data, resulting in 500 cells (Fig. 7H and I). Application of ‘Cytocipher cluster-merge’ with ‘code-scoring’, ‘coexpr-scoring’, and ‘Giotto PAGE’ produced the same result; correctly merging all the artificial over-clusters to the original cell types (Fig. 7H). Giotto (−), Scanpy-scoring, and Scanpy (−) had near-perfect performance, but failed to correctly merge a small proportion of the over-clusters for specific cell types (Fig. 7H). ‘Sc-SHC’ correctly merged lacrimal gland functional unit cell over-clusters, but incorrectly merged all other cell types (Fig. 7H). ‘Cytocipher cluster-merge’ was also significantly faster, taking <3.5 s on the 500 cells and 34 over-clusters, while ‘Sc-SHC’ had a 178.80 s compute time (Fig. 7I). Overall, this test suggested Cytocipher cluster significance analysis is more accurate and time efficient than the concurrently developed method ‘Sc-SHC’.

4 Discussion

ScRNA-seq is having a transformative effect upon our understanding of cellular diversity in multicellular organisms, and several technological advances are resulting in an extensive build-up of data. However, the bioinformatics analysis of scRNA-seq data remains challenging and involves problematic post hoc manual curation. Cytocipher represents the first method that performs statistical analysis of tentative single-cell clusters to ensure that cell groups represent significantly transcriptionally distinct populations. In contrast to standard scRNA-seq clustering, which operates on the nearest-neighbour graph, Cytocipher refers back to the original gene expression measurements, and performs per-cell enrichment scoring for cluster marker genes and a bi-directional statistical test to infer significantly different clusters. In this study, we demonstrate that this statistical approach ensures the final single-cell clusters align with transcriptionally distinct populations of cells allowing for improved insights in various biological contexts.

When analysing the human PBMC data, we did not find any Leiden resolution that could accurately uncover all of the transcriptionally distinct cell populations without also over-clustering. However, Cytocipher was able to identify the transcriptionally distinct populations. Importantly, several methods have been developed concerned with varying cluster hyper-parameters to identify an optimal parameter set (Risso et al. 2018, Zappia and Oshlack 2018, Senabouth et al. 2019, Blumenberg and Ruggles 2020, Virshup et al. 2020, Liu et al. 2021, Patterson-Cross et al. 2021, Shahsavari et al. 2022). While the criteria to define ‘optimal’ clustering is tool dependent (Risso et al. 2018, Zappia and Oshlack 2018, Senabouth et al. 2019, Blumenberg and Ruggles 2020, Virshup et al. 2020, Liu et al. 2021, Patterson-Cross et al. 2021, Shahsavari et al. 2022), the tool Clustree (Zappia and Oshlack 2018) e.g. defines this as the highest resolution clustering where further increasing the clustering resolution results in arbitrary rearrangements of cells to increase the number of clusters. All of these approaches pre-suppose that there exists a set of hyper-parameters where the clustering method utilized can identify the underlying transcriptionally distinct cell populations, despite this not being a part of the objective function utilized by current clustering methods. Our results suggest that future methods should instead evaluate single-cell clustering by determining if tentative clusters are supported by cells within the clusters displaying distinctly different gene co-expression. This approach improves interpretability and is more in-line with the goal of scRNA-seq analysis for cell type discovery.

Cytocipher considers cluster significance analysis at the level of cluster pairs. An alternative approach, as utilized by the concurrently developed ‘Sc-SHC’ (Grabski et al. 2023), is to use a dendrogram to group clusters and test the branch-points to compare groups of clusters. When comparing Cytocipher and ‘Sc-SHC’, we observed the former correctly grouped cell types while the latter incorrectly merged several distinctly different cell types. The improved performance of Cytocipher could be due to an averaging effect within ‘Sc-SHC’ when comparing cluster groups. For instance, when ‘Sc-SHC’ tests a particular branch point representing groups of clusters, on average most of the clusters may not be significantly different from one another resulting in a call of non-significant difference for all grouped clusters. However, when Cytocipher performs comparisons at the level of cluster pairs, subsets of clusters are identified as distinctly different. To confirm this, systematic testing of every methodological difference between the two methods would be required, which is outside of the scope of the current study. Based on our current results, we suggest the choice of cluster pairs as the unit of testing for future methods of cluster significance analysis to avoid potential averaging effects.

One limitation of Cytocipher is its dependency on the cell–cell neighbourhood graph indirectly by the input of Leiden clusters. Because the graph is constructed based upon linearly reduced dimensions from highly variable genes, the clusters tested by Cytocipher will define transcriptionally distinct cell populations that are biased towards the highly variable genes used to construct the original neighbourhood graph. It therefore remains an interesting and open question whether alteration of the gene sets utilized to define the initial clusters has an effect on which cells are significantly different from one another; particular in the case of neurons, where several different biological features are used to define neuronal subtypes in different contexts (Phillips et al. 2019).

A promising alternative approach to scoring cells for cluster membership based on marker genes, which may overcome the aforementioned limitation of Cytocipher, could be utilization of matrix factorization approaches, such as consensus non-negative matrix factorization (cNMF) (Kotliar et al. 2019). This would enable learning cell type-specific gene expression profiles without prior selection of highly variable genes, dimensionality reduction, and initial clustering. Approaches, such as cNMF, are similarly limited to current clustering approaches by requiring the input of the number of factors however, and are optimized with stochastic methods that alter results between runs. Therefore, future work could extend cluster significance analysis to determine cell populations based upon unsupervised latent factor extraction methods, to circumvent the preprocessing steps typical of scRNA-seq clustering and improve latent factor reproducibility.

5 Conclusion

Cytocipher represents an important advance in the analysis of scRNA-seq data and will enable the reproducible and statistically significant identification of single-cell populations that have distinct gene co-expression. In the future, we anticipate similar implementations of cluster significance analysis could be utilized to analyse other complex and heterogeneous biological data, such as spatial RNA-seq and scATAC-seq.

Supplementary Material

btad435_Supplementary_Data

Click here for additional data file.^{(823.3KB, pdf)}

Acknowledgements

We thank Gabriel Foley, Samuel Davis, Sanjana Tule, Ariane Mora, and Samuel Granjeaud for critically reading this manuscript and providing valuable discussion/feedback.

Contributor Information

Brad Balderson, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD 4072, Australia.

Michael Piper, School of Biomedical Sciences, University of Queensland, Brisbane, QLD 4072, Australia.

Stefan Thor, School of Biomedical Sciences, University of Queensland, Brisbane, QLD 4072, Australia.

Mikael Bodén, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD 4072, Australia.

Author contributions

B.B. designed methods, performed analysis, made figures, and wrote manuscript. M.P., S.T., and M.B. edited the manuscript and informed method design.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the Australian Government Research Training Program (RTP) scholarship and RTP tuition fee offset (to B.B.); and Australian Research Council [grant number DP220100985 to S.T., M.P., and M.B.].

Data availability

Automatic download of data and reproducible code for the analyses performed are available at https://github.com/BradBalderson/Cytocipher manuscript. Cytocipher is available at https://github.com/BradBalderson/Cytocipher. The software version used for this manuscript has been deposited on Zenodo at https://doi.org/10.5281/zenodo.8089546. The github tutorial includes full code to reproduce the mouse pancreas analysis. All data analysed are publicly available. The E18.5 hypothalamus neuronal subtype data is available in the Gene Expression Omnibus (GEO) database at accession GSE154995 with the cell annotations and UMAP coordinates available in the supplemental data of the original paper (Yaghmaeian Salmani et al. 2022). The E15.5 mouse pancreas data is available in GEO at accession GSE132188 (Bastidas-Ponce et al. 2019). The processed data was downloaded using Scvelo with the scv.datasets.pancreas() function. The human PBMC 3K data are hosted by 10X Genomics, and were downloaded via wget https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz. The prostate cancer data was downloaded from http://www.prostatecellatlas.org/ with wget https://cellgeni.cog.sanger.ac.uk/prostatecellatlas/prostate_portal_300921.h5ad. The Tabula Sapiens data is available from figshare https://figshare.com/projects/Tabula_Sapiens/100973 and was downloaded via wget https://figshare.com/ndownloader/files/34702114.

References

Ahlgren U, Pfaff SL, Jessell TM. et al. Independent requirement for ISL1 in formation of pancreatic mesenchyme and islet cells. Nature 1997;385:257–60. 10.1038/385257a0. [DOI] [PubMed] [Google Scholar]
Aibar S, González-Blas CB, Moerman T. et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 2017;14:1083–6. 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alexander Wolf F, Angerer P, Theis FJ.. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 2018;19:15. 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bastidas-Ponce A, Tritschler S, Dony L. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 2019;146:dev173849. 10.1242/dev.173849. [DOI] [PubMed] [Google Scholar]
Bergen V, Lange M, Peidli S. et al. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 2020;38:1408–14. 10.1038/s41587-020-0591-3. [DOI] [PubMed] [Google Scholar]
Blumenberg L, Ruggles KV.. Hypercluster: a flexible tool for parallelized unsupervised clustering optimization. BMC Bioinformatics 2020;21:428. 10.1186/s12859-020-03774-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakravarthi BVSK, Chandrashekar DS, Agarwal S. et al. miR-34a regulates expression of the stathmin-1 oncoprotein and prostate cancer progression. Mol Cancer Res 2018;16:1125–37. 10.1158/1541-7786.MCR-17-0230. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen S, Zhu G, Yang Y. et al. Single-cell analysis reveals transcriptomic remodellings in distinct cell types that contribute to human prostate cancer progression. Nat Cell Biol 2021;23:87–98. 10.1038/s41556-020-00613-6. [DOI] [PubMed] [Google Scholar]
Dann E, Henderson NC, Teichmann SA. et al. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol 2022;40:245–53. 10.1038/s41587-021-01033-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dries R, Zhu Q, Dong R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol 2021;22:78. 10.1186/s13059-021-02286-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Farha MW, Salami SS.. Biomarkers for prostate cancer detection and risk stratification. Ther Adv Urol 2022;14:17562872221103988. 10.1177/17562872221103988. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grabski IN, Street K, Irizarry RA.. Significance analysis for clustering with single-cell RNA-sequencing data. Nature Methods2023. doi: 10.1038/s41592-023-01933-9. [DOI] [PMC free article] [PubMed]
Hao Y, Hao S, Andersen-Nissen E. et al. Integrated analysis of multimodal single-cell data. Cell 2021;184:3573–87.e29. 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kotliar D, Veres A, Aurel Nagy M. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. Elife 2019;8:e43803. 10.7554/eLife.43803. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, van der Leun AM, Yofe I. et al. Dysfunctional CD8 T cells form a proliferative, dynamically regulated compartment within human melanoma. Cell 2019;176:775–89.e18. 10.1016/j.cell.2018.11.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu S, Thennavan A, Garay JP. et al. MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data. Genome Biol 2021;22:232. 10.1186/s13059-021-02445-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lun ATL, Marioni JC.. Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data. Biostatistics 2017;18:451–64. 10.1093/biostatistics/kxw055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mogilenko DA, Shpynov O, Sairam Andhey P. et al. Comprehensive profiling of an aging immune system reveals clonal GZMK+ CD8+ T cells as conserved hallmark of inflammaging. Immunity 2021;54:99–115.e12. 10.1016/j.immuni.2020.11.005. [DOI] [PubMed] [Google Scholar]
Napolitano T, Avolio F, Courtney M. et al. Pax4 acts as a key player in pancreas development and plasticity. Semin Cell Dev Biol 2015;44:107–14. 10.1016/j.semcdb.2015.08.013. [DOI] [PubMed] [Google Scholar]
Patterson-Cross RB, Levine AJ, Menon V.. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinformatics 2021;22:39. 10.1186/s12859-021-03957-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedregosa F, Varoquaux G, Gramfort A. et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011;12(85):2825−2830. [Google Scholar]
Phillips JW, Schulmann A, Hara E. et al. A repeated molecular architecture across thalamic pathways. Nat Neurosci 2019;22:1925–35. 10.1038/s41593-019-0483-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Risso D, Purvis L, Fletcher RB. et al. clusterExperiment and RSEC: a bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol 2018;14:e1006378. 10.1371/journal.pcbi.1006378. [DOI] [PMC free article] [PubMed] [Google Scholar]
Satija R, Farrell JA, Gennert D. et al. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 2015;33:495–502. 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scavuzzo MA, Hill MC, Chmielowiec J. et al. Endocrine lineage biases arise in temporally distinct endocrine progenitors during pancreatic morphogenesis. Nat Commun 2018;9:3356. 10.1038/s41467-018-05740-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Senabouth A, Lukowski SW, Hernandez JA. et al. ascend: R package for analysis of single-cell RNA-seq data. Gigascience 2019;8:giz087. 10.1093/gigascience/giz087. [DOI] [PMC free article] [PubMed] [Google Scholar]
Senga S, Kawaguchi K, Kobayashi N. et al. A novel fatty acid-binding protein 5-estrogen-related receptor signaling pathway promotes cell growth and energy metabolism in prostate cancer cells. Oncotarget 2018;9:31753–70. 10.18632/oncotarget.25878. Publisher: Impact Journals. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shahsavari A, Munteanu A, Mohorianu I.. ClustAssess: tools for assessing the robustness of single-cell clustering. bioRxiv 2022. doi: 10.1101/2022.01.31.478592. [DOI]
Squair JW, Gautier M, Kathe C. et al. Confronting false discoveries in single-cell differential expression. Nat Commun 2021;12:5692. 10.1038/s41467-021-25960-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tabula Sapiens Consortium. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376:eabl4896. 10.1126/science.abl4896. [DOI] [PMC free article] [PubMed] [Google Scholar]
Traag VA, Waltman L, van Eck NJ.. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 2019;9:5233. 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trapnell C, Cacchiarelli D, Grimsby J. et al. Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions. Nat Biotechnol 2014;32:381–6. 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tuong ZK, Loudon KW, Berry B. et al. Resolving the immune landscape of human prostate at a single-cell level in health and cancer. Cell Rep 2021;37:110132. 10.1016/j.celrep.2021.110132. [DOI] [PMC free article] [PubMed] [Google Scholar]
Virshup I, Choi J, Cao K-AL. et al. constclust: consistent clusters for scRNA-seq. bioRxiv 2020. doi: 10.1101/2020.12.08.417105. [DOI]
Wang Y, Ma S, Ruzzo WL.. Spatial modeling of prostate cancer metabolic gene expression reveals extensive heterogeneity and selective vulnerabilities. Sci Rep 2020;10:3490. 10.1038/s41598-020-60384-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weatherly K, Bettonville M, Torres D. et al. Functional profile of S100A4-deficient T cells. Immun Inflamm Dis 2015;3:431–44. 10.1002/iid3.85. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu H, Wang F, Li H. et al. Prostatic acid phosphatase (PAP) predicts prostate cancer progress in a population-based study: the renewal of PAP? Dis Markers 2019;2019:7090545. 10.1155/2019/7090545. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yaghmaeian Salmani B, Balderson B, Bauer S. et al. Selective requirement for polycomb repressor complex 2 in the generation of specific hypothalamic neuronal subtypes. Development 2022;149:dev200076. 10.1242/dev.200076. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang S, Corbett SE, Koga Y. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol 2020;21:57. 10.1186/s13059-020-1950-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zappia L, Oshlack A.. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. Gigascience 2018;7:giy083. 10.1093/gigascience/giy083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zappia L, Phipson B, Oshlack A.. Splatter: simulation of single-cell RNA sequencing data. Genome Biol 2017;18:174. 10.1186/s13059-017-1305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang J, McKenna LB, Bogue CW. et al. The diabetes gene Hhex maintains δ-cell differentiation and islet function. Genes Dev 2014;28:829–34. 10.1101/gad.235499.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad435_Supplementary_Data

Click here for additional data file.^{(823.3KB, pdf)}

Data Availability Statement

[btad435-B1] Ahlgren U, Pfaff SL, Jessell TM. et al. Independent requirement for ISL1 in formation of pancreatic mesenchyme and islet cells. Nature 1997;385:257–60. 10.1038/385257a0. [DOI] [PubMed] [Google Scholar]

[btad435-B2] Aibar S, González-Blas CB, Moerman T. et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 2017;14:1083–6. 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B3] Alexander Wolf F, Angerer P, Theis FJ.. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 2018;19:15. 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B4] Bastidas-Ponce A, Tritschler S, Dony L. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 2019;146:dev173849. 10.1242/dev.173849. [DOI] [PubMed] [Google Scholar]

[btad435-B5] Bergen V, Lange M, Peidli S. et al. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 2020;38:1408–14. 10.1038/s41587-020-0591-3. [DOI] [PubMed] [Google Scholar]

[btad435-B6] Blumenberg L, Ruggles KV.. Hypercluster: a flexible tool for parallelized unsupervised clustering optimization. BMC Bioinformatics 2020;21:428. 10.1186/s12859-020-03774-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B7] Chakravarthi BVSK, Chandrashekar DS, Agarwal S. et al. miR-34a regulates expression of the stathmin-1 oncoprotein and prostate cancer progression. Mol Cancer Res 2018;16:1125–37. 10.1158/1541-7786.MCR-17-0230. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B8] Chen S, Zhu G, Yang Y. et al. Single-cell analysis reveals transcriptomic remodellings in distinct cell types that contribute to human prostate cancer progression. Nat Cell Biol 2021;23:87–98. 10.1038/s41556-020-00613-6. [DOI] [PubMed] [Google Scholar]

[btad435-B9] Dann E, Henderson NC, Teichmann SA. et al. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol 2022;40:245–53. 10.1038/s41587-021-01033-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B10] Dries R, Zhu Q, Dong R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol 2021;22:78. 10.1186/s13059-021-02286-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B11] Farha MW, Salami SS.. Biomarkers for prostate cancer detection and risk stratification. Ther Adv Urol 2022;14:17562872221103988. 10.1177/17562872221103988. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B12] Grabski IN, Street K, Irizarry RA.. Significance analysis for clustering with single-cell RNA-sequencing data. Nature Methods2023. doi: 10.1038/s41592-023-01933-9. [DOI] [PMC free article] [PubMed]

[btad435-B13] Hao Y, Hao S, Andersen-Nissen E. et al. Integrated analysis of multimodal single-cell data. Cell 2021;184:3573–87.e29. 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B14] Kotliar D, Veres A, Aurel Nagy M. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. Elife 2019;8:e43803. 10.7554/eLife.43803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B15] Li H, van der Leun AM, Yofe I. et al. Dysfunctional CD8 T cells form a proliferative, dynamically regulated compartment within human melanoma. Cell 2019;176:775–89.e18. 10.1016/j.cell.2018.11.043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B16] Liu S, Thennavan A, Garay JP. et al. MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data. Genome Biol 2021;22:232. 10.1186/s13059-021-02445-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B17] Lun ATL, Marioni JC.. Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data. Biostatistics 2017;18:451–64. 10.1093/biostatistics/kxw055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B18] Mogilenko DA, Shpynov O, Sairam Andhey P. et al. Comprehensive profiling of an aging immune system reveals clonal GZMK+ CD8+ T cells as conserved hallmark of inflammaging. Immunity 2021;54:99–115.e12. 10.1016/j.immuni.2020.11.005. [DOI] [PubMed] [Google Scholar]

[btad435-B19] Napolitano T, Avolio F, Courtney M. et al. Pax4 acts as a key player in pancreas development and plasticity. Semin Cell Dev Biol 2015;44:107–14. 10.1016/j.semcdb.2015.08.013. [DOI] [PubMed] [Google Scholar]

[btad435-B20] Patterson-Cross RB, Levine AJ, Menon V.. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinformatics 2021;22:39. 10.1186/s12859-021-03957-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B21] Pedregosa F, Varoquaux G, Gramfort A. et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011;12(85):2825−2830. [Google Scholar]

[btad435-B22] Phillips JW, Schulmann A, Hara E. et al. A repeated molecular architecture across thalamic pathways. Nat Neurosci 2019;22:1925–35. 10.1038/s41593-019-0483-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B23] Risso D, Purvis L, Fletcher RB. et al. clusterExperiment and RSEC: a bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol 2018;14:e1006378. 10.1371/journal.pcbi.1006378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B24] Satija R, Farrell JA, Gennert D. et al. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 2015;33:495–502. 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B25] Scavuzzo MA, Hill MC, Chmielowiec J. et al. Endocrine lineage biases arise in temporally distinct endocrine progenitors during pancreatic morphogenesis. Nat Commun 2018;9:3356. 10.1038/s41467-018-05740-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B26] Senabouth A, Lukowski SW, Hernandez JA. et al. ascend: R package for analysis of single-cell RNA-seq data. Gigascience 2019;8:giz087. 10.1093/gigascience/giz087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B27] Senga S, Kawaguchi K, Kobayashi N. et al. A novel fatty acid-binding protein 5-estrogen-related receptor signaling pathway promotes cell growth and energy metabolism in prostate cancer cells. Oncotarget 2018;9:31753–70. 10.18632/oncotarget.25878. Publisher: Impact Journals. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B28] Shahsavari A, Munteanu A, Mohorianu I.. ClustAssess: tools for assessing the robustness of single-cell clustering. bioRxiv 2022. doi: 10.1101/2022.01.31.478592. [DOI]

[btad435-B29] Squair JW, Gautier M, Kathe C. et al. Confronting false discoveries in single-cell differential expression. Nat Commun 2021;12:5692. 10.1038/s41467-021-25960-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B30] Tabula Sapiens Consortium. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376:eabl4896. 10.1126/science.abl4896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B31] Traag VA, Waltman L, van Eck NJ.. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 2019;9:5233. 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B32] Trapnell C, Cacchiarelli D, Grimsby J. et al. Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions. Nat Biotechnol 2014;32:381–6. 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B33] Tuong ZK, Loudon KW, Berry B. et al. Resolving the immune landscape of human prostate at a single-cell level in health and cancer. Cell Rep 2021;37:110132. 10.1016/j.celrep.2021.110132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B34] Virshup I, Choi J, Cao K-AL. et al. constclust: consistent clusters for scRNA-seq. bioRxiv 2020. doi: 10.1101/2020.12.08.417105. [DOI]

[btad435-B35] Wang Y, Ma S, Ruzzo WL.. Spatial modeling of prostate cancer metabolic gene expression reveals extensive heterogeneity and selective vulnerabilities. Sci Rep 2020;10:3490. 10.1038/s41598-020-60384-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B36] Weatherly K, Bettonville M, Torres D. et al. Functional profile of S100A4-deficient T cells. Immun Inflamm Dis 2015;3:431–44. 10.1002/iid3.85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B37] Xu H, Wang F, Li H. et al. Prostatic acid phosphatase (PAP) predicts prostate cancer progress in a population-based study: the renewal of PAP? Dis Markers 2019;2019:7090545. 10.1155/2019/7090545. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B38] Yaghmaeian Salmani B, Balderson B, Bauer S. et al. Selective requirement for polycomb repressor complex 2 in the generation of specific hypothalamic neuronal subtypes. Development 2022;149:dev200076. 10.1242/dev.200076. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B39] Yang S, Corbett SE, Koga Y. et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol 2020;21:57. 10.1186/s13059-020-1950-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B40] Zappia L, Oshlack A.. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. Gigascience 2018;7:giy083. 10.1093/gigascience/giy083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B41] Zappia L, Phipson B, Oshlack A.. Splatter: simulation of single-cell RNA sequencing data. Genome Biol 2017;18:174. 10.1186/s13059-017-1305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad435-B42] Zhang J, McKenna LB, Bogue CW. et al. The diabetes gene Hhex maintains δ-cell differentiation and islet function. Genes Dev 2014;28:829–34. 10.1101/gad.235499.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Cytocipher determines significantly different populations of cells in single-cell RNA-seq data

Brad Balderson

Michael Piper

Stefan Thor

Mikael Bodén

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

2.1 Cytocipher code-scoring

Figure 1.

Figure 5.

2.2 Cytocipher cluster-merge

2.2.1 Enrichment score summarization

2.3 Simulation study

2.4 E18.5 mouse hypothalamus study

2.5 Human 3K PBMC study

2.6 Mouse E15.5 pancreas development study

2.7 Prostate cancer study

2.8 Tabula sapiens study

3 Results

3.1 Cytocipher code-scoring

3.2 Cytocipher cluster-merge

3.3 Cytocipher negative gene set subtraction outperforms existing methods for binary classification task on simulated data

Figure 2.

3.4 ‘Cytocipher code-scoring’ performs comparably to existing methods for binary classification task of E18.5 hypothalamus neuronal subtypes

Figure 3.

3.5 ‘Cytocipher cluster-merge’ reveals novel heterogeneity in 3K human PBMCs

Figure 4.

3.6 ‘Cytocipher cluster-merge’ identifies intermediate states in pancreas development

3.7 Cytocipher detects over-represented subpopulations in prostate cancer

Figure 6.

3.8 Cytocipher analysis of >480 000 cells in tabula sapiens atlas reveals scalability and high-test performance

Figure 7.

4 Discussion

5 Conclusion

Supplementary Material

Acknowledgements

Contributor Information

Author contributions

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases