Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2023 May 5;24(3):bbad157. doi: 10.1093/bib/bbad157

CosTaL: an accurate and scalable graph-based clustering algorithm for high-dimensional single-cell data analysis

Yijia Li 1, Jonathan Nguyen 2, David C Anastasiu 3, Edgar A Arriaga 4,5,
PMCID: PMC10199777  PMID: 37150778

Abstract

With the aim of analyzing large-sized multidimensional single-cell datasets, we are describing a method for Cosine-based Tanimoto similarity-refined graph for community detection using Leiden’s algorithm (CosTaL). As a graph-based clustering method, CosTaL transforms the cells with high-dimensional features into a weighted k-nearest-neighbor (kNN) graph. The cells are represented by the vertices of the graph, while an edge between two vertices in the graph represents the close relatedness between the two cells. Specifically, CosTaL builds an exact kNN graph using cosine similarity and uses the Tanimoto coefficient as the refining strategy to re-weight the edges in order to improve the effectiveness of clustering. We demonstrate that CosTaL generally achieves equivalent or higher effectiveness scores on seven benchmark cytometry datasets and six single-cell RNA-sequencing datasets using six different evaluation metrics, compared with other state-of-the-art graph-based clustering methods, including PhenoGraph, Scanpy and PARC. As indicated by the combined evaluation metrics, Costal has high efficiency with small datasets and acceptable scalability for large datasets, which is beneficial for large-scale analysis.

Keywords: Clustering, Mass Cytometry, Flow Cytometry, Single-cell RNA sequencing, k nearest neighbors, Graph-based clustering

INTRODUCTION

Cells can be classified into different types according to their intrinsic heterogeneity in proteins, nucleotides and other metabolites. The classification of cells facilitates the understanding of relationships between cell identities and their functions in disease, aging and other biological models. In the past several years, multiparametric single-cell profiling methods, such as Mass Cytometry (MC) and single-cell RNA sequencing (scRNA-seq), have significantly improved the characterization of single cells, allowing researchers to depict precisely different types of cells from a complex population [1, 2].

Nowadays, cytometry methods such as MC allow measuring up to more than 40 parameters per cell, while the scRNA-seq method generally yields more than 20 000 features during a single measurement [3, 4]. The manual gating method can be used to identify cell types based on the signature of the features using histograms and bivariate plots. However, gating graphically is relatively subjective and time-consuming. For sequencing data, due to the large number of parameters contained in the data, exhaustive manual gating approaches are not feasible. Thus, researchers often resort to automated unsupervised clustering methods in order to characterize cell types [5, 6]. A growing trend of generating larger scale single-cell data is expected as a result of the rapid development and widespread application of single-cell techniques in recent years. This makes it increasingly important for clustering algorithms to allow users to process large-sized multiparametric datasets accurately and promptly.

In order to deal with large datasets with both high dimensionality and a large number of cells, a straightforward strategy is down-sampling, which selects only a portion of the total cells for analysis. This has been done by algorithms like SPADE by Qiu et al. [7]. However, down-sampling has the potential to overlook rare populations. On the other hand, if no down-sampling is conducted, scalability becomes an issue for analyzing large datasets. Based on this perspective, clustering algorithms with high complexity, like SNN-Clip and SC3, are unsuitable for large datasets as they require increased computational workloads and execution time [8]. Additionally, clustering algorithms should be extensible so that emerging analysis pipelines or analysis protocols for new single-cell techniques can use them directly. Many algorithms developed specifically for single-cell platforms have limitations in extending the strategy to other studies employing different methods. X-shift, for example, is designed for cytometry datasets and cannot be applied to scRNA-seq data [7, 9]. Other algorithms, like BackSPIN [10], GRACE [11] and SHARP [12], are exclusively suited for scRNA-seq datasets, which limit their applicability to other types of data.

Among all the existing algorithms that are both scalable and extensible, FlowSOM, PhenoGraph, Seurat and Scanpy are the most commonly used methods [13]. FlowSOM, although being extremely fast, requires a parameter Inline graphic as the desired cluster number. In most cases, Inline graphic should be determined by the user, as studies have demonstrated that automatic estimation within the algorithm is not always helpful [5]. Also, FlowSOM generates too many small clusters and this over-partitioning makes it hard to interpret the clustering results, especially without prior knowledge of the populations [5, 14]. On the other hand, PhenoGraph, Seurat and Scanpy are algorithms that share the strategy of transforming the high-dimensional cell data into a graph to represent the similarity relationships between the cells. Similarly, PARC is another efficient graph-based clustering algorithm that specializes in detecting rare populations [15].

With the aim of developing a more efficient, scalable and extensible clustering algorithm for large single-cell datasets, we formulate CosTaL, a graph-based single-cell clustering framework. In comparison with other single-cell clustering methods, CosTaL is effective and competitive when tested on a total of 13 benchmark datasets. When clustering scRNA-seq datasets using CosTaL, one should note that no normalization, scaling or Principal Components Analysis (PCA) transformations are required, which significantly reduces the effort involved in parameter selection as well as increases the overall clustering efficiency.

METHODS

Description of the CosTaL algorithm

Current methods for profiling single cells capture many features, such as those represented by different fluorescence channels in flow cytometry (FC). Each one of those features with unique signal intensities corresponds to one dimension in a high-dimensional Euclidean space. Similar types of cells should share a similar pattern of distribution for most of the features, resulting in a dense region in the high-dimensional space which can be used to identify the cell clusters.

Across all the existing unsupervised clustering algorithms, Jaccard–Louvain methods, as exemplified in PhenoGraph, PARC, Seurat and Scanpy, are among the most effective and most widely used strategies [6, 13, 16]. These methods are effective due to the basis that building a k-nearest neighbor (kNN) graph is an efficient means of extracting high-dimensional distributions of the cells and preserving their relatedness within dense areas [6, 13, 17–22]. In a kNN graph, cells are converted into nodes and, for each node, their Inline graphic most similar neighbors are connected by edges. The task of clustering then is converted into detecting communities, defined as subsets of nodes where the connections among the nodes are denser within the community than connections with the rest of the graph [23]. Given a computed kNN graph, Jaccard–Louvain methods employ a second step of Jaccard similarity refinement to optimize the local structure of the kNN graph, which ultimately leads to the successful detection of communities. The Jaccard similarity is defined as

graphic file with name DmEquation1.gif (1)

where Inline graphic and Inline graphic are the sets of neighbors of two cells Inline graphic and Inline graphic. When computing neighborhood similarities, Jaccard similarity binarizes the edge weights of the originally computed kNN graph, turning them into 1s. Therefore, the analysis only considers the presence or absence of shared neighbors, instead of focusing on the true weight of the features the cell has with its neighbors. The cells are considered more similar if they share a large portion of common neighbors, reflected by Jaccard similarity. Levine et al. [24] demonstrate that Jaccard similarity-based refinement facilitates the separation between the clusters and potentially identifies outliers from the major population on the kNN graph. Clustering algorithms making use of this strategy are collectively marked as kNN-Jaccard–Louvain methods. We also include Scanpy, considered an extension of PhenoGraph and Seurat [25, 26], which uses connectivity instead of Jaccard similarity to refine the kNN graph, to broaden the scope of Jaccard–Louvain methods. Like Scanpy, DUBStepR [27], Pagoda2 [28], SCHNEL [29] and CellRanger [30] are also kNN-Louvain methods that do not include the Jaccard refinement step [31]. To simplify comparisons, here we selected Scanpy as a representative clustering procedure that does not include a Jaccard refinement step.

Essentially, CosTaL adapts and extends the Jaccard–Louvain strategy. A three-step process is used by CosTaL to identify cell populations from single-cell data. First, an exact kNN graph based on cosine similarity is constructed using the L2-norm k-Nearest Neighbor Graph Construction (L2knng) algorithm [32]. L2knng is specifically designed to handle sparse high-dimensional data, such as single-cell profiling data, with a considerable speed advantage. The second step then involves refining the edge weights of the kNN graph using the Tanimoto coefficient, which enhances the local structure of the graph. Unlike Jaccard–Louvain methods, CosTaL uses the lengths of the original cell vectors to efficiently adjust the kNN graph weights to account for both angular and spatial separation of the points in the Euclidean space. Lastly, on the basis of the refined kNN graph, the Leiden algorithm is employed to identify communities (clusters) within the kNN graph [33]. Even though CosTaL and other graph-based clustering algorithms all take similar steps and use the Leiden algorithm, how the graph is generated and refined is different. The comparisons among the algorithms are listed in Table 1.

Table 1.

KNN-Based Clustering Algorithms

Algorithm Step 1. Construct kNN graph Step 2. Refine Step 3. Detect community
CosTaL L2knng algorithm Tanimoto coefficient and connectivity [25] Leiden algorithm
PhenoGraph [24] kd-tree from scikit-learn or brute force mapping Jaccard similarity Louvain/Leiden algorithm
Scanpy [26] PyNNDescent algorithm [34] Connectivity [25] Leiden algorithm
PARC [15] HNSW algorithm [35] Jaccard similarity with threshold cutoffs Leiden algorithm

Feature extraction

To ensure that the expression features of the cells are properly presented and compatible with the clustering algorithms, the output of the multiparametric single-cell measurements should be preprocessed prior to clustering. This process creates vector representations for the cells which are then used to construct the kNN graph.

For cytometry datasets, the features of cells denote the intensities of fluorophores (for FC) or event dual counts (for MC), which represent the amount of target proteins. The data are typically Inline graphic transformed (Inline graphic) for MC and Inline graphic for FC) to keep the readings in a linear scale [36].

For scRNA-seq datasets, the features are Unique Molecular Identifier (UMI) counts, representing the absolute number of detected RNAs. In order to process scRNA-seq datasets, the most prolific preprocessing approach is the method proposed by the R package Seurat as shown in Fig. 2, including steps of ‘highly variable gene selection’, ‘total count normalization’, ‘Inline graphic transformation’, ‘scaling’ and ‘PCA’ [16, 37, 38].

Figure 2.

Figure 2

Procedures for clustering scRNA-seq data by other graph-based methods (left) and CosTaL (right). Upstream quality control (not shown) is completed to eliminate cells with insufficient information. As a result, UMIs are generated as input, followed by selecting highly variable genes using the Seurat v3/v4 method to select the top 2000 genes. With the unnormalized UMI matrix, a canonical preprocessing used by the other clustering algorithm involves total count normalization, Inline graphic (Inline graphic, where Inline graphic is every value in the data matrix of cells) transformation, scaling to zero mean and unit variance and PCA transformation to reduce dimensionality. As a comparison, CosTaL only requires Inline graphic transformation. All the preprocessing steps for scRNA-seq data are executed in python using the Scanpy package.

Following feature extraction, the output from cytometry and scRNA-seq differ noticeably. According to the statistics of the selected datasets presented in Table 2, cytometry datasets have fewer features than scRNA-seq datasets. In addition, the sparsity levels of cytometry datasets are generally lower than those of scRNA-seq datasets. Datasets consisting of a large number of cells (GSE110823, 1M_neurons in the table) in this category achieve sparsity of over 91%.

Table 2.

Benchmark Dataset Statistics

Mass/Flow cytometry datasets
Dataset Ref. No. labeled cells No. total cells No. features No. non-zeros Sparsity
Levine_13 [24] 81 747 167 044 13 1689 738 22.188%
Levine_32 [24] 104 184 265 627 32 6205 110 26.999%
Samusik_01 [9] 53 173 86 864 39 1949 437 42.455%
Samusik_all [9] 514 386 841 644 39 18 578 390 43.400%
Giordani_WT1 [39] 585 113 599 100 26 10 824 531 30.508%
Mosmann_rare [40] 109 396 460 15 5808 969 2.319%
Nilsson_rare [41] 358 44 140 14 616 569 0.225%
scRNA-seq datasets
Dataset Ref. No. original features No. cells No. features No. non-zeros Sparsity
E-MTAB-3321 [42] 41 480 124 2000 144 499 41.734%
GSE81861 [43] 57 241 561 2000 214 688 80.866%
GSE74672 [44] 24 341 2881 2000 821 520 85.742%
GSE84133 [45] 20 125 3605 2000 792 390 89.010%
GSE110823 [46] 26 894 156 049 2000 18 463 051 94.084%
1M_neurons [47] 27 998 1306 127 2000 215 428 065 91.753%

Constructing the exact kNN graph using p-L2knng

Even though constructing the kNN graph to represent the similarity structure of the original datasets avoids the pitfalls of directly computing the densities in a high-dimensional space, the task is still very computationally expensive, requiring Inline graphic similarity comparisons, where Inline graphic is the number of given cells. Many mapping algorithms resort to finding kNNs approximately, using one of the tree-based, hashing-based, quantization-based or graph-based approaches [48]. Nevertheless, approximate methods cannot guarantee finding the exact nearest neighbors that an exhaustive search would find, thereby undermining their reproducibility. Additionally, candidate selection and comparisons used in the approximate searching methods can be computationally expensive [49]. Only a few exact search methods, such as ball-trees and KD-trees, are capable of ensuring search effectiveness while avoiding the computational difficulties associated with brute-force pairwise proximity comparisons. Still, they typically use computationally intensive pruning estimates that cannot scale well with large datasets to speed up the analysis process [50, 51].

In order to tackle the scalability requirements of large-scale dataset analysis, while at the same time improving clustering effectiveness, we utilized the parallel version of the L2Knng algorithm (p-L2knng) to build an exact, rather than approximate, kNN graph [32, 52]. The L2knng method efficiently uses L2-norms of feature subsets as an effective pruning strategy that avoids the computation of the majority of the pairwise cosine similarities required to build the kNN graph from a sparse feature matrix. Additional details on the p-L2knng algorithm are included in Supplementary Note 1. In summary, due to multiple pruning steps inherited from L2knng and parallelization, p-L2knng significantly speeds up the kNN construction step of our framework, while at the same time constructing an exact, rather than an approximate, nearest neighbor graph.

It is worth noting that p-L2knng is designed for constructing kNN graphs using cosine similarity on non-negative input data rather than the most widely used Euclidean distance. However, as shown in earlier works, even though cosine similarity and Euclidean distance are both impacted by high dimensionality, cosine similarity typically still performs better when dealing with high-dimensional sparse data than Euclidean distance [53, 54].

Refining the kNN graph using Tanimoto coefficient

Tanimoto coefficient, also known as Tanimoto similarity or extended Jaccard similarity/coefficient, has been broadly used in the field of cheminformatics [55], text analysis [54] and thesaurus extraction [56]. Tanimoto coefficient can be calculated as

graphic file with name DmEquation2.gif (2)
graphic file with name DmEquation3.gif (3)

where Inline graphic and Inline graphic are the vector representations of two cells A and B, constructed as described in Section Methods: Feature extraction, and Inline graphic and Inline graphic are the values of the Inline graphicth component in those vector representations. Equation 3 is the typical form used for calculating the Tanimoto coefficient between two continuous n-dimensional variables. With CosTaL, the kNN graph’s edge weights are updated from cosine similarity to the Tanimoto coefficient to incorporate the amplitude signal (as shown in Fig. 1). Despite the fact that Tanimoto coefficients provide more information about similarity relationships, there can still be discrepancies in the weights assigned to edges between cells, particularly in cases where one cell may be among the kNNs of another cell but not vice versa. CosTaL provides users with a choice between the arithmetic mean method used by PhenoGraph and the probabilistic approach for computing connectivities using UMAP’s method, which is also embedded in Scanpy [25, 26]. By default, CosTaL is set to use the connectivity method.

Figure 1.

Figure 1

Illustrations comparing the Tanimoto coefficient with cosine similarity and Euclidean distance as a measure of similarity. (A), (B), (C): Advantages of Tanimoto coefficient over cosine similarity. (A) Points B and C are two cells that are similar to a third cell represented by Point A. B and C are identified as A’s neighbors. (B) When mapping the similarity relationships with cosine similarity, the angular similarities between A&B and A&C are captured, while the dissimilarity between A&B is not reflected. (C) When refining the edge weights with the Tanimoto coefficient, the similarity relationship is retained as no edges are removed. Moreover, it is possible to determine the amplitude difference between A and B using Tanimoto coefficients. (D), (E), (F): Contour maps of a single point A’s proximity measurements in a 2D space. (D) Contour map of point A’s cosine similarities in a 2D space. (E) Contour map of point A’s neighbors mapped by Euclidean distance in a 2D space. (F) Contour map of point A’s Tanimoto coefficient in a 2D space. Tanimoto coefficient is able to simultaneously capture both the angular aspect of Cosine similarity and the amplitude aspect of Euclidean distance.

Theoretical advantages of Tanimoto coefficient

As described above, the Tanimoto coefficient uses similarity relationships among nodes, which are potentially overlooked during the initial kNN mapping using cosine similarity or ignored in Euclidean distance systems (Fig. 1). To further demonstrate its advantages, Supplementary Note 2 (Supplementary Fig. 1) illustrates how the Tanimoto coefficient is an effective measure for discriminating outliers that cosine similarity would otherwise miss, despite being similar in this respect to the Jaccard similarity. In addition, Supplementary Note 3 describes that the Tanimoto refinement in CosTaL generally maintains or improves effectiveness scores when processing scRNA-seq benchmark datasets, relative to CosTaL ‘without’ the Tanimoto refinement.

Besides, the Tanimoto coefficient has three unique advantages for graph refinement. First, as illustrated in Fig. 1, the Tanimoto coefficient is a refinement that includes the actual proximity measurements computed in the first step and extends them to also account for amplitude differences. In contrast, the refinement method based on Jaccard similarity captures the neighborhood similarities but throws away the initial weights, which represent the proximity between the cells. Second, the Tanimoto coefficient requires a smaller Inline graphic value compared with the Jaccard similarity. The Jaccard similarity algorithm treats all neighbors equally and refines the local structure of a kNN graph based on the number of shared neighbors, while the spurious links of outliers cannot be identified unless using a higher Inline graphic value (see Supplementary Section 2 and Supplementary Fig. 1). CosTaL, on the other hand, is unaffected by this effect and requires a smaller Inline graphic value, making it more efficient. Our experimental evaluation also confirmed this (see Section Parameter influences). Third, the efficiency of computing the Tanimoto coefficients is improved by reusing the cosine similarities reported by the p-L2knng algorithm, along with pre-computed L2-norms for each cell. CosTaL computes the Tanimoto coefficient as

graphic file with name DmEquation4.gif (4)

where Inline graphic and Inline graphic are the L2-norms of cells A and B and Inline graphic is the cosine similarity between their vector representations.

Community detection on the kNN graph using the Leiden algorithm

Once the network preserving the similarity information of the cell populations has been established, the cluster of cells, represented as communities on the kNN graph, can be mapped using the Leiden algorithm [33].

In the graph, a community is defined as a collection of nodes that are more densely connected with nodes in the group than with those outside the group, which can be reflected by the modularity of the graph. A detailed discussion of the computation of modularity can be found in the original article by Newman and Girvan [57]. In terms of clustering, a well-clustered graph has a high modularity, which can be interpreted as an overall high degree of intra-group connections and a low degree of inter-group connections. Thus, optimizing the modularity of a graph is equivalent to clustering in such a way that the cell populations become well separated. For similarity-based kNN graphs, modularity optimization finds cell clusters with high edge weights (similarities) within clusters and low weights between clusters.

Similar to most other graph-based clustering algorithms, CosTaL employs the Leiden algorithm to optimize modularity. The Leiden algorithm is a sophisticated optimization method that uses the local move approach to merge nodes starting from singletons while optimizing graph modularity [33]. As opposed to its predecessor, the Louvain algorithm, the Leiden algorithm ensures that the clusters are well connected for the purpose of locating more high-quality clusters [58].

The advantages of modularity optimization are high effectiveness and efficiency, while a disadvantage is the resolution limit. The resolution limit of a graph with a total edge number of Inline graphic is Inline graphic, meaning that small communities with edges less than Inline graphic may not be found through modularity optimization [59]. In light of this, some key links that represent the basic structure of the subpopulation within the kNN graph may not be well reflected in the changes in global modularity value during optimization, therefore being recognized as internal links for a greater, merged community. Since CosTaL requires smaller Inline graphic values, the total edge number in the graph is reduced, alleviating any potential issues with the resolution limit.

Among all other algorithms, only PARC considers the resolution limit problem, and it uses aggressive truncation strategies to remove edges with weights below a threshold. Due to truncation, the network is divided into several subgraphs with a very limited number of connections. Although this is useful for detecting rare populations, it makes PARC relatively inaccurate at the global level.

Benchmark datasets and preprocessing procedure

We validated the performance of CosTaL against PhenoGraph, Scanpy and PARC in both cytometry and scRNA-seq datasets. All the data matrices were formatted to cells Inline graphic features as the input. We picked seven cytometry and six scRNA-seq benchmark datasets (Table 3) that have been used intensively in many previous comparison studies for clustering algorithms [5, 36]. Based on the credibility of cell labels, we categorized datasets into three tiers. The first-tier datasets are those with labels from manual gating results (Levein_13, Levine_32, Samusik_01 and Samusik_all) or from known cell sources (E-MTAB-3321 and GSE81861). The second-tier datasets (Giordani_WT1, GSE74672 and GSE84133) are labeled with clustering algorithms with post-clustering inspections. The third group are datasets used for particular purposes. Nilsson_rare and Mosmann_rare datasets among this group were used for assessing the problem of detecting rare populations. GSE110823 and M_neurons are also in this group, primarily for evaluating the scalability of the methods.

Table 3.

Benchmark Datasets

Dataset Ref. Source Pop. Platform Credibility
Mass/Flow cytometry datasets
Levine_13 [24] Human, bone marrow 24 CyTOF Manual gating
Levine_32 [24] Human, bone marrow 14 CyTOF Manual gating
Samusik_01 [9] Mouse, bone marrow 24 CyTOF Manual gating
Samusik_all [9] Mouse, bone marrow 24 CyTOF Manual gating
Giordani_WT1 [39] Mouse, skeletal muscle 8 CyTOF X-Shift clustering and merge with inspections
Mosmann_rare [40] Human, peripheral blood 1 Flow cytometry Manual gating
Nilsson_rare [41] Human, bone marrow 1 Flow cytometry Manual gating
scRNA-seq datasets
E-MTAB-3321 [42] Mouse, Embryo 5 Smart-Seq2 Sourced from 5 developmental stages
GSE81861 [43] Human, Colorectal tumor 7 SMARTer Sourced from 7 cell lines
GSE74672 [44] Mice, Hypothalamus 7 Fluidigm C1 BackSPIN biclustering and merge with inspections
GSE84133 [45] Human, Pancreas 14 inDrop Iterative hierarchical clustering and merge with inspections
GSE110823 [46] Mouse, Brain & Spinal cord 73 SPLiT-seq Jaccard–Louvain iterations and merge with inspections
1M_neurons [47] Mouse, Brain 60 10X Genomics Chromium Jaccard–Louvain and merge with hierarchical clustering

Preprocessing of cytometry datasets

Typically, cytometry datasets have some negative readings due to randomization and event calculation. Negative values are zeroed for CosTaL. As PhenoGraph, Scanpy and PARC accept negative values, no changes were made to the data preprocessing steps of these methods. MC datasets, including Levine_13, Levine_32, Samusik_01, Samusik_all and Giordani_WT1, were all Inline graphic transformed. FC datasets Mosmann_rare and Nilsson_rare were Inline graphic transformed [36].

Preprocessing of scRNA-seq datasets

For the selection of the highly variable genes, the top 2000 genes were selected for each scRNA-seq dataset as features for clustering. While selecting highly variable genes can be helpful in addressing dropouts in scRNA-seq datasets, CosTaL is not optimized for this issue [60]. If dropouts are of concern, prior to the selection of the highly variable genes, we recommend (1) examining the counts according to their distributions (e.g. Poisson distribution, negative binomial distribution or zero-inflated negative binomial distribution), and (2) using correction strategies like imputation and hubness reduction [61, 62].

As required by p-L2knng, only non-negative feature values could be used as input. Since PCA transformation typically generates negative values in the Principal Components (PCs), the steps of ‘total count normalization’, ‘scaling’ and ‘PCA’ were removed in the preprocessing procedure of CosTaL as they are not needed. The illustration of the Seurat-fashioned workflow used by CosTaL versus the other algorithms is presented in Fig. 2. With CosTaL, the user not only saves more time by streamlining the preprocessing process, but also avoids having to determine how many PC dimensions should be used during the PCA transformation, which could be subjective.

Software and Execution environment

The Scanpy package (v.1.8.2) used for preprocessing and clustering can be found at https://github.com/theislab/scanpy.git.

The PhenoGraph algorithm (v.1.5.7) can be found at https://github.com/dpeerlab/PhenoGraph.git.

The PARC algorithm (v.0.33) can be found at https://github.com/ShobiStassen/PARC.git.

The Leiden algorithm (v.0.8.2) used for community detection can be found at https://github.com/vtraag/leidenalg.git. The parameter partition method is set to Inline graphic, the resolution is set to 0.8 and the seed is set to its default value, which is random.

The p-L2Knng algorithm (v.0.2.0) can be found at http://davidanastasiu.net/software/pl2knng/. A new version of p-L2Knng with Python bindings is available as part of the SNNLib library at https://github.com/davidanastasiu/snnlib.

The CosTaL algorithm (v0.1.0) can be found at https://github.com/li000678/CosTaL.

Each clustering analysis on the benchmark datasets was executed on a stand-alone AMD ROME computing node with 64 cores and 256GB of RAM (https://www.msi.umn.edu/mangi). Clustering algorithms were executed using each of the benchmark datasets for Inline graphic and all experiments were repeated 10 times.

Evaluation methods

The performance of clustering algorithms is measured by three aspects: number of clusters identified, time consumed and effectiveness scores. While cluster number and time are straightforward, the best metric to assess clustering algorithms is yet unsettled [63]. Clustering algorithms are generally evaluated using external validation methods when ground truth is available. The external validations can be divided into three different categories: (1) Pair counting (2) Set overlap and (3) Information theory [64, 65]. In our analysis, we took into account two popular metrics from each of the three categories in order to be comprehensive. In brief, we selected the Adjusted Rand Index (ARI) and Fowlkes–Mallows Index (FMI) measures for pair-counting-based evaluations. For overlap ratio-based evaluations, we used F-measure (F1 score). Depending on how the F1 score is harmonized as a whole for all the detected clusters, FlowCAPI F1 scores (FF1 scores) and Hungarian algorithm-based F1 scores (HF1), which have been previously described in [5, 9, 36], were adopted as our evaluation performance metrics. For information theory-based evaluations, we selected the Normalized Mutual Information (NMI) and Adjusted Mutual Information (AMI) scores. These categories of evaluations are complementary and detailed descriptions of the calculations and properties for each effectiveness score are provided in the Supplementary Note 4. All the scores are between 0 and 1, and 1 indicates a perfect match between the reference and the prediction [66].

RESULTS

Performance comparisons on cytometry datasets using default k values

We first evaluated the performance of CosTaL (Inline graphic = 10) against PhenoGraph (Inline graphic = 30), Scanpy (Inline graphic = 15) and PARC (Inline graphic = 30). The results over cytometry datasets are shown in Fig. 3.

Figure 3.

Figure 3

Performance of CosTaL compared with PhenoGraph, Scanpy and PARC on cytometry datasets. (A) Number of identified clusters and the time consumption. The vertical red line represents the number of cell types of the reference labels. Mosmann_rare and Nisson_rare only focus on a single rare population and the reference numbers are thus not shown. (B) The effectiveness scores of clustering algorithms for each dataset. AMI, ARI, FF1, HF1, FMI and V-measure were used for all cytometry datasets except Mosmann_rare and Nisson_rare, which only used F1 score to measure effectiveness.

As the results show, CosTaL was the fastest among all the methods in most cases. Additionally, CosTaL generated fewer clusters than PhenoGraph, Scanpy and PARC. The cluster number appears to affect the effectiveness scores for global-scale clustering. All six evaluation metrics show that CosTaL outperformed the other methods on the Levine_32 and Giordani_WT1 datasets, where CosTaL generated the closest cluster numbers to the references. For the Levine_13, Samusik_01 and Samusik_all datasets, CosTaL identified the least number of clusters, all below the reference number of clusters for each dataset. In these cases, CosTaL had lower HF1 scores, which require one-to-one matches between reference and clustering results, than the other methods. Yet CosTaL did not perform worse than the baselines on other performance scores.

For the problem of identifying rare populations, CosTaL was outperformed on the Mosmann_rare dataset, and none of the algorithms performed well on the Nilsson_rare dataset. The reason for this can be attributed to the fact that CosTaL identifies the fewest clusters with larger sizes, thereby affecting precision scores when calculating F1 scores. To enable the clustering algorithm to detect rare populations, CosTaL offers the option of performing iterative clustering on the identified sub-populations. In addition to the results obtained from the basic CosTaL clustering, a further level of clustering can be carried out on each cluster, which we detail in the Supplementary Note 5. Consequently, a more detailed partitioning of the population can be achieved, which leads to a higher F1 score. As shown in Supplementary Fig. 3, iterative two-level CosTaL clustering outperforms PARC, even though PARC is designed specifically for detecting rare populations.

Overall, we can conclude that CosTaL has the advantage of efficiency while maintaining very comparable effectiveness for global-scale clustering, making it more scalable for large-scale datasets. In terms of identifying rare populations, iterative CosTaL can also provide the highest level of effectiveness.

Performance comparisons on scRNA-seq datasets using default k value

Due to the incompatibility of CosTaL with the PCA transformation, which is generally used by other algorithms for dimensional reduction purposes in scRNA-seq datasets, we compare CosTaL with other algorithms using both PCA and non-PCA transformed data using their default Inline graphic value. The results are shown in Fig. 4.

Figure 4.

Figure 4

Performance of CosTaL compared with PhenoGraph, Scanpy and PARC on scRNAseq datasets. (A) Number of identified clusters and the time consumption. The vertical red line represents the number of cell types of the reference labels. (B) The effectiveness scores of clustering algorithms for each dataset. Both PCA-transformed (top 50 PCs were used) and non-PCA-transformed data were used as the input for PhenoGraph, Scanpy and PARC. Scanpy failed in processing the non-PCA-transformed 1M_neurons dataset.

For cluster numbers identified using CosTaL, the results are similar, except for the 1M_neurons, to those of the other methods.

In terms of efficiency, which we measure as execution time, CosTaL outperformed PhenoGraph, Scanpy and PARC using the non-PCA-transformed data (shown as ‘PhenoGraph’, ‘Scanpy’ and ‘PARC’ in Fig. 4), but it was somewhat slower than Scanpy or PARC when they use PCA-transformed data (shown as ‘Scanpy_PCA’ and ‘PARC_PCA’ in Fig. 4). However, as we will show later, when considering both the preprocessing (including PCA) and clustering steps as a whole, CosTaL was the most efficient in most cases.

In terms of effectiveness, CosTaL performed well consistently across all datasets with nearly the highest AMI, FF1, FMI and V-measure. The ARI scores for CosTaL were highest for all datasets, except for the E-MTAB-3321 and 1M_neurons datasets. Similarly, the HF1 scores for CosTal were highest for all datasets except for GSE84133 and 1M_neurons. These findings are not surprising as these scores are dominated by the number of clusters identified, which were lowest for processing of the GSE84133 and 1M neurons with CosTaL. On the other hand, CosTaL was the most efficient method for non-PCA-transformed data, as shown in Figure 4 (A). In addition, according to AMI scores, which are more robust to chance clustering [67], CosTaL is adequate for datasets like 1M_neurons, while outperforming other algorithms for each of the datasets. Lastly, in cases where clusters are under-partitioned, additional clustering iterations in CosTaL can be performed to improve overall clustering performance as evaluated by metrics such as ARI, without compromising AMI scores.

It is also worth noting that clustering results based on PCA-transformed data were not always better than those based on non-PCA-transformed data within the baseline algorithms. As an example, PCA-transformed data always outperformed non-transformed ones on the E-MTAB-3321 dataset, while non-transformed data always outperformed transformed data on GSE74672 and GSE84133 using all data metrics. As a means of eliminating any bias resulting from the selection of PCs, we used two mid-sized scRNA-seq datasets, GSE74672 and GSE84133, and evaluated the clustering performance of PhenoGraph, Scanpy and PARC with a range of PCs from 10 to 1990. The results are shown in Supplementary Fig. 4. The results indicate that PCA did not always improve clustering effectiveness, but instead helped shorten clustering time by reducing the dimension of the features used to map the kNN graph in analyzing scRNA-seq data.

We also conducted tests on the overall execution time, including preprocessing steps and clustering stages, and the results are shown in Supplementary Figure 5. The results indicate that, in most cases, CosTaL is the most efficient clustering method, with only two exceptions out of six where it was outperformed by Scanpy or PARC with the PCA transformation. The current architecture of the p-L2knng algorithm, which operates as a stand-alone executable and writes files to the hard disk, results in a significant performance bottleneck. In the dataset 1M_neurons, these processes account for, on average, more than 14 percent of the total clustering time. To address this issue, we are currently developing a new library that will directly integrate with the p-L2knng algorithm via Python C++ bindings. It is also worth noting that Scanpy was unable to process the 1M_neurons dataset in our computing environment, which raises some questions about its scalability compared with other tools. Given these findings, CosTaL is a highly appropriate choice when looking to achieve high efficiency and high effectiveness scores when clustering scRNA-seq datasets, including large-scale ones such as GSE110823 and 1M_neurons.

Parameter influences

There are primarily two parameters that might affect the results: the similarity metrics used for mapping kNN, as well as the number of nearest neighbors (Inline graphic) to be mapped.

The first parameter that needs attention is Inline graphic. When users have less prior knowledge, it is preferable for different Inline graphic values to generate relatively stable results without being over- or under- partitioned. In this regard, we conducted a comparative analysis using Inline graphic ranging from 5 to 50 for each dataset. According to the results (Fig. 5 (A)), CosTaL was capable of reaching a stable state at Inline graphic around 10 or above, but PhenoGraph and Scanpy were generally stable at Inline graphic around 25 or above. CosTaL, therefore, required a smaller Inline graphic value, which further reduced the amount of time required for clustering. When Inline graphic is small, the over-partitioned clusters identified by PhenoGraph and Scanpy may not be meaningful, as evidenced by the lower effectiveness scores as well as relative studies [5, 36].

Figure 5.

Figure 5

(A) An analysis of Inline graphic’s effect on the clustering results of PhenoGraph, Scanpy, PARC and CosTaL, using the Levine_32 dataset as an example. The effectiveness scores are shown as bars, and the number of identified clusters is marked by the line plot. The number of neighbors Inline graphic lies in the range 5–50. The number of cell populations in the benchmark datasets is marked with a red line as a reference. (B) An analysis of similarity matrices’ effects on the clustering results of PhenoGraph, Scanpy, PARC and CosTaL, using the Levine_32 dataset as an example. Both Euclidean distance and cosine similarity were used as the proximity measures by PhenoGraph, Scanpy and PARC, while CosTaL only used cosine similarity. The values of Inline graphic lie in the range 5–50.

While CosTaL only supports cosine similarity, other methods may be able to use both Euclidean distances and cosine similarity, two metrics that are most widely used for clustering. From Fig. 5(B), we observed that the choice of proximity functions between cosine similarity and Euclidean distance could impact the results, even when the same algorithm is used. However, the differences in results were not as significant as the inter-algorithm differences observed among PhenoGraph, Scanpy and PARC. This suggests that, in determining the effectiveness of a clustering result versus a reference, the clustering algorithms are more likely to influence effectiveness than the choice of similarity metrics.

DISCUSSION

The current cell detection technologies are trending toward collecting information from more cells with a greater number of feature parameters. According to citation frequencies, graph-based unsupervised methods are most preferred due to their scalability, relative speed and effectiveness [5, 13, 14, 17, 36]. With the aim of improving graph-based clustering methods and making them suitable for large datasets, this report introduces CosTaL as a strategy for clustering single-cell datasets. In comparison with other algorithms, CosTaL is among the top tier in terms of speed and effectiveness with superior scalability.

Generally, CosTaL is very efficient and produces slightly underpartitioned populations based on cytometry and scRNA-seq data analysis. To measure the effectiveness, we selected six well-established methods for comparison purposes. According to the results, the CosTaL clustering algorithm has tied or outperformed other algorithms in most cases.

There are basically three factors contributing to CosTaL’s efficiency. The first factor is the utilization of the p-L2knng algorithm, making the kNN mapping stage extremely fast. The second factor is the use of smaller Inline graphic values in the kNN graph generation step, which is enabled by the Tanimoto coefficient refinement. The final factor is the shortcut of using the pre-computed cosine similarities provided by p-L2knng when computing Tanimoto coefficient values, which further enhances the efficiency of the refinement process.

In practice, clustering algorithms have more than a few parameters to be tuned. CosTaL simplifies the parameter tuning process, which makes it more practical for processing large datasets. And only the number of nearest neighbors is of concern. We found that CosTaL could generate optimal-partitioned clusters when Inline graphic has a value of 10 or above. As Inline graphic increases, fewer clusters could be detected even though the change of resolution is not drastic. The method allows controlling the resolution through the parameter Inline graphic while adding no extra effort for trial-and-error assessments since the results remain relatively stable.

Specifically, for scRNA-seq data, CosTaL does not need a PCA transformation to reduce the dimensionality before clustering, further simplifying the utilization of CosTaL. Because of the efficiency of p-L2knng, CosTaL is able to map kNNs on the original feature space with 2000 highly variable genes in a short period of time, outperforming the other algorithms that require PCA transformation. Based on the clustering results, we conclude that PCA would better serve as a dimensional reduction method to ease the computational load rather than as an approach to improving the effectiveness of the clustering when dealing with scRNA-seq data. PCA is traditionally considered an effective method to reduce noise inside the sequencing data, especially for bulk analyses. However, we found that PCA does not always improve the clustering effectiveness in single-cell scenarios. This may be because the majority of the noise is eliminated during the quality control and gene selection phases, thereby not impacting the subsequent mapping of heterogeneity. As a consequence, the remaining highly variable genes would be able to accurately determine the similarities among the cells without the need for PCA transformation.

Nowadays, there are many novel single-cell technologies that can monitor phenotypic and functional markers based on either cytometry or single-cell sequencing platforms. Despite the fact that we only examined the performance of CosTaL against other algorithms in datasets of either cytometry or scRNA-seq, where the parameters are associated with proteins or transcripts, CosTaL could easily be directly extended to the analysis of other mono-modal single-cell techniques like imaging MC [68] and scATAC-seq [69].

Key Points

  • CosTaL is an accurate and scalable graph-based clustering algorithm designed for analyzing single-cell data, like cytometry and scRNS-seq results.

  • CosTaL uses the p-L2knng algorithm for constructing an initial kNN graph and uses the Tanimoto coefficient to refine the graph.

  • For scRNA-seq datasets, CosTaL does not require a PCA transformation step and still provides better scalability over large-scale datasets without compromising the effectiveness of the clustering.

Supplementary Material

CosTaL_sup_bbad157

Contributor Information

Yijia Li, Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, 420 Washington Ave. S.E., Minneapolis, 55455, Minnesota, USA.

Jonathan Nguyen, Department of Computer Science and Engineering, Santa Clara University, 500 El Camino Real, Santa Clara, 95053, California, USA.

David C Anastasiu, Department of Computer Science and Engineering, Santa Clara University, 500 El Camino Real, Santa Clara, 95053, California, USA.

Edgar A Arriaga, Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, 420 Washington Ave. S.E., Minneapolis, 55455, Minnesota, USA; Department of Chemistry, University of Minnesota, Smith Hall, 139 Smith Hall, Pleasant St SE, Minneapolis, 55455, Minnesota, USA.

FUNDING

National Institutes of Health [R01-AG020866], the National Science Foundation [IIS-2002321], and the University of Minnesota [GIA University of Minnesota]. Access to research and computing facilities was provided by the Minnesota Supercomputing Institute at the University of Minnesota (https://www.msi.umn.edu).

AUTHOR CONTRIBUTIONS STATEMENT

Y.L. formulated the algorithm and conceived the experiments. Y.L., J.N. and D.A. developed the algorithms. Y.L., D.A. and E.A.A. wrote and reviewed the manuscript.

DATA AVAILABILITY

The Levine 13, Levine 32, Samusik 01 and Samusik all, Mosmann rare, and Nilsson rare datasets are available from Flow Repository (repository FR-FCM-ZZPH) published by [36]. The Giordani WT1 dataset is sourced from [39]. The E-MTAB-3321 dataset is available from BioStudies under the ID E-MTAB-3321. The GSE81861, GSE74672, GSE84133, GSE110823 datasets are available from the Gene Expression Omnibus (GEO) repository under their accession number respectively. The 1M neurons dataset is available through 10X Genomics (see reference [47]). Original data files can be obtained through the references listed in Table 2.

References

  • 1. Regev A, Teichmann SA, Lander ES, et al. Science forum: the human cell atlas. Elife 2017;6:e27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Bendall SC, Nolan GP, Roederer M, Chattopadhyay PK. A deep profiler’s guide to cytometry. Trends Immunol 2012;33(7):323–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Spitzer MH, Nolan GP. Mass cytometry: single cells, many features. Cell 2016;165(4):780–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Ziegenhain C, Vieth B, Parekh S, et al. Comparative analysis of single-cell rna sequencing methods. Mol Cell 2017;65(4):631–43. [DOI] [PubMed] [Google Scholar]
  • 5. Liu X, Song W, Wong BY, et al. A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biol 2019;20(1):1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell rna-seq data. F1000Research 2018;7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Qiu P, Simonds EF, Bendall SC, et al. Extracting a cellular hierarchy from high-dimensional cytometry data with spade. Nat Biotechnol 2011;29(10):886–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Tian T, Zhang J, Lin X, et al. Model-based deep embedding for constrained clustering analysis of single cell rna-seq data. Nat Commun 2021;12(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Samusik N, Good Z, Spitzer MH, et al. Automated mapping of phenotype space with single-cell data. Nat Methods 2016;13(6):493–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zeisel A, Muñoz-Manchado AB, Codeluppi S, et al. Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq. Science 2015;347(6226):1138–42. [DOI] [PubMed] [Google Scholar]
  • 11. Guan J, Li R-Y, Wang J. Grace: a graph-based cluster ensemble approach for single-cell rna-seq data clustering. IEEE Access 2020;8:166730–41. [Google Scholar]
  • 12. Wan S, Kim J, Won KJ. Sharp: hyperfast and accurate processing of single-cell rna-seq data via ensemble random projection. Genome Res 2020;30(2):205–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Liu P, Liu S, Fang Y, et al. Recent advances in computer-assisted algorithms for cell subtype identification of cytometry data. Front Cell Dev Biol 2020;8:234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell rna-seq data. Nat Rev Genet 2019;20(5):273–82. [DOI] [PubMed] [Google Scholar]
  • 15. Stassen SV, Siu DMD, Lee KCM, et al. Parc: ultrafast and accurate clustering of phenotypic data of millions of single cells. Bioinformatics 2020;36(9):2778–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Feng C, Liu S, Zhang H, et al. Dimension reduction and clustering models for single-cell rna sequencing data: a comparative study. Int J Mol Sci 2020;21(6):2181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Cheung M, Campbell JJ, Whitby L, et al. Current trends in flow cytometry automated data analysis software. Cytometry A 2021. [DOI] [PubMed] [Google Scholar]
  • 18. Tsuyuzaki K, Sato H, Sato K, Nikaido I. Benchmarking principal component analysis for large-scale single-cell rna-sequencing. Genome Biol 2020;21(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Krzak M, Raykov Y, Boukouvalas A, et al. Benchmark and parameter sensitivity analysis of single-cell rna sequencing clustering methods. Front Genet 2019;10:1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Peng L, Tian X, Tian G, et al. Single-cell rna-seq clustering: datasets, models, and algorithms. RNA Biol 2020;17(6):765–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li R, Guan J, Zhou S. Single-cell rna-seq data clustering: a survey with performance comparison study. J Bioinform Comput Biol 2020;18(04):2040005. [DOI] [PubMed] [Google Scholar]
  • 22. Kim T, Chen IR, Lin Y, et al. Impact of similarity metrics on single-cell rna-seq data clustering. Brief Bioinform 2019;20(6):2316–26. [DOI] [PubMed] [Google Scholar]
  • 23. Radicchi F, Castellano C, Cecconi F, et al. Defining and identifying communities in networks. Proc Natl Acad Sci 2004;101(9):2658–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Levine JH, Simonds EF, Bendall SC, et al. Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell 2015;162(1):184–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. McInnes L, Healy J, Saul N, Großberger L. Umap: uniform manifold approximation and projection. J. Open Source Softw 2018;3(29):861. [Google Scholar]
  • 26. Alexander Wolf F, Angerer P, Theis FJ. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol 2018;19(1):1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Ranjan B, Sun W, Park J, et al. Dubstepr is a scalable correlation-based feature selection method for accurately clustering single-cell data.. Nat Commun 2021;12(1):5849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Nikolas Barkas V, Petukhov PK, Biederstedt E. pagoda2: single cell analysis and differential expression. R package version 2021;102. [Google Scholar]
  • 29. Abdelaal T, Raadt de P, Lelieveldt BPF, et al. Schnel: scalable clustering of high dimensional single-cell data. Bioinformatics 2020;36(Supplement_2):i849–56. [DOI] [PubMed] [Google Scholar]
  • 30. Zheng GXY, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8(1):14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Yan W, Zhang K. Tools for the analysis of high-dimensional single-cell rna sequencing data. Nat Rev Nephrol 2020;16(7):408–21. [DOI] [PubMed] [Google Scholar]
  • 32. Anastasiu DC, Karypis G. L2knng: fast exact k-nearest neighbor graph construction with l2-norm pruning. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 791–800, Melbourne, Australia, 2015.
  • 33. Traag VA, Waltman L, Van Eck NJ. From louvain to Leiden: guaranteeing well-connected communities. Sci Rep 2019;9(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Dong W, Moses C, Li K. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, 577–86, 2011.
  • 35. Malkov YA, Yashunin DA. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans Pattern Anal Mach Intell 2018;42(4):824–36. [DOI] [PubMed] [Google Scholar]
  • 36. Weber LM, Robinson MD. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry A 2016;89(12):1084–96. [DOI] [PubMed] [Google Scholar]
  • 37. Stuart T, Butler A, Hoffman P, et al. Comprehensive integration of single-cell data. Cell 2019;177(7):1888–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell 2021;184(13):3573–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Giordani L, He GJ, Negroni E, et al. High-dimensional single-cell cartography reveals novel skeletal muscle-resident cell populations. Mol Cell 2019;74(3):609–21. [DOI] [PubMed] [Google Scholar]
  • 40. Mosmann TR, Naim I, Rebhahn J, et al. Swift-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, part 2: biological evaluation. Cytometry A 2014;85(5):422–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Nilsson AR, Bryder D, Pronk CJH. Frequency determination of rare populations by flow cytometry: a hematopoietic stem cell perspective. Cytometry A 2013;83(8):721–7. [DOI] [PubMed] [Google Scholar]
  • 42. Goolam M, Scialdone A, Graham SJL, et al. Heterogeneity in oct4 and sox2 targets biases cell fate in 4-cell mouse embryos. Cell 2016;165(1):61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Li H, Courtois ET, Sengupta D, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet 2017;49(5):708–18. [DOI] [PubMed] [Google Scholar]
  • 44. Romanov RA, Zeisel A, Bakker J, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nat Neurosci 2017;20(2):176–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Baron M, Veres A, Wolock SL, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst 2016;3(4):346–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Rosenberg AB, Roco CM, Muscat RA, et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 2018;360(6385):176–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. 10x genomics Inc. 1.3 million brain cells from e18 mice. https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons, 2017.
  • 48. Lin P-C, Zhao W-L. Graph based nearest neighbor search: promises and failures. arXiv preprint arXiv:190402077 2019. [Google Scholar]
  • 49. Fu C, Xiang C, Wang C, Cai D. Fast approximate nearest neighbor search with the navigating spreading-out graph. arXiv preprint arXiv:170700143 2017. [Google Scholar]
  • 50. De Berg M, Van Kreveld M, Overmars M, Schwarzkopf O. Computational geometry. In: Computational geometry. Springer, 1997, 1–17. [Google Scholar]
  • 51. Liu T, Moore AW, Gray A. Efficient exact k-nn and nonparametric classification in high dimensions. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 265–72, Whistler, British Columbia, Canada, 2003.
  • 52. Anastasiu DC, Karypis G. Fast parallel cosine k-nearest neighbor graph construction. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3), IA3 2016, pp. 50–3. Utah, USA, IEEE, 2016. [Google Scholar]
  • 53. Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In Workshop on artificial intelligence for web search (AAAI 2000), Austin, Texas, USA, 2000;58:64. [Google Scholar]
  • 54. Huang A. Similarity measures for text document clustering. In Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008;4:9–56. [Google Scholar]
  • 55. Bajusz D, Rácz A, Héberger K. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Chem 2015;7(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Curran JR, Moens M. Improvements in automatic thesaurus extraction. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, Philadelphia, Pennsylvania, USA, 2002, 59–66.
  • 57. Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev E 2004;69(2):026113. [DOI] [PubMed] [Google Scholar]
  • 58. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008;2008(10):P10008. [Google Scholar]
  • 59. Fortunato S, Barthelemy M. Resolution limit in community detection. Proc Natl Acad Sci 2007;104(1):36–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Qiu P. Embracing the dropouts in single-cell rna-seq analysis. Nat Commun 2020;11(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Amblard E, Bac J, Chervov A, et al. Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data. Bioinformatics 2022;38(4):1045–51. [DOI] [PubMed] [Google Scholar]
  • 62. Kim TH, Zhou X, Chen M. Demystifying “drop-outs” in single-cell umi data. Genome Biol 2020;21(1):196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Arinik N, Labatut V, Figueiredo R. Characterizing and comparing external measures for the assessment of cluster analysis and community detection. IEEE Access 2021;9:20255–76. [Google Scholar]
  • 64. Hennig C, Meila M, Murtagh F, Rocci R. Handbook of cluster analysis. CRC Press, 2015. [Google Scholar]
  • 65. Wagner S, Wagner D. Comparing clusterings: an overview. Fakultät für Informatik Karlsruhe: Universität Karlsruhe, 2007. [Google Scholar]
  • 66. Guyeux C, Chrétien S, Tayeh GB, et al. Introducing and comparing recent clustering methods for massive data management in the internet of things. J Sens Actuator Netw 2019;8(4):56. [Google Scholar]
  • 67. Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, Montreal, Quebec, Canada, 2009, 1073–80.
  • 68. Chevrier S, Crowell HL, Zanotelli VRT, et al. Compensation of signal spillover in suspension and imaging mass cytometry. Cell Syst 2018;6(5):612–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Fang R, Preissl S, Li Y, et al. Comprehensive analysis of single cell atac-seq data with snapatac. Nat Commun 2021;12(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

CosTaL_sup_bbad157

Data Availability Statement

The Levine 13, Levine 32, Samusik 01 and Samusik all, Mosmann rare, and Nilsson rare datasets are available from Flow Repository (repository FR-FCM-ZZPH) published by [36]. The Giordani WT1 dataset is sourced from [39]. The E-MTAB-3321 dataset is available from BioStudies under the ID E-MTAB-3321. The GSE81861, GSE74672, GSE84133, GSE110823 datasets are available from the Gene Expression Omnibus (GEO) repository under their accession number respectively. The 1M neurons dataset is available through 10X Genomics (see reference [47]). Original data files can be obtained through the references listed in Table 2.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES