Abstract
Background
Differences in data distribution, feature dimensions, and quality between different single-cell modalities pose challenges for clustering. Although clustering algorithms have been developed for single-cell transcriptomic or proteomic data, their performance across different omics data types and integration scenarios remains poorly investigated, which limits the selection of methods and future method development.
Results
In this study, we conduct a systematic and comparative benchmark analysis of 28 computational algorithms on 10 paired transcriptomic and proteomic datasets, evaluating their performance across various metrics in terms of clustering, peak memory, and running time. We also discuss the impact of highly variable genes (HVGs) and cell type granularity on clustering performance. Additionally, the robustness of these clustering methods on two kinds of omics is evaluating by using 30 simulated datasets. Furthermore, to explore the benefits of integrating omics information for clustering tasks, we integrate single-cell transcriptomic and proteomic data using 7 state-of-the-art integration methods and assess the performance of existing single-omics clustering schemes on the integrated features.
Conclusions
Our findings reveal modality-specific strengths and limitations, highlight the complementary nature of existing methods, and provide actionable insights to guide the selection of appropriate clustering approaches for specific scenarios. Overall, for top performance across two omics, consider scAIDE, scDCC, and FlowSOM, with FlowSOM also offering excellent robustness. For users prioritizing memory efficiency scDCC and scDeepCluster are recommended, while TSCAN, SHARP, and MarkovHC are recommended for users who prioritize time efficiency, and community detection-based methods offer a balance.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03719-y.
Keywords: Single-cell clustering, Transcriptomics, Proteomics, Feature integration
Background
Single-cell omics sequencing technology has revolutionized our ability to profile gene or protein expression in individual cells across large cell populations, thereby enabling more precise cell type classification, deeper insights into developmental and differentiation processes, and a better understanding of cellular heterogeneity in disease onset and progression[1–3]. Among the various single-cell omics modalities, single-cell proteomics is particularly valuable because it quantifies protein abundance, providing intuitionistic and crucial phenotypic information not attainable from transcriptomic analysis alone[4–7]. Antibody-based single-cell proteomics, in particular, is a mature and robust approach that leverages the specific binding of antibodies to target proteins to precisely quantify protein expression, thereby revealing cellular heterogeneity and functional diversity[8]. Despite these advantages, single-cell proteomic data often exhibit markedly different data distributions and feature dimensionalities compared to transcriptomic data, posing non-trivial challenges for applying clustering techniques uniformly across the two omics modalities[9, 10].
Clustering is a fundamental step in single-cell data analysis for delineating cellular heterogeneity[11, 12]. Significant progress has been made in clustering methods for single-cell transcriptomic data, from classical machine learning-based and community detection-based algorithms to modern deep learning approaches[13–15]. However, relatively few studies have focused on developing clustering methods specifically tailored for single-cell proteomic data. In principle, both transcriptomic and proteomic single-cell datasets can be represented as high-dimensional feature matrices, suggesting that many clustering algorithms could be applied across both modalities. Nevertheless, most existing benchmarking efforts in this area are either outdated[16] or focused predominantly on single modality[17], lacking any comprehensive cross-modal evaluation of clustering performance.
Recent technological advancements have enabled the simultaneous measurement of multiple modalities in single cells[18, 19]. Techniques such as CITE-seq[8], ECCITE-seq[20], and Abseq[21] employ oligonucleotide-labeled antibodies to simultaneously quantify mRNA and surface protein levels in individual cells, generating paired transcriptomic and proteomic datasets. Such paired data describes both the transcriptome landscape and the proteome landscape for the same cellular microenvironment and biological mechanisms, providing an ideal foundation for benchmarking clustering methods across different modalities.
In this study, we first present a comprehensive evaluation of single-cell clustering algorithms in both transcriptomic and proteomic contexts. Specifically, we benchmarked 28 clustering algorithms across 10 paired single-cell transcriptomic and proteomic datasets (Fig. 1), assessing their particular performance for each modality according to Adjusted Rand Index (ARI)[22], Normalized Mutual Information (NMI)[23], Clustering Accuracy (CA)[24], Purity[25], Peak Memory, and Running Time. This comprehensive benchmarking offered a practical user guide for selecting clustering algorithms suitable for single-cell transcriptomics and proteomics.
Fig. 1.
Pipeline and data. a The pipeline of the benchmark study. b The datasets used in this study. The number of cells, cell types, and features for paired transcriptomic and proteomic data are provided. Detailed information on the data sources is presented in Methods
Secondly, we evaluated the robustness of these methods and examined the additional factors that may affect their performance. Specifically, we investigated the impact of highly variable genes (HVGs) and cell type granularity on clustering performance. By utilizing 30 simulated datasets, we then assessed how varying noise levels and dataset sizes influence clustering outcomes. Finally, we employed 7 feature integration methods to fuse paired single-cell transcriptomic and proteomic data, thereby extending existing single-omics clustering algorithms to multi-omics scenarios. An advanced performance evaluation was conducted on the integrated feature space, further estimating the effect of multimodal integration on clustering, and providing nuanced guidance on choosing feature integration methods and clustering algorithms for given multi-omics scenarios.
Overall, this study delivers a profound contribution to single-cell clustering analysis, providing a detailed evaluation of transcriptomic and proteomic data, as well as their integrated features, thus promoting the extension of existing single-omics clustering algorithms to other single-omics or multi-omics applications. The fundamental rationale for using different datasets in specific experiments is detailed in Additional file 1: Note S1. We anticipate that our findings will spark further interest and guide the refinement of existing methods, as well as the development of new methods for increasingly complex single-cell studies.
Results
Algorithms and datasets
To evaluate the performance of various single-cell clustering algorithms in their respective applicable omics as well as extended omics (Fig. 1a), the selection of paired omics data is crucial. SPDB[26] represents the largest single-cell proteomic database, providing access to the most extensive and up-to-date collection of datasets. We downloaded 9 datasets from SPDB and 1 dataset from Seurat v3[27], the latter of which includes cell type labels at different levels of granularity. In total, we obtained 10 real datasets across 5 tissue types, encompassing over 50 cell types and more than 300,000 cells (Fig. 1b and Additional file 2: Tables S1-11), each containing paired single-cell mRNA expression and surface protein expression data. These datasets were obtained using multi-omics technologies such as CITE-seq, ECCITE-seq, and Abseq. These paired multi-omics datasets, obtained by measuring gene or protein expression within the same set of cells, reflect identical biological conditions across the two omics. This consistency facilitates a comparable analysis of clustering algorithms, improving the reliability and comparability of the resulting analyses.
In benchmark research, the diversity of computational methods is essential. This study considers a total of 28 clustering algorithms (detailed in Methods), including 15 classical machine learning-based methods, 6 community detection-based methods, and 7 deep learning-based methods. Most of these methods were developed after 2020, representing recent advancements in research, while also incorporating some of the most classic single-cell clustering methods from earlier studies. The classical machine learning-based methods include SC3[28], FFC[29], CDC[30], CIDR[31], Celda[32], SIMLR[33], scLCA[34], scSHC[35], DR-SC[36], TSCAN[37], SHARP[38], FlowSOM[39], Spectrum[40], MarkovHC[41], and DEPECHER[42]. The community detection-based methods are PARC[43], Leiden[44], Louvain[45], SCHNEL[46], Monocle3[47], and PhenoGraph[48]. The deep learning-based methods are DESC[49], scDCC[50], scGNN[51], scAIDE[52], CarDEC[53], scziDesk[54], and scDeepCluster[55]. Among these methods, 18 designs utilized only transcriptomic data, 8 designs incorporated both transcriptomic and proteomic data, and 2 designs were based solely on proteomic data. For each method, we performed clustering on both types of omics data, thereby extending the applicability of these methods across omics. Comprehensive benchmarking of these clustering methods based on different technologies on paired omics data will provide effective user guidance for researchers in the field.
To overcome the limitations of single-cell single-omics, bioinformaticians have proposed and developed various feature integration methods that combine data from different omics[56]. In this study, we utilized 7 feature integration methods, all developed after 2020, representing the current state-of-the-art solutions. These methods are moETM[57], sciPENN[58], scMDC[59], totalVI[60], JTSNE[61], JUMAP[61], and MOFA + [62]. Using these methods, we integrated paired transcriptomic and proteomic data, and further extended single-omics clustering methods to the multi-omics level.
In this benchmark study, the diversity of clustering algorithms ensures comprehensiveness, guiding biologists in selecting the best methods and aiding developers in improving state-of-the-art techniques. Among existing single-cell clustering algorithms, most are developed for single omics, particularly in the field of single-cell transcriptomics. However, clustering algorithms specifically designed for single-cell proteomic data are scarce, limiting the options available to researchers in this field. This benchmark study extends the applicability of all 28 methods, providing researchers with guidelines for applying existing methods to their studies, especially in single-cell proteomics. Furthermore, benchmarking studies that integrate paired omics features as input for clustering algorithms will offer researchers richer references, aiding them in choosing the optimal solutions.
Benchmarking analysis on paired transcriptomic and proteomic data
The development of clustering methods for single-cell data often involves different emphases on various omics. Some approaches are specifically designed for transcriptomic data, while others are tailored for proteomic data, with certain methods being compatible with both. In Fig. 2, we summarize the types of omics data utilized during the development of each method. Additionally, we evaluated the performance of these methods using 10 paired transcriptomic and proteomic datasets.
Fig. 2.
Overall performance. The method information includes the method name, omics type (a black diamond star indicates that the corresponding omics data was utilized during the development of the clustering method, RNA icon represents transcriptomics, protein icon represents proteomics, while a gray diamond star indicates that it was not), platform (Python, R, or both platforms are supported). Rank includes the comprehensive rankings in transcriptomics and proteomics separately, where 24 methods were ranked on 10 real datasets, and due to the limited scalability of the 4 methods, the rankings were conducted individually on downsampled datasets. Accuracy includes ARI, NMI, CA, and Purity. Scalability includes Peak Memory and Running Time. The methods are categorized by color schemes: blue represents community detection-based methods, red represents deep learning-based methods, and brown represents classical machine learning-based methods. A total of 10 real datasets were employed, each containing paired transcriptomic and proteomic features. The performance of each clustering method was evaluated separately for the two omics data. Due to the inability of SIMLR, SC3, CIDR, and Spectrum to handle large-scale datasets, each dataset was downsampled to 10,000 cells during the experiments. The comprehensive ranking and specific downsampling strategies are detailed in Methods
In this study, ARI and NMI serve as the primary metrics for quantifying clustering performance. ARI quantifies clustering quality by comparing predicted and ground truth labels, with values from − 1 to 1. NMI measures the mutual information between clustering and ground truth, normalized to [0, 1]. In both cases, values closer to 1 indicate better clustering performance.
We ranked the performance of 24 clustering methods based on a strategy detailed in Methods. As shown in Fig. 2 and Additional file 2: Tables S12-15, the top three methods for transcriptomic data are scDCC, scAIDE, and FlowSOM. Interestingly, these same methods also perform best for proteomic data, though in a slightly different order: scAIDE ranks first, followed by scDCC and FlowSOM. This consistency suggests that these three methods exhibit strong performance and generalization across different omics.
In transcriptomics, CarDEC and PARC ranked 4th and 5th, respectively, but their rankings dropped significantly in proteomics, with CarDEC falling to 16th and PARC to 18th. In contrast, Celda ranked 15th for transcriptomics but performs notably better for proteomics, ranking 7th. These discrepancies may be partly explained by the inherent differences in the distributional properties of single-cell transcriptomic and proteomic data. Specifically, the design of these methods likely involves assumptions about data distribution that are more applicable to transcriptomic data than to proteomic data, or vice versa.
Due to the limited scalability of methods like SIMLR, SC3, CIDR, and Spectrum, we downsampled 10 real datasets in our experiments, following the strategies described in Methods. On the downsampled datasets, SIMLR consistently achieved the highest comprehensive rank among the four methods across both types of omics data. However, we encountered certain limitations with SIMLR during our experiments, including system crashes when analyzing Data1, which prevented us from obtaining results.
This study involves 28 clustering methods, some of which support specifying the number of clusters, while others automatically identify the optimal number of clusters through alternative techniques (such as Louvain, Leiden, and methods developed based on them). For those methods that allow the specification of the number of clusters, we assigned the true number of cell types in the data as this parameter. Considering that the actual number of clusters may be unknown in real-world scenarios, we conducted further experiments on Data2, which contains 10 cell types. We assigned values of 6, 8, 10, 12, and 14 to the clustering parameter for relevant algorithms to assess their robustness regarding this parameter. As shown in Additional file 1: Fig. S1, under the ARI metric, scAIDE, FlowSOM, and SHARP demonstrated insensitivity to the clustering parameter in transcriptomic data, whereas in proteomic data, FlowSOM, Celda, and scDeepCluster exhibited good robustness to this parameter. The robustness of these methods in CA was similar to their performance in ARI, while under NMI and Purity metrics, we observed that most methods exhibited considerable stability. Overall, combining the results from the previous experiments, FlowSOM stands out as a method with excellent performance and strong robustness regarding the clustering parameter.
To validate the ability of clustering methods to identify rare cell types, we followed the experimental design of Xie et al.[52] and constructed a series of simulation experiments based on Data5, using the F1-score as the evaluation metric. The detailed procedure is described in Methods. The t-SNE visualizations of these datasets, based on both single-cell transcriptomic and proteomic features, are shown in Additional file 1: Fig. S2.
The experimental results are presented in Additional file 1: Fig. S3. scAIDE demonstrated outstanding rare cell types detection performance under all tested conditions of omics features and rare cell proportions. On transcriptomic data, scAIDE consistently ranked among the top two performers in terms of F1-score, achieving no less than 0.978 across all rare cell proportions. On proteomic data, scAIDE performed even more impressively, achieving an F1-score of 1.000 under multiple settings. Even under the most challenging condition (0.5% rare cell proportion), it achieved a high score of 0.978, demonstrating remarkable sensitivity and precision in identifying extremely low-abundance cell populations. Apart from scAIDE, SHARP and MarkovHC also performed well in most settings. Septrum and SCHNEL exhibited strong performance on proteomic data, but failed to effectively detect rare cells on transcriptomic data. Further analysis revealed that both the type of omics data and the proportion of rare cells significantly affect rare cell detection performance. Overall, proteomic data yielded better performance than transcriptomic data, particularly under extremely low rare cell proportions (e.g., 0.5% and 1.0%). This may be attributed to the fact that proteomic features more directly reflect cellular functional states under specific conditions, thus aiding in the identification of immune cell subtypes. Additionally, when the rare cell proportion was only 0.5%, the detection ability of many clustering algorithms substantially declined, and some even failed completely, indicating that most current methods still face limitations in detecting extremely rare populations. However, as the proportion of rare cells increased, the F1-scores of many methods improved significantly, showing a clear trend of performance recovery.
Beyond clustering performance, we also offer recommendations for users with specific concerns regarding memory usage or time efficiency. For users prioritizing memory efficiency over time consumption, deep learning-based methods like scDCC and scDeepCluster are recommended. Nonetheless, it is important to note that while these methods may exhibit lower peak memory usage in some cases, this is often due to the efficient offloading of data to Graphics Processing Unit (GPU) memory rather than a reduction in overall memory demand. As a result, the apparent memory efficiency of these methods may be situation-dependent and influenced by hardware configurations, particularly the availability and capacity of GPU resources. On the other hand, for users who prioritize time efficiency over memory consumption, some classical machine learning-based methods like TSCAN, SHARP, and MarkovHC are suitable due to their relatively fast execution times. However, it is important to note that not all classical machine learning-based methods are computationally efficient, as some can be significantly slower, such as SC3, CIDR, and Spectrum. For those who need to balance both time and memory constraints, community detection-based methods like Louvain, Leiden, and their derivatives are recommended, as they exhibit low memory usage while maintaining competitive computational speeds. The performance (ARI, NMI, CA, Purity, Peak Memory, and Running Time) of all 28 methods on 10 real datasets is presented in Additional file 1: Figs. S4-S13.
Performance comparison and tendency between the two omics
To provide a more intuitive comparison of the tendency of each method across the two omics, we visualized the performance (measured by ARI and NMI) of each method using pie charts for each dataset in Fig. 3. These pie charts depict the relative preference of each method for transcriptomic or proteomic data. The proportions in the pie charts represent this tendency, while the intensity of the colors indicates the magnitude of the performance values, with darker shades reflecting higher values and lighter shades representing lower values.
Fig. 3.
Comparative evaluation on paired transcriptomics and proteomics. The method information includes the method name, omics type (a black diamond star indicates that the corresponding omics data was utilized during the development of the clustering method, RNA icon represents transcriptomics, protein icon represents proteomics, while a gray diamond star indicates that it was not), platform (Python, R, or both platforms are supported). The two metrics are visualized using pie charts, which respectively illustrate the ARI and NMI of each method across the paired omics features in the 10 real datasets. Each pie chart consists of two segments representing the ARI or NMI for transcriptomics and proteomics for a given method on a specific dataset. Red and blue represent the ARI for transcriptomics and proteomics, respectively, while yellow and green represent the NMI for the two omics. The proportions in the pie chart represent the proportion of values (either ARI or NMI) obtained by each method on transcriptomic and proteomic data. The intensity of the colors reflects the magnitude of these values, with darker shades indicating higher values and lighter shades indicating lower values. The score quantifies each method’s tendency towards either of the two omics, with positive values indicating a preference for transcriptomics and negative values indicating a preference for proteomics, as detailed in Methods
Several key insights can be drawn from the results in the tendency score of Fig. 3, with the score calculation detailed in Methods. In terms of ARI and NMI, nearly all methods exhibit a clear preference for transcriptomic data, with only a few methods showing a slight advantage towards proteomic data. Similar conclusions were obtained when using CA and Purity (Additional file 1: Fig. S14). Among the top-performing methods identified earlier, scDCC, scAIDE, and FlowSOM, both scDCC and scAIDE tend to favor transcriptomic data, while FlowSOM, despite its negative score, shows no significant tendency toward any particular omics type due to the small magnitude of its absolute value.
The overall performance of these clustering methods on the 10 real datasets tends to favor transcriptomic data or shows no clear omics preference, with almost no methods displaying a strong tendency toward proteomic data. We attribute this phenomenon not only to the design of the clustering algorithms but also to the intrinsic characteristics of the datasets. In paired transcriptomic and proteomic data, transcriptomic features typically have tens of thousands of dimensions, whereas proteomic features only contain dozens to hundreds. This difference in dimensionality allows transcriptomic data to capture more comprehensive information, leading to improved performance in characterizing cell types and conducting clustering analyses.
Impact of gene selection and hierarchical granularity of cell types on clustering
The results of the previous experiment indicate that, in many instances, transcriptomic features, due to their greater quantity, exhibit superior performance in the characterization of cell types compared to proteomic features. However, high-dimensional transcriptomic data also presents challenges, such as redundancy and noise. In this section, we first examine the impact of whether HVGs are used. During the development of various clustering methods, some use all genes as input, while others focus on HVGs. To explore the impact of these two input strategies on clustering performance, we conducted experiments using Data2. Figure 4a provides a visualization of this dataset, along with the top three marker genes for each cell type, which were obtained using the FindAllMarkers function in Seurat[63].
Fig. 4.
The impact of whether HVGs are used and T cell subtypes identification. a Visualization of Data2, generated by applying t-SNE on transcriptomic features. Additionally, the top 3 marker genes for each cell type are visualized. b Performance comparison (ARI, NMI, CA, and Purity) of various clustering methods when using all genes versus using HVGs as input. In the radar chart, blue represents the performance using all genes as input, while red represents the performance using HVGs as input. c Visualization of the Data5 based on two groups of cell type labels with different levels of granularity, generated by applying t-SNE on transcriptomic features and proteomic features, respectively. d Performance comparison (ARI, NMI) of various clustering algorithms on Data5 and a subset containing only T cell subtypes, using transcriptomic and proteomic features as inputs. Green represents the performance using transcriptomic features as input, while blue represents the performance using proteomic features as input
In our experiments, clustering algorithms were provided with two types of input: all genes and the top 2000 HVGs. Figure 4b presents a radar chart comparing the performance of various clustering methods under both input conditions. Using ARI as an example, the vast majority of clustering methods demonstrated superior performance when using the top 2000 HVGs. However, DESC, DEPECHER, and scGNN performed better when all genes were used as input. Meanwhile, SCHNEL, PhenoGraph, CarDEC, and SIMLR were insensitive to the type of input. When focusing on NMI, only scGNN exhibited better performance with all genes as input, while for all other methods, the top 2000 HVGs yielded better results.
In addition to the setting using all genes and 2000 HVGs as input features, we examined different conditions by selecting 500, 1000, 3000, and 5000 HVGs to systematically comparing their clustering performance with those obtained using all genes. The corresponding experimental results are presented in Additional file 1: Fig. S15. In terms of the ARI, most clustering methods demonstrated superior performance when using HVGs compared to all genes. However, certain methods such as DESC, DEPECHER, and scGNN consistently achieved better results when all genes were used. Furthermore, for methods including SIMLR, scziDesk, PhenoGraph, CarDEC, PARC, SCHNEL, and DR-SC, the number of HVGs influenced clustering performance in varying ways: some methods performed better with a specific number of HVGs, whereas others yielded better results when all genes were used. Regarding the NMI metric, only scGNN consistently performed best when using all genes, while most other methods showed improved clustering performance when HVGs were used as input.
In summary, these results highlight that although transcriptomic data contain rich biological information, their high dimensionality and noise pose challenges for clustering analysis. Feature engineering in transcriptomics can provide significant improvements in clustering performance. Our further analysis indicates that while using up to 5000 HVGs still yields better results for most methods compared to all genes, the performance improvement is less substantial than that observed with 3000 or fewer HVGs. Hence, it is advisable in practical applications to prioritize selecting no more than 3000 HVGs to maintain high performance while minimizing the impact of redundant information.
Compared to transcriptomic features, proteomics directly reflects the functional state of cells under specific conditions. For instance, the functional subtypes of T cells, such as effector T cells and memory T cells, are distinguished by specific cell surface markers like CD4 and CD8[64]. Therefore, we hypothesize that in certain tasks, such as identifying immune cell subtypes, proteomic features may outperform transcriptomic features. To validate this hypothesis, we conducted experiments on Data6 and Data7, which encompass diverse immune cell subtypes. As illustrated in Additional file 1: Fig. S16a, in the task of identifying immune cell subtypes, clustering methods performed comparably or even better on proteomic features than on transcriptomic features.
Furthermore, we conducted experiments using Data5, which contains two groups of cell type labels with different levels of granularity. As shown in Fig. 4c, it is apparent that both transcriptomic and proteomic features can clearly distinguish T cells when using coarse-grained cell type labels. However, when using fine-grained cell type labels for coloring, we observed that the T cell subtypes are mixed together in the t-SNE plot based on transcriptomic features, whereas the t-SNE plot based on proteomic features is able to effectively distinguish these subtypes.
Next, we conducted further experiments for validation. Specifically, we constructed a subset of Data5, which includes all T cell subtypes from the original dataset (CD4 Naive, CD4 Memory, CD8 Naive, CD8 Effector_1, CD8 Effector_2, CD8 Memory_1, CD8 Memory_2, gdT, Treg, and MAIT). Subsequently, we performed experiments on both Data5 and its subset using transcriptomic and proteomic features as inputs, respectively. Figure 4d shows the comparison of performance (ARI and NMI) between the two omics across the two datasets. It can be observed that on Data5, the performance of most clustering methods using transcriptomic features as input is significantly better than that using proteomic features. However, in the subset for identifying T cell subtypes, nearly all methods exhibit better performance with proteomic features. The only exception is PARC, where the ARI performance on proteomic features is slightly lower than that on transcriptomic features, but NMI is substantially higher. Analogous conclusions were reached through CA and Purity (Additional file 1: Fig. S16b).
The experimental results suggest that, for the task of identifying immune cell subtypes, proteomic features exhibit superior performance compared to transcriptomic features. These findings indicate that proteomic features hold great potential for tasks such as subtype clustering, rare cell type detection, and disease subtype identification. Based on the content of this section and the previous one, we conclude that transcriptomic and proteomic features each have their own strengths. We believe that combining both omics could integrate the unique characteristics and advantages of each, potentially leading to even better performance.
Impact of data scales and noise levels on the performance of clustering
Stability of clustering methods is another important evaluation criterion beyond accuracy metrics. In this study, experiments of varying scales were conducted using Data8, and 25 simulated datasets of different sizes were generated by applying subsampling at different levels to this dataset. Experiments under varying noise levels were based on Data1, with five different levels of noise added. The detailed strategies for subsampling and noise addition are described in Methods.
As shown in Fig. 5a, each scale comprises 5 datasets, and for datasets of the same scale, clustering methods exhibit greater stability on proteomic features compared to transcriptomic features. This is especially clear with scLCA, where the lower dimensionality and reduced sparsity of proteomic data likely allow Singular Value Decomposition (SVD) to produce higher-quality Latent Cellular (LC) states with less noise, leading to more consistent cosine distance calculations and stable clustering results. By contrast, the high dimensionality and sparsity of transcriptomic data can introduce noise that may affect LC state quality, often resulting in less stable clustering. These observations reveal that, due to the lower dimensionality and reduced sparsity of proteomic features compared to transcriptomic features, clustering performance can be more stable to a certain extent. Then, we examined the stability of various clustering methods across datasets of different sizes. From Fig. 5a, it is apparent that as the number of cells increases, the performance of FFC decreases significantly in terms of ARI, NMI, and CA, while Purity shows an upward trend. This could stem from the fire temperature parameter in FFC being somewhat sensitive to local density differences, leading to a higher number of clusters than the actual cell types. These smaller, more homogeneous clusters may increase Purity, but they can result in the over-segmentation of true cell types, leading to decreases in ARI, NMI, and CA. SHARP, DR-SC, CarDEC, and SIMLR consistently rank in the top 50% across both omics, although their performance fluctuates with varying cell numbers. We also observed that community detection-based clustering methods demonstrate a trend of stability under these experimental conditions, but their overall performance remains low, with the majority ranked in the bottom 50%. The observation suggests that future improvements in this category of methods could focus on enhancing graph structure construction and feature representation in order to achieve both high stability and high accuracy in clustering performance.
Fig. 5.
Evaluation on datasets of varying scales and noise levels. a Performance of clustering methods on datasets of varying scales, with community detection-based methods represented in shades of blue, deep learning-based methods in shades of red, and classical machine learning-based methods in shades of brown. The dataset sizes range from 2000 to 10,000 cells, with each size containing 5 datasets. Each boxplot represents the clustering performance (ARI, NMI, CA, and Purity) of the method across 5 datasets for a given scale, with the median value of each metric at each scale connected by a line. Clustering methods are ranked in descending order based on the median ARI across the 25 datasets. b Performance of clustering methods on datasets with distinct noise levels, with community detection-based methods represented in shades of blue, deep learning-based methods in shades of red, and classical machine learning-based methods in shades of brown. Clustering methods are ranked in descending order based on the median ARI across the 5 datasets
Additional file 1: Fig. S17 illustrates the Peak memory and Running time of clustering methods across datasets of different scales. It is evident that an increase in the number of cells typically results in heightened peak memory usage and running time across nearly all methods; however, the extent of this impact varies among different categories of techniques. Our experiments indicated that community detection-based methods exhibit relatively low peak memory usage and computational time, showing minimal sensitivity to increases in cell counts. In contrast, deep learning-based methods, while demonstrating lower peak memory consumption, often encounter substantial time overhead during processing. Additionally, classical machine learning-based methods exhibit the most significant fluctuations in both memory demands and computational time when cell numbers rise, typically necessitating greater overall resource allocation.
Figure 5b presents the stability of various methods across 5 datasets with different levels of noise. We observed that CDC, CarDEC, Monocle3, and scziDesk exhibited remarkable stability across both transcriptomic and proteomic features. On the other hand, methods like Celda, DEPECHER, and SC3 maintained stable performance when analyzing transcriptomic features under added noise but showed slight fluctuations in their proteomic results. Similarly, scAIDE and DR-SC were minimally affected by noise when handling proteomic data, but their performance exhibited some variability with transcriptomic features. From the perspective of robustness to data noise, we suggest that in real-world applications, particular attention should be paid to denoising strategies during the data preprocessing stage to minimize the adverse impact of noise on clustering performance. Subsequently, it is advisable to select clustering algorithms that exhibit strong robustness to noise. This multi-step optimization process has the potential to enhance the stability of clustering results.
Considering the two dimensions of stability assessment discussed above, we evaluated the stability of the best-performing methods identified earlier: scDCC, scAIDE, and FlowSOM. We found that although these three methods did not maintain perfectly stable outputs across datasets of varying sizes and noise levels, the fluctuations in their performance remained within an acceptable range. This highlights the robustness of these methods, as their performance was not significantly impacted by these variables.
Performance of clustering algorithms on features integrated by different methods
By employing appropriate data fusion strategies, the overall robustness of the analysis can be enhanced, leading to improved accuracy and interpretability of biological insights[65, 66]. As shown in Fig. 6a, we employed 7 advanced integration methods: moETM[57], sciPENN[58], scMDC[59], totalVI[60], JTSNE[61], JUMAP[61], and MOFA + [62]. We then used the integrated multi-omics features as input for single-omics clustering methods to perform clustering tasks. This approach not only allows us to consider both transcriptomic and proteomic data in clustering but also extends existing single-omics clustering algorithms to multi-omics features.
Fig. 6.
Evaluation on integrated features. a The paired transcriptomic and proteomic data were integrated using 7 integration methods, and the resulting features were used as input for various clustering algorithms. b Visualization of 3 datasets involved in the experiment, generated by applying t-SNE to transcriptomic features, proteomic features, and the integrated features, respectively. c The performance of each clustering method is shown across transcriptomic features, proteomic features, and integrated features. Each point shape corresponds to a feature type, with the y-axis representing the metric (ARI or NMI) and the x-axis representing the clustering methods
In this experiment, we selected three datasets: Data1, Data4, and Data8, with their visualizations shown in Fig. 6b, generated by applying t-SNE to transcriptomic features, proteomic features, and the integrated features, respectively. Each dataset has distinct characteristics. On Data1, the performance of various clustering methods is comparable between transcriptomic and proteomic data, while Data4 and Data8 tend to favor transcriptomic and proteomic data, respectively. Additionally, these datasets differ in sample sizes: Data1 contains 4426 cells, Data4 has 14,793 cells, and Data8 includes 49,147 cells. These characteristics highlight the diversity and representativeness of the selected datasets, ensuring the broad applicability and reliability of our experiments.
As shown in Fig. 6c, on Data1, the features integrated by TotalVI achieved higher performance across many methods compared to both single-omics data. Features integrated by moETM, sciPENN, and scMDC also improved the performance of some clustering methods, though less frequently. For Data4, where single-omics data favor transcriptomics, the performance of transcriptomic features generally surpassed that of integrated multi-omics features. However, for a few clustering algorithms, the performance of the features integrated by TotalVI, moETM, sciPENN, and scMDC exceeded that of single-omics features. For Data8, we found that in most cases, the clustering performance with integrated features exceeded that of any single omics data, with moETM and scMDC performing the best. The clustering methods’ performance (CA and Purity) in integrating features is presented in Additional file 1: Fig. S18. As the experiments presented in this section do not encompass all 28 clustering methods, the selection criteria and exclusion rationale for clustering methods are detailed in Additional file 1: Note S2.
Overall, our experimental results suggest that clustering methods designed for single-omics data can exhibit superior performance when using integrated features as input, compared to using data from only one omics source. This is especially true in cases where the proteomic data in paired omics outperform transcriptomic data, leading to a synergistic effect greater than the single part. We believe that different omics provide complementary biological information, and the integrated data reconstruct the feature space, thereby enhancing biological signals and more comprehensively reflecting cellular states.
Discussion
In this study, we performed a comprehensive benchmarking analysis of 28 clustering methods using 10 paired single-cell transcriptomic and proteomic data across 5 tissues, encompassing over 50 cell types and more than 300,000 cells. These data were generated by multi-omics technologies such as CITE-seq, ECCITE-seq, and Abseq, in which the proteomic measurements focus on cell surface proteins. The reason for using such data is that, compared to other types of proteomic data, cell surface proteins are supported by more extensive biological background knowledge and immunological research, making them particularly suitable for analysis of immune cells[67, 68]. Notably, cell surface proteins play critical roles in immune processes, and single-cell surface proteomics technologies are more mature and have been widely applied in immunological studies[3].
We evaluated these methods based on metrics such as ARI, NMI, CA, and Purity, as well as Running Time and Peak Memory, providing a comprehensive assessment framework for clustering in single-cell transcriptomics and proteomics. This framework can guide researchers in selecting the most suitable clustering tools. Our results indicate that single-cell transcriptomics generally outperforms single-cell proteomics in many cases, as transcriptomic sequencing captures the expression of tens of thousands of genes, allowing clustering algorithms to leverage a broader range of features to identify cell types. In contrast, antibody-based proteomic sequencing is limited to detecting a smaller number of proteins (typically dozens to hundreds).
The success of scDCC can be attributed to its integration of soft constraints into the modeling process, which allows for flexible application across various experimental settings. These constraints encode prior knowledge, facilitating the learning of better latent representations and improving clustering performance. Even a small number of constraints can significantly enhance clustering outcomes, underscoring the importance of incorporating prior knowledge to boost clustering efficacy. Similarly, scAIDE employs a Random Projection Hashing-based k-means (RPH-kmeans) algorithm, which is particularly effective in identifying small or rare cell types as well as common cell types. FlowSOM, on the other hand, utilizes Self-Organizing Map (SOM) for data analysis and visualization. SOM effectively maps high-dimensional data into a low-dimensional space while preserving the topological structure of the data. This mapping approach aids in clearly visualizing the clustering structure in a lower-dimensional space, thereby enhancing clustering performance.
Conversely, methods such as FFC and scLCA perform poorly in both omics. FFC is a non-parametric method, makes fewer assumptions about the data, which can be advantageous when dealing with data that does not conform to known distributions. However, this flexibility also means that FFC may fail to fully exploit specific structural information within the data, leading to suboptimal performance on certain datasets. scLCA relies on the assumption of structural consistency between the LC space and the principal component space, positing that intra-cluster similarity is higher than inter-cluster similarity in both spaces. When dealing with highly complex datasets, scLCA may be adversely affected by such structural inconsistencies, leading to a reduction in clustering performance.
In addition, we observed that proteomic technologies, due to their ability to directly detect surface markers associated with cellular functions, exhibit greater stability and specificity in protein expression compared to mRNA expression, with lower noise levels. These advantages make proteomics more effective in identifying immune cell subtypes.
We believe that proteomic data can complement the limitations of transcriptomic data in single-cell clustering analysis. Therefore, we employed seven feature integration strategies, combining paired single-cell transcriptomic and proteomic features. The low-dimensional representations of the integrated features were then input into various clustering methods, extending traditional single-omics clustering approaches to multi-omics data. Benchmarking results demonstrated that different omics data provide unique and complementary biological information. By integrating multi-omics data, the feature space was optimized, enabling better capture of biological signals and more accurately reflecting the true cellular state.
There are some limitations of this study, primarily focusing on clustering methods applied to paired single-cell transcriptomic and proteomic data. This focus unintentionally overlooks other cutting-edge multi-omics sequencing technologies. With advances in multi-omics sequencing, it is possible to simultaneously measure chromatin accessibility (scATAC-seq) and transcriptomics (scRNA-seq) within the same cell, as demonstrated by methods such as SHARE-seq[69]. Such data could provide additional complementary insights. Furthermore, due to the high cost of multi-omics sequencing technologies, integrating unpaired multi-omics features to extend single-omics clustering methods to multi-omics features may represent a promising approach[70–72].
Conclusions
In summary, our benchmarking study provides a valuable guide for researchers in the field, particularly in selecting appropriate clustering methods originally developed for transcriptomic data to be applied to single-cell proteomic clustering tasks. This addresses the current scarcity of clustering algorithms specifically designed for single-cell proteomic data. Based on the results reported in our study, users can choose suitable data integration methods for exploratory single-cell multi-omics clustering analysis. Additionally, we anticipate that this work will serve as a reference for method developers, highlighting the importance of leveraging complementary biological information from different omics to enhance the performance of newly developed methods in single-cell clustering tasks.
Methods
Settings of benchmarked methods
In this study, various single-cell clustering methods developed by researchers worldwide were employed, each based on distinct techniques. These methods require different data input formats and parameter settings. To ensure the correct operation of the algorithms and to maintain fairness in benchmarking, the following principles were adhered to:
For transcriptomics, the 2000 HVGs were provided as input to methods capable of accepting preprocessed data. HVGs were selected using the Seurat[63] for methods developed in R and the Scanpy[73] for those developed in Python. For methods with inherent data processing steps, we used the raw count data as input. Regarding proteomics, data or counts were provided according to the method’s specific requirements. Due to the limited number of protein features, all features were included in the input rather than being filtered based on variability. This structured approach ensures consistency and fairness in evaluating the various clustering methods.
In single-cell transcriptomic data (RNA), “counts” refers to the raw transcript counts for each gene per cell, typically the direct output from sequencing platforms and unprocessed. “Data” refers to the matrix after normalization (e.g., using LogNormalize), which is then used for downstream analyses such as selection of HVGs, dimensionality reduction, and clustering. In single-cell proteomic data (ADT), “counts” represent the number of antibody-derived DNA tags detected per cell, which indirectly reflect the abundance of the corresponding cell surface proteins. These values are typically normalized using the Centered Log Ratio (CLR) transformation to produce “data” suitable for downstream analysis, reducing technical noise and improving comparability across cells. Thus, “counts” denotes the raw, unprocessed expression data, while “data” refers to the normalized form.
A critical parameter affecting clustering results is the setting of the number of clusters. For methods where the expected number of clusters can be directly specified, we set this parameter to match the actual number of cell types in the dataset.
In the following sections, we explain the fundamental principles of the different methods.
Methods based on classical machine learning
sc-SHC
sc-SHC [35] extends significance of hierarchical clustering to scRNA-seq data by incorporating hypothesis testing into the method. It models cell populations using a parametric distribution, enabling automated identification of distinct clusters. The framework corrects for multiple sequential tests, controlling the family-wise error rate and providing interpretable clustering uncertainty. It also extends to the setting with batch labels.
FFC
FFC [29] is inspired by self-organized criticality in forest fire dynamics. By modeling label propagation analogous to fire spread, FFC clusters data based on a single “fire temperature” hyperparameter. It quantifies label uncertainty via point-wise posterior exclusion probabilities and label entropy. FFC supports online clustering, allowing efficient updates in dynamic, high-throughput data environments.
Celda
Celda [32] is a Bayesian hierarchical model for scRNA-seq data, designed to co-cluster genes into transcriptional modules and cells into subpopulations. It quantifies the probabilistic contribution of genes, modules, and cell populations across samples. Built on hierarchical Dirichlet multinomial distributions, Celda effectively handles sparse, non-negative integer count data without prior normalization.
DR-SC
DR-SC [36] jointly performs dimension reduction and (spatial) clustering using a hierarchical model. The first layer extracts low-dimensional embeddings from gene expression, while the second layer maps these embeddings and spatial coordinates to cluster labels. By unifying these steps, DR-SC enhances the estimation of cell-type-relevant embeddings and improves clustering performance for scRNA-seq and spatial transcriptomic data across platforms.
CDC
CDC [30] identifies boundary points by assessing the directional distribution of K-nearest neighbors and differentiates them from internal points based on local direction centrality. Boundary points form cages that enclose internal points, preventing cross-cluster connections and effectively separating weakly connected clusters. The method is irrelevant to the point density, making it preserve sparse clusters’ completeness.
MarkovHC
MarkovHC [41] is a topological clustering algorithm designed for hierarchical and interpretable analysis of single-cell omics data. It calculates cell similarity using shared nearest neighbors and constructs a cellular network. A Markov transition matrix and pseudo-energy matrix are derived, guiding the hierarchical structure formation based on attractors, basins, and critical points on each level.
SHARP
SHARP [38] leverages ensemble Random Projection (RP) to efficiently handle large-scale scRNA-seq data. RP projects high-dimensional data into a lower-dimensional subspace using a random matrix, preserving inter-cell distances and robustness to missing values. SHARP significantly reduces clustering computational cost while maintaining performance, particularly for large datasets, by minimizing distance distortions during dimension reduction.
Spectrum
Spectrum [40] is a spectral clustering method designed for complex omic data, employing a self-tuning density-aware kernel that enhances similarity between points sharing common nearest neighbors. It integrates graph data via tensor product and diffusion to reduce noise and reveal underlying structures. Spectrum introduces a novel method for determining the optimal number of clusters (K) through eigenvector distribution analysis, capable of automatically identifying K for both Gaussian and non-Gaussian structures.
DEPECHE
DEPECHE [42] builds on penalized k-means clustering by incorporating an L1-norm penalization, driving cluster centers towards the origin. This parameter-free method optimizes clustering resolution, enabling the identification of biologically relevant clusters and the specific variables defining them, which is crucial for comprehending complex and noisy single-cell data.
scLCA
scLCA [34] employs a dual-space decomposition strategy for single-cell gene expression analysis, focusing on LC space and principal components space. Cell–cell similarity is measured in both spaces, and biologically informative cellular states guide the selection of final clustering solutions among top-ranked candidate models.
CIDR
CIDR [31] effectively addresses dropout effects in scRNA-seq data by employing a novel implicit imputation strategy. First, dropout candidates are identified, and the relationship between dropout rate and gene expression is estimated. This allows for the calculation of dissimilarities between the imputed gene expression profiles of single cells. Principal coordinate analysis is then performed using this dissimilarity matrix, followed by clustering based on the leading principal coordinates.
SIMLR
SIMLR [33] is a computational framework designed to analyze single-cell RNA-seq data by learning a similarity measure for dimension reduction, clustering, and visualization. It optimizes multiple kernels to construct a block-diagonal similarity matrix, reflecting separable subpopulations. For large datasets, SIMLR approximates similarity using k-nearest-neighbor graphs and sparse eigen-decomposition.
SC3
SC3 [28] calculates distances between cells using Euclidean, Pearson, and Spearman metrics, then transforms the distance matrices using PCA or graph Laplacian eigenvectors. K-means clustering is applied to the first d eigenvectors, and a consensus matrix is generated by averaging binary similarity matrices from individual clustering. The final clusters are obtained by hierarchical clustering of the consensus matrix.
TSCAN
TSCAN [37] is a computational tool for reconstructing pseudo-time in single-cell RNA-seq analysis and can also be used for clustering. It employs a cluster-based minimum spanning tree (MST) approach, where cells are grouped into clusters using the mclust package in R. An MST is then constructed to connect cluster centers, allowing for the ordering of cells along a developmental trajectory.
FlowSOM
FlowSOM [39] primarily relies on SOM for initial dimensionality reduction of high-dimensional single-cell data, mapping similar cells to neighboring nodes in a topological structure. A minimal spanning tree is subsequently constructed to reveal node similarity and topology, followed by hierarchical clustering to aggregate nodes into final cell clusters.
Methods based on community detection
PARC
PARC [43] is a graph-based clustering method for single-cell data that integrates hierarchical graph construction, data-driven pruning, and community detection. It uses hierarchical navigable small world for fast k-NN graph construction, prunes edges based on local and global edge-weight distributions, and applies the Leiden algorithm for robust community detection. The pruning step enhances speed and accuracy, particularly for detecting rare cell populations.
SCHNEL
SCHNEL [46] integrates Hierarchical Stochastic Neighbor Embedding (HSNE) with Louvain community detection for scalable clustering of high-dimensional data. By clustering a reduced data subset while leveraging HSNE’s hierarchical structure, SCHNEL preserves manifold properties and efficiently propagates cluster labels back to the full dataset. This approach enables accurate, scalable clustering across millions of cells.
Monocle3
Monocle3 [47] is primarily a trajectory analysis tool and can also be used for clustering. It begins with dimensionality reduction of single-cell transcriptomic data using techniques such as UMAP or PCA, projecting the high-dimensional data into a lower-dimensional space. A k-nearest neighbor graph is constructed based on this low-dimensional representation, followed by clustering via community detection algorithms such as Leiden or Louvain.
Leiden
Leiden [44] is a community detection method that iteratively refines partitions by improving the Louvain algorithm’s modularity optimization. It operates in three phases: partitioning nodes, refining memberships, and aggregating communities. Unlike Louvain, it prevents disconnected subcommunities and ensures well-connected clusters by refining node memberships based on local moving and splitting steps.
PhenoGraph
PhenoGraph [48] clusters single-cell data by first constructing a k-nearest neighbors graph based on Euclidean distances. The graph’s edges are weighted according to shared neighbors between cells. The Louvain algorithm then detects communities by optimizing modularity, effectively partitioning the graph into distinct cell subpopulations. This approach is robust, accurate, and scalable for high-dimensional single-cell analysis.
Louvain
Louvain [45] iteratively optimizes modularity by grouping cells into communities, aiming to maximize intra-cluster connections while minimizing inter-cluster links. Its hierarchical approach allows detection of cell populations at different resolutions, making it crucial for identifying cell subtypes in high-dimensional single-cell datasets.
Methods based on deep learning
CarDEC
CarDEC [53] begins with data preprocessing and pretraining an autoencoder on HVGs using a mean squared error loss. The pretrained weights are then transferred to the main model, which considers HVGs and LVGs separately. CarDEC optimizes a combined loss of reconstruction and clustering, enabling structure preservation, improved clustering, and enhanced gene expression quality through denoising and batch effect correction.
scDCC
scDCC [50] integrates domain knowledge into the clustering process by embedding pairwise constraints as soft conditions in the loss function of a Zero-Inflated Negative Binomial (ZINB) based deep autoencoder. The algorithm first denoises and reduces dimensionality using a denoising ZINB autoencoder, then optimizes clustering and constraint losses in the latent space. By jointly learning latent representations and clustering, scDCC enhances clustering performance and biological interpretability, facilitating cell type identification and downstream analysis.
scGNN
scGNN [51] models heterogeneous cell–cell relationships by integrating gene expression and transcriptional regulation through a multi-modal autoencoder and GNN. It employs a hypothesis-free approach to infer relationships without predefined assumptions, incorporates cell-type-specific regulatory signals using a left-truncated mixture Gaussian model, and dynamically prunes cell graphs for noise-tolerant embedding. The learned graph embeddings are used to regularize autoencoder training for gene expression recovery.
DESC
DESC [49] involves pretraining a stacked autoencoder to learn low-dimensional representations of gene expression data. The encoder is then integrated into an iterative clustering network. Louvain’s algorithm initializes cluster centers by optimizing modularity for community detection. The learned feature space facilitates obtaining centroids, forming the basis for subsequent iterative clustering and refinement.
scziDesk
scziDesk [54] integrates deep count autoencoder network with denoised scRNA-seq data characterized by negative binomial or ZINB models. It replaces mean squared error with negative log-likelihood for capturing probabilistic structures and employs a weighted soft k-means clustering in latent space. Incorporating t-SNE and deep embedded clustering strategies, it unifies data modeling, dimensionality reduction, and clustering through iterative updates.
scAIDE
scAIDE[52] integrates an autoencoder-imputation network with a distance-preserved embedding for effective single-cell data representation. It then employs RPH-kmeans algorithm for clustering, enabling the identification of both common and rare cell types. The approach also includes an automated cluster number detection, enhancing its adaptability and robustness.
scDeepCluster
scDeepCluster [55] integrates a ZINB model with clustering loss, optimizing clustering while performing dimensionality reduction. It employs a ZINB-based autoencoder for nonlinear mapping of scRNA-seq data into latent space, where clustering is conducted via Kullback–Leibler (KL) divergence. Additionally, denoising techniques enhance feature robustness, enabling more accurate clustering and representation.
Multi-omics data integration methods
scMDC
scMDC[59] is a multimodal deep learning model for clustering single-cell data. It uses a multimodal autoencoder with one encoder for concatenated multimodal data and two decoders for modality-specific reconstruction. The model optimizes ZINB loss for reconstruction, KL loss for feature learning, and deep k-means for clustering. Additionally, scMDC corrects batch effects using a conditional autoencoder framework, achieving superior performance in multimodal data integration and analysis.
totalVI
totalVI [60] is a deep generative model that jointly learns probabilistic representations of RNA and protein measurements in CITE-seq data, addressing modality-specific noise, technical biases, and batch effects. It separates protein signal into background and foreground components for correction and enables tasks such as joint dimensionality reduction, dataset integration, differential expression testing, and more. totalVI optimizes these representations using a variational autoencoder framework, facilitating various downstream analyses.
moETM
moETM[57] integrates high-dimensional single-cell multimodal data by using a product-of-experts encoder and multiple linear decoders. It infers latent topics and learns shared embeddings to accurately reconstruct multi-omics data from low-dimensional topic spaces. moETM enables cell clustering, cross-omics imputation, and identification of cell-type signatures indicative of phenotypic traits.
sciPENN
sciPENN [58] is a versatile deep learning framework designed for CITE-seq and scRNA-seq data integration. It supports multiple functions including protein expression prediction for scRNA-seq, protein expression imputation for CITE-seq, quantification of prediction and imputation uncertainty, and cell type label transfer from CITE-seq to scRNA-seq. Its architecture features an input block followed by feed-forward blocks and an RNN cell to maintain and update the hidden state, culminating in dense layers for protein predictions, prediction bounds, and cell type probabilities.
JTSNE and JUMAP
JVis [61] extends t-SNE and UMAP to obtain JTSNE and JUMAP, respectively, for joint visualization of multimodal single-cell omics data by learning a unified embedding that preserves similarities across all modalities. It optimizes the relative contribution of each modality, highlighting discriminative features while suppressing noise. This approach captures relationships between modalities that conventional methods miss, facilitating more comprehensive cellular identity representations.
MOFA +
MOFA + [62] is a statistical framework for integrating single-cell multi-modal data through low-dimensional representation using variational inference. Key features include GPU-accelerated stochastic variational inference for scalability to millions of cells, and the use of sparsity priors with hierarchical variance regularization for principled analysis across multiple data modalities and sample groups, allowing flexible sparsity constraints and joint modeling of variation.
Real datasets
The 10 real datasets used in our experiment encompass sequencing technologies such as CITE-seq, ECCITE-seq, and Abseq. After preprocessing, the number of cells ranges from 4426 to 73,580, with RNA features varying from 461 to 18,335 and ADT features from 19 to 265. Such paired transcriptomic and proteomic data providing an ideal foundation for benchmarking clustering methods across different modalities. In the subsequent sections, we provide a brief description of each dataset, with characteristics based on the preprocessed data.
Data1
CITE-seq is a technology that integrates single-cell RNA sequencing with the quantification of surface proteins through antibody labeling. By using antibody-oligonucleotide conjugates for protein labeling, it allows for simultaneous analysis of the transcriptome and surface protein characteristics of cells. This dataset, derived from Homo sapiens bone marrow AML cells, was obtained from ref.[74]. After preprocessing, it contains 4426 cells, representing 7 distinct cell types, with 15,142 RNA features and 19 ADT features.
Data2
This dataset, similarly generated using CITE-seq, includes paired transcriptomic and proteomic data. It was derived from Mus musculus brain cells and was obtained from ref. [75]. After preprocessing, the dataset contains 13,052 cells, representing 10 different cell types, with 15,962 RNA features and 34 ADT features.
Data3 and Data4
Both datasets were generated using antibody-based CITE-seq and were obtained from ref. [76]. The key difference between them lies in the source: Data3 was derived from myeloid cells in Mus musculus glioblastoma tissue, while Data4 was derived from myeloid cells in Homo sapiens glioblastoma tissue. After preprocessing, Data3 contains 13,635 cells, representing 16 distinct cell types, with 14,551 RNA features and 173 ADT features. Data4 contains 14,793 cells, representing 14 distinct cell types, with 18,156 RNA features and 265 ADT features.
Data5
This dataset was sequenced using CITE-seq, providing paired transcriptomic and proteomic data. It was sourced from mononuclear cells in Homo sapiens bone marrow and was obtained from ref. [27]. After preprocessing, it contains 30,672 cells, representing 5 distinct cell types, with 16,313 RNA features and 25 ADT features.
Data6 and Data7
Abseq focuses on protein expression at the single-cell level by efficiently labeling and detecting cell surface proteins using DNA-barcoded antibodies. Both datasets were generated using antibody-based Abseq and were obtained from ref. [77]. The main distinction between them is that Data6 was derived from mononuclear BM cells from the blood and bone marrow of a young healthy donor, while Data7 was derived from mononuclear BM cells from the bone marrow of both young and old healthy donors. After preprocessing, Data6 contains 15,502 cells, representing 38 distinct cell types, with 461 RNA features and 196 ADT features. Data7 contains 49,057 cells, representing 43 distinct cell types, with 461 RNA features and 104 ADT features.
Data8
ECCITE-seq is an advanced version of CITE-seq, incorporating compatibility with CRISPR screens. It allows the simultaneous acquisition of RNA, protein, CRISPR target, and multiplexed antigen data, offering a more comprehensive analysis of cellular states. This dataset was derived from immune cells in Homo sapiens blood and was obtained from ref. [78]. After preprocessing, it contains 49,147 cells, representing 8 distinct cell types, with 18,231 RNA features and 54 ADT features.
Data9 and Data10
Both datasets were generated using antibody-based CITE-seq and were obtained from ref. [79]. The distinction between the two lies in their source: Data9 was derived from liver cells in Mus musculus, while Data10 was derived from liver cells in Homo sapiens. After preprocessing, Data9 contains 55,455 cells, representing 15 distinct cell types, with 17,017 RNA features and 115 ADT features. Data10 contains 73,580 cells, representing 16 distinct cell types, with 18,355 RNA features and 202 ADT features.
Downsampling strategy for real datasets
Methods such as SIMLR, SC3, CIDR, and Spectrum encounter challenges when handling large-scale datasets. Therefore, each real dataset was downsampled to a size of 10,000, and these clustering methods were applied to the downsampled datasets. A key aspect of downsampling is to ensure that the proportion of different cell types in the downsampled dataset is consistent with that in the original dataset. To achieve this, we first calculated the proportion of each type of cell relative to the total number of cells in each dataset, and subsequently downsampled each real dataset to a size of 10,000 based on these proportions.
Simulated data
Simulating rare cell types
We selected all 4500 CD4 Naive T cells in Data5 as the major cells, and simulated rare cells by randomly sampling Naive B cells at proportions of 2.5%, 2.0%, 1.5%, 1.0%, and 0.5% of the total cell count, resulting in five simulated datasets. To ensure fair and generalizable evaluation across clustering methods, we adopted the strategy used by Xu et al. [80], where clusters containing less than 5% of the total cells are considered rare, while the remaining clusters are treated as major cells. If the smallest cluster exceeds 5% of the total, we designate the smallest cluster as rare and the rest as major. Based on this binary classification, we used the F1-score to evaluate rare cell detection performance.
Subsampling withvarious sizes
We performed subsampling on the Data8 to evaluate the impact of different dataset sizes on the performance of each method. Specifically, for transcriptomic or proteomic data within the dataset, we first calculated the number of cells for each cell type and their proportions relative to the total number of cells. Then, we determined the number of samples for each cell type based on these proportions, ensuring that the proportion of each cell type in the subsampled datasets matched that of the original dataset. Using this sampling approach, we generated 5 datasets of varying sizes (2000; 4000; 6000; 8000; and 10,000 cells), with each dataset size sampled 5 times using different random seeds, resulting in a total of 25 subsampled datasets. By generating datasets of varying sizes, we can assess the performance of clustering methods as the dataset size changes. This facilitates a comprehensive and systematic evaluation of the algorithms’ performance and stability.
Adding various levels of noise
We evaluated the impact of varying degrees of noise on the performance of each method using the Data1. Specifically, normally distributed noise with a specified standard deviation was added to the non-zero elements of the feature matrices derived from transcriptomic or proteomic data in the dataset. Noise addition was performed five times, with standard deviations of 0.02, 0.04, 0.06, 0.08, and 0.1, respectively. These varying noise levels simulate different extents of technical errors and biological variability, aiding in the assessment of clustering methods under different noise conditions. This approach allows for a comprehensive and in-depth evaluation of the stability and robustness of clustering methods.
Benchmark metrics
In our study, we used the following metrics to assess each method.
ARI
ARI [22] is a measure used to evaluate the similarity between two data clustering assignments by adjusting for the chance grouping of elements. Unlike the Rand Index [81], which measures the agreement or similarity between cluster assignments, the ARI accounts for the expected similarity of all pairwise comparisons between clustering assignments obtained randomly. The ARI value ranges from − 1 to 1, where a value closer to 1 indicates a higher degree of agreement between the clustering result and the ground truth. The formula for ARI is given by:
| 1 |
where nij represents the number of samples that are in both true class i and cluster j, ai denotes the total number of samples in true class i, and bj denotes the total number of samples in cluster j, and n denotes the total number of samples.
NMI
NMI [23] is an information-theoretic metric used to measure the shared information between two clustering assignments. It is normalized to ensure the score lies between 0 and 1, where 1 indicates perfect agreement and 0 indicates no mutual information. NMI is symmetric and effective for comparing clustering assignments with different numbers of clusters. The formula for NMI is:
| 2 |
where I (U, V) denotes the mutual information between clustering assignments U and V, while H(U) and H(V) represent the entropies of U and V, respectively. Here, U = {u1, u2, …, un} and V = {v1, v2, …, vn} are two clustering assignments corresponding to the same set of n samples, where ui and vi denote the labels assigned to the ith sample by two clustering algorithms or by one clustering algorithm and the ground truth.
CA
CA [24] measures the extent to which the clustering labels match the true labels by finding the best one-to-one correspondence between clusters and labels. The CA value ranges from 0 to 1, where 1 indicates perfect matching between clustering results and true labels, and 0 indicates no matching. CA is defined as:
| 3 |
where n denotes the number of data points, and m iterates over all possible one-to-one correspondences between cluster assignments and true labels. Given a data point i, let li be the ground truth label and ui be the assignment of the clustering algorithm. The optimal correspondence can be efficiently determined using the Hungarian algorithm [82].
Purity
Purity [25] is a simple and intuitive metric for evaluating the quality of a clustering. It measures the extent to which each cluster contains elements from a single ground truth class. Purity ranges from 0 to 1, with higher values indicating better clustering quality. The formula for Purity is:
| 4 |
where ck is the set of elements in cluster k, tj is the set of elements in ground truth class j, and n is the total number of elements.
F1-score
F1-score measures the balance between precision and recall in classification tasks by calculating their harmonic mean. The F1-score value ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates the worst performance. F1-score is defined as:
| 5 |
where the Precision and Recall are defined as:
| 6 |
| 7 |
In the above equations, TP, FP, and FN represent the numbers of true positives, false positives, and false negatives, respectively. Specifically, TP refers to the number of rare cells correctly identified as rare; FP refers to the number of major cells incorrectly identified as rare; and FN refers to the number of rare cells incorrectly identified as major.
Computational resources
2 Intel(R) Xeon(R) Gold 6326 central processing units (CPUs), each operating at 2.90 GHz with 24 MB of L3 cache and 16 CPU cores, 1024 GB memory, and NVIDIA Tesla A100 GPU (80 GB of memory).
Calculation of the overall score
In Fig. 2, we evaluated each method in terms of Accuracy (measured by ARI, NMI, CA, and Purity) and Scalability (measured by Peak Memory and Running Time). Consistent with previous studies, we defined the overall score for each component by aggregating the metrics it encompasses. Specifically, the overall score for Accuracy is computed as the average of ARI, NMI, CA, and Purity. For Scalability, we first normalized both Peak Memory and Running Time to values within the range of 0–1. Subsequently, the overall score for Scalability was calculated as the average of the normalized Peak Memory and Running Time. The formulas for the overall scores are as follows:
| 8 |
| 9 |
Calculation of the comprehensive ranking
In Additional file 2: Tables S12-15, we calculated the ranking of each method on each dataset based on ARI and NMI, respectively, as well as a comprehensive ranking. Specifically, the ranking for each individual metric was determined by sorting the values in descending order. We define ARIi and NMIi as the vectors containing the ARI and NMI scores of all methods on the ith dataset, respectively. The formulas for this ranking calculation are as follows:
| 10 |
| 11 |
To more accurately evaluate clustering methods across multiple metrics, we have introduced a new ranking strategy that combines the harmonic mean of ranks with a penalty for rank disparity. This approach is designed to capture both excellence in any single metric and consistency across metrics. Given a method’s ranks in ARI and NMI, we first compute the harmonic mean:
| 12 |
To further account for imbalance between metrics, we include a rank disparity penalty:
| 13 |
Here, λ is a hyperparameter controlling the penalty’s weight. This term penalizes methods with high variance between metrics, indicating instability or inconsistency. Considering the appropriate penalty for rank disparity, we set λ = 0.5 to balance the consistency and performance of the methods. The final composite score for each method on each dataset is:
| 14 |
In this framework, we compute the score for each method on each of the N datasets and average them, where lower scores indicate better performance, reflecting both stronger individual metric ranks and higher consistency between them. The comprehensive rank of clustering methods on N datasets is calculated as follows:
| 15 |
Calculation of the tendency score
When analyzing paired single-cell transcriptomic and proteomic data, we designed a tendency score to quantify the tendency of clustering methods on these two types of data, based on metrics such as ARI and NMI. Suppose there are T datasets, of which M datasets exhibit better performance in transcriptomics (i.e., the transcriptomic score is higher), and N datasets exhibit better performance in proteomics, where M + N = T.
For a specific clustering method, we first calculated the proportion of the transcriptomic score (e.g., ARI or NMI) in the sum of the transcriptomic and proteomic scores for each dataset. We then summed these proportions across all datasets, compute the average, and multiply it by M/T. Similarly, we calculated the proportion of the proteomic score for each dataset, compute the average, and multiply it by N/T. Finally, the tendency score is defined as the difference between these two values. A positive score indicates that the clustering method favors transcriptomic data, while a negative score indicates that it favors proteomic data. To interpret the tendency score, we consider both its sign and magnitude. Specifically, scores with an absolute value less than 0.01 are considered to indicate no clear or meaningful preference for either modality.
The tendency score formula for ARI is as follows:
| 16 |
Similarly, the tendency score formula for NMI, CA, and Purity are as follows:
| 17 |
| 18 |
| 19 |
Scientific computing and plotting
We used Python (version 3.11.4) environment to conduct this study. Important packages include numpy [83] (version 1.24.3), pandas [84] (version 1.5.3), scipy [85] (version 1.10.1), scikit-learn [86] (version 1.3.0), matplotlib [87] (version 3.7.1), and seaborn [88] (version 0.12.2).
Supplementary Information
Additional file 1: Notes S1-S2 and Figs. S1-S18.
Acknowledgements
We would like to thank the members of Guohua Wang’s lab for their helpful discussions.
Peer review information
Claudia Feng was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Authors’ contributions
Y.-H.Y. and F.W. conceived the problem and designed the study. G.W. and F.W. supervised the work. Y.-H.Y. conducted methods collection and experimental analysis. Y.-H.Y. and F.W. wrote the manuscript. G.W. and W.L. revised the manuscript. Q.L. and S.Z. performed bioinformatics analysis. M.Z. revised and polished the figures. Z.J. and D.-J.Y. provided valuable suggestions and assistance on experimental setup and code implementation. All authors reviewed and approved the manuscript.
Funding
This work was supported by the National Science Fund for Distinguished Young Scholars [62225109], the National Natural Science Foundation of China [32400546, 62450112], and the Fundamental Research Funds for the Central Universities [2572024AW30].
Data availability
All data used in this study are publicly available, and their usage has been comprehensively discussed in the Methods. Data1 was described in ref. [74] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143363) [89]. Data2 was described in ref. [75] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148127) [90]. Data3 and Data4 were described in ref. [76] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163120) [91]. Data5 was described in ref. [27] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128639) [92]. Data6 and Data7 were described in ref. [77] and available at (https://figshare.com/projects/Single-cell_proteo-genomic_reference_maps_of_the_human_hematopoietic_system/94469) [93]. Data8 was described in ref. [78] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164378) [94]. Data9 and Data10 were described in ref. [79] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE192742) [95]. For more information, please refer to the Methods and Additional file 2: Table S1. The source code and implementation details of our benchmarking framework, implemented in Python and R, are publicly accessible on GitHub (https://github.com/yinyh-1997/CBTP) [96] and Zenodo (10.5281/zenodo.15766048) [97] under the MIT License.
Declarations
Ethics approval and consent to participate
No ethical approval was required for this study. All utilized public datasets were generated by other organizations that obtained ethical approval.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yu-Hang Yin and Fang Wang contributed equally to this work.
Contributor Information
Fang Wang, Email: yafangwang163@163.com.
Guohua Wang, Email: ghwang@nefu.edu.cn.
References
- 1.Paik DT, Cho S, Tian L, Chang HY, Wu JC. Single-cell RNA sequencing in cardiovascular development, disease and medicine. Nat Rev Cardiol. 2020;17(8):457–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lei Y, Tang R, Xu J, Wang W, Zhang B, Liu J, et al. Applications of single-cell sequencing in cancer research: progress and perspectives. J Hematol Oncol. 2021;14(1): 91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Labib M, Kelley SO. Single-cell analysis targeting the proteome. Nat Rev Chem. 2020;4(3):143–58. [DOI] [PubMed] [Google Scholar]
- 4.Perkel JM. Single-cell proteomics takes centre stage. Nature. 2021;597(7877):580–2. [DOI] [PubMed] [Google Scholar]
- 5.Vistain LF, Tay S. Single-cell proteomics. Trends Biochem Sci. 2021;46(8):661–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Redit C, Cha S, Ai N. Single-cell proteomics: challenges and prospects. Nat Methods. 2023;20(3):317–8. [DOI] [PubMed] [Google Scholar]
- 7.Slavov N. Unpicking the proteome in single cells. Science. 2020;367(6477):512–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods. 2017;14(9):865–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Marguerat S, Schmidt A, Codlin S, Chen W, Aebersold R, Bähler J. Quantitative analysis of fission yeast transcriptomes and proteomes in proliferating and quiescent cells. Cell. 2012;151(3):671–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kelly RT. Single-cell proteomics: progress and prospects. Mol Cell Proteomics. 2020;19(11):1739–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Andrews TS, Kiselev VY, McCarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc. 2021;16(1):1–9. [DOI] [PubMed] [Google Scholar]
- 12.Kharchenko PV. The triumphs and limitations of computational methods for scRNA-seq. Nat Methods. 2021;18(7):723–32. [DOI] [PubMed] [Google Scholar]
- 13.Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–82. [DOI] [PubMed] [Google Scholar]
- 14.Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform. 2020;21(4):1209–23. [DOI] [PubMed] [Google Scholar]
- 15.Qi R, Ma A, Ma Q, Zou Q. Clustering and classification methods for single-cell RNA-sequencing data. Brief Bioinform. 2020;21(4):1196–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Krzak M, Raykov Y, Boukouvalas A, Cutillo L, Angelini C. Benchmark and parameter sensitivity analysis of single-cell RNA sequencing clustering methods. Front Genet. 2019;10: 1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yu L, Cao Y, Yang JY, Yang P. Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data. Genome Biol. 2022;23(1): 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhu C, Preissl S, Ren B. Single-cell multimodal omics: the power of many. Nat Methods. 2020;17(1):11–4. [DOI] [PubMed] [Google Scholar]
- 19.Lee J, Hyeon DY, Hwang D. Single-cell multiomics: technologies and data analysis methods. Exp Mol Med. 2020;52(9):1428–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Mimitou EP, Cheng A, Montalbano A, Hao S, Stoeckius M, Legut M, et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat Methods. 2019;16(5):409–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shahi P, Kim SC, Haliburton JR, Gartner ZJ, Abate AR. Abseq: ultrahigh-throughput single cell protein profiling with droplet microfluidic barcoding. Sci Rep. 2017;7(1): 44447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. [Google Scholar]
- 23.Strehl A, Ghosh J. Cluster ensembles–-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2002;3(Dec):583–617. [Google Scholar]
- 24.Xie J, Girshick R, Farhadi A, editors. Unsupervised deep embedding for clustering analysis. Proceedings of the 33rd International Conference on Machine Learning; 2016: PMLR.
- 25.Amigó E, Gonzalo J, Artiles J, Verdejo F. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retrieval. 2009;12:461–86. [Google Scholar]
- 26.Wang F, Liu C, Li J, Yang F, Song J, Zang T, et al. SPDB: a comprehensive resource and knowledgebase for proteomic data at the single-cell resolution. Nucleic Acids Res. 2024;52(D1):D562–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–902. e21. [DOI] [PMC free article] [PubMed]
- 28.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14(5):483–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Chen Z, Goldwasser J, Tuckman P, Liu J, Zhang J, Gerstein M. Forest fire clustering for single-cell sequencing combines iterative label propagation with parallelized monte carlo simulations. Nat Commun. 2022;13(1): 3538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Peng D, Gui Z, Wang D, Ma Y, Huang Z, Zhou Y, et al. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nat Commun. 2022;13(1): 5455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Lin P, Troup M, Ho JW. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 2017;18:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wang Z, Yang S, Koga Y, Corbett SE, Shea CV, Johnson WE, et al. Celda: a Bayesian model to perform co-clustering of genes into modules and cells into subpopulations using single-cell RNA-seq data. NAR Genom Bioinform. 2022;4(3): lqac066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods. 2017;14(4):414–6. [DOI] [PubMed] [Google Scholar]
- 34.Cheng C, Easton J, Rosencrance C, Li Y, Ju B, Williams J, et al. Latent cellular analysis robustly reveals subtle diversity in large-scale single-cell RNA-seq data. Nucleic Acids Res. 2019;47(22):e143-e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Grabski IN, Street K, Irizarry RA. Significance analysis for clustering with single-cell RNA-sequencing data. Nat Methods. 2023;20(8):1196–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liu W, Liao X, Yang Y, Lin H, Yeong J, Zhou X, et al. Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Res. 2022;50(12):e72-e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Research. 2016;44(13):e117-e. [DOI] [PMC free article] [PubMed]
- 38.Wan S, Kim J, Won KJ. SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection. Genome Res. 2020;30(2):205–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Van Gassen S, Callebaut B, Van Helden MJ, Lambrecht BN, Demeester P, Dhaene T, et al. FlowSOM: using self-organizing maps for visualization and interpretation of cytometry data. Cytometry A. 2015;87(7):636–45. [DOI] [PubMed] [Google Scholar]
- 40.John CR, Watson D, Barnes MR, Pitzalis C, Lewis MJ. Spectrum: fast density-aware spectral clustering for single and multi-omic data. Bioinformatics. 2020;36(4):1159–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wang Z, Zhong Y, Ye Z, Zeng L, Chen Y, Shi M, et al. MarkovHC: markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data with transition pathway and critical point detection. Nucleic Acids Res. 2022;50(1):46–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Theorell A, Bryceson YT, Theorell J. Determination of essential phenotypic elements of clusters in high-dimensional entities—DEPECHE. PLoS One. 2019;14(3): e0203247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Stassen SV, Siu DM, Lee KC, Ho JW, So HK, Tsia KK. PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells. Bioinformatics. 2020;36(9):2778–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):P10008. [Google Scholar]
- 46.Abdelaal T, de Raadt P, Lelieveldt BP, Reinders MJ, Mahfouz A. SCHNEL: scalable clustering of high dimensional single-cell data. Bioinformatics. 2020;36(Supplement_2):i849-i56. [DOI] [PubMed]
- 47.Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Levine JH, Simonds EF, Bendall SC, Davis KL, El-ad DA, Tadmor MD, et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. 2015;162(1):184–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Li X, Wang K, Lyu Y, Pan H, Zhang J, Stambolian D, et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat Commun. 2020;11(1): 2338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Tian T, Zhang J, Lin X, Wei Z, Hakonarson H. Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nat Commun. 2021;12(1): 1873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wang J, Ma A, Chang Y, Gong J, Jiang Y, Qi R, et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat Commun. 2021;12(1): 1882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Xie K, Huang Y, Zeng F, Liu Z, Chen T. Scaide: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types. NAR Genom Bioinform. 2020;2(4): lqaa082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Lakkis J, Wang D, Zhang Y, Hu G, Wang K, Pan H, et al. A joint deep learning model enables simultaneous batch effect correction, denoising, and clustering in single-cell transcriptomics. Genome Res. 2021;31(10):1753–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Chen L, Wang W, Zhai Y, Deng M. Deep soft K-means clustering with self-training for single-cell RNA sequence data. NAR Genomics and Bioinformatics. 2020;2(2):lqaa039. [DOI] [PMC free article] [PubMed]
- 55.Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat Mach Intell. 2019;1(4):191–8. [Google Scholar]
- 56.Hu Y, Wan S, Luo Y, Li Y, Wu T, Deng W, et al. Benchmarking algorithms for single-cell multi-omics prediction and integration. Nat Methods. 2024;21(11):2182–94. [DOI] [PubMed] [Google Scholar]
- 57.Zhou M, Zhang H, Bai Z, Mann-Krzisnik D, Wang F, Li Y. Single-cell multi-omics topic embedding reveals cell-type-specific and COVID-19 severity-related immune signatures. Cell Rep Methods. 2023. 10.1016/j.crmeth.2023.100563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lakkis J, Schroeder A, Su K, Lee MY, Bashore AC, Reilly MP, et al. A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation. Nat Mach Intell. 2022;4(11):940–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Lin X, Tian T, Wei Z, Hakonarson H. Clustering of single-cell multi-omics data with a multimodal deep learning method. Nat Commun. 2022;13(1):7705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Gayoso A, Steier Z, Lopez R, Regier J, Nazor KL, Streets A, et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods. 2021;18(3):272–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Do VH, Canzar S. A generalization of t-SNE and UMAP to single-cell multimodal omics. Genome Biol. 2021;22(1):130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. Mofa+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Hao Y, Stuart T, Kowalski MH, Choudhary S, Hoffman P, Hartman A, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42(2):293–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Veillette A, Bookman MA, Horak EM, Bolen JB. The CD4 and CD8 T cell surface antigens are associated with the internal membrane tyrosine-protein kinase p56lck. Cell. 1988;55(2):301–8. [DOI] [PubMed] [Google Scholar]
- 65.Stuart T, Satija R. Integrative single-cell analysis. Nat Rev Genet. 2019;20(5):257–72. [DOI] [PubMed] [Google Scholar]
- 66.Argelaguet R, Cuomo AS, Stegle O, Marioni JC. Computational principles and challenges in single-cell data integration. Nat Biotechnol. 2021;39(10):1202–15. [DOI] [PubMed] [Google Scholar]
- 67.Davis DM. Intercellular transfer of cell-surface proteins is common and can affect many stages of an immune response. Nat Rev Immunol. 2007;7(3):238–43. [DOI] [PubMed] [Google Scholar]
- 68.Shilts J, Severin Y, Galaway F, Müller-Sienerth N, Chong Z-S, Pritchard S, et al. A physical wiring diagram for the human immune system. Nature. 2022;608(7922):397–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell. 2020;183(4):1103–16. e20. [DOI] [PMC free article] [PubMed]
- 70.Efremova M, Teichmann SA. Computational methods for single-cell omics across modalities. Nat Methods. 2020;17(1):14–7. [DOI] [PubMed] [Google Scholar]
- 71.Yuan Q, Duren Z. Integration of single-cell multi-omics data by regression analysis on unpaired observations. Genome Biol. 2022;23(1):160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Lee MY, Kaestner KH, Li M. Benchmarking algorithms for joint integration of unpaired and paired single-cell RNA-seq and ATAC-seq data. Genome Biol. 2023;24(1): 244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Pei S, Pollyea DA, Gustafson A, Stevens BM, Minhajuddin M, Fu R, et al. Monocytic subclones confer resistance to venetoclax-based therapy in patients with acute myeloid leukemia. Cancer Discov. 2020;10(4):536–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Golomb SM, Guldner IH, Zhao A, Wang Q, Palakurthi B, Aleksandrovic EA, et al. Multi-modal single-cell analysis reveals brain immune landscape plasticity during aging and gut microbiota dysbiosis. Cell Rep. 2020. 10.1016/j.celrep.2020.108438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Pombo Antunes AR, Scheyltjens I, Lodi F, Messiaen J, Antoranz A, Duerinck J, et al. Single-cell profiling of myeloid cells in glioblastoma across species and disease stage reveals macrophage competition and specialization. Nat Neurosci. 2021;24(4):595–610. [DOI] [PubMed] [Google Scholar]
- 77.Triana S, Vonficht D, Jopp-Saile L, Raffel S, Lutz R, Leonce D, et al. Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat Immunol. 2021;22(12):1577–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–87. e29. [DOI] [PMC free article] [PubMed]
- 79.Guilliams M, Bonnardel J, Haest B, Vanderborght B, Wagner C, Remmerie A, et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell. 2022;185(2):379–96. e38. [DOI] [PMC free article] [PubMed]
- 80.Xu Y, Wang S, Feng Q, Xia J, Li Y, Li H-D, et al. scCAD: Cluster decomposition-based anomaly detection for rare cell identification in single-cell expression data. Nat Commun. 2024;15(1):7561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–50. [Google Scholar]
- 82.Kuhn HW. The hungarian method for the assignment problem. Nav Res Logist Q. 1955;2(1–2):83–97. [Google Scholar]
- 83.Harris CR, Millman KJ, Van Der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.McKinney W, editor Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference; 2010.
- 85.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
- 87.Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(03):90–5. [Google Scholar]
- 88.Waskom ML. Seaborn: statistical data visualization. J Open Source Softw. 2021;6(60): 3021. [Google Scholar]
- 89.Pei S, Gillen AE, Jordan CT. CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) of paired diagnosis and relapse specimens from acute myeloid leukemia (AML) patients treated with venetoclax plus azacitidine. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143363.
- 90.Zhang S, Golomb SM. Multi-modal single cell analysis reveals gut microbiota reshapes brain immune landscape in aged mouse model. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148127. [DOI] [PMC free article] [PubMed]
- 91.Pombo Antunes AR, Scheyltjens I, Lodi F, Messiaen J, Antoranz A, Duerinck J, et al. Single-cell profiling of myeloid cells in glioblastoma across species and disease stage reveals macrophage competition and specialization. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163120. [DOI] [PubMed]
- 92.Butler A, Stuart T. Comprehensive integration of single-cell data. Gene Expression Omnibus. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128639. [DOI] [PMC free article] [PubMed]
- 93.Velten L, Triana S, Vonficht D. Single-cell proteo-genomic reference maps of the human hematopoietic system. Figshare. 2021. https://figshare.com/projects/Single-cell_proteo-genomic_reference_maps_of_the_human_hematopoietic_system/94469. [DOI] [PMC free article] [PubMed]
- 94.Hao Y. Integrated analysis of multimodal single-cell data. Gene Expression Omnibus. 2021. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164378. [DOI] [PMC free article] [PubMed]
- 95.Spatial proteogenomics reveals distinct and evolutionarily-conserved hepatic macrophage niches. Gene Expression Omnibus. 2022. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE192742. [DOI] [PMC free article] [PubMed]
- 96.Yin Y-H, Wang F, Li W, Liu Q, Zhou S, Zhou M, et al. Code for CBTP: Comparative benchmarking of single-cell clustering algorithms for transcriptomic and proteomic data. GitHub. 2025. https://github.com/yinyh-1997/CBTP.
- 97.Yin Y-H, Wang F, Li W, Liu Q, Zhou S, Zhou M, et al. Code for CBTP: Comparative benchmarking of single-cell clustering algorithms for transcriptomic and proteomic data. 2025. Zenodo. 10.5281/zenodo.15766048. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Notes S1-S2 and Figs. S1-S18.
Data Availability Statement
All data used in this study are publicly available, and their usage has been comprehensively discussed in the Methods. Data1 was described in ref. [74] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143363) [89]. Data2 was described in ref. [75] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148127) [90]. Data3 and Data4 were described in ref. [76] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163120) [91]. Data5 was described in ref. [27] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128639) [92]. Data6 and Data7 were described in ref. [77] and available at (https://figshare.com/projects/Single-cell_proteo-genomic_reference_maps_of_the_human_hematopoietic_system/94469) [93]. Data8 was described in ref. [78] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164378) [94]. Data9 and Data10 were described in ref. [79] and available at (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE192742) [95]. For more information, please refer to the Methods and Additional file 2: Table S1. The source code and implementation details of our benchmarking framework, implemented in Python and R, are publicly accessible on GitHub (https://github.com/yinyh-1997/CBTP) [96] and Zenodo (10.5281/zenodo.15766048) [97] under the MIT License.






