Abstract
Background
Single-cell foundation models (scFMs) have emerged as powerful tools for integrating heterogeneous datasets and exploring biological systems. Despite high expectations, their ability to extract unique biological insights beyond standard methods and their advantages over traditional approaches in specific tasks remain unclear.
Results
Here, we present a comprehensive benchmark study of six scFMs against well-established baselines under realistic conditions, encompassing two gene-level and four cell-level tasks. Pre-clinical batch integration and cell type annotation are evaluated across five datasets with diverse biological conditions, while clinically relevant tasks, such as cancer cell identification and drug sensitivity prediction, are assessed across seven cancer types and four drugs. Model performance is evaluated using 12 metrics spanning unsupervised, supervised, and knowledge-based approaches, including scGraph-OntoRWR, a novel metric designed to uncover intrinsic knowledge encoded by scFMs. We provide holistic rankings from dataset-specific to general performance to guide model selection. Our findings reveal that scFMs are robust and versatile tools for diverse applications while simpler machine learning models are more adept at efficiently adapting to specific datasets, particularly under resource constraints. Notably, no single scFM consistently outperforms others across all tasks, emphasizing the need for tailored model selection based on factors such as dataset size, task complexity, biological interpretability, and computational resources.
Conclusions
This benchmark introduces novel evaluation perspectives, identifying the strengths and limitations of current scFMs, and paves the way for their effective application in biological and clinical research, including cell atlas construction, tumor microenvironment studies, and treatment decision-making.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03781-6.
Background
Single-cell RNA sequencing (scRNA-seq), which provides a granular view of transcriptomics at a single-cell resolution, has broadened our understanding of biological processes and revolutionized the research paradigm in biology and drug development [1]. With the advancement of high-throughput sequencing technology, the amount of single-cell transcriptomics data has increased exponentially [2, 3], providing an abundant corpus for training machine learning (ML) models. However, transcriptome data have the characteristics of high sparsity, high dimensionality, and low signal-to-noise ratio [4, 5], which presents challenges to the subsequent data analysis. Traditional ML approaches struggle to effectively harness knowledge from data to build general-purpose models. To address this challenge, new computational strategies are urgently needed to overcome data complexity and excavate more valuable information from the heterogeneous transcriptomic data across platforms, tissues, patients, and even species. Inspired by the remarkable progress of foundation models in the field of natural language processing (NLP), the development of foundation models in single-cell omics has emerged as a promising avenue [1–4, 6–10]. Leveraging the massive and diverse data in a self-supervised manner, foundation models hold the promise of learning universal biological knowledge during the pretraining phase, which endows them with the emergent abilities for zero-shot learning and efficient adaptation to various downstream tasks.
To promote future advancements in this rapidly evolving field, several benchmarking efforts have been made to explore the potential of single-cell foundation models (scFMs) in single-cell data analysis. For instance, Boiarsky et al. [11] fine-tuned scBERT for cell type annotation, Kedzierska et al. [12] assessed the utility of Geneformer and scGPT in batch integration, and there are some benchmarking studies specifically designed for cellular perturbation prediction against covariate transfer, combo prediction, or distribution shift [5, 13, 14]. Liu et al. [15] proposed a technology-oriented benchmarking study, which covers 10 scFMs over 8 downstream tasks, to evaluate the effect of initial settings, hyperparameters, and loss components on model performance. These studies revealed that the pretrained foundation models failed to outperform the simpler baseline models in certain scenarios [5, 11–13], which raises questions about the effectiveness of the “pre-train then fine-tune” paradigm and the representation learning capacity of scFMs.
Despite the valuable contributions of previous benchmarking studies, due to the intricate relationship between single-cell sequencing data and underlying biological insights, it remains unclear about the best practice for constructing and applying scFMs. Currently, there are three critical issues in practical applications that require further attention:
Assessing the biological relevance of scFMs: How can we effectively evaluate the ability of scFMs to capture meaningful biological insights? This involves selecting benchmark datasets that are biologically representative, designing evaluation metrics that align with prior biological knowledge, and developing protocols that reflect real-world biological applications.
Choosing between complex foundation models and simpler alternatives: What factors should guide the decision to use a complex foundation model versus a simpler machine learning model? Key considerations include dataset size, task complexity, the need for biological interpretability, and available computational resources.
Model generalization and task-specific selection: Are there any foundation models that consistently outperform others across diverse application scenarios? If not, how can we systematically select the most appropriate model for a specific task and dataset? This requires a deeper understanding of the strengths and limitations of existing scFMs, as well as the development of guidelines to match models to specific tasks and data characteristics.
To address these open questions, we propose a benchmarking framework that performs deep introspection into the zero-shot scFM embeddings with biologically meaningful metrics and clinically relevant tasks, and provides a feasible solution for model selection in real-world applications. We evaluate six scFMs with different pretraining settings, representing the current state-of-the-art, alongside baseline strategies including highly variable genes (HVGs) selection, the anchor-based Seurat [16], the clustering-based Harmony [17], and the generative model scVI [18], to ascertain the gains from the large-scale pretraining of scFMs. To evaluate the usefulness and transferability of the learned representations, we adopt the zero-shot protocol to conduct two gene-level tasks and four cell-level tasks, leveraging large and diverse benchmarking datasets with high-quality labels. To further mitigate the risk of data leakage and rigorously validate our conclusions, we introduce an independent and unbiased dataset: the Asian Immune Diversity Atlas (AIDA) v2 [19] from CellxGene [20]. Our benchmark is application- and biology-oriented, focusing on challenging scenarios neglected by previous benchmarking efforts, such as novel cell types, cross-tissue homogeneity, and intra-tumor heterogeneity. In particular, we innovatively propose cell ontology-informed metrics to introduce a fresh perspective on the model evaluation. The scGraph-OntoRWR metric measures the consistency of cell type relationships captured by scFMs with prior biological knowledge. Additionally, the Lowest Common Ancestor Distance (LCAD) metric, which measures the ontological proximity between misclassified cell types, is introduced to assess the severity of error in cell type annotation. Our experimental results prove that the pretrained zero-shot scFM embeddings indeed capture biological insights into the relational structure of genes and cells, which is beneficial to downstream tasks. In addition, we quantitatively estimate how the model performance correlated with cell-property landscape roughness [21] in the pretrained latent space, verifying that the performance improvement arises from a smoother landscape, which reduces the difficulty of training task-specific models. Our findings support that scFMs can serve as a plug-and-play module to push the boundaries of various downstream tasks. Lastly, we systematically review the experimental results and provide model rankings via a non-dominated sorting algorithm that enables aggregating multiple evaluation metrics. The task-specific and overall rankings provide general guidance for model selection. For more specific needs, the roughness index (ROGI) [21] can serve as a proxy to recommend an appropriate model in a dataset-dependent manner. This approach not only simplifies the evaluation process of various candidate models, but also provides valuable insights into the difference between scFMs in a certain downstream task. Overall, our study addresses existing research gaps in the field of scFMs and facilitates meaningful biological interpretation of results by introducing more challenging benchmarking tasks and novel evaluation perspectives.
Results
Overview of the benchmarking framework
In this benchmarking study, we evaluate the zero-shot gene embeddings and cell embeddings learned from large-scale pretraining. The benchmarking pipeline regarding feature extraction, downstream tasks, selected models, datasets, and evaluation metrics is depicted in Fig. 1. Compared to the sequence modeling in the NLP community, the scRNA-seq data has its unique features:
Gene tokens have an additional feature representing their expression levels.
Genes can interact dynamically and are not ordered in a sequential manner like words in a sentence [22].
Fig. 1.
Benchmarking pipeline. a Extraction of gene embeddings and cell embeddings. The scRNA-seq data can be represented as a matrix, where is the number of cells and is the number of genes. The input of the scFMs contains the gene identity information and the expression values. The output of scFMs is the contextualized cell embeddings in the latent space. b Gene-level evaluation tasks including predicting tissue specificity and Gene Ontology (GO) terms. c Cell-level evaluation tasks comprising batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction. d The benchmarking datasets are selected with different scale of size and diversity. Totally, six scFMs based on the Transformer architecture are comprehensively evaluated in our study. The evaluation metrics consist of traditional metrics and cell ontology-informed metrics
Numerous competing approaches have been proposed to tweak the Transformer architecture for better encoding scRNA-seq data, but a consensus on the best practice has yet to be established. To assess the current progress of modeling scRNA-seq data using foundation models, we consider six prominent and widely used scFMs (namely, Geneformer [7], scGPT [2], UCE [8], scFoundation [4], LangCell [9], and scCello [10]) across various pretraining settings as part of our benchmarking workflow. The input layers of these scFMs can be divided into three components: gene embeddings (analogous to word embeddings), value embeddings, and positional embeddings (Table 1 and Additional file 1: Note S1). To gain deeper insights into what the scFMs learn from pretraining, we evaluate them on two gene-level tasks, four cell-level tasks, and an attention-based interpretability analysis that captures different semantic information.
Table 1.
Overview of the scFMs for benchmarking
| Model name | Omics modalities | Model parameters | Pretraining dataset | # Input genes | #Output dim | Value embedding | Gene symbol embedding | Positional embedding | Architecture | Pretraining tasks |
|---|---|---|---|---|---|---|---|---|---|---|
| Geneformer [7] | scRNA-seq | 40 M | 30 M | 2048 ranked genes | 256 (6L); 512 (12L) | Ordering | Lookup Table (512d) | ✓ | Encoder | MGM with CE loss (gene ID prediction) |
| scGPT [2] | scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics | 50 M | 33 M | 1200 HVGs | 512 | Value binning | Lookup Table (512d) | × | Encoder with attention mask | Iterative MGM with MSE loss (gene-prompt + cell-prompt), generative pretraining |
| UCE [8] | scRNA-Seq | 650 M | 36 M | 1024 non-unique genes sampled (with replacement) by expression and ordered by genomic positions | 1280 | / |
ESM-2 [104] based protein embedding (5120d) |
✓ | Encoder | Modified MGM: binary CE loss for predicting whether a gene is expressed or not |
| scFoundation [4] | scRNA-Seq | 100 M | 50 M | 19,264 human protein-encoding genes and common mitochondrial genes | 3072 | Value projection | Lookup Table (768d) | × | Asymmetric encoder-decoder | Read-depth-aware MGM with MSE loss |
| LangCell [9] | scRNA-Seq | 40 M | 27.5 M scRNA-text pairs (use cell type labels) | 2048 ranked genes | 256 | Ordering | Lookup Table (512d)a | ✓ | Cell encoder + Text encoderb | MGM with CE loss + intra-modal CL + inter-modal CL |
| scCello [10] | scRNA-Seq | 10 M | 22 M (use cell type labels) | 2048 ranked genes | 200 | Ordering | Lookup Table (256d) | ✓ | Encoder | MGM with CE loss + cell-type coherence loss + ontology alignment loss |
HVGs highly variable genes, MGM masked gene modeling, CE cross-entropy, MSE mean squared error, CL contrastive learning
aThe gene embeddings are initialized from Geneformer
bThe cell encoder is initialized from Geneformer, while the text encoder is initialized from PubMedBERT [65]
Gene-level tasks
Learning meaningful gene embeddings and capturing underlying relationships between genes and their functional information is essential in understanding biological systems [22–24]. Ideally, functionally similar genes should be embedded in close proximity in the latent space, analog word embeddings in large language models (LLMs) [25, 26]. scFMs automatically learn a gene embedding matrix from diverse cellular contexts, which has proven to be useful for perturbation effect prediction [5, 27]. To systematically unveil inductive insights exploited by scFMs, we compare the gene embeddings obtained from scFMs with Functional Representation of Gene Signatures (FRoGS), a recent approach that learns gene embeddings via random walks on a hypergraph, with genes as nodes and Gene Ontology (GO) terms or regulated gene sets as hyperedges [23, 28]. In this study, we extract gene embeddings from the input layers of scFMs and use them to predict known biological relationships, including tissue specificity and GO terms (see details in Additional file 1: Note S2 and Additional file 2: Fig. S1).
Cell-level tasks
Our framework explores the efficiency of zero-shot scFM cell embeddings in dataset integration (Diving deep into the cell representation space via biologically informed metrics) and cell type annotation (Benchmarking analysis on cell type annotation), which are two core steps in scRNA-seq data analysis, especially for the construction of a comprehensive cell atlas [29]. A reliable, consistent, and reproducible single-cell research requires a unified cell embedding space, removing batch effects while preserving biological variation [30]. Here, we employ five high-quality datasets with manual annotations to analyze the representation space. These datasets vary in size and diversity and contain multiple sources of batch effects, such as inter-patient, inter-platform, and inter-tissue variations, which present unique challenges for data integration (Table 2). To better capture the conservation of biological structures and offer a more holistic assessment of model performance, we introduce cell ontology-informed metrics into the evaluation pipeline, bringing a biologically grounded perspective that is overlooked by traditional metrics. In addition to intra-dataset validation, we also conduct cross-dataset validation and novel cell type identification to simulate cell type annotation in practical applications.
Table 2.
Overview of benchmarking datasets for batch integration and cell type annotation
| Dataset name | Description | # Cells | # Cell types | # Batches | Label column | Batch column | Download link | Publication (DOI) |
|---|---|---|---|---|---|---|---|---|
| Pancreas (human) | Cells from human pancreas created by combining data spanning 6 studies | 16 K | 14 | 6 | celltype | batch (tech) | 10.6084/m9.figshare.12420968 | 10.1038/s41592-021-01336-8 |
| Immune (human) | Cells from 7 human PBMC samples and 3 human Bone Marrow samples | 33 K | 16 | 10 | final_annotation | batch | 10.6084/m9.figshare.12420968 | 10.1038/s41592-021-01336-8 |
| Tabula Sapiens | Cells from 24 different tissues across 15 human donors | 483 K | 177 | 24 | cell_ontology_class | tissue_in_publication | https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5 | 10.1126/science.abl4896 |
| HLCA (core) | Cells from healthy lung tissue across 107 individuals | 585 K | 50 | 14 | cell_type | dataset | https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293 | 10.1038/s41591-023-02327-2 |
| AIDA v2 (New data) | Circulating immune cells from 121 healthy Asian donor samples | 201 K | 32 | 121 | cell_type | donor_id | https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508 | 10.1016/j.cell.2025.02.017 |
Further, we dive deep into the applications of zero-shot cell embeddings in clinical scenarios. Deeper evaluation of cellular behavior in disease states is beneficial to clinical research and drug discovery [31]. For instance, the study of tumor microenvironments is critical for deciphering the mechanism of cancer progression and designing cancer therapy [32]. To evaluate the awareness of cell states and cellular heterogeneity in tumor microenvironments, we apply scFMs to cancer cell identification (Benchmarking analysis on cancer cell identification) [33] and drug sensitivity prediction (Benchmarking analysis on drug sensitivity prediction) [4, 32] tasks, which have been neglected in previous benchmark studies [11, 12, 15]. In these tasks, we employ the task-specific downstream models with input features replaced by pretrained cell embeddings, enabling us to leverage the strength of both approaches and assess the intrinsic quality of the zero-shot cell embeddings. The performance evaluation focuses on cross-tissue generalization ability and the transferability from bulk-level data to single-cell data. The criterion for dataset selection is the availability of high-quality ground truth labels and the coverage of diverse conditions. Additionally, we analyze the cell-property landscape using the ROGI approach, which offers an in-depth interpretation of how pretrained cell embeddings contribute to the performance enhancement.
Interpretability analysis
The Transformer architecture used by scFMs could provide informative attention scores and contextualized gene embeddings for gene regulatory network (GRN) inference [2, 3], which is crucial for dissecting cell identity [34]. As a proof-of-concept, we perform an exploratory analysis using pretrained scFMs to assess their potential for attention-based GRN analysis, by inferring target genes of a transcription factor based on the change of attention weights before and post-perturbation. The results on the Adamson dataset [35] confirm that Transformer-based scFMs capture biologically meaningful gene–gene interactions through their attention patterns (see details in Additional file 1: Note S3, Additional file 2: Fig. S2, and Additional file 3: Table S1).
Diving deep into the cell representation space via biologically informed metrics
Combining sequencing datasets from different sources to create a self-consistent single-cell atlas requires a batch integration approach to remove complex batch effects while preserving the biological signals. Previous benchmarking studies follow the principle that cells from the same class should be grouped together in a batch-invariant spaces [36]. In contrast, we focus on assessing how well the pretrained cell embedding space reflects the inherent biological structures between cell types.
To explore the boundaries of the zero-shot capabilities of scFMs in data integration, we conduct experiments on five integration tasks including the Pancreas dataset [36], the Immune (human) dataset [36], the Human Lung Cell Atlas (HLCA core set) [37], the Tabula Sapiens [38] dataset, and the Asian Immune Diversity Atlas (AIDA) v2 [19] dataset (Table 2). These datasets are characterized by varying scales and complexities, comprising 16,000 ~ 585,000 cells from single or multi-tissues. The batch effects exist in the single-tissue datasets primarily arising from sequencing platforms and inter-individual variations, whereas the multi-tissue datasets introduce additional inter-tissue variations, among which the most challenging Tabula Sapiens dataset encompasses 24 different tissues from 15 human donors. Specifically, the AIDA v2 dataset, released in April 2025, incorporates approximately 201,000 immune cells from healthy donors in Singapore, Thailand, and India, ensuring that none of the scFMs had exposure to these data during pretraining.
The Uniform Manifold Approximation and Projection (UMAP) [43] visualization of cell embeddings in each integrated dataset is shown in Additional file 2: Fig. S3 ~ S12. Following the experimental settings in scGPT, the evaluation of batch integration is based on 3 biological conservation metrics (NMIcell ASWcell, ARIcell) and 2 batch correction metrics (ASWbatch, Graph connectivity) from scIB [36] (Methods). Consistent with the prior studies [12, 15], scFMs do not surpass simpler baseline models (HVG and scVI) under the scIB metrics (Fig. 2a, Additional file 2: Fig. S13 ~ S14). Among the scFMs, scGPT achieves the highest batch correction score, likely owing to its binned value encoding, which mitigates differences in sequencing depth and scale. scCello attains the highest bio-conservation score, aligning with its ontology-informed pretraining that preserves meaningful biological structure.
Fig. 2.
Benchmarking results on batch integration. a The average model performance assessed by scIB metrics. Error bars indicate the s.d. of five benchmarking datasets. b The average model performance assessed by scGraph metrics. Error bars indicate the s.d. of five benchmarking datasets. c Violin plots of the class-specific Pearson correlation coefficient (PCC) with the OntoRWR-based reference graph. The dashed line indicates the median HVG-based score. d UMAP visualization of the pretrained cell embeddings from six scFMs on the Tabula Sapiens dataset, colored by shortest path distance to target plasma cell based on the cell ontology graph. The red circle highlights the plasma cell cluster. The r score shown in each subplot denotes the scGraph-OntoRWR score for the plasma cell
However, while the well-established scIB metrics provide valuable insights into batch integration performance, they fail to capture the relational structure among cell types, which is essential for biological interpretation and instrumental to fine-tuning task performance [39]. For instance, a simple Islander model trained on the cell type annotations achieves perfect scIB scores at the cost of distorting the biological structure [40]. In this case, if we use more coarse-grained cell type labels to evaluate the integrated dataset, the overall score will decrease drastically. To capture the conservation of biological structures in the embedding space, Wang et al. [40] propose a scGraph metric to evaluate the consistency between the neighborhood affinities derived from the original data (raw counts or principal component analysis (PCA) loadings) and those from the integrated embeddings. Based on the original scGraph metric (scGraph-PCA), we observe a notable change in the model rankings, with all scFMs exceeding scVI and Harmony (Fig. 2b). This phenomenon highlights the necessity of designing metrics that could better reflect the preservation of biological structures.
As suggested by Hrovatin et al. [29], the evaluation of integration quality should combine the prior knowledge-based metrics with the unsupervised metrics. The current reference graph is dataset-dependent and biased towards the original data structure [40]. In addition, it always contains many null values and is asymmetric since the cell type distributions are varied across batches. Notably, by zooming in on a representative region of the UMAP plots of the Tabula Sapiens dataset (Additional file 2: Fig. S15), we find that HVG-based integration places breast and cardiac fibroblasts too far apart, which is biologically implausible. Nevertheless, its performance appears overestimated when evaluated by the scGraph-PCA metric. To address the above deficiencies, we introduce a new metric named scGraph-OntoRWR to compare the learned cell embedding space with the expert-defined Cell Ontology [41, 42]. The key idea is that the text description of cell types alongside the ontology graph structure are combined into a dataset-independent reference graph for robust and consistent benchmarking analysis. Specifically, we construct the weighted cell ontology graph based on the text similarity between connected cell types utilizing the LangCell text encoder, then perform random walk with restart (RWR) to obtain the equilibrium distribution for pairwise distance computation (Methods).
Firstly, we perform a case study to illustrate the necessity of introducing a dataset-independent metric. As shown in Additional file 2: Fig. S16a, the OntoRWR distances increase monotonically with the shortest distance in the Directed Acyclic Graph (DAG distance), while the PCA distance introduces conflict as the most distant liver dendritic cell in the cell ontology graph shows a shortest PCA distance. Further, we generate UMAP [43] plots colored by the DAG distances to the plasma cell for visualizing the organization of cell embeddings learned by scVI and scFMs (Fig. 2d and Additional file 2: Fig. S16b). The Pearson correlation coefficient (PCC) measures how well the Euclidean distances of other cell types from the target cell (plasma cell) in the embedding space align with the OntoRWR distance. This analysis demonstrates that, except for Geneformer, scFMs exhibit strong advantages over baselines in preserving the cell type relationships in the latent space. Specifically, the leading scFM (scCello) exceeds scVI by 0.187 (+ 38.8%) in terms of PPC.
Subsequently, we utilize the scGraph-OntoRWR metric to evaluate the integrated datasets (Fig. 2b and Additional file 2: Fig. S17). Intriguingly, HVG performs extremely well under scGraph-PCA, but ranks moderately under scGraph-OntoRWR. In addition, the performance gap between scFMs becomes more pronounced when assessed by scGraph-OntoRWR compared to scGraph-PCA. On average, the UCE model achieves the best results among the scFMs, followed by scFoundation and LangCell. Taking each cell type as an anchor, we calculate and visualize the distribution of cell type-specific scGraph-OntoRWR scores (Fig. 2c), highlighting the superiority of LangCell and scCello on the unbiased AIDA v2 dataset. These observations indicate that our proposed metric meaningfully complements the original scGraph metric by providing a biology-aware perspective.
Overall, the integrated cell representation space learned by scFMs better preserves biologically relevant relationships than simple baseline models, a strength overlooked in previous studies. Our findings underscore the importance of selecting appropriate benchmarking metrics and considering the relational structures between cell types captured by the pretrained models.
Benchmarking analysis on cell type annotation
Reliable cell type annotation plays a significant role in building integrated single-cell atlases. The development of a unified annotation model generalizing well across tissues is a holy grail of the field, which still has a long way to go [44]. Currently, limited attention has been paid to cross-tissue cell annotation, and the cell-type coverage of the benchmarking datasets is insufficient [11, 15]. Thus, it remains unclear how scFMs perform on larger and more diverse scRNA-seq datasets across a wide range of tissues. Moreover, there has been little progress in validating the performance of scFMs on novel cell type discovery.
To provide more comprehensive evaluations, we consider three different cell type annotation scenarios, including intra-dataset validation, cross-dataset validation, and identification of novel cell types. Considering the superior performance and generalizability of OnClass [45], we adopt the annotation scheme that projects the cell ontology terms and single-cell transcriptomes into the same low-dimensional space. Here, we use cell embeddings provided by integration approaches and scFMs instead of the original gene expression profiles as the input of OnClass.
Accuracy and F1-score are the most common metrics used in multi-class classification, estimating the overall correctness of predictions and the balance between precision and recall, respectively. Particularly, the cell type annotation could have different resolutions according to the hierarchical structure of cell ontologies. Following the protocol of OnClass [45], we retain the most fine-grained cell types that are mutually exclusive to train the classifier. After data processing, the multi-organ Tabula Sapiens dataset contains 120 cell types, whereas the single-organ HLCA dataset and AIDA v2 dataset comprise 37 and 22 cell types, respectively. In this context, we propose Accuracy (non-leaf), an accuracy score customized for those cells with coarse-grained annotations excluded for training (Methods). In addition, current metrics do not capture the severity of errors (SOE). To fill this gap, we employ the LCAD metric, which measures how closely related two cell types are within the ontology (Methods). This metric allows us to estimate the model’s awareness of cell type hierarchies, beyond simple classification accuracy. Overall, the combination of standard metrics and ontology-based metrics provides a more comprehensive evaluation of model performance, reflecting nuances in the model’s behavior.
Intra-dataset validation
In this section, we perform intra-dataset validation with 20% training data to simulate the annotation process for the large-scale cell atlas [45]. On standard classification metrics (Accuracy@1 and macro-F1), scFoundation and UCE are ranked as top performers (Fig. 3a). One-vs-rest logistic regression (LR) and scVI serve as strong baselines, outperforming other scFMs. While LangCell underperforms scVI on the HLCA and AIDA v2 dataset, it surpasses scVI on the more complex Tabula Sapiens dataset. Intriguingly, Geneformer achieves higher accuracy yet lower macro-F1 scores than scGPT on the HLCA and Tabula Sapiens datasets, demonstrating that scGPT predicts the rare cell types better. Detailed macro-F1 scores grouped by the number of training cells per class (Additional file 2: Fig. S18) further confirm that scGPT performs better than Geneformer in predicting cell types with limited training data. A possible explanation is that Geneformer’s use of highly expressed genes (HEGs) and ordering-based encoding better captures dominant co-expression patterns in abundant cell types. In contrast, scGPT leverages HVGs and regression-based pretraining, which emphasize biological heterogeneity.
Fig. 3.
Benchmarking results on cell type annotation. a The in-distribution model performance on three benchmarking datasets, including HLCA (37 cell types), Tabula Sapiens (120 cell types), and AIDA v2 (22 cell types). b The model performance on the cross-dataset validation. The test data contains 14 cell types overlapped between the two benchmarking datasets. The red dashed line indicates the median score across all models. c The tissue-specific accuracy in the setting of transferring from HLCA to Tabula Sapiens dataset. d The model complementary analysis in the setting of transferring from Tabula Sapiens to HLCA dataset. The ensemble performance and absolute improvement over the best component are colored by red and blue, respectively. e The out-of-distribution model performance on the AIDA v2 dataset measured by Accuracy (FPR = 0.1). All experiments are independently run 5 times with different random seeds. The results reported in a,c,d,e are average scores
The ontology-relevant metrics, namely Accuracy (non-leaf) and LCAD, provide some new perspectives and discoveries not captured by standard metrics (Fig. 3a). Notably, scCello emerges as the best-performing model, consistent with its pretraining objective derived from the cell ontology graph, which leads to cell type relationship-aware embeddings. As measured by Accuracy@1(non-leaf), models that perform best under standard metrics (e.g., UCE and scFoundation) may experience greater degradation when applied to annotate cells with coarse-grained labels. Moreover, the accuracy score does not necessarily correlate with the SOE, as reflected by LCAD.
Our findings suggest that while some models fit the training data well, they may struggle to preserve the cell type hierarchy. More importantly, these results verify the significance of evaluating model performance from diverse perspectives, allowing users to jointly consider multiple metrics tailored to their specific needs.
Cross-dataset validation
The cross-dataset validation, which transfers the knowledge from a known reference dataset to unknown query datasets, is further applied to evaluate the cell type annotation ability in the presence of inter-study variations. Here, we implement the cross-dataset experiments between the single-organ atlas (HLCA) and the multi-organ atlas (Tabula Sapiens).
When applying scVI to a new query dataset, we should load the pretrained reference model and finetune it on the query dataset, termed as the scVI surgery. The effectiveness of scVI surgery largely depends on how well the HVGs selected for training the reference model align with the genes in the expression matrix of the query dataset. Herein, the proportion of matched genes is 99.95% and 71.35% in the scenario of transferring from HLCA to Tabula Sapiens and transferring from Tabula Sapiens to HLCA, respectively. Consequently, the scVI-based annotation model falls short of all scFMs when transferred from Tabula Sapiens to HLCA but performs competitively on the contrary (Fig. 3b). This phenomenon reveals that classical approaches like scVI have limited transferability and reusability, since they rely on dataset-specific integration and fixed HVGs. We also observe that scCello robustly achieves the best results on the cross-dataset scenarios, demonstrating that ontology-aware metrics are better indicators of a model’s generalization ability. Notably, although the accuracy scores are all above 0.6, the macro-F1 scores are much lower, implying that the models struggle to handle distributional shifts in the proportions of known cell types, performing poorly on rare subpopulations under such imbalance (Additional file 2: Fig. S19). Moreover, we analyze the tissue-specific model performance in the scenario of transferring from HLCA to Tabula Sapiens (Fig. 3c). In summary, scCello, UCE, scVI, scFoundation, scGPT, and Geneformer achieve the best performance in 9, 6, 3, 2, 1, and 1 tissues out of the 22 tissues, respectively. The scGPT model lags behind other models mainly due to its unsatisfactory performance on the Blood and Spleen tissues, which account for a large proportion of the whole test set (Additional file 2: Fig. S20).
The above results show that scFMs excel in different scenarios, highlighting the potential of model ensembling, a strategy that has been successfully applied in the domain of protein language models [46]. As a proof-of-concept, we combine the six scFMs into 15 pairs, aggregate their output logits by summation, and apply the SoftMax function to obtain ensemble predictions. In the case of transferring from a general atlas to a tissue-specific atlas, all combinations exhibit advantages compared to the single constituent, except for scCello with no improvement in terms of Accuracy@1 (Fig. 3d). Particularly, the combination of Geneformer and scGPT improves the accuracy from 67.4 to 75.6%.
Interestingly, the best pairwise ensemble does not outperform the best individual model in terms of accuracy, but achieves higher macro-F1 on both datasets (Fig. 3d and Additional file 2: Fig. S21), suggesting that complementarity primarily benefits the prediction of rare cell types. We also perform full ensemble via logit aggregation and majority voting. The results (Additional file 3: Table S2-S3) show that logit aggregation outperforms majority voting in Accuracy@1, while majority voting achieves higher macro-F1. However, both full-ensemble strategies perform worse than the best-performing pairwise ensembles. We attribute this to the fact that full ensembles weigh all models equally, allowing weaker or less complementary models to introduce noise and dilute the contribution of stronger, more complementary ones. In contrast, pairwise ensembles enable the selective combination of strong and complementary models, resulting in a more synergistic effect and better overall performance. This observation underscores the importance of carefully selecting complementary models when designing ensemble strategies in practice.
Identification of novel cell types
A critical step in automated cell type annotation is transferring the labels from a well-annotated reference dataset to newly generated datasets [29, 47]. The identification of novel cell types is practically important since the cell types that exist in reference and query datasets often partially overlap.
The novel cell types never-before-seen during training can be defined as out-of-distribution (OOD) data that should be rejected to be predicted as any seen cell types. Therefore, the classifier should be aware of the data distribution and outputs low confidence scores for those unknown cells. To assess the OOD detection performance at the inference time, we select the SoftMax-based confidence score and energy-based confidence score to distinguish the in-distribution samples (positive samples) and OOD samples (negative samples) based on the output logits (Methods). Considering unseen ratios ranging from 10 to 90%, the UCE model performs best in both AUROC and AUPRC (Additional file 2: Fig. S22). Particularly, with the increase of unseen ratios, the advantage of UCE over other models in terms of AUPRC becomes more significant, suggesting that it excels in identifying the small proportion of positive samples.
Following ref. [48] and ref. [49], we calculate the classification accuracy for known cell types under a given false positive rate (Accuracy@FPR, see Methods), which is specifically set as 5, 10, and 20%. This uncertainty-aware accuracy score estimates the model’s ability to correctly classify known cells with a high level of confidence. On three benchmarking datasets, UCE consistently achieves the best performance (Fig. 3e and Additional file 2: Fig. S23), indicating its superiority in simultaneously discerning novel cells from known cells and correctly annotating the latter. Although linear LR could achieve competitive performance in classifying known cell types, it generalizes poorly to OOD cell populations. The superiority of scFMs in this scenario could be attributed to their large-scale pretraining and the multi-head attention mechanism, enabling the model to capture subtle differences between cells [1].
In the future, we can further integrate the confidence score into the training stage, targeting better OOD detection performance [50]. Another promising direction is zero-shot classification for unknown cells, leveraging the cell ontology graph structure [45, 51] or contrastive cell-text pretraining [9, 52].
Benchmarking analysis on cancer cell identification
The identification of cancer cells is a crucial process in studying tumor microenvironments [33]. To a higher extent, the ability to distinguish the disease states of cells reflected in this task can be transferred to identify other disease-specific cell subpopulations [53]. In our study, we collect a cross-tissue dataset over 7 cancer types from the Tumor Immune Single-cell Hub 2 (TISCH2) [54] database, which provides diverse clinical datasets with high-quality cancer cell annotations. To evaluate the model’s generalization ability across multiple tissues, we conduct leave-one-tissue-out cross-validation.
The baseline model implemented in Cancer-Finder [33] generates the gene list based on the intersection set between all training and validation datasets and selects the top 5000 highly variable genes as the input features. For scVI, we concatenate 10 datasets into a combined dataset and train a joint model for feature extraction. Both the baseline approach and scVI will face challenges when applied to a new query dataset as they require gene set alignment. Selecting overlapped genes across multiple datasets is prone to losing important biological signals due to the high variance of gene coverage for scRNA-seq data. Besides, important genes for new datasets cannot be added later on. In contrast, scFMs offer a remarkable advantage by allowing direct extraction of cell embeddings from each dataset individually, without the need for gene filtering. This approach preserves the unique biological information of each dataset, making scFMs more robust and adaptable to diverse datasets without the risk of dropping biologically meaningful genes.
The tissue-specific performance (Fig. 4a) demonstrates that the baseline model using gene expressions (Raw) performs well on the Brain and Eye tissues but fails to achieve satisfactory results on the Blood and Bone tissues. The Blood dataset contains three different cancers, and the patient samples in the Bone dataset have received targeted therapy, which may introduce additional variations that make the prediction more challenging. The UCE model always outperforms the baseline, proving that it offers a cell representation space with cross-tissue homogeneity. UCE benefits from its massive scale, a diverse pretraining corpus that includes diseased states, and a binary expression objective, likely resulting in a more abstract and robust embedding space to distinguish normal from malignant cells. The scGPT and scVI models achieve competitive results across 4 tissues, both of which perform HVGs selection, highlighting the important role of HVGs in the task of cancer cell identification. Moreover, scGPT’s strength likely originates from its regression-based pretraining on over 33 million normal cells, which establishes a precise quantitative baseline of a “healthy” transcriptome. On average, scGPT (AUROC = 0.824, AUPRC = 0.867) is slightly better than scVI (AUROC = 0.811, AUPRC = 0.860), and the UCE model (AUROC = 0.823, AUPRC = 0.840) exceeds scVI in terms of AUROC but performs worse in terms of AUPRC.
Fig. 4.
Benchmarking results on cancer cell identification. a The tissue-specific model performance in terms of AUROC and AUPRC. b The impact of the batch key selection for the scVI model. c The tissue-specific roughness index (ROGI) calculated on a stratified random sample of 1000 normal and 1000 malignant cells per tissue, with the number of cells drawn from each cluster proportional to its size. The red dashed lines indicate the ROGI for the baseline (Raw). d The average model performance assessed by AUROC and AUPRC. Error bars indicate the s.d. of 4 tissues. e Correlation between average model performance and ROGI. The red line is a linear regression of the x-axis and y-axis. The score reported is the Pearson correlation coefficient
Compared to scVI, a significant advantage of scFMs is that they can remove batch effects implicitly and provide a universal cell representation space. Therefore, the task-specific models trained on cell embeddings from scFMs can be easily and robustly applied to any customized data. Additionally, the batch key for training scVI has multiple candidate choices, ranging from fine-grained to coarse-grained, including patients, datasets, diseases, and tissues. In this case, we conduct exploratory experiments and find that the model performance is sensitive to the selection of batch key, which puts forward higher requirements for data processing (Fig. 4b). Notably, the results of scVI reported in Fig. 4a are based on the best-performing variant using “tissue” as the batch key.
To provide an interpretability analysis of the experimental results, we employ ROGI (Methods) to evaluate the roughness of the cell property landscape provided by different models. We perform a stratified random sampling of 1000 normal and 1000 malignant cells per tissue, collecting cells from each cluster proportional to its size. Subsequently, the datapoints from two tissues are combined to calculate the pairwise ROGI. The tissue-specific ROGI shown in Fig. 4c is averaged across three combinations containing the target tissue. The ROGI values of all models except Geneformer are lower than the baseline, which is consistent with the better classification results reported in Fig. 4a. Furthermore, when the performance is aggregated across all 4 tissues (Fig. 4d), the PCC value (= − 0.83) demonstrates a strong negative linear correlation of model performance (in terms of AUROC) with ROGI (Fig. 4e). These results prove that the performance gains of scFMs and scVI are attributable to a smoother landscape with enhanced modellability for ML algorithms, whereas the raw counts data contains more data noise that poses tougher optimization challenges. This also suggests that ROGI can serve as a cost-effective solution for selecting the optimal model with no additional model training.
Benchmarking analysis on drug sensitivity prediction
Apart from distinguishing normal and malignant cells, the cellular heterogeneity within the malignant clusters is also valuable for discovering cancer mechanisms and developing therapeutic strategies. The sensitivity of tumor cells to different drug treatments has been widely studied using ML algorithms [55, 56]. Particularly, uncovering the varied drug response at the single-cell resolution is valuable for precision medicine [32].
Considering that there are far more bulk RNA-seq data with drug efficacy information than single-cell data, a series of research has attempted to transfer the knowledge learned from bulk RNA-seq data (source domain) to improve the prediction of single-cell drug response (target domain) [32, 57, 58]. The gene expression profiles in bulk RNA-seq data give an average measurement of cell mixtures, which masks cellular heterogeneity. Therefore, the transfer learning process needs to align the bulk domain and the single-cell domain. For instance, SCAD [32] utilizes a domain discriminator to help the shared feature extractor learn domain-invariant representation, thus the domain communal drug response predictor can generalize from bulk data to single-cell data.
In this study, we follow the experimental settings in the original study of scFoundation, which trains the baseline SCAD model using all genes shared by the source domain and target domain, and substitutes the raw counts data with zero-shot scFM embeddings to train embedding-based SCAD models. The drug-specific (Fig. 5a) and overall (Fig. 5b) model performance shows that all scFMs outperform the baseline model, and the top-performing models are scFoundation (AUROC = 0.755, AUPRC = 0.753) and scGPT (AUROC = 0.737, AUPRC = 0.732). The advantages of scFoundation may be attributed to the read-depth aware pretraining and the specially designed feature extraction strategy, which enables it to generalize across datasets with varying sequencing depths, including bulk RNA-seq. Owing to the additional pretraining on cell-text pairs with cancer cells and text information about cell function and pathway annotations, the LangCell model (AUROC = 0.667, AUPRC = 0.686) outperforms Geneformer (AUROC = 0.601, AUPRC = 0.643) and achieves the best result on the Etoposide drug. scCello is primarily pretrained on normal, annotated cells, which may limit its ability to model the complex heterogeneity and unique states of cells within the tumor microenvironment. Notably, the gene expression scale of bulk-RNA seq data is much higher than scRNA-seq data, which could cause a harmful effect on the gene sampling probabilities defined in UCE. Conversely, the scGPT adopts the cell-specific value binning strategy, which will be more robust to different scales of gene expression profiles.
Fig. 5.
Benchmarking results on drug sensitivity prediction. a Scatterplots of drug-specific model performance in terms of AUROC and AUPRC. b Boxplots of model performance over 4 drugs. c The drug-specific roughness index (ROGI) calculated on the target scRNA-seq dataset. The red dashed lines indicate the ROGI for the baseline (Raw). d The average model performance assessed by AUROC and AUPRC. Error bars indicate the s.d. of 4 drugs. e Correlation between average model performance and ROGI. The red line is a linear regression of the x-axis and y-axis. The score reported is the Pearson correlation coefficient
To assess the intrinsic quality of cell embeddings provided by scFMs, we calculate ROGI for each drug using all its target domain training samples. At the drug-specific level, the best-performing models in Fig. 5a rank 1st, 1st, 2nd, and 3rd in terms of ROGI (from lowest to highest) for Sorafenib, NVP-TAE684, Etoposide, and PLX4720_451Lu, respectively (Fig. 5c). Taking the average of drug-specific performance (Fig. 5d), we observe that the ROGI values negatively correlated with AUROC (= − 0.99, Fig. 5e), highlighting that the cell embeddings from scFMs provide a good starting point to train the task-specific module, and the ROGI values can be utilized to guide model selection on a specific downstream dataset.
Overall performance
To provide an overview of the benchmarking results, we utilize a non-dominated sorting algorithm to rank the evaluated models from specific scenarios to general performance (Fig. 6). This approach offers a balanced evaluation of the selected metrics without the need to manually select aggregation weights (Methods), ensuring that no single metric is unfairly prioritized. This flexibility is crucial for real-world applications, where the choice of optimal model varies depending on the user’s requirements, dataset characteristics, and resource constraints. For instance, users who prioritize batch correction over biological conservation, or vice versa, can make more informed decisions via selecting models from the Pareto front based on their preferred metric. The overall evaluation partially solves the three critical issues we mentioned in the “Background” section.
Fig. 6.
Overall performance ranked by non-dominated sorting algorithm. a Summary of the overall and task-specific rankings. The task-specific rankings are based on the sum of rankings over all datasets and scenarios. The overall rankings are based on the sum of rankings over all tasks. b Model rankings on the cell clustering (batch integration) task based on scIB (AvgBio and AvgBatch) and scGraph metrics (scGraph-PCA and scGraph-OntoRWR). c Model rankings on the cell type annotation task. The intra-dataset model rankings consist of Seen (Basic), Seen (Onto), and Unseen scenarios. The inter-dataset model rankings corresponds to the Seen (Transfer) scenario. Seen (Basic) and Seen (Transfer) are based on Accuracy@1 and macro-F1, Seen (Onto) is based on Accuracy@1(non-leaf) and LCAD, and Unseen is based on Accuracy (FPR = 0.05), Accuracy (FPR = 0.1), and Accuracy (FPR = 0.2). d Model rankings on the cancer cell identification task based on AUROC and AUPRC. e Model rankings on the drug sensitivity prediction task based on AUROC and AUPRC
How can we assess the ability of scFMs to capture meaningful biological insights?
We observe that the ranking of models varied with specific metrics, datasets, tasks, and validation scenarios. Therefore, a benchmark study requires more complex and comprehensive experiments with a broad range of available metrics to thoroughly characterize the generalization capabilities of scFMs. Our proposed scGraph-OntoRWR, which measures the integration quality from a biologically aware perspective, unveils the intrinsic biological prior knowledge encoded by scFMs. We calculate the Spearman correlation coefficients between model rankings produced by scIB, scGraph-PCA, and scGraph-OntoRWR, and the rankings from other downstream tasks (Additional file 2: Fig. S24a). The results show that (a) The three metrics produce rankings with low inter-correlation, confirming they provide orthogonal perspectives. (b) The model ranking from scGraph-OntoRWR exhibits the highest average correlation with rankings from other downstream tasks, suggesting it is a more reliable indicator of a model’s practical utility. Especially, the drug sensitivity prediction is mostly related to cancer cell identification, which is reasonable since they all focus on modeling tumor microenvironments. This also necessitates further efforts to evaluate model performance from a broad range of tasks and perspectives. Moreover, the ontology-aware classification metrics (non-leaf accuracy and LCAD) not only reveal nuanced model capabilities but also correlate more strongly with cross-dataset performance than standard metrics, providing a better indicator of generalization ability (Additional file 2: Fig. S24b).
What factors should guide the decision to use a complex foundation model versus a simpler machine learning model?
Our evaluations reveal some advantages and limitations of both traditional ML models and scFMs. Yang et al. [59] emphasize that the pretraining of large-scale foundation models is aimed to encode biological prior knowledge and promote the generalization ability for single-cell data analysis [59]. It is plausible that traditional ML approaches can achieve optimal performance on simpler downstream tasks, while scFMs show better reusability and generalizability to a wider range of scenarios. Therefore, the selection of models is highly dependent on the use cases, including dataset size, task complexity, the need for biological interpretability, and available computational resources. scFMs excel at providing a universal cell embedding space for large-scale datasets, while simpler ML models specialize in efficient adaptation to the dataset in hand under limited computational resources. If users prefer removing strong batch effects and only require a domain-specific model to perform well on the current dataset, traditional approaches are sufficient. However, if users aim to preserve biological structures, discern nuanced cell states, or require a general-purpose model that can be easily transferred to new datasets, scFMs are recommended. Furthermore, the Transformer-based scFMs could provide contextual gene embeddings, cell embeddings, attention weights, gene expression predictions, and even de novo generation of scRNA-seq data, significantly extending the scope of their applications.
Are there any foundation models that demonstrate consistent superior performance across diverse application scenarios?
Our experiments indicate that no single scFM consistently outperforms others on all downstream tasks. Overall, scFoundation and UCE are top-performing models in our benchmarking study, which have the largest pretraining data and model size, respectively. Beyond the scaling law, we systematically analyze several critical design choices that influence model performance (Additional file 3: Table S4). These design trade-offs are key to explaining why certain models excel at specific tasks. For instance, UCE excels in data integration and annotation, while scFoundation performs better at cancer cell identification and drug sensitivity prediction. A possible explanation might be that scFoundation utilizes absolute value projection and regression-based pretraining, making it more sensitive to cell heterogeneity within the tumor microenvironment. scCello performs particularly well on bio-conservation score and ontology-aware classification metrics, attributable to its pretraining with cell-type coherence loss and ontology alignment loss. scGPT also ranks among the top models on batch integration, cancer cell identification, and drug sensitivity prediction. A core strength of scGPT is its accessibility, as it requires much less hardware resources and inference time than UCE and scFoundation. Furthermore, users can select an appropriate model on a specific dataset, resorting to the ROGI values before any model training. For instance, in the sensitivity prediction on the treatment of Etoposide, the cell embeddings with the lowest ROGI values are from scFoundation, LangCell, and Geneformer. In this case, users can also select LangCell and Geneformer to obtain reasonable results. Given the highly similar architectures and encoding strategies of Geneformer, LangCell, and scCello, our benchmark demonstrates that integrating biological knowledge and multi-modal pretraining effectively enhances model performance.
In summary, our findings highlight that the state-of-the-art scFMs offer more informative cellular representations, outperform traditional approaches, and hold significant potential for tackling practical challenges in the single-cell research field. Owing to the pretraining on diverse cellular states, they also pave the way for providing a universal cell embedding space and developing more versatile models applicable to a variety of downstream tasks.
Discussion
Reliable benchmarks, built upon well-designed experiments and robust evaluation metrics, are essential for steering emerging fields in the right direction. In this study, we present one of the most comprehensive benchmarks to date for foundation models in single-cell transcriptomics, addressing several critical challenges in the field. To achieve this, we began by rigorously analyzing existing benchmarks, which predominantly stressed the limited potential of scFMs. Through this analysis, we observed that many of the earlier benchmarks failed to emphasize downstream tasks that traditional methods struggle with and overlooked realistic yet challenging scenarios that could thoroughly test a model’s capabilities. In addition to curating appropriate benchmark tasks, it is vital to adopt diverse metrics—spanning unsupervised, supervised, and knowledge-based approaches—to mitigate biases inherent in individual measures. Guided by these observations, our benchmark introduces four key contributions:
Broad evaluation of single-cell foundation models (scFMs): We evaluate the utility of zero-shot scFM embeddings across diverse downstream tasks by freezing the scFMs and training task-specific modules only. This approach facilitates a direct evaluation of the quality of pretraining and enables the parallel comparison of six scFMs on every task, even for models not explicitly designed for those tasks.
Comprehensive benchmarking tasks: Beyond standard single-cell data analysis tasks, we focus on complex, clinically relevant scenarios, including identifying novel cell types, distinguishing disease cell states, and analyzing cellular heterogeneity in response to drug treatments.
Mitigation of data leakage: Previous benchmarks often risk inadvertent data leakage since the evaluation cells may appear in the pretraining datasets of certain scFMs. To address this, we introduce a new benchmarking dataset, AIDA v2, released after all the scFMs benchmarked, thereby ensuring no overlap with their pretraining data. Additionally, we design tasks like drug sensitivity prediction that reduce this risk by ensuring the drug modality is excluded from pretraining.
Biologically-aware evaluation metrics: Recognizing variability in results based on selected metrics, we incorporate cell ontology-informed measures alongside traditional metrics to provide a holistic assessment of model performance with some innovative perspectives.
Indeed, our benchmark reveals novel insights, such as the ability of well-pretrained scFMs to provide zero-shot embeddings capturing multiple dimensions of biologically meaningful attributes. At the gene level, scFMs effectively learn gene functions and expression specificity across diverse cellular contexts. According to our proposed cell ontology-informed metrics, certain scFMs align strongly with prior biological knowledge, correlating with superior downstream task performance. The ROGI approach provides a deeper understanding of the differences in cell representation spaces learned by scFMs, considering specific application scenarios. Our analysis with ROGI confirms that pretrained foundation models offer smoother embedding-to-objective landscapes, enhancing their effectiveness for downstream machine learning applications. Compared to the contemporary benchmarking study from Liu et al. [15], we systematically explore the utility of scFMs in practical biological research settings and address fundamentally different yet complementary questions, thus making distinct and significant contributions to the field (Additional file 1: Note S4).
These findings underscore the potential of scFMs as powerful tools for constructing contextualized, general-purpose, zero-shot representations of individual cells, and advancing single-cell research. However, scFMs do not consistently outperform simpler baseline models, and no single model excels across all scenarios. Model selection should therefore be informed by dataset characteristics and specific application needs. For instance, UCE and scFoundation are top performers in most cases, while scGPT is suitable for resource-limited settings. For specialized applications, ROGI can guide model recommendations based on dataset-dependent factors.
Despite their promise, current approaches face several limitations. Traditional methods like scVI, while efficient for dataset-specific training with batch labels, often suffer from overcorrection, distorting intrinsic biological structures. scVI also has limited reusability and is sensitive to gene selection, making it less flexible for new datasets. In contrast, scFMs provide an out-of-the-box solution, requiring only standardized scRNA-seq data without gene filtering or manual integration. These models integrate seamlessly into downstream workflows, eliminating the need for dataset-specific training.
However, scFMs require substantial computational resources, limiting their accessibility. Furthermore, when data distributions deviate significantly from pretraining datasets, zero-shot scFM embeddings may underperform, and fine-tuning in resource-constrained settings remains challenging. Lightweight models like scVI, trained directly on target datasets, may offer more practical solutions in such cases. Another limitation of current scFMs is the reliance on simplistic pretraining tasks, which may fail to capture biologically meaningful representations. The success of scCello suggests promising directions for biologically informed pretraining. Furthermore, the use of predefined gene vocabularies during pretraining limits the cross-species generalization of scFMs, and protein-based representations, as adopted by UCE, cannot adequately capture non-coding RNAs (ncRNAs).
Finally, our findings also suggest opportunities for enhancing the utility and generalizability of scFMs. Strategies such as leveraging multiple scFMs through ensemble methods could improve prediction accuracy. Integration of multi-omics data represents another promising avenue; among the evaluated models, only scGPT supports additional modalities like scATAC-seq and spatial transcriptomics. Developing unified cell embeddings for both bulk and single-cell data could enable applications such as identifying phenotype-associated subpopulations [60], predicting perturbation responses [61], and improving drug sensitivity predictions [32]. Future work in the field should also aim to develop more biologically informed pretraining tasks to improve model representations, create unified embeddings for both bulk and single-cell data, and address the significant computational challenges to improve the accessibility and generalizability of these powerful models.
Conclusions
In summary, our work establishes a comprehensive benchmark for single-cell foundation models, systematically evaluates the zero-shot capabilities of six scFMs across diverse and clinically relevant tasks, addresses the critical issue of data leakage with an unbiased dataset, and proposes new cell ontology-informed metrics for a more holistic assessment. Moreover, our analyses illuminate the strengths and weaknesses of current scFMs and provide a practical framework for model selection based on specific application needs. Our work bridges the gap between theoretical advancements and practical applications of scFMs, offering novel and valuable insights for both benchmarking and method development in the single-cell community.
Methods
Benchmarking models
Baselines
HVGs
In this study, we standardized to select 2000 highly variable genes with a predefined batch key using the Scanpy [62] Python package. Crucially, HVG selection is also part of the data processing protocols of scVI, Harmony, and scGPT.
scVI
scVI [18] is a variational autoencoder (VAE)-based generative model, which is widely used in single-cell transcriptomics data analysis. We adopted 2 layers with a latent space dimension of 30 for model training, and the other parameters remained the default settings in scvi-tools (version 0.16.4) [63].
Harmony
The Harmony [17] algorithm initializes all datasets in PCA space along with the batch variable and alternately iterates over two complementary steps until convergence. Harmony was applied using its Python package harmonypy (version 0.0.9) on the top 2000 highly HVGs after PCA (n = 50) transformation.
Seurat v5
Seurat [16] provides anchor-based integration workflows that identify shared canonical correlation vectors between datasets and project them into a common low-dimensional space to align cell populations across batches. We used the R package reticulate (version 1.34.0) to create a Seurat object from an AnnData object and performed canonical correlation analysis (CCA) integration following the tutorial at https://satijalab.org/seurat/articles/seurat5_integration via its official R package Seurat (version 5.0.0).
Logistic regression
Following the settings in OnClass [45], Logistic regression (LR) was implemented as a one-vs-rest classifier with L2 regularization, using the top 2000 HVGs as input. The LR implementation was accelerated with the cuML library.
Single-cell foundation models
Geneformer
Geneformer [7] is a scFM pretrained on the non-disease Genecorpus-30 M dataset. Geneformer adopts the ordering-based value embedding approach, which normalizes the gene expression value across the entire Genecorpus-30 M dataset and ranks the genes in a descending order, thus taking the 2048 top-ranked genes as model input. It releases two checkpoints with 6 layers and 12 layers. We utilized the larger version for evaluation.
scGPT
scGPT [2] is a scFM capable of processing multiple single-cell omics data. The scGPT model takes 1200 HVGs as input and proposes a cell-dependent value binning technique to encode gene expression levels. The authors release a series of scGPT models with varied pretraining data; the model used for benchmarking here is the scGPT-human model pretrained on 33 M normal human cells. Specially, the scGPT model employs condition tokens to represent meta-information, such as modality, batch, and perturbation conditions. The authors also explore the design of task-specific fine-tuning objectives for better adaptation to downstream tasks.
Universal cell embedding (UCE)
UCE [8] is a cross-species scFM that takes advantage of a pretrained protein language model (ESM-2 [104]) to encode gene tokens, which can be applied to any set of genes never-before-seen during pretraining. Furthermore, the gene tokens are sampled with a probability proportional to the expression values and sorted by their chromosomal locations. UCE is the largest scFM evaluated in this study, which contains 33 Transformer layers with 650 M parameters in total.
scFoundation
scFoundation [4] utilizes the asymmetric encoder-decoder architecture xTrimoGene [64] to deal with non-zero and zero expression values of 19,264 protein-coding genes and employs a value projection approach to convert continuous expression values to learnable value embeddings without hard discretization. The scFoundation model is pretrained on 50 M single cells based on the read-depth-aware pretraining task with additional tokens of target total counts (T) and source total counts (S), which aims to mitigate the influence of high variance in sequencing read depths. We used the concatenated 3072-dimensional cell embeddings for benchmarking analysis, which consists of 4 parts: (1) mean-pooling of gene embeddings; (2) max-pooling of gene embeddings; (3) the context embedding of the indicator S; (4) the context embedding of the indicator T.
LangCell
LangCell [9] is the first scFM that employs the language-cell pretraining framework, which consists of a cell encoder and a text encoder initialized from Geneformer and PubMedBERT [65], respectively. LangCell adopts the same model architecture and gene embedding strategy as Geneformer with an additional [CLS] token. The authors construct a cell-text dataset, scLibrary, consisting of 27.5 M paired data, and the pretraining tasks are expanded with intra-modality and inter-modality contrastive learning objectives.
scCello
scCello [10] is a cell ontology-guided scFM pretrained on 22 M scRNA-seq data with cell type labels mapped to the cell ontology graph. It also adopts the ordering-based gene value encoding strategy and introduces cell-type coherence loss and ontology alignment loss to learn cell-type-specific and ontology-aware cell representations.
Dataset collection and processing
Datasets for gene-level tasks
As shown in Additional file 3: Table S5, we collected two datasets with GO term annotations and tissue-specific labels released by Chen et al. [23] to implement the gene function prediction experiments. Only those genes that exist in the vocabulary of all scFMs were preserved, and each dataset was split into training and test sets with a ratio of 8:2, followed by dividing 20% of the training set as the validation set. The experiments were repeated 5 times with different random seeds.
Datasets for cell-level tasks
Batch integration and cell type annotation
We collected 5 datasets with varying data size and diversity from scIB [36] and the CellxGene database [20] (Table 2). We performed standard data processing based on the raw counts saved in the “counts” layer of datasets from scIB and “adata.raw.X” of datasets from the CellxGene database. The AIDA v2 dataset was downloaded via the cellxgene_census API based on the 121 newly released donor IDs (Additional file 3: Table S6).
For the cell type annotation, we excluded non-leaf nodes to avoid conflicts between the cell types to be predicted. For instance, if the dataset contains both lymphocytes and plasma cells, we will exclude those cells annotated as “lymphocyte.” These excluded cells with coarse-grained annotations can be used as an external test set to compare model performance based on whether the predicted label is a subcategory of the ground truth label. In addition, the cell types with fewer than 10 datapoints were also removed. Finally, there remain 37 and 120 cell types in the HLCA and Tabula Sapiens datasets, respectively, among which 14 cell types are shared by the two datasets. For the AIDA v2 dataset, 22 of 32 cell types are leaf nodes in the cell ontology graph.
Following Wang et al. [45], we performed in-distribution (seen cell types) and out-of-distribution (unseen cell types) data splits for benchmarking. For in-distribution data splitting, we utilized the “StratifiedShuffleSplit” function in sklearn [66] to split data into training and test sets with a ratio of 2:8. Afterwards, the training set was further divided into 90% training data and 10% validation data. The in-distribution data splits were first used for intra-dataset validation. Subsequently, the model trained on intra-dataset validation was directly applied to the cross-dataset validation, with all cells belonging to the 14 overlapped cell types as test data. For OOD data splitting, a specified proportion of cell types was exclusively used as the test set, then the cells from remained cell types were split into training and test sets with a ratio of 8:2. Concretely, the proportion of unseen cell types were set to 10%, 30%, 50%, 70%, and 90%. The OOD data splits were used for the identification of novel cell types.
Cancer cell identification
We constructed a benchmarking dataset for cancer cell identification, encompassing 10 datasets covering 7 cancer types from blood, brain, bone, and eye tissues, based on the TISCH2 [54] database (Additional file 3: Table S7). Each dataset is sourced from primary tumors of patients and contains both normal cells and malignant cells. The gene expression profiles downloaded from the database have already been scaled to 1e4 in total counts and then log-transformed. It is worth noting that the UCE model requires raw counts to calculate the gene sampling probabilities; thus, the cell embeddings extracted for the TISCH dataset may be affected since the raw counts are unavailable. For the Geneformer and LangCell models, we modified the original codes to reverse the log1p transformation and then applied the division of gene normalization factors. The scFoundation model can take preprocessed scRNA-seq data as input by setting the parameter “pre_normalized = T”. As for the scGPT model, the normalization and transformation process have no influence on the value binning result.
The cancer cell annotations in the TISCH2 database are of high quality since the authors have combined three approaches to separate the cells into normal clusters and malignant clusters: (1) take the cell-type annotations provided by the original studies; (2) check the malignant cell makers’ expression distribution; (3) predict cell malignancy based on the copy number alterations inferred by InferCNV [67].
Drug sensitivity prediction
In the original study of scFoundation, they conducted drug sensitivity prediction focusing on 4 drugs with bulk level (source domain) and single-cell (target domain) level sensitivity labels from SCAD, namely Sorafenib, NVP-TAE684, PLX4720, and Etoposide. We followed the data processing workflow and experimental settings provided in the repository of scFoundation. The detailed information of the processed datasets is reported in Additional file 3: Table S8. Both the source and target domain data are split into five folds for cross-validation. We used the normalized gene expression data as input to extract zero-shot cell embeddings from scFMs. For scFoundation, the source total counts (S) and target total counts (T) for bulk RNA-seq are all equal to the sum of gene expression values, while T is set to 10,000 for scRNA-seq data.
Dataset processing and feature extraction
The dataset processing was implemented by the Scanpy [62] package. Following the default settings in the repository of UCE, the data quality control was set to filter cells by minimum gene counts of 25 and filter genes by minimum cell counts of 10. Afterwards, we performed total count normalization (target_sum = 1e4) and log1p transformation to the raw counts data.
For gene-level tasks, we directly used the ESM-2 protein embeddings for UCE and the pretrained gene token embeddings in the input layer of other scFMs, which are independent of their Transformer module. The gene symbols were unified by the HUGO Gene Nomenclature Committee (HGNC) [68].
For cell-level tasks, we extracted zero-shot contextualized cell embeddings from scFMs without any additional training. We introduced a “data_is_raw” hyperparameter to distinguish whether the gene expression matrix has already been processed. The scVI model was trained for each individual dataset with the given batch key. For the cross-dataset evaluation, we adopted the scVI surgery pipeline, which involves training a scVI model on a reference dataset (source dataset), fine-tuning it with a query dataset (target dataset), and then using the fine-tuned model to obtain a unified latent representation of both datasets for downstream analysis.
Training details for downstream tasks
For our benchmarking experiments, we employ a zero-shot evaluation protocol, where the pretrained scFMs are used as frozen feature extractors to generate static cell embeddings. These embeddings then serve as the input features for separate, state-of-the-art downstream models tailored to specific tasks. This approach allows us to synergize the powerful representations from foundation models with the optimized architectures of established task-specific tools. Crucially, by keeping the downstream model and training protocol constant while only varying the input embeddings, this framework ensures that any observed performance differences are directly attributable to the intrinsic quality of the representations provided by each scFM. This enables a fair and direct comparison of their representative power. The task-specific models we employed are listed below:
OnClass
OnClass [45] is a cell type annotation approach that embeds cells and cell types into a common low dimensional space. Leveraging the cell type relationships defined in the cell ontology graph, OnClass is capable of predicting novel cell types not present in the training data. For our implementation, we followed the original study’s experimental setting, which uses a bilinear neural network to model the interaction between cell embeddings and cell type embeddings. Specifically, the model first requires a fixed cell type embedding matrix, , which represents the entire cell ontology structure with ontology terms. This matrix is pre-computed by embedding the cell ontology graph into a -dimensional space using methods like clusDCA [69] and singular value decomposition (SVD) [70], and is subsequently concatenated with a diagonal matrix that serves as a bias term. To align with this semantic space, the -dimensional input cell features (i.e., the zero-shot embeddings from scFMs) are passed through a trainable two-layer neural network with weight matrices , , and ReLU activation functions. The final prediction score for a cell belonging to a specific cell type is calculated via the dot product of this transformed cell embedding and . During training, the model is optimized using a cross-entropy loss, where only the weights of the transformation network ( and ) are updated. For hyperparameter tuning, we performed a grid search for the optimal learning rate and weight decay under the intra-dataset validation scenario (Additional file 2: Fig. S25). All experiments were repeated five times with different random seeds for data splitting and model initialization to ensure the robustness of our results.
Cancer-Finder
Cancer-Finder [33] is a scalable framework for cancer cell annotation across multiple tissues (domains). The model architecture consists of a 3-layer feature extractor and a binary classification layer, which are shared across all domains. The model is trained using a risk extrapolation objective to enhance its generalization capabilities. This technique minimizes a composite loss function, defined as a weighted sum of the average risk (the mean cross-entropy loss across all domains) and the variance risk (the variance of these losses). To rigorously assess the generalization ability of the input embeddings, we implemented a leave-one-tissue-out cross-validation scheme. Following the original study, we employed a balanced sampling strategy to achieve a 1:1 ratio of normal and abnormal cells in each tissue for training, while preserving all cells for evaluation. We utilized the Optuna [71] package to perform 20 rounds of optimization over four hyperparameters, maximizing the average AUROC across 4 tissues (Additional file 3: Table S9). Based on the optimal hyperparameters, the final model was then trained for up to 200 epochs with an early-stopping criterion to prevent overfitting.
SCAD
SCAD [32] is a transfer learning framework that utilizes bulk RNA-seq data (source domain) to improve single-cell (target domain) drug sensitivity prediction. The SCAD model consists of three key components: a shared feature extractor, a domain discriminator, and a drug response predictor. The overall training objective is a weighted sum of a binary cross-entropy loss for drug response prediction (on the source domain) and an adversarial loss from the domain discriminator. In our study, we employed SCAD as a downstream model to benchmark the quality of different cellular representations. Following the protocol of the original study, we implemented a weighted random sampler during the training process. For each drug-specific model, we performed extensive hyperparameter optimization using the Optuna [71] package, searching a 7-parameter space over 50 iterations to maximize performance on the validation set (Additional file 3: Table S10 ~ S11). The final results reported in our manuscript are the average model performance across the five test folds of the target single-cell data.
Cell ontology-informed metrics
The extension of scGraph
Recently, a scGraph metric has been introduced to overcome the limitation of the scIB metrics [40], which neglect the inherent biological structures between various cell types. To ensure the applicability of the method, the authors created the batch-specific affinity graphs based on PCA loadings (or raw counts) and then aggregated them into a consensus reference graph. This approach eliminates the need to annotate the dataset using standard cell ontology terms or to include all cell types within a single batch. However, the reference graph is often incomplete, asymmetric, and can be very sensitive to the given dataset and the batch key used, which may lead to inconsistency with expert knowledge (Additional file 2: Fig. S16).
For comprehensive benchmarking in our study, we prefer to design a more robust reference graph in line with our prior knowledge in cell biology. To this end, we proposed the scGraph-OntoRWR metric, which uses the expert-defined cell ontology graph to construct the reference graph for comparison. Specifically, the scGraph-OntoRWR score is calculated as follows:
Step 1: Constructing the embedding-derived graph . Given cells with embeddings of shape , and their corresponding cell type labels belonging to categories. For each cell type, we remove the most extreme 20% of cells (to mitigate outliers) and compute the mean vector of the remaining cells as the cell type centroid. We calculate the pairwise distances between all cell type centroids to construct the embedding-derived cell type graph .
Step 2: Constructing the ontology-based reference graph . The Cell Ontology 2016 [42] provides detailed definitions and hierarchical relationships over 2700 cell types. We construct the cell ontology graph with text similarity weighted edges connecting cell types sharing the “is_a” relationship. We first apply a Random Walk with Restart (RWR) algorithm on the ontology graph to obtain a smooth similarity matrix capturing hierarchical and semantic relationships between all ontology terms (Additional file 1: Note S5). Then, we extract the subgraph corresponding to the cell types present in the dataset, yielding the reference graph .
Step 3: Computing the scGraph-OntoRWR score. The scGraph-OntoRWR score is computed as described in Eq. (1):
1
where and denote the -th row of the embedding-derived graph and ontology-based reference graph, respectively. The PCC between and is calculated for each cell type and then averaged to obtain the final score.
We conducted experiments on the HLCA and Tabula Sapiens datasets since they already have cell ontology annotations. For those datasets without standard annotations, we can map the free annotations (cell type annotations using free text) to standard cell ontology terms based on the matching algorithm proposed by Wang et al. [45]
Lowest Common Ancestor Distance (LCAD)
LCAD is a metric designed to evaluate the severity of error (SOE), which measures the hierarchical distance between ground-truth labels and predicted labels in a predefined class hierarchy [72]. Considering that the intrinsic biological structures of cell types have already been defined as the cell ontology graph, the LCAD metric can be naturally applied to estimate the SOE in cell type annotation. Recently, it has been verified that the LCAD on in-distribution data has a strong linear correlation with OOD performance on both visual models and visual-language models [72], suggesting that LCAD is a promising indicator of OOD generalization performance. We apply the LCAD metric in the context of single-cell annotation, a setting where it has not been previously explored, to offer a complementary view of model evaluation and shed light on generalization performance under distribution shift. The calculation of LCAD is illustrated in Additional file 2: Fig. S26.
Standard benchmarking metrics
For the batch integration benchmark, we selected the scIB metrics used in the original study of scGPT, which can be grouped into two categories: (1) effect of batch correction: ASWbatch, Graph connectivity; (2) effect of biological variance conservation: NMIcell , ASWcell , ARIcell. The AvgBatch and AvgBio scores were calculated by averaging the metrics within each category, and the overall score was a weighted mean of them. We used the functions provided in the scib [36] Python package to calculate the above metrics. The values of these metrics range from 0 ~ 1, with higher values indicating better alignment between the predicted clustering results and the ground-truth domain labels, such as batch labels and cell type labels. Gene function prediction and cell type annotation are multiclass classification tasks, which were evaluated with top1 accuracy (Accuracy@1) and macro-F1. For OOD scenarios, we evaluated the model’s ability to accurately annotate cell types under a given false positive rate (FPR), where false positive refers to classifying unknown cells as known cells, termed as Accuracy@FPR. For binary classification tasks, including novel cell identification, cancer cell annotation, and drug sensitivity prediction, the evaluations were performed with the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).
Normalized Mutual Information (NMI)
NMI is used to measure the similarity between the predicted clustering result (P) and the ground-truth clustering result (T), which can be formulated as:
| 2 |
where and indicate their entropy. The NMI score serves as a target function to be maximized for the selection of optimal resolution to perform optimized Louvain clustering.
Adjusted Rand Index (ARI)
The ARI metric also quantifies the similarity between two clustering results, which is built upon the rand index and adjusted by the expected score of the random clustering (). ARI is calculated as:
| 3 |
| 4 |
where , , , indicate true positive, true negative, false positive, and false negative, respectively. Here, the definition of positive and negative refers to a pair of cells that exist in the same or different clusters.
Average silhouette width (ASW)
ASW is the average of the silhouette width (SW) across all cell points. The SW is defined as the difference between average nearest-cluster distance () and average intra-cluster distance (), and then divided by the larger value of these two variables:
| 5 |
ASWbatch and ASWcell are calculated by leveraging the batch labels and cell type labels to assign each cell to a cluster. An ideal batch integration should mix different batches while separating cell types well. To ensure that higher value indicates better performance, the ASW scores are normalized by:
| 6 |
| 7 |
k-nearest-neighbor (kNN) graph connectivity
The kNN graph connectivity evaluates the connectivity of the cells within each cell type in the kNN graph, which is computed as:
| 8 |
where refers to the total number of cell types, is the number of cells belonging to cell type , and is the largest number of cells of cell type connected in the kNN graph.
Aggregation of scIB metrics
Following ref. [2], the aggregated scores are defined as follows:
| 9 |
| 10 |
| 11 |
Accuracy@1
Accuracy@1 assesses the percentage of correctly annotated datapoints considering the top-1 predicted label. In this study, we also define a relaxed accuracy score for cell type annotation, termed as Accuracy (non-leaf), where the prediction is regarded as correct if the predicted label equals or is a subclass of the ground-truth label according to the cell ontology graph. The Accuracy@FPR metric further considers the classification uncertainty. In detail, we define a confidence score based on the model’s output logits to classify cells into “known” and “unknown”. The threshold is selected by the given FPR, which indicates the percentage of unknown cells mistakenly classified as known. Then, all the cells with a confidence score above the threshold are predicted as known, which are used to compute the consistency between predicted labels and ground-truth annotations. Finally, the accuracy score is calculated as the number of cells correctly classified with a confidence score higher than the threshold, divided by the total number of known cells.
Macro-F1
The macro-F1 metric provides an aggregated estimation of the F1 score across all classes by treating each class equally, which is more suitable for imbalanced datasets. The F1 score is calculated as:
| 12 |
| 13 |
| 14 |
Confidence score estimation
The task of identifying novel cell types can be formulated as an OOD detection problem. To distinguish in-distribution and OOD samples, we employed two confidence scores directly based on the output logits without additional training. The cells with a confidence score lower than the manually selected threshold value were classified as unknown.
SoftMax-based confidence score
The SoftMax-based confidence score is defined as the maximum predicted class probability from the SoftMax distribution [73]:
| 15 |
where refers to the output dimension of the classifier, is the logit for the class , and is the temperature parameter.
Energy-based confidence score
A significant problem of the SoftMax-based confidence score is that it assigns arbitrarily high values to OOD samples [50]. To mitigate this problem, Liu et al. [50] proposed an energy function formulated as below:
| 16 |
Then, the confidence score is defined as the negative energy:
| 17 |
Roughness index
For interpretation of the experimental results and providing guidance for model selection, we extended the roughness index (ROGI) [21] proposed for quantifying the roughness of molecular property landscapes to the single-cell domain. ROGI can accept various types of representations and tasks, which allows us to compute landscape roughness on pretrained single-cell representations and perform an in-depth analysis of the relationship between zero-shot scFM embeddings and model performance on downstream tasks.
To compute ROGI, the embedding space is first clustered using varying distance thresholds () ranging from 0 to 1. For each threshold, the weighted standard deviation () of property labels is calculated for the cluster prototypes. As the distance threshold increases, decreases monotonically, ultimately reaching zero when all embeddings merge into a single cluster. If similar molecules share similar labels, decreases more slowly, reflecting a smoother property landscape. Formally, ROGI is calculated as below:
| 18 |
A schematic and detailed description of the ROGI calculation is provided in Additional file 1: Note S6 and Additional file 2: Fig. S27.
Non-dominated sorting algorithm
Inspired by Stable Diffusion 3 [74], we employed the non-dominated sorting algorithm to provide a Pareto optimal ranking that combines multiple evaluation metrics. For the batch integration task using scIB metrics, we iteratively identified the variants that are Pareto optimal based on the AvgBio and AvgBatch scores, assigned them the current iteration index, removed them from the pool, and repeated the process until all variants were ranked. The metrics used to rank models for each task are listed in Additional file 3: Table S12. To obtain the task-level ranking, we summed the model rankings over all datasets and scenarios within each task, and sorted the aggregate scores in ascending order to get the final ranking (the lower ranking score is better). Similarly, the overall ranking was an aggregation of task-level rankings.
Supplementary Information
Additional file 1: Supplementary Notes. Note S1-S6
Additional file 2: Supplementary Figures. Fig. S1-S27
Additional file 3: Supplementary Tables. Table S1-S13
Acknowledgements
We acknowledge the anonymous reviewers who provided valuable input that enhanced the clarity and scientific quality of the manuscript.
Code availability
The source code and implementation details of our benchmarking framework, implemented in Python, are publicly accessible on GitHub (https://github.com/wujialu/scFM-Bench/) [102] and Zenodo (10.5281/zenodo.17062098) [103] under the MIT License. We also provided tutorials for each downstream task (Additional File 3: Table S13).
Peer review information
Zhana Duren and Claudia Feng were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Authors’ contributions
JLW, QY, C-YH and TJH conceived the idea of the project and designed research. JLW, YLW and RLH implemented the algorithms and performed the benchmarking studies. MZY and TYW contributed to data analysis. JLW and QY wrote the first draft of the manuscript. YHZ and JKW suggested improvement and corrected the manuscript. C-YH and TJH supervised the project and provided overall guidance. All authors read and approved the final version of the manuscript.
Funding
This work has been supported in part by the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (2025C01117), and the National Natural Science Foundation of China (22303081).
Data availability
All datasets used in this manuscript are publicly available. The gene function prediction datasets are available from https://github.com/chenhcs/FRoGS, as described in. The Pancreas dataset and Immune (human) dataset were downloaded from Figshare, as described in. The source data are available at Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/): (1) Pancreas: GSE81076, GSE85241, GSE86469, GSE84133, GSE81608; (2) Immune: GSE120221, GSE107727, GSE115189, GSE128066, GSE94820. The Human Lung Cell Atlas (HLCA core set), the Tabula Sapiens dataset, and the Asian Immune Diversity Atlas (AIDA) v2 were downloaded from CELLxGENE (cellxgene.cziscience.com). The datasets used for cancer cell identification were downloaded from the TISCH2 curation (http://tisch.comp-genomics.org/), containing 10 datasets collected from GEO: (1) Blood: GSE142213, GSE132509, GSE116256; (2) Brain: GSE131928_10X, GSE138794, GSE139448, GSE141982, GSE119926; (3) Bone: GSE117156; (4) Eye: GSE139829. The drug sensitivity prediction datasets were downloaded from figshare. As described in, the bulk-level data was obtained from Genomics of Drug Sensitivity in Cancer (GDSC) and processed by (https://ibm.ent.box.com/v/paccmann-pytoda-data/folder/91702227426); the single-cell data was obtained from GEO (GSE149215, GSE108383) and Broad Institute’s single-cell portal (SCP542). The Cell Ontology data is downloaded from the OBO Foundry (https://obofoundry.org/ontology/cl.html).[75–105]
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Chang-Yu Hsieh, Email: kimhsieh@zju.edu.cn.
Tingjun Hou, Email: tingjunhou@zju.edu.cn.
References
- 1.Yang F, Wang W, Wang F, Fang Y, Tang D, Huang J, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4:852–66. [Google Scholar]
- 2.Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024;21:1470–80. https://www.nature.com/articles/s41592-024-02201-0 [DOI] [PubMed]
- 3.Szałata A, Hrovatin K, Becker S, Tejada-Lapuerta A, Cui H, Wang B, et al. Transformers in single-cell omics: a review and new perspectives. Nat Methods Nature Publishing Group. 2024;21:1430–43. [DOI] [PubMed] [Google Scholar]
- 4.Hao M, Gong J, Zeng X, Liu C, Guo Y, Cheng X, et al. Large-scale foundation model on single-cell transcriptomics. Nat Methods Nature Publishing Group. 2024;21:1481–91. [DOI] [PubMed] [Google Scholar]
- 5.Ahlmann-Eltze C, Huber W, Anders S. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nat Methods. 2025;22:1657–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ma Q, Jiang Y, Cheng H, Xu D. Harnessing the deep learning power of foundation models in single-cell omics. Nat Rev Mol Cell Biol. Nature Publishing Group. 2024;25:593–4. [DOI] [PubMed]
- 7.Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Rosen Y, Roohani Y, Agarwal A, Samotorčan L, Tabula Sapiens Consortium, Quake SR, et al. Universal Cell Embeddings: A Foundation Model for Cell Biology. 2023. https://biorxiv.org/lookup/10.1101/2023.11.28.568918
- 9.Zhao S, Zhang J, Wu Y, Luo Y, Nie Z. LangCell: Language-cell pre-training for cell identity understanding. 41st Int Conf Mach Learn. PMLR; 2024. p. 61159–85. https://proceedings.mlr.press/v235/zhao24u.html.
- 10.Yuan X, Zhan Z, Zhang Z, Zhou M, Zhao J, Han B, et al. Cell ontology guided transcriptome foundation model. Adv Neural Inf Process Syst; 2024. p. 6323–66. https://neurips.cc/virtual/2024/105930.
- 11.Boiarsky R, Singh NM, Buendia A, Amini AP, Getz G, Sontag D. Deeper evaluation of a single-cell foundation model. Nat Mach Intell. 2024. 10.1038/s42256-024-00949-w. [Google Scholar]
- 12.Kedzierska KZ, Crawford L, Amini AP, Lu AX. Zero-shot evaluation reveals limitations of single-cell foundation models. Genome Biol. 2025;26:101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wenteler A, Occhetta M, Branson N, Curean V, Huebner M, Dee W, et al. PertEval-scFM: Benchmarking single-cell foundation models for perturbation effect prediction. Forty-Second Int Conf Mach Learn. 2025. https://icml.cc/virtual/2025/poster/43799.
- 14.Wu Y, Wershof E, Schmon SM, Nassar M, Osiński B, Eksi R, et al. PerturBench: Benchmarking machine learning models for cellular perturbation analysis. Adv Neural Inf Process Syst (NeurIPS) Workshop on AI for New Drug Modalities. 2024. https://neurips.cc/virtual/2024/102911.
- 15.Liu T, Li K, Wang Y, Li H, Zhao H. Evaluating the utilities of foundation models in single-cell data analysis. bioRxiv 2023.09.08.555192. 10.1101/2023.09.08.555192.
- 16. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive integration of single-cell data. cell. Elsevier. 2019;177:1888–902. [DOI] [PMC free article] [PubMed]
- 17.Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods Nature Publishing Group. 2019;16:1289–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods Nature Publishing Group. 2018;15:1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kock KH, Tan LM, Han KY, Ando Y, Jevapatarakul D, Chatterjee A, et al. Asian diversity in human immune cells. Cell Elsevier. 2025;188:2288-2306.e24. [DOI] [PubMed] [Google Scholar]
- 20.Program CCS, Abdulla S, Aevermann B, Assis P, Badajoz S, Bell SM, et al. CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res Oxford University Press. 2025;53:D886-900. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Aldeghi M, Graff DE, Frey N, Morrone JA, Pyzer-Knapp EO, Jordan KE, et al. Roughness of molecular property landscapes and its impact on modellability. J Chem Inf Model. 2022;62:4660–71. [DOI] [PubMed]
- 22.Kwon JJ, Pan J, Gonzalez G, Hahn WC, Zitnik M. On knowing a gene: a distributional hypothesis of gene function. Cell Syst. 2024;15:488–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chen H, King FJ, Zhou B, Wang Y, Canedy CJ, Hayashi J, et al. Drug target prediction through deep learning functional representation of gene signatures. Nat Commun Nature Publishing Group. 2024;15:1853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Pesaranghader A, Matwin S, Sokolova M, Grenier J-C, Beiko RG, Hussin J. deepSimDEF: deep neural embeddings of gene products and gene ontology terms for functional analysis of genes. Birol I, editor. Bioinformatics. 2022;38:3051–61. [DOI] [PMC free article] [PubMed]
- 25. Park S, Bak J, Oh A. Rotated Word Vector Representations and their Interpretability. Proc 2017 Conf Empir Methods Nat Lang Process. 2017. p. 401–11. http://aclweb.org/anthology/D17-1041
- 26. Han C, Xu J, Li M, Fung Y, Sun C, Jiang N, et al. Word Embeddings Are Steers for Language Models. Proc 62nd Annu Meet Assoc Comput Linguist Vol 1 Long Pap. 2024. p. 16410–30. https://aclanthology.org/2024.acl-long.864/
- 27.Gao Y, Wei Z, Dong K, Chen K, Yang J, Chuai G, et al. Toward subtask-decomposition-based learning and benchmarking for predicting genetic perturbation outcomes and beyond. Nat Comput Sci. 2024;4:773–85. Nature Publishing Group. [DOI] [PubMed]
- 28.Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun. 2018;9:1366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hrovatin K, Sikkema L, Shitov VA, Heimberg G, Shulman M, Oliver AJ, et al. Considerations for building and using integrated single-cell atlases. Nat Methods. 2025;22:41–57. Nature Publishing Group. [DOI] [PubMed]
- 30.Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han J-DJ. Transformer for one stop interpretable cell type annotation. Nat Commun. 2023;14:223. 10.1038/s41467-023-35923-4. Nature Publishing Group. [DOI] [PMC free article] [PubMed]
- 31.Jovic D, Liang X, Zeng H, Lin L, Xu F, Luo Y. Single-cell RNA sequencing technologies and applications: a brief overview. Clin Transl Med. 2022;12:e694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zheng Z, Chen J, Chen X, Huang L, Xie W, Lin Q, et al. Enabling single-cell drug response annotations from bulk RNA-Seq using SCAD. Adv Sci. 2023;10:2204113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhong Z, Hou J, Yao Z, Dong L, Liu F, Yue J, et al. Domain generalization enables general cancer cell annotation in single-cell and spatial transcriptomics. Nat Commun. 2024;15:1929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Badia-i-Mompel P, Wessels L, Müller-Dott S, Trimbour R, Ramirez Flores RO, Argelaguet R, et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat Rev Genet. 2023;24:739–54. [DOI] [PubMed] [Google Scholar]
- 35.Adamson B, Norman TM, Jost M, Cho MY, Nuñez JK, Chen Y, et al. A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response. Cell Elsevier BV. 2016;167:1867-1882.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Luecken MD, Büttner M, Chaichoompu K, Danese A, Interlandi M, Mueller MF, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods Nature Publishing Group. 2022;19:41–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sikkema L, Ramírez-Suástegui C, Strobl DC, Gillett TE, Zappia L, Madissoon E, et al. An integrated cell atlas of the lung in health and disease. Nat Med. 2023;29:1563–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. The Tabula Sapiens Consortium*, Jones RC, Karkanias J, Krasnow MA, Pisco AO, Quake SR, et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022;376:eabl4896. [DOI] [PMC free article] [PubMed]
- 39.McDermott MBA, Yap B, Szolovits P, Zitnik M. Structure-inducing pre-training. Nat Mach Intell. 2023. 10.1038/s42256-023-00647-z. [Google Scholar]
- 40.Wang H, Leskovec J, Regev A. Limitations of cell embedding metrics assessed using drifting islands. Nat Biotechnol. 2025;1–4. 10.1038/s41587-025-02702-z. Nature Publishing Group. [DOI] [PubMed]
- 41.Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25:1251–5. [DOI] [PMC free article] [PubMed]
- 42.Diehl AD, Meehan TF, Bradford YM, Brush MH, Dahdul WM, Dougall DS, et al. The cell ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semant. 2016;7:44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38–44. [DOI] [PubMed] [Google Scholar]
- 44.Fischer F, Fischer DS, Mukhin R, Isaev A, Biederstedt E, Villani A-C, et al. Sctab: scaling cross-tissue single-cell annotation models. Nat Commun. 2024;15:6611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Wang S, Pisco AO, McGeever A, Brbic M, Zitnik M, Darmanis S, et al. Leveraging the Cell Ontology to classify unseen cell types. Nat Commun Nature Publishing Group. 2021;12:5556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hie BL, Shanker VR, Xu D, Bruun TUJ, Weidenbacher PA, Tang S, et al. Efficient evolution of human antibodies from general protein language models. Nat Biotechnol Nature Publishing Group. 2024;42:275–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Büttner M, Wagenstetter M, et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol. 2022;40:121–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Li C, Liu B, Kang B, Liu Z, Liu Y, Chen C, et al. SciBet as a portable and fast single cell type identifier. Nat Commun. 2020;11:1818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Zeng Y, Wei Z, Pan Z, Lu Y, Yang Y. A robust and scalable graph neural network for accurate single-cell classification. Brief Bioinform. 2022;23:bbab570. [DOI] [PubMed]
- 50. Liu W, Wang X, Owens J, Li Y. Energy-based Out-of-distribution Detection. Adv Neural Inf Process Syst. Curran Associates, Inc. 2020. p. 21464–75. https://proceedings.neurips.cc/paper/2020/hash/f5496252609c43eb8a3d147ab9b9c006-Abstract.html
- 51. Bernstein MN, Ma Z, Gleicher M, Dewey CN. CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology. iScience. Elsevier. 2021;24. https://www.cell.com/iscience/abstract/S2589-0042(20)31110-X [DOI] [PMC free article] [PubMed]
- 52.Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Int Conf Mach Learn. PMLR; 2021. p. 8748–63. https://proceedings.mlr.press/v139/radford21a.
- 53.Dann E, Cujba A-M, Oliver AJ, Meyer KB, Teichmann SA, Marioni JC. Precise identification of cell states altered in disease using healthy single-cell references. Nat Genet. 2023;55:1998–2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Han Y, Wang Y, Dong X, Sun D, Liu Z, Yue J, et al. TISCH2: expanded datasets and new tools for single-cell transcriptome analyses of the tumor microenvironment. Nucleic Acids Res. 2023;51:D1425–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Sharifi-Noghabi H, Harjandi PA, Zolotareva O, Collins CC, Ester M. Out-of-distribution generalization from labelled and unlabelled gene expression data for drug response prediction. Nat Mach Intell. 2021;3:962–72. [Google Scholar]
- 56. Zhu Y, Ouyang Z, Chen W, Feng R, Chen DZ, Cao J, et al. TGSA: protein–protein association-based twin graph neural networks for drug response prediction with similarity augmentation. Martelli PL, editor. Bioinformatics. 2022;38:461–8. [DOI] [PubMed]
- 57.Chen J, Wang X, Ma A, Wang Q-E, Liu B, Li L, et al. Deep transfer learning of cancer drug responses by integrating bulk and single-cell RNA-seq data. Nat Commun Nature Publishing Group. 2022;13:6494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Duan W, Liu H. Predicting Single-cell Drug Sensitivity by Adaptive Weighted Feature for Adversarial Multi-source Domain Adaptation. arXiv. 2024. http://arxiv.org/abs/2403.05260 [DOI] [PubMed]
- 59.Yang F, Wang F, Huang L, Liu L, Huang J, Yao J. Reply to: Deeper evaluation of a single-cell foundation model. Nat Mach Intell Nature Publishing Group. 2024;6:1447–50. [Google Scholar]
- 60.Sun D, Guan X, Moran AE, Wu L-Y, Qian DZ, Schedin P, et al. Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol. 2022;40:527–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Hetzel L, Boehm S, Kilbertus N, Günnemann S, Theis F, others. Predicting cellular responses to novel drug perturbations at a single-cell resolution. Adv Neural Inf Process Syst. 2022;35:26711–22.
- 62.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Gayoso A, Lopez R, Xing G, Boyeau P, Valiollah Pour Amiri V, Hong J, et al. A python library for probabilistic analysis of single-cell omics data. Nat Biotechnol. 2022;40:163–6. [DOI] [PubMed] [Google Scholar]
- 64.Gong J, Hao M, Cheng X, Zeng X, Liu C, Ma J, et al. xTrimoGene: an efficient and scalable representation learner for single-cell RNA-seq data. Adv Neural Inf Process Syst. 2023;36:69391–403. [Google Scholar]
- 65. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc Health. ACM New York, NY. 2021;3:1–23.
- 66.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in python. J Mach Learn Res JMLR org. 2011;12:2825–30. [Google Scholar]
- 67.Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344:1396–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Eyre TA. The HUGO gene nomenclature database, 2006 updates. Nucleic Acids Res. 2006;34:D319–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Cho H, Berger B, Peng J. Compact Integration of Multi-Network Topology for Functional Analysis of Genes. Cell Syst Elsevier BV. 2016;3:540-548.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Halko N, Martinsson P-G, Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev SIAM. 2011;53:217–88. [Google Scholar]
- 71. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. Proc 25th ACM SIGKDD Int Conf Knowl Discov Data Min. Anchorage AK USA: ACM. 2019. p. 2623–31. https://dl.acm.org/10.1145/3292500.3330701
- 72.Shi J, Gare GR, Tian J, Chai S, Lin Z, Vasudevan AB, et al. LCA-on-the-line: Benchmarking out of distribution generalization with class taxonomies. Int Conf Mach Learn. PMLR; 2024. p. 44887–908. https://proceedings.mlr.press/v235/shi24c.html.
- 73.Hendrycks D, Gimpel K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. Int Conf Learn Represent. 2017. https://openreview.net/forum?id=Hkg4TI9xl.
- 74.Esser P, Kulal S, Blattmann A, Entezari R, Müller J, Saini H, et al. Scaling rectified flow transformers for high-resolution image synthesis. Forty-First Int Conf Mach Learn. PMLR; 2024. p. 12606–33. https://proceedings.mlr.press/v235/esser24a.html.
- 75. Luecken M, Buttner M, Danese A, Interlandi M, Müller M, Strobl D, et al. Benchmarking atlas-level data integration in single-cell genomics - integration task datasets. figshare. 2020. https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_integration_task_datasets_Immune_and_pancreas_/12420968/7 [DOI] [PMC free article] [PubMed]
- 76.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2012;41:D991–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Muraro MJ, Dharmadhikari G, Grün D, de Koning E, van Oudenaarden A. A single-cell transcriptome atlas of the human pancreas. Gene Expression Omnibus. 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81076 [DOI] [PMC free article] [PubMed]
- 78. Muraro MJ, Dharmadhikari G, Grün D, de Koning E, van Oudenaarden A. A single-cell transcriptome atlas of the human pancreas [CEL-seq2]. Gene Expression Omnibus. 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85241 [DOI] [PMC free article] [PubMed]
- 79. Lawlor N, George J, Bolisetty M, Kursawe R, Sun L, Sivakamasundari V, et al. Single cell transcriptomics defines human islet cell signatures and reveals cell-type-specific expression changes in type 2 diabetes [single cell]. Gene Expression Omnibus. 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86469 [DOI] [PMC free article] [PubMed]
- 80. Veres A, Baron M. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Gene Expression Omnibus. 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 [DOI] [PMC free article] [PubMed]
- 81. Oetjen KA, Gui G, Hourigan CS. Human Bone Marrow Assessment by Single Cell RNA Sequencing, Mass Cytometry and Flow Cytometry [scRNA]. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120221 [DOI] [PMC free article] [PubMed]
- 82. Hannah R. A single cell hematopoietic landscape resolves 8 lineage trajectories and defects in Kit mutant mice. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107727 [DOI] [PMC free article] [PubMed]
- 83. Freytag S, Lonnstedt I, Ng M, Bahlo M. Single cell profiling of peripheral blood mononuclear cells from healthy human donor. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115189
- 84. Chen K, Chen W, Sun Z. A Bayesian mixture model for clustering droplet-based single cell transcriptomic data from population studies. Gene Expression Omnibus. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128066 [DOI] [PMC free article] [PubMed]
- 85. Villani A. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes and progenitors. Gene Expression Omnibus. 2017. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94820 [DOI] [PMC free article] [PubMed]
- 86. Di Genua C, Thongjuea S, Nerlov C. Combined mutation of C/EBPa and GATA-2 induce bi-lineage acute erythroid leukemia through transformation of a neomorphic neutrophil-erythroid progenitor [10x RNA-seq]. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142213 [DOI] [PMC free article] [PubMed]
- 87. Caron M, St-Onge P, Sontag T, Wang YC, Richer C, Ragoussis I, et al. Single-cell analysis of childhood leukemia reveals a link between developmental states and ribosomal protein expression as a source of intra-individual heterogeneity. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132509 [DOI] [PMC free article] [PubMed]
- 88. van Galen P, Hovestadt V, Bernstein BE. Single-cell RNA-seq reveals AML hierarchies relevant to disease progression and immunity. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256 [DOI] [PMC free article] [PubMed]
- 89. Laffy J, Tirosh I. Single cell RNA-seq analysis of adult and paediatric IDH-wildtype Glioblastomas. Gene Expression Omnibus. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131928
- 90. Diaz A, Wang L, Babikir H, Muller S, Yagnik G, Shamardani K, et al. A single cell atlas of human glioma. Gene Expression Omnibus. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138794
- 91. Wang R, Sharma R, Laughney AM, Masilionis I, Pe’er D, Tabar V. Characterization of tumor initiating cells from human glioblastoma multiforms by scRNA seq. Gene Expression Omnibus. 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139448
- 92.Wang L, Catalan FL, Babikir H, Shamardani K, Diaz A. Ensemble learning for classifying single-cell data and projection across reference atlases. Bioinformatics. 2020. 10.1093/bioinformatics/btaa137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93. Hovestadt V, Smith KS, Bihannic L, Filbin MG, Bernstein BE, Suva ML, et al. Single cell RNA-seq analysis of medulloblastoma. Gene Expression Omnibus. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119926
- 94. Ledergor G, Gui G, Papaemmanuil E, Amit I. Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117156 [DOI] [PubMed]
- 95.Durante MA, Rodriguez DA, Kurtenbach S. Single-cell analysis reveals new evolutionary complexity in uveal melanoma. Nat Commun. 2020. 10.1038/s41467-019-14256-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96. Hao M. scFoundation:Large Scale Foundation Model on Single-cell Transcriptomics - processed datasets. figshare. 2024. https://figshare.com/articles/dataset/scFoundation_Large_Scale_Foundation_Model_on_Single-cell_Transcriptomics_-_processed_datasets/24049200/3 [DOI] [PubMed]
- 97.Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res Oxford University Press. 2012;41:D955–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98. Manica M, Oskooei A, Born J, Subramanian V, Sáez-Rodríguez J, Rodríguez Martínez M. Toward explainable anticancer compound sensitivity prediction via multimodal attention-based convolutional encoders. Mol Pharm. ACS Publications. 2019;16:4797–806. [DOI] [PubMed]
- 99. Benevolenskaya EV, Aissa AF, Islam AB. Profiling non-small cell lung carcinoma cell line PC9 treated with etoposide, erlotinib and its combination with crizotinib. Gene Expression Omnibus. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149215
- 100. Hammell MG, Ho Y. SAKE (Single-cell RNA-Seq Analysis and Klustering Evaluation) Identifies Markers of Resistance to Targeted BRAF Inhibitors in Melanoma Cell Populations [Fluidiam scRNA-Seq]. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108383 [DOI] [PMC free article] [PubMed]
- 101. Kinker GS, Greenwald AC, Tal R, Orlova Z, Cuoco MS, McFarland JM, et al. Pan-cancer cell line heterogeneity. Broad Institute’s single-cell portal. 2020. https://singlecell.broadinstitute.org/single_cell/study/SCP542/
- 102. Jialu W, Yilin W. Code for scFM-Bench: Biology-Driven Insights into the Power of Single-Cell Foundation Models. GitHub. 2025. https://github.com/wujialu/scFM-Bench/
- 103.Jialu W, Yilin W. Code for scFM-Bench: Biology-Driven Insights into the Power of Single-Cell Foundation Models. 2025. Zenodo. 10.5281/zenodo.17062098.
- 104.Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. [DOI] [PubMed] [Google Scholar]
- 105.Xin Y, Kim J, Okamoto H, et al. RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. Cell Metab. 2016;24(4):608-15. [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Supplementary Notes. Note S1-S6
Additional file 2: Supplementary Figures. Fig. S1-S27
Additional file 3: Supplementary Tables. Table S1-S13
Data Availability Statement
All datasets used in this manuscript are publicly available. The gene function prediction datasets are available from https://github.com/chenhcs/FRoGS, as described in. The Pancreas dataset and Immune (human) dataset were downloaded from Figshare, as described in. The source data are available at Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/): (1) Pancreas: GSE81076, GSE85241, GSE86469, GSE84133, GSE81608; (2) Immune: GSE120221, GSE107727, GSE115189, GSE128066, GSE94820. The Human Lung Cell Atlas (HLCA core set), the Tabula Sapiens dataset, and the Asian Immune Diversity Atlas (AIDA) v2 were downloaded from CELLxGENE (cellxgene.cziscience.com). The datasets used for cancer cell identification were downloaded from the TISCH2 curation (http://tisch.comp-genomics.org/), containing 10 datasets collected from GEO: (1) Blood: GSE142213, GSE132509, GSE116256; (2) Brain: GSE131928_10X, GSE138794, GSE139448, GSE141982, GSE119926; (3) Bone: GSE117156; (4) Eye: GSE139829. The drug sensitivity prediction datasets were downloaded from figshare. As described in, the bulk-level data was obtained from Genomics of Drug Sensitivity in Cancer (GDSC) and processed by (https://ibm.ent.box.com/v/paccmann-pytoda-data/folder/91702227426); the single-cell data was obtained from GEO (GSE149215, GSE108383) and Broad Institute’s single-cell portal (SCP542). The Cell Ontology data is downloaded from the OBO Foundry (https://obofoundry.org/ontology/cl.html).[75–105]






