Abstract
Cell-type annotation remains a major challenge in single-cell and spatial omics analysis. Most existing methods rely on single-cell RNA sequencing (scRNA-seq) references or predefined marker sets. However, the scarcity of high-quality scRNA-seq references and marker sets makes relying on a single approach prone to bias and limits usability. Furthermore, available methods for cell-type annotation in single-cell ATAC-sequencing (scATAC-seq) and spatial transcriptomics datasets perform poorly. Here, we present ScInfeR, a graph-based cell-type annotation method that combines information from both scRNA-seq references and marker sets. By integrating these two data sources, ScInfeR can accurately annotate broad range of cell-types. It employs a hierarchical framework inspired by message-passing layers in graph neural networks to accurately identify cell subtypes. ScInfeR is highly versatile, supporting cell annotation across scRNA-seq, scATAC-seq, and spatial omics datasets. For scATAC-seq, it effectively utilizes chromatin accessibility data, while for spatial transcriptomics, it incorporates spatial coordinate information. Additionally, ScInfeR supports weighted positive and negative markers, allowing users to define marker importance in cell-type classification. Our extensive benchmarking across multiple atlas-scale scRNA-seq, scATAC-seq, and spatial datasets, evaluating 10 existing tools in over 100 cell-type prediction tasks, demonstrated ScInfeR’s superior performance. Noteworthy, it exhibits robustness against batch effects arising in these datasets. To facilitate seamless annotation, we developed ScInfeRDB, an interactive database containing manually curated scRNA-seq references and marker sets for 329 cell-types, covering 2497 gene markers in 28 tissue types from human and plant. ScInfeR is available as an R package, with both the tool and database publicly accessible at https://www.swainasish.in/scinfer.
Keywords: cell type annotation, scRNA-seq, scATAC-seq, spatial transcriptomics
Introduction
Advancements in sequencing technologies provide us unparalleled opportunities to study cellular heterogeneity, gene regulation, and spatial tissue architecture of complex biological systems [1–3]. Single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq), and spatial transcriptomics are among the most powerful techniques developed to characterize the transcriptomic, epigenomic, and spatial landscape of cells, respectively [1–4]. Briefly, scRNA-seq allows us to analyse the expression pattern of genes at individual cell level [5], scATAC-seq provides insights into chromatin accessibility of the genome at a single-cell resolution [3], and spatial transcriptomics retains spatial context while analysing the molecular features of tissue samples [4]. Accurate identification of cell types is crucial for downstream analysis of single-cell and spatial datasets. Cell-type annotation can be performed manually or through automated methods. The manual cell-type annotation process is labour-intensive, expert-dependent, and not scalable to large datasets [6]. On the counterpart, automated cell-type annotation methods are scalable to large datasets and are less susceptible to human error [7–9]. Automated cell-type annotation methods can be broadly classified into two categories: marker-based and reference-based [10]. Marker-based methods extract cell-type-specific markers from literature studies or cell-type marker databases such as PanglaoDB [11], ACT database [12], and CellMarker database [13], and then classify cells based on the expression levels of these markers [14]. Traditionally, these cell-type markers have been identified by isolating specific cell types using cell sorting and microscopic techniques [15, 16]. Following isolation, the molecular characteristics of these cells were analysed to identify the cell-type specific markers [7]. Conversely, reference-based methods rely on scRNA-seq reference instead of marker sets. The reference-based methods transfer the cell annotations to the target dataset by correlating the gene expression profiles of the scRNA-seq reference dataset [9, 17–20].
Recently, several cell-type annotation methods have been developed for scRNA-seq, scATAC-seq, and spatial omics datasets. Most of these rely on either marker-based or reference-based approaches. Marker-based methods, including SCINA [8], ScType [7], Garnett [21], and scSorter [22], as well as reference-based tools, such as SingleR [17] and Seurat [19], have been developed to effectively annotate cell types in scRNA-seq datasets. Methods such as AtacAnnoR [23] and CellCano [24] were explicitly designed for annotation of scATAC-seq datasets. Similarly, spatially aware cell annotation tools such as SPANN [25] and TACCO [26] have been developed for spatial omics datasets. Most of the marker-based methods assume that marker gene sets should exhibit higher expression in the corresponding cell type. Among the marker-based tools, SCINA uses a Gaussian mixture model, assuming that marker gene sets should exhibit higher expression in the corresponding cell type [8]. ScType utilizes positive and negative marker sets to categorize the user-defined clusters. The negative marker set penalizes a cluster with higher expression of negative marker genes [7]. Furthermore, Garnett uses a generalized linear machine learning approach to identify cell types and their associated subtypes in a hierarchical manner [21]. scSorter uses combined information of user-defined marker genes and highly variable genes to annotate the scRNA-seq datasets [22]. The major drawback of major marker-based tools (e.g. SCINA, ScType) is their dependence on the quality of cell-type-specific marker sets and their lack of support for subtype identification [7, 8]. For example, subtypes of T cell markers reported in CellMarker [13] and ScTypeDB [7] databases heavily overlapped, which can lead to incorrect cell type classification. Biologically, several cell types are comprised of multiple subtypes. Due to the overlapping nature of markers among these subtypes, accurately identifying them at the cluster level is highly challenging. Methods like ScType [7] and scSorter [22], which perform cell annotation at the cluster level, often struggle to distinguish closely related subtypes, leading to reduced accuracy. Among existing methods, only Garnett supports subtype classification. However, Garnett’s performance heavily depends on the quality of the training data, where inadequate training data can lead to poor classification outcomes, highlighting the need for a method that enables hierarchical subtype classification with greater accuracy and robustness.
Among the reference-based methods, Seurat uses canonical correlation, and SingleR uses Spearman correlation to identify cell types using a well-annotated scRNA-seq reference dataset [17, 19]. Among the methods designed for annotating scATAC-seq datasets, CellCano uses a combination of multi-layer perceptron and knowledge distillation algorithms to identify cell types using a scATAC-seq reference dataset [24]. As fewer scATAC-seq reference datasets are available compared with scRNA-seq datasets, AtacAnnoR utilizes a combine and discard strategy to annotate scATAC-seq datasets using scRNA-seq dataset as a reference [23]. Furthermore, spatial transcriptomics cell annotation methods such as SPANN uses a coupled-variational autoencoder, and TACCO uses an optimal transport model to annotate spatial spots/cells using scRNA-seq as reference [25, 26]. However, the availability of good quality reference scRNA-seq datasets comprising of a wide range of cell types is rare [27]. Consequently, if a cell type in the target dataset is not included in the reference dataset, it can lead to inaccurate predictions. Both approaches of cell type prediction tasks have their strengths and weaknesses. Combining both approaches can be a comprehensive strategy to improve the usability and robustness of the cell-type annotations task. By integrating both scRNA-seq reference data and marker sets, a more comprehensive strategy can be developed to accurately annotate a broad range of cell types from both sources. To the best of our knowledge, no hybrid-based approach currently exists for cell typing in single-cell and spatial technologies.
To solve these challenges, we propose a hybrid-based cell-type annotation toolkit named ScInfeR (Single Cell-type Inference toolkit using R). ScInfeR can annotate cells using either user-defined marker sets, or scRNA-seq references, or both to annotate cells in scRNA-seq, scATAC-seq, and spatial omics datasets. This combined strategy leverages the complementary strengths of both marker- and reference-based annotation approaches to identify novel or missing cell types. By integrating cell types from both sources, this dual-layered framework enables the annotation of a broader range of cell types, improving the robustness and adaptability of cell-type identification across diverse datasets and species, which is lacking in existing tools. ScInfeR implements two rounds of annotation strategy for cell-type assignment. First, our tool annotates the cell clusters by correlating the cluster-specific markers with the cell-type-specific markers in the cell–cell similarity graph. These cell-type-specific marker genes can be either user-defined, extracted by ScInfeR from scRNA-seq reference data, or a combination of both. For scRNA-seq as a reference, ScInfeR extracts cell-type markers by considering both the global and local specificity of markers. In the second round, ScInfeR annotates the subtypes and clusters containing multiple cell types in a hierarchical manner. In this step, it uses a framework adapted from the message-passing layer in the graph neural network to annotate each cell individually. Additionally, our method supports weighted and negative markers, where users can specify the importance of the markers in the cell type classification task. Our comprehensive benchmarking analysis across various scRNA-seq, scATAC-seq, and spatial omics datasets demonstrated that ScInfeR outperformed existing tools in both accuracy and sensitivity. Additionally, the tool is capable of accurately annotating datasets that exhibit substantial batch effects. We also built an open-access and interactive hierarchical cell marker database, i.e. ScInfeRDB (https://www.swainasish.in/scinfer), comprising 28 tissue types, 329 cell types, and 2497 gene markers. This cell-marker database can be integrated into our toolkit for seamless cell-type annotation.
Materials and methods
Overview of datasets
We have used 24 scRNA-seq, two scATAC-seq, and three spatial omics datasets for the performance assessment of ScInfeR. All scRNA-seq datasets were pre-processed using the Seurat package [19]. scATAC-seq datasets were processed using Signac [28] and ArchR [29] packages. Spatial datasets were jointly processed using the Seurat and Scanpy packages [30]. Detailed information about all the datasets is summarized in Supplementary Table 1. A summary of all datasets is described below:
Tabula Sapiens atlas scRNA-seq dataset
This scRNA-seq atlas comprises scRNA-seq datasets of multiple human tissues [31]. We retrieved the human lung, pancreas, and liver scRNA-seq datasets for the downstream analysis. The cell-type annotations provided in the Tabula Sapiens atlas were used as ground truth for benchmarking with other tools.
scATAC-seq datasets
Two peripheral blood mononuclear cell (PBMC) scATAC-seq datasets were retrieved from the NCBI-GEO with accession IDs GSE129785 [32] and GSE123578 [33]. Both datasets include a range of cell types, such as B cells, monocytes, natural killer cells, CD8 T-cells, and CD4 T-cells. These datasets were obtained using cell sorting, so the annotations from cell sorting are considered ground truth for benchmarking.
Spatial transcriptomics datasets
Three spatial transcriptomics datasets were used for the performance assessment of ScInfeR. The first spatial transcriptomic dataset was retrieved from the mouse cortical region. This dataset was sequenced using the STARmap technology [34]. The second spatial dataset was retrieved from a developmental mouse embryo. This dataset was profiled using the SeqFISH technique. We also retrieved the corresponding scRNA-seq data at the same developmental time point [35]. The last spatial dataset was obtained from the human prefrontal cortex region, which was sequenced using the 10X Visium technique [36].
scRNA-seq datasets with batch-effects
A total of 14 scRNA-seq datasets were retrieved from multiple studies from the pancreas and PBMC tissues. The batch-wise concatenated matrix was obtained from De Donno., et al [37]. The integrated pancreas dataset was comprised of 16 382 cells and 13 cell types. This dataset was created by integrating eight batches of scRNA-seq datasets. PBMC integrated dataset was comprised of 33 506 cells and 16 cell types. This dataset was created by integrating five batches of scRNA-seq datasets.
Plants scRNA-seq datasets
Two scRNA-seq datasets from Arabidopsis thaliana root and Oryza sativa leaf were used to assess the performance of ScInfeR. The A. thaliana root scRNA-seq dataset contained 28 183 cells across 12 cell types, while the O. sativa leaf scRNA-seq dataset comprised 24 264 cells across four cell types [38, 39].
Cell type-specific marker identification from scRNA-seq references
For reference-based cell type annotation, ScInfeR requires reference scRNA-seq matrix (
) with cell type annotations and target scRNA-seq matrix (
). Traditionally, cell type markers were extracted from
using the one-versus-all approach (in Seurat and Scanpy packages), where gene expression is compared at the global scale only. This method does not take into account the local specificity of the markers. As a result, genes that show specificity only at the local level are dominated by the markers having high specificity at the global level. For example, the specificity of T-cell subtype markers, including CD4 T-cells and CD8-T cell markers, are dominated by general T-cell markers. To overcome this issue, our tool finds all gene’s cell type specificity on a global scale (
) and also at a local scale (
). The area under the ROC curve (
) score ranges between 0 and 1 and refers to the specificity of a gene to a cell type. A higher score indicates better specificity to the cell type. First, the tool calculates the
of all genes in the
using the traditional one-versus-all approach. Next, for
estimation, the tool identifies
highly correlated cell types for each cell type within the cell–cell similarity adjacency matrix (
).
is derived from the UMAP or PCA projections of the reference scRNA-seq data. For multiple scRNA-seq references, it was recommended that Harmony projections be used [40]. Details about the
matrix construction are discussed in the next steps. Next, the
is estimated for all genes by comparing the query cell type with the
highly correlated cell types. Finally, a combined score
estimated for all genes by
![]() |
(1) |
Here,
represents the weight assigned to the
, and
represents the weight assigned to the
. By default, both
and
are set at 0.5 so that equal weightage is given to both the global and local specificity of the gene. In the final step, the top
genes with the highest
were selected for each cell type for further downstream analysis. High-quality marker selection is based on two parameters: the number of top genes (
) and the (
) score. By default, (
) is set to 15 and the (
) threshold to 0.75. When fewer than
genes fulfil the threshold condition (
), only the genes having score >0.75 are selected. Users can adjust the (
) threshold to refine the marker selection: increasing it (e.g. to 0.85) selects more specific markers, while decreasing it includes a broader range of genes. Additionally, the number of top genes (
)) can be modified to create either a smaller, more precise marker set or a larger, more comprehensive one, depending on the tissue type and gene expression overlap across cell types.
ScInfeR framework for the cell-type and subtype identification for the scRNA-seq dataset
ScInfeR infers cell type annotations of the scRNA-seq dataset (
) using three steps. First, the cell–cell similarity adjacency matrix (
) is constructed from the UMAP or PCA projections. In the second step, cluster-level cell annotation is performed by combining information from
with cell type-specific markers. In the third step, clusters containing multiple cell types and subtypes are annotated at the individual cell level. Details of all three steps are described below:
Step1: Cell–cell similarity adjacency matrix (
) construction
To construct the cell–cell similarity adjacency matrix,
nearest neighbours for each cell are identified using the
module using the UMAP or PCA projections. If the dataset has multiple batches Harmony [40] projections can be used for the adjacency matrix construction. The
module efficiently identifies neighbours by segmenting the search space into smaller areas using hyperplanes and then identifying the cells most likely to be nearest neighbours. Next,
constructed using the following conditions:
![]() |
(2) |
Here, if the cell
is among the
’s nearest neighbour,
=1 else 0.
Step2: cluster-level cell annotation
In this step, for each cluster in
, cluster-specific gene markers (
) were identified by calculating the
as mentioned above. These cluster-specific markers (
) exhibit very high specificity to their respective clusters. Similarly, cell type-specific markers (
), either user-defined or calculated from the
, have higher expression in the particular cell types. If the high correlation is observed between the expression patterns of
and
within a cluster, it is likely that the cluster belongs to that particular cell type. Based on this assumption, the cosine similarity calculated between
and
for each cluster is as follows:
![]() |
(3) |
Here,
represents the cosine similarity of cluster-specific marker
to the cell type marker
. Next, the cell type specificity of the cluster
for each cell type is determined by
![]() |
(4) |
Here,
represents the weighted cell type specificity of the cluster
to the cell type
.
represents the weight assigned to the cell type marker
by the user.
ranges from −3 to 3, where +3 denotes the high specific positive marker, and −3 denotes the high specific negative marker.
represents the specificity of the cell type marker
to the cluster
. In this way, for cluster
, cell type specificity for all cell types is determined. Next, the cell type with the highest
is assigned to cluster
. If a cluster has
of multiple cell types very close to each other, that cluster will be annotated in step 3, which performs annotation at the individual cell level. If the difference between the highest and the second-highest
is <0.05, it indicates ambiguity in cell type assignment. In such cases, the cluster is considered unresolved and will be passed to step 3, where annotation is performed at the individual cell level. This finer resolution step helps accurately assign cell types by leveraging the distinct transcriptional profiles of individual cells.
Step3: subtype identification
Clusters containing multiple cell types and user-defined subtypes were annotated in this step. The weighted mean expression of each cell’s neighbourhood cells was calculated using
. Next, each cell’s own expression is merged with the weighted mean expression of its neighbours. In this way, a combined expression was created that captures both the own and neighbour cell expressions. This approach is adapted from the message-passing layer framework in graph neural networks, where a node incorporates its own information and information from its neighbourhood nodes. The weighted mean expression is calculated as follows:
![]() |
(5) |
Here,
represents the weighted mean expression of the cell,
is the mean expression of the neighbourhood cells, and
represents the number of iterations over which the message-passing layer integrates the mean expression from its neighbours.
represents the weight assigned to the neighbours mean expression, which is initialized as 1 and reduced by
at each iteration. In this way, higher weight is given to nearer cells compared with the farther cells in the projection space. Next, the combined expression of each cell is calculated as follows:
![]() |
(6) |
Here,
denotes the weight assigned to the cell’s own expression, and
denotes the weight assigned to the weighted mean expression of the cell’s neighbours. By default,
and
are assigned a value of 0.5, giving equal weight to both parameters. Next, cell type-specific markers’ expression is indexed from the
, and the mean expression of these cell type markers is calculated for each cell type across all cells. Subsequently, cells are individually annotated based on the cell type that has the highest combined mean expression.
Cell-type identification framework for scATAC-seq dataset
Most cell-type annotation methods for scATAC-seq datasets rely on the gene activity score only, which aggregates the counts of all fragments within the gene promoter region and gene body [28, 29]. However, using only the gene activity score for cell type annotation can overlook the fragments that reside outside the gene region. To overcome this challenge, ScInfeR utilizes both the gene activity score and chromatin accessibility score to predict cell-type annotations. For cell type guidance, users can provide either a marker set, scRNA-seq reference, or scATAC-seq reference. When using scATAC-seq dataset as the reference, ScInfeR calculates cell type-specific markers by combining the gene activity scores with the UMAP projections of chromatin accessibility data.
Our tool requires gene activity score, cluster information along with UMAP or PCA projections generated from chromatin accessibility scores derived from the target scATAC-seq dataset. First, the cell–cell similarity adjacency matrix (
) constructed from the UMAP or PCA projections of the target scATAC-seq dataset. Subsequently, clusters will be annotated by calculating
using cluster-specific gene markers from gene activity score and cell type-specific markers as described in previous steps. Further, subtypes and clusters having multiple cell types will be annotated by calculating the
using
and cell-type and sub-type specific markers.
Cell-type identification framework for spatial transcriptomics dataset
For cell type annotation in spatial omics datasets, our tool calculates the
adjacency matrix from the
and
coordinates of the spatial data. Next, spatial domains or clusters were obtained using our in-house tool, SpatialPrompt [41]. Briefly, SpatialPrompt calculates spatial domains by integrating the gene expression data and spatial coordinates from spatial omics datasets. In this step, users can also provide spatial domains generated by other tools to ScInfeR. Finally, by combining the spatial adjacency matrix, spatial domain information, and gene expression data, ScInfeR predicts the cell types in the spatial data using the steps mentioned above.
Performance assessment of ScInfeR to other state-of-the-art tools
A total of 10 cell-type annotation tools, used across various scRNA-seq, scATAC-seq, and spatial technologies, were considered for the performance assessment. For scRNA-seq benchmarking, the marker-based methods scSorter [22], SCINA [8], scType [7], and Garnett [21] were considered. Among the reference-based tools, Seurat [19] and SingleR [17] were selected as these methods have shown promising results in recent benchmark studies [42]. For the scATAC-seq benchmark, AtacAnnoR, Cellcano, and scRNA-seq reference-based methods were considered. For spatial transcriptomics, recent spatially aware cell type annotation tools, such as SPANN [25] and TACCO [26], along with scRNA-seq marker and reference-based tools, were included in the analysis. A brief overview of all the tools is summarized in Table 1. For plant scRNA-seq datasets, scSorter and SCINA were considered for benchmarking, while ScType and Garnett were excluded as they are explicitly designed for human and mouse scRNA-seq data.
Table 1.
Comparison of cell type annotation tools: overview of popular cell annotation tools, highlighting their target data types, underlying algorithms, and support for hybrid-based annotation, subtype identification, weighted marker usage, and negative marker support
| Tool name | Target data type | Algorithm overview | Hybrid annotation support | Subtype support | Weighted marker support | Negative marker support |
|---|---|---|---|---|---|---|
| ScInfeR | scRNA-seq, scATAC-seq, spatial-omics | Graph-based | Yes | Yes | Yes | Yes |
| ScType [7] | scRNA-seq | Correlation-based | No | No | No | Yes |
| SCINA [8] | scRNA-seq | Gaussian mixture model | No | No | No | No |
| scSorter [22] | scRNA-seq | Correlation-based | No | No | Yes | No |
| Garnett [21] | scRNA-seq | Linear-model | No | Yes | No | No |
| SingleR [17] | scRNA-seq | Spearman correlation | No | No | No | No |
| Seurat [19] | scRNA-seq | Canonical correlation | No | No | No | No |
| CellCano [24] | scATAC-seq | Knowledge distillation algorithms | No | No | No | No |
| AtacAnnoR [23] | scATAC-seq | Combine and discard strategy | No | No | No | No |
| Tacco [26] | Spatial omics | Optimal transport model | No | No | No | No |
| SPANN [25] | Spatial omics | Coupled-variational autoencoder | No | No | No | No |
For unbiased scRNA-seq reference selection, all references were retrieved from the DISCO database [43]. For the marker-based methods, cell type markers were collected from their respective studies. If the marker sets were not provided by the study, the top 10 markers for each cell type were fetched using the FindMarkers function in the Seurat package. This approach has been used by several benchmark studies as well [42]. Details about the target and reference scRNA-seq, scATAC-seq, and spatial omics datasets were provided in the Supplementary Table 2. For all the methods, instructions provided in their official repositories were followed with default parameters. Quantitative assessment metrics, i.e. micro F1 score and adjusted rand score (ARI), were calculated between the predicted cell type annotations and ground truth. Micro F1 score and ARI score were calculated between ground truth (
) and predicted annotation (
) as
![]() |
(7) |
![]() |
(8) |
Here, RI stands for the Rand Index, which measures the similarity between the predicted annotations and the ground truth. All analyses were performed on a system with an Intel Xeon processor with 48 cores, 128 GB of RAM, and 4GB of graphics memory.
Hierarchical cell type marker and scRNA-seq reference database construction
A total of 2497 cell type markers from 28 tissue types were collected from the PanglaoDB, Cellmarker database, DISCO, ScType, and PCMDB databases [7, 11, 13, 43, 44]. Markers validated by multiple users in PanglaoDB and Cellmarker database were assigned weights >1. For plant cell type markers, only non-duplicated experimental validated markers from PCMDB were considered for database construction. Additionally, scRNA-seq reference datasets were collected from the DISCO and Azimuth databases [19, 43]. Following that, cell type names are standardized across both the marker and scRNA-seq reference datasets for seamless integration with ScInfeR R package. This feature is missing in several databases, resulting in confusion in the cell type annotation process. Compiling all the resources, an interactive web server was created using R Shiny, allowing users to visualize and download the marker sets and scRNA-seq references using both the web server and the R package [45].
RESULTS
Overview of ScInfeR framework
ScInfeR employs a graph-based framework to annotate cell types in scRNA-seq, scATAC-seq, and spatial transcriptomics datasets. It enables cell-type annotation using a cell marker set, scRNA-seq reference, or combination of both (Fig. 1). From scRNA-seq reference (
), ScInfeR uses both local and global neighbourhoods of cell types to extract the cell type-specific markers from
(Fig. 1b). In this way, ScInfeR provides flexibility to the users for marker-set and reference scRNA-seq selection. Additionally, for seamless cell annotation, we built an open-access and interactive cell type markers and scRNA-seq reference database, i.e. ScInferDB. This comprehensive repository encompasses scRNA-seq references and cell-type markers.
Figure 1.
Overview of the ScInfeR workflow: (a) ScInfeR framework takes scRNA-seq, or single-cell ATAC-sequencing (scATAC-seq), or spatial omics expression matrix as input for cell type inference; (b) a cell marker set or scRNA-seq reference matrix is used as secondary input for cell type guidance. Both of these can be retrieved from our ScInfeRDB database. In case, scRNA-seq reference is input, ScInfeR could calculate the cell marker set; (c) ScInfeR annotates cells in three steps: building a similarity matrix, assigning cluster-level labels based on marker correlations, and refining annotations at the single-cell level using neighbourhood-weighted expression.
ScInfeR annotates cells through a three-step process. In step 1, the tool constructs a cell–cell similarity adjacency matrix (
) using either the gene expression data of scRNA-seq, chromatin accessibility scores of scATAC-seq, or spatial coordinates of spatial transcriptomics data.
represents the underlying cellular network as a graph, where nodes are cells and edges reflect their similarity with respect to gene expression, chromatin accessibility, or position. In step 2, the tool annotates each cluster or domain by correlating the expression patterns of cluster markers to the cell type markers. In step 3, leveraging a framework adapted from the message-passing layer of graph neural networks (details in the method section), the tool computes the weighted mean expression profile of each cell’s neighbouring cells. By integrating the intrinsic gene expression of each cell with the transcriptomic signals from its neighbours, ScInfeR accurately assigns each cell to its corresponding cell type (Fig. 1c).
Systematic benchmarking of ScInfer on scRNA-seq Tabula Sapiens atlas
The performance of ScInfeR and six existing tools was initially assessed using the scRNA-seq Tabula Sapiens atlas. This scRNA-seq atlas comprises of scRNA-seq datasets from multiple human tissues with ground-truth annotations. The scRNA-seq datasets for the lungs, liver, and pancreas were retrieved and processed as mentioned in the methods section. The lungs scRNA-seq dataset (TS-lungs) contains 12 cell types, with two cell types having five subtypes. Similarly, the liver scRNA-seq dataset (TS-liver) includes 13 cell types, and the pancreas scRNA-seq dataset (TS-pancreas) contains 15 cell types. This benchmark analysis enables us to investigate the performance of ScInfeR and other scRNA-seq-based tools across various tissue types.
On the TS-lungs dataset, among the marker-based methods, ScInfeR was able to surpass other methods with an F1 score of 0.94 (Fig. 2a). ScType and SCINA were the next best-performing tools, with F1 scores of 0.93 and 0.87, respectively (Fig. 2b). In this specific benchmark, ScType’s performance is comparable with ScInfeR, but it did not perform well on other datasets (Fig. 3). scSorter and Garnett performed poorly, with F1 scores of 0.70 and 0.48 in this dataset (Fig. 2b). Similarly, for the reference-based methods, ScInfeR had a lower false positive rate, achieving an F1 score of 0.94 compared with SingleR and Seurat, both of which had F1 scores of 0.89 (Fig. 2c). For performance assessment of subtype identification, five subtypes: CD4 T-cell, CD8 T-cell, EC-capillary, EC-microvascular, and EC-vein cell types were considered (Fig. 2d). Both ScInfeR and Garnett were projected for subtype analysis, as these tools were specifically designed to identify and analyse the various subtypes within the cell types. In the subtype analysis, ScInfeR achieved an F1 score of 0.74, outperforming Garnett, which has an F1 score of 0.24. This demonstrates ScInfeR’s superior performance in subtype identification (Fig. 2d). In this case, when predicting the EC-capillary subtype, ScInfeR classified some cells as aerocytes (labelled as other in Fig. 2d). We observed high expression of aerocyte markers in those cells (Supplementary Figure 1). It is possible that the authors did not consider the aerocyte cell type in their cell type annotation process. In terms of scalability, SingleR, Garnett, and scSorter required >2500 s to predict the cell types of the TS-lungs dataset. On the counterpart, ScType and ScInfeR required <60 s for the cell typing task (Fig. 2e). This shows that ScInfeR is an ideal tool for large scRNA-seq datasets with millions of cells.
Figure 2.
Quantitative assessment of ScInfeR on Tabula Sapiens lungs scRNA-seq dataset: (a) Comparison of ScInfeR predicted cell types with the ground truth annotations, visualized on the UMAP projection of the lungs scRNA-seq dataset. ScInfeR (M) and ScInfeR (R) represent the marker-based and reference-based performances, respectively. The F1 score (0.94) was calculated by comparing the tool’s predicted annotations with the ground truth annotations; (b,c) bar plot showing the F1 and Adjusted Rand Index (ARI) scores of marker-based and reference-based tools on the same dataset; (d) performance of tools (ScInfeR and Garnett) allowing subtype identification, considering only T cells and endothelial cells, as they have subtypes; (e) run time comparison of all tools on the same dataset.
Figure 3.
Quantitative assessment of ScInfeR on Tabula Sapiens liver and pancreas scRNA-seq datasets: (a) Comparison of ScInfeR predicted cell types and ground truth annotations, visualized on the UMAP projection of the liver scRNA-seq dataset. ScInfeR (M) and ScInfeR (R) represent the marker-based and reference-based performances, respectively. The F1 score (0.88 and 0.67) was calculated by comparing the tool’s predicted annotations with the ground truth annotations; (b,c) bar plots showing the F1 and Adjusted Rand Index (ARI) scores of marker-based and reference-based tools on the liver scRNA-seq dataset; (d) comparison of ScInfeR predicted cell types and ground truth annotations, visualized on the UMAP projection of the pancreas scRNA-seq dataset; (e,f) bar plot showing the F1 and ARI scores of marker-based and reference-based tools on the pancreas scRNA-seq dataset.
On the TS-liver and TS-pancreas datasets, among the marker-based tools, ScInfeR outperformed other tools, with F1 scores of 0.88 and 0.93, respectively (Fig. 3a, d). In both datasets, SCINA and scSorter were the next best-performing methods. The F1 and ARI scores of all the tools are shown in Fig. 3b, e. In the TS-liver dataset, other reference-based tools performed poorly, with F1 scores of 0.48 for both SingleR and Seurat. The high batch effect between reference and target datasets might be the reason for this poor performance. Despite this challenge, ScInfeR outperformed other tools, achieving an F1 score of 0.67 (Fig. 3c). Similarly, in the TS-pancreas dataset, ScInfeR identified all cell types with an F1 score of 0.84, surpassing other reference-based methods such as SingleR and Seurat, which had an F1 score of 0.73 and 0.72 (Fig. 3f).
Application of ScInfer on scATAC-seq technology
The application of ScInfeR on scATAC-seq datasets was assessed using the PBMC dataset (GSE129785) [32]. This benchmark was conducted in two stages. First, tools that support scATAC-seq as a reference (e.g. CellCano) were evaluated. Next, the methods that utilize scRNA-seq as a reference were tested. For the scATAC-seq reference, the PBMC scATAC-seq dataset (GSE123578) [33] was used. For the scRNA-seq reference, the PBMC scRNA-seq dataset from the DISCO database was used for the cell-type prediction. The second benchmarking was essential due to the scarcity of scATAC-seq reference datasets compared with scRNA-seq. Using scRNA-seq as a reference enhances the tool’s versatility and broadens its applicability. Furthermore, tools such as SingleR and Seurat do not consider the chromatin accessibility information from the scATAC-seq, so only gene activity scores were used for their benchmarking. In the first benchmarking, ScInfeR demonstrated superior performance with an F1 score of 0.97, while CellCano also performed well with an F1 score of 0.95 (Fig. 4a). Despite CellCano performing well in detecting cell types, its inability to use scRNA-seq references limits its usability. In the second benchmarking, ScInfeR again performed well with an F1 score of 0.95 compared with AtacAnnoR (F1 score of 0.88)(Fig. 4c). When using scRNA-seq as a reference for scATAC-seq cell type annotation, only ScInfeR and AtacAnnoR incorporate chromatin accessibility information from scATAC-seq datasets. The lack of this feature in other tools results in poorer performance (Fig. 4c, d). Furthermore, in this benchmark, when detecting T-cell subtypes, none of the tools except ScInfeR could distinctly identify the boundaries between CD4 and CD8 T cells. The F1 scores and ROC scores obtained in both benchmarks are shown in Fig. 4b, d.
Figure 4.
Performance assessment of ScInfeR on scATAC-seq datasets: (a) UMAP plot of cell type inference predicted by reference-based tools that use scATAC-seq data as reference. The scores represent the F1 score obtained by comparing ground truth with the predicted cell types. (b) Bar plot of F1 and ROC scores obtained by comparing the ground truth annotations with the tool’s predicted annotations that use scATAC-seq data as a reference. (c) Cell type inference predicted by reference-based tools that use scRNA-seq data as a reference. (d) Bar plot of F1 and ROC scores obtained by comparing the ground truth annotations with the tool’s predicted annotations that use scRNA-seq data as a reference.
Spatially informed cell type annotation in spatial transcriptomics
Three spatial omics datasets were obtained from various studies, sequenced using STARmap, SeqFISH, and Visium spatial technologies [25, 34, 36]. These spatial datasets were derived from the mouse cortex, mouse embryo, and human brain. The first two spatial datasets (i.e. STARmap and SeqFISH) were benchmarked with reference-based tools. The third dataset (i.e. Visium-DLPFC) was benchmarked using marker-based tools. This marker-based benchmarking was crucial, as marker-based methodologies are less commonly utilized compared with reference-based tools in spatial transcriptomics, yet they offer distinct advantages in certain analytical scenarios. These scenarios include the availability of non-human scRNA-seq references and batch effects between the scRNA-seq and spatial datasets.
On the STARmap spatial dataset, ScInfeR outperformed other tools by a large margin, achieving an F1 score of 0.73 (Fig. 5a). SingleR and TACCO were the next best-performing methods, with F1 scores of 0.44 and 0.38, respectively. In this dataset, ScInfeR was the only method able to accurately capture the spatial cell boundaries compared with other methods (Fig. 5a). However, despite using spatial information for cell type prediction in spatial omics, TACCO and SPANN did not perform well on this dataset. SPANN performed the worst, predicting most of the cells as Smc cell type (Fig. 5a).
Figure 5.
Performance assessment of ScInfeR on spatial transcriptomics datasets: (a) spatial distribution of major cell types predicted by reference-based tools in the STARmap cortex spatial dataset. The scores represent the F1 score obtained by comparing the ground truth with the predicted annotations. X and Y axis represent the coordinates of the spatial data; (b) spatial distribution of major cell types predicted by reference-based tools in the SeqFISH embryo dataset; (c) spatial distribution of major cell types predicted by marker-based tools in the human dorsal prefrontal cortex dataset.
Similarly, on the SeqFISH embryo dataset, ScInfeR performed well with an F1 score of 0.80 (Fig. 5b). In this dataset also, spatially informed cell type annotation tools did not perform well compared with scRNA-seq based tools. Seurat and SingleR were the next best-performing tools, achieving F1 scores of 0.75 and 0.74, respectively (Fig. 5b). In the third spatial dataset (Visium-cortex), marker-based tools SCINA, scSorter, and ScType were evaluated. These tools, originally developed for scRNA-seq cell type annotation, do not effectively utilize the spatial information, resulting in poor performance, with SCINA achieving an F1 score of 0.38 and scSorter with 0.35. ScType notably predicted all cells as Layer3, with an F1 score of 0.25 and ROC score of 0 (Fig. 5c). In contrast, ScInfeR integrates spatial gene expression and coordinates, leading to the highest F1 score of 0.77 (Fig. 5c) among the evaluated methods on the Visium-cortex dataset.
ScInfeR performance on plants scRNA-seq datasets
Two scRNA-seq datasets from Arabidopsis thaliana root and Oryza sativa leaf were used to assess the performance of ScInfeR. This benchmark aimed to evaluate the method’s effectiveness across diverse plant species. The root scRNA-seq dataset contained 12 cell types, while the leaf dataset had four cell types. Due to the scarcity of high-quality scRNA-seq references, only the marker-based tools were used for this benchmark. On the A. thaliana root scRNA-seq dataset, ScInfeR outperformed other tools with an F1 score of 0.84 (Supplementary Figure 5), compared with scSorter and SCINA, which achieved F1 scores of 0.80 and 0.67, respectively. Similarly, on the O. sativa scRNA-seq dataset, ScInfeR achieved an impressive F1 score of 0.86 (Supplementary Figure 6), outperforming scSorter and SCINA, which scored 0.75 and 0.77, respectively. These results demonstrate ScInfeR’s superior accuracy in cell-type annotation across plant species. The consistent performance highlights its robustness and reliability for cross-species applications.
ScInfeR performance on scRNA-seq datasets with batch effects
Batch effect is a major challenge in the analysis of scRNA-Seq datasets. If the batch effect is neglected during cell type annotation, these variations could be attributed to the cell type prediction performance [46]. Our tool implements several steps to accurately annotate cells by mitigating the batch effects present in single-cell and spatial datasets. Performance assessment of ScInfeR was tested using 14 scRNA-seq datasets from the pancreas and PBMC tissue types. The batch effect among the scRNA-seq datasets was removed, and cell type annotation was performed as described in the method section. After batch effect removal, the UMAP projection of both pancreas and PBMC scRNA-seq datasets were shown in Fig. 6a, c. On the integrated pancreas scRNA-seq dataset, ScInfeR successfully captures nearly all cell type distributions, achieving an impressive F1 score of 0.95. (Fig. 6b). Cell type annotation on integrated PBMC was challenging due to the high similarity in cell types and overlapping of markers or features. Despite these challenges, ScInfeR is able to annotate major cell types with an F1 score of 0.80 (Fig. 6d). Notably, the tool was able to distinguish the cell type boundaries of CD4+ T cells, CD8+ T cells, NKT cells, and NK cells (Figure 6d) despite the high similarity of their expression patterns.
Figure 6.
Performance assessment of ScInfeR in scRNA-seq datasets with substantial batch effects: (a) UMAP plot of the integrated pancreas dataset after the batch effect correction; legend represents the techniques used to sequence the cells; (b) UMAP plot of ground truth annotations and ScInfeR predicted annotations on integrated pancreas scRNA-seq dataset. The scores represent the F1 score obtained by comparing the ground truth with the predicted annotations. (c) UMAP plot of the integrated pancreas dataset after the batch effect correction; legends represent the name of study from the dataset retrieved; (d) UMAP plot of ground truth annotation and ScInfeR predicted annotation on integrated PBMC scRNA-seq dataset.
Discussion
Accurate cell type identification is a crucial prerequisite for the downstream analysis of scRNA-seq, scATAC-seq, and spatial omics datasets. Currently, most cell-type annotation tools annotate cells using either a marker-based or reference-based approach. Marker-based tools require a cell type marker set, while reference-based tools require a reference scRNA-seq expression matrix for cell type annotation in single-cell or spatial datasets. Both approaches face several challenges, including cell-type marker selection, marker specificity, and scRNA-seq reference selection. Further, limited tools are available to annotate scATAC-seq and spatial omics datasets. Combining both approaches can enhance annotation accuracy by leveraging their strength. To overcome these limitations, we propose ScInfeR, a cell-type annotation toolkit that can annotate scRNA-seq, scATAC-seq, and spatial omics datasets using either a marker-based or reference-based approach or a combination of both.
ScInfeR implements several improvements compared with the existing tools. It can annotate wide range of datasets including scRNA-seq, scATAC-seq, and spatial omics generated from multiple technologies (e.g. STARmap, Visium). Additionally, the tool supports weighted markers, allowing users to define the importance of markers in cell type assignment. Another key feature is subtype identification, which most other tools lack. In our benchmarking analysis, ScInfeR outperformed existing tools on lung, pancreas, and liver scRNA-seq datasets from the Tabula Sapiens atlas. For subtype assignments, ScInfeR clearly captures the hierarchical cell assignment compared with Garnett, the only tool supporting subtype assignments. Similarly, in scATAC-seq cell type assignment, ScInfeR shows promising results compared with the tools specifically designed for scATAC-seq datasets. Furthermore, in spatial omics datasets, ScInfeR precisely annotated cells/spots by considering both spatial information and the gene expression matrix.
For seamless cell type annotation, we constructed a hierarchical cell marker database ScInfeRDB. The database is comprised of 28 tissue types, 329 cell types, and 2497 gene markers. The database also includes scRNA-seq references collected from multiple tissue types. Following that, cell type names are standardized across both the marker and scRNA-seq reference datasets to ensure seamless integration with the ScInfeR R package. This feature is lacking in several databases, leading to confusion during the cell type annotation process.
Some considerations are advised for the optimal performance and accuracy of ScInfeR. The scRNA-seq reference and marker set should encompass a broad spectrum of cell types, ideally from the same species. Harmony projections are recommended for the adjacency matrix construction for multiple scRNA-seq references or scRNA-seq datasets with batch effects. There are future developments planned for ScInfeR to incorporate more single-cell technologies, including single-cell DNA methylation and single-cell proteomics datasets.
In conclusion, ScInfeR, a cell type annotation toolkit, offers a wide range of modules for cell typing assignment in scRNA-seq, scATAC-seq, and spatial transcriptomics. ScInfeR demonstrates superior performance across various tissue types and sequencing technologies compared with state-of-the-art tools. The ScInfeR toolkit, combined with ScInfeRDB, will provide a more comprehensive and accurate solution for the labour-intensive cell annotation process.
Key Points
A comprehensive framework for annotating cell types across scRNA-seq, scATAC-seq, and spatial omics datasets.
A hybrid cell-type annotation strategy that enables annotation using a marker set, scRNA-seq reference, or both.
Our method enables hierarchical subtype identification, a feature missing in most other methods.
ScInfeR supports weighted markers, allowing users to define marker importance in cell-type classification.
We developed ScInfeRDB, an interactive and open-access hierarchical cell marker database encompassing 28 tissue types, 329 cell types, and 2497 gene markers for human and plant (https://www.swainasish.in/scinfer).
Supplementary Material
Contributor Information
Asish Kumar Swain, Department of Bioscience and Bioengineering, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India.
Rajveer Singh Shekhawat, Department of Bioscience and Bioengineering, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India.
Pankaj Yadav, Department of Bioscience and Bioengineering, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India; School of Artificial Intelligence and Data Science, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India.
Author contributions
Asish Kumar Swain (Primary data analyses, Data curation, Software development, Writing original draft), Rajveer Singh Shekhawat (Data curation, Writing original draft), and Pankaj Yadav (Conceptualization, Supervision, Writing original draft); all authors contributed by comments and approved the final manuscript.
Conflict of interest
The authors declare that they have no conflict of interest.
Funding
This work was partly supported by GenomeIndia grant from the Department of Biotechnology, Ministry of Science and Technology, Government of India (project number: BT/GenomeIndia/2018) and the Ministry of Education, Government of India.
Data availability
All the datasets used in this study are publicly available. Tabula Sapiens atlas scRNA-seq datasets were obtained from the Figshare repository (https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219). scATAC-seq datasets obtained from NCBI-GEO having accession IDs GSE129785 and GSE123578. STARmap cortex spatial dataset was retrieved from http://starmapresources.org/data. The spatial human DLPFC dataset was retrieved from the LIBD database https://research.libd.org/spatialLIBD/. The spatial mouse embryo dataset was obtained from https://crukci.shinyapps.io/SpatialMouseAtlas/. The PBMC, pancreas, lungs, and liver reference datasets were obtained from the DISCO database (https://www.immunesinglecell.org/). All plant scRNA-seq datasets were retrieved from scPlantDB (https://biobigdata.nju.edu.cn/scplantdb/home).
The ScInfeR R package and the scripts used for the benchmarking analysis can be found on GitHub at https://github.com/swainasish/ScInfeR. Comprehensive documentation and tutorials for our tool and database are available at https://www.swainasish.in/scinfer.
References
- 1. Aldridge S, Teichmann SA. Single cell transcriptomics comes of age. Nat Commun 2020;11:4307. 10.1038/s41467-020-18158-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Li X, Wang CY. From bulk, single-cell to spatial RNA sequencing. Int J Oral Sci 2021;13:36. 10.1038/s41368-021-00146-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Chen H, Lareau C, Andreani T. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol 2019;20:1–25. 10.1186/s13059-019-1854-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Rao A, Barkley D, França GS. et al. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596:211–20. 10.1038/s41586-021-03634-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Saliba AE, Westermann AJ, Gorski SA. et al. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res 2014;42:8845–60. 10.1093/nar/gku555 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Yuan L, Sun S, Jiang Y. et al. scRGCL: a cell type annotation method for single-cell RNA-seq data using residual graph convolutional neural network with contrastive learning. Brief Bioinform 2025;26:bbae662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 2022;13:1246. 10.1038/s41467-022-28803-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Zhang Z, Luo D, Zhong X. et al. SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes. 2019;10:531. 10.3390/genes10070531 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yang F, Wang W, Wang F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell 2022;4:852–66. 10.1038/s42256-022-00534-z [DOI] [Google Scholar]
- 10. Huang Q, Liu Y, Du Y. et al. Evaluation of cell type annotation R packages on single-cell RNA-seq data. Genomics Proteomics Bioinformatics 2021;19:267–81. 10.1016/j.gpb.2020.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Franzén O, Gan LM, Björkegren JL. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database. 2019;2019:baz046. 10.1093/database/baz046 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Quan F, Liang X, Cheng M. et al. Annotation of cell types (ACT): a convenient web server for cell type annotation. Genome Med 2023;15:91. 10.1186/s13073-023-01249-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hu C, Li T, Xu Y. et al. CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res 2023;51:D870–6. 10.1093/nar/gkac947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Liu H, Li H, Sharma A. et al. scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets. Brief Bioinform 2023;24:bbad179. [DOI] [PubMed] [Google Scholar]
- 15. Pruszak J, Sonntag KC, Aung MH. et al. Markers and methods for cell sorting of human embryonic stem cell-derived neural cell populations. Stem Cells 2007;25:2257–68. 10.1634/stemcells.2006-0744 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kelly OG, Chan MY, Martinson LA. et al. Cell-surface markers for the isolation of pancreatic cell types derived from human embryonic stem cells. Nat Biotechnol 2011;29:750–6. 10.1038/nbt.1931 [DOI] [PubMed] [Google Scholar]
- 17. Aran D, Looney AP, Liu L. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 2019;20:163–72. 10.1038/s41590-018-0276-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Hou R, Denisenko E, Forrest AR. scMatch: a single-cell gene expression profile annotation tool using reference datasets. Bioinformatics. 2019;35:4688–95. 10.1093/bioinformatics/btz292 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Satija R, Farrell JA, Gennert D. et al. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 2015;33:495–502. 10.1038/nbt.3192 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Zhai Y, Chen L, Deng M. scBOL: a universal cell type identification framework for single-cell and spatial transcriptomics data. Brief Bioinform 2024;25:bbae188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods 2019;16:983–6. 10.1038/s41592-019-0535-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Guo H, Li J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol 2021;22:69. 10.1186/s13059-021-02281-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Tian L, Xie Y, Xie Z. et al. AtacAnnoR: a reference-based annotation tool for single cell ATAC-seq data. Brief Bioinform 2023;24:bbad268. 10.1093/bib/bbad268 [DOI] [PubMed] [Google Scholar]
- 24. Ma W, Lu J, Wu H. Cellcano: supervised cell type identification for single cell ATAC-seq data. Nat Commun 2023;14:1864. 10.1038/s41467-023-37439-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Yuan M, Wan H, Wang Z. et al. SPANN: annotating single-cell resolution spatial transcriptome data with scRNA-seq data. Brief Bioinform 2024;25:bbad533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Mages S, Moriel N, Avraham-Davidi I. et al. TACCO unifies annotation transfer and decomposition of cell identities for single-cell and spatial omics. Nat Biotechnol 2023;41:1465–73. 10.1038/s41587-023-01657-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Cao ZJ, Wei L, Lu S. et al. Searching large-scale scRNA-seq databases via unbiased cell embedding with cell BLAST. Nat Commun 2020;11:3458. 10.1038/s41467-020-17281-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Stuart T, Srivastava A, Madad S. et al. Single-cell chromatin state analysis with Signac. Nat Methods 2021;18:1333–41. 10.1038/s41592-021-01282-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Granja JM, Corces MR, Pierce SE. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet 2021;53:403–11. 10.1038/s41588-021-00790-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 2018;19:1–5. 10.1186/s13059-017-1382-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Consortium* TTS, Jones RC, Karkanias J. et al. The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376:eabl4896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Satpathy AT, Granja JM, Yost KE. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol 2019;37:925–36. 10.1038/s41587-019-0206-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Lareau CA, Duarte FM, Chew JG. et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat Biotechnol 2019;37:916–24. 10.1038/s41587-019-0147-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Wang X, Allen WE, Wright MA. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science. 2018;361:eaat5691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Lohoff T, Ghazanfar S, Missarova A. et al. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nat Biotechnol 2022;40:74–85. 10.1038/s41587-021-01006-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Maynard KR, Collado-Torres L, Weber LM. et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci 2021;24:425–36. 10.1038/s41593-020-00787-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. De Donno C, Hediyeh-Zadeh S, Moinfar AA. et al. Population-level integration of single-cell datasets enables multi-scale analysis across samples. Nat Methods 2023;20:1683–92. 10.1038/s41592-023-02035-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Shulse C, Cole B, Ciobanu D. et al. High-throughput single-cell transcriptome profiling of plant cell types. Cell Rep 2019;27:2241–7. 10.1016/j.celrep.2019.04.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. He Z, Luo Y, Zhou X. et al. scPlantDB: a comprehensive database for exploring cell types and markers of plant cell atlases. Nucleic Acids Res 2024;52:D1629–38. 10.1093/nar/gkad706 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Korsunsky I, Millard N, Fan J. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods 2019;16:1289–96. 10.1038/s41592-019-0619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Swain AK, Pandit V, Sharma J. et al. SpatialPrompt: spatially aware scalable and accurate tool for spot deconvolution and domain identification in spatial transcriptomics. Commun Biol 2024;7:639. 10.1038/s42003-024-06349-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Sun X, Lin X, Li Z. et al. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq. Brief Bioinform 2022;23:bbab567. 10.1093/bib/bbab567 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Li M, Zhang X, Ang KS. et al. DISCO: a database of deeply integrated human single-cell omics data. Nucleic Acids Res 2022;50:D596–602. 10.1093/nar/gkab1020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Jin J, Lu P, Xu Y. et al. PCMDB: a curated and comprehensive resource of plant cell markers. Nucleic Acids Res 2022;50:D1448–55. 10.1093/nar/gkab949 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y., et al. Shiny: Web Application Framework for R; 2024. R Package Version 1.8.1.1. Available from: https://CRAN.R-project.org/package=shiny (6 February 2025, date last accessed).
- 46. Li X, Wang K, Lyu Y. et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat Commun 2020;11:2338. 10.1038/s41467-020-15851-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All the datasets used in this study are publicly available. Tabula Sapiens atlas scRNA-seq datasets were obtained from the Figshare repository (https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219). scATAC-seq datasets obtained from NCBI-GEO having accession IDs GSE129785 and GSE123578. STARmap cortex spatial dataset was retrieved from http://starmapresources.org/data. The spatial human DLPFC dataset was retrieved from the LIBD database https://research.libd.org/spatialLIBD/. The spatial mouse embryo dataset was obtained from https://crukci.shinyapps.io/SpatialMouseAtlas/. The PBMC, pancreas, lungs, and liver reference datasets were obtained from the DISCO database (https://www.immunesinglecell.org/). All plant scRNA-seq datasets were retrieved from scPlantDB (https://biobigdata.nju.edu.cn/scplantdb/home).
The ScInfeR R package and the scripts used for the benchmarking analysis can be found on GitHub at https://github.com/swainasish/ScInfeR. Comprehensive documentation and tutorials for our tool and database are available at https://www.swainasish.in/scinfer.














