Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Jun 5;26(3):bbaf253. doi: 10.1093/bib/bbaf253

ScInfeR: an efficient method for annotating cell types and sub-types in single-cell RNA-seq, ATAC-seq, and spatial omics

Asish Kumar Swain 1, Rajveer Singh Shekhawat 2, Pankaj Yadav 3,4,
PMCID: PMC12140018  PMID: 40471991

Abstract

Cell-type annotation remains a major challenge in single-cell and spatial omics analysis. Most existing methods rely on single-cell RNA sequencing (scRNA-seq) references or predefined marker sets. However, the scarcity of high-quality scRNA-seq references and marker sets makes relying on a single approach prone to bias and limits usability. Furthermore, available methods for cell-type annotation in single-cell ATAC-sequencing (scATAC-seq) and spatial transcriptomics datasets perform poorly. Here, we present ScInfeR, a graph-based cell-type annotation method that combines information from both scRNA-seq references and marker sets. By integrating these two data sources, ScInfeR can accurately annotate broad range of cell-types. It employs a hierarchical framework inspired by message-passing layers in graph neural networks to accurately identify cell subtypes. ScInfeR is highly versatile, supporting cell annotation across scRNA-seq, scATAC-seq, and spatial omics datasets. For scATAC-seq, it effectively utilizes chromatin accessibility data, while for spatial transcriptomics, it incorporates spatial coordinate information. Additionally, ScInfeR supports weighted positive and negative markers, allowing users to define marker importance in cell-type classification. Our extensive benchmarking across multiple atlas-scale scRNA-seq, scATAC-seq, and spatial datasets, evaluating 10 existing tools in over 100 cell-type prediction tasks, demonstrated ScInfeR’s superior performance. Noteworthy, it exhibits robustness against batch effects arising in these datasets. To facilitate seamless annotation, we developed ScInfeRDB, an interactive database containing manually curated scRNA-seq references and marker sets for 329 cell-types, covering 2497 gene markers in 28 tissue types from human and plant. ScInfeR is available as an R package, with both the tool and database publicly accessible at https://www.swainasish.in/scinfer.

Keywords: cell type annotation, scRNA-seq, scATAC-seq, spatial transcriptomics

Introduction

Advancements in sequencing technologies provide us unparalleled opportunities to study cellular heterogeneity, gene regulation, and spatial tissue architecture of complex biological systems [1–3]. Single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq), and spatial transcriptomics are among the most powerful techniques developed to characterize the transcriptomic, epigenomic, and spatial landscape of cells, respectively [1–4]. Briefly, scRNA-seq allows us to analyse the expression pattern of genes at individual cell level [5], scATAC-seq provides insights into chromatin accessibility of the genome at a single-cell resolution [3], and spatial transcriptomics retains spatial context while analysing the molecular features of tissue samples [4]. Accurate identification of cell types is crucial for downstream analysis of single-cell and spatial datasets. Cell-type annotation can be performed manually or through automated methods. The manual cell-type annotation process is labour-intensive, expert-dependent, and not scalable to large datasets [6]. On the counterpart, automated cell-type annotation methods are scalable to large datasets and are less susceptible to human error [7–9]. Automated cell-type annotation methods can be broadly classified into two categories: marker-based and reference-based [10]. Marker-based methods extract cell-type-specific markers from literature studies or cell-type marker databases such as PanglaoDB [11], ACT database [12], and CellMarker database [13], and then classify cells based on the expression levels of these markers [14]. Traditionally, these cell-type markers have been identified by isolating specific cell types using cell sorting and microscopic techniques [15, 16]. Following isolation, the molecular characteristics of these cells were analysed to identify the cell-type specific markers [7]. Conversely, reference-based methods rely on scRNA-seq reference instead of marker sets. The reference-based methods transfer the cell annotations to the target dataset by correlating the gene expression profiles of the scRNA-seq reference dataset [9, 17–20].

Recently, several cell-type annotation methods have been developed for scRNA-seq, scATAC-seq, and spatial omics datasets. Most of these rely on either marker-based or reference-based approaches. Marker-based methods, including SCINA [8], ScType [7], Garnett [21], and scSorter [22], as well as reference-based tools, such as SingleR [17] and Seurat [19], have been developed to effectively annotate cell types in scRNA-seq datasets. Methods such as AtacAnnoR [23] and CellCano [24] were explicitly designed for annotation of scATAC-seq datasets. Similarly, spatially aware cell annotation tools such as SPANN [25] and TACCO [26] have been developed for spatial omics datasets. Most of the marker-based methods assume that marker gene sets should exhibit higher expression in the corresponding cell type. Among the marker-based tools, SCINA uses a Gaussian mixture model, assuming that marker gene sets should exhibit higher expression in the corresponding cell type [8]. ScType utilizes positive and negative marker sets to categorize the user-defined clusters. The negative marker set penalizes a cluster with higher expression of negative marker genes [7]. Furthermore, Garnett uses a generalized linear machine learning approach to identify cell types and their associated subtypes in a hierarchical manner [21]. scSorter uses combined information of user-defined marker genes and highly variable genes to annotate the scRNA-seq datasets [22]. The major drawback of major marker-based tools (e.g. SCINA, ScType) is their dependence on the quality of cell-type-specific marker sets and their lack of support for subtype identification [7, 8]. For example, subtypes of T cell markers reported in CellMarker [13] and ScTypeDB [7] databases heavily overlapped, which can lead to incorrect cell type classification. Biologically, several cell types are comprised of multiple subtypes. Due to the overlapping nature of markers among these subtypes, accurately identifying them at the cluster level is highly challenging. Methods like ScType [7] and scSorter [22], which perform cell annotation at the cluster level, often struggle to distinguish closely related subtypes, leading to reduced accuracy. Among existing methods, only Garnett supports subtype classification. However, Garnett’s performance heavily depends on the quality of the training data, where inadequate training data can lead to poor classification outcomes, highlighting the need for a method that enables hierarchical subtype classification with greater accuracy and robustness.

Among the reference-based methods, Seurat uses canonical correlation, and SingleR uses Spearman correlation to identify cell types using a well-annotated scRNA-seq reference dataset [17, 19]. Among the methods designed for annotating scATAC-seq datasets, CellCano uses a combination of multi-layer perceptron and knowledge distillation algorithms to identify cell types using a scATAC-seq reference dataset [24]. As fewer scATAC-seq reference datasets are available compared with scRNA-seq datasets, AtacAnnoR utilizes a combine and discard strategy to annotate scATAC-seq datasets using scRNA-seq dataset as a reference [23]. Furthermore, spatial transcriptomics cell annotation methods such as SPANN uses a coupled-variational autoencoder, and TACCO uses an optimal transport model to annotate spatial spots/cells using scRNA-seq as reference [25, 26]. However, the availability of good quality reference scRNA-seq datasets comprising of a wide range of cell types is rare [27]. Consequently, if a cell type in the target dataset is not included in the reference dataset, it can lead to inaccurate predictions. Both approaches of cell type prediction tasks have their strengths and weaknesses. Combining both approaches can be a comprehensive strategy to improve the usability and robustness of the cell-type annotations task. By integrating both scRNA-seq reference data and marker sets, a more comprehensive strategy can be developed to accurately annotate a broad range of cell types from both sources. To the best of our knowledge, no hybrid-based approach currently exists for cell typing in single-cell and spatial technologies.

To solve these challenges, we propose a hybrid-based cell-type annotation toolkit named ScInfeR (Single Cell-type Inference toolkit using R). ScInfeR can annotate cells using either user-defined marker sets, or scRNA-seq references, or both to annotate cells in scRNA-seq, scATAC-seq, and spatial omics datasets. This combined strategy leverages the complementary strengths of both marker- and reference-based annotation approaches to identify novel or missing cell types. By integrating cell types from both sources, this dual-layered framework enables the annotation of a broader range of cell types, improving the robustness and adaptability of cell-type identification across diverse datasets and species, which is lacking in existing tools. ScInfeR implements two rounds of annotation strategy for cell-type assignment. First, our tool annotates the cell clusters by correlating the cluster-specific markers with the cell-type-specific markers in the cell–cell similarity graph. These cell-type-specific marker genes can be either user-defined, extracted by ScInfeR from scRNA-seq reference data, or a combination of both. For scRNA-seq as a reference, ScInfeR extracts cell-type markers by considering both the global and local specificity of markers. In the second round, ScInfeR annotates the subtypes and clusters containing multiple cell types in a hierarchical manner. In this step, it uses a framework adapted from the message-passing layer in the graph neural network to annotate each cell individually. Additionally, our method supports weighted and negative markers, where users can specify the importance of the markers in the cell type classification task. Our comprehensive benchmarking analysis across various scRNA-seq, scATAC-seq, and spatial omics datasets demonstrated that ScInfeR outperformed existing tools in both accuracy and sensitivity. Additionally, the tool is capable of accurately annotating datasets that exhibit substantial batch effects. We also built an open-access and interactive hierarchical cell marker database, i.e. ScInfeRDB (https://www.swainasish.in/scinfer), comprising 28 tissue types, 329 cell types, and 2497 gene markers. This cell-marker database can be integrated into our toolkit for seamless cell-type annotation.

Materials and methods

Overview of datasets

We have used 24 scRNA-seq, two scATAC-seq, and three spatial omics datasets for the performance assessment of ScInfeR. All scRNA-seq datasets were pre-processed using the Seurat package [19]. scATAC-seq datasets were processed using Signac [28] and ArchR [29] packages. Spatial datasets were jointly processed using the Seurat and Scanpy packages [30]. Detailed information about all the datasets is summarized in Supplementary Table 1. A summary of all datasets is described below:

Tabula Sapiens atlas scRNA-seq dataset

This scRNA-seq atlas comprises scRNA-seq datasets of multiple human tissues [31]. We retrieved the human lung, pancreas, and liver scRNA-seq datasets for the downstream analysis. The cell-type annotations provided in the Tabula Sapiens atlas were used as ground truth for benchmarking with other tools.

scATAC-seq datasets

Two peripheral blood mononuclear cell (PBMC) scATAC-seq datasets were retrieved from the NCBI-GEO with accession IDs GSE129785 [32] and GSE123578 [33]. Both datasets include a range of cell types, such as B cells, monocytes, natural killer cells, CD8 T-cells, and CD4 T-cells. These datasets were obtained using cell sorting, so the annotations from cell sorting are considered ground truth for benchmarking.

Spatial transcriptomics datasets

Three spatial transcriptomics datasets were used for the performance assessment of ScInfeR. The first spatial transcriptomic dataset was retrieved from the mouse cortical region. This dataset was sequenced using the STARmap technology [34]. The second spatial dataset was retrieved from a developmental mouse embryo. This dataset was profiled using the SeqFISH technique. We also retrieved the corresponding scRNA-seq data at the same developmental time point [35]. The last spatial dataset was obtained from the human prefrontal cortex region, which was sequenced using the 10X Visium technique [36].

scRNA-seq datasets with batch-effects

A total of 14 scRNA-seq datasets were retrieved from multiple studies from the pancreas and PBMC tissues. The batch-wise concatenated matrix was obtained from De Donno., et al [37]. The integrated pancreas dataset was comprised of 16 382 cells and 13 cell types. This dataset was created by integrating eight batches of scRNA-seq datasets. PBMC integrated dataset was comprised of 33 506 cells and 16 cell types. This dataset was created by integrating five batches of scRNA-seq datasets.

Plants scRNA-seq datasets

Two scRNA-seq datasets from Arabidopsis thaliana root and Oryza sativa leaf were used to assess the performance of ScInfeR. The A. thaliana root scRNA-seq dataset contained 28 183 cells across 12 cell types, while the O. sativa leaf scRNA-seq dataset comprised 24 264 cells across four cell types [38, 39].

Cell type-specific marker identification from scRNA-seq references

For reference-based cell type annotation, ScInfeR requires reference scRNA-seq matrix (Inline graphic) with cell type annotations and target scRNA-seq matrix (Inline graphic). Traditionally, cell type markers were extracted from Inline graphic using the one-versus-all approach (in Seurat and Scanpy packages), where gene expression is compared at the global scale only. This method does not take into account the local specificity of the markers. As a result, genes that show specificity only at the local level are dominated by the markers having high specificity at the global level. For example, the specificity of T-cell subtype markers, including CD4 T-cells and CD8-T cell markers, are dominated by general T-cell markers. To overcome this issue, our tool finds all gene’s cell type specificity on a global scale (Inline graphic) and also at a local scale (Inline graphic). The area under the ROC curve (Inline graphic) score ranges between 0 and 1 and refers to the specificity of a gene to a cell type. A higher score indicates better specificity to the cell type. First, the tool calculates the Inline graphic of all genes in the Inline graphic using the traditional one-versus-all approach. Next, for Inline graphic estimation, the tool identifies Inline graphic highly correlated cell types for each cell type within the cell–cell similarity adjacency matrix (Inline graphic). Inline graphic is derived from the UMAP or PCA projections of the reference scRNA-seq data. For multiple scRNA-seq references, it was recommended that Harmony projections be used [40]. Details about the Inline graphic matrix construction are discussed in the next steps. Next, the Inline graphic is estimated for all genes by comparing the query cell type with the Inline graphic highly correlated cell types. Finally, a combined score Inline graphic estimated for all genes by

graphic file with name DmEquation1.gif (1)

Here, Inline graphic represents the weight assigned to the Inline graphic, and Inline graphic represents the weight assigned to the Inline graphic. By default, both Inline graphic and Inline graphic are set at 0.5 so that equal weightage is given to both the global and local specificity of the gene. In the final step, the top Inline graphic genes with the highest Inline graphic were selected for each cell type for further downstream analysis. High-quality marker selection is based on two parameters: the number of top genes (Inline graphic) and the (Inline graphic) score. By default, (Inline graphic) is set to 15 and the (Inline graphic) threshold to 0.75. When fewer than Inline graphic genes fulfil the threshold condition (Inline graphic), only the genes having score >0.75 are selected. Users can adjust the (Inline graphic) threshold to refine the marker selection: increasing it (e.g. to 0.85) selects more specific markers, while decreasing it includes a broader range of genes. Additionally, the number of top genes (Inline graphic)) can be modified to create either a smaller, more precise marker set or a larger, more comprehensive one, depending on the tissue type and gene expression overlap across cell types.

ScInfeR framework for the cell-type and subtype identification for the scRNA-seq dataset

ScInfeR infers cell type annotations of the scRNA-seq dataset (Inline graphic) using three steps. First, the cell–cell similarity adjacency matrix (Inline graphic) is constructed from the UMAP or PCA projections. In the second step, cluster-level cell annotation is performed by combining information from Inline graphic with cell type-specific markers. In the third step, clusters containing multiple cell types and subtypes are annotated at the individual cell level. Details of all three steps are described below:

Step1: Cell–cell similarity adjacency matrix (Inline graphic) construction

To construct the cell–cell similarity adjacency matrix, Inline graphic nearest neighbours for each cell are identified using the Inline graphic module using the UMAP or PCA projections. If the dataset has multiple batches Harmony [40] projections can be used for the adjacency matrix construction. The Inline graphic module efficiently identifies neighbours by segmenting the search space into smaller areas using hyperplanes and then identifying the cells most likely to be nearest neighbours. Next, Inline graphic constructed using the following conditions:

graphic file with name DmEquation2.gif (2)

Here, if the cell Inline graphic is among the Inline graphic’s nearest neighbour, Inline graphic=1 else 0.

Step2: cluster-level cell annotation

In this step, for each cluster in Inline graphic, cluster-specific gene markers (Inline graphic) were identified by calculating the Inline graphic as mentioned above. These cluster-specific markers (Inline graphic) exhibit very high specificity to their respective clusters. Similarly, cell type-specific markers (Inline graphic), either user-defined or calculated from the Inline graphic, have higher expression in the particular cell types. If the high correlation is observed between the expression patterns of Inline graphic and Inline graphic within a cluster, it is likely that the cluster belongs to that particular cell type. Based on this assumption, the cosine similarity calculated between Inline graphic and Inline graphic for each cluster is as follows:

graphic file with name DmEquation3.gif (3)

Here, Inline graphic represents the cosine similarity of cluster-specific marker Inline graphic to the cell type marker Inline graphic. Next, the cell type specificity of the cluster Inline graphic for each cell type is determined by

graphic file with name DmEquation4.gif (4)

Here, Inline graphic represents the weighted cell type specificity of the cluster Inline graphic to the cell type Inline graphic. Inline graphic represents the weight assigned to the cell type marker Inline graphic by the user. Inline graphic ranges from −3 to 3, where +3 denotes the high specific positive marker, and −3 denotes the high specific negative marker. Inline graphic represents the specificity of the cell type marker Inline graphic to the cluster Inline graphic. In this way, for cluster Inline graphic, cell type specificity for all cell types is determined. Next, the cell type with the highest Inline graphic is assigned to cluster Inline graphic. If a cluster has Inline graphic of multiple cell types very close to each other, that cluster will be annotated in step 3, which performs annotation at the individual cell level. If the difference between the highest and the second-highest Inline graphic is <0.05, it indicates ambiguity in cell type assignment. In such cases, the cluster is considered unresolved and will be passed to step 3, where annotation is performed at the individual cell level. This finer resolution step helps accurately assign cell types by leveraging the distinct transcriptional profiles of individual cells.

Step3: subtype identification

Clusters containing multiple cell types and user-defined subtypes were annotated in this step. The weighted mean expression of each cell’s neighbourhood cells was calculated using Inline graphic. Next, each cell’s own expression is merged with the weighted mean expression of its neighbours. In this way, a combined expression was created that captures both the own and neighbour cell expressions. This approach is adapted from the message-passing layer framework in graph neural networks, where a node incorporates its own information and information from its neighbourhood nodes. The weighted mean expression is calculated as follows:

graphic file with name DmEquation5.gif (5)

Here, Inline graphic represents the weighted mean expression of the cell, Inline graphic is the mean expression of the neighbourhood cells, and Inline graphic represents the number of iterations over which the message-passing layer integrates the mean expression from its neighbours. Inline graphic represents the weight assigned to the neighbours mean expression, which is initialized as 1 and reduced by Inline graphic at each iteration. In this way, higher weight is given to nearer cells compared with the farther cells in the projection space. Next, the combined expression of each cell is calculated as follows:

graphic file with name DmEquation6.gif (6)

Here, Inline graphic denotes the weight assigned to the cell’s own expression, and Inline graphic denotes the weight assigned to the weighted mean expression of the cell’s neighbours. By default, Inline graphic and Inline graphic are assigned a value of 0.5, giving equal weight to both parameters. Next, cell type-specific markers’ expression is indexed from the Inline graphic, and the mean expression of these cell type markers is calculated for each cell type across all cells. Subsequently, cells are individually annotated based on the cell type that has the highest combined mean expression.

Cell-type identification framework for scATAC-seq dataset

Most cell-type annotation methods for scATAC-seq datasets rely on the gene activity score only, which aggregates the counts of all fragments within the gene promoter region and gene body [28, 29]. However, using only the gene activity score for cell type annotation can overlook the fragments that reside outside the gene region. To overcome this challenge, ScInfeR utilizes both the gene activity score and chromatin accessibility score to predict cell-type annotations. For cell type guidance, users can provide either a marker set, scRNA-seq reference, or scATAC-seq reference. When using scATAC-seq dataset as the reference, ScInfeR calculates cell type-specific markers by combining the gene activity scores with the UMAP projections of chromatin accessibility data.

Our tool requires gene activity score, cluster information along with UMAP or PCA projections generated from chromatin accessibility scores derived from the target scATAC-seq dataset. First, the cell–cell similarity adjacency matrix (Inline graphic) constructed from the UMAP or PCA projections of the target scATAC-seq dataset. Subsequently, clusters will be annotated by calculating Inline graphic using cluster-specific gene markers from gene activity score and cell type-specific markers as described in previous steps. Further, subtypes and clusters having multiple cell types will be annotated by calculating the Inline graphic using Inline graphic and cell-type and sub-type specific markers.

Cell-type identification framework for spatial transcriptomics dataset

For cell type annotation in spatial omics datasets, our tool calculates the Inline graphic adjacency matrix from the Inline graphic and Inline graphic coordinates of the spatial data. Next, spatial domains or clusters were obtained using our in-house tool, SpatialPrompt [41]. Briefly, SpatialPrompt calculates spatial domains by integrating the gene expression data and spatial coordinates from spatial omics datasets. In this step, users can also provide spatial domains generated by other tools to ScInfeR. Finally, by combining the spatial adjacency matrix, spatial domain information, and gene expression data, ScInfeR predicts the cell types in the spatial data using the steps mentioned above.

Performance assessment of ScInfeR to other state-of-the-art tools

A total of 10 cell-type annotation tools, used across various scRNA-seq, scATAC-seq, and spatial technologies, were considered for the performance assessment. For scRNA-seq benchmarking, the marker-based methods scSorter [22], SCINA [8], scType [7], and Garnett [21] were considered. Among the reference-based tools, Seurat [19] and SingleR [17] were selected as these methods have shown promising results in recent benchmark studies [42]. For the scATAC-seq benchmark, AtacAnnoR, Cellcano, and scRNA-seq reference-based methods were considered. For spatial transcriptomics, recent spatially aware cell type annotation tools, such as SPANN [25] and TACCO [26], along with scRNA-seq marker and reference-based tools, were included in the analysis. A brief overview of all the tools is summarized in Table 1. For plant scRNA-seq datasets, scSorter and SCINA were considered for benchmarking, while ScType and Garnett were excluded as they are explicitly designed for human and mouse scRNA-seq data.

Table 1.

Comparison of cell type annotation tools: overview of popular cell annotation tools, highlighting their target data types, underlying algorithms, and support for hybrid-based annotation, subtype identification, weighted marker usage, and negative marker support

Tool name Target data type Algorithm overview Hybrid annotation support Subtype support Weighted marker support Negative marker support
ScInfeR scRNA-seq, scATAC-seq, spatial-omics Graph-based Yes Yes Yes Yes
ScType [7] scRNA-seq Correlation-based No No No Yes
SCINA [8] scRNA-seq Gaussian mixture model No No No No
scSorter [22] scRNA-seq Correlation-based No No Yes No
Garnett [21] scRNA-seq Linear-model No Yes No No
SingleR [17] scRNA-seq Spearman correlation No No No No
Seurat [19] scRNA-seq Canonical correlation No No No No
CellCano [24] scATAC-seq Knowledge distillation algorithms No No No No
AtacAnnoR [23] scATAC-seq Combine and discard strategy No No No No
Tacco [26] Spatial omics Optimal transport model No No No No
SPANN [25] Spatial omics Coupled-variational autoencoder No No No No

For unbiased scRNA-seq reference selection, all references were retrieved from the DISCO database [43]. For the marker-based methods, cell type markers were collected from their respective studies. If the marker sets were not provided by the study, the top 10 markers for each cell type were fetched using the FindMarkers function in the Seurat package. This approach has been used by several benchmark studies as well [42]. Details about the target and reference scRNA-seq, scATAC-seq, and spatial omics datasets were provided in the Supplementary Table 2. For all the methods, instructions provided in their official repositories were followed with default parameters. Quantitative assessment metrics, i.e. micro F1 score and adjusted rand score (ARI), were calculated between the predicted cell type annotations and ground truth. Micro F1 score and ARI score were calculated between ground truth (Inline graphic) and predicted annotation (Inline graphic) as

graphic file with name DmEquation7.gif (7)
graphic file with name DmEquation8.gif (8)

Here, RI stands for the Rand Index, which measures the similarity between the predicted annotations and the ground truth. All analyses were performed on a system with an Intel Xeon processor with 48 cores, 128 GB of RAM, and 4GB of graphics memory.

Hierarchical cell type marker and scRNA-seq reference database construction

A total of 2497 cell type markers from 28 tissue types were collected from the PanglaoDB, Cellmarker database, DISCO, ScType, and PCMDB databases [7, 11, 13, 43, 44]. Markers validated by multiple users in PanglaoDB and Cellmarker database were assigned weights >1. For plant cell type markers, only non-duplicated experimental validated markers from PCMDB were considered for database construction. Additionally, scRNA-seq reference datasets were collected from the DISCO and Azimuth databases [19, 43]. Following that, cell type names are standardized across both the marker and scRNA-seq reference datasets for seamless integration with ScInfeR R package. This feature is missing in several databases, resulting in confusion in the cell type annotation process. Compiling all the resources, an interactive web server was created using R Shiny, allowing users to visualize and download the marker sets and scRNA-seq references using both the web server and the R package [45].

RESULTS

Overview of ScInfeR framework

ScInfeR employs a graph-based framework to annotate cell types in scRNA-seq, scATAC-seq, and spatial transcriptomics datasets. It enables cell-type annotation using a cell marker set, scRNA-seq reference, or combination of both (Fig. 1). From scRNA-seq reference (Inline graphic), ScInfeR uses both local and global neighbourhoods of cell types to extract the cell type-specific markers from Inline graphic (Fig. 1b). In this way, ScInfeR provides flexibility to the users for marker-set and reference scRNA-seq selection. Additionally, for seamless cell annotation, we built an open-access and interactive cell type markers and scRNA-seq reference database, i.e. ScInferDB. This comprehensive repository encompasses scRNA-seq references and cell-type markers.

Figure 1.

Figure 1

Overview of the ScInfeR workflow: (a) ScInfeR framework takes scRNA-seq, or single-cell ATAC-sequencing (scATAC-seq), or spatial omics expression matrix as input for cell type inference; (b) a cell marker set or scRNA-seq reference matrix is used as secondary input for cell type guidance. Both of these can be retrieved from our ScInfeRDB database. In case, scRNA-seq reference is input, ScInfeR could calculate the cell marker set; (c) ScInfeR annotates cells in three steps: building a similarity matrix, assigning cluster-level labels based on marker correlations, and refining annotations at the single-cell level using neighbourhood-weighted expression.

ScInfeR annotates cells through a three-step process. In step 1, the tool constructs a cell–cell similarity adjacency matrix (Inline graphic) using either the gene expression data of scRNA-seq, chromatin accessibility scores of scATAC-seq, or spatial coordinates of spatial transcriptomics data. Inline graphic represents the underlying cellular network as a graph, where nodes are cells and edges reflect their similarity with respect to gene expression, chromatin accessibility, or position. In step 2, the tool annotates each cluster or domain by correlating the expression patterns of cluster markers to the cell type markers. In step 3, leveraging a framework adapted from the message-passing layer of graph neural networks (details in the method section), the tool computes the weighted mean expression profile of each cell’s neighbouring cells. By integrating the intrinsic gene expression of each cell with the transcriptomic signals from its neighbours, ScInfeR accurately assigns each cell to its corresponding cell type (Fig. 1c).

Systematic benchmarking of ScInfer on scRNA-seq Tabula Sapiens atlas

The performance of ScInfeR and six existing tools was initially assessed using the scRNA-seq Tabula Sapiens atlas. This scRNA-seq atlas comprises of scRNA-seq datasets from multiple human tissues with ground-truth annotations. The scRNA-seq datasets for the lungs, liver, and pancreas were retrieved and processed as mentioned in the methods section. The lungs scRNA-seq dataset (TS-lungs) contains 12 cell types, with two cell types having five subtypes. Similarly, the liver scRNA-seq dataset (TS-liver) includes 13 cell types, and the pancreas scRNA-seq dataset (TS-pancreas) contains 15 cell types. This benchmark analysis enables us to investigate the performance of ScInfeR and other scRNA-seq-based tools across various tissue types.

On the TS-lungs dataset, among the marker-based methods, ScInfeR was able to surpass other methods with an F1 score of 0.94 (Fig. 2a). ScType and SCINA were the next best-performing tools, with F1 scores of 0.93 and 0.87, respectively (Fig. 2b). In this specific benchmark, ScType’s performance is comparable with ScInfeR, but it did not perform well on other datasets (Fig. 3). scSorter and Garnett performed poorly, with F1 scores of 0.70 and 0.48 in this dataset (Fig. 2b). Similarly, for the reference-based methods, ScInfeR had a lower false positive rate, achieving an F1 score of 0.94 compared with SingleR and Seurat, both of which had F1 scores of 0.89 (Fig. 2c). For performance assessment of subtype identification, five subtypes: CD4 T-cell, CD8 T-cell, EC-capillary, EC-microvascular, and EC-vein cell types were considered (Fig. 2d). Both ScInfeR and Garnett were projected for subtype analysis, as these tools were specifically designed to identify and analyse the various subtypes within the cell types. In the subtype analysis, ScInfeR achieved an F1 score of 0.74, outperforming Garnett, which has an F1 score of 0.24. This demonstrates ScInfeR’s superior performance in subtype identification (Fig. 2d). In this case, when predicting the EC-capillary subtype, ScInfeR classified some cells as aerocytes (labelled as other in Fig. 2d). We observed high expression of aerocyte markers in those cells (Supplementary Figure 1). It is possible that the authors did not consider the aerocyte cell type in their cell type annotation process. In terms of scalability, SingleR, Garnett, and scSorter required >2500 s to predict the cell types of the TS-lungs dataset. On the counterpart, ScType and ScInfeR required <60 s for the cell typing task (Fig. 2e). This shows that ScInfeR is an ideal tool for large scRNA-seq datasets with millions of cells.

Figure 2.

Figure 2

Quantitative assessment of ScInfeR on Tabula Sapiens lungs scRNA-seq dataset: (a) Comparison of ScInfeR predicted cell types with the ground truth annotations, visualized on the UMAP projection of the lungs scRNA-seq dataset. ScInfeR (M) and ScInfeR (R) represent the marker-based and reference-based performances, respectively. The F1 score (0.94) was calculated by comparing the tool’s predicted annotations with the ground truth annotations; (b,c) bar plot showing the F1 and Adjusted Rand Index (ARI) scores of marker-based and reference-based tools on the same dataset; (d) performance of tools (ScInfeR and Garnett) allowing subtype identification, considering only T cells and endothelial cells, as they have subtypes; (e) run time comparison of all tools on the same dataset.

Figure 3.

Figure 3

Quantitative assessment of ScInfeR on Tabula Sapiens liver and pancreas scRNA-seq datasets: (a) Comparison of ScInfeR predicted cell types and ground truth annotations, visualized on the UMAP projection of the liver scRNA-seq dataset. ScInfeR (M) and ScInfeR (R) represent the marker-based and reference-based performances, respectively. The F1 score (0.88 and 0.67) was calculated by comparing the tool’s predicted annotations with the ground truth annotations; (b,c) bar plots showing the F1 and Adjusted Rand Index (ARI) scores of marker-based and reference-based tools on the liver scRNA-seq dataset; (d) comparison of ScInfeR predicted cell types and ground truth annotations, visualized on the UMAP projection of the pancreas scRNA-seq dataset; (e,f) bar plot showing the F1 and ARI scores of marker-based and reference-based tools on the pancreas scRNA-seq dataset.

On the TS-liver and TS-pancreas datasets, among the marker-based tools, ScInfeR outperformed other tools, with F1 scores of 0.88 and 0.93, respectively (Fig. 3a, d). In both datasets, SCINA and scSorter were the next best-performing methods. The F1 and ARI scores of all the tools are shown in Fig. 3b, e. In the TS-liver dataset, other reference-based tools performed poorly, with F1 scores of 0.48 for both SingleR and Seurat. The high batch effect between reference and target datasets might be the reason for this poor performance. Despite this challenge, ScInfeR outperformed other tools, achieving an F1 score of 0.67 (Fig. 3c). Similarly, in the TS-pancreas dataset, ScInfeR identified all cell types with an F1 score of 0.84, surpassing other reference-based methods such as SingleR and Seurat, which had an F1 score of 0.73 and 0.72 (Fig. 3f).

Application of ScInfer on scATAC-seq technology

The application of ScInfeR on scATAC-seq datasets was assessed using the PBMC dataset (GSE129785) [32]. This benchmark was conducted in two stages. First, tools that support scATAC-seq as a reference (e.g. CellCano) were evaluated. Next, the methods that utilize scRNA-seq as a reference were tested. For the scATAC-seq reference, the PBMC scATAC-seq dataset (GSE123578) [33] was used. For the scRNA-seq reference, the PBMC scRNA-seq dataset from the DISCO database was used for the cell-type prediction. The second benchmarking was essential due to the scarcity of scATAC-seq reference datasets compared with scRNA-seq. Using scRNA-seq as a reference enhances the tool’s versatility and broadens its applicability. Furthermore, tools such as SingleR and Seurat do not consider the chromatin accessibility information from the scATAC-seq, so only gene activity scores were used for their benchmarking. In the first benchmarking, ScInfeR demonstrated superior performance with an F1 score of 0.97, while CellCano also performed well with an F1 score of 0.95 (Fig. 4a). Despite CellCano performing well in detecting cell types, its inability to use scRNA-seq references limits its usability. In the second benchmarking, ScInfeR again performed well with an F1 score of 0.95 compared with AtacAnnoR (F1 score of 0.88)(Fig. 4c). When using scRNA-seq as a reference for scATAC-seq cell type annotation, only ScInfeR and AtacAnnoR incorporate chromatin accessibility information from scATAC-seq datasets. The lack of this feature in other tools results in poorer performance (Fig. 4c, d). Furthermore, in this benchmark, when detecting T-cell subtypes, none of the tools except ScInfeR could distinctly identify the boundaries between CD4 and CD8 T cells. The F1 scores and ROC scores obtained in both benchmarks are shown in Fig. 4b, d.

Figure 4.

Figure 4

Performance assessment of ScInfeR on scATAC-seq datasets: (a) UMAP plot of cell type inference predicted by reference-based tools that use scATAC-seq data as reference. The scores represent the F1 score obtained by comparing ground truth with the predicted cell types. (b) Bar plot of F1 and ROC scores obtained by comparing the ground truth annotations with the tool’s predicted annotations that use scATAC-seq data as a reference. (c) Cell type inference predicted by reference-based tools that use scRNA-seq data as a reference. (d) Bar plot of F1 and ROC scores obtained by comparing the ground truth annotations with the tool’s predicted annotations that use scRNA-seq data as a reference.

Spatially informed cell type annotation in spatial transcriptomics

Three spatial omics datasets were obtained from various studies, sequenced using STARmap, SeqFISH, and Visium spatial technologies [25, 34, 36]. These spatial datasets were derived from the mouse cortex, mouse embryo, and human brain. The first two spatial datasets (i.e. STARmap and SeqFISH) were benchmarked with reference-based tools. The third dataset (i.e. Visium-DLPFC) was benchmarked using marker-based tools. This marker-based benchmarking was crucial, as marker-based methodologies are less commonly utilized compared with reference-based tools in spatial transcriptomics, yet they offer distinct advantages in certain analytical scenarios. These scenarios include the availability of non-human scRNA-seq references and batch effects between the scRNA-seq and spatial datasets.

On the STARmap spatial dataset, ScInfeR outperformed other tools by a large margin, achieving an F1 score of 0.73 (Fig. 5a). SingleR and TACCO were the next best-performing methods, with F1 scores of 0.44 and 0.38, respectively. In this dataset, ScInfeR was the only method able to accurately capture the spatial cell boundaries compared with other methods (Fig. 5a). However, despite using spatial information for cell type prediction in spatial omics, TACCO and SPANN did not perform well on this dataset. SPANN performed the worst, predicting most of the cells as Smc cell type (Fig. 5a).

Figure 5.

Figure 5

Performance assessment of ScInfeR on spatial transcriptomics datasets: (a) spatial distribution of major cell types predicted by reference-based tools in the STARmap cortex spatial dataset. The scores represent the F1 score obtained by comparing the ground truth with the predicted annotations. X and Y axis represent the coordinates of the spatial data; (b) spatial distribution of major cell types predicted by reference-based tools in the SeqFISH embryo dataset; (c) spatial distribution of major cell types predicted by marker-based tools in the human dorsal prefrontal cortex dataset.

Similarly, on the SeqFISH embryo dataset, ScInfeR performed well with an F1 score of 0.80 (Fig. 5b). In this dataset also, spatially informed cell type annotation tools did not perform well compared with scRNA-seq based tools. Seurat and SingleR were the next best-performing tools, achieving F1 scores of 0.75 and 0.74, respectively (Fig. 5b). In the third spatial dataset (Visium-cortex), marker-based tools SCINA, scSorter, and ScType were evaluated. These tools, originally developed for scRNA-seq cell type annotation, do not effectively utilize the spatial information, resulting in poor performance, with SCINA achieving an F1 score of 0.38 and scSorter with 0.35. ScType notably predicted all cells as Layer3, with an F1 score of 0.25 and ROC score of 0 (Fig. 5c). In contrast, ScInfeR integrates spatial gene expression and coordinates, leading to the highest F1 score of 0.77 (Fig. 5c) among the evaluated methods on the Visium-cortex dataset.

ScInfeR performance on plants scRNA-seq datasets

Two scRNA-seq datasets from Arabidopsis thaliana root and Oryza sativa leaf were used to assess the performance of ScInfeR. This benchmark aimed to evaluate the method’s effectiveness across diverse plant species. The root scRNA-seq dataset contained 12 cell types, while the leaf dataset had four cell types. Due to the scarcity of high-quality scRNA-seq references, only the marker-based tools were used for this benchmark. On the A. thaliana root scRNA-seq dataset, ScInfeR outperformed other tools with an F1 score of 0.84 (Supplementary Figure 5), compared with scSorter and SCINA, which achieved F1 scores of 0.80 and 0.67, respectively. Similarly, on the O. sativa scRNA-seq dataset, ScInfeR achieved an impressive F1 score of 0.86 (Supplementary Figure 6), outperforming scSorter and SCINA, which scored 0.75 and 0.77, respectively. These results demonstrate ScInfeR’s superior accuracy in cell-type annotation across plant species. The consistent performance highlights its robustness and reliability for cross-species applications.

ScInfeR performance on scRNA-seq datasets with batch effects

Batch effect is a major challenge in the analysis of scRNA-Seq datasets. If the batch effect is neglected during cell type annotation, these variations could be attributed to the cell type prediction performance [46]. Our tool implements several steps to accurately annotate cells by mitigating the batch effects present in single-cell and spatial datasets. Performance assessment of ScInfeR was tested using 14 scRNA-seq datasets from the pancreas and PBMC tissue types. The batch effect among the scRNA-seq datasets was removed, and cell type annotation was performed as described in the method section. After batch effect removal, the UMAP projection of both pancreas and PBMC scRNA-seq datasets were shown in Fig. 6a, c. On the integrated pancreas scRNA-seq dataset, ScInfeR successfully captures nearly all cell type distributions, achieving an impressive F1 score of 0.95. (Fig. 6b). Cell type annotation on integrated PBMC was challenging due to the high similarity in cell types and overlapping of markers or features. Despite these challenges, ScInfeR is able to annotate major cell types with an F1 score of 0.80 (Fig. 6d). Notably, the tool was able to distinguish the cell type boundaries of CD4+ T cells, CD8+ T cells, NKT cells, and NK cells (Figure 6d) despite the high similarity of their expression patterns.

Figure 6.

Figure 6

Performance assessment of ScInfeR in scRNA-seq datasets with substantial batch effects: (a) UMAP plot of the integrated pancreas dataset after the batch effect correction; legend represents the techniques used to sequence the cells; (b) UMAP plot of ground truth annotations and ScInfeR predicted annotations on integrated pancreas scRNA-seq dataset. The scores represent the F1 score obtained by comparing the ground truth with the predicted annotations. (c) UMAP plot of the integrated pancreas dataset after the batch effect correction; legends represent the name of study from the dataset retrieved; (d) UMAP plot of ground truth annotation and ScInfeR predicted annotation on integrated PBMC scRNA-seq dataset.

Discussion

Accurate cell type identification is a crucial prerequisite for the downstream analysis of scRNA-seq, scATAC-seq, and spatial omics datasets. Currently, most cell-type annotation tools annotate cells using either a marker-based or reference-based approach. Marker-based tools require a cell type marker set, while reference-based tools require a reference scRNA-seq expression matrix for cell type annotation in single-cell or spatial datasets. Both approaches face several challenges, including cell-type marker selection, marker specificity, and scRNA-seq reference selection. Further, limited tools are available to annotate scATAC-seq and spatial omics datasets. Combining both approaches can enhance annotation accuracy by leveraging their strength. To overcome these limitations, we propose ScInfeR, a cell-type annotation toolkit that can annotate scRNA-seq, scATAC-seq, and spatial omics datasets using either a marker-based or reference-based approach or a combination of both.

ScInfeR implements several improvements compared with the existing tools. It can annotate wide range of datasets including scRNA-seq, scATAC-seq, and spatial omics generated from multiple technologies (e.g. STARmap, Visium). Additionally, the tool supports weighted markers, allowing users to define the importance of markers in cell type assignment. Another key feature is subtype identification, which most other tools lack. In our benchmarking analysis, ScInfeR outperformed existing tools on lung, pancreas, and liver scRNA-seq datasets from the Tabula Sapiens atlas. For subtype assignments, ScInfeR clearly captures the hierarchical cell assignment compared with Garnett, the only tool supporting subtype assignments. Similarly, in scATAC-seq cell type assignment, ScInfeR shows promising results compared with the tools specifically designed for scATAC-seq datasets. Furthermore, in spatial omics datasets, ScInfeR precisely annotated cells/spots by considering both spatial information and the gene expression matrix.

For seamless cell type annotation, we constructed a hierarchical cell marker database ScInfeRDB. The database is comprised of 28 tissue types, 329 cell types, and 2497 gene markers. The database also includes scRNA-seq references collected from multiple tissue types. Following that, cell type names are standardized across both the marker and scRNA-seq reference datasets to ensure seamless integration with the ScInfeR R package. This feature is lacking in several databases, leading to confusion during the cell type annotation process.

Some considerations are advised for the optimal performance and accuracy of ScInfeR. The scRNA-seq reference and marker set should encompass a broad spectrum of cell types, ideally from the same species. Harmony projections are recommended for the adjacency matrix construction for multiple scRNA-seq references or scRNA-seq datasets with batch effects. There are future developments planned for ScInfeR to incorporate more single-cell technologies, including single-cell DNA methylation and single-cell proteomics datasets.

In conclusion, ScInfeR, a cell type annotation toolkit, offers a wide range of modules for cell typing assignment in scRNA-seq, scATAC-seq, and spatial transcriptomics. ScInfeR demonstrates superior performance across various tissue types and sequencing technologies compared with state-of-the-art tools. The ScInfeR toolkit, combined with ScInfeRDB, will provide a more comprehensive and accurate solution for the labour-intensive cell annotation process.

Key Points

  • A comprehensive framework for annotating cell types across scRNA-seq, scATAC-seq, and spatial omics datasets.

  • A hybrid cell-type annotation strategy that enables annotation using a marker set, scRNA-seq reference, or both.

  • Our method enables hierarchical subtype identification, a feature missing in most other methods.

  • ScInfeR supports weighted markers, allowing users to define marker importance in cell-type classification.

  • We developed ScInfeRDB, an interactive and open-access hierarchical cell marker database encompassing 28 tissue types, 329 cell types, and 2497 gene markers for human and plant (https://www.swainasish.in/scinfer).

Supplementary Material

Supplementary_Data_ScInfeR_bbaf253

Contributor Information

Asish Kumar Swain, Department of Bioscience and Bioengineering, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India.

Rajveer Singh Shekhawat, Department of Bioscience and Bioengineering, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India.

Pankaj Yadav, Department of Bioscience and Bioengineering, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India; School of Artificial Intelligence and Data Science, Indian Institute of Technology (IIT), N.H. 62, Nagaur Road, Karwar, Jodhpur 342030, Rajasthan, India.

Author contributions

Asish Kumar Swain (Primary data analyses, Data curation, Software development, Writing original draft), Rajveer Singh Shekhawat (Data curation, Writing original draft), and Pankaj Yadav (Conceptualization, Supervision, Writing original draft); all authors contributed by comments and approved the final manuscript.

Conflict of interest

The authors declare that they have no conflict of interest.

Funding

This work was partly supported by GenomeIndia grant from the Department of Biotechnology, Ministry of Science and Technology, Government of India (project number: BT/GenomeIndia/2018) and the Ministry of Education, Government of India.

Data availability

All the datasets used in this study are publicly available. Tabula Sapiens atlas scRNA-seq datasets were obtained from the Figshare repository (https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219). scATAC-seq datasets obtained from NCBI-GEO having accession IDs GSE129785 and GSE123578. STARmap cortex spatial dataset was retrieved from http://starmapresources.org/data. The spatial human DLPFC dataset was retrieved from the LIBD database https://research.libd.org/spatialLIBD/. The spatial mouse embryo dataset was obtained from https://crukci.shinyapps.io/SpatialMouseAtlas/. The PBMC, pancreas, lungs, and liver reference datasets were obtained from the DISCO database (https://www.immunesinglecell.org/). All plant scRNA-seq datasets were retrieved from scPlantDB (https://biobigdata.nju.edu.cn/scplantdb/home).

The ScInfeR R package and the scripts used for the benchmarking analysis can be found on GitHub at https://github.com/swainasish/ScInfeR. Comprehensive documentation and tutorials for our tool and database are available at https://www.swainasish.in/scinfer.

References

  • 1. Aldridge  S, Teichmann  SA. Single cell transcriptomics comes of age. Nat Commun  2020;11:4307. 10.1038/s41467-020-18158-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Li  X, Wang  CY. From bulk, single-cell to spatial RNA sequencing. Int J Oral Sci  2021;13:36. 10.1038/s41368-021-00146-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Chen  H, Lareau  C, Andreani  T. et al.  Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol  2019;20:1–25. 10.1186/s13059-019-1854-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rao  A, Barkley  D, França  GS. et al.  Exploring tissue architecture using spatial transcriptomics. Nature.  2021;596:211–20. 10.1038/s41586-021-03634-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Saliba  AE, Westermann  AJ, Gorski  SA. et al.  Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res  2014;42:8845–60. 10.1093/nar/gku555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Yuan  L, Sun  S, Jiang  Y. et al.  scRGCL: a cell type annotation method for single-cell RNA-seq data using residual graph convolutional neural network with contrastive learning. Brief Bioinform  2025;26:bbae662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ianevski  A, Giri  AK, Aittokallio  T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun  2022;13:1246. 10.1038/s41467-022-28803-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Zhang  Z, Luo  D, Zhong  X. et al.  SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes.  2019;10:531. 10.3390/genes10070531 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Yang  F, Wang  W, Wang  F. et al.  scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell  2022;4:852–66. 10.1038/s42256-022-00534-z [DOI] [Google Scholar]
  • 10. Huang  Q, Liu  Y, Du  Y. et al.  Evaluation of cell type annotation R packages on single-cell RNA-seq data. Genomics Proteomics Bioinformatics  2021;19:267–81. 10.1016/j.gpb.2020.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Franzén  O, Gan  LM, Björkegren  JL. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database.  2019;2019:baz046. 10.1093/database/baz046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Quan  F, Liang  X, Cheng  M. et al.  Annotation of cell types (ACT): a convenient web server for cell type annotation. Genome Med  2023;15:91. 10.1186/s13073-023-01249-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Hu  C, Li  T, Xu  Y. et al.  CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res  2023;51:D870–6. 10.1093/nar/gkac947 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Liu  H, Li  H, Sharma  A. et al.  scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets. Brief Bioinform  2023;24:bbad179. [DOI] [PubMed] [Google Scholar]
  • 15. Pruszak  J, Sonntag  KC, Aung  MH. et al.  Markers and methods for cell sorting of human embryonic stem cell-derived neural cell populations. Stem Cells  2007;25:2257–68. 10.1634/stemcells.2006-0744 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Kelly  OG, Chan  MY, Martinson  LA. et al.  Cell-surface markers for the isolation of pancreatic cell types derived from human embryonic stem cells. Nat Biotechnol  2011;29:750–6. 10.1038/nbt.1931 [DOI] [PubMed] [Google Scholar]
  • 17. Aran  D, Looney  AP, Liu  L. et al.  Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol  2019;20:163–72. 10.1038/s41590-018-0276-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Hou  R, Denisenko  E, Forrest  AR. scMatch: a single-cell gene expression profile annotation tool using reference datasets. Bioinformatics.  2019;35:4688–95. 10.1093/bioinformatics/btz292 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Satija  R, Farrell  JA, Gennert  D. et al.  Spatial reconstruction of single-cell gene expression data. Nat Biotechnol  2015;33:495–502. 10.1038/nbt.3192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Zhai  Y, Chen  L, Deng  M. scBOL: a universal cell type identification framework for single-cell and spatial transcriptomics data. Brief Bioinform  2024;25:bbae188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Pliner  HA, Shendure  J, Trapnell  C. Supervised classification enables rapid annotation of cell atlases. Nat Methods  2019;16:983–6. 10.1038/s41592-019-0535-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Guo  H, Li  J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol  2021;22:69. 10.1186/s13059-021-02281-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Tian  L, Xie  Y, Xie  Z. et al.  AtacAnnoR: a reference-based annotation tool for single cell ATAC-seq data. Brief Bioinform  2023;24:bbad268. 10.1093/bib/bbad268 [DOI] [PubMed] [Google Scholar]
  • 24. Ma  W, Lu  J, Wu  H. Cellcano: supervised cell type identification for single cell ATAC-seq data. Nat Commun  2023;14:1864. 10.1038/s41467-023-37439-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Yuan  M, Wan  H, Wang  Z. et al.  SPANN: annotating single-cell resolution spatial transcriptome data with scRNA-seq data. Brief Bioinform  2024;25:bbad533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Mages  S, Moriel  N, Avraham-Davidi  I. et al.  TACCO unifies annotation transfer and decomposition of cell identities for single-cell and spatial omics. Nat Biotechnol  2023;41:1465–73. 10.1038/s41587-023-01657-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Cao  ZJ, Wei  L, Lu  S. et al.  Searching large-scale scRNA-seq databases via unbiased cell embedding with cell BLAST. Nat Commun  2020;11:3458. 10.1038/s41467-020-17281-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Stuart  T, Srivastava  A, Madad  S. et al.  Single-cell chromatin state analysis with Signac. Nat Methods  2021;18:1333–41. 10.1038/s41592-021-01282-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Granja  JM, Corces  MR, Pierce  SE. et al.  ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet  2021;53:403–11. 10.1038/s41588-021-00790-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wolf  FA, Angerer  P, Theis  FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol  2018;19:1–5. 10.1186/s13059-017-1382-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Consortium*  TTS, Jones  RC, Karkanias  J. et al.  The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science  2022;376:eabl4896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Satpathy  AT, Granja  JM, Yost  KE. et al.  Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol  2019;37:925–36. 10.1038/s41587-019-0206-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Lareau  CA, Duarte  FM, Chew  JG. et al.  Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat Biotechnol  2019;37:916–24. 10.1038/s41587-019-0147-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Wang  X, Allen  WE, Wright  MA. et al.  Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science.  2018;361:eaat5691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Lohoff  T, Ghazanfar  S, Missarova  A. et al.  Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nat Biotechnol  2022;40:74–85. 10.1038/s41587-021-01006-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Maynard  KR, Collado-Torres  L, Weber  LM. et al.  Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci  2021;24:425–36. 10.1038/s41593-020-00787-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. De Donno  C, Hediyeh-Zadeh  S, Moinfar  AA. et al.  Population-level integration of single-cell datasets enables multi-scale analysis across samples. Nat Methods  2023;20:1683–92. 10.1038/s41592-023-02035-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Shulse  C, Cole  B, Ciobanu  D. et al.  High-throughput single-cell transcriptome profiling of plant cell types. Cell Rep  2019;27:2241–7. 10.1016/j.celrep.2019.04.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. He  Z, Luo  Y, Zhou  X. et al.  scPlantDB: a comprehensive database for exploring cell types and markers of plant cell atlases. Nucleic Acids Res  2024;52:D1629–38. 10.1093/nar/gkad706 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Korsunsky  I, Millard  N, Fan  J. et al.  Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods  2019;16:1289–96. 10.1038/s41592-019-0619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Swain  AK, Pandit  V, Sharma  J. et al.  SpatialPrompt: spatially aware scalable and accurate tool for spot deconvolution and domain identification in spatial transcriptomics. Commun Biol  2024;7:639. 10.1038/s42003-024-06349-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Sun  X, Lin  X, Li  Z. et al.  A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq. Brief Bioinform  2022;23:bbab567. 10.1093/bib/bbab567 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Li  M, Zhang  X, Ang  KS. et al.  DISCO: a database of deeply integrated human single-cell omics data. Nucleic Acids Res  2022;50:D596–602. 10.1093/nar/gkab1020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Jin  J, Lu  P, Xu  Y. et al.  PCMDB: a curated and comprehensive resource of plant cell markers. Nucleic Acids Res  2022;50:D1448–55. 10.1093/nar/gkab949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Chang  W, Cheng  J, Allaire  J, Sievert  C, Schloerke  B, Xie  Y., et al. Shiny: Web Application Framework for R; 2024. R Package Version 1.8.1.1. Available from: https://CRAN.R-project.org/package=shiny (6 February 2025, date last accessed).
  • 46. Li  X, Wang  K, Lyu  Y. et al.  Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat Commun  2020;11:2338. 10.1038/s41467-020-15851-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Data_ScInfeR_bbaf253

Data Availability Statement

All the datasets used in this study are publicly available. Tabula Sapiens atlas scRNA-seq datasets were obtained from the Figshare repository (https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219). scATAC-seq datasets obtained from NCBI-GEO having accession IDs GSE129785 and GSE123578. STARmap cortex spatial dataset was retrieved from http://starmapresources.org/data. The spatial human DLPFC dataset was retrieved from the LIBD database https://research.libd.org/spatialLIBD/. The spatial mouse embryo dataset was obtained from https://crukci.shinyapps.io/SpatialMouseAtlas/. The PBMC, pancreas, lungs, and liver reference datasets were obtained from the DISCO database (https://www.immunesinglecell.org/). All plant scRNA-seq datasets were retrieved from scPlantDB (https://biobigdata.nju.edu.cn/scplantdb/home).

The ScInfeR R package and the scripts used for the benchmarking analysis can be found on GitHub at https://github.com/swainasish/ScInfeR. Comprehensive documentation and tutorials for our tool and database are available at https://www.swainasish.in/scinfer.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES