Abstract
In this paper, we develop a heterogeneous graph neural network, STAMapper, to transfer the cell-type labels from single-cell RNA-sequencing (scRNA-seq) data to single-cell spatial transcriptomics (scST) data. We collect 81 scST datasets consisting of 344 slices and 16 paired scRNA-seq datasets from eight technologies and five tissues to validate the efficiency of STAMapper. STAMapper achieves the best performance on 75 out of 81 datasets compared to competing methods in accuracy. STAMapper demonstrates enhanced performance over manual annotations, particularly at the boundaries of cell clusters, enables the unknown cell-type detection in scST data, and exhibits precise cell subtype annotations.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03773-6.
Background
Single-cell RNA sequencing (scRNA-seq) technologies allow us to study whole-transcriptome changes at the single-cell level to shape the diversity of cell types and their dynamic changes [1]. However, the spatial position information of single cells tends to be lost due to the dissociation during the sequencing process, which prevents us from understanding the relationship between gene expression and the tissue architecture and hinders us from deciphering the complex interactions between cells and their microenvironments under spatial context [2]. In recent years, spatial transcriptomics has rapidly evolved, enabling high-resolution gene expression mapping within tissue architecture. Alongside technological advances, a growing number of computational methods have been developed to analyze spatial transcriptomics data, focusing on various tasks such as spatial domain identification (e.g., IRIS [3], STAGATE [4]), detection of spatially variable genes (e.g., PROST [5], STANCE [6], STAMarker [7]), and spatial cell–cell communication (e.g., COMMOT [8], DeepTalk [9]). More recently, the emergence of single-cell spatial transcriptomics (scST) technologies such as MERFISH [10, 11], seqFISH [12], seqFISH + [13], osmFISH [14], STARmap [15], STARmap PLUS [16], NanoString [17], and Slide-tags [18], enables the profiling of gene expression with their spatial context at single-cell resolution.
The essential problem in scRNA-seq and scST data analysis is cell-type annotation (or cell typing) [19, 20]. The standard workflow [21] for cell-type annotation in scRNA-seq data is normalization, gene selection (usually top 2000 highly variable genes), dimensionality reduction, clustering, and assigning a cell type to each cluster according to the expression of known marker genes. When dealing with the scST data, the above workflow may fail since the sequencing quality of scST technologies is far lower than that of the mature scRNA-seq technologies. Specifically, these spatially resolved technologies typically focus on a pre-defined set of marker genes or genes relevant to biological processes (usually far fewer than 2000, Additional file1: Table S1). In the case of Slide-tags, a whole-transcriptome single-nucleus spatial technology, approximately 75% of nuclei are lost during sequencing [18]. These factors may lead to clustering instability and blurred cluster boundaries, resulting in inaccurate cell-type annotation. Moreover, as some of the markers for rare cell types may be absent in ST data, the annotation for the related cells could be challenged or overlooked. For the MERFISH hypothalamic data [22], annotation was performed by clustering all genes, refining neuronal subtypes through secondary clustering and doublet removal, and aligning clusters to scRNA-seq using correlation, classification, and anatomical constraints. This multi-step process is both time-consuming and complex. Therefore, accurate and fast annotation of single-cell ST data remains demanding and intricate.
With more scRNA-seq data available, reference-based scST annotation methods have been proposed to transfer cell-type labels to query datasets by leveraging the well-annotated reference dataset [23–25]. For example, scANVI employs a variational autoencoder architecture to learn a latent space of cellular states for both the scRNA-seq and scST data and utilizes the mean of the variational distribution associated with each cell to perform annotation [23]. RCTD utilizes a regression framework to model cell-type profiles in reference and account for platform effects, facilitating cell-type identification in spatial data [24]. Tangram maps scRNA-seq profiles onto ST data by maximizing the cosine similarity of the predicted and the observed expression matrix [25]. These methods can predict cell-type labels in datasets generated by MERFISH (scANVI) [26], Slide-seq (RCTD) [24], and STARmap (Tangram) [25]. However, these existing methods may fail to reveal fuzzy boundaries in scST annotations, and due to the lack of incorporating gene modeling, they cannot identify gene modules either shared by scRNA-seq and scST data or unique to each of them. Furthermore, to the best of our knowledge, there has not yet been a substantial number of prepared real datasets to evaluate the accuracy and robustness of the cell-type annotation methods across different ST technologies and tissue origins.
To this end, we develop STAMapper to accurately annotate cells from scST data by a heterogeneous graph neural network [27] with a graph attention classifier. Also, we collected 81 paired scRNA-seq and scST datasets with manual annotations and manually aligned them to evaluate the annotation accuracy. Extensive tests and comparisons with existing methods demonstrated the superiority of STAMapper in various biological applications, i.e., cell-type mapping of scST data, reannotation of blurred cell types, unknown cell-type detection, subtype annotation, and gene module extraction. Additionally, the collected data can serve as a benchmark for scST annotation, and the annotation results of STAMapper can serve as a reference.
Results
Overview of STAMapper
STAMapper takes a well-annotated scRNA-seq dataset and a scST dataset as input, where the two data matrices are normalized. STAMapper first constructs a heterogeneous graph, where the cells and genes are modeled as two distinct types of nodes and connected with edges based on whether the genes are expressed in the cells. Two cells from each dataset are connected if they exhibit similar gene expression patterns. Each node is connected to itself to indicate it utilizes the information from the previous step when updating its embedding (Fig. 1, Methods).
Fig. 1.
Illustration of STAMapper and its applications. a STAMapper can annotate scST data obtained from mainstream technologies such as image-based and seq-based by leveraging well-annotated sc/snRNA-seq data sequenced from microfluidics-based or droplet-based technologies. b STAMapper models genes and cells as two types of heterogeneous nodes and connects sc/scRNA-seq and scST data by their expression on the shared genes. c STAMapper takes the expression and the heterogeneous relationships of nodes as input. STAMapper then learns embeddings for cells and genes based on the information propagation mechanism on the heterogeneous graph network to fit cell labels from scRNA-seq data by using a graph attention classifier, ultimately utilizing the learned weights of information propagation on the graph to transfer cell labels on spatial data. d The output of STAMapper can be applied for annotation on large-scale scST data, reannotation on scST data, unknown cell-types detection, and gene module extraction
For each cell node, the initial input is the corresponding normalized gene expression vector. The gene nodes obtain their initial embedding by aggregating the input from the connected cell nodes (Methods). STAMapper updates the latent embedding of each cell or gene node based on the message-passing mechanism with information from its neighbors. It utilizes the embedding of gene nodes as the input of a graph attention classifier to estimate the probability of the cell-type identity, wherein each cell assigns varying attention weights to its connected genes. STAMapper uses a modified cross-entropy loss [28] (Methods) to quantify the discrepancy between the predicted and original cell-type labels for cells in the scRNA-seq dataset. Finally, through backpropagation, STAMapper updates the weights of parameters for different edges until the model converges. STAMapper determines gene modules based on the learned embeddings of gene nodes with the Leiden clustering algorithm [29] and applies the outputs of the graph attention classifier to assign cell-type labels to cells in the scST dataset.
STAMapper enables accurate cell-type mapping for scST data
We collected 81 single-cell ST datasets comprised of 344 slices and 16 paired scRNA-seq datasets from identical tissues. These scST datasets originate from eight single-cell ST technologies, i.e., MERFISH [10], NanoString [17], STARmap [30], STARmap Plus [16], Slide-tags [18], osmFISH [14], seqFISH [12], seqFISH + [13], and five different tissues, i.e., brain, embryo, retina, kidney, liver (Fig. 2a). All datasets come with manual annotations provided by the authors and the cell-type labels in paired scRNA-seq and spatial datasets aligned manually.
Fig. 2.
Benchmarking cell annotation performance of STAMapper. a Overview of all datasets used for evaluating the performance of STAMapper. We collected 81 single-cell spatial transcriptomics datasets comprising a total of 344 slices, where each dataset is matched with corresponding single-cell transcriptomics data (or scRNA-seq data). b Performance comparison of STAMapper and scANVI, RCTD, Tangram regarding cell annotation accuracy on 81 pairs of scRNA-seq and single-cell spatial transcriptomics datasets. P values were calculated by paired t test. c Performance comparison of the classification accuracies of STAMapper and three other methods on different down-sampling rates (1.0, 0.8, 0.6, 0.4, 0.2) for read counts, where the down-sampling rate of 1.0 means the raw data. The upper panel depicts spatial transcriptomics datasets with more than 200 genes for sequencing (47 datasets), while the lower panel corresponds to fewer than 200 genes (34 datasets)
We quantitatively evaluated the cell-type annotation performance of STAMapper and the competing methods, i.e., scANVI [23], RCTD [24], and Tangram [25], in terms of accuracy, macro F1 score, and weighted F1 score (Methods). STAMapper demonstrated significantly higher accuracy in annotating cells from scST datasets compared with scANVI (p = 2.2e-14), RCTD (p = 1.3e-27), and Tangram (p = 1.3e-36) (Fig. 2b). STAMapper also achieved the best overall performance with the macro F1 score, compared with scANVI (p = 5.8e-16), RCTD (p = 7.8e-29), and Tangram (p = 1.5e-40) for the imbalanced cell-type distributions (Additional file 1: Fig. S1a). Also, STAMapper showed a significant advantage over all other methods in weighted F1 score (Additional file 1: Fig. S1b). scANVI ranked the second best across the three metrics. These results suggest that STAMapper exhibits the best annotation capability and proficiently identifies the rare cell types, which is crucial in cell-type annotation.
We evaluated the performance under poor sequencing quality with four different down-sampling rates on scST data. STAMapper consistently demonstrated the highest accuracy, macro F1 score, and weighted F1 score (Fig. 2c, Additional file 1: Fig. S1c, d). This trend is particularly distinct in scST datasets with fewer than 200 genes, where at a down-sampling rate of 0.2, STAMapper exhibited a much higher accuracy than the second-highest ranking method, scANVI (median 51.6% VS 34.4%) (Fig. 2c). For scST datasets with more than 200 genes, across all down-sampling rates (0.2, 0.4, 0.6, and 0.8), STAMapper still achieved the highest annotation accuracy, macro F1 score, and weighted F1 score, even though the performance margin was less superior (Fig. 2c, Additional file 1: Fig. S1c, d). RCTD demonstrated superiority in the raw data compared to scANVI (25 of 34 datasets) and comparable performance in the down-sampled data (Fig. 2c) for datasets containing more than 200 genes. scANVI tended to outperform RCTD on scST datasets with fewer than 200 genes in the raw data (41 of 47 datasets) and the down-sampled data regarding the accuracy and weighted F1 score. scANVI showed comparable performance with RCTD for datasets containing more than 200 genes, consistently outperforming RCTD on scST datasets with fewer than 200 genes in terms of macro F1 score (Additional file 1: Fig. S1c, d). Additionally, to mitigate the sensitivity of deep learning methods to hyperparameters, we tested the accuracy of scANVI under various hyperparameters (Additional file 1: Fig. S1e). We also compared STAMapper with CellTrek [31] and SeuratV4 [32], where STAMapper still achieved the best performance with significantly higher accuracy, macro F1, and weighted F1 scores (Additional file 1: Fig. S1f).
STAMapper facilitates precise cell-type mapping within the retinal laminar structure
We applied STAMapper to ten MERFISH datasets derived from the mouse retina [33], a highly organized tissue, on which we assessed the accuracy of annotations from the perspective of cell spatial positions beyond checking the expression of markers. Here, we selected five scRNA-seq datasets of the mouse retina collected from postnatal 0 h to 60 days (P60) as the reference data [34] to test the robustness of STAMapper. As expected, STAMapper consistently outperformed the other three methods on every spatial dataset according to accuracy, macro F1 score, and weighted F1 score (Fig. 3a, Additional file 1: Fig. S2a–c). Additionally, STAMapper exhibited the lowest variance in accuracy and weighted F1 score, demonstrating its robustness to changes in the reference data. Due to the poor performance of Tangram, we did not include it in subsequent analyses.
Fig. 3.
Application of STAMapper to MERFISH retina datasets. a Performance comparison of STAMapper and scANVI, RCTD, Tangram, where each box represents the method’s performance on the 50 paired datasets (five scRNA-seq datasets and ten single-cell spatial transcriptomics datasets). b UMAP plots of mouse retinal dataset (VZG105a_WT3) cells colored by the manual annotation and the prediction of STAMapper, scANVI, and RCTD using the mouse_LD_60 scRNA-seq dataset as the reference. AC amacrine cells, EC endothelial cells, MG Müller Glia, PC pericytes, RET reticulocyte, HC retinal horizontal cells, BC bipolar cells, Cones cone cells, RGC retinal ganglion cells, RPE retinal pigment epithelium, Rods Rod cells. c The heatmap of the marker expression for major cell types on the scRNA-seq dataset grouped by manual annotation and on the corresponding spatial transcriptomics dataset annotated by STAMapper, scANVI, and RCTD, respectively. d A schematic illustration of the distribution of cell types within the retina. e. Spatial organization of a slice from spatial transcriptomics dataset corresponding to (b), where cells are colored by the annotation by STAMapper, scANVI, and RCTD, respectively
ScRNA-seq often allows for identifying some rare cell types since it can measure the expressions of most genes. However, some cell types identified in single-cell data, including endothelial cells (EC), pericytes (PC), retinal pigment epithelium (RPE), and reticulocyte (RET), do not appear in the manual annotations of spatial data, possibly due to the limited number of genes sequenced (Additional file1: Table S1) or inadequacies in clustering methods. In particular, we use WT3 sample as the considered spatial data, which only contained 368 genes, where not all the markers [34] of the cell types in scRNA-seq data were captured. In this case, STAMapper, scANVI, and RCTD were able to annotate some of these cell types. STAMapper annotated a group of rods (black dotted line in Fig. 3b), distant from the major cluster, as RPE but as PC by scANVI and RCTD. Additionally, scANVI and RCTD identified a cluster of RETs (purple dotted line in Fig. 3b and Additional file 1: Fig. S2d). To determine whether the annotated cell types indeed exist in the scST data, we first selected markers [34] in original spatial data and differential expressed genes for the newly annotated cell types and further validated by the scRNA-seq data (Fig. 3c). The expression of these genes confirmed the accurate annotation of STAMapper. The PCs annotated by scANVI and RCTD did not express Nr2f2, and the RETs annotated by scANVI also demonstrated inconsistency in the expression of Cd24a, which were two genes showing highly differential expression in the corresponding cell types from scRNA-seq data (Fig. 3c). This indicated that there could be inaccuracies in their annotation.
We further examined their spatial locations to validate the annotation of newly discovered cell types. Here, scANVI and RCTD identified the cells in the outermost layer of the retina as ECs and PCs (black dashed ellipse in Fig. 3e), which does not align well with their real anatomical location [35, 36] (the inner layer of retina) (Fig. 3d and Additional file 1: Fig. S2f–g), while STAMapper annotated this group of cells as RPE cells, which expressed the corresponding markers in the appropriate anatomical positions [37] (Fig. 3c–e and Additional file 1: Fig. S2e). Different from STAMapper, scANVI and RCTD mistakenly annotated PCs among the rods (black dashed ellipse in Fig. 3e), which should be located in the retina’s inner side, consistent with our observations on Uniform Manifold Approximation and Projection (UMAP) (Fig. 3b and Additional file 1: Fig. S2f). In addition, scANVI and RCTD wrongly identified many RET cells between rods (purple dashed ellipse in Fig. 3e), which should be located at the most inner part of the retina (Fig. 3d and Additional file 1: Fig. S2d). Similar situations can be observed across the other six slices (Additional file 1: Fig. S2h–m). Overall, STAMapper uncovered the cell types that clustering alone fails to identify in spatial retinal data by making full use of the detailed annotations from scRNA-seq data. More importantly, the annotation of STAMapper perfectly aligned with the retinal architecture, spanning from the outer retinal cells to the inner supporting cells.
STAMapper corrects the blurred cell-type annotations caused by clustering-induced boundaries
We further applied STAMapper to the MERFISH hypothalamic data (ID = 15) collected from mouse brain [22] with the reference scRNA-seq data from the same study to test whether it can help correct the cell-type annotations at the boundaries detected by clustering algorithms. STAMapper achieved the highest annotation accuracy (86.3%) compared with scANVI (72.6%) and RCTD (67.3%). Specifically, scANVI mistakenly identified many macrophages mixed with inhibitory neurons and some fibroblasts at the boundary of OD mature cells. RCTD mistakenly annotated some ependymal cells mixed with inhibitory neurons and excitatory cells. In contrast, the annotation of STAMapper avoided the mixing of cell types between two non-transitional cells and showed smoothness at the boundaries of the clusters (Fig. 4a and Additional file 1: Fig. S3a). Taking a close look at a slice of this sample, we discovered that STAMapper accurately recovered the “arrow-like” excitatory neurons in the middle of the slice, whereas RCTD mistakenly annotated them as a mixture of excitatory neurons and ependymal cells. STAMapper correctly identified the cellular microenvironment dominated by inhibitory neurons, and scANVI and RCTD failed (Fig. 4b and Additional file 1: Fig. S3a). For the remaining three slices, STAMapper maintained the highest degree of consistency with manual annotations (Additional file 1: Fig. S3b–d). The Sankey plot of the entire sample demonstrated that scANVI and RCTD incorrectly predicted many inhibitory neurons to other cells like astrocytes, macrophages, and excitatory neurons (Fig. 4c). scANVI predicted a considerable number of macrophages annotated as inhibitory neurons manually. However, these cells did express the inhibitory neuron marker Gad1, indicating the annotation errors by scANVI (Additional file 1: Fig. S3e).
Fig. 4.
Application of STAMapper to MERFISH hypothalamic dataset. a UMAP plots of mouse hypothalamic dataset colored by the manual annotation and the prediction of STAMapper, scANVI, and RCTD, respectively. b Spatial organization of a slice from mouse hypothalamic dataset corresponding to (a), cells are colored by the manual annotation and the prediction of STAMapper, scANVI, and RCTD, respectively. c Sankey plot showing the accuracy of the cell-type annotation by STAMapper, scANVI, and RCTD, respectively. The left side of the Sankey plot represents manual annotations, while the right side shows the predicted results. The height of each linkage line reflects the number of cells. d Heatmap plot of marker expression for major cell types presented in manual annotation with mismatched labels between manual annotation and STAMapper. e Expression levels of Sema4d (a marker of OD Newly formed) across different cell types (annotated by STAMapper). f The predicted probability of STAMapper for cells from spatial data (left panel) and unknown cells from spatial data (right panel), the red dash line indicates x = 0.738 in both panels. g Cell-type level distance from spatial data to single-cell data on cell embeddings learned by STAMapper. Bold indicates unknown cells were predicted as this specific cell type, and red denotes cell types present in single-cell data but not annotated in spatial data by manual annotation. h UMAP plots of the co-embedding of scRNA-seq and spatial data learned by STAMapper. Cells are colored by manual annotation, STAMapper prediction without unknown detection, and STAMapper prediction with unknown detection. The percentages in parentheses represent the predicted accuracy
Despite STAMapper exhibited the highest annotation accuracy, some cells showed inconsistencies with manual annotations. These cells tended to be located at the boundaries of cell clusters, e.g., around microglia (Fig. 4a). We examined the cell types already presented in the manual annotations. Some cell types did not express the corresponding markers from the original study [22]. Yet the annotation of STAMapper aligned precisely with the expression of marker genes (Fig. 4d). Also, STAMapper identified OD newly formed cells, which were not previously recognized in manual annotations, verified by highly expressing their marker gene Sema4d [38] (Fig. 4e).
We further extended STAMapper to detect unknown cell types in spatial data. Here, we removed astrocytes from the scRNA-seq dataset and assumed to identify them as unknown cells in the spatial data. We defined the potentially unknown cells by the following two rules: (i) predicted probability less than the median value of all cells from spatial data (global prediction confidence), and (ii) the cell-type level distance from scST data to scRNA-seq data higher than a user-defined threshold (local distributional discrepancy) (Fig. 4f, g, Methods). In addition to tuning the threshold, we offered the option to specify low-confidence cell types, which offers flexibility and can be tuned based on dataset characteristics, e.g., by inspecting the expression of marker genes in cell types that display a large distance. In this scenario, STAMapper without unknown cell detection primally achieved an accuracy of 76.4%, where it annotated astrocytes as ependymal reasonably, which is another type of glial cell. Leveraging the mechanism of unknown detection, STAMapper successfully annotated astrocytes as unknown cells and increased the accuracy to 83.8% (Fig. 4h). We also performed another test, where we removed OD immature cells from the scRNA-seq dataset, and STAMapper continued to annotate these cells as unknown correctly (Additional file 1: Fig. S3f–h). We also removed endothelial cells (Additional file 1: Fig. S3i–k), inhibitory and excitatory neurons (Additional file 1: Fig. S3l–n), OD mature and astrocytes (Additional file 1: Fig. S3o-q) from the scRNA-seq dataset, and STAMapper continued to annotate these cells as unknown and largely improved the accuracy of annotation. Therefore, STAMapper could rectify cell types at classification boundaries that are challenging to identify with clustering algorithms and detect unknown cell types in spatial data.
STAMapper achieves precise annotations aiding in deciphering the tumor microenvironment
We applied STAMapper to the human hepatocellular carcinoma spatial data sequenced by NanoString technology [17], and the scRNA-seq data originated from the same type of tumor lesions as a reference [39]. Compared to scANVI and RCTD, the annotation of STAMapper could reveal distinct boundaries between different cell types and achieve the highest consistency with the annotations provided by the authors (Fig. 5a, b). The expression of markers aligned closely with the corresponding major cell types [40], i.e., PROX1 (Malignant), CD163 (Macro), NKG7 (NK), CD3D (T cell), CD8A (CD8 T), IL7R (CD4 T), PECAM1 (Endothelial), COL1A1 (Fibroblast), JCHAIN (Mature B) (Fig. 5a, c). We selected two regions of interest (ROIs) in the tissue section (Additional file 1: Fig. S4a). ROI 1 was mainly malignant cells marked with highly expressed PROX1. scANVI annotated more DCs in this region, while RCTD annotated some cholangiocytes. However, the markers corresponding to these two cell types are barely expressed in this region (Additional file 1: Fig. S4b, c). ROI 2 appeared as a region mixed with immune cells and malignant cells, where numerous cells are distributed in a circular formation in the middle of this region (Fig. 5d). STAMapper annotated it as a microenvironment where macrophages enveloped malignant cells, T cells located in the outer side of this structure, and the exterior layer consisted of mature B cells. RCTD annotated the innermost layer with cholangiocytes surrounded by malignant cells, whereas scANVI annotated the innermost layer as mainly cholangiocytes with very few malignant cells. The expression of PROX1 validated the presence of malignant cells in the innermost layer, and its expression is consistent with the annotations of STAMapper (Fig. 5d, e). Meanwhile, cells in this region barely expressed KRT7, a marker of cholangiocyte [41], indicating that the annotations by RCTD and scANVI were incorrect (Fig. 5e). Interestingly, all three methods agreed that there exists a reticular structure of macrophages in this area (Additional file 1: Fig. S4d), and in fact, they surrounded the malignant cells, forming a unique microenvironment (Fig. 5d).
Fig. 5.
Application of STAMapper to Nanostring HCC dataset. a UMAP plots of the human HCC dataset colored by STAMapper, scANVI, and RCTD, respectively. Macro macrophage, NK natural killer, DC dendritic cell. b Accuracy of STAMapper, scANVI, RCTD, and Tangram on human HCC dataset. c UMAP plot of the normalized marker expression corresponding to major cell types. d Spatial organization of ROI 1, cells are colored by the annotation of STAMapper, RCTD, and scANVI, respectively. e The normalized marker expression on ROI 1. f Density plot for the distribution of macro cells, annotated by STAMapper. g Physical distance of immune cells to malignant cells, with cells being annotated by STAMapper. h Boxplot for the scores of selected pathways (stemness, proliferation, and MHC-I) on malignant cells near macro and other malignant cells. i Heatmap displaying genes with the highest normalized attention weights, categorized by each cell type. We aggregate the normalized attention weights from that gene to all cells (Methods) belonging to the cell type and compute their average as the cell type’s normalized attention weights. These scores reflect the gene’s overall contribution to the annotation of that cell type. j Cosine similarity of the gene embedding pairs learned by STAMapper, where gene pairs were TFs collected from hTFtarget. RELA-EPCAM and STAT3-SMAD3 were validated to exist in HCC malignant cells in the literature
We found that macrophages were widely present in the sections, especially at the edges of the malignant cells, and they were the immune cells closest to the malignant cells (Fig. 5f, g). Macrophages are considered to play a crucial role in tumor immune evasion and also serve as pro-inflammatory mediators [42]. To investigate the role of this unique malignant-macrophage microenvironment, we classified the malignant cells into two groups, i.e., malignant cells near macrophages and others. The malignant cells near macrophages showed higher stemness [43] and MHC-I score [44] but lower proliferation [40], indicating that, for malignant cells near macrophages, the increase in stemness was not due to proliferation, or they were less likely to be recognized by T cells. Therefore, it should be due to the influence of macrophages (Fig. 5h).
STAMapper could identify the most critical genes for each cell type and most of them are marker genes or differentially expressed genes, e.g., PECAM1 for Endothelial [45], HLA-A for Mature B [46], NKG7 for NK [47], C1QB for Macro [48], etc., which enhanced the model’s interpretability (Fig. 5i, Methods). Additionally, STAMapper provided gene embeddings that enhanced the similarity of functionally related gene pairs collected from hTFtarget [49]. We found that two TF-target pairs, validated in the literature [50, 51], exhibited cosine similarities close to 1 (Fig. 5j, Methods). Collectively, STAMapper could provide precise annotations with interpretability, helping to discover and explain the complex tumor microenvironment.
STAMapper reveals the cellular subpopulation localization within the layered structure of the human cortex
We applied STAMapper to the human prefrontal cortex (PFC) sequenced by Slide-tags, a whole-transcriptome single-nucleus spatial technology [18], and the single-nucleus RNA dataset profiled for PFC samples as a reference [52]. STAMapper successfully aligned identical cell types across two datasets, including glial cells and neurons (Fig. 6a, b), and provided precise annotations for the scST data (Fig. 6c). The vascular cells (VCs, from the scRNA-seq data) and endothelial cells (ECs, from the scST data) were aligned well. That is reasonable since EC is a subtype of VC, and they both expressed ITIH5, a marker of ECs [18] (Additional file 1: Fig. S5a). Utilizing the cell embeddings provided by STAMapper, we discovered a distinctive subset of oligodendrocytes (oligo) in the single-cell data that expressed GPR17 (Fig. 6b, d). This gene acts as an intrinsic timer of oligodendrocyte differentiation and myelination [53], suggesting these cells may be in a state of differentiation or development (labeled as Oligo_GPR17 for further analysis).
Fig. 6.
Application of STAMapper to Slide-tags human prefrontal cortex dataset. a, b UMAP plot for the co-embedding of scRNA-seq and spatial dataset learned by STAMapper. Cells are colored based on the prediction of STAMapper (a) and the origin of the datasets (b), respectively. Oligo oligodendrocytes, OPC oligodendrocyte progenitor cells. c The predicted cell-type probabilities for each cell (each column) in the spatial data. A maximum of 50 cells was subsampled from each type for visualization. d UMAP plots showing the co-embedding of the scRNA-seq and spatial dataset learned by STAMapper, cells are colored by the normalized expression levels of GPR17. e Boxplots of the cosine similarity between gene embedding pairs grouped by the number of shared pathways. f UMAP plot for the distribution of gene embedding. Genes are colored by clusters identified through the Leiden algorithm. g Abstracted graph of the heterogenous cell-gene graph, where nodes represent cell types (pink) or gene modules (blue). Node size reflects the number of cells in a cell type or genes in a module. Edge width varies with the average expression levels of cell types linked to gene modules, determined by STAMapper. h UMAP plots showing the co-embedding of scRNA-seq and spatial dataset learned by STAMapper, cells are colored by the normalized expression levels of Module 12. i Enrichment analysis of gene module 12 related to Oligo_GPR17 cells. j Spatial organization of cells from the spatial dataset. Cells are clustered by STAGATE with resolution = 0.05. k The Normalized expression of SYT4 (marker gene of grey matter) and LRP2 (marker gene of white matter). l–n Spatial organization and UMAP plot of astrocyte (l), excitatory (m), and inhibitory (n) Subtypes predicted by STAMapper from the spatial dataset, Subtypes with more than 20 cells are shown. The UMAP coordinates are calculated from the expression of the spatial data
To further understand whether the gene embeddings provided by STAMapper captured biological meanings for each pair of input genes, we explored relationships between the number of shared pathways (Reactome [54] and cell type signatures from MSigDB [55]) and the gene cosine similarities (Methods). The cosine similarities were relatively low (median = 0.12) for gene pairs not presented in any biological pathway. We observed a clear gap in the cosine similarities for gene pairs that shared a pathway (median = 0.45). The more pathways shared, the higher their cosine similarities were (Fig. 6e). The cosine similarity achieved a median value of 0.77 for gene pairs occurring in at least ten pathways. Such a trend also existed in the raw data and was enhanced by STAMapper (Additional file 1: Fig. S5b). We next extracted 14 gene modules from these informative gene embeddings (Fig. 6f, Methods). They shared similar transcriptional patterns in the common cell types identified in both scRNA-seq and spatial data (Fig. 6g). Module 12 tended to be expressed in the oligodendrocyte progenitor cells (OPCs) and Oligo_GPR17 but not in oligos, and was enriched in pathways related to differentiation and development, i.e., glial cell-derived neurotrophic factor receptor signaling pathway, glial cell migration, gliogenesis and regulation of Wnt signaling pathway [56] (Fig. 6g–i, Methods). These results further indicated that Oligo_GRP17 represents a cell type distinct from traditional oligos, potentially arising from the differentiation of OPCs.
The layered structure is a main characteristic of the cortex [57]. We next performed subtype annotations on the scST data to explore their association with the cortex structure (Additional file 1: Fig. S5c, f). To reveal the hierarchical structure of the cortex, we initially employed the clustering method STAGATE [58] at a low resolution. STAGATE divided the section into white matter (WM) and gray matter (GM), verified by corresponding markers [59] (Fig. 6j, k, Methods). The astrocytes expressing GFAP were associated with WM in mice [60], and SLC1A2 tended to express in the GM area [61]. STAMapper correctly annotated Astro_GFAP and Astro_SLC1A2 to the corresponding regions (Fig. 6l). The Oligos prefer to locate at WM [62] (Additional file 1: Fig. S5d). On the contrary, the excitatory neurons and inhibitory located at GM are consistent with a recent study [63] (Fig. 6m, n). Additionally, the subtypes of excitatory neurons exhibited layer-specific localization from the GM border to the GM/WM junction (from L2 to L6), consistent with the result of STAGATE (resolution = 0.3) (Fig. 6m, Additional file 1: Fig. S5e). These results illustrated that STAMapper provided gene embeddings with biological meanings, facilitating the identification of shared or unique gene modules across datasets. Also, STAMapper accurately annotated structurally related cell subtypes, aiding in understanding their positional context within tissues.
Discussion
Precise annotation of single cells from the scST data is essential for understanding the complex interactions between cellular functions and their physical locations within tissues and organs. Here, we developed an accurate and user-friendly cell-type annotation method, STAMapper, which can be seamlessly integrated with the standard workflow of the package Scanpy [64].
The primary reason for the success of STAMapper lies in its utilization of both reference and query cells within a unified heterogeneous graph, where all cells share the same node type and connect to a common set of gene nodes. These gene nodes act as shared anchors that bridge reference and query data, enabling the model to capture consistent expression patterns across datasets. This encourages the alignment of both reference and query cells in a shared latent space. Moreover, STAMapper learns the edge weights that encode the relationship of cell embedding and cell types by training on scRNA-seq data, and these weights are shared with the scST dataset. Consequently, once the scRNA-seq data has established a robust mapping between expression profiles and cell types, this mapping can be directly leveraged to generate accurate annotations for the scST data. Also, it adopts a graph attention classifier to ensure that each cell pays more attention to genes that are more biologically related during the classification. However, STAMapper does not incorporate the spatial information of the cells in the current version. Although it is straightforward to model the spatial information within STAMapper by connecting new edges for spatially adjacent cells, we found that this led to an accuracy loss of about 0.4% with statistical significance (Additional file 1: Fig. S6a). This could be because many spatially adjacent cells are not of the same cell type. Especially, in median, only about 2 out of its 5 spatial neighbors belong to the same cell type for all datasets. As a result, modeling spatial adjacency alone, without considering transcriptomic similarity, can introduce noise and negatively impact model accuracy. Therefore, how to reasonably utilize spatial information to enhance the accuracy of annotations remains a challenging problem.
Benchmarking is quite an important issue when discussing the accuracy and robustness of a method. However, different computational methods utilize different datasets, which complicates comparing them fairly. In this study, we gathered 81 scST datasets from five distinct tissues and eight different technologies with corresponding scRNA-seq datasets as references. We carefully examined each dataset using manual validation based on canonical marker genes. These standardized datasets are promised to become a benchmark process for testing methods that are proposed to annotate scST-seq data. Additionally, STAMapper achieved superior performance to competing methods, demonstrating its superior efficacy. We also evaluated the performance of different annotation methods on a colon adenocarcinoma (COAD) dataset profiled by Xenium, a popular subcellular platform [65]. STAMapper still achieved the highest accuracy (93.91%) compared with scANVI (66.33%), RCTD (92.23%), and Tangram (37.89%) (Additional file 1: Fig. S6h).
In this study, we concentrated on cell-type annotation for scST datasets. Given a well-annotated scRNA-seq dataset as a reference, STAMapper can accurately annotate cell types in scST data and is robust to different spatial sequencing technologies and diverse tissues. Furthermore, while some deep learning methods may be sensitive to hyperparameter settings, tests on variations in hidden units and hidden layers revealed that STAMapper is robust to such changes (Additional file 1: Fig. S6b), further highlighting its potential for broad applicability. We also conducted additional ablations of heterogeneous edge types or processes that can be removed from STAMapper. Specifically, we (i) removed the “cell_similar_to_cell” edge type, (ii) disabled the filter on the “cell_similar_to_cell” edges that connect different cell types, and (iii) removed the gene_self_loop edges. These modifications lead to a 0.7–3.0% drop in annotation accuracy (Additional file 1: Fig. S6c), demonstrating that all of these edges and steps are essential. For STAMapper, the runtime increases linearly with the number of cells. When processing the largest dataset (460 K cells), STAMapper completed the task in about 5.3 h, using only 11.29 GB of GPU memory and 38.99 GB of system memory. This demonstrates that STAMapper can be efficiently applied to large-scale datasets (Additional file 1: Fig. S6d). On average, STAMapper demonstrated the fastest runtime, scANVI was slightly slower, and RCTD was much slower (Additional file 1: Fig. S6e). In terms of memory consumption, RCTD used the least memory, followed closely by STAMapper, but scANVI consumed substantially more memory than the other two methods (Additional file 2:Table S3). Overall, STAMapper offers both fast runtime and relatively low memory usage, indicating strong potential for scaling to large datasets. Further, to assess how cell numbers affect the analysis, we defined a metric, cell ratio between scRNA-seq reference and spatial data (CRSS), as the number of cells in the reference dataset divided by the number of cells in the spatial dataset. We observed a negative trend between CRSS and STAMapper accuracy; however, it did not reach statistical significance (Additional file 1: Fig. S6f). One potential limitation of STAMapper is that it utilizes only one scRNA-seq dataset as a reference, which could lead to the omission of certain cell types if the sequenced cells in the reference are not comprehensive. Future work may consider using multiple references to annotate spatial data to improve the annotation.
Selecting spatially variable genes has become a popular topic in recent years. However, the existing methods, e.g., SPARK-X [66], STAMarker [7], and spatialDE [67], are designed for spatial transcriptomics technologies with lower resolution, i.e.,10 × spatial transcriptomics [68]. In scST data, cell type can be employed to identify genes exhibiting spatial variation within the same cell population. STAMapper could help with an attention score to reflect the importance of each gene to a cell. If such a score undergoes a drastic change along spatial positions within a specific cell type, it could be potentially identified as a spatially variable gene relating to this cell type. Also, STAMapper does not include spatial information. A recent proposed method, CAESAR [69], integrates histology images and spatial location information into a low-dimensional space to characterize the gene-cell relationship (Additional file 1:Supplementary Note 1). We expect STAMapper to be extended in this direction.
Methods
Data description
We collected 81 scST datasets sequenced by different technologies, including MERFISH, seqFISH, seqFISH +, osmFISH, STARmap, STARmap PLUS, NanoString, and Slide-tags (Fig. 1, Additional file1: Table S1). For the NanoString HCC data, the authors provided the annotation by an unpublished method InSituType, and we used this annotation as ground truth. For the other 80 scST datasets and all the scRNA-seq datasets, the authors provided manual annotations that served as ground truth. We also collected 16 scRNA-seq datasets sequenced by different technologies, including 10 × Chromium, Droplet-microfluidic and STRT/C1 (Fig. 1, Additional file1: Table S1). The cell types across scRNA-seq and scST datasets were unified manually through corresponding literature and Cell Ontology [70]. The unification process between the scRNA-seq and scST datasets can be found in section “Data availability” below.
Detection of differentially expressed genes
Differentially expressed genes (DEGs) were identified based on t test, ranked based on the adjusted p value, and filtered with a threshold of < 0.05.
Data preprocessing
In all datasets, we first normalized the library size for each cell and then logarithmized the expression data with a pseudo-count. For scRNA-seq datasets, we then selected the top 2000 highly variable genes (HVGs) and calculated the top 50 differentially expressed genes (DEGs) for each cell type as input. For the scST datasets with more than 2000 genes, we used the same strategy for selecting HVGs. For the selection of DEGs, we first pre-clustered the spatial data by the Leiden algorithm [29] (resolution = 0.4) and then calculated the top 50 DEGs for each cell cluster. For the scST datasets with fewer than 2000 genes, we used the expression of all the genes as input. Finally, we scaled each gene to unit variance and zero mean value. All the preprocessing steps were implemented using the built-in functions in the package Scanpy [64].
Construction of the heterogeneous graph
STAMapper uses a heterogeneous graph to model the two datasets with genes and cells as two types of nodes (Fig. 1a). We use a combination of HVGs and DEGs to determine the set of gene nodes, aiming to balance the representation of overall expression variability of genes that are highly informative for distinguishing cell type. Specifically, we used the intersection of selected genes (DEGs and HVGs) between scRNA-seq and scST datasets as the gene nodes. We have five types of heterogeneous edges. Specifically, for each cell node and gene node, we have the edge connected to itself named “cell_self_loop” and “gene_self_loop.” They help utilize information from the previous step in the training process. We also have “cell_similar_to_cell” edges connected to similarly expressed cells with the k nearest neighbor strategy (based on their expression vector, k = 5 by default) within each dataset. For a scRNA-seq dataset, we filter the edges connecting two different types of cells. Additionally, we have heterogeneous edges named “gene_expressed_by_cell” and “cell_express_gene” in opposite directions to indicate a gene is expressed by a cell and a cell expresses a gene.
Architecture of STAMapper
STAMapper consists of two parts, i.e., a heterogeneous graph encoder and a heterogeneous graph attention classifier (Fig. 1a). STAMapper updates parameters on heterogeneous graphs according to the message-passing mechanism, where the same edge type shares the same parameters [27]. We aim to learn these edge parameters.
Encoder
The encoder is based on the architecture of the heterogeneous graph, where we take the expression of the union of DEGs from different clusters as the input for each cell node. Suppose we have cells in the scRNA-seq dataset and cells in the scST dataset. Here, we denote as the input for cell node where denotes the normalized expression for gene in cell . For the cell node the initial embedding is calculated as follows:
where denotes the layer normalization [71], is the nonlinear activation function, denotes the learnable parameters for the edge type “cell_self_loop” in the initial layer, denotes the learnable bias for cell nodes. We used the intersection of selected genes (DEGs and HVGs) between scRNA-seq and scST datasets as the gene nodes. Suppose we have gene nodes, the initial embedding for gene node is as follows:
where is a set containing the cell nodes which are neighbors of gene node where denotes the number of cells in this set. denotes the learnable parameters for edge type “cell_express_gene” for the initial layer, denotes the learnable bias for gene nodes.
For the th hidden layer, the embedding for cell node is:
where denotes a set of edge types connected to cell node , denotes a set containing the cell/gene nodes that are neighbors of cell node according to edge type , denotes the learnable parameters for edge type , denotes the embedding for the th cell/gene node (adaptive to the edge type ) in the previous layer. The embedding for gene node in the -th hidden layer is:
denotes a set of edge types connecting to gene node , denotes a set containing the cell nodes that are neighbors of gene node according to edge type , denotes the learnable parameters for edge type , denotes the embedding for the th cell/gene node (adaptive to the edge type ) in the previous layer.
Classifier
To further utilize the information from genes associated with cell classification, we employed an attention mechanism [27] in the heterogeneous graph classifier. Specifically, the attention weight of the classifier from gene node to cell node is:
where is a learnable weight vector, denotes concatenation of and . The weight is further normalized as follows:
where is a set containing the gene nodes that are neighbors of cell node , determined by edge “gene_expressed_by_cell.” The normalized attention weight reflects the importance of gene in determining the classification of cell . Given a gene and a specific cell type, we aggregate the normalized attention weights from that gene to all cells belonging to the cell type and compute their average as the cell type’s normalized attention weights. This value represents the heatmap intensity shown in Fig. 5i, reflecting the overall importance of the gene for that particular cell type determined by STAMapper. Then the output logits of the classifier are:
Then we apply function over logits coordinately:
where , . Here, denotes the total number of cells, denotes the number of cell types, denotes the predicted probabilities of cell type for the th cell.
Loss function
We modified the cross-entropy loss as the classification loss. Suppose is the manual annotation of cells from the scRNA-seq dataset. We first use a weighted cross-entropy loss as follows:
where denotes the weight for cell-type , denotes the number of cells in -th cell type, denotes the sum of the squares of all model parameters penalizing the complexity of the model to prevent overfitting, and is set as 0.01 by default. We applied the label smoothing technique to reduce the model overconfidence [28]. The overall training loss is:
the default value of is 0.1.
Training process
In all experiments, we set the encoder of STAMapper as a two-layer heterogeneous graph neural network with 512 hidden units for all types of edges and nodes across all hidden layers. We apply the Adam optimizer [72] with a learning rate of 5e-3 to optimize all the parameters. We adopt LeakyReLU [73] as the activation function with a negative slope set as 0.05. We set the number of iterations as 1000 by default and use the same checkpoint selection strategy as a recently published method [74].
Unknown cells detection
We detected unknown cells based on two criteria: (i) cells in the scST data with predicted probability less than the median value; and (ii) cell types with distance from the scST data to scRNA-seq data larger than a user-defined threshold (default 4). The distance was calculated based on the cell embeddings provided by STAMapper, after reducing to 50 dimensions using PCA. Specifically, the distance for a given cell type from the scST data to scRNA-seq data is defined as the average distance of each cell in the scST data to its five nearest neighbors in the scRNA-seq data.
Gene module extraction and enrichment analysis
We performed the Leiden community detection algorithm [29] on the gene embeddings from the last layer of STAMapper and defined the detected clusters as gene modules. We performed the GO enrichment analysis by the R package clusterProfiler [75] with the biological process ontologies and a p value cutoff of 0.01 to identify significantly enriched terms.
Evaluation
To assess the performance of cell-type annotation, we employed accuracy, macro F1 score, and weighted F1 score as evaluation metrics. Accuracy is defined as the proportion of correctly predicted cells among all cells. In addition, we employed macro F1 and Weighted F1. To define macro F1 score and weighted F1 score, we first define precision and recall for each cell type:
precision , recall ,
respectively, where TP, FP, and FN represent the number of true positives, false positives, and false negatives, respectively. The F1 score is the harmonic mean of precision and recall, while the Macro F1 is defined as the average of class-wise F1 score,
Macro F1 score ,
where represents the F1 score for cell type , and denotes the number of all cell types. The weighted F1 considers the proportion of each cell type,
Weighted F1 score ,
where represents the proportion of cells of type in all cells.
Benchmarking cell-type annotation
We performed scANVI [23] with the Python package scVI, referring to the section “Integration and label transfer with Tabula Muris” from its documentation (https://docs.scvi-tools.org/en/stable/tutorials/notebooks/scrna/tabula_muris.html.). We performed RCTD using the R package spacexr [24] with default workflow (https://raw.githack.com/dmcable/spacexr/master/vignettes/spatial-transcriptomics.html). We performed Tangram [25] with the Python package with cluster mode (https://tangram-sc.readthedocs.io/en/latest/tutorial_sq_link.html). We performed SeuratV4 [32] with the R package Seurat, referring to the section “Mapping and annotating query datasets” from its documentation (https://satijalab.org/seurat/articles/integration_mapping). We performed CellTrek [31] with the R package CellTrek, referring to its tutorial (https://github.com/navinlabcode/CellTrek). We set “intp = F” for the function celltrek () with all other parameters as default. After projecting the scRNA-seq cells onto the scST coordinates, we inspected every cell in the scST dataset. If one or more scRNA-seq cells were mapped to a given spatial location, we assigned that location the cell type corresponding to the most prevalent mapped cell population. For locations without any mapped scRNA-seq cells, we identified the five nearest neighboring scRNA-seq cells and assigned the cell type by majority vote among those neighbors. We performed CAESAR [69] with the R package CAESAR.Suite, referring to its tutorial (https://xiaozhangryy.github.io/CAESAR.Suite/articles/STMOB.html). For scST datasets with less than 2000 genes, we set the overlap.max of the function marker.select() as 4 to obtain better annotation results. Additionally, for some datasets, the algorithm reported the default radius.upper value of 400 in the CAESAR.coembedding() function was too small to identify spatial neighbors, so we increased it to 800.
Spatial clustering with STAGATE
We performed STAGATE [58] on the Slide-tags cortex data and followed a standard workflow for data preprocessing with a rad_cutoff as 200 to construct the spatial network (https://stagate.readthedocs.io/en/latest/T3_Slide-seqV2.html).
Visualization of the UMAP plot
The UMAP embeddings presented in our study were computed and visualized using the Scanpy package. Specifically, the neighborhood graph was generated with scanpy.pp.neighbors() with n_neighbors = 10, n_pcs = 40, based on the PCA-reduced representation of the data. We then used scanpy.tl.umap() to embed the neighborhood graph using UMAP and visualized the results with scanpy.pl.umap(). All other parameters were set to their default values.
Supplementary Information
Additional file 1: Figures S1-S6, Tables S1-S2. Supplementary figures that complement the analyses in the main text, and tables detailing the collected datasets.
Additional file 2: Tables S3. Running time and peak memory cost of STAMapper and competing methods.
Acknowledgements
Not applicable.
Authors’ contributions
S.Z. conceived and supervised the project. Q.S. designed and implemented the STAMapper algorithm. Q.S. and K.D. validated the study. Q.S., S.Q.Z., and S.Z. wrote the manuscript. All authors read and approved the final manuscript.
Peer review information
Claudia Feng was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Funding
This work has been supported by the National Key Research and Development Program of China (no. 2021YFC2701601 to S.Q.Z.), the Science and Technology Commission of Shanghai Municipality (no. 23JC1401000 to S.Q.Z.), the National Natural Science Foundation of China (nos. 32341013, 12326614 to S.Z., 12471350 to S.Q.Z.), the R&D project of Pazhou Lab (Huangpu) (no. 2023K0602 to S.Z.), and the CAS Project for Young Scientists in Basic Research (no. YSBR-034 to S.Z.).
Data availability
All datasets analyzed in this study are publicly available. The raw datasets are available from the following studies:
Dataset 1–4 (mouse prefrontal cortex):STARmap [15, 76] https://github.com/weallen/STARmap/tree/master; 10x Chromium [15, 77], adult samples from GSE124952 in the GEO database.
Dataset 5–9 (mouse visual cortex): STARmap [15, 76], https://github.com/weallen/STARmap/tree/master; Smart-seq [78, 79], https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq.
Dataset 10 (mouse visual cortex): seqFISH + [13, 80], https://github.com/CaiGroup/seqFISH-PLUS; Smart-seq [78, 79], https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq.
Dataset 11 (mouse somatosensory cortex): osmFISH [79, 81], https://github.com/drieslab/spatial-datasets/tree/master/data/ 2018_osmFISH_SScortex/raw_data; STRT/C1 [82, 83], GSE60361 in the GEO database.
Dataset 12–23 (mouse primary motor cortex): MERFISH [11, 84], https://knowledge.brain-map.org/data/L3GYGFMDJCG0GUEE3QG/; 10x Chromium [85, 86], https://data.nemoarchive.org/biccn/lab/zeng/transcriptome/scell/10x_v3/mouse/processed/analysis/10X_cells_v3_AIBS/.
Dataset 24–33 (mouse retina): MERFISH [33, 87], https://zenodo.org/records/8144355; 10x Chromium [34, 88], GSE135406 in the GEO database.
Dataset 34 (mouse kidney): MERFISH [26, 89], https://figshare.com/projects/
MERFISH_mouse_comparison_study/134213; 10x Chromium [26, 89], https://figshare.com/articles/dataset/SingleCellData_raw_/19310675.
Dataset 35 (mouse liver): MERFISH [26, 89], https://figshare.com/projects/
MERFISH_mouse_comparison_study/134213; 10X Chromium [26, 89], https://figshare.com/articles/dataset/SingleCellData_raw_/19310675.
Dataset 36–71 (mouse hypothalamic preoptic region): MERFISH [22, 90], https://datadryad.org/stash/dataset/10.5061/dryad.8t8s248; Droplet-microfluidic [22, 91], GSE113576 in the GEO database.
Dataset 72–77 (mouse gastrulation):osmFISH, seqFISH [81, 92], https://crukci.shinyapps.io/mousegastrulation2018/; 10x Chromium [81, 93], https://bioconductor.org/packages/devel/data/experiment/vignettes/MouseGastrulationData/inst/doc/MouseGastrulationData.html.
Dataset 78–79 (mouse prefrontal cortex): STARmap PLUS [16, 94], https://zenodo.org/records/7332091; 10x Chromium [15, 77], GSE124952 in the GEO database.
Dataset 80 (human prefrontal cortex): Slide-tags [18, 95], https://singlecell.broadinstitute.org/single_cell/study/SCP2167/slide-tags-snrna-seq-on-human-prefrontal-cortex#study-download; 10x Chromium [52, 96], GSE168408 in the GEO database.
Dataset 81 (human liver cancer): NanoString [17, 97], https://nanostring.com/resources/liver-cancer-raw-data-files-cosmx-smi-human-liver-ffpe-dataset/; 10x Chromium [39, 97], GSE149614 in the GEO database.
Code availability
STAMapper is implemented in Python, it is available on GitHub under an open-source MIT license (https://github.com/zhanglabtools/STAMapper) [98] and have been deposited in Zenodo with a permanent DOI (https://zenodo.org/records/17007394) [99].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Shuqin Zhang, Email: zhangs@fudan.edu.cn.
Shihua Zhang, Email: zsh@amss.ac.cn.
References
- 1.Jovic D, Liang X, Zeng H, Lin L, Xu F, Luo Y. Single-cell RNA sequencing technologies and applications: a brief overview. Clin Transl Med. 2022;12(3):e694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Choe K, Pak U, Pang Y, Hao W, Yang X. Advances and challenges in spatial transcriptomics for developmental biology. Biomolecules. 2023;13(1):156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ma Y, Zhou X. Accurate and efficient integrative reference-informed spatial domain detection for spatial transcriptomics. Nat Methods. 2024;21(7):1231–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Dong K, Zhang S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat Commun. 2022;13(1):1739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Liang Y, Shi G, Cai R, Yuan Y, Xie Z, Yu L, et al. PROST: quantitative identification of spatially variable genes and domain detection in spatial transcriptomics. Nat Commun. 2024;15(1):600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Su H, Wu Y, Chen B, Cui Y. STANCE: a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics. Nat Commun. 2025;16(1):1793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Zhang C, Dong K, Aihara K, Chen L, Zhang S. STAMarker: determining spatial domain-specific variable genes with saliency maps in deep learning. Nucleic Acids Res. 2023;51(20):e103-e. [DOI] [PMC free article] [PubMed]
- 8.Cang Z, Zhao Y, Almet AA, Stabell A, Ramos R, Plikus MV, et al. Screening cell–cell communication in spatial transcriptomics via collective optimal transport. Nat Methods. 2023;20(2):218–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yang W, Wang P, Xu S, Wang T, Luo M, Cai Y, et al. Deciphering cell–cell communication at single-cell resolution for spatial transcriptomics with subgraph-based graph attention network. Nat Commun. 2024;15(1):7101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Xia C, Fan J, Emanuel G, Hao J, Zhuang X. Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proc Natl Acad Sci U S A. 2019;116(39):19490–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang M, Eichhorn SW, Zingg B, Yao Z, Cotter K, Zeng H, et al. Spatially resolved cell atlas of the mouse primary motor cortex by MERFISH. Nature. 2021;598(7879):137–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Shah S, Lubeck E, Zhou W, Cai L. seqFISH accurately detects transcripts in single cells and reveals robust spatial organization in the hippocampus. Neuron. 2017;94(4):752–8 e1. [DOI] [PubMed]
- 13.Eng CL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature. 2019;568(7751):235–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Codeluppi S, Borm LE, Zeisel A, La Manno G, van Lunteren JA, Svensson CI, et al. Spatial organization of the somatosensory cortex revealed by osmFISH. Nat Methods. 2018;15(11):932–5. [DOI] [PubMed] [Google Scholar]
- 15.Bhattacherjee A, Djekidel MN, Chen R, Chen W, Tuesta LM, Zhang Y. Cell type-specific transcriptional programs in mouse prefrontal cortex during adolescence and addiction. Nat Commun. 2019;10(1):4169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zeng H, Huang J, Zhou H, Meilandt WJ, Dejanovic B, Zhou Y, et al. Integrative in situ mapping of single-cell transcriptional states and tissue histopathology in a mouse model of Alzheimer’s disease. Nat Neurosci. 2023;26(3):430–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hernandez S, Lazcano R, Serrano A, Powell S, Kostousov L, Mehta J, et al. Challenges and opportunities for immunoprofiling using a spatial high-plex technology: the NanoString GeoMx((R)) digital spatial profiler. Front Oncol. 2022;12:890410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Russell AJC, Weir JA, Nadaf NM, Shabet M, Kumar V, Kambhampati S, et al. Slide-tags enables single-nucleus barcoding for multimodal spatial genomics. Nature. 2024;625(7993):101–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Fan Z, Luo Y, Lu H, Wang T, Feng Y, Zhao W, et al. SPASCER: spatial transcriptomics annotation at single-cell resolution. Nucleic Acids Res. 2023;51(D1):D1138–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–82. [DOI] [PubMed] [Google Scholar]
- 21.Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Moffitt JR, Bambah-Mukku D, Eichhorn SW, Vaughn E, Shekhar K, Perez JD, et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science. 2018;362(6416):eaau5324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xu C, Lopez R, Mehlman E, Regier J, Jordan MI, Yosef N. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol. 2021;17(1):e9620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Cable DM, Murray E, Zou LS, Goeva A, Macosko EZ, Chen F, et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat Biotechnol. 2022;40(4):517–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Biancalani T, Scalia G, Buffoni L, Avasthi R, Lu Z, Sanger A, et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods. 2021;18(11):1352–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Liu J, Tran V, Vemuri VNP, Byrne A, Borja M, Kim YJ, et al. Concordance of MERFISH spatial transcriptomics with bulk and single-cell RNA sequencing. Life Sci Alliance. 2023;6(1):e202201701 [DOI] [PMC free article] [PubMed]
- 27. Wang X, Ji H, Shi C, Wang B, Ye Y, Cui P, Yu PS, editors. Heterogeneous graph attention network. The world wide web conference; 2019.
- 28. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z, editors. Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016.
- 29.Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wang X, Allen WE, Wright MA, Sylwestrak EL, Samusik N, Vesuna S, et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science. 20182018;361(6400):eaat5691 [DOI] [PMC free article] [PubMed]
- 31.Wei R, He S, Bai S, Sei E, Hu M, Thompson A, et al. Spatial charting of single-cell transcriptomes in tissues. Nat Biotechnol. 2022;40(8):1190–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–87. e29. [DOI] [PMC free article] [PubMed]
- 33.Choi J, Li J, Ferdous S, Liang Q, Moffitt JR, Chen R. Spatial organization of the mouse retina at single cell resolution by MERFISH. Nat Commun. 2023;14(1):4929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hoang T, Wang J, Boyd P, Wang F, Santiago C, Jiang L, et al. Gene regulatory networks controlling vertebrate retinal regeneration. Science. 2020;370(6519):eabb8598 [DOI] [PMC free article] [PubMed]
- 35.Hahn J, Monavarfeshani A, Qiao M, Kao AH, Kolsch Y, Kumar A, et al. Evolution of neuronal cell classes and types in the vertebrate retina. Nature. 2023;624(7991):415–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Santos G, Prazeres P, Mintz A, Birbrair A. Role of pericytes in the retina. Eye. 2018;32(3):483–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Yang S, Zhou J, Li D. Functions and diseases of the retinal pigment epithelium. Front Pharmacol. 2021;12:727870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Zeisel A, Hochgerner H, Lonnerberg P, Johnsson A, Memic F, van der Zwan J, et al. Molecular architecture of the mouse nervous system. Cell. 2018;174(4):999–1014 e22. [DOI] [PMC free article] [PubMed]
- 39.Lu Y, Yang A, Quan C, Pan Y, Zhang H, Li Y, et al. A single-cell atlas of the multicellular ecosystem of primary and metastatic hepatocellular carcinoma. Nat Commun. 2022;13(1):4594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Sun Y, Wu L, Zhong Y, Zhou K, Hou Y, Wang Z, et al. Single-cell landscape of the ecosystem in early-relapse hepatocellular carcinoma. Cell. 2021;184(2):404–21 e16. [DOI] [PubMed]
- 41.Zhao D, Chen S, Cai J, Guo Y, Song Z, Che J, et al. Derivation and characterization of hepatic progenitor cells from human embryonic stem cells. PLoS ONE. 2009;4(7):e6468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kumari N, Choi SH. Tumor-associated macrophages in cancer: recent advancements in cancer nanoimmunotherapies. J Exp Clin Cancer Res. 2022;41(1):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Miranda A, Hamilton PT, Zhang AW, Pattnaik S, Becht E, Mezheyeuski A, et al. Cancer stemness, intratumoral heterogeneity, and immune response across cancers. Proc Natl Acad Sci U S A. 2019;116(18):9020–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The genecards suite: from gene data mining to disease genome sequence analyses. Curr Protoc Bioinformatics. 2016;54:1 30 1–1 3. [DOI] [PubMed] [Google Scholar]
- 45.Muller AM, Hermanns MI, Skrzynski C, Nesslinger M, Muller KM, Kirkpatrick CJ. Expression of the endothelial markers PECAM-1, vWf, and CD34 in vivo and in vitro. Exp Mol Pathol. 2002;72(3):221–9. [DOI] [PubMed] [Google Scholar]
- 46.Choo SY. The HLA system: genetics, immunology, clinical testing, and clinical implications. Yonsei Med J. 2007;48(1):11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yang C, Siebert JR, Burns R, Gerbec ZJ, Bonacci B, Rymaszewski A, et al. Heterogeneity of human bone marrow and blood natural killer cells defined by single-cell transcriptome. Nat Commun. 2019;10(1):3931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Revel M, Sautes-Fridman C, Fridman WH, Roumenina LT. C1q+ macrophages: passengers or drivers of cancer progression. Trends Cancer. 2022;8(7):517–26. [DOI] [PubMed] [Google Scholar]
- 49.Zhang Q, Liu W, Zhang HM, Xie GY, Miao YR, Xia M, et al. hTFtarget: a comprehensive database for regulations of human transcription factors and their targets. Genomics Proteomics Bioinformatics. 2020;18(2):120–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Fan H, Zhang H, Pascuzzi PE, Andrisani O. Hepatitis B virus X protein induces EpCAM expression via active DNA demethylation directed by RelA in complex with EZH2 and TET2. Oncogene. 2016;35(6):715–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Jiang XL, Deng B, Deng SH, Cai M, Ding WJ, Tan ZB, et al. Dihydrotanshinone I inhibits the growth of hepatoma cells by direct inhibition of Src. Phytomedicine. 2022;95:153705. [DOI] [PubMed] [Google Scholar]
- 52. Herring CA, Simmons RK, Freytag S, Poppe D, Moffet JJD, Pflueger J, et al. Human prefrontal cortex gene regulatory dynamics from gestation to adulthood at single-cell resolution. Cell. 2022;185(23):4428–47 e28. [DOI] [PubMed]
- 53.Chen Y, Wu H, Wang S, Koito H, Li J, Ye F, et al. The oligodendrocyte-specific G protein-coupled receptor GPR17 is a cell-intrinsic timer of myelination. Nat Neurosci. 2009;12(11):1398–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50(D1):D687–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1(6):417–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Soomro SH, Jie J, Fu H. Oligodendrocytes development and Wnt signaling pathway. Int J Hum Anat. 2018;1(3):17–35. [Google Scholar]
- 57.Miyashita Y. Operating principles of the cerebral cortex as a six-layered network in primates: beyond the classic canonical circuit model. Proc Jpn Acad Ser B Phys Biol Sci. 2022;98(3):93–111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Cao L, Yang C, Hu L, Jiang W, Ren Y, Xia T, et al. Deciphering spatial domains from spatially resolved transcriptomics with Siamese graph autoencoder. Gigascience. 2024;13(1). [DOI] [PMC free article] [PubMed]
- 59.Mills JD, Kavanagh T, Kim WS, Chen BJ, Kawahara Y, Halliday GM, et al. Unique transcriptome patterns of the white and grey matter corroborate structural and functional heterogeneity in the human frontal lobe. PLoS ONE. 2013;8(10):e78480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Brown TG, Thayer MN, VanTreeck JG, Zarate N, Hart DW, Heilbronner S, et al. Striatal spatial heterogeneity, clustering, and white matter association of GFAP(+) astrocytes in a mouse model of Huntington’s disease. Front Cell Neurosci. 2023;17:1094503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Blacker CJ, Millischer V, Webb LM, Ho AMC, Schalling M, Frye MA, et al. EAAT2 as a research target in bipolar disorder and unipolar depression: a systematic review. Mol Neuropsychiatry. 2020;5(Suppl 1):44–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hofmann K, Rodriguez-Rodriguez R, Gaebler A, Casals N, Scheller A, Kuerschner L. Astrocytes and oligodendrocytes in grey and white matter regions of the brain metabolize fatty acids. Sci Rep. 2017;7(1):10779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Kubo KI. Increased densities of white matter neurons as a cross-disease feature of neuropsychiatric disorders. Psychiatry Clin Neurosci. 2020;74(3):166–75. [DOI] [PubMed] [Google Scholar]
- 64.Wolf FA, Angerer P, Theis FJ. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19(1):15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Ren P, Zhang R, Wang Y, Zhang P, Luo C, Wang S, et al. Systematic benchmarking of high-throughput subcellular spatial transcriptomics platforms. Biorxiv. 2024:2024.12. 23.630033.
- 66.Zhu J, Sun S, Zhou X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 2021;22(1):184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Svensson V, Teichmann SA, Stegle O. SpatialDE: identification of spatially variable genes. Nat Methods. 2018;15(5):343–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Williams CG, Lee HJ, Asatsuma T, Vento-Tormo R, Haque A. An introduction to spatial transcriptomics for biomedical research. Genome Med. 2022;14(1):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Liu J, Liu W, Chai X, Zhang X, Lin Z. CAESAR: a cross-technology and cross-resolution framework for spatial omics annotation. 2024.
- 70.Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005;6(2):R21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:160706450. 2016.
- 72. Zhang Z, editor Improved adam optimizer for deep neural networks. 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS); 2018: Ieee.
- 73. Zhang X, Zou Y, Shi W, editors. Dilated convolution neural network with LeakyReLU for environmental sound classification. 2017 22nd international conference on digital signal processing (DSP); 2017: Ieee.
- 74.Liu X, Shen Q, Zhang S. Cross-species cell-type assignment from single-cell RNA-seq data by a heterogeneous graph neural network. Genome Res. 2023;33(1):96–111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Zhao L. wang2018three_STATmap. 2022. Dataset Figshare. 10.6084/m9.figshare.19786456.v1.
- 77. Bhattacherjee A, Djekidel MN, Chen R, Chen W, Tuesta LM, Zhang Y. Cell type-specific transcriptional programs in mouse prefrontal cortex during adolescence and addiction. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124952. 2019. [DOI] [PMC free article] [PubMed]
- 78.Hodge RD, Bakken TE, Miller JA, Smith KA, Barkan ER, Graybuck LT, et al. Conserved cell types with divergent features in human versus mouse cortex. Nature. 2019;573(7772):61–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Science AIfB. Mouse V1 and ALM SMART-seq. Datasets. Allen Brain Map. https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq. 2018.
- 80. Eng C-HLC, Long. NIH3T3_point_locations for RNA seqFISH+ experiments.Datasets. Zenodo. https://zenodo.org/records/2669683. 2019.
- 81.Pijuan-Sala B, Griffiths JA, Guibentif C, Hiscock TW, Jawaid W, Calero-Nieto FJ, et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature. 2019;566(7745):490–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347(6226):1138–42. [DOI] [PubMed] [Google Scholar]
- 83. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, et al. Single-cell RNA-seq of mouse cerebral cortex. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361. 2015.
- 84. Zhuang X. A molecularly defined and spatially resolved cell atlas of the mouse primary motor cortex. Datasets. Brain Knowledge Platform. https://knowledge.brain-map.org/data/L3GYGFMDJCG0GUEE3QG. 2021.
- 85.Yao Z, Liu H, Xie F, Fischer S, Adkins RS, Aldridge AI, et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature. 2021;598(7879):103–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Science AIfB. Mouse cortical 10X v3 single-cell transcriptome dataset. Datasets. NEMO Archive. https://data.nemoarchive.org/biccn/lab/zeng/transcriptome/scell/10x_v3/mouse/processed/analysis/10X_cells_v3_AIBS/. 2020.
- 87.Choi JLJ, Ferdous S, Liang Q, Moffitt JR, Chen R. Spatial organization of the mouse retina at single cell resolution by MERFISH. 2023. Datasets Zenodo. 10.5281/zenodo.8144355.
- 88. Hoang TW, Jun; Boyd, Peter; Wang, Fei; Hyde, Daniel R.; Qian, Jun; Blackshaw, Seth. Comparative transcriptomic and epigenomic analysis identifies key regulators of injury response and neurogenic competence in retinal glia. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135406. 2020.
- 89. Pisco A. MERFISH mouse comparison study. Datasets. Figshare. https://figshare.com/projects/MERFISH_mouse_comparison_study/134213. 2022.
- 90. Moffitt JRB-M, Dhananjay; Eichhorn, Stephen W.; Vaughn, Eric; Shekhar, Karthik; Perez, Julio D.; Rubinstein, Nimrod D.; Hao, Junjie; Regev, Aviv; Dulac, Catherine; Zhuang, Xiaowei. Molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Datasets. Dryad. 10.5061/dryad.8t8s248. 2018. [DOI] [PMC free article] [PubMed]
- 91. Zhuang XD, Catherine. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE113576. 2018. [DOI] [PMC free article] [PubMed]
- 92. Lab M. A single-cell molecular map of mouse gastrulation and early organogenesis. Datasets. Marioni Lab Shiny Apps. https://crukci.shinyapps.io/mousegastrulation2018/. 2018.
- 93. Griffiths J LA. Overview of the MouseGastrulationData datasets. Datasets. Bioconductor. https://bioconductor.org/packages/devel/data/experiment/vignettes/MouseGastrulationData/inst/doc/MouseGastrulationData.html. 2022.
- 94. Zeng HH, Jiahao; Zhou, Haowen; Meilandt, William J.; Dejanovic, Borislav; Zhou, Yiming; Bohlen, Christopher J.; Lee, Seung-Hye; Ren, Jingyi; Liu, Albert; Tang, Zefang; Sheng, Hao; Liu, Jia; Sheng, Morgan; Wang, Xiao. Integrative in situ mapping of single-cell transcriptional states and tissue histopathology in an Alzheimer disease model. Datasets. Zenodo. https://zenodo.org/records/7332091. 2022. [DOI] [PMC free article] [PubMed]
- 95. Russell AJ, Weir JA, Nadaf NM, Shabet M, Kumar V, Kambhampati S, et al. Slide-tags snRNA-seq on human prefrontal cortex. Datasets. Single Cell Portal. https://singlecell.broadinstitute.org/single_cell/study/SCP2167/slide-tags-snrna-seq-on-human-prefrontal-cortex. 2023.
- 96. Herring CA, Simmons RK, Freytag S, Poppe D, Moffet JJ, Pflueger J, et al. Human prefrontal cortex gene regulatory dynamics from gestation to adulthood at single-cell resolution. Datasets. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE168408. 2022. [DOI] [PubMed]
- 97. Technologies N. CosMx SMI Human Liver RNA FFPE Dataset. Datasets. NanoString. https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/human-liver-rna-ffpe-dataset/. 2022.
- 98. Shen Q, Dong K, Zhang SQ, Zhang S. High-precision cell-type mapping and annotation of single-cell spatial transcriptomics with STAMapper. GitHub. https://github.com/zhanglabtools/STAMapper. 2025.
- 99. Shen Q, Dong K, Zhang SQ, Zhang S. High-precision cell-type mapping and annotation of single-cell spatial transcriptomics with STAMapper. Zenodo. https://zenodo.org/records/17007394. 2025.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Figures S1-S6, Tables S1-S2. Supplementary figures that complement the analyses in the main text, and tables detailing the collected datasets.
Additional file 2: Tables S3. Running time and peak memory cost of STAMapper and competing methods.
Data Availability Statement
All datasets analyzed in this study are publicly available. The raw datasets are available from the following studies:
Dataset 1–4 (mouse prefrontal cortex):STARmap [15, 76] https://github.com/weallen/STARmap/tree/master; 10x Chromium [15, 77], adult samples from GSE124952 in the GEO database.
Dataset 5–9 (mouse visual cortex): STARmap [15, 76], https://github.com/weallen/STARmap/tree/master; Smart-seq [78, 79], https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq.
Dataset 10 (mouse visual cortex): seqFISH + [13, 80], https://github.com/CaiGroup/seqFISH-PLUS; Smart-seq [78, 79], https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-v1-and-alm-smart-seq.
Dataset 11 (mouse somatosensory cortex): osmFISH [79, 81], https://github.com/drieslab/spatial-datasets/tree/master/data/ 2018_osmFISH_SScortex/raw_data; STRT/C1 [82, 83], GSE60361 in the GEO database.
Dataset 12–23 (mouse primary motor cortex): MERFISH [11, 84], https://knowledge.brain-map.org/data/L3GYGFMDJCG0GUEE3QG/; 10x Chromium [85, 86], https://data.nemoarchive.org/biccn/lab/zeng/transcriptome/scell/10x_v3/mouse/processed/analysis/10X_cells_v3_AIBS/.
Dataset 24–33 (mouse retina): MERFISH [33, 87], https://zenodo.org/records/8144355; 10x Chromium [34, 88], GSE135406 in the GEO database.
Dataset 34 (mouse kidney): MERFISH [26, 89], https://figshare.com/projects/
MERFISH_mouse_comparison_study/134213; 10x Chromium [26, 89], https://figshare.com/articles/dataset/SingleCellData_raw_/19310675.
Dataset 35 (mouse liver): MERFISH [26, 89], https://figshare.com/projects/
MERFISH_mouse_comparison_study/134213; 10X Chromium [26, 89], https://figshare.com/articles/dataset/SingleCellData_raw_/19310675.
Dataset 36–71 (mouse hypothalamic preoptic region): MERFISH [22, 90], https://datadryad.org/stash/dataset/10.5061/dryad.8t8s248; Droplet-microfluidic [22, 91], GSE113576 in the GEO database.
Dataset 72–77 (mouse gastrulation):osmFISH, seqFISH [81, 92], https://crukci.shinyapps.io/mousegastrulation2018/; 10x Chromium [81, 93], https://bioconductor.org/packages/devel/data/experiment/vignettes/MouseGastrulationData/inst/doc/MouseGastrulationData.html.
Dataset 78–79 (mouse prefrontal cortex): STARmap PLUS [16, 94], https://zenodo.org/records/7332091; 10x Chromium [15, 77], GSE124952 in the GEO database.
Dataset 80 (human prefrontal cortex): Slide-tags [18, 95], https://singlecell.broadinstitute.org/single_cell/study/SCP2167/slide-tags-snrna-seq-on-human-prefrontal-cortex#study-download; 10x Chromium [52, 96], GSE168408 in the GEO database.
Dataset 81 (human liver cancer): NanoString [17, 97], https://nanostring.com/resources/liver-cancer-raw-data-files-cosmx-smi-human-liver-ffpe-dataset/; 10x Chromium [39, 97], GSE149614 in the GEO database.
STAMapper is implemented in Python, it is available on GitHub under an open-source MIT license (https://github.com/zhanglabtools/STAMapper) [98] and have been deposited in Zenodo with a permanent DOI (https://zenodo.org/records/17007394) [99].






