Abstract
Background
Spatial transcriptomics preserves spatial context of tissues while capturing gene expression. As the technology advances, researchers are increasingly generating data from multiple tissue sections, creating a growing demand for multi-slice integration methods. These methods aim to generate spatially aware embeddings that jointly capture spatial and transcriptomic information, preserving biological signals while mitigating technical artifacts such as batch effects. However, the reliability of these methods varies, and the growing diversity of technologies makes integration even more challenging. This underscores the need for a comprehensive benchmark to evaluate their performance, which is still lacking.
Results
To systematically evaluate the performance of multi-slice integration methods, we propose a comprehensive benchmarking framework covering four key tasks that form an upstream-to-downstream pipeline: multi-slice integration, spatial clustering, spatial alignment, slice representation. For each task, we perform detailed analyses of the methods and provide actionable recommendations. Our results reveal substantial data-dependent variation in performance across tasks. We further investigate the relationships between upstream and downstream tasks, showing that downstream performance often depends on upstream quality.
Conclusions
Our study provides a comprehensive benchmark of 12 multi-slice integration methods across four key tasks using 19 diverse datasets. Our results reveal that method performance is highly dependent on application context, dataset size, and technology. We also identified strong interdependencies between upstream and downstream tasks, highlighting the importance of robust early-stage analysis.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03796-z.
Keywords: Spatial transcriptomics, Spatial multi-slice integration, Systematic benchmark
Background
Spatial transcriptomics technologies provide unprecedented insights into tissue organization by preserving spatial context while measuring gene expression [1], providing a new perspective for studying biological processes such as embryonic development, nervous system, and tumor microenvironment [2–5]. As the technologies advance, researchers increasingly generate data from multiple tissue sections, enabling more systematic and comprehensive analyses [6–8]. This growing demand has driven the development of computational methods designed to integrate spatially complex, multi-slice datasets [9, 10].
Current multi-slice integration methods typically focus on two main aspects. First, they integrate spatial and expression data to generate spatially aware embeddings for each cell or spot. Second, they aim to correct batch effects across slices, facilitating comprehensive analysis of multi-slice data. Based on their underlying strategies, existing integration methods can be broadly classified into three categories. (1) Deep learning-based methods: These methods primarily use variational autoencoders (VAEs) or graph neural networks (GNNs) to integrate spatial and expression data. They incorporate batch information as a covariate during model training to remove batch effects [9, 11, 12]. (2) Statistical methods: These methods consider factors such as the cellular microenvironment or abundance data to associate cells or spots with their surrounding tissues, naturally mitigating batch effects. Some methods also use batch processing tools like Harmony for further batch effect removal [7, 10, 13–16]. (3) Hybrid methods: These methods combine elements of both deep learning and statistical approaches. Initially, a deep learning framework is used to generate integrated representations of individual cells or spots, which are then further refined by incorporating information about the spatial context [17–19].
In spatial transcriptomics analysis, a sequential workflow of interconnected tasks is essential for comprehensive tissue characterization. Multi-slice integration methods serve as the foundational first step, producing spatial embeddings that capture the combined information across tissue sections [20, 21]. Spatial clustering then operates on these spatial embeddings, applying unsupervised techniques to identify distinct spatial domains within the tissue [17]. These spatial domains support downstream applications such as slice representation, which characterizes each slice based on spatial domain composition and facilitate connections with metadata (e.g., clinical annotations) for deeper biological insights [15, 22, 23].
In addition, spatial alignment, which aligns multiple tissue slices to a common coordinate system for three-dimensional reconstruction, is another critical analysis step in spatial transcriptomics. Existing spatial alignment methods can be broadly categorized into two types: integration-based and non-integration-based methods. Integration-based alignment methods rely on spatial domains or embeddings derived from the integration process to correct spatial coordinates between adjacent slices [19, 24, 25]. For instance, STAligner uses user-specified landmark domains and computes mutual nearest neighbors (MNNs) based on integrated embeddings [19]. It then aligns slices by leveraging MNN information within the landmark domains. SPACEL follows a similar strategy but identifies MNNs using spatial coordinates instead of embeddings, incorporating information from all domains for alignment [24]. In contrast, non-integration-based alignment methods directly utilize spatial coordinates and/or gene expression data to align slices [26–29]. For example, PASTE uses both gene expression profiles and spatial coordinates from adjacent slices and applies fused Gromov-Wasserstein optimal transport to estimate the pairing probabilities between cells or spots, which then inform spatial coordinate correction [26]. STalign treats tissue slices as images and aligns them using Large Deformation Diffeomorphic Metric Mapping (LDDMM) based on image varifold representations [28]. Thus, for integration-based methods, spatial alignment is inherently a downstream application of multi-slice integration and spatial clustering, whereas non-integration-based methods treat alignment as an independent or even upstream step that may precede or guide integration.
Therefore, this hierarchical workflow, from integration (generating spatial embeddings) to clustering (identifying spatial domains) to alignment and representation (utilizing these domains), highlights the inherent complex of spatial analysis, which contributes to the significant uncertainty in multi-slice integration methods. This challenge is further compounded by the expanding diversity of spatial transcriptomics technologies, each with distinct technical and biological biases that complicate integration [30, 31]. To date, several studies have aimed to establish benchmarks for spatial transcriptomics analysis methods. For example, Yuan et al. proposed a comprehensive evaluation framework for spatial clustering methods across various datasets; however, their work focused solely on single-slice analysis [32]. Hu et al. conducted a preliminary evaluation of multi-slice integration methods, comparing only four methods across two multi-slice datasets [33]. Notably, their findings indicated that multi-slice integration improves spatial clustering performance through joint embedding, compared to traditional single-slice approaches. This highlights the importance of a more systematic assessment. Consequently, a comprehensive benchmarking framework that rigorously evaluates multi-slice integration methods across diverse data types and downstream applications remains urgently needed.
To this end, we present a comprehensive benchmarking framework that systematically evaluates 12 state-of-the-art multi-slice integration methods across 19 datasets from seven sources representing various spatial technologies. Our evaluation framework includes both the core multi-slice integration performance and three critical downstream applications, including spatial clustering, spatial alignment, and slice representation. Then, by analyzing the relationships between upstream and downstream tasks, we uncover a certain correlation between them.
Results
Overview of the overall framework
Traditional benchmarking frameworks for single-cell genomics primarily focus on embedding quality and clustering [34]. However, spatial transcriptomics demands a more nuanced approach due to its spatial complexity and spatial-related downstream applications [32, 35]. To rigorously assess the performance of multi-slice integration methods, we proposed a comprehensive evaluation framework that includes multi-slice integration, spatial clustering, spatial alignment and slice representation (Fig. 1a). We also evaluated the scalability of each method (Additional file 1: Note S1). Finally, we provided an overall analysis of each method’s performance across different tasks and data types, and further examined the correlations between upstream and downstream analysis (Fig. 1a).
Fig. 1.
Pipeline and data. a The pipeline of the overall study. b The datasets used in this study are presented, including information on the number of slices, the total number of cells/spots, and the number of cells/spots and genes in each slice. The length of the bars representing cells/spots is proportional to the mean values, with the numbers next to the bars indicating the mean. Error bars represent the standard error across slices within each dataset. The length of the bars for genes is proportional to the number of genes, except for the 10X Visium dataset, where the large gene count is taken into account
We evaluated 12 advanced methods in this study (see Methods), including four deep learning-based methods, five statistical methods and three hybrid methods. The majority of these methods were published after 2023, reflecting the latest advancements in the field. The deep learning-based methods are GraphST, GraphST-PASTE [9], SPIRAL [11], and STAIG [12]. The statistical methods include Banksy [10], CN [7], MENDER [13], PRECAST [14], and SpaDo [15]. And the hybrid methods are CellCharter [17], NicheCompass [18], and STAligner [19].
Given the variability in resolution and data distribution across different technologies, the performance of a given method can vary depending on the dataset. To account for this, 19 spatial transcriptomics datasets were collected from seven sources, representing multiple technologies like 10X Visium, BaristaSeq, MERFISH, and STARMap (Fig. 1b, see Methods) [3, 36–40]. Each dataset contains at least two slices from the same or similar tissue, accompanied by domain annotations.
Finally, our benchmarking revealed that no single method consistently outperforms others across all datasets and tasks. Meanwhile, we found that the performance of spatial clustering is strongly influenced by the quality of upstream integration, and that integration-based spatial alignment is closely correlated with spatial clustering. The complete benchmarking workflow is available on GitHub (https://github.com/bm2-lab/iSTBench) [41], enabling users to reproduce the results presented in this study or apply the framework to their own dataset.
Benchmarking multi-slice integration
Multi-slice integration is the essential first step in spatial transcriptomics analysis [20], so we first evaluated the performance of different methods on this task. To evaluate the performance of different methods on multi-slice integration, we used 18 spatial transcriptomics datasets from six distinct sources (see Methods), excluding the TNBC dataset, as its cancer-derived samples exhibit high heterogeneity and are not considered standard for evaluating multi-slice integration [6]. This dataset was instead used for evaluating slice representation. Drawing on previous studies evaluating single-cell transcriptome integration methods [34], we employed bASW, iLISI and GC to evaluate the effectiveness of the methods in removing batch effects. Additionally, dASW, dLISI, and ILL were used to evaluate the capacity of the methods to conserve biological variance throughout the integration process (see Methods). Unlike single-cell transcriptome analysis, we used slices label as batch labels and domains as biological labels to calculate these metrics, adapting our metrics to spatial transcriptomics datasets. Since CN is not a direct method for data integration, it was excluded from this evaluation.
Our initial focus was on the performance of the various methods applied to 10X Visium dataset obtained via 10X Visium technology [36]. All methods were capable of processing batch effects and conserving biological variance to some degree; however, performance varied (Fig. 2a, b, Additional file 2: Figs. S1–S2). The combination of bASW, iLISI, and GC indicated that GraphST-PASTE (mean bASW 0.940, mean iLISI 0.713, mean GC 0.527) was the most efficient for removing batch effects, though it struggled to conserve biological variance (Fig. 2c). In contrast, MENDER (mean dASW 0.559, mean dLISI 0.988, mean ILL 0.568); STAIG (mean dASW 0.595, mean dLISI 0.963, mean ILL 0.606); and SpaDo (mean dASW 0.556, mean dLISI 0.985, mean ILL 0.575) excelled at preserving biological variance but SpaDo was less effective in removing batch effects. Other models, including STAligner, CellCharter, and SPIRAL, showed moderate overall performance, while the remaining methods performed poorly on both aspects. Considering both batch effect removal and biological signal conservation, STAIG and MENDER emerged as the most suitable method for integrating 10X Visium data, effectively balancing batch effect removal and biological variance conservation.
Fig. 2.
Benchmarking results for multi-slice integration across multiple slices. a, b Visualization of the multi-slice integration results for each method on the DLPFC S3 dataset. The plots display UMAP embeddings based on the results from each method, with cells colored according to their slice (a) and domain (b) labels. c The boxplots of dASW, dLISI, ILL, bASW, iLISI, and GC for each method on all 3 datasets of 10X Visium dataset. Center line: median; box limits: upper and lower quartiles; whiskers: max or min value no further than 1.5 × interquartile range; arrow: metric dominant direction. d Overall performance of each method on each dataset, with metrics displayed based on their rank across all methods. Methods that failed to run for a particular dataset were assigned a rank of 0. The overall score represents the average rank of the six metrics for each method. The indicator represents the relative value of the corresponding metric for each method after min–max normalization across all methods
A similar analysis was performed on the MERFISH Brain dataset generated through MERFISH technique [3] (Additional file 2: Figs. S3–S8). The combination of bASW, iLISI, and GC indicated that GraphST-PASTE (mean bASW 0.974, mean iLISI 0.741, mean GC 0.806); MENDER (mean bASW 0.968, mean iLISI 0.694, mean GC 0.768); and CellCharter (mean bASW 0.965, mean iLISI 0.643, mean GC 0.902) were most effective at removing batch effects (Additional file 2: Fig. S9). In terms of preserving biological variance, NicheCompass (mean dASW 0.595, mean dLISI 0.987, mean ILL 0.563); SpaDo (mean dASW 0.623, mean dLISI 0.967, mean ILL 0.605); and CellCharter (mean dASW 0.555, mean dLISI 0.990, mean ILL 0.554) performed well. Similar trends were observed in the MERFISH Preoptic and the Large-scale datasets, which were also generated using the MERFISH platform (Additional file 2: Figs. S2, S9). However, due to memory and runtime limitations, some methods could not be applied to the Large-scale dataset. Overall, CellCharter consistently demonstrated strong performance across MERFISH-based data, making it especially suitable for integration tasks involving MERFISH technology.
In conclusion, our analysis of the performance of all methods across 18 datasets revealed that CellCharter consistently produces robust results without showing a preference for any specific sequencing technology in terms of batch effect removal or biological variance conservation (Fig. 2d, Additional file 2: Figs. S1–S8). Although GraphST-PASTE excelled at removing batch effects, they encountered challenges in preserving biological variance. As previously noted, the other methods each has unique attributes that contributed to their performance (Fig. 2d). Beyond compatibility between algorithms and sequencing platforms, these differences can also be explained by algorithmic design. For example, GraphST and GraphST-PASTE employ graph-based embeddings and explicitly create cross-slice edges by linking spots between slices during training. This design achieves strong global integration and effectively reduces batch effects, but it may also lead to over-correction, diminishing biological diversity. By contrast, MENDER and SpaDo emphasize biological context by modeling neighborhood relationships, thereby better preserving genuine biological variance, though leaving some batch effects unresolved. Thus, each method’s integration outcome reflects its inherent strategy, whether it prioritizes strong integration across slices or the preservation of tissue-specific biological signals.
Benchmarking spatial clustering
Spatial clustering is evaluated next, as it is the immediate analytical step following multi-slice integration and forms the basis for a series of downstream applications [17]. To evaluate spatial clustering accuracy, we used two standard clustering metrics, ARI and NMI, and employed CHAOS [42] and PAS [42] to assess spatial continuity based on domain consistence and distinct boundaries (see Methods).
To illustrate the results, we visualized the domain labels identified by each method on the DLPFC S3 dataset. As shown in Fig. 3a, most methods produced hierarchical structures largely consistent with the ground truth. According to the ARI results, the three most accurate methods for spatial clustering in the 10X Visium dataset, were Banksy (mean ARI 0.518), STAligner (mean ARI 0.492), and SPIRAL (mean ARI 0.465). These methods also demonstrated strong consistency, with similar trends reflected in the NMI results (Fig. 3b). For spatial continuity, measured by CHAOS and PAS, SpaDo and STAligner consistently produced low scores, indicating high spatial continuity in their clustered domains (Fig. 3b). In contrast, Banksy and SPIRAL exhibited higher CHAOS and PAS values, suggesting more scattered domain spots, despite its accurate hierarchical clustering (Fig. 3a). Considering both spatial clustering accuracy and continuity, STAligner emerges as the most balanced and recommended method for 10X Visium data. However, in many practical applications, clustering accuracy is prioritized over spatial continuity. In such cases, Banksy and SPIRAL are also good choices.
Fig. 3.
Benchmarking results for spatial clustering across multiple slices. a Visualization of spatial clustering results for each method on the DLPFC S3 dataset. The plots display spatial coordinates with cells colored according to domain labels, with the ground truth domain on the left and the results from each method on the right. b The boxplots of ARI, NMI, CHAOS, and PAS for each method on all 3 datasets of 10X Visium dataset. Center line: median; box limits: upper and lower quartiles; whiskers: max or min value no further than 1.5 × interquartile range; arrow: metric dominant direction. c Overall performance of each method on each dataset, with metrics displayed based on their rank across all methods. Methods that failed to run for a particular dataset were assigned a rank of 0. The overall score represents the average rank of the four metrics for each method. The indicator represents the relative value of the corresponding metric for each method after min-max normalization across all methods
Subsequently, we analyzed results using the MERFISH Brain dataset. Visualization of domain identification showed that most methods were able to capture distinct domain structures (Additional file 2: Figs. S12–S17). Based on ARI scores, the top-performing methods for spatial clustering were Banksy (mean ARI 0.601), NicheCompass (mean ARI 0.568), and MENDER (mean ARI 0.565). Meanwhile, NMI results suggested that STAligner (mean NMI 0.610) also have a performance edge for this dataset (Additional file 2: Fig. S12). For spatial continuity, as assessed by CHAOS and PAS, GraphST (mean CHAOS 0.0188, mean PAS 0.0326); SpaDo (mean CHAOS 0.0193, mean PAS 0.0577); and GraphST-PASTE (mean CHAOS 0.0195, mean PAS 0.0637) produced the most spatially coherent domains (Additional file 2: Fig. S12). Similar patterns were observed in the MERFISH Preoptic and the Large-scale datasets, which were also generated using the MERFISH platform (Additional file 2: Figs. S11, S17). Compared to the 10X Visium results, Banksy and STAligner maintained strong performance in domain clustering, while SPIRAL performed poorly on the MERFISH data.
To further explore potential method preferences for specific sequencing technologies in spatial clustering tasks, we analyzed the performance of all 12 methods across 18 datasets, which excluding the TNBC dataset (Fig. 3c, Additional file 2: Figs. S10–S17). Banksy consistently achieved strong spatial clustering accuracy across all 18 datasets, highlighting its robust performance. STAligner also had good spatial clustering performance in most datasets, except in STARMap dataset. At the same time, limited by running memory and time, it cannot deal with larger dataset, such as the Large-scale dataset. Although CellCharter performance is not optimal, the performance is relatively stable and at a moderate level in the vast majority of sets, except in the BaristaSeq dataset. MENDER, NocheCompass, SpaDo, and STAIG also showed superior performance on certain datasets, but their results were less consistent overall (Fig. 3c).
For domain continuity, SpaDo and MENDER exhibited consistent performance across most data, while Banksy, CellCharter, and STAligner, despite their accuracy in domain identification, showed lower continuity (Fig. 3c). These differences likely reflect the underlying characteristics of each method’s algorithm. SpaDo and MENDER, which rely on cell abundance within the microenvironment, tend to produce well-defined and spatially coherent domains. In contrast, Banksy, CellCharter, and STAligner utilize high-dimensional features—such as gene expression profiles or learned embeddings—which can lead to fragmented domain boundaries or scattered domain assignments, ultimately reducing spatial continuity.
Benchmarking spatial alignment
Alignment between adjacent slices is essential for accurately reconstructing the 3D structure of tissues and analyzing the spatial distribution of different domains or cells [43]. Current spatial alignment methods can be divided into two types: integration-based and non-integration-based methods. To comprehensively evaluate performance across these categories, we selected PASTE and STalign as representatives of non-integration-based methods, and STAligner and SPACEL as representatives of integration-based methods. For integration-based alignment, we applied the results of different multi-slice integration methods as inputs to the STAligner and SPACEL pipelines. This design allows us to assess how both the integration strategy and the alignment scheme contribute to final spatial alignment performance (see Methods). We used 7 datasets to evaluate this task, including three from the 10X Visium dataset (DLPFC S1, DLPFC S2, and DLPFC S3), one from the MERFISH Brain dataset (MERFISH Brain S3), as well as the MERFISH Preoptic, BaristaSeq, and STARMap datasets. These datasets adequately highlight the performance differences across methods. To quantify alignment accuracy, we employed two metrics: Accuracy, which evaluates layer-wise alignment consistency, and Ratio, which measures the proportion of correctly matched spots across slices (see Methods). Since GraphST-PASTE has a built-in coordinate correction module based on PASTE, it was excluded from this evaluation to ensure fairness.
Using the MERFISH Preoptic dataset as an example, we observed that both PASTE and SPACEL-based methods achieved strong alignment performance. Notably, SPACEL-based methods showed consistently stable performance, regardless of which integration method was used as input (Fig. 4a, b). In contrast, STalign did not perform well on this dataset and, in some cases, failed to produce any meaningful alignment. This limitation likely stems from its image-based approach, which relies on distinct tissue shape features. Such features are less apparent in the MERFISH Preoptic dataset, as adjacent slices do not overlap in their original spatial positions. STAligner-based methods also showed high variability in performance, likely due to their reliance on landmark domain alignment. Although we standardized the choice of landmark domain across all integration methods, inconsistencies in domain identification significantly influenced the alignment outcomes (Fig. 4a, b). These findings were further validated across other datasets. For instance, although the performance of SPACEL-GraphST declined in DLPFC S1 and DLPFC S3, both PASTE and SPACEL-based methods generally demonstrated superior and stable alignment across most datasets. By contrast, STalign only performed well when slices already had partial overlap and failed to align effectively when no overlap was present, which is consistent with its tutorial recommendation that manual pre-adjustment should be performed before alignment (Fig. 4c, Additional file 2: Figs. S18–S20). In comparison, STAligner-based methods performed well on select datasets but remained highly sensitive to the integration method and landmark domain selection, resulting in overall instability (Additional file 2: Fig. S21).
Fig. 4.
Benchmarking results for spatial alignment. a Visualization of spatial alignment results for each method on the MERFISH Preoptic dataset. The stacking plots display spatial coordinates with cells colored according to slice labels. The original spatial coordinates are shown on the left and the corrected spatial coordinate from each method are shown on the right. b The boxplots of Accuracy and Ratio for each method across 5 slices within MERFISH Preoptic data. Center line: median; box limits: upper and lower quartiles; whiskers: max or min value no further than 1.5 × interquartile range. c Overall performance of each method on each dataset, with metrics displayed based on their rank across all methods. Methods that failed to run for a particular dataset were assigned a rank of 0. The overall score represents the average rank of the two metrics for each method. The indicator represents the relative value of the corresponding metric for each method after min-max normalization across all methods
In summary, PASTE is recommended for spatial alignment when the data has not undergone integration or lacks reliable domain information. SPACEL is a better choice when multi-slice integration has been performed or when domain annotations are available. While PASTE and SPACEL-based methods perform comparably in many scenarios, SPACEL-based approaches consistently outperform PASTE as the number of slices increases, for instance, in the MERFISH Preoptic dataset (Additional file 2: Fig. S22).
Benchmarking slice representation
Slice representation, another key application of spatial data integration, utilizes the abundance of identified spatial domains in each slice as representation. This approach, grounded in biology and clinically relevant, is a standard method for slice representation, widely used to stratify patient groups and visualize data in a low-dimensional space [7, 22, 23]. To assess the performance of slice representation, we used hierarchical clustering (hclust) and K-means to group samples based on domain abundance identified by different methods (see Methods). Since the true number of domains in new datasets cannot be determined a priori, we also evaluated clustering performance with varying numbers of domains for each method. The triple-negative breast cancer (TNBC) dataset [6], which includes 40 slices from different TNBC samples, was used as the gold standard (Fig. 5a, Additional file 2: Fig. S23). These slices are categorized into three groups: cold (low immune infiltrate), mixed (high mixture of tumor and immune cells), and compartmentalized (regions predominantly containing either immune or tumor cells) (Fig. 5b). We used two common multi-class evaluation metrics, ARI and NMI, to assess the clustering performance of the methods (see Methods). Banksy, CellCharter, CN, MENDER, NicheCompass, and PRECAST were included in the evaluation of this task, while other methods were excluded due to limitations in memory usage and runtime.
Fig. 5.
Benchmarking results for slice representation. a The dataset information. b Representative patients from different classes are shown. The plots display spatial coordinates with cells colored according to the ground truth domain labels. c, d The heatmap of ARI and NMI for each method across different settings of the number of identified domains based on hclust (c) and K-means (d) clustering. Rows represent the number of domains, and columns represent different methods. The mean value for each method is calculated by averaging its performance across all domain counts, while the CV is calculated as the ratio of the standard error to the mean value across all domain counts. e, f Bar plots showing ARI (e) and NMI (f) for hierarchical clustering (hclust) and K-means based on domains identified by different integration methods. Bar heights indicate the mean ARI and NMI across various domain count settings for each method, with error bars representing the standard error. p-values (two-sided t-test) reflect the statistical significance of differences between hclust and K-means: ns (not significant), *p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001, ****p ≤ 0.0001. g Stacked bar plots illustrating the domain composition across samples, with domains identified by MENDER under the five-domain setting (left), and the cell type composition within each identified domain (right). h Stacked bar plots illustrating the domain composition across samples, with domains identified by PRECAST under the five-domain setting (left), and the cell type composition within each identified domain (right)
The results indicated that clustering accuracy varied substantially with the number of domains, showing a non-linear relationship between domain count and performance, regardless of whether hclust or K-means was used (Fig. 5c, d). To evaluate the stability of each method with respect to the number of identified domains, we introduced the coefficient of variation (CV) (see Methods). By comparing the mean and CV of ARI and NMI across various domain counts, we found that domains identified by CN or MENDER, when combined with K-means clustering, achieved a good balance between accuracy and stability. In contrast, domains identified by PRECAST failed to distinguish between patient types regardless of the clustering method used (Fig. 5c, d). Further comparison of hclust and K-means revealed that K-means consistently outperformed hclust, particularly when applied to integration methods like CN or MENDER that are well-suited for slice representation (Fig. 5e, f).
Given prior findings that domain count strongly affects slice representation accuracy, we further investigated whether this effect stems from the biological relevance of the domains themselves. Focusing on the case where five domains were identified by MENDER and K-means yielded the best clustering results, we observed distinct differences in domain abundance across sample types. For example, domain 1, composed primarily of immune cells, was nearly absent in Cold samples and only modestly represented in Mixed samples. In contrast, it was abundant in Compartmentalized samples. Conversely, domain 2, composed mainly of tumor cells, dominated Cold samples but was less prevalent in Compartmentalized ones (Fig. 5g). This domain composition aligned well with the known biological characteristics of these sample types. However, when four domains were identified by MENDER, K-means clustering performance declined. Although domain 1 and domain 2 still represented tumor and immune cell-enriched regions, respectively, the coarse granularity limited the ability to distinguish sample types, especially Mixed and Compartmentalized samples (Additional file 2: Fig. S24).
Similarly, when five domains were identified by PRECAST, the domain compositions showed no significant differences between sample types, particularly between Mixed and Compartmentalized samples. Specifically, domain 2 occupied a high proportion in all samples and was composed of a mix of immune and tumor cells, suggesting a lack of meaningful biological separation (Fig. 5h). These findings highlight that biologically relevant domain identification is essential for effective slice representation and sample clustering based on domain abundance.
Overall, our analysis indicates that slice representation based on domains identified by CN or MENDER offers distinct advantages. K-means clustering outperforms hierarchical clustering in most scenarios, particularly when paired with biologically meaningful domain structures. The biological interpretability of identified domains, such as their cell type composition, plays a crucial role in determining the utility of domain-based slice representation. Therefore, assessing domain biological relevance is a critical prerequisite for reliable slice level analyses.
Overall performance analysis of each method across all datasets
Through our comprehensive evaluation pipeline, we extensively benchmarked state-of-the-art spatial integration methods across various data types generated by different technologies. For each task, we identified the dominant methods that users could consider. Notably, while we found robust and consistent methods in spatial alignment and slice representation, performance in multi-slice integration and spatial clustering varied more significantly. Only a few methods maintained stable performance across all datasets, with most showing sensitivity to the underlying technology. To better understand these patterns, we conducted a detailed analysis of method performance across tasks and datasets.
We first analyzed the performance of each method across different datasets for multi-slice integration and spatial clustering. To quantify this, we calculated the Spearman correlation coefficient across all datasets, using rank-based overall scores of all methods to represent each dataset (see Methods). In multi-slice integration task, datasets generated by the same technology exhibited high correlation, while datasets from different technologies showed little to no correlation (Fig. 6a). For example, 10X Visium derived datasets, including DLPFC S1, DLPFC S2, and DLPFC S3, displayed high internal consistency despite being from different samples. Similarly, MERFISH derived datasets, including MERFISH Preoptic, MERFISH Brain, and the Large-scale datasets, also displayed high internal consistency despite being from different sources and studies. These findings highlight how integration method performance is strongly influenced by sequencing technology, underscoring the importance of selecting technology specific integration methods. A similar pattern emerged in the spatial clustering task (Additional file 2: Fig. S24). In addition to technology, sample origin and tissue architecture also affected method performance. For example, although DLPFC S1, S2, and S3 are all derived from the dorsolateral prefrontal cortex, DLPFC S2 differs in structural composition, resulting in performance divergence between DLPFC S2 and the other two.
Fig. 6.
Overall performance analysis of each method across all datasets. a Correlation between datasets about the performance of methods on multi-slice integration. The heatmap shows the Spearman correlation of overall performance across data. Each element (i, j) represents the correlation between dataset i and dataset j. b–d The scatter plot shows the rank-based overall scores for multi-slice integration and spatial clustering (b), spatial clustering and spatial alignment (c), multi-slice integration and spatial alignment (d) across all methods and datasets. Each point represents a method’s performance on a specific dataset, with its position determined by the scores for the two tasks. Point colors indicate the corresponding methods. e The scatter plot shows the ARI for spatial clustering and slice representation on the TNBC dataset across all methods with point colors indicating the corresponding methods. f The scatter plot shows cell type based ARI for spatial clustering and ARI for slice representation on the TNBC dataset across all methods with point colors indicating the corresponding methods
Next, we explored the relationship between upstream and downstream task performance by calculating the Spearman correlation coefficient of the rank-based overall scores for these tasks across all datasets. The results revealed a certain correlation between multi-slice integration and spatial clustering, though methods like Banksy and MENDER showed weaker associations (Fig. 6b, Additional file 2: Fig. S24). This suggests that, for most methods, strong integration performance is a prerequisite for accurate spatial clustering. It also implies that, when applying spatial clustering to a new dataset, the best-performing integration method can serve as a useful guide for method selection. We then examined the link between upstream results and integration-based spatial alignment, using SPACEL-based alignment outcomes for analysis. The results revealed a strong correlation between spatial clustering and alignment, but a weak correlation between integration and alignment (Fig. 6c, d, Additional file 2: Fig. S25). This indicates that accurate domain identification, rather than integration alone, is essential for integration-based spatial alignment.
Finally, we analyzed the correlation between spatial clustering and slice representation. In the slice representation evaluation, we used the TNBC dataset and assessed spatial clustering performance under various domain settings using ARI and NMI. We then computed the Spearman correlation coefficient between spatial clustering and slice representation using these indicators. The results revealed a significant correlation between spatial clustering and slice representation, reinforcing the notion that high quality upstream results are essential for reliable downstream applications (Fig. 6e, Additional file 2: Fig. S25). Furthermore, in evaluating slice representation, we found that the biological interpretability of identified domains plays a pivotal role in determining representation quality. To quantify this, we assessed the correspondence between domains and cell types using ARI and NMI and examined their correlation with slice representation performance across methods. Notably, this correlation was stronger than that observed with spatial clustering metrics, underscoring that the biological relevance of domains has a great impact on slice representation (Fig. 6f, Additional file 2: Fig. S25).
In summary, our analysis revealed that the performance of current methods in multi-slice integration and spatial clustering is often dependent on the sequencing technology used. Furthermore, the quality of upstream results substantially impacts downstream outcomes, although this dependency may vary across methods.
Discussion
In this study, we benchmarked 12 methods for spatial transcriptomics integration using 19 datasets from various spatial transcriptomics sequencing technologies. We proposed a comprehensive benchmarking framework that evaluates key aspects, including multi-slice integration, spatial clustering, spatial alignment, slice representation, and scalability of methods. This framework is designed to guide researchers in selecting the most suitable method based on their specific objectives, data types, and dataset sizes. Additionally, we systematically analyzed performance variability across different data types and examined the relationships between upstream and downstream tasks.
Our benchmarking results indicate that no single method performs optimally across all scenarios and data types. The choice of method should depend on the specific application, data types, and dataset size. For multi-slice integration, CellCharter, MENDER, and STAIG demonstrate strong performance in handling batch effects and preserving biological information, with CellCharter consistently outperforming other methods across datasets. However, MENDER and STAIG are particularly suited for specific types of data. In spatial clustering, Banksy, CellCharter, and STAligner perform well, though they sometimes produce scattered spots across domains, affecting continuity. MENDER, NicheCompass, SpaDo, and STAIG also achieve strong performance on specific datasets. For spatial alignment, PASTE, representing non-integration-based methods, and SPCEL, representing integration-based methods, are recommended, as both demonstrate stable and robust alignment performance. For slice representation, domains identified by CN and MENDER are recommended for represent slices, with K-means generally outperforming hierarchical clustering for sample clustering based on domain abundance. Additionally, performance in terms of running time and memory usage varied significantly between methods. GraphST, SpaDo, and STAIG, for example, are sensitive to the number of cells, limiting their application to large datasets. Conversely, Banksy, CN, CellCharter, PRECAST, and MENDER have shorter running times and lower memory requirements, making them more efficient for large-scale dataset.
By analyzing the overall performance of all methods across different tasks, we found that the effectiveness of multi-slice integration and spatial clustering is influenced by the type of data. Correlation analysis between upstream and downstream tasks revealed strong associations between multi-slice integration and spatial clustering, spatial clustering and integration-based spatial alignment, as well as spatial clustering and slice representation. These findings highlight the complexity of multi-slice analysis and emphasize the need for rigor and accuracy at each step to ensure reliable results in subsequent downstream applications.
In addition to the results presented above, our evaluation of spatial alignment and slice representation revealed a key limitation shared by most current methods: the widespread neglect of domain–domain relationships, including both physical proximity and semantic similarity. For example, in spatial alignment using SPACEL, coordinate correction assumes that mutually nearest neighbors (MNNs) across adjacent slices should belong to the same domain. However, in biological tissues, adjacent slices often differ in domain composition, meaning that even MNNs may belong to different domains. Incorporating domain–domain relationships into the alignment process, such as introducing weighted penalties for MNN pairs based on specific domain associations, could improve alignment accuracy substantially. Similarly, existing slice representation typically rely only on domain abundance. Yet analysis of the TNBC dataset revealed that certain Mixed and Compartmentalized samples had similar domain compositions but markedly different spatial distributions. This underscores the potential value of incorporating physical relationships between domains into the slice representation framework, which could enhance its ability to distinguish biologically meaningful sample subtypes.
Together, these findings highlight the importance of domain–domain relationships as an untapped source of information that could strengthen downstream analyses. In future work, we aim to develop dedicated modules that leverage this information to improve the overall performance and interpretability of spatial transcriptomics analysis pipelines.
Despite these contributions, our study has some limitations. First, due to the rapid development of spatial transcriptomics, several new integration methods and sequencing technologies have emerged, but some of these were not included in our benchmark cohort due to time constraints. Second, this study primarily evaluates methods for integrating slices from similar tissues with comparable spatial domain structures. The ability of methods to integrate slices from different tissue types or species, which is crucial for studying tissue specificity, development, and inter-species differences, was not assessed. Third, our focus was on vertical integration of multiple slices, and we did not evaluate horizontal integration methods for slices from the same tissue. Future studies will address these limitations by including more diverse datasets, from different sequencing technologies, tissue types, species, and integration methods. This will enable a more detailed evaluation of methods across various contexts.
Conclusions
To summarize, we provided a systematic evaluation of 12 spatial multi-slice integration methods across a range of multi-slice analysis tasks, including multi-slice integration, spatial clustering, spatial alignment, and slice representation. Using 19 datasets generated from various spatial transcriptomics platforms, we comprehensively assessed each method’s performance through multiple quantitative metrics, revealing their relative strengths and limitations. Our analysis highlights that method performance is highly dependent on specific applications, dataset size, and sequencing technology. Additionally, we identified strong interdependencies between upstream and downstream tasks, underscoring the importance of robust integration and clustering for reliable downstream analysis.
Methods
Datasets and preprocessing
In this benchmark study, we utilized 19 datasets from various spatial transcriptomics sequencing technologies to assess the performance of multiple slice integration methods. These datasets are categorized by sequencing technology as follows: 10X Visium dataset (including 3 datasets) [36]; MERFISH dataset (comprising MERFISH Preoptic [38], MERFISH Brain [3], and the Large-scale [39] datasets); STARMap dataset [40], BaristaSeq dataset [37], and the TNBC dataset (based on MIBI-TOF sequencing) [6] (Additional file 1: Note S2). During data preprocessing, each dataset was filtered to remove spots without a ground truth domain label. The datasets were then stored in h5ad and RDS formats to facilitate the evaluation of different methods. Instead of normalizing each data individually, we applied normalization within each method to account for the different data processing schemes used by the various methods. (Additional file 1: Note S3).
Benchmarking methods
This study evaluates 12 distinct methods for multi-slice integration. Based on their underlying principles, these methods are categorized into three groups: deep learning-based methods (GraphST, GraphST-PASTE, SPIRAL, and STAIG); statistical methods (Banksy, CN, MENDER, PRECAST, and SpaDo); and hybrid methods (CellCharter, NicheCompass, and STAligner). The majority of these methods were published after 2023, reflecting the latest advancements in the field.
The parameters used for executing these methods follow the settings recommended by the original authors. In cases where no specific hyperparameter settings were provided, we explored the recommended hyperparameter space for each method and selected the optimal configuration for the model (Additional file 1: Note S3). Since not all methods were evaluated across all four tasks, we summarized the benchmark tasks participated in by each method in Additional File 1: Table S1.
Additionally, the parameters affecting the number of domains identified by the model were set based on the true number of domains in the test data. These parameters fall into two categories: one explicitly defines the number of domains, where the specified number should match the actual number of domains in the data. The other category influences the number of domains by adjusting the resolution setting. In this case, we included a resolution search module to find the optimal resolution, ensuring the detected domains closely match the true data (Additional file 1: Note S4).
Benchmarking metrics
In this study, we employed several metrics to assess the effectiveness of different methods across various tasks. These include bASW, dASW, iLISI, dLISI, ILL, GC, ARI, NMI, CHAOS, PAS, Accuracy, Ratio, and a rank-based overall score.
bASW and dASW
The average silhouette width (ASW) measures the degree of separation between classes by evaluating both the proximity of an object to its own class and its distance from other classes. The ASW value ranges from −1 to 1, with 1 indicates optimal class separation and −1 represents class overlap. In this study, we used slices and domains within the same data as category labels. The integration embeddings of cells or spots, obtained from different methods, were used as input to calculate the corresponding ASWs: bASW (employing slices as a category label for batch effect removal) and dASW (employing domain as a category label for biological variance retention). The bASW value ranges from 0 to 1, where 1 indicates perfect mixing of cells from different slices, and 0 indicates complete separation. The dASW value also ranges from 0 to 1, where 1 indicates distinct separation of domains and 0 indicates complete overlap. Both metrics were calculated using the Python package scib-metrics (version 0.5.5).
iLISI and dLISI
The local inverse Simpson’s index (LISI) quantifies the local diversity of categorical labels within the neighborhood of each cell, offering insights into both batch mixing and biological separation. LISI is computed from integrated k-nearest neighbor (kNN) graphs, using the inverse Simpson’s index to assess how many neighboring cells can be drawn before encountering a repeated label. In this study, we used two label types, slices and domains, resulting in two LISI variants: iLISI (using slice labels to assess batch mixing) and dLISI (using domain labels to assess biological variance). The iLISI value ranges from 0 to 1, where 1 indicates perfect mixing of cells from different slices, and 0 indicates complete separation. The dASW value also ranges from 0 to 1, where 1 indicates distinct separation of domains and 0 indicates complete overlap. Both metrics were calculated using the Python package scib-metrics (version 0.5.5).
ILL
The isolated label (ILL) score assesses how well integration methods preserve rare or underrepresented domain identities, referred to as isolated labels, that appear in only a few slices within a dataset. These isolated labels pose a particular challenge for integration, as they are prone to being mixed with more common labels. In our study, ILL was computed using an ASW-based approach. Specifically, we calculated the average silhouette width (ASW) between isolated and non-isolated domain labels using the integrated embedding. A higher ILL score reflects better preservation of rare domain identities. This metric was implemented using the scib-metrics package (version 0.5.5).
GC
The graph connectivity (GC) score evaluates how well cells belonging to the same domain remain connected in the integrated k-nearest neighbor (kNN) graph. The GC score quantifies the proportion of cells within each subgraph that are part of the largest connected component. The final GC score is the average across all domain labels. Values range from 0 to 1, where 1 indicates perfect connectivity, and values closer to 0 indicate fragmented or poorly integrated domain structure. This metric was calculated using the scib-metrics package (version 0.5.5).
ARI
The adjusted rand index (ARI) evaluates clustering consistency by comparing cluster labels to true labels while adjusting for random clustering. The ARI ranges from −1 to 1, with 1 indicating perfect agreement with the true labels, 0 indicating random clustering, and negative values indicating less accuracy than random clustering. We used ARI to assess spatial clustering by comparing the domain labels identified by the methods with the ground truth. ARI was also used to evaluate the quality of slice representation. This metric was calculated using the R package aricode (version 1.0.3).
NMI
The normalized mutual information (NMI) measures the shared information between clustering results and true labels. NMI ranges from 0 to 1, with higher values indicating that the clustering labels share more information with the true labels and thus a better clustering result. In this study, NMI was used to assess the method’s ability to cluster spatial by comparing the domain labels identified by different methods with the ground truth domain labels. It was also used to evaluate slice clustering quality. The NMI was also calculated using the R package aricode (version 1.0.3).
CHAOS
CHAOS is a metric for evaluating spatial continuity. It uses a 1-nearest neighbor graph of cells or spots based on their spatial coordinates, calculating the sum of intra-class distances to assess spatial continuity. Lower values indicate greater continuity. In this study, CHAOS was calculated for each slice by setting the domain labels identified by each method as category labels. This metric was computed for each slice, and overall performance was evaluated by considering the CHAOS values across all slices in a data. The formula for calculating CHAOS of a slice is:
where is the total number of cells or spots in the slice, is the number of unique spatial domains, and is the distance of to its nearest neighbor.
PAS
PAS measures spatial homogeneity by assessing whether neighboring cells belong to the same domain. For each cell or spot, the k-nearest neighbors are identified based on spatial coordinates, and whether half of the points in the k-nearest neighbors belong to different classes is further determined. PAS values range from 0 to 1, with lower values indicating greater spatial homogeneity. In this study, PAS was calculated for each slice by setting the domain labels identified by each method as category labels with a k value of 10, and the method’s performance was evaluated by considering PAS values across all slices in a data. The formula for calculating PAS of a slice is:
where is the total number of cells or spots in a slice. is 1 if at least six neighboring cells belong to different domains, and 0 otherwise for .
Accuracy
Accuracy measures the alignment consistency of paired cells or spots between adjacent slices, assuming that paired cells or spots between adjacent slices are likely to belong to the same domain. Specifically, for two adjacent slices, denoted as and , we first identify, for each cell in , its paired cell in , defined as the nearest neighbor based on the corrected spatial coordinates. Accuracy is then calculated as the proportion of these paired cells that share the same domain label. The same process is repeated in the opposite direction, identifying paired cells in for each cell in and computing the corresponding proportion. The final accuracy score for the slice pair is reported as the average of the two directional accuracies. The accuracy value ranges from 0 to 1, with higher values indicating better alignment between slices. For data with multiple slices, the accuracy is calculated for each pair of adjacent slices. The formula for calculating accuracy for and is:
where and are the number of cells or spots in and , and represents individual cell or spot in and . denotes the domain label assigned to cell or spot . represents the nearest cell in , based on the corrected spatial coordinates, corresponding to cell or spot . The indicator function returns 1 if the condition inside is true, and 0 otherwise.
Ratio
Ratio assesses the performance of spatial alignment based on the assumption that ideal ratio of paired cells between adjacent slices should be closed to 1:1. Similarly, the paired cells are identified based on corrected spatial coordinates and further calculates the proportion of paired cells in all cells. The ratio metric ranges from 0 to infinity, with lower values indicating a better alignment between slices. The formula for calculating the ratio is:
where is the number of paired cells or spots between and , and and are the number of cells or spots in and .
Rank-based overall score
To comprehensively evaluate the performance of each method across different tasks (multi-slice integration, spatial clustering and spatial alignment), we calculate a rank-based overall score. This score is derived by ranking the methods according to each metric and averaging the rank values. For example, in the evaluation of multi-slice integration, dASW, dLISI, ILL, bASW, iLISI, and GC were used to assess each method. First, for each data, we converted the metrics into rank values (from worst to best) by ranking the methods based on their performance. The rank-based overall score for each method on a given data was then defined as the mean of the rank values across the four metrics. A similar approach was used to calculate the rank-based overall score for other tasks. For spatial clustering, given that accurate domain identification is a prerequisite for a meaningful hierarchical domain structure, we assigned different weights when calculating the rank-based overall score: ARI and NMI were weighted more heavily than CHAOS and PAS, with a ratio of 4:4:1:1.
In the overall performance analysis, we used the rank-based overall scores of each method to calculate the Spearman correlation coefficient across all data. Specifically, each data was embedded with the rank-based overall scores of the methods, and their correlations were then calculated. Additionally, during the benchmarking process, we computed the mean overall score for all data within the same technology, which served as the technology-level overall score, providing a clear summary of each method’s performance on each task.
CV
In the slice representation task, to evaluate the stability of each method with respect to the number of identified domains, we calculated the coefficient of variation (CV) for ARI and NMI. For a given method, the CV of ARI is calculated as follows:
where is the number of experiments conducted to test the influence of domain count, and is the value for the experiment. The CV of NMI is calculated using the same formula, substituting ARI with NMI.
Spatial alignment methods
In this study, we selected PASTE and STalign as representatives of non-integration-based spatial alignment methods. Both were run following their original tutorials. Specifically, PASTE performs pairwise alignment between slices, whereas STalign requires one slice to be designated as a reference. In our analyses, we used the last slice in each series as the reference. It is worth noting that the original STalign workflow involves a user-provided pre-rotation step to ensure approximate overlap between slices. However, this step introduces subjectivity, especially in multi-slice alignment where it is difficult to accurately estimate angular differences between slices and the reference. Moreover, other software tools do not require such pre-adjustments. To ensure fairness across methods, we deliberately omitted this step when applying STalign in our benchmarking.
To evaluate the performance of different methods in integration-based spatial alignment, we used the STAligner and SPACEL framework to correct coordinates between slices based on the integration embeddings and domains identification from each method. For STAligner, the integration embeddings and identified domains generated by each method were provided to STAligner. The mutual nearest neighbors (MNNs) were then calculated based on these embeddings, and spatial coordinate correction was performed between multiple slices using the specified domain. While the landmark domain identified by each method may differ, all domains correspond to the same ground truth, allowing for a fair comparison of the methods’ ability to perform coordinate correction (Additional file 1: Note S5). Similarly, for SPACEL, the identified domains and original spatial coordinates generated by each method were provided as input. Since SPACEL identifies MNNs based on spatial coordinates and leverages domain information to perform spatial coordinate correction.
Slice representation
To evaluate the performance of domains identified by different methods for slice representation, each slice is characterized by its domain composition, with the assumption that similar slices should have similar compositions of spatial domains. Then, slices are clustered based on these characteristics. The effectiveness of each method for multiple slice clustering is measured by the degree of agreement between the resulting clusters and the true labels.
Initially, we apply each method to perform multi-slice integration and identify domains. For each slice, its domain composition is defined as follows:
where represents the number of cells or spots in , represents the number of domains in , and refers to the number of cells or spots belonging to in . If a domain is absent from a slice, its composition is set to zero.
Subsequently, we apply two clustering methods, hierarchical clustering (hclust) and K-means, to group slices based on domain-abundance based slice representation. For hclust, the Euclidean distance between slices based on their domain composition is then calculated using the R function dist(). Clustering is performed using the hclust() function, and the cutree() function is used to output the appropriate number of clusters based on the number of slices. For K-means, clustering is conducted using the R function kmeans() with the specified number of clusters.
Supplementary Information
Additional file 1: Supplementary Notes (Notes S1–S5 and Table S1).
Additional file 2: Supplementary Figures (Figs. S1–S26).
Acknowledgements
Q.L. acknowledges the support by National Natural Science Foundation of China (Grant No. T2425019, 32341008), the National Key Research and Development Program of China (Grant No. 2025YFC3409300), Shanghai Pilot Program for Basic Research, Shanghai Science and Technology Innovation Action Plan-Key Specialization in Computational Biology, Shanghai Shuguang Scholars Project, Shanghai Excellent Academic Leader Project, Shanghai Municipal Science and Technology Major Project (Grant No. 2021SHZDZX0100) and Fundamental Research Funds for the Central Universities; Z.Y. acknowledges the support by National Nature Science Foundation of China (62303119, 32470706), Shanghai Science and Technology Development Funds (23YF1403000), Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission (22CGA02), and Shanghai Science and Technology Commission Program (23JS1410100), National Key R&D Program of China (2023YFF1204800). Y.G. acknowledges the support by National Natural Science Foundation of China (Grant No. 324B2013).
Peer review information
Claudia Feng was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Code availability
Python and R were used for data preprocessing, format conversion, and evaluation metric calculation. The relevant code is available at “https://github.com/bm2-lab/iSTBench” [41]. R scripts for result analysis and visualization can be accessed through “https://github.com/bm2-lab/iSTBench/tree/main/Analysis.” Our benchmarking workflow is provided as a reproducible pipeline at “https://github.com/bm2-lab/iSTBench/tree/main/Benchmark.” We also uploaded the relevant code to Zenodo which is available at https://zenodo.org/records/17089390 [44].
Authors’ contributions
Q.L. and ZY.Y. conceived and designed the study. ZY.Y., Q.L., KJ.D., and YC.G. designed the pipeline and collected the methods and datasets. KJ.D. performed the benchmarking analysis. KJ.D., Q.Z., Y.C., CY.H., SL.L., ZK.W., C.T., XJ.C., FLZ.M., XH.C., SG.W., X.J., JY.Y., C.Z. and GH.C. analyzed the results. ZY.Y and KJ.D. generated the figures. ZY.Y, KJ.D., and YC.G. designed and implemented the new algorithms. Q.L., ZY.Y, KJ.D., and YC.G. wrote the manuscript and designed the figures. All authors approved the manuscript.
Data availability
The 10X Visium dataset [36] was downloaded from http://research.libd.org/spatialLIBD/ with manual annotations [45]. The MERFISH dataset includes three distinct datasets, referred to as MERFISH Preoptic, MERFISH Brain, and Large-scale datasets. MERFISH Preoptic was originally published by Moffitt et al. [38], and five of its slices were annotated with region labels by Li et al. [16]. We downloaded the annotated data from the SDMBench website (http://sdmbench.drai.cn), corresponding to data IDs 25–29 [46]. MERFISH Brain [3] can be accessed from https://cellxgene.cziscience.com/collections/31937775-0602-4e52-a799-b6acdd2bac2e with manual annotation [47]. The Large-scale dataset [39] was obtained from https://doi.brainimagelibrary.org/doi/10.35077/g.21 with manual annotations [48]. The BaristaSeq dataset [37], annotated and published as part of the SpaceTx project, was also downloaded via the SDMBench website, corresponding to data IDs 22–24 [49]. The STARMap datasets, originally published by Wang et al. [40] and annotated by Li et al. [16], was similarly obtained from SDMBench, corresponding to data IDs 31-33 [50]. The TNBC dataset [6] was obtained from https://mibi-share.ionpath.com, including both manual annotations and clinical metadata [51]. All of these datasets are uploaded to Zenodo (https://doi.org/10.5281/zenodo.14906156) [52].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Kejing Dong and Yicheng Gao contributed equally to this work.
Contributor Information
Zhiyuan Yuan, Email: zhiyuan@fudan.edu.cn.
Qi Liu, Email: qiliu@tongji.edu.cn.
References
- 1.Moses L, Pachter L. Museum of spatial transcriptomics. Nat Methods. 2022. 10.1038/s41592-022-01409-2. [DOI] [PubMed] [Google Scholar]
- 2.Zhang B, et al. A human embryonic limb cell atlas resolved in space and time. Nature. 2024;635:668–78. 10.1038/s41586-023-06806-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Allen WE, Blosser TR, Sullivan ZA, Dulac C, Zhuang X. Molecular and spatial signatures of mouse brain aging at single-cell resolution. Cell 2023;186:194–208. e118. 10.1016/j.cell.2022.12.010. [DOI] [PMC free article] [PubMed]
- 4.Khaliq AM, et al. Spatial transcriptomic analysis of primary and metastatic pancreatic cancers highlights tumor microenvironmental heterogeneity. Nat Genet. 2024;56:2455–65. 10.1038/s41588-024-01914-4. [DOI] [PubMed] [Google Scholar]
- 5.Longo SK, Guo MG, Ji AL, Khavari PA. Integrating single-cell and spatial transcriptomics to elucidate intercellular tissue dynamics. Nat Rev Genet. 2021;22:627–44. 10.1038/s41576-021-00370-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Keren L, et al. A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell. 2018;174(1373–1387). 10.1016/j.cell.2018.08.039. e1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Schurch CM, et al. Coordinated cellular neighborhoods orchestrate antitumoral immunity at the colorectal cancer invasive front. Cell. 2020;183:838. 10.1016/j.cell.2020.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Shiao SL, et al. Single-cell and spatial profiling identify three response trajectories to pembrolizumab and radiation therapy in triple negative breast cancer. Cancer Cell. 2024;42:70–84. e78. 10.1016/j.ccell.2023.12.012. [DOI] [PubMed]
- 9.Long Y, et al. Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST. Nat Commun. 2023;14:1155. 10.1038/s41467-023-36796-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Singhal V, et al. Banksy unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis. Nat Genet. 2024. 10.1038/s41588-024-01664-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Guo T, et al. Spiral: integrating and aligning spatially resolved transcriptomics data across different experiments, conditions, and technologies. Genome Biol. 2023;24:241. 10.1186/s13059-023-03078-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yang Y, et al. STAIG: spatial transcriptomics analysis via image-aided graph contrastive learning for domain exploration and alignment-free integration. Nat Commun. 2025;16:1067. 10.1038/s41467-025-56276-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Yuan Z. MENDER: fast and scalable tissue structure identification in spatial omics data. Nat Commun. 2024;15:207. 10.1038/s41467-023-44367-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liu W, et al. Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST. Nat Commun. 2023;14:296. 10.1038/s41467-023-35947-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Duan B, Chen S, Cheng X, Liu Q. Multi-slice spatial transcriptome domain analysis with SpaDo. Genome Biol. 2024;25:73. 10.1186/s13059-024-03213-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li Z, Zhou X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol. 2022;23:168. 10.1186/s13059-022-02734-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Varrone M, Tavernari D, Santamaria-Martinez A, Walsh LA, Ciriello G. Cell charter reveals spatial cell niches associated with tissue remodeling and cell plasticity. Nat Genet. 2024;56:74–84. 10.1038/s41588-023-01588-4. [DOI] [PubMed] [Google Scholar]
- 18.Birk S. et al. Quantitative characterization of cell niches in spatially resolved omics data. Nat Genet. 2025;57:897–909. 10.1038/s41588-025-02120-6. [DOI] [PMC free article] [PubMed]
- 19.Zhou X, Dong K, Zhang S. Integrating spatial transcriptomics data across different conditions, technologies and developmental stages. Nat Comput Sci. 2023;3:894–906. 10.1038/s43588-023-00528-w. [DOI] [PubMed] [Google Scholar]
- 20. Park HE, et al. Spatial transcriptomics: technical aspects of recent developments and their applications in neuroscience and cancer research. Adv Sci (Weinh) 2023;10:e2206939. 10.1002/advs.202206939. [DOI] [PMC free article] [PubMed]
- 21.Jain S, Eadon MT. Spatial transcriptomics in health and disease. Nat Rev Nephrol. 2024;20:659–71. 10.1038/s41581-024-00841-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hu Y, et al. Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes. Nat Methods. 2024;21:267–78. 10.1038/s41592-023-02124-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kim J, et al. Unsupervised discovery of tissue architecture in multiplexed imaging. Nat Methods. 2022. 10.1038/s41592-022-01657-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Xu H, et al. SPACEL: deep learning-based characterization of spatial transcriptome architectures. Nat Commun. 2023;14:7603. 10.1038/s41467-023-43220-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Li H, et al. Santo: a coarse-to-fine alignment and stitching method for spatial omics. Nat Commun. 2024;15:6048. 10.1038/s41467-024-50308-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zeira R, Land M, Strzalkowski A, Raphael BJ. Alignment and integration of spatial transcriptomics data. Nat Methods. 2022;19:567–75. 10.1038/s41592-022-01459-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Liu X, Zeira R, Raphael BJ. Partial alignment of multislice spatially resolved transcriptomics data. Genome Res. 2023;33:1124–32. 10.1101/gr.277670.123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Clifton K, et al. Stalign: alignment of spatial transcriptomics data using diffeomorphic metric mapping. Nat Commun. 2023;14:8123. 10.1038/s41467-023-43915-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jones A, Townes FW, Li D, Engelhardt BE. Alignment of spatial genomics data using deep Gaussian processes. Nat Methods. 2023;20:1379–87. 10.1038/s41592-023-01972-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.You Y, et al. Systematic comparison of sequencing-based spatial transcriptomic methods. Nat Methods. 2024. 10.1038/s41592-024-02325-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Carstens JL, et al. Spatial multiplexing and omics. Nat Rev Methods Primers. 2024. 10.1038/s43586-024-00330-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yuan Z, et al. Benchmarking spatial clustering methods with spatially resolved transcriptomics data. Nat Methods. 2024. 10.1038/s41592-024-02215-8. [DOI] [PubMed] [Google Scholar]
- 33.Hu Y, et al. Benchmarking clustering, alignment, and integration methods for spatial transcriptomics. Genome Biol. 2024;25:212. 10.1186/s13059-024-03361-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods. 2022;19:41–50. 10.1038/s41592-021-01336-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Li B, et al. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nat Methods. 2022. 10.1038/s41592-022-01480-9. [DOI] [PubMed] [Google Scholar]
- 36.Maynard KR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci. 2021;24:425–36. 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Brian Long JM. The SpaceTx Consortium. SpaceTx: a roadmap for benchmarking spatial transcriptomics exploration of the brain. 2023. Preprint at 10.48550/arXiv.2301.08436. [DOI]
- 38.Moffitt JR, et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science. 2018. 10.1126/science.aau5324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhang M, et al. Spatially resolved cell atlas of the mouse primary motor cortex by MERFISH. Nature. 2021;598:137–43. 10.1038/s41586-021-03705-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wang X, et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science. 2018;361. 10.1126/science.aat5691. [DOI] [PMC free article] [PubMed]
- 41. Dong K, Gao Y. Code for iSTBench: Benchmarking multi-slice integration and downstream applications in spatial transcriptomics data analysis. GitHub. 2025. https://github.com/bm2-lab/iSTBench.
- 42.Shang L, Zhou X. Spatially aware dimension reduction for spatial transcriptomics. Nat Commun. 2022;13:7203. 10.1038/s41467-022-34879-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Tang Z, et al. Search and match across spatial omics samples at single-cell resolution. Nat Methods. 2024;21:1818–29. 10.1038/s41592-024-02410-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Dong K, Gao Y. Code fo iSTBench: benchmarking multi-slice integration and downstream applications in spatial transcriptomics data analysis. 2025. Zenodo. 10.5281/zenodo.17089390.
- 45. Maynard KR, et al. spatialLIBD for hosting dorsolateral prefrontal cortex 10x Visium dataset. Datasets. spatialLIBD. 2021. http://research.libd.org/spatialLIBD.
- 46. Moffitt JR, et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Datasets. SDMBench. 2024. http://sdmbench.drai.cn. [DOI] [PMC free article] [PubMed]
- 47. Allen WE, Blosser TR, Sullivan ZA, Dulac C, Zhuang X. Molecular and spatial signatures of mouse brain aging at single-cell resolution. Datasets. CZ CELLxGENE. 2023. https://cellxgene.cziscience.com/collections/31937775-0602-4e52-a799-b6acdd2bac2e. [DOI] [PMC free article] [PubMed]
- 48. Zhuang X, Zhang MA. Molecularly defined and spatially resolved cell atlas of the mouse primary motor cortex. Datasets. Brain Image Library. 2020. 10.35077/g.21. [DOI]
- 49. Brian Long JM. The SpaceTx Consortium. SpaceTx: a roadmap for benchmarking spatial transcriptomics exploration of the brain. Datasets. SDMBench. 2024. http://sdmbench.drai.cn.
- 50. Wang X, et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Datasets. SDMBench. 2024. http://sdmbench.drai.cn. [DOI] [PMC free article] [PubMed]
- 51. Keren L, et al. A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Datasets. IONPATH. 2018. https://mibi-share.ionpath.com. [DOI] [PMC free article] [PubMed]
- 52.Dong K, Gao Y. Benchmarking multi-slice integration and downstream applications in spatial transcriptomics data analysis. 2025. Datasets Zenodo. 10.5281/zenodo.14906156. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Dong K, Gao Y. Code fo iSTBench: benchmarking multi-slice integration and downstream applications in spatial transcriptomics data analysis. 2025. Zenodo. 10.5281/zenodo.17089390.
- Dong K, Gao Y. Benchmarking multi-slice integration and downstream applications in spatial transcriptomics data analysis. 2025. Datasets Zenodo. 10.5281/zenodo.14906156. [DOI]
Supplementary Materials
Additional file 1: Supplementary Notes (Notes S1–S5 and Table S1).
Additional file 2: Supplementary Figures (Figs. S1–S26).
Data Availability Statement
The 10X Visium dataset [36] was downloaded from http://research.libd.org/spatialLIBD/ with manual annotations [45]. The MERFISH dataset includes three distinct datasets, referred to as MERFISH Preoptic, MERFISH Brain, and Large-scale datasets. MERFISH Preoptic was originally published by Moffitt et al. [38], and five of its slices were annotated with region labels by Li et al. [16]. We downloaded the annotated data from the SDMBench website (http://sdmbench.drai.cn), corresponding to data IDs 25–29 [46]. MERFISH Brain [3] can be accessed from https://cellxgene.cziscience.com/collections/31937775-0602-4e52-a799-b6acdd2bac2e with manual annotation [47]. The Large-scale dataset [39] was obtained from https://doi.brainimagelibrary.org/doi/10.35077/g.21 with manual annotations [48]. The BaristaSeq dataset [37], annotated and published as part of the SpaceTx project, was also downloaded via the SDMBench website, corresponding to data IDs 22–24 [49]. The STARMap datasets, originally published by Wang et al. [40] and annotated by Li et al. [16], was similarly obtained from SDMBench, corresponding to data IDs 31-33 [50]. The TNBC dataset [6] was obtained from https://mibi-share.ionpath.com, including both manual annotations and clinical metadata [51]. All of these datasets are uploaded to Zenodo (https://doi.org/10.5281/zenodo.14906156) [52].






