Abstract
Recent advances in spatial transcriptomics (ST) have significantly deepened our understanding of biology. A primary focus in ST analysis is to identify spatially variable genes (SVGs) which are crucial for downstream tasks like spatial domain detection. Spatial domains reflect underlying tissue architecture and distinct biological processes. Traditional methods often use a set number of top SVGs for this purpose, and embedding these SVGs simultaneously can confound unrelated spatial signals, dilute weaker patterns, leading to obscured latent structure. Instead, grouping SVGs and getting low-dimensional embedding within each group preserves specific patterns, reduces signal mixing, and enhances the detection of diverse structures. Furthermore, classifying SVGs is akin to identifying cell-type marker genes, offering valuable biological insights. The challenge lies in accurately categorizing SVGs into relevant clusters, aggravated by the absence of prior knowledge regarding the number and spatial gene patterns. Here, we propose SPACE, a framework that classifies SVGs based on their spatial patterns by adjusting for shared cell-type confounding effects, to improve spatial domain detection. This method does not require prior knowledge of gene cluster numbers, spatial patterns, or cell type information. Both simulation and real data analyses demonstrate that SPACE is an efficient and promising tool for ST analysis.
Graphical Abstract
Graphical Abstract.
Introduction
Recent advancements in spatially-resolved transcriptomics (SRT) technology have revolutionized our ability to acquire comprehensive gene expression data for thousands of genes across tissue locations in multiple samples. The number of genes and spatial resolution vary depending on the specific technology employed. However, regardless of the technology and resolution, spatial transcriptomic data facilitate the exploration of various biological questions.
Often a fundamental initial step in the analysis of SRT data involves identifying spatially variable genes (SVGs). These genes exhibit expression level variations either across the entire tissue or within predefined spatial domains. In recent years, there has been an abundance of research and the development of new methods to address the challenge of detecting SVGs [1, 2]. Although the detection of SVGs lets us visualize the spatial patterns in the tissue which might offer some level of biological insights about the tissue of interest, the main use of SVGs lies in downstream analysis, specifically for spatial domain detection. Spatial domains are distinct and functionally specialized anatomical structures within tissue, each distinguished by unique local characteristics including cell-type composition, transcriptome heterogeneity, and cell–cell interactions [3–5]. Detecting these domains is crucial for understanding their collaborative role in tissue functions and development stages. To achieve this, a set number of top SVGs is typically selected, and spatial domains are identified using these top SVGs [1, 2].
However, using an arbitrary number of top SVGs might not represent all the spatial patterns exhibited by the SVGs. As previously argued [6], Some dominant patterns may overshadow less pronounced yet relevant patterns. Previous methods, like SpatialDE [7], SPARK [8], and Sepal [6], attempted to classify SVGs into groups with similar spatial patterns, aiding in a more holistic representation of results. It’s important to highlight that classifying SVGs is a challenging task requiring specialized methods. Simple clustering approaches are inadequate as they overlook spatial information [7]. However, in existing methods, challenges arise regarding the selection of an unknown number of spatial pattern groups and the selection of other parameters, as well as the unclear impact of classification on downstream analysis. Here, we propose an efficient method, SPACE, to classify SVGs into clusters and explore the benefits of this clustering step in the final goal of spatial domain detection.
The concept of clustering SVGs carries a significant biological rationale. Researchers have already identified many SVGs as the markers for specific cell types [9–12], yet detecting these cell type-specific SVGs (ctSVGs) presents a formidable challenge. As distinct spatial patterns are associated with distinct cell type compositions, it follows that different ctSVGs would manifest distinct spatial patterns. Consequently, clustering SVGs with similar patterns can be viewed as a means of segregating distinct cell type-specific SVGs. As reviewed by Yan et al.(2025) [13], three ctSVG detection methods, CTSV [14], spVC [15], and CSIDE [16] were previously available. These methods treat spatial coordinates as fixed effects when estimating cell type-specific spatial effects, making them not spatial rotation-invariant (see [17] for a detailed evaluation). Since tissue sections can be placed in random orientation, the significance of detected ctSVGs may vary with rotation, which is a critical limitation. More recently, Celina [18] and STANCE [19] have been proposed for ctSVG detection, both of which address rotation-invariance. However, these methods rely on accurate spatial deconvolution results, assuming that cell-type proportions are well-estimated. This assumption cannot always be guaranteed, as deconvolution performance depends on the availability of sufficient reference data and the specific deconvolution method used. Here we introduce SPACE, a method that bypasses the need for ctSVG detection while identifying SVG clusters that capture cell type-relevant information to enhance spatial domain detection.
SPACE is a Gaussian process-based method that initially identifies SVGs and then establishes a dependency map among these SVGs using an intuitive approach (see Materials and Methods). This map links each SVG with other SVGs exhibiting similar spatial patterns, ultimately clustering similarly expressed SVGs together (see Fig. 1A). The resulting SVG-clusters from the SPACE algorithm can serve as inputs for further downstream analysis.
Figure 1.
(A) Schematic overview of the SPACE framework. (B–E) An example simulation based on a synthetic dataset: (B) The synthetic dataset is generated based on a real dataset with annotated spatial domains. Three types of domain-specific SVGs and noise genes are created. (C) The distribution of the number of each type of gene is presented. (D) Evaluation based on ARI (higher is better) and PAS score (lower is better). (E) Visualization by the t-SNE plot: shown is for a randomly selected simulation result. The gene clusters identified by SPACE are highlighted with distinct colors. The plot illustrates that genes within the clusters are densely grouped together and separated from genes belonging to other clusters.
For performing downstream analysis, we employ a well-known dimension reduction technique tailored for spatial data, SpatialPCA [20], to derive low-dimensional embeddings specific to each SVG-cluster. These embeddings are subsequently utilized for spatial domain detection. In each example considered in this study, whether through simulation setups or real data analysis, we compare our findings with those obtained from the SpatialPCA framework, which generates low-dimensional embeddings for all top SVGs collectively and subsequently performs the same spatial domain detection step. This comparison aims to elucidate the advantages of the gene clustering step facilitated by SPACE.
To evaluate the performance of SPACE, we conducted a comparison using a synthetic dataset derived from real human DLPFC cortex data, with known annotations of its 5 layers (4 prefrontal cortex layers and the white matter). See Supplementary Fig. S2 for more details on the synthetic data generation. Fig. 1B illustrates the domain-based SVG clusters within the synthetic dataset: cluster 1 represents genes predominantly expressed in layer 1, cluster 2 includes genes from cortex layers 2, 3, or 4, while cluster 3 comprises genes overexpressed in the white matter region. The distribution of the number of genes across clusters is uneven in this scenario as shown in Fig. 1C, as is frequently evident in real datasets. Upon repeating the analysis on 10 simulated synthetic datasets, and considering ARI scores (Adjusted Rand Index, higher the better) and PAS scores (Percentage of abnormal spots, lower the better, see ‘Materials and Methods’) (see Fig. 1D), we observe that our framework provides better domain detection results compared to the SpatialPCA framework. Additionally, Fig. 1E displays a t-SNE plot [21] representing genes from a randomly chosen simulation outcome. The genes are colour-coded according to the gene cluster label identified by SPACE. This visualization demonstrates that genes within each cluster are packed together, indicating the accuracy of gene classification by SPACE.
In this paper, we conducted two types of simulation studies: firstly, to assess the accuracy of the SPACE framework for SVG classification, and secondly, to evaluate the accuracy of domain detection based on the detected SVG clusters by SPACE using synthetic datasets mimicking real-world scenarios. In addition, we analyzed three publicly available datasets and one newly acquired pancreatic ductal adenocarcinoma (PDAC) SRT dataset. The publicly available datasets include: (1) the DLPFC human cortex annotated dataset [22], comprising 12 samples; (2) the HER2 human breast tumor annotated dataset [23] (used one sample); and (3) the dataset from the study of human breast cancer biopsies [24]. The newly acquired dataset is the PDAC dataset, which comes with rudimentary annotation. The application of SPACE and the detection of domains aligns well with the rough annotation. Overall, findings from simulation studies and real data analyses across multiple datasets affirm that utilizing SVG clusters and, consequently, the SPACE framework can markedly enhance the performance of spatial domain detection. This approach holds significant promise to unveil hidden spatial patterns that provide novel insights into the spatial heterogeneity of tissue samples.
Materials and methods
In a typical spatial transcriptomics setup, the dataset comprises gene expression measures or counts for m genes distributed across N known spatial coordinates or spots. Suppose y = (y1, y2, ..., yN) represents the gene expression profiles or counts for a particular gene across spatial coordinates (referred to as samples or spots), denoted by s = (s1, ..., sN). The spatial locations are typically represented as two-dimensional coordinates, i.e., si = (si1, si2), although coordinates of any dimensionality can be employed. The primary objective of SVG detection models is to identify which genes, among the m genes, exhibit spatial variability across the tissue. In essence, the key goal is to determine whether the gene expression measure y is dependent on or related to the spatial locations where the gene expression measures are sampled.
Consider a scenario where there are kd spatial domains within the tissue of interest. A spatial domain represents a distinct region or area within the tissue characterized by unique molecular signatures or gene expression patterns. These patterns may arise from specific factors such as cellular composition, anatomical organization, spatial arrangement, or functional attributes.
In reality, the specific spatial domains are typically unknown. However, each domain may be defined with a set of SVGs exhibiting characteristic gene expression patterns in proximity to these domains. To accurately reconstruct the underlying domain structure, it is beneficial to group SVGs based on their spatial patterns and utilize all SVG groups in downstream analysis, rather than relying solely on an arbitrary subset of all SVGs.
With this objective in mind, we present the SPACE framework, organized into two primary steps. The first step involves the selection of SVGs, while the second step utilizes the SVGs selected in the first step to generate an SVG dependency list and subsequently create clusters of SVGs based on this information (see Fig. 1A). It is worth noting that there exists a plethora of SVG detection techniques in the literature [1, 2], and any of these methods can substitute for the first step in the framework. However, methods that rigorously control the false discovery rate (FDR) are preferable, as they enhance accuracy in the subsequent stages of analysis.
Step 1: Selecting SVGs
Here we adopt a model-based method for SVG detection focusing on the Gaussian process (GP) regression model which models the normalized gene expression y for a given gene assuming the following multivariate normal model [7]:
![]() |
(1) |
where the covariance term is decomposed into a spatial and a non-spatial part, with δI and representing the non-spatial and spatial covariance matrix, respectively. The (i, j)th element in the kernel matrix K denotes the spatial similarity between the ith and jth spot calculated based on the corresponding coordinates si and sj. The choice of the kernel function plays a very important role in detecting the spatial correlation presented in the gene expressions. XN × k represents the covariate matrix, while βk × 1 denotes the array of corresponding coefficients. This model can incorporate up to k − 1 covariates, such as cell type information or domain structure information. However, often such information is either unavailable or deemed untrustworthy. Hence, in practice, we typically employ X solely as the intercept. In model (1), testing if a gene is spatially variable is equivalent to testing
.
Within this framework, we use a straightforward score test to test the underlying hypothesis, and a p-value is calculated. More information regarding the test is provided in the supplementary material. Prior studies [8, 25] indicate that Gaussian and Cosine kernels are adept at capturing a wide spectrum of distinct spatial gene expression patterns. Hence, we utilize 10 different kernels (5 Cosine and 5 Gaussian kernels with varying parameter values) following the approach established by SPARK [8]. Denote the number of detected SVGs by m1 in step 1. We note that while many existing techniques can be used for SVG detection, this is not the primary objective of our method. Therefore, we did not compare this step with other SVG detection approaches.
Step 2: Classifying SVGs by spatial patterns
This stage categorizes the SVGs identified in step 1 into cohesive groups. While conventional clustering algorithms can segregate genes into distinct groups based on gene expression, they invariably disregard spatial information. Hence, we require more sophisticated algorithms to classify SVGs more precisely [7].
Intuitively, if two SVGs y1 and y2 exhibit similar spatial patterns, they should be correlated which is equivalent to assume y1 = α0 + α1y2 + ε, where is a random noise term with arbitrary variance
and some constants α0, α1. Hence, while testing gene y1, if we use gene y2 as a covariate in model (1) and test the same alternative hypothesis described in step 1, we would expect gene y1 to be less or even not significant, depending on how strong the correlation between y1 and y2 is. On the other hand, if after using another gene y3 as a covariate in the model we still find gene y1 to be significant, that would imply that gene y1 and gene y3 have different spatial patterns. Given that many SVGs are cell type marker genes [9–12], we would expect gene y1 and y2 to belong to the same cell type while gene y3 belongs to a different cell type. However, our algorithm does not require such knowledge (often not available in spot-level SRT data) as detailed below.
With this intuition, for each SVG j, j = 1, ⋅⋅⋅, m1, we can select a list of genes Sj which are correlated to gene j and use them as covariates in the model 1. If no of genes in Sj exceeds 3, we choose the top kj principal components and include them as covariates in the model. Typically, kj is chosen such that at least 80% of the total variance is explained by the top kj principal components. With this, model (1) becomes:
![]() |
(2) |
where PCj represents the jth principal component, and ε(s) ∼ N(0, δI). More details about selecting the gene list Sj are available in Section 3 of the supplementary material.
In real applications, many SVGs represent marker genes for cell types, exhibiting spatial expression patterns that reflect the distribution of various cell types. Given that marker genes for a given cell type often exhibit similar or strongly correlated gene expression distributions, employing this model adeptly controls for the cell-type specific effect, a factor typically challenging to ascertain or quantify in real-world settings as the cell type information is typically unknown in spot-level SRT data.
Employing this model on all the genes, we can get a set of genes with unique patterns (showing significance under model (2)) and a gene dependency list. From the list, we come up with a weighted graph structure, where nodes are SVGs and two nodes are connected if they are correlated. By applying a clustering algorithm such as Leiden [26], groups of genes are determined and the unique genes (which are not part of any gene group) create singleton sets. The full algorithm steps are provided in the supplementary file.
Downstream analysis: spatial domain detection
Spatially resolved transcriptomics serve a crucial role in identifying tissue or region substructures through domain detection analysis. Numerous frameworks have been developed for this purpose, to name a few, SpaGCN [27], SpatialPCA [20] and BayesSpace [28]. In our analysis, we opted for SpatialPCA [20] due to its proven superiority in performance over other available algorithms. Furthermore, to conduct domain detection post identification of SVG clusters using our framework, we require an effective low-dimensional representation of the dataset, a task efficiently facilitated by SpatialPCA. SpatialPCA effectively extracts a low-dimensional representation of spatial transcriptomic data while preserving both biological signals and spatial correlation structures. This condensed representation can serve as an input for efficient clustering algorithms, such as the Louvain [29] or Walktrap algorithm [30], facilitating the clustering of spots and thereby identifying spatial domains. The steps for obtaining spatial domains in the SpatialPCA workflow include: (1) select the top 3000 SVGs and calculate spatial PCs based on these genes and (2) use the top 20-30 spatial PCs for spatial clustering using algorithms like Louvein or walktrap.
In our approach, we utilize our framework to identify SVG groups with similar spatial patterns. Within each SVG group, we compute the spatial PCs and aggregate the top spatial PCs from each group to form the final embedding. These aggregated spatial PCs are then used as input in the clustering algorithm to detect spatial domains. To ensure the results are comparable with SpatialPCA, we controlled the total number of aggregated spatial PCs by selecting five spatial PCs from each SVG cluster. The differences between our method and the existing ones are that existing methods use low-dimensional embeddings based on all top SVGs, while our method gets low-dimensional embeddings within each cluster. The within-cluster embeddings can capture unique spatial structures and hence lead to improved spatial domain detection.
Measuring accuracy of domain detection
Detecting domains essentially involves assigning a cluster label to each of the N spots in the tissue sample. Once a framework is implemented for detecting spatial domains, it becomes crucial to measure its accuracy against the ground truth domain labels. We primarily employ a standard clustering evaluation metric, the Adjusted Rand Index (ARI), to assess the similarity between the predicted domain labels and the true labels. Additionally, we utilize the percentage of abnormal spots (PAS) score to quantify the clustering performance of spatial domain detection, following the approach outlined in [20]. This score gauges the randomness of spots located outside their clustered spatial domain computed as the proportion of spots with a cluster label differing from at least six of their ten neighboring spots. A lower PAS score reflects greater homogeneity within spatial clusters.
Results
Simulation Study
We conducted two simulation studies: one to demonstrate the effectiveness of the SVG clustering performance and another to assess whether the SVG clusters identified by SPACE are beneficial for enhancing domain detection accuracy.
Evaluation of the clustering performance
To illustrate SPACE’s capability to accurately classify SVGs, we devise the simulation scenario I in which 100 normalized gene expression datasets were simulated, each consisting of 53 genes (labeled g1-g53) and 2000 spots following model 1 with no covariates. The first 10 genes (g1-g10) represent independent noise genes, not correlated with any other gene in the dataset. The next set of 10 genes (g11-g20) exhibits strong correlations among themselves but does not show any spatial pattern. The third (g21-g30), fourth (g31-g40), and fifth (g41-g50) sets of genes represent SVGs with three distinct spatial patterns (Spatial Pattern 1, Spatial Pattern 2, and Spatial Pattern 3, respectively). Fig. 2A exhibits the six sets of representative genes with distinct spatial patterns. Genes are correlated within the spatial pattern 1-3 as exhibited in Fig. 2B, and the spatial effect strength increases with the gene index. For example, g21 and g30 are correlated and share the same spatial pattern, but the spatial effect is stronger in g30 compared to g21. The correlation between the genes within a gene group could have different structures. They could display compound symmetry (CS) if any pair of genes within a group has the same correlation, i.e., ρij = ρ. Alternatively, they could demonstrate a first-order Autoregressive (AR(1)) pattern if the correlation between two genes decays as their distance increases, i.e., ρij = ρ|i − j|. Fig. 2B exhibits CS correlation structure. The last three genes, g51, g52, and g53, each exhibit a unique spatial pattern (Spatial Pattern 4, Spatial Pattern 5, and Spatial Pattern 6, respectively). As the spatial pattern strength increases within each spatial gene group (pattern1-pattern3), the SVG detection power converges towards 1, as expected (see Fig. 2C). Furthermore, upon comparing this outcome with the performance of the most efficient comparable SVG detection method, nnSVG [31] (where both nnSVG and SPACE use normalized gene expressions), we observed that SPACE detects spatial genes slightly more effectively (see Supplementary Fig. S4). The SPACE framework not only demonstrates efficacy in identifying the true SVGs (g21-g53) but also adeptly categorizes them. The Adjusted Rand Index (ARI) scores, calculated for each simulated dataset, are computed based on the cluster labels of detected SVGs and their true cluster labels—cluster near 1 in the violin plot in Fig. 2D, indicating high accuracy of gene clustering. The FDRs were also monitored in this study. Here false discovery happens when any of the non-spatial genes (g1-g20) appear in any of the final gene clusters. FDR is calculated as the number of false discoveries divided by the total number of SVG discoveries. As indicated in Fig. 2E, the FDR values are predominantly distributed near 0, indicating high accuracy of the results.
Figure 2.
Simulation setting for SPACE. (A) Six representative genes each with a distinct spatial pattern in the simulated dataset. Green and Red color represents low and high gene expression. (B) Correlation heatmap of the simulated dataset with compound symmetry correlation structure within gene groups. Independent: uncorrelated gene group for genes without any spatial pattern; Correlated: correlated gene group for genes without any spatial pattern. Patterns 1–3: correlated gene group for genes with spatial patterns 1–3. Patterns 4–6: single gene with spatial patterns 4–6. (C) Empirical power of the SVG detection step at detecting SVGs for the simulated datasets with compound symmetry (Left) and AR(1) (Right) correlation structure within each gene group. (D) The distribution of ARI values based on predicted gene clusters for simulation under each correlation structure. (E) Empirical FDR distribution for simulation under each correlation structure. Here, false discovery occurs when Correlated or Independent genes show up in any of the final gene clusters.
Evaluation of the spatial domain detection performance
To showcase how the outputs of the SPACE algorithm aid in the downstream analysis of spatial domain detection and enhance its accuracy, we generated synthetic datasets based on the annotated human DLPFC data with sample ID 151670. Supplementary Fig. S2 details the steps involved in generating the synthetic data, ensuring that key features of the original dataset are preserved, such as the distribution of means and variances of all genes. The dataset is annotated with 5 layers (4 prefrontal cortex layers and the white matter layer). After filtering out sparse genes, we ended up with 4865 genes whose expressions were measured in 3484 spots. We first randomly selected 2000 genes which were converted to SVGs in the generated dataset (See Supplementary Fig. S2). The rest of the genes were converted to random noise genes with no specific pattern. Among the 2000 SVGs, three distinct spatial domain structures were represented (see Fig. 1B and C): 800 SVGs correspond to the first cluster, wherein genes are predominantly expressed in the cortex layer 1150 SVGs exhibit overexpression in cortex layers 2, 3, or 4; and the remaining 1050 SVGs from cluster 3 predominantly display expression in the white matter domain region. In such 10 simulated synthetic datasets, we applied our framework, SPACE, to identify gene clusters that were further leveraged for domain detection. We compared the domain detection results of SPACE with the default one by SpatialPCA without an additional clustering step. As we mentioned in the introduction, our framework significantly improves the spatial domain detection accuracy, as evidenced by ARI scores and PAS scores (see Fig. 1D). Additionally, t-SNE plot [21] from a randomly chosen simulation result for the genes displays the accuracy of the gene clustering performance of SPACE. The t-SNE plots for all 10 simulation results were given in Supplementary Fig. S3.
Real data analysis
Human DLPFC 10x Genomics Visium dataset
We applied the SPACE algorithm to the human dorsolateral prefrontal cortex (DLPFC) data [22] generated by Visium from 10x Genomics. Publicly available datasets from 12 human DLPFC tissue samples, obtained from three individuals, can be accessed and downloaded from the link: http://spatial.libd.org/spatialLIBD/. We directly downloaded the processed datasets from the SpatialPCA Github repository, available at: https://github.com/shangll123/SpatialPCA_analysis_codes. These samples, on average, encompassed 3973 spots, each manually annotated to one of the six prefrontal cortex layers or white matter. We illustrated the method focusing on two samples with ID 151670 and 151673, which contain expression measurements of 33,538 genes across 3498 spots and 33,538 genes across 3639 spots, respectively. We also analysed the other 10 samples and the results are available in the supplementary files.
We applied our framework to detect the spatial domains (see ‘Materials and methods’) for each of the samples. We also followed the SpatialPCA framework provided in https://github.com/shangll123/SpatialPCA_analysis_codes without separating genes into clusters to detect spatial domains to compare with our results. The comparison is based on the ARI scores, utilizing the provided annotations for each sample. We also compared the PAS scores for all the samples. In Fig. 3A, sample 151670 is presented with annotated spatial domains as the ground truth (left), spatial domains detected by SpatialPCA (middle), and spatial domains detected by our framework (right). The ARI values for the detected domains by SpatialPCA and our framework (0.34 and 0.69, respectively, as shown in Fig. 3E), along with the PAS scores (0.013722 and 0.006289, respectively, depicted in Fig. 3F), highlight the superior performance of our framework over SpatialPCA.
Figure 3.
Spatial domain detection analysis results of the Human cortex data from DLPFC: (A) The detected spatial domains for Sample 151670: Annotated domain used as the ground truth (left), by SpatialPCA (middle), and by our framework (right). (B) The t-SNE plot for all the SVGs detected by SPACE for sample 151670. SVGs are colored based on cluster labels calculated by SPACE. (C) The spatial domains detected for sample 151673: Annotated domain used as the ground truth (left), by SpatialPCA (middle), and by our framework (right). (D) The t-SNE plot for all the SVGs detected by SPACE for sample 151673. The fact that SVGs with similar colors are close to each other but are away from those with different colors indicates accurate gene clustering. (E) Comparison of ARI score (higher the better) between SpatialPCA and our framework based on these two samples. (F) Comparison of PAS score (lower the better) between SpatialPCA and our framework based on these two samples. (G) Comparison of ARI score between SpatialPCA and our framework based on all 12 samples. (H) Comparison of PAS score between SpatialPCA and our framework based on all 12 samples.
Similarly, in Fig. 3C, the spatial domains for sample 151673 are displayed in the same order: annotated domains, spatial domains by SpatialPCA, and spatial domains by our framework. The ARIs for SpatialPCA and our framework are 0.58 and 0.65, respectively, and the PAS scores are 0.028579 and 0.023083, respectively. The performance of both methods across all samples is shown in Supplementary Fig. S6. Combining the results from all the samples, our framework significantly enhances spatial domain detection compared to SpatialPCA (see Fig. 3G and H) as confirmed by the increase in the median ARI score. Additionally, we have included a benchmarking table (Supplementary Table S1) that compares the performance of SPACE with existing methods such as SpaGCN [27], stLearn [32], BayesSpace [28], and BASS [33]. The results clearly demonstrate that SPACE consistently outperforms any of these methods in spatial domain detection accuracy. Representative genes from the clusters in sample 151673 are depicted in Supplementary Fig. S5.
We visualized the SVG clusters identified by SPACE through the t-SNE plots in Fig. 3B and D for the two samples. These visualizations highlight the resemblance among genes within the same cluster and the disparity between genes from separate clusters, emphasizing the efficacy of SPACE in delineating SVG clusters. The t-SNE plots for all 12 samples are shown in Supplementary Fig. S7.
Analysis of HER2 breast tumor data
We applied SPACE to another dataset, the HER2-positive breast tumor data [23], initially comprising 36 tumor datasets from eight individuals (patients A-H), each consisting of three or six sections. Following SpatialPCA analysis, we selected the H1 dataset, encompassing 15,030 genes measured across 613 spatial locations. We utilized the pre-processed dataset available in the SpatialPCA repository, containing 10,053 genes across 607 spots, and omitted datasets from other samples due to sparse gene expressions or minimal spot coverage. SPACE identified 268 SVGs grouped into three main clusters, along with a few unique pattern genes. The three main gene clusters consist of 84, 48, and 117 genes respectively. The t-SNE plot of the SVGs (see Fig. 4B) exhibits a similar pattern for genes within the same cluster in close proximity to each other. Comparing spatial domains detected by SpatialPCA and SPACE based on the ARI value, our framework shows better performance (ARI = 0.48) than SpatialPCA (ARI = 0.44), thereby improving domain detection accuracy (see 4C).
Figure 4.
Analysis of the HER2 data. (A) Annotated spatial layers considered as ground truth, showcasing six known tissue components including cancer-related spots in blue shades. (B) The t-SNE plot illustrating SVGs detected by SPACE, with distinct colors representing distinct SVG clusters. (C) Spatial domains detected by SPACE and SpatialPCA, with respective ARI scores of 0.48 and 0.44.
Analysis of the PDAC data
Our final analysis was conducted on a PDAC dataset, obtained from the Henry Ford Health System (Institutional Review Board approval is maintained for 16,150 at Henry Ford Hospital for The Translational and Clinical Research Center Biorepository). This dataset comprises gene expression measurements for 17,943 genes across 3142 spots. Following standard filtering and normalization procedures, we applied our SPACE framework to identify SVG clusters and detect spatial domains. Three primary SVG clusters were identified, each showcasing differential expressions in distinct tissue regions. Representative genes from these clusters are depicted in Supplementary Fig. S8.
For each SVG cluster, we extracted low-dimensional embeddings (top SpatialPCs from SpatialPCA), combined them, and applied the Leiden algorithm to identify spatial domains. We repeated this process for the SpatialPCA framework, utilizing the top 20 Spatial PCs from the top 3000 SVGs. Due to the absence of spot annotations, the exact number of spatial domains is unknown. Therefore, we employed the same Leiden algorithm with a resolution parameter set at the default value of 1 for the SpatialPCA framework, rather than using algorithms typically utilized by SpatialPCA that require a predetermined number of domains.
The dataset includes a rough annotation (see 5A) highlighting the tumor (marked in red) and non-tumor (marked in yellow) regions of interest. Figs 5B and C display the predicted domains by SpatialPCA and SPACE frameworks, respectively. As the precise annotation labels for each spot are unavailable, we cannot compute scores like ARI to compare spatial domain detection accuracy between the two methods. However, through visual inspection, the domains detected by SPACE appear more accurate, effectively capturing the most important regions. Tumor-containing regions identified by SPACE are notably smaller and more accurate compared to those identified by SpatialPCA.
Figure 5.
Analysis of the PDAC data: (A) Rough annotation of the tissue depicting different important regions. Tumor regions are highlighted in red, while non-tumor yet significant regions are marked in yellow. (B) Spatial domains detected by SpatialPCA, utilizing the calculation of spatial PCs and employing the Leiden algorithm for clustering without presupposing the number of clusters. (C) Spatial domain detection by SPACE, involving the aggregation of SVG-cluster-specific spatial PCs and utilizing the Leiden algorithm for clustering without presupposing the number of clusters.
The SPACE algorithm identified three primary clusters of SVGs (comprising 3931, 2751, and 1781 genes, respectively) in the PDAC dataset, each providing significant biological insights. Supplementary Fig. S8 demonstrates that genes in Cluster 1 are predominantly overexpressed mostly in the tumor adjacent regions, while genes in Clusters 2 and 3 are overexpressed around tumor regions. Consequently, pathway enrichment analysis shown in Fig. 6 reveals that Cluster 1 included pathways highly suggestive of changes in the metabolic state of the cells. Changes in the metabolic state of the tumor and tumor adjacent regions are a well-established feature in PDAC [34, 35]. Cluster 3 is uniquely enriched for inflammatory pathways specific to T cell signaling, a feature not observed in Cluster 1. These data suggest there is a specific spatial distribution of immune (and T and NK cells specifically) within tumor adjacent regions, in agreement with previous work [36]. Immune suppression is a key aspect of PDAC [37], and our pipeline may lead to new understandings of the mechanisms driving this aspect of the disease in the future. Cluster 2 displayed a mixed result with pathways identified both in inflammatory signaling, and cell to cell communication (i.e. tight junctions). As PDAC is highly desmoplastic, this result may reflect the abundant cancer associated fibroblasts (CAFs) present in this cancer type that signal to tumor cells and promote or restrict tumor growth depending on the subtype of CAF present. We also plotted the average gene expressions in each cluster across spatial locations (Supplementary Fig. S9), and we observed cluster-specific spatial expression patterns.
Figure 6.
Enrichment analysis of the PDAC data: KEGG Pathway enrichment analysis of genes from (A) Cluster 1, (B) Cluster 2, and (C) Cluster 3 are showcased.
On the other hand, a similar enrichment analysis on the top 3000 SVGs (see Supplementary Fig. S10(A)) or the group of all SVGs (see Supplementary Fig. S10(B)) shows that the gene sets are enriched in a mixture of different pathways, which does not clearly indicate the functional dependency between spatial gene patterns and their biological mechanisms. The unique enrichment of cluster 3 SVGs in Fig. 6C for inflammatory pathways specific to T-cell signaling would be impossible to discern if all SVGs or the top 3000 SVGs were selected. This demonstrates why using SVG clusters instead of just the top genes provides more biological insights and suggests that using SVG clusters might be a better option for downstream analysis, carrying more biological information.
We also analyzed another human breast cancer ST dataset and the details can be found in Section 4 in the supplemental file, together with the plots of representative genes in each cluster (Supplementary Fig. S1).
Discussion
In recent years, the exploration and analysis of spatial data have reached unprecedented heights, offering diverse insights into biological systems. Central to this endeavor is the identification of SVGs, which serve as pivotal components in understanding tissue organization and function. However, merely detecting SVGs does not inherently yield substantial biological insights. Rather, their significance lies in their dual role: (1) SVGs are used for spatial domain detection and (2) Some SVGs serve as markers for specific cell types. Traditional approaches often struggle to achieve precise spatial domain detection, and discerning cell type-specific SVGs amidst the data noise poses a formidable task. Our proposed framework, SPACE, addresses these challenges to improve spatial domain detection without requiring cell type information.
By implementing SPACE, we effectively detect SVG clusters that can be interpreted as clusters of cell-type SVGs. Remarkably, this identification is accomplished without the need for complex cell-type deconvolution techniques, streamlining the analysis process while providing biologically meaningful insights. Through extensive real data analysis and simulation studies, we have demonstrated the efficacy of SPACE in enhancing spatial domain detection accuracy, validating its utility in spatial transcriptomic analysis.
The implementation of SPACE involves two sequential steps, with the initial phase focusing on SVG detection. This step can be performed using any existing SVG detection technique and the alteration can be done without changing any code for SPACE. The next step focuses on finding gene cluster which involves using the Leiden community detection algorithm [26] (details in Supplementary). The detailed step-by-step code and results are provided on our GitHub repository https://github.com/wangjr03/SPACE.
All the results presented in this paper utilize the default marginal correlation test to select related genes in step 2 of the SPACE algorithm (see the full workflow in supplementary materials). However, there might be alternative ways of selecting related genes. It could be done by choosing a threshold based on a correlation measure, either linear or nonlinear, such as Pearson correlation, Spearman correlation, distance correlation or kernel correlation. Alternatively, one can run a penalized regression with LASSO [38], Elastic net [39, 40] or MCP penalty [41] or perform sure independence screening [42] to select genes correlated with the gene of interest. Identifying the optimal method and determining an appropriate threshold could greatly improve the method’s performance and is an area worth exploring further.
For all the results in the paper, the parameters involved in the community detection algorithm were not tuned, and default values were used. However, it has been observed that tuning the parameters can further improve spatial domain detection accuracy in some cases. Nonetheless, tuning the parameters in this instance poses certain challenges. Looking ahead, there are ample opportunities to refine and expand upon the SPACE framework. Future enhancements may involve simplifying and scaling up the SVG detection and cluster detection methods to accommodate larger and more complex datasets. Additionally, continued refinement of the framework will enable researchers to extract deeper insights from spatial transcriptomic data, advancing our understanding of tissue biology and disease mechanisms.
We note that many spatial transcriptomics datasets, such as the DLPFC dataset used in this study, include multiple samples, and integrating them can ideally enhance spatial domain detection accuracy. Recent work, including the BASS [33] and IRIS framework [43], supports the benefits of multi-sample integration. However, kernel-based methods, despite their efficiency, often face scalability issues with large datasets, as noted in previous studies like SPARK-X [25]. As a result, multi-sample analysis is beyond the scope of this work. However, we plan to explore this direction in future studies, particularly by using SVGs from multiple samples as input in Step 2 of our method, which may further enhance spatial domain detection performance.
In conclusion, the use of SVG clusters generated by SPACE represents a crucial advancement in spatial transcriptomic analysis. As we continue to refine and evolve this methodology, it is poised to become an indispensable tool for dissecting the spatial complexity of biological systems and unraveling the intricate interplay between genes, cells, and tissues.
Supplementary Material
Acknowledgements
We thank Mr. Ian Loveless for his contribution to the PDAC sample alignment. We also thank MSU iCER for providing the high-performance computing infrastructure. We would like to acknowledge the use of OpenAI’s ChatGPT for assistance in language refinement and editing.
Author contributions: m.D.A. (), N.G.S. (Data curation [equal] Investigation [equal] Resources [equal]), B.T. (Data curation [equal] Resources [equal]), J.W. (Funding acquisition [equal] Investigation [equal] Methodology [equal] Supervision [equal] Writing – review & editing [equal]), Y.C. (Conceptualization [equal] Investigation [equal] Methodology [equal] Supervision [equal] Writing – review & editing [equal]).
Contributor Information
Sikta Das Adhikari, Department of Statistics and Probability, Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, United States.
Nina G Steele, Department of Surgery, Henry Ford Pancreatic Cancer Center, Henry Ford Hospital, Detroit, MI, 48202, United States; Department of Pharmacology and Toxicology, Michigan State University, East Lansing, MI, 48824, United States; Department of Internal Medicine, University of Cincinnati, Cincinnati, OH, 45219, United States.
Brian Theisen, Department of Pathology, Henry Ford Health, Detroit, MI, 48202, United States.
Jianrong Wang, Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, United States.
Yuehua Cui, Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, United States.
Supplementary data
Supplementary data is available at NAR online.
Conflict of interest
We do not have any conflicts of interest, and we have not received any financial support for this work that could create potential conflicts of interest.
Funding
National Science Foundation Division of Mathematical Sciences (DMS1942143, DMS2152011); U.S. Department of Health and Human Services (R00CA263154); National Institute of General Medical Sciences (R01GM131398); National Institutes of Health (R01ES031937). Funding to pay the Open Access publication charges for this article was provided by the Michigan State University.
Data availability
All relevant codes for reproducing each step of the real data analysis and simulation study results are available on our GitHub repository: https://github.com/wangjr03/SPACE and Zenodo: https://doi.org/10.5281/zenodo.17025709. The publicly accessible datasets and their sources are provided in the data folder. The human PDAC data [44] can be found through dbGaP with accession number PRJNA1124001 and the image files are available at https://zenodo.org/records/13379726.
References
- 1. Das Adhikari S, Yang J, Wang J et al. Recent advances in spatially variable gene detection in spatial transcriptomics. Comput Struct Biotechnol J. 2024; 23:883–91. 10.1016/j.csbj.2024.01.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Li Z, Patel Z M, Song D et al. Benchmarking computational methods to identify spatially variable genes and peaks. Nucleic Acids Res. 2025; 53:gkaf303. 10.1093/nar/gkaf303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Swanton C Intratumor heterogeneity: evolution through space and time. Cancer Res. 2012; 72:4875–82. 10.1158/0008-5472.CAN-12-2217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Janiszewska M The microcosmos of intratumor heterogeneity: the space-time of cancer evolution. Oncogene. 2020; 39:2031–9. 10.1038/s41388-019-1127-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Scadden DT Nice neighborhood: emerging concepts of the stem cell niche. Cell. 2014; 157:41–50. 10.1016/j.cell.2014.02.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Andersson A, Lundeberg J sepal: Identifying transcript profiles with spatial patterns by diffusion-based modeling. Bioinformatics. 2021; 37:2644–50. 10.1093/bioinformatics/btab164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Svensson V, Teichmann SA, Stegle O SpatialDE: identification of spatially variable genes. Nat Methods. 2018; 15:343–6. 10.1038/nmeth.4636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Sun S, Zhu J, Zhou X Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat Methods. 2020; 17:193–200. 10.1038/s41592-019-0701-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Hunter MV, Moncada R, Weiss JM et al. Spatially resolved transcriptomics reveals the architecture of the tumor-microenvironment interface. Nat Commun. 2021; 12:6278. 10.1038/s41467-021-26614-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wang Q, Hu B, Hu X et al. Tumor evolution of glioma-intrinsic gene expression subtypes associates with immunological changes in the microenvironment. Cancer Cell. 2017; 32:42–56. 10.1016/j.ccell.2017.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Zhang Y, Xiang G, Jiang AY et al. MetaTiME integrates single-cell gene expression to characterize the meta-components of the tumor immune microenvironment. Nat Commun. 2023; 14:2634. 10.1038/s41467-023-38333-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Raghavan S, Winter PS, Navia AW et al. Microenvironment drives cell state, plasticity, and drug response in pancreatic cancer. Cell. 2021; 184:6119–37. 10.1016/j.cell.2021.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Yan G, Hua SH, Li JJ Categorization of 34 computational methods to detect spatially variable genes from spatially resolved transcriptomics data. Nat Commun. 2025; 16:1141. 10.1038/s41467-025-56080-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Yu J, Luo X Identification of cell-type-specific spatially variable genes accounting for excess zeros. Bioinformatics. 2022; 38:4135–44. 10.1093/bioinformatics/btac457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Yu S, Li WV spVC for the detection and interpretation of spatial gene expression variation. Genome Biol. 2024; 25:103. 10.1186/s13059-024-03245-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Cable DM, Murray E, Shanmugam V et al. Cell type-specific inference of differential expression in spatial transcriptomics. Nat Methods. 2022; 19:1076–87. 10.1038/s41592-022-01575-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Su H, Cui Y Rotation-invariance is essential for accurate detection of spatially variable genes in spatial transcriptomics. Nat Commun. 2025; 16:7122. 10.1038/s41467-025-62574-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Shang L, Wu P, Zhou X Statistical identification of cell type-specific spatially variable genes in spatial transcriptomics. Nat Commun. 2025; 16:1059. 10.1038/s41467-025-56280-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Su H, Wu Y, Chen B et al. STANCE: a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics. Nat Commun. 2025; 16:1793. 10.1038/s41467-025-57117-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Shang L, Zhou X Spatially aware dimension reduction for spatial transcriptomics. Nat Commun. 2022; 13:7203. 10.1038/s41467-022-34879-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Van der Maaten L, Hinton G Visualizing data using t-SNE. J Mach Learn Res. 2008; 9:2579–605. [Google Scholar]
- 22. Maynard KR, Collado-Torres L, Weber LM et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci. 2021; 24:425–36. 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Andersson A, Larsson L, Stenbeck L et al. Spatial deconvolution of HER2-positive breast cancer delineates tumor-associated cell type interactions. Nat Commun. 2021; 12:6012. 10.1038/s41467-021-26271-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Ståhl PL, Salmén F, Vickovic S et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016; 353:78–82. 10.1126/science.aaf2403. [DOI] [PubMed] [Google Scholar]
- 25. Zhu J, Sun S, Zhou X SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 2021; 22:184. 10.1186/s13059-021-02404-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Traag VA, Waltman L, Van Eck NJ From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019; 9:5233. 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Hu J, Li X, Coleman K et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. 2021; 18:1342–51. 10.1038/s41592-021-01255-8. [DOI] [PubMed] [Google Scholar]
- 28. Zhao E, Stone MR, Ren X et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat Biotechnol. 2021; 39:1375–84. 10.1038/s41587-021-00935-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Blondel VD, Guillaume JL, Lambiotte R et al. Fast unfolding of communities in large networks. J Stat Mech Theor Exp. 2008; 2008:P10008. 10.1088/1742-5468/2008/10/P10008. [DOI] [Google Scholar]
- 30. Pons P, Latapy M Computing communities in large networks using random walks. Computer and Information Sciences-ISCIS 2005: 20th International Symposium, Istanbul, Turkey,October 26–28, 2005. Lecture Notes in Computer Science. 2005; 3733:Berlin, Heidelberg: Springer; 284–293. [Google Scholar]
- 31. Weber LM, Saha A, Datta A et al. nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes. Nat Commun. 2023; 14:4059. 10.1038/s41467-023-39748-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Pham D, Tan X, Balderson B et al. Robust mapping of spatiotemporal trajectories and cell–cell interactions in healthy and diseased tissues. Nat Commun. 2023; 14:7739. 10.1038/s41467-023-43120-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Li Z, Zhou X BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol. 2022; 23:168. 10.1186/s13059-022-02734-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Nwosu ZC, Ward MH, Sajjakulnukit P et al. Uridine-derived ribose fuels glucose-restricted pancreatic cancer. Nature. 2023; 618:151–8. 10.1038/s41586-023-06073-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Halbrook CJ, Lyssiotis CA Employing metabolism to improve the diagnosis and treatment of pancreatic cancer. Cancer Cell. 2017; 31:5–19. 10.1016/j.ccell.2016.12.006. [DOI] [PubMed] [Google Scholar]
- 36. Steele NG, Carpenter ES, Kemp SB et al. Multimodal mapping of the tumor and peripheral blood immune landscape in human pancreatic cancer. Nat Cancer. 2020; 1:1097–112. 10.1038/s43018-020-00121-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Halbrook CJ, Pasca di Magliano M, Lyssiotis CA Tumor cross-talk networks promote growth and support immune evasion in pancreatic cancer. Am J Physiol Gastrointest Liver Physiol. 2018; 315:G27–35. 10.1152/ajpgi.00416.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Tibshirani R Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B: Stat Methodol. 1996; 58:267–88. 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
- 39. Friedman J, Hastie T, Tibshirani R Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33:1. 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Simon N, Friedman J, Hastie T et al. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011; 39:1. 10.18637/jss.v039.i05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Zhang CH Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010; 38:894–942. [Google Scholar]
- 42. Fan J, Lv J Sure independence screening for ultrahigh dimensional feature space. J Roy Stat Soc Ser B: Stat Methodol. 2008; 70:849–911. 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Ma Y, Zhou X Accurate and efficient integrative reference-informed spatial domain detection for spatial transcriptomics. Nat Methods. 2024; 21:1231–44. 10.1038/s41592-024-02284-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Loveless IM, Kemp SB, Hartway KM et al. Human pancreatic cancer single-cell atlas reveals association of CXCL10+ fibroblasts and basal subtype tumor cells. Clin Cancer Res. 2025; 31:756–72. 10.1158/1078-0432.CCR-24-2183. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant codes for reproducing each step of the real data analysis and simulation study results are available on our GitHub repository: https://github.com/wangjr03/SPACE and Zenodo: https://doi.org/10.5281/zenodo.17025709. The publicly accessible datasets and their sources are provided in the data folder. The human PDAC data [44] can be found through dbGaP with accession number PRJNA1124001 and the image files are available at https://zenodo.org/records/13379726.