Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST

Wei Liu; Xu Liao; Ziye Luo; Yi Yang; Mai Chan Lau; Yuling Jiao; Xingjie Shi; Weiwei Zhai; Hongkai Ji; Joe Yeong; Jin Liu

doi:10.1038/s41467-023-35947-w

. 2023 Jan 18;14:296. doi: 10.1038/s41467-023-35947-w

Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST

Wei Liu ^1,^#, Xu Liao ^1,^#, Ziye Luo ^1,², Yi Yang ¹, Mai Chan Lau ³, Yuling Jiao ⁴, Xingjie Shi ⁵, Weiwei Zhai ⁶, Hongkai Ji ⁷, Joe Yeong ^3,⁸, Jin Liu ^1,^9,^✉

PMCID: PMC9849443 PMID: 36653349

Abstract

Spatially resolved transcriptomics involves a set of emerging technologies that enable the transcriptomic profiling of tissues with the physical location of expressions. Although a variety of methods have been developed for data integration, most of them are for single-cell RNA-seq datasets without consideration of spatial information. Thus, methods that can integrate spatial transcriptomics data from multiple tissue slides, possibly from multiple individuals, are needed. Here, we present PRECAST, a data integration method for multiple spatial transcriptomics datasets with complex batch effects and/or biological effects between slides. PRECAST unifies spatial factor analysis simultaneously with spatial clustering and embedding alignment, while requiring only partially shared cell/domain clusters across datasets. Using both simulated and four real datasets, we show improved cell/domain detection with outstanding visualization, and the estimated aligned embeddings and cell/domain labels facilitate many downstream analyses. We demonstrate that PRECAST is computationally scalable and applicable to spatial transcriptomics datasets from different platforms.

Subject terms: Data integration, Statistical methods, Software, Bioinformatics

Methods that perform data integration are needed to analyse spatial transcriptomics data from multiple tissue slides. Here, the authors present PRECAST, an efficient data integration method for multiple spatial transcriptomics datasets with complex batch or biological effects between slides.

Introduction

Spatially resolved transcriptomics (SRT) encompass a set of recently developed technologies that characterize the gene expression profiles of tissues while retaining information on their physical location. The methodologies used for resolving spatial gene expression are primarily categorized into in situ hybridization (ISH) technologies (e.g., MERFISH^1–3, seqFISH^4,5, seqFISH+⁶), and in situ capturing technologies (e.g., ST⁷, HDST⁸, Slide-seq^9,10, and 10x Genomics Visium¹¹)¹². The in situ capturing technologies are unbiased and involve transcriptome-wide expression measurements, while ISH-based methods are targeted and require prior knowledge of the genes of interest. These technologies have provided extraordinary new opportunities for researchers to characterize the transcriptomic landscape within a spatial context; explore how cells influence and are influenced by the cells around them¹³; identify genes with spatial variations other than cell/domain differences, e.g., cell morphology¹⁴; and identify spatial trajectories or RNA velocity in tissues^15,16, among other applications¹⁷.

Similar to single-cell RNA-sequencing (scRNA-seq) studies, in SRT studies of a single slide, identifying the cell/domain clusters for each spot with the collation of both spatial information and expression measurements is an important step^18–20. Recently, multiple studies have involved the analysis of SRT datasets from multiple slides, requiring to further remove unwanted variations from different batches. For example, SRT profiles were characterized in 12 human cortex tissue slides from three adult donors using 10x Visium²¹ and in multiple sections from a mouse olfactory bulb (OB) that were equally distributed along the anterior-posterior axis of the same mouse using Slide-seqV2²². Moreover, when multiple SRT datasets from different clinical/biological conditions are available, integrative analysis to estimate shared embeddings of expressions representing variations between cell/domain types can provide the first step towards detecting genes that are differentially expressed between conditions²³. Thus, it is important to develop rigorous methods that are capable of performing data integration across multiple SRT datasets by aligning shared embeddings of biological effects between cell/domain types while accounting for complex batch effects and/or biological effects between slides²⁴.

An ideal data integration method for SRT datasets should be capable of the following three tasks: (1) estimating the shared embeddings of biological effects between cell/domain types across SRT datasets and slide-specific embeddings that account for local microenvironments (spatial dimension reduction); (2) aligning the shared embeddings that capture cellular biological variation across datasets with heterogeneous batch effects and/or biological effects between slides (data alignment); (3) clustering the aligned embeddings to obtain cell/domain clusters across datasets that promote spatial smoothness (spatial clustering). Most existing data integration methods, including MNN²⁴, Scanorama²⁵, Seurat V3²⁶, Harmony²³, scVI²⁷, and scGen²⁸, have been developed for scRNA-seq datasets without any consideration of spatial information. More recently, MEFISTO was proposed as a way to analyze datasets with repeated spatio-temporal measurements²⁹, and PASTE was proposed as a method to stack and/or integrate SRT data from multiple adjacent tissue slices into a single slide, but it is not applicable to the integration of tissue sections from different individuals³⁰. In addition, most existing methods perform data integration in low-dimensional space using the principal components (PCs) of conventional dimension reduction, e.g., principal component analysis (PCA), without considering the consistent loss functions of dimension reduction, alignment across datasets, or spatial clustering.

To address the challenges presented by SRT data integration and facilitate the downstream analyses of combinations of multiple tissue slides, we propose the use of a unified and principled probabilistic model, PRECAST, to simultaneously estimate low-dimensional embeddings for biological effects between cell/domain types, perform spatial clustering, and most importantly, align embeddings for normalized gene expression matrices from multiple tissue slides. As a result, PRECAST can resolve aligned representations, provide outstanding visualizations, and achieve higher spatial clustering accuracy for combined tissue slides. The resolved aligned representations and estimated labels from PRECAST can be used in multiple downstream analyses, e.g., removing batch effects, identifying differentially expressed genes under different conditions/stages, and recovering spatial trajectories/RNA velocity, etc. In addition, PRECAST uniquely estimates slide-specific embeddings that capture the spatial dependence in neighboring cells/spots, providing an opportunity to understand the spatial impact of various microenvironments. We illustrate the benefits of using PRECAST through extensive simulations and analysis of a diverse range of example datasets collated with different spatial transcriptomics technologies: 10x Visium datasets of 12 human dorsolateral prefrontal cortex (DLPFC) samples and four hepatocellular carcinoma (HCC) samples, ST datasets from eight mouse liver tissue sections and Slide-seqV2 datasets from 16 OB tissue slides.

Results

Spatial transcriptomics data integration using PRECAST

Unlike other integration methods that take as input the top (spatial) PCs and apply multiple steps to remove batch effects and merge the data, PRECAST takes as input the normalized gene expression matrices from multiple tissue slides, factorizes the input to each matrix into a latent factor with a shared distribution in each cell/domain cluster while simultaneously performing spatial dimension reduction and spatial clustering, and aligning and estimating joint embeddings for biological effects between cell/domain types across multiple tissue slides (Fig. 1a). In the dimension-reduction step, we used an intrinsic conditional autoregressive (CAR) component to capture the spatial dependence induced by neighboring microenvironments while in the spatial-clustering step, we used a Potts model to promote spatial smoothness within spot neighborhoods in the space of cluster labels. PRECAST applies a simple projection strategy to non-cellular biological effects, e.g., batch effects and/or biological effects between slides, and implicitly accounts for shifts in the centroid of each cluster using the intrinsic CAR component in the dimension-reduction step (see “Methods”). These considerations of PRECAST mimic some of the recent explorations into self-supervised learning and domain adaptation in deep learning but in a parametric manner. We show that PRECAST outperforms existing data integration methods by more successfully aligning similar clusters across multiple tissue slides while separating clusters with outstanding visualization. Uniquely, PRECAST estimates slide-specific embeddings for spatial dependence among neighboring cells/spots, providing an opportunity to explore the impact of neighboring microenvironments.

PRECAST can be applied to SRT datasets of various resolutions. By increasing the resolution for SRT datasets, PRECAST can reveal fine-scale cell-type distributions. In the following analysis, datasets in Slide-seqV2 with near-single-cell resolution present spatial patterns with much more “noise” than those of Visium due to the heterogeneity of the cell-type distributions. When we merged nearby beads in the Slide-seqV2 datasets, the recovered spatial patterns resembled those from Visium.

We use the estimated cell/domain labels and finely aligned embeddings to showcase some downstream analyses, as depicted in Fig. 1b. First, users can visualize the inferred embeddings for biological effects between cell/domain types using two components from either tSNE³¹ or UMAP³². Second, because the aligned embeddings obtained from PRECAST only carry information on the biological differences between cell/domain types, we provide a module to recover the gene expression matrices with batch effects removed and further identify genes differentially expressed either across different cell/domain clusters and/or under different conditions. Third, using these embeddings, we can identify genes whose spatial variability is not just due to biological differences between cell/domain types. Fourth, with the aligned embeddings estimated by PRECAST, we can perform trajectory inference/RNA velocity analysis to determine either the pseudotime or the pattern of dynamics in spatial spots across multiple tissue slides.

Validation using simulated data

We performed comprehensive simulations to evaluate the performance of PRECAST and compare it with that of several other methods (Fig. 1c and Supplementary Figs. S1–S2). Specifically, we considered the following eight integration methods: Harmony²³, Seurat V3²⁶, fastMNN²⁴, scGen²⁸, Scanorama²⁵, scVI²⁷, MEFISTO²⁹, and PASTE³⁰, all of which can be used to estimate the aligned embeddings among samples, except PASTE, which can only estimate the embeddings of the center slice. The simulation details are provided in the “Methods” section. Briefly, we simulated either the normalized or count matrices of gene expression to compare PRECAST with other methods in terms of performance in data integration, dimension reduction, and spatial clustering. To mimic real data, we also considered the two ways to generate spatial coordinates and cell/domain labels, i.e., Potts models and real data in three DLPFC Visium slices. Using real data in DLPFC, we considered slides either from different or the same donor. In total, we investigated five scenarios: (1) Potts + Count, with three different scales in batch effects (low, middle, high); (2) DLPFC (slides from different donors) + Count, with three different scales in batch effects (low, middle, high); (3) DLPFC (slides from the same donor) + Count; (4) Potts + logCount; and (5) DLPFC (slides from different donors) + logCount. Scenarios 1 and 2 were used to examine the impact of scales in batch effects on data integration performance.

To quantify the performance of the data integration, we calculated the F1 scores of the average silhouette coefficients, which summarized two similar metrics of silhouette coefficients into a single quantity, and two versions of the local inverse Simpson’s Index (LISI): integration LISI (iLISI) and cell-type LISI (cLISI). iLISI was employed to assess integration mixing, while cLISI was employed to assess the separation of each domain cluster. In scenarios 1 and 2, data integration performance by all methods dropped as the scale of batch effects increased. However, PRECAST achieved the best performance in all methods, with the highest F1 scores for the average silhouette coefficients. In scenario 3, PRECAST was comparable to PASTE (Fig. 1c, top panel), and in scenarios 4 and 5, PRECAST markedly outperformed other methods (Supplementary Fig. S1c). Moreover, PRECAST was able to merge spots within common cell/domain clusters across datasets, while separating spots from different cell/domain clusters (small cLISI, Supplementary Fig. S1a, top panel; Supplementary Fig. S1c, middle panel) and, at the same time, maintaining a sufficient mix of different samples (large iLISI, Supplementary Fig. S1a, c, bottom panel).

We then evaluated the performance of estimating the embeddings induced by neighboring microenvironments in PRECAST. The estimated embeddings due to neighboring microenvironments were highly correlated with the underlying truth (Fig. 1c, middle panel; Supplementary Fig. S1d, top panel), suggesting PRECAST was able to well recover the spatial dependence of spots with microenvironments. Furthermore, the canonical correlation coefficients decreased as the scales in batch effects became large. Next, we evaluated the performance of the models in obtaining embeddings for biological effects between cell/domain types. For the average canonical correlation between the estimated aligned embeddings and the true latent features, PRECAST ranked at the top in the majority of scenarios (Supplementary Fig. S1b, top panel; Supplementary Fig. S1d, middle panel). This suggested that the aligned embeddings estimated by PRECAST were more accurate. We also showed that Pearson’s correlation coefficients between the observed expression and the estimated cell/domain labels conditioned on the aligned embeddings from PRECAST were lower than those from other methods, except for scenario 3 (Supplementary Fig. S1b, d, bottom panel), suggesting PRECAST captured more relevant information regarding cell/domain clusters and, thus, facilitated downstream analysis.

Last, we compared the clustering performance of each method. As a unified method, PRECAST simultaneously estimates embeddings for biological effects between cell/domain types and cluster labels. For the other methods, we sequentially performed spatial clustering based on each of the estimated embeddings using SC-MEB, except for Seurat V3, which has its own clustering pipeline based on Louvain. PRECAST achieved the highest adjusted Rand index (ARI) and normalized mutual information (NMI) in all considered scenarios (Fig. 1c, bottom panel, and Supplementary Fig. S2a, b), while the other methods, such as Seurat V3 and Harmony, were sensitive to the data generation process. Moreover, we observed that all methods correctly chose the number of clusters. Notably, when using embeddings from PRECAST, other clustering methods such as SC-MEB, BASS, BayesSpace, and Louvain achieved comparable clustering performance to PRECAST (Supplementary Fig. S3a). PRECAST was also computationally efficient, exhibiting linear computational complexity with respect to the number of genes and the total number of spots (Supplementary Fig. S3b). It only took ~6 h to analyze a dataset with 2000 genes and 600,000 spots for a fixed number of clusters (K = 7; Supplementary Fig. S3b, left panel).

Application to human dorsolateral prefrontal cortex Visium data

We applied PRECAST and the other methods to the analysis of four published datasets obtained via either Visium, ST, or Slide-seqV2 technologies (see “Methods”). By obtaining the estimated aligned embeddings and cluster labels from PRECAST, we could perform many downstream analyses using all tissue slides. Here, we showcase the differential expression (DE) analyses across detected domains, spatial variation analysis (SVA) adjusting for aligned embeddings as covariates, and trajectory inference/RNA velocity analysis. To examine the clustering performance with low-resolution data, we performed deconvolution analysis to infer the cell compositions of the domains detected by PRECAST.

To quantitatively show that PRECAST outperforms existing data integration methods, we first analyzed LIBD human DLPFC data generated using 10x Visium²¹ that contained 12 tissue slices from three adult donors, comprising four tissue slices from each donor. In all 12 tissue slices, the median number of spots was 3844, and the median number of genes per spot was 1716. The original study provided manual annotations for the tissue layers based on the cytoarchitecture that allowed us to evaluate the performance of both the data integration and accuracy of spatial domain detection by taking the manual annotations as ground truth. For each method, we summarized the inferred embeddings for biological effects between cell/domain types using three components from either tSNE or UMAP and visualized the resulting tSNE/UMAP components with red/green/blue (RGB) colors in the RGB plot (Fig. 2a, right-top panel, and Supplementary Figs. S4a, b to S6a, b). The resulting RGB plots from PRECAST showed the laminar organization of the human cerebral cortex, and PRECAST provided smoother transitions across neighboring spots and spatial domains than those from other methods. For each of the other methods, we further performed clustering analysis to detect spatial domains using different methods with the inferred embeddings (Fig. 2a, right-bottom panel, and Supplementary Figs. S4c–S6c). We observed that the results from PRECAST had stronger laminar patterns and the estimated aligned embeddings carried more information about the domain labels (Supplementary Fig. S7a).

Fig. 2 — a Left panel: H& E image and manual annotation of sample ID151674. Top panel: UMAP RGB plots of sample ID151674 for PRECAST, Seurat V3, Harmony, and fastMNN. Bottom panel: Clustering assignment heatmaps for these four methods. Color scheme used in clustering assignment heatmap for PRECAST is the same with (b) and (d). b tSNE plots for these four data integration methods with right-most column showing the analysis without correction; domains are labeled as in (d). c Box/violin plot of ARI values for PRECAST and other methods; SC-MEB was used in the other methods for clustering based on their aligned embeddings. In the boxplot, the center line and box lines denote the median, upper, and lower quartiles, respectively. d Heatmap of Pearson’s correlation coefficients among detected domains. L1-L6, Layer 1–Layer 6; WM, white matter; NA, undetermined. e Spatial expression patterns of DE genes for Domain 1 (*HOPX*), Domain 2 (*PCP4*), Domain 4 (*ENC1*), Domain 5 (*CNP*), Domain 5 (*MBP*), and Domain 8 (*NEFL*) for sample ID151674. f Spatial expression patterns of genes associated with pseudotime: *MOBP*, *GFAP*, *MAG*, TF, *MBP*, and *COX1*, where the arrow represents the direction of the increased pseudotime. g Bubble plot of −log10(p-values) for GO enrichment analysis of genes associated with pseudotime. The p-values are based on one-sided hypergeometric tests without multiple testing adjustment. h Bubble plot of −log10(p-values) for KEGG enrichment analysis of SVGs while adjusting domain-relevant aligned embeddings by PRECAST for sample ID151674. The p-values are based on one-sided Fisher’s exact tests with the Benjamini-Hochberg FDR corrections.

A unique feature of PRECAST is its ability to estimate slide-specific embeddings capturing spatial dependence in the neighboring cells/spots due to various neighboring microenvironments in different regions. Supplementary Fig. S8 provides RGB plots of the inferred embeddings for spatial dependence using three components from either tSNE or UMAP. We observed that spots in the domain of white matter had similar microenvironments, while spots in layers 1 to 6 had two distinct microenvironment patterns from left to right, suggesting potentially distinct functions in the left and right regions of layers 1 to 6.

PRECAST can offer outstanding data visualization compared with other methods. We visualized the inferred embeddings for biological effects between cell/domain types using two components from tSNE for each method (Fig. 2b and Supplementary Fig. S7b). The tSNE plots for PRECAST show that spots from different slices were well mixed (Fig. 2b, top panel) while the domain clusters were well segregated (Fig. 2b, bottom panel), and there were significant improvements in visualization in comparison to the other methods, including the method applied with no corrections. Supplementary Fig. S7c shows that PRECAST achieved the best data integration in terms of F1 scores, iLISI, and cLISI. To evaluate the clustering accuracy, we used both ARI and NMI. As shown in Fig. 2c and Supplementary Fig. S7d, PRECAST achieved the highest ARI and NMI for the separate evaluation and combined evaluation: the median ARI was 0.434 for PRECAST, 0.382 for Scanorama, and 0.406 for scVI in the separate evaluation; and the ARI was 0.374 for PRECAST, 0.216 for Scanorama, and 0.301 for scVI in the combined evaluation. Using embeddings aligned from PRECAST, we further demonstrated that other clustering methods could achieve a similar clustering performance to PRECAST (Supplementary Fig. S9). A heatmap of Pearson’s correlation coefficients among the detected domains shows the good separation of the estimated aligned embeddings across domains (Fig. 2d) and the correlations between deeper layers were high, e.g., there were high correlations between layers 5 and 6, while correlations among the separated layers were low.

A key benefit of PRECAST is its ability to estimate aligned embeddings for biological effects between cell/domain types and joint labels for all slides. We performed DE analysis for the combined 12 slices (see “Methods”). In total, we detected 1331 DE genes with adjusted p-values of less than 0.001 among the 10 spatial domains identified by PRECAST, with 314 genes being specific to Domain 5, which corresponded to white matter (Supplementary Data 1). Many of these genes were reported to be enriched in different layers of DLPFC, i.e., PCP4 (Domain 2,layer 5)¹⁹, HOPX (Domain 1, layer 1/3), and ENC1 (Domain 4, layer 2/3)³³ (Fig. 2e and Supplementary Figs. S10–S12). Next, we performed trajectory inference using the aligned embeddings and domain labels estimated by PRECAST (Supplementary Fig. S13). The pseudotime analysis inferred using the aligned embeddings from PRECAST was “sample-invariant” compared with that using embeddings from either PCA or DR-SC³⁴ for a single slide (Supplementary Fig. S14). In total, we identified 858 genes associated with the estimated pseudotime with adjusted p-values of less than 0.001. Among them, 373 were identified as DE genes in at least one domain (Fig. 2f and Supplementary Data 2). We further found that the pseudotime-associated genes identified by PRECAST were significantly enriched for nervous system development (Fig. 2g). Among the most enriched of these genes was GFAP, which encodes glial fibrillary acidic protein and plays an important role in human brain development³⁵.

To further show that the estimated embeddings for biological effects between cell/domain types from PRECAST were well aligned across tissue slides, we performed SVA analysis with the aligned embeddings from PRECAST as covariates for each slice to identify spatially variable genes (SVGs) with nonlaminar patterns. A detailed list of the genes identified at a false discovery rate (FDR) of 1% is available in Supplementary Data 3. Interestingly, many of the identified genes were related to immune function. For example, ISG15 encodes a ubiquitin-like molecule induced by type I interferon, and ISG15 deficiency increases antiviral responses in humans^36,37. Many studies in the literature have highlighted the importance of immune-brain interactions to the development of many disorders of the central nervous system^38,39. Further enrichment analysis showed the genes from each slice to be highly enriched in many common pathways, suggesting the embeddings for biological effects between cell/domain types were effectively aligned by PRECAST (Fig. 2h and Supplementary Figs. S15–S19).

To demonstrate the robustness of PRECAST, we applied different methods to select top genes as input. As presented in Supplementary Fig. S20, when using the top 2000 highly variable genes (HVGs) as input, we observed similar patterns in results from different data integration methods. Supplementary Fig. S21 confirms the robustness of PRECAST by using top genes identified by different methods, such as SPARK⁴⁰, SPARK-X⁴¹, SpatialDE⁴², and nnSVG⁴³, as input.

Application to mouse liver ST data

We further applied PRECAST and other methods to analyze eight sections of wild-type adult, female mouse livers from the caudate and right liver lobes of three female mice using ST technology⁴⁴. In all eight sections, the median number of spots was 640, and the average number of genes was 15,302. The original study provided manual annotations based on marker genes that allows us to evaluate the performance of both the data integration and accuracy of spatial domain detection by taking the manual annotations as ground truth. For each method, we summarized the clustering performance of each section and combined sections using both ARI and NMI (Fig. 3a, top panel, and Supplementary Fig. S22a). PRECAST achieved the highest ARI and NMI in both cases: in each separate section, the median ARI was 0.24 for PRECAST, 0.18 for Seurat V3, and 0.02 for scVI; and jointly, in the combined sections, the value of ARI was 0.23 for PRECAST, 0.18 for Seurat V3, and 0.02 for scVI. We visualized the cluster labels obtained by PRECAST and other methods as well as the manual annotations (Supplementary Fig. S23) and found PRECAST performed best for each individual sample. On the other hand, PRECAST achieved better data integration than most of the other methods in terms of F1 score, iLISI, and cLISI (Fig. 3a, bottom panel, and Supplementary Fig. S22b) with comparable conditional correlations (Supplementary Fig. S22c). The tSNE plots for PRECAST show that spots from different sections were well mixed (Fig. 3b and Supplementary Fig. S22d, top panel), while the domain clusters were well segregated (Fig. 3b and Supplementary Fig. S22d, bottom panel), and there were significant improvements in visualization over other methods, including the method applied with no corrections. We further visualized spatial dependence due to variations in neighboring microenvironments using RGB plots of the inferred slide-specific embeddings (Supplementary Fig. S22e, f) and observed various patterns of microenvironments in the different sections. Using embeddings aligned from PRECAST, we further demonstrated that (spatial) clustering methods could achieve comparable clustering performance to PRECAST (Supplementary Fig. S24).

Fig. 3 — a Top panel: Box/violin plot of ARI values of each sample for PRECAST and the other methods (left); bar plot of ARI value of combined samples for PRECAST and other methods (right). Bottom panel: Box/violin plots of cLISI and iLISI values for PRECAST and other methods. Color scheme of each method is the same as in (d). In the boxplot, the center line and box lines denote the median, upper, and lower quartiles, respectively. b tSNE plots for four data integration methods, with the right-most column showing analysis without correction. Color scheme of each domain is the same as in (c) and (g). c Heatmap of differentially expressed genes for each domain identified by PRECAST. CV-1, central veins 1; CV-2, central veins 2; Endo, endothelial cells; Hep, hepcidin-related cells; Mes, mesenchymal-related cells; PV-1, portal veins 1; PV-2, portal veins 2. d Bar plot of McFadden’s adjusted R² values for PRECAST and other methods. McFadden’s adjusted R² measures the association between the cluster label obtained by each method and the cell proportion obtained by RCTD cell-type deconvolution, and a larger value indicates a stronger association. e Visualization of the cell type proportions mapped to spatial coordinates for six cell types of the first three samples. f Visualization of the combined trajectory inferred by PRECAST in tSNE plot of all samples. Domains 1-2 and Domains 6-7 representing central veins and portal veins, respectively, are circled. g Heatmap for genes with expression change in the Slingshot pseudotime inferred by PRECAST.

As a key feature, PRECAST estimates aligned embeddings for biological effects between cell/domain types and joint labels for the combined sections. We performed DE analysis of the combined sections (see Methods). In total, we detected 367 DE genes with adjusted p-values of less than 0.001 in all seven spatial domains detected by PRECAST (Supplementary Data 4). A heatmap of the findings shows the good separation of the DE genes across different spatial domains (Fig. 3c). Many of these genes are markers that define particular cellular regions in liver lobes, i.e., Cyp2e1, Cyp2c37, Oat, and Slc1a2 for central veins (Domains 1–2)^44,45; Cyp2f2, Hal, Sds, and Ctsc for portal veins (Domains 6-7)⁴⁴; and Gsn, Vim, and Col3a1 for the mesenchyme (Domain 5)⁴⁴. Further enrichment analysis shows that genes specific to both central veins and portal veins were highly enriched for metabolic processes, with central veins that were more enriched for fatty acid metabolism, while portal veins that were more enriched for amino acid metabolism (Supplementary Fig. S25). By performing enrichment analysis for DE genes unique to each of two subtypes in central/portal veins, we found pathways unique to each subtype of central/portal veins (Supplementary Fig. S26).

To examine the cell compositions of each spatial domain detected by PRECAST, we performed cell-type deconvolution analysis of all mouse liver datasets using scRNA-seq data in the Mouse Cell Atlas (MCA)⁴⁶ as the reference panel. To assess the performance in spatial clustering, we evaluated the associations between the cell type proportions obtained by cell-type deconvolution and the cluster labels estimated by PRECAST and other methods. The results displayed in Fig. 3d suggest that PRECAST retained the largest MacFadden’s adjusted R² than the other methods. In addition, cell-type deconvolution enabled us to spatially map 17 cell types annotated in the MCA dataset for liver tissue sections (Fig. 3e and Supplementary Figs. S27–S28). We observed a high proportion of estimated values for periportal and pericentral hepatocytes, which is consistent with the findings in the existing literature⁴⁴.

Lastly, we performed trajectory inference using aligned embeddings and estimated domain labels from PRECAST, to examine the cell lineages in the detected domains. Figure 3f shows the inferred pseudotime mapped to PRECAST-induced tSNE, which suggests the central veins differentiated earlier than portal veins, in accordance with findings that showed Wnts and R-spondin3 signals were released from the central veins and transited along the venular wall towards the perivenous hepatocytes⁴⁵. Based on the inferred pseudotime, we identified differentially expressed genes along the cell pseudotime using TSCAN⁴⁷. The heatmap of the expression of the top 20 most significant genes (Fig. 3g) suggested the occurrence of some interesting dynamic expression patterns over pseudotime.

We further confirmed the robustness of PRECAST by using top genes identified by different methods as input (Supplementary Figs. S29 and S30).

Application to mouse olfactory bulb Slide-seqV2 data

In Visium Spatial Gene Expression, barcoded beads (55 μm diameter) with a center-to-center distance of 100 μm are used to capture mRNA¹¹. The Slide-seq technique was developed to perform for high-resolution SRT using 10-μm-diameter barcoded beads^9,10, and Slide-seqV2 further improved the detection sensitivity. To show the scalability of PRECAST, we analyzed a mouse OB dataset generated using Slide-seqV2 technology²². In this dataset, spatial transcriptomic information was obtained from a total of 20 OB sections distributed evenly along the anterior-posterior axis. We removed four slides due to the quality of the sections from the end of the mouse OB region, and analyzed 16 containing 21,571 genes, on average, from over a total of 693,863 spots using PRECAST and other methods. After quality control (QC), we obtained data of a lower resolution (~4000 spots) by collapsing nearby spots in each slide (see “Methods”), and the resulting resolution was similar to that of Visium (average of 10 cells per spot). To further examine the structure of the mouse OB, we relied on the structural annotations in the Allen Brain Atlas⁴⁸ (Fig. 4a). With near-single-cell resolution, the identified aligned embeddings and cluster labels both showed more fine-scale cell-type distribution patterns in the mouse OB⁴⁹. In comparison, when we lowered the resolution, PRECAST estimated the aligned embeddings and cluster labels with smoother spatial patterns at the expense of less detailed local spatial information (Fig. 4b and Supplementary Figs. S31–S34). Moreover, the resulting RGB plots from PRECAST showed the laminar organization of the mouse OB, and PRECAST provided smoother transitions across neighboring spots and spatial domains than the other methods (Supplementary Figs. S31a, b to S34a, b). We next visualized the inferred aligned embeddings for biological effects between cell/domain types using two components from tSNE from each method (Fig. 4c and Supplementary Fig. S35) and showed that, using PRECAST, the spatial spots were mixed well across the 16 slides, while the cell/domain clusters were well segregated, suggesting PRECAST was more effective at spatial data integration. We further visualized the inferred slide-specific embeddings for spatial dependence due to variations in the microenvironment using RGB plots (Supplementary Fig. S36) and observed a few microenvironmental patterns in the inner layers (e.g., the granule cell layer, GCL), middle layers (e.g., the glomerular layer, GL), and outer layers (e.g., the olfactory nerve layer, ONL).

Fig. 4 — a Structure of the mouse olfactory bulb annotated using the Allen Brain Atlas. b Clustering assignment heatmaps for 16 tissue slides by PRECAST, where the first row shows samples 1–6, the second row samples 7–12, and the last row samples 13–16 (RMS, rostral migratory stream; GCL, granule cell layer; IPL, inner plexiform layer; MCL, mitral cell layer; OPL, outer plexiform layer; GL, glomerular layer; ONL, olfactory nerve layer). Color scheme for domains detected in PRECAST is as in (c) and (d), and the order of domain labels in (b) is the same as in (c), (d), and (e). c tSNE plots for four data integration methods with the right-most column showing analysis without correction. d Percentage of different cell types in each domain detected by PRECAST with scaling. e MacFadden’s adjusted R² between the inferred cell type proportions and the estimated domain labels by PRECAST and other methods (top panel); boxplot of ARI values of 16 samples for PRECAST and other methods, where each spot is annotated using the cell type, with the highest proportion from the spatial deconvolution (bottom panel). In the boxplot, the center and box lines denote the median, upper, and lower quartiles, respectively. (f) Visualization of the trajectory inferred by PRECAST in spatial heatmap for samples 1–8.

At reduced resolution, PRECAST detected 12 spatial domains with laminar organization, including the rostral migratory stream (RMS, Domain 1), GCL (Domain 2), GCL/inner plexiform layer (GCL/IPL, Domain 3), mitral cell layer (MCL, Domain 4), outer plexiform layer (OPL, Domain 5), GL (Domains 6 and 7), and ONL (Domains 8), with Domains 9–12 belonging to low-quality regions or experiment artifacts. To characterize the transcriptomic properties of the spatial domains identified by PRECAST, we performed DE analysis of the combined 16 tissue slides (see “Methods”). In total, we detected 4131 DE genes with adjusted p-values of less than 0.001 in all 12 spatial domains detected by PRECAST (Supplementary Data 5), including representative genes that define particular cellular layers in the mouse OB, e.g., Sox2ot and Sox11 (RMS)^50,51. The heatmap of the findings shows the good separation of the DE genes across different spatial domains (Supplementary Fig. S37), and we found that genes specific to Domain 1 (RMS) were enriched for myelin sheath and structural constituents of myelin sheath (Supplementary Fig. S38). Compared with the reduced-resolution data, when near-single-cell level resolution was used, PRECAST identified 24 cell clusters, including fine-scale cell-type clusters. To better visualize each detected cell cluster, we plotted a heatmap of each cluster assignment for all 16 slides (Supplementary Figs. S39–S40). A heatmap of DE genes across different cell clusters showed Clusters 1–3 were subtypes of granule cells, and Clusters 4-6 belonged to cells in MCL and GL (Supplementary Fig. S41).

To examine the cell compositions in each spatial domain detected by PRECAST at the reduced resolution, we performed cell-type deconvolution analysis of all 16 tissue slides using scRNA-seq data from adult mouse OB as the reference panel⁵². As shown in Fig. 4d and Supplementary Fig. S42a, Domain 1 (RMS) was enriched for immature neurons. Immature neurons reportedly migrate to the OB through RMS⁵³. Unsurprisingly, we found that Domains 2–3 (GCL) were dominated by two primary subtypes of granule cell, with a larger proportion of immature neurons in Domain 2 (inner) and an enrichment of mitral and tufted cells in Domain 4 (MCL). Domain 8 (ONL) was primarily enriched in olfactory sensory neurons (OSNs); OSNs express odorant receptors in the olfactory epithelium⁵⁴. We additionally quantified the association between the inferred cell type proportions and the domain labels estimated by PRECAST and the other methods using MacFadden’s adjusted R² (Fig. 4e, top panel), and PRECAST achieved the highest R². Then, we manually annotated each spot with the cell type present at the highest proportion⁵⁵ and quantitatively evaluated the clustering performance of PRECAST and other methods. As shown in Fig. 4e (bottom panel) and Supplementary Fig. S42b, c, PRECAST achieved the highest ARI and NMI values assessed separately for each slide or jointly for the combined slides. Further analyses showed that PRECAST achieved better data integration, with the highest iLISI and the lowest cLISI (Supplementary Fig. S42d), and maintained comparable conditional Pearson’s correlations with the other methods (Supplementary Fig. S42e).

We performed trajectory inference to examine the cell lineages among the detected domains using aligned embeddings and estimated domain labels from PRECAST. In general, the estimated trajectory showed an “inside-out” sequence, consistent with the general understanding that OB neurons migrate from the subventricular zone along the RMS to the OB, before migrating radially out of the RMS in an inside-out sequence⁵⁶ (Fig. 4f and Supplementary Fig. S43a). Further DE analysis identified genes along the pseudotime (Supplementary Fig. S44b).

Application to hepatocellular carcinoma Visium data

To study the dynamics of tumorigenesis in tumors and tumor-adjacent tissues, we further analyzed four slides of in-house HCC data generated using the 10x Visium platform, with two slides from tumors (HCC1 and HCC2) and two from tumor-adjacent tissues (HCC3 and HCC4) from an HCC patient. The median number of spots was 2748, and the median number of genes per spot was 3635. Figure 5a shows a histology image (top panel) with manual annotations for tumor/normal epithelium (TNE) and stroma provided by a pathologist (bottom panel). Consistent with the DLPFC data, the RGB plots generated by PRECAST clearly segregated the tissue slices into multiple spatial domains (Fig. 5b, top panel), with neighboring spots and spots in the same domain across multiple slides more closely sharing similar RGB colors than those generated by other methods (Supplementary Fig. S44a, b). Similarly, the PRECAST spatial heatmaps of clustering assignment for the four tissue slides resembled the corresponding RGB plots and presented more spatial patterns across the tissue slides than those from the other methods (Fig. 5b, bottom panel; Supplementary Fig. S44c). We further visualized the inferred embeddings for biological effects between cell/domain types using two components from tSNE for each method (Fig. 5c and Supplementary Fig. S45a), in which the tSNE plots for PRECAST showed that spots from different slices were mixed well (Fig. 5c, top panel) while the domain clusters were well segregated (Fig. 5c, bottom panel), with significant improvements in visualization. A heatmap of Pearson’s correlation coefficients among the detected domains shows the good separation of estimated embeddings across domains (Supplementary Fig. S45b), in which correlations between regions in TNE were high, and correlations between regions for TNE and stroma were low. We further visualized the spatial dependence due to microenvironment variations using RGB plots of the inferred slide-specific embeddings (Supplementary Fig. S46a) and observed variations in microenvironmental patterns between Domains 1 and 3, and between Domains 4 and 5.

Fig. 5 — a Top panel: H& E images from four tissue slides. Bottom panel: Manual annotation by a pathologist of four tissue slides. b Top panel: UMAP RGB plots of PRECAST for four tissue slides. Bottom panel: Clustering assignment heatmaps for four tissue sections by PRECAST. Color scheme for clustering assignment heatmap in PRECAST is the same as in (c), (f), and (g). c tSNE plots for four data integration methods with the right-most column showing analysis without correction; domains are labeled as in (e) and (f). TNE, tumor/normal epithelium. d Spatial heatmap of deconvoluted cell proportions in malignant cells, immune cells, and HPC-like cells. e Percentage of different cell types in each domain detected by PRECAST, with scaling to the summation of all cell types across all domains equal to 100%. f PC plot of estimated RNA velocity. g Heatmap of genes with expression change in latent time.

To characterize the transcriptomic properties of the spatial domains identified by PRECAST, we performed DE analysis of the combined four tissue slides (see Methods). In total, we detected 2093 DE genes with adjusted p-values of less than 0.001 in all nine spatial domains detected by PRECAST, with 539 genes being specific to Domains 1–5, which corresponded to TNE regions (Supplementary Data 6). A heatmap and ridge plots of the findings showed the good separation of the DE genes across different spatial domains (Supplementary Figs. S46b, S47–S48). In TNE regions (Domains 1–5), we further found that genes specific to Domains 1 and 5 were highly enriched in pathways of chemical carcinogenesis DNA adducts and chemical carcinogenesis receptor activation. Genes specific to Domain 4 were enriched in signaling pathways of RAF1 mutants and signaling by RAS mutants, and genes specific to Domains 2 and 3 were highly enriched in complement and coagulation cascade pathways (Supplementary Fig. S49). Genes specific to Domains 6–9 were enriched in the angiogenesis pathway, and we found 40 out of 427, 34 out of 606, 25 out of 281, and 18 out of 240 angiogenesis signature genes, respectively, in Domains 6–9, from multiple studies^57–60. Interestingly, Domains 1–3 were only present in the tumor tissues (HCC1 and HCC2), Domain 5 was only present in the tumor-adjacent tissues (HCC3 and HCC4), while Domain 4 was shared across the tumor and tumor-adjacent tissues.

To identify SVGs other than those that were merely relevant to domain differences, we performed SVA analysis with the embeddings estimated by PRECAST as covariates for each slice. A detailed list of genes identified at an FDR of 1% is available in Supplementary Data 7. By performing functional enrichment analysis of these SVGs, we detected SVGs adjusted for domain-relevant covariates to be highly enriched in many common pathways in the four HCC slices, e.g., cytoplasmic translation, and cytosolic ribosome (Supplementary Figs. S50–S51).

Next, to examine the cell compositions of each spatial domain detected by PRECAST, we performed cell-type deconvolution analysis of all four HCC slides using scRNA-seq data as the reference panel (see Methods). The scRNA-seq reference panel consisted of malignant and tumor microenvironment cells, including cancer-associated fibroblasts (CAFs), tumor-associated macrophages (TAMs), tumor-associated endothelial cells (TECs), cells of an unknown entity but expressing hepatic progenitor cell markers (HPC-like), and immune cells⁶¹. As shown in Fig. 5d, e and Supplementary Fig. S52a, the proportions of malignant cells were substantially higher in Domains 1–5, while HPC-like cells were seen at higher proportions in Domain 7. In Domain 6, we observed an increased proportion of TAMs and immune cells and genes specific to this domain included TGFB1 and MMP2, which have been used for the classification of TAMs^62,63.

The estimated aligned embeddings and cluster labels can also be used in RNA velocity analysis to investigate the directed transcriptional dynamics of tumorigenesis when spliced and unspliced mRNA are available (see “Methods”). Interestingly, two cell lineages were identified, with one that originated in Domain 2 (TNE) and spread to Domains 4 and 5 (TNE), followed by Domains 6, 7, and 9 (stroma), and the other that originated in Domains 1 and 3 (TNE) and spread to Domain 8 (stroma). The TNE regions in Domain 2 residing in HCC1 may play a key role in tumorigenesis with Domain 4 (TNE) shared among the four slides and Domain 5 (TNE) in the tumor-adjacent tissues (Fig. 5f and Supplementary Fig. S52b). To infer cell states in the identified TNE, we further performed RNA velocity analysis using the spots identified in Domains 1–5. An expression heatmap of the top genes associated with cell states with induction close to 0 and repression close to 1 is shown in Fig. 5g (see “Methods”). The TNEs in Domains 1 and 2 tended to be transcriptionally active, while TNEs in Domains 4 and 5 tended to be repressed and show no transcription. The top genes associated with cell states included SPINK1, RPL30, and IL32, which highlights the importance of genes associated with cell states in HCC^64–66.

Discussion

PRECAST takes, as input, matrices of normalized expression levels and the physical location of each spot across multiple tissue slides. The output of PRECAST comprises all aligned embeddings for cellular biological effect, slide-specific embeddings that capture spatial dependence in neighboring cells/spots, and estimated cluster labels. In contrast to other existing methods of data integration, PRECAST is a unified and principled probabilistic model that simultaneously estimates embeddings for cellular biological effects, performs spatial clustering, and more importantly, aligns the estimated embeddings across multiple tissue sections. Thus, we recommend applying PRECAST first before a comprehensive data analysis pipeline is deployed. By applying PRECAST, the aligned embeddings and estimated cluster labels can be used for many types of downstream analyses, such as visualization, trajectory analysis, and SVA and DE analysis for combined tissue slices. In more detail, we developed a module to further remove batch effects across multiple tissue slides based on housekeeping genes, making expression data comparable for different cell/domain clusters. This module is also applicable to the examination of expressional differences caused by multiple conditions when such information on tissue slides is obtainable.

PRECAST simultaneously performs dimension reduction and spatial clustering while using simple projections to align embeddings, and uniquely estimating embeddings that capture spatial dependence of neighboring cells/spots due to varied microenvironments. Recently, Liu et al.³⁴ showed that, compared with methods that perform dimension reduction and spatial clustering sequentially^18–20, joint methods can estimate embeddings for cellular biological effects more efficiently while accounting for the uncertainty in obtaining low-dimensional features from sequential analysis. A similar strategy has been described in previous self-supervised learning literature^67,68. PRECAST also takes advantage of CAR to account for the local microenvironments of neighboring spots, and an intrinsic CAR component has been used to promote spatial smoothness in the observed expressions of SRT data⁶⁹. We showed that, by projecting non-cellular biological effects onto cellular biological space, different sample slides exhibit a constant shift in the centroid of each cell/domain type. With the assistance of joint modeling, we can use a subclass of CAR, intrinsic CAR, to simultaneously account for both smoothness in neighboring embeddings and shift across batches due to non-cellular biological effects such as complex batch effects.

With the advent of high-throughput technologies for SRT, data integration is particularly relevant for analyzing SRT datasets from multiple tissue slides. Analysis of a single section with current state-of-the-art techniques, e.g., 10x Visium and Slide-seqV2, only covers a tiny area of the region of interest, and it takes a few or dozens of slides to cover the whole tissue/organ. In this case, biological variations between cell/domain types are often confounded by factors related to data generation processes. Methods in data integration, which serve as the first step before downstream analyses, not only align the embeddings but also estimate the shared cell/domain clusters across samples^23–26. Most existing methods of data integration were designed to analyze scRNA-seq data without considering additional spatial information in the SRT data.

We examined the SRT data generated by three major platforms, 10x Visium, ST, and Slide-seqV2, with different spatial resolutions. With 10x Visium, mRNA is captured in 55-μm-diameter spots, while Slide-seqV2 achieves a higher resolution with 10μm-diameter spots. As common sense implies, with low-resolution datasets, PRECAST recovers aligned embeddings for cellular biological effects and cluster labels with smoother spatial patterns while losing the detailed local spatial information. Whereas, at near-single-cell resolution, the identified aligned embeddings and cluster labels show more fine-scale cell-type distribution patterns. Using four datasets, we demonstrated that PRECAST can successfully perform data integration with aligned embeddings across tissue sections, such that spots across different tissue sections are well mixed while cell/domain clusters are well segregated, improve clutering performance, and detect varied microenvironments in tissues. When applied to an HCC dataset, PRECAST identified five spatial domains belonging to TNE cells that were consistent with both manual annotations and spatial deconvolution. We further performed RNA velocity analysis to show the potential velocity of spots in different regions, shedding light on the tumorigenesis in the context of the tissues. To demonstrate the scalability of PRECAST, we analyzed a Slide-seqV2 dataset of 16 slides equally distributed along the anterior-posterior axis of a mouse OB.

PRECAST provides opportunities for new exciting research routes. Firstly, when SRT datasets of single-cell resolution are available, its use can be extended to the integration of multimodal single-cell data. For example, integrating single-cell ATAC-seq will allow users to examine cell-type-specific regulatory mechanisms in the spatial context of tissues. Secondly, it would be interesting to integrate single-cell-resolution SRT datasets with CITE-seq data, thus integrating spatial transcriptomics with immunophenotyping, in which surface proteins are detected by antibody-derived tags. This would enable the exploration of surface proteins not measured in SRT datasets.

Methods

PRECAST model

Here, we present a basic overview of PRECAST, and further details are available in the Supplementary Notes. PRECAST is a data integration method for SRT data from multiple tissue slides. The proposed method involves simultaneous dimension reduction and spatial clustering built on a hierarchical model with two layers, as shown in Fig. 1a. The first layer, the dimension-reduction step, relates gene expression to the shared latent embeddings, while the second layer, the spatial-clustering step, relates the shared latent embeddings and spatial coordinates to the cluster labels. In the dimension-reduction step, an intrinsic CAR model captures the spatial dependence induced by neighboring microenvironments in the low-dimensional embedding space, while in the spatial-clustering step, a Potts model promotes spatial smoothness in the cluster label space. Using simple projections of the batch effects and/or biological effects of slides onto the space of biological effects between cell/domain types, PRECAST aligns cell/domain clusters across multiple tissue slides with the shared distributions of embeddings of each cell/domain and detects their cell/domain labels. With M SRT datasets, we observe an n_r × p normalized expression matrix $X_{r} = {(x_{r 1}, \dots, x_{r i}, \dots, x_{r n_{r}})}^{^{T}}$ for each sample r(=1,⋯,M), where $x_{r i} = {(x_{r i 1}, \dots, x_{r i p})}^{^{T}}$ is a p-dimensional normalized expression vector for each spot $s_{r i} \in R^{2}$ of sample r on square or hexagonal lattices, among others; while the cluster label of spot s_ri, y_ri ∈ {1,⋯,K}, and q-dimensional shared embeddings, z_ri’s, are unavailable. Without loss of generality, we assume that, for each sample r, x_ri is centered, and PRECAST models the centered normalized expression vector x_ri with its latent low-dimensional feature, z_ri, and class label, y_ri, as

x_{r i} = W (z_{r i} + v_{r i}) + ε_{r i}, ε_{r i} ~ N (0, Λ_{r}),

z_{r i} ∣ y_{r i} = k ~ N (μ_{k}, Σ_{k}),

where Λ_r = diag(λ_r1,⋯,λ_rp) is a diagonal matrix for residual variance, $W \in R^{p \times q}$ is a loading matrix that transforms the p-dimensional expression vector into q-dimensional embeddings shared across M datasets, $μ_{k} \in R^{q \times 1}$ and $Σ_{k} \in R^{q \times q}$ are the mean vector and covariance matrix for the kth cluster, respectively, and v_ri is a q-dimensional slide-specific latent vector that captures the spatial dependence among neighboring spots and aligns embeddings across datasets. Equation (1) is related to the high-dimensional expression vector (x_ri) in p genes with a low-dimensional feature (z_ri) via a probabilistic PCA model⁷⁰ with consideration of spatial dependence while Eq. (2) is a Gaussian mixture model (GMM)⁷¹ for this latent feature among all spots across M datasets. To promote spatial smoothness in the space of cluster labels, we assume each latent class label, y_ri, is interconnected with the class labels of its neighborhoods via a discrete hidden Markov random field (HMRF). In detail, we use the following Potts model⁷² for the latent labels,

P (y_{r}) = C_{r} {(β_{r})}^{- 1} \exp \{- \frac{1}{2} \sum_{i} \sum_{i^{'} \in N_{r i}} β_{r} (1 - δ (y_{r i}, y_{r i^{'}}))\},

where C_r(β_r) is a normalization constant that does not have a closed form, N_ri is the neighborhood of spot s_ri in sample r, and β_r is the sample-specific smoothing parameter that captures the label similarity among the neighboring spots. However, we assume a continuous multivariate HMRF for the vector, v_ri, which captures spatial dependence in the embedding space. In detail, we assume an intrinsic CAR model⁷³ for v_ri

v_{r i} ∣ v_{[n_{r}] \ i} ~ N (μ_{v_{r i}}, m_{r i}^{- 1} Ψ_{r}),

where subscript $_{[n_{r}] \ i}$ denotes all spots but s_ri in sample r, m_ri is the number of neighbors of spot i in sample r, $μ_{v_{r i}} = m_{r i}^{- 1} \sum_{i^{'} \in N_{r i}} v_{r i^{'}}$ is the conditional mean relevant to the neighbors of that spot s_ri, and Ψ_r is a q × q conditional covariance matrix for the elements of v_ri. Conventionally, the joint distribution of the intrinsic CAR model is non-identifiable, as the mean of the joint distribution of intrinsic CAR is not zero. As shown in the next section, non-cellular biological effects, e.g., batch effects, from each slide can be projected onto the cellular biological space, which can be corrected by the non-zero-mean property of intrinsic CAR.

Projections of non-cellular biological effects

For each tissue slide, we assume the normalized expressions in each sample r can be decomposed into additional parts with respect to non-cellular biological effects as follows:

x_{r i} = W (z_{r i} + ν_{r i}) + W_{r} ζ_{r i} + ε_{r i},

where ν_ri is a q-dimensional vector that captures the spatial dependence among neighboring spots; $W_{r} \in R^{p \times \tilde{q}}$ is a loading matrix for a factor related to non-cellular biological effects; and ζ_ri, independent of (z_ri, ν_ri), is the corresponding $\tilde{q}$ -dimensional vector. Assuming cell biological space (W) and non-cellular biological space (W_r) are non-orthogonal, we can project W_rs onto the column spaces of W, i.e., ${\hat{W}}_{r} = W {(W^{T} W)}^{- 1} W^{^{T}} W_{r}$ , and then rewrite the normalized expressions as the following

x_{r i} \approx W (z_{r i} + ν_{r i} + {\tilde{W}}^{^{T}} W_{r} ζ_{r i}) + ε_{r i},

where $\tilde{W} = W {(W^{T} W)}^{- 1}$ . We denote $v_{r i} = ν_{r i} + {\tilde{W}}^{^{T}} W_{r} ζ_{r i}$ , $μ_{v_{r i}}^{'} = E (ν_{r i} ∣ ν_{[n_{r}] / i})$ , $m_{r i}^{- 1} Ψ_{r}^{'} = var (ν_{r i} ∣ ν_{[n_{r}] / i})$ , then the conditional mean and variance in the intrinsic CAR component can be written as

E (v_{r i} ∣ v_{[n_{r}] / i}) = μ_{v_{r i}}^{'} + {\tilde{W}}^{^{T}} W_{r} E (ζ_{r i}) \equiv μ_{v_{r i}}, var (v_{r i} ∣ v_{[n_{r}] / i}) = m_{r i}^{- 1} Ψ_{r}^{'} + {\tilde{W}}^{^{T}} W_{r} var (ζ_{r i}) W_{r}^{^{T}} \tilde{W} \equiv m_{r i}^{- 1} Ψ_{r},

where we assume $var (ζ_{r i}) = m_{r i}^{- 1} Ψ_{r}^{″}$ . Then, the projected approximated model can be written as

x_{r i} \approx W (z_{r i} + v_{r i}) + ε_{r i}, r = 1, 2, \dots, M,

where z_ri∣y_ri = k ~ N(μ_k,Σ_k), and $v_{r i} ∣ v_{[n_{r}] / i} ~ N (μ_{v_{r i}}, m_{r i}^{- 1} Ψ_{r})$ .

Recovery of comparable gene expression matrices

Once the estimated cluster labels are obtained by PRECAST, we can remove unwanted variations using a set of housekeeping genes as negative control genes that are not affected by other biological effects⁷⁴. In this study, we used a set of mouse/human housekeeping genes from the Housekeeping and Reference Transcript Atlas⁷⁵. First, we obtained vectors, ${\tilde{x}}_{r i} = ({\tilde{x}}_{r i 1}, \dots, {\tilde{x}}_{r i L})$ , of the expression of L housekeeping genes by matching the set of housekeeping genes and genes passing QC for each dataset. By performing PCA, we obtained the top 10 PCs, ${\hat{h}}_{r i}$ , as covariates to adjust for unwanted variation. One of the outputs of PRECAST, the posterior probability of y_ri ( ${\hat{r}}_{r i} \in R^{K}$ ), can be used as the design matrix to explain biological variation between cell/domain types. Finally, we used a linear model for the normalized gene expression vector

x_{r i} = α {\hat{r}}_{r i} + γ {\hat{h}}_{r i} + ε_{r i},

where α is a p-by-Kdimensional matrix for biological effects between cell/domain types and γ is a p-by-10dimensional matrix of regression coefficients associated with the unwanted factors. After obtaining the parameter estimates in Eqn. (7), users can remove batch effects from the original normalized gene expression using

{\hat{x}}_{r i} = x_{r i} - \hat{γ} {\hat{h}}_{r i} .

This strategy can also be applied to samples from multiple biological conditions when such information is available. This can be achieved by adding additional covariates for biological conditions in Eq. (7).

Differential expression analysis and enrichment analysis

After removing unwanted variation of gene expression matrices for multiple slides, we performed DE analyses and enrichment analysis on all four datasets. In detail, we used the FindAllMarkers function in the R package Seurat with default settings to detect the differentially expressed genes for each domain detected by PRECAST. The DE analysis was considered to identify domain-specific DE genes with adjusted p-values of less than 0.001 and a log-fold change of greater than 0.25. After obtaining a set of genes specific to each domain, we performed gene set enrichment analysis (GSEA) on a set of detected DE genes using g:Profiler with the g:SCS multiple testing correction method and applying a significance threshold of 0.05⁷⁶.

Conditional SVG analysis

After obtaining aligned embeddings and domain labels for DLPFC and HCC Visium data using PRECAST, we detected the SVGs by adjusting the estimated aligned embeddings as covariates to investigate the role of SVGs beyond differences between cell/domain types. In detail, we used the function spark.vc in the R package SPARK to identify SVGs adjusted for cell/domain-relevant covariates for each sample. Finally, an FDR of 1% was adopted to identify the significant SVGs.

Trajectory inference/RNA velocity analysis

To further investigate the development and differentiation of these identified spatial domains/cells by PRECAST, we used the aligned embeddings and domain clusters estimated by PRECAST to perform trajectory inference, or RNA velocity analysis if splicing and unsplicing information was available. For the DLPFC Visium data, mouse liver ST data and mouse OB Slide-seqV2 data, we conducted trajectory inference for the combined spots in all multiple tissue sections using Slingshot⁷⁷. We inputted the aligned embeddings and domain clusters estimated by PRECAST into the function slingshot in the R package slingshot for implementation. Because the splicing and unsplicing information was available for the HCC Visium dataset, we ran RNA velocity analysis using the scvelo.tl.velocity function in the Python module scvelo based on the splicing and unsplicing matrices, then the domain clusters estimated by PRECAST were used to visualize the inferred RNA velocity and latent time.

Cell-type deconvolution analysis

We performed deconvolution analysis using Robust Cell Type Decomposition (RCTD)⁵⁵, a supervised learning method used to decompose each spatial transcriptomics pixel into a mixture of individual cell types while accounting for platform effects. We leveraged the results of deconvolution analysis for better biological interpretation of the real data analysis.

For mouse liver ST data, we used scRNA-seq data on liver tissue sections from an adult mouse from the MCA⁴⁶. By removing cell types with cell numbers of less than 18, the reference data retained 4640 cells belonging to 17 cell types: B cells with high Jchain expression (n = 43), erythroid cells with high Hbb-bs expression (n = 555), periportal hepatocytes (n = 26), hepatocytes with high Fabp1 expression (n = 149), Kupffer cells (n = 1046), erythroid cells with high Hbb-bt expression (n = 62), endothelial cells (n = 1196), granulocytes (n = 194), NK cells (n = 114), macrophages (n = 191), dendritic cells (n = 422), B cells with high Fcmr expression (n = 97), T cells (n = 219), plasmacytoid dendritic cells (n = 90), hepatocytes with high Spp1 expression (n = 99), pericentral hepatocytes (n = 119), and hepatic stellate cells (n = 18).

For the mouse OB Slide-seqV2 data, we used 10X Genomics Chromium scRNA-seq data collected from the wild-type OB⁵². The reference data included 17,453 cells belonging to 40 cell types: three astrocyte types (Astro1: n = 1087; Astro2: n = 129; and Astro3: n = 599), two endothelial types (EC1: n = 897 and EC2: n = 445), two mesenchymal types (Mes1: n = 200 and Mes2: n = 65), three microglials (MicroG1: n = 255; MicroG2: n = 766; and MicroG3: n = 294), monocytes (n = 238), two murals (Mural1: n = 43 and Mural2: n = 272), myelinating-oligodendrocyte cells (n = 294), macrophages (n = 197), five olfactory ensheathing cell types (OEC1: n = 812; OEC2: n = 664; OEC3: n = 1073; OEC4: n = 1083; and OEC5: n = 281), oligodendrocyte progenitor cells (OPC: n = 137), red blood cells (RBCs: n = 103), and 18 neural subtypes (OSNs: n = 467; PGC-1: n = 437; PGC-2: n = 146; PGC-3: n = 106; GC-1: n = 540; GC-2: n = 1245; GC-3: n = 96; GC-4: n = 80; GC-5: n = 516; GC-6: n = 89; GC-7: 293; Immature: n = 1150; Transition: n = 733; Astrocyte-Like: n = 1078; M/TC-1: n = 24; M/TC-2: n = 44; M/TC-3: n = 411; and EPL-IN: n = 64). For more details of the cell types, please refer to Tepe et al.⁵².

For the HCC Visium data, we leveraged a droplet-based scRNA dataset collected from HCC and intra-hepatic cholangiocarcinoma patients to serve as the reference data for deconvolution⁶¹. After filtering out 92 cells of unclassified cell type, the reference data contained 5023 cells belonging to seven cell types: B cells (n = 598), CAFs (n = 724), HPC-like cells (n = 254), malignant cells (702), T cells (n = 1429), TAMs (n = 437), and TECs (n = 879). After acquiring the deconvolution results from the above-mentioned reference, we combined the proportions of B cells, T cells, and TECs into a single set and referred to this as the immune cell proportion.

Comparisons of methods

We conducted comprehensive simulation and real data analyses to compare PRECAST with existing methods of data integration and clustering.

We applied the following single-cell integration methods to benchmark the data integration performance of PRECAST: (1) Seurat V3²⁶; (2) Harmony²³ implemented in the R package harmony; (3) fastMNN²⁴ implemented in the R package batchelor; (4) Scanorama²⁵ implemented in the Python module scanorama. (5) scGen²⁸; (6) scVI²⁷ implemented in the Python module scvi; (7) MEFISTO²⁹ implemented in the Python module mofax; and (8) PASTE³⁰ implemented in the Python module paste. In the real data analyses, PRECAST and all other methods used the same list of selected genes from the preprocessing steps as input. The first six methods were designed for scRNA-seq integration, while MEFISTO can be used for scRNA-seq and SRT data integration, and PASTE is designed for integrating SRT data from multiple adjacent tissue slices into a single slice (see Supplementary Materials).

To evaluate clustering performance, we considered the following four methods using the extracted aligned embeddings as input: (1) SC-MEB implemented in the R package SC.MEB²⁰, (2) Louvain implemented in the R package igraph⁷⁸, (3) BayesSpace implemented in the R package BayesSpace⁷⁹, and (4) BASS implemented in the R package BASS⁸⁰. SC-MEB and BayesSpace were recently developed to perform spatial clustering based on a discrete Markov random field^20,79, and Louvain is a conventional non-spatial clustering algorithm based on community detection in large networks⁷⁸, while BASS was a newly developed clustering method for multiple SRT data based on the aligned embeddings from Harmony. In the implementation, we set the default values for them in the respective packages (see Supplementary Materials).

Evaluation metrics

We evaluated the methods’ performances in data integration, obtaining embeddings for cellular biological effects, and spatial clustering using the following metrics.

Local inverse Simpson’s index

To assess performance in batch-effect removal, we used the cell-type/integration local inverse Simpson’s index (LISI), cLISI, and iLISI²³ to quantify the performance in merging the shared cell populations among tissue slides and mixing spots from M tissue slides. cLISI assigns a diversity score to each spot that represents the effective number of cell types in the neighborhoods of that spot. For M datasets with a total of K cell types, accurate integration should maintain a cLISI value close to 1, reflecting the purity of the unique cell types in the neighborhood of each spot, as defined by the aligned embeddings. Erroneous embedding includes neighborhoods with a cLISI of more than 1, while the worst cases have cLISI close to K, suggesting that neighbors have K different types of cells. iLISI has a similar form and implication as cLISI but is based on a sample index set rather than clusters of cell types. Thus, a larger iLISI value means the different samples have more sufficient mixing.

Silhouette coefficient

To simultaneously evaluate the separation of each cell/domain cluster and mixing of multiple datasets, we calculated the average silhouette coefficient of the SRT datasets using two different groupings: (1) grouping using known cell types as the cell/domain-type silhouette coefficient (silh_cluster) and (2) grouping using different datasets as the batch silhouette coefficient (silh_batch). In the data integration, a larger value of silh_cluster indicates better preservation of the biological signals between cell/domain types, while a smaller value of silh_batch suggests better mixing of datasets. These two metrics can be summarized using the F1 score as follows⁸¹:

F 1 score = \frac{2 (1 - {silh}_{batch}^{'}) {silh}_{cluster}^{'}}{{silh}_{cluster}^{'} + (1 - {silh}_{b a t c h}^{'})} \in [0, 1],

where ${silh}_{batch}^{'} = \frac{1 + {silh}_{batch}}{2}$ and ${silh}_{cluster}^{'} = \frac{1 + {silh}_{cluster}}{2}$ . A larger F1 score suggests better data integration that preserves the biological variations between cell/domain types while removing other non-cellular biological variations across multiple tissues.

Canonical correlation coefficients and/or conditional correlation

For dimension reduction, we applied two measurements to assess the performance of true latent feature recovery in the simulation studies. The first was the mean canonical correlation between the estimated features and the true one, defined as

CCor = \frac{1}{q} \sum_{l = 1}^{q} ζ_{l} (z_{i}, {\hat{z}}_{i}),

where ζ_l is the l-th canonical correlation coefficient. The second measure was the mean conditional correlation between gene expression x_i and cell type label y_i given the estimated latent features ${\hat{z}}_{i}$ , defined as

ConCor = \frac{1}{p} \sum_{j = 1}^{p} c o r r (y_{i}, r e s i d_{i j}),

where resid_i is the residual of x_ij regressing on ${\hat{z}}_{i}$ , and corr(y_i, resid_ij) is the Pearson correlation coefficient between y_i and resid_ij.

Adjusted Rand index and/or normalized mutual information

To evaluate performance in spatial clustering, we used ARI⁸² and NMI⁸³. ARI is the corrected version of the Rand index (RI)⁸⁴ and is used avoid some of the drawbacks of RI⁸². ARI measures the similarity between two different partitions and ranges from −1 to 1. A larger value of ARI means a higher degree of similarity between two partitions. ARI takes a value of 1 when the two partitions are equal up to a permutation. Whereas NMI is a variant of the mutual information (MI) that normalizes the value of MI to within the range of 0 and 1. When the two partitions are equal up to a permutation, NMI takes a value of 1.

Simulations

Scenario 1. Raw gene expression count data with different scales of batch effects: domain labels and spatial coordinates from Potts models. For this scenario, we generated the raw gene expression count data for three samples, as well as the spatial coordinates based on Potts models with four neighborhoods. All quantities, such as the labels y_ris and latent features ν_ris, of spots were randomly simulated from the generative model (5). We generated the class label y_ri for each r = 1, 2, 3 and i = 1, ⋯ , n_r, with n_r ∈ {65 × 65, 60 × 60, 60 × 60}, corresponding to rectangular lattices from a K-state (K = 7) Potts model with the smoothing parameter β_r = 0.8 + 0.2(r − 1), using the function sampler.mrf from the R package GiRaF. Then, we generated latent features ν_ris from the CAR model with the same N_ri as defined in the Potts model and covariance matrix $Ψ_{r}^{'} = (σ_{r, i j})$ , with σ_r,ij = r(0.2r)^∣i−j∣, using the function rmatrixnorm in R package LaplacesDemon. We set different values for the smoothing parameter β_r and covariance matrices $Ψ_{r}^{'}$ across r, to mimic the heterogeneity of three samples. The domain labels y_ris and latent features ν_ris were fixed once they were obtained.

Following this, we generated domain-relevant latent features z_ri in the model (5) from the conditional Gaussian distribution, such that z_ri∣y_ri = k ~ (μ_k, Σ_k), where z_ri ∈ R^q with q = 10. Structures for μ_k and Σ_k are shown in Supplementary Data 8. Next, we generated $\tilde{W} = ({\tilde{w}}_{i j}, i \leq p, j \leq q)$ with each ${\tilde{w}}_{i j} \overset{i . i . d .}{~} N (0, 1)$ , performed QR decomposition on $\tilde{W}$ such that $\tilde{W} = \tilde{Q} \tilde{R}$ , and assigned $W = \tilde{Q}$ , which is a column orthogonal matrix. Then, we generated a batch-loading matrix W_r by generating ${\bar{W}}_{r} = \bar{W} + Ē_{r}$ with $\bar{W} = ({\bar{w}}_{i j}, i \leq p = 2000, j \leq q_{r}), q_{r} = 2$ , ${\bar{w}}_{i j} \overset{i . i . d .}{~} N (0, 1)$ , $Ē_{r} = (ē_{r i j}, i \leq p, j \leq q_{r})$ , and $ē_{r i j} ~ N (0, {\bar{σ}}_{r}^{2})$ with ${\bar{σ}}_{1} = 0.5$ , ${\bar{σ}}_{2} = 0.8$ , ${\bar{σ}}_{3} = 1$ . In a similar manner to the generation of W, we performed orthogonalization of ${\bar{W}}_{r}$ to generate W_r. Next, we generated ζ_ri in the model (5) by $ζ_{r i} = {(ζ_{r i 1}, \dots, ζ_{r i q_{r}})}^{^{T}}$ such $ζ_{r i k} \overset{i . i . d .}{~} N (0, b_{s c a l e}^{2} σ_{r}^{2})$ with σ₁ = 1, σ₂ = 2 and σ₃ = 0.5, where b_scale controlled the scales of batch effects. Here, we considered three scales of batch effects, corresponding to low, middle and high, by taking the value of 1, 2, or 3 for b_scale, respectively.

Next, we generated a high-dimensional normalized expression matrix using x_ri = τ_r + W(z_ri + ν_ri) + W_rζ_ri + ε_ri, τ_rj ~ N(0, 4), ε_ri ~ N(0, Λ_r), where τ_rj is the j-th element of τ_r, Λ_r = diag(λ_rj), j = 1,…,p, λ_1j = 2(1 + 1.5∣z_1j∣) with $z_{1 j} \overset{i . i . d .}{~} N (0, 3)$ ; λ_2j = 2(1 + z_2j) with $z_{2 j} \overset{i . i . d .}{~} U [0, 1]$ ; and λ_3j = 2(1 + 2z_3j) with $z_{3 j} \overset{i . i . d .}{~} U [0, 1]$ . Finally, we generated raw gene expressions ${\tilde{x}}_{r i} = {({\tilde{x}}_{r i 1}, \dots, {\tilde{x}}_{r i p})}^{^{T}}$ using ${\tilde{x}}_{r i j} ~ P o i s s o n (e x p (x_{r i j}))$ . The term ε_ri makes the distribution of ${\tilde{x}}_{r i j}$ over-dispersed, which better imitates the properties of count expression. In this scenario, for each sample r, we only observed the raw expression ${\tilde{x}}_{r i j}$ of gene j and spot i and spatial coordinates s_ri for spot i.

Scenario 2. Raw gene expression count data with different scales of batch effects: domain labels and spatial coordinates from DLPFC data. To validate the generalizability of PRECAST, we also generated data based on three DLFPC datasets (ID: 151507, 151669, and 151673) from three donors (Visium platform). The domain labels y_ris for spots were obtained from annotations made by Maynard et al.²¹, together with the spatial coordinates. To generate the latent features ν_ris, PRECAST was used to fit the three datasets, and we used the estimated features ${\hat{ν}}_{r i}$ s as input. The cluster labels y_ris and latent features ν_ris were fixed once they were obtained. The other quantities were generated in the same way as in scenario 1, but with a different μ_k (Supplementary Data 8).

Scenario 3. Raw gene expression data: count matrix, domain labels and spatial coordinates from DLPFC data. To make PASTE comparable to the other methods, we obtained raw gene expression data, domain labels and spatial coordinates from all three tissue slides (ID: 151673, 151674, and 151675) from the last donor. The domain labels were obtained from Maynard et al.²¹. As the true shared embeddings were unknown, we did not evaluate the canonical correlations. To generate the count matrix, we randomly added a pseudocount for each raw count from a binomial distribution with size three and probability parameter 0.3, similar to Zeira et al.³⁰. Finally, we obtained the count matrix and spatial coordinates as input for the compared methods.

Gene selection for integrative analysis

By performing QC, we filtered out genes with zero expression in multiple spots, and spots with zero expression of many genes (see “Data resources”). In our analyses, we used SPARK⁴⁰ to select top SVGs for human DLPFC Visium data, mouse liver ST data and HCC Visium data. However, we used SPARK-X⁴¹ to select SVGs for mouse olfactory bulb Slide-seqV2 data, since SPARK cannot handle datasets with a large number of spots. In total, we selected the top 2000 SVGs for each sample using SPARK or SPARK-X. Next, we prioritized genes based on the number of times they were selected as SVGs in all samples and chose the top 2000 genes as input for PRECAST and the other compared analytical methods.

We used human DLPFC Visium data and mouse liver ST data, which had manual annotations, to confirm that spatially-aware gene selection methods did not represent a crucial part of PRECAST. Here, we selected the top 2000 highly variable genes (HVGs) for each sample using FindVariableFeatures with default settings in the Seurat R package. In addition, to examine the impact of different SVG selection methods to choose SVGs on the performance of PRECAST, we applied four methods to select the top 2000 SVGs for each sample of these two datasets. These methods included SPARK, SPARK-X, SpatialDE⁴², and nnSVG⁴³.

Data resources

Human dorsolateral prefrontal cortex Visium data

We downloaded spatial transcriptomic data for human DLPFC obtained on the 10x Visium platform from 10.5281/zenodo.4730634. These data were collected from 12 human postmortem DLPFC tissue sections from three independent neurotypical adult donors, and the raw expression count matrix contained 33,538 genes for each sample, with a total of 47,681 spatial locations. Before conducting the analysis, we first performed QC on each sample to filter out genes with non-zero expression levels for fewer than 20 spots and the spots with non-zero expression levels for less than 20 genes. The filtering step led to sets of 14,535 genes on average in a total of 47,680 spatial locations. The annotated spatial domains in all 12 samples based on the cytoarchitecture in the original study²¹ were layer 1 (n = 5321), layer 2 (n = 2858), layer 3 (n = 17,587), layer 4 (n = 3547), layer 5 (n = 7300), layer 6 (n = 6201), white matter (n = 4514), and undetermined spots (n = 352). In the analysis, we treated these manual annotations as the ground truth to evaluate the clustering and data integration performance of the different methods.

Mouse liver ST data

We downloaded eight sets of mouse liver ST data from https://zenodo.org/record/4399655. The eight datasets contained 15,302 genes, on average, measured over 4865 spatial spots in total. In the QC steps, we first filtered out genes with non-zero expression levels in fewer than 20 spots and spots with non-zero expression levels for fewer than 20 genes. The filtering step led to a set of 9221 genes, on average, from a total of 4865 locations. The annotated spatial domains in all eight samples, based on marker genes in the original study⁴⁴ were portal veins (n = 1223), central veins (n = 720), haemoglobin (n = 464), immune-related domain (n = 163), mesenchymal-related domain (n = 110), and undetermined spots (n = 2183), which were used to evaluate the clustering and data integration performances of the different methods in the analyses.

Mouse olfactory bulb Slide-seqV2 data

We obtained the data for 20 mouse OB Slide-seqV2 sections in Replicate 2 from CNBI accession number GSE169021 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE169021. We used the first 16 slides because of the low quality of the last four slides. The data contained 21,571 genes, on average, over all 693,863 spatial locations across 16 sections, which were equally distributed along the anterior-posterior axis of the same mouse. In the QC steps, we first filtered out genes with non-zero expression levels in fewer than 20 spots and spots with non-zero expression levels for fewer than 20 genes. The filtering step led to a set of 14,307 genes, on average, for a total of 594,890 locations. To evaluate the impact of spatial resolution, we further collapsed nearby spots in each tissue slide using square grids of size 70 × 70. Then, the same QC steps were performed to obtain normalized expression for the analysis. To examine the structure of the mouse OB, we relied on the structural annotation in the Allen Brain Atlas⁴⁸ and used the results of downstream analyses to determine the specific regions of OB.

Human hepatocellular carcinoma Visium data

These data were from two tissue sections each from tumor and tumor-adjacent regions of an HCC patient, and contained 36,601 genes from over 9813 spatial locations. In the QC steps, we first filtered out genes with non-zero expression levels in fewer than 20 spots and spots with non-zero expression levels for fewer than 20 genes. The filtering step led to a set of 14,851 genes on average from a total of 9813 locations. In this data, manual annotations for the TNE and stroma regions were provided by a pathologist using the Visium companion H&E images. We further performed spatial deconvolution to examine the spatial distribution of malignant cells using RCTD⁵⁵.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplementary Information^{(47.5MB, pdf)}

Peer Review File^{(16.3MB, pdf)}

41467_2023_35947_MOESM3_ESM.pdf^{(94.8KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1-8^{(1.1MB, zip)}

Reporting Summary^{(345.9KB, pdf)}

Source data

Source Data^{(68.2MB, xlsx)}

Acknowledgements

We thank Dr. Juan Zhou for critical reading and feedback. This work was supported by University Development Fund (UDF01003033) from The Chinese University of Hong Kong, Shenzhen; AcRF Tier 2 grant (MOET2EP20220-0009) from the Ministry of Education, Singapore, and grant from the National Natural Science Foundation of China (11931014). The computational work for this article was partially performed using resources from the National Supercomputing Centre, Singapore (https://www.nscc.sg).

Author contributions

J.L. initiated and designed the study, W.L. implemented the model and developed the software tool with assistance from Y.Y., W.L., X.L., and Z.L. performed the simulation studies and the benchmark evaluation; J.L. wrote the manuscript, and W.L., X.L., Z.L., Y.Y., M.L., Y.J., X.S., W.Z., H.J., J.Y., and J.L. edited and revised the manuscript.

Peer review

Peer review information

Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Data availability

All datasets used in this study are publicly available. These include the 12 human dorsolateral prefrontal cortex Visium datasets (10.5281/zenodo.4730634), eight mouse liver ST datasets (https://zenodo.org/record/4399655), 16 mouse OB Slide-seqV2 datasets (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE169021) and four human hepatocellular carcinoma Visium datasets (Raw FASTQ data are available at https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=858545, and H&E images are available at 10.6084/m9.figshare.21280569.v1 and 10.6084/m9.figshare.21061990.v1). The structural annotation of mouse olfactory bulb is available at Allen Brain Atlas (https://atlas.brain-map.org/). All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding author upon reasonable request. Source data are provided with this paper.

Code availability

The PRECAST methods were implemented in an open-source, publicly available R package⁸⁵ that is available at https://cran.r-project.org/package=PRECAST and https://github.com/feiyoung/PRECAST. Code for reproducing the analysis can be found at https://github.com/feiyoung/PRECAST_Analysis.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Wei Liu, Xu Liao.

Change history

10/18/2023

A Correction to this paper has been published: 10.1038/s41467-023-42412-1

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-023-35947-w.

References

1.Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science348, aaa6090 (2015). [DOI] [PMC free article] [PubMed]
2.Moffitt JR, et al. High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proc. Natl Acad. Sci. USA. 2016;113:11046–11051. doi: 10.1073/pnas.1612826113. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wang G, Moffitt JR, Zhuang X. Multiplexed imaging of high-density libraries of RNAs with merfish and expansion microscopy. Sci. Rep. 2018;8:1–13. doi: 10.1038/s41598-018-22297-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nat. Methods. 2014;11:360–361. doi: 10.1038/nmeth.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Shah S, Lubeck E, Zhou W, Cai L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron. 2016;92:342–357. doi: 10.1016/j.neuron.2016.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Eng C-HL, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+ Nature. 2019;568:235–239. doi: 10.1038/s41586-019-1049-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ståhl PL, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353:78–82. doi: 10.1126/science.aaf2403. [DOI] [PubMed] [Google Scholar]
8.Vickovic S, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. Methods. 2019;16:987–990. doi: 10.1038/s41592-019-0548-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Rodriques SG, et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363:1463–1467. doi: 10.1126/science.aaw1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Stickels RR, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 2021;39:313–319. doi: 10.1038/s41587-020-0739-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.10x Genomics. Visium spatial gene expression. https://www.10xgenomics.com/products/spatial-gene-expression (2019).
12.Rao A, Barkley D, Francca GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596:211–220. doi: 10.1038/s41586-021-03634-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Armingol E, Officer A, Harismendy O, Lewis NE. Deciphering cell–cell interactions and communication from gene expression. Nat. Rev. Genet. 2021;22:71–88. doi: 10.1038/s41576-020-00292-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Nassiri I, McCall MN. Systematic exploration of cell morphological phenotypes associated with a transcriptomic query. Nucleic Acids Res. 2018;46:e116–e116. doi: 10.1093/nar/gky626. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019;37:547–554. doi: 10.1038/s41587-019-0071-9. [DOI] [PubMed] [Google Scholar]
16.Qiu X, et al. Mapping transcriptomic vector fields of single cells. Cell. 2022;185:690–711. doi: 10.1016/j.cell.2021.12.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Palla, G., Fischer, D. S., Regev, A. & Theis, F. J. Spatial components of molecular tissue biology. Nat. Biotechnol.40, 308–318 (2022). [DOI] [PubMed]
18.Zhao, E. et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol.39, 1375–1384 (2021). [DOI] [PMC free article] [PubMed]
19.Hu J, et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods. 2021;18:1342–1351. doi: 10.1038/s41592-021-01255-8. [DOI] [PubMed] [Google Scholar]
20.Yang Y, et al. SC-MEB: spatial clustering with hidden Markov random field using empirical Bayes. Brief. Bioinform. 2022;23:bbab466. doi: 10.1093/bib/bbab466. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Maynard KR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 2021;24:425–436. doi: 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Wang, I.-H. et al. Spatial transcriptomic reconstruction of the mouse olfactory glomerular map suggests principles of odor processing. Nat. Neurosci.25, 484–492 (2022). [DOI] [PMC free article] [PubMed]
23.Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods. 2019;16:1289–1296. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Haghverdi L, Lun AT, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat. Biotechnol. 2019;37:685–691. doi: 10.1038/s41587-019-0113-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Stuart T, et al. Comprehensive integration of single-cell data. Cell. 2019;177:1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat. Methods. 2018;15:1053–1058. doi: 10.1038/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Lotfollahi M, Wolf FA, Theis FJ. scGen predicts single-cell perturbation responses. Nat. Methods. 2019;16:715–721. doi: 10.1038/s41592-019-0494-8. [DOI] [PubMed] [Google Scholar]
29.Velten, B. et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat. Methods19, 179–186 (2022). [DOI] [PMC free article] [PubMed]
30.Zeira, R., Land, M., Strzalkowski, A. & Raphael, B. J. Alignment and integration of spatial transcriptomics data. Nat. Methods19, 567–575 (2022). [DOI] [PMC free article] [PubMed]
31.Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008).
32.McInnes, L. et al. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861 (2018), 10.21105/joss.00861.
33.Yeh FL, Wang Y, Tom I, Gonzalez LC, Sheng M. TREM2 binds to apolipoproteins, including apoe and CLU/APOJ, and thereby facilitates uptake of amyloid-beta by microglia. Neuron. 2016;91:328–340. doi: 10.1016/j.neuron.2016.06.015. [DOI] [PubMed] [Google Scholar]
34.Liu, W. et al. Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Res.10.1093/nar/gkac219 (2022). [DOI] [PMC free article] [PubMed]
35.Mamber C, et al. GFAPδ expression in glia of the developmental and adolescent mouse brain. PLoS ONE. 2012;7:e52659. doi: 10.1371/journal.pone.0052659. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Zhang X, et al. Human intracellular ISG15 prevents interferon-α/β over-amplification and auto-inflammation. Nature. 2015;517:89–93. doi: 10.1038/nature13801. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Hermann M, Bogunovic D. ISG15: in sickness and in health. Trends Immunol. 2017;38:79–93. doi: 10.1016/j.it.2016.11.001. [DOI] [PubMed] [Google Scholar]
38.Dantzer R, O’connor JC, Freund GG, Johnson RW, Kelley KW. From inflammation to sickness and depression: when the immune system subjugates the brain. Nat. Rev. Neurosci. 2008;9:46–56. doi: 10.1038/nrn2297. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Khandaker GM, et al. Inflammation and immunity in schizophrenia: implications for pathophysiology and treatment. Lancet Psychiatry. 2015;2:258–270. doi: 10.1016/S2215-0366(14)00122-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Sun S, Zhu J, Zhou X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat. Methods. 2020;17:193–200. doi: 10.1038/s41592-019-0701-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Zhu J, Sun S, Zhou X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 2021;22:1–25. doi: 10.1186/s13059-021-02404-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Svensson V, Teichmann SA, Stegle O. Spatialde: identification of spatially variable genes. Nat. Methods. 2018;15:343–346. doi: 10.1038/nmeth.4636. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Weber, L. M., Saha, A., Datta, A., Hansen, K. D. & Hicks, S. C. nnSVG: scalable identification of spatially variable genes using nearest-neighbor gaussian processes. Preprint at bioRxiv10.1101/2022.05.16.492124 (2022). [DOI] [PMC free article] [PubMed]
44.Hildebrandt F, et al. Spatial transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver. Nat. Commun. 2021;12:1–14. doi: 10.1038/s41467-021-27354-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Mak KM, Png CM. The hepatic central vein: structure, fibrosis, and role in liver biology. Anatomical Record. 2020;303:1747–1767. doi: 10.1002/ar.24273. [DOI] [PubMed] [Google Scholar]
46.Han X, et al. Mapping the mouse cell atlas by microwell-seq. Cell. 2018;172:1091–1107. doi: 10.1016/j.cell.2018.02.001. [DOI] [PubMed] [Google Scholar]
47.Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44:e117–e117. doi: 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Lein ES, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445:168–176. doi: 10.1038/nature05453. [DOI] [PubMed] [Google Scholar]
49.Marx V. Method of the year: spatially resolved transcriptomics. Nat. Methods. 2021;18:9–14. doi: 10.1038/s41592-020-01033-y. [DOI] [PubMed] [Google Scholar]
50.Amaral PP, et al. Complex architecture and regulated expression of the Sox2ot locus during vertebrate development. RNA. 2009;15:2013–2027. doi: 10.1261/rna.1705309. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Haslinger A, Schwarz TJ, Covic M, Chichung Lie D. Expression of Sox11 in adult neurogenic niches suggests a stage-specific role in adult neurogenesis. Eur. J. Neurosci. 2009;29:2103–2114. doi: 10.1111/j.1460-9568.2009.06768.x. [DOI] [PubMed] [Google Scholar]
52.Tepe B, et al. Single-cell RNA-seq of mouse olfactory bulb reveals cellular heterogeneity and activity-dependent molecular census of adult-born neurons. Cell Rep. 2018;25:2689–2703. doi: 10.1016/j.celrep.2018.11.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Sanai N, et al. Corridors of migrating neurons in the human brain and their decline during infancy. Nature. 2011;478:382–386. doi: 10.1038/nature10487. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Nagayama S, Homma R, Imamura F. Neuronal organization of olfactory bulb circuits. Front. Neural Circ. 2014;8:98. doi: 10.3389/fncir.2014.00098. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol.40, 517–526 (2022). [DOI] [PMC free article] [PubMed]
56.Tufo C, et al. Development of the mammalian main olfactory bulb. Development. 2022;149:dev200210. doi: 10.1242/dev.200210. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Hu J, et al. Gene expression signature for angiogenic and nonangiogenic non-small-cell lung cancer. Oncogene. 2005;24:1212–1219. doi: 10.1038/sj.onc.1208242. [DOI] [PubMed] [Google Scholar]
58.Bentink S, et al. Angiogenic mRNA and microRNA gene expression signature predicts a novel subtype of serous ovarian cancer. PLoS ONE. 2012;7:e30269. doi: 10.1371/journal.pone.0030269. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Masiero M, et al. A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor ELTD1 as a key regulator of angiogenesis. Cancer Cell. 2013;24:229–241. doi: 10.1016/j.ccr.2013.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Langlois B, et al. Angiomatrix, a signature of the tumor angiogenic switch-specific matrisome, correlates with poor prognosis for glioma and colorectal cancer patients. Oncotarget. 2014;5:10529. doi: 10.18632/oncotarget.2470. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Ma L, et al. Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer. Cancer Cell. 2019;36:418–430. doi: 10.1016/j.ccell.2019.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Capece, D. et al. The inflammatory microenvironment in hepatocellular carcinoma: a pivotal role for tumor-associated macrophages. BioMed Res. Int.2013, 1–15 (2013). [DOI] [PMC free article] [PubMed]
63.Sawa-Wejksza K, Kandefer-Szerszeń M. Tumor-associated macrophages as target for antitumor therapy. Arch. Immunol. Ther. Exp. 2018;66:97–111. doi: 10.1007/s00005-017-0480-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Huang K, et al. High SPINK1 expression predicts poor prognosis and promotes cell proliferation and metastasis of hepatocellular carcinoma. J. Invest. Surg. 2021;34:1011–1020. doi: 10.1080/08941939.2020.1728443. [DOI] [PubMed] [Google Scholar]
65.Birgani MT, et al. Long non-coding RNA SNHG6 as a potential biomarker for hepatocellular carcinoma. Pathol. Oncol. Res. 2018;24:329–337. doi: 10.1007/s12253-017-0241-3. [DOI] [PubMed] [Google Scholar]
66.Kang YH, et al. Dysregulation of overexpressed IL-32α in hepatocellular carcinoma suppresses cell growth and induces apoptosis through inactivation of NF-κB and Bcl-2. Cancer Lett. 2012;318:226–233. doi: 10.1016/j.canlet.2011.12.023. [DOI] [PubMed] [Google Scholar]
67.Tsai, Y.-H. H. et al. Self-supervised representation learning with relative predictive coding. In International Conference on Learning Representations (2021). https://openreview.net/forum?id=068E_JSq9O
68.Lin, Y. et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat. Biotechnol.40, 703–710 (2022). [DOI] [PMC free article] [PubMed]
69.Allen, C. et al. A Bayesian multivariate mixture model for high throughput spatial transcriptomics. Biometrics. online, (2022). 10.1111/biom.13727 [DOI] [PMC free article] [PubMed]
70.Tipping ME, Bishop CM. Probabilistic principal component analysis. J. R. Stat. Soc. Series B Stat. Methodol. 1999;61:611–622. [Google Scholar]
71.Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
72.Graner F, Glazier JA. Simulation of biological cell sorting using a two-dimensional extended potts model. Phys. Rev. Lett. 1992;69:2013. doi: 10.1103/PhysRevLett.69.2013. [DOI] [PubMed] [Google Scholar]
73.Besag J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Series B Methodol. 1974;36:192–225. [Google Scholar]
74.Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 2014;32:896–902. doi: 10.1038/nbt.2931. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Hounkpe BW, Chenou F, de Lima F, De Paula EV. HRT Atlas v1. 0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res. 2021;49:D947–D955. doi: 10.1093/nar/gkaa609. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Raudvere U, et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Res. 2019;47:W191–W198. doi: 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Street K, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19:1–16. doi: 10.1186/s12864-018-4772-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008;2008:P10008. [Google Scholar]
79.Zhao, E., et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nature Biotechnology39, 1375–1384 (2021). [DOI] [PMC free article] [PubMed]
80.Li Z, Zhou X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol. 2022;23:1–35. doi: 10.1186/s13059-022-02734-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Lin Y, et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA. 2019;116:9775–9784. doi: 10.1073/pnas.1820006116. [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Hubert L, Arabie P. Comparing partitions. J. Classif. 1985;2:193–218. [Google Scholar]
83.Cover, T. M. & Thomas, J. A. Elements of Information Theory 2nd Edition (wiley series in telecommunications and signal processing) (Wiley-Interscience, 2006).
84.Rand WM. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971;66:846–850. [Google Scholar]
85.Liu, W. et al. Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with precast. feiyoung/PRECAST: v1.3.0. 10.5281/zenodo.7417715 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(47.5MB, pdf)}

Peer Review File^{(16.3MB, pdf)}

41467_2023_35947_MOESM3_ESM.pdf^{(94.8KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1-8^{(1.1MB, zip)}

Reporting Summary^{(345.9KB, pdf)}

Source Data^{(68.2MB, xlsx)}

Data Availability Statement

[CR1] 1.Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science348, aaa6090 (2015). [DOI] [PMC free article] [PubMed]

[CR2] 2.Moffitt JR, et al. High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proc. Natl Acad. Sci. USA. 2016;113:11046–11051. doi: 10.1073/pnas.1612826113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Wang G, Moffitt JR, Zhuang X. Multiplexed imaging of high-density libraries of RNAs with merfish and expansion microscopy. Sci. Rep. 2018;8:1–13. doi: 10.1038/s41598-018-22297-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nat. Methods. 2014;11:360–361. doi: 10.1038/nmeth.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Shah S, Lubeck E, Zhou W, Cai L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron. 2016;92:342–357. doi: 10.1016/j.neuron.2016.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Eng C-HL, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+ Nature. 2019;568:235–239. doi: 10.1038/s41586-019-1049-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Ståhl PL, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353:78–82. doi: 10.1126/science.aaf2403. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Vickovic S, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. Methods. 2019;16:987–990. doi: 10.1038/s41592-019-0548-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Rodriques SG, et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363:1463–1467. doi: 10.1126/science.aaw1219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Stickels RR, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 2021;39:313–319. doi: 10.1038/s41587-020-0739-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.10x Genomics. Visium spatial gene expression. https://www.10xgenomics.com/products/spatial-gene-expression (2019).

[CR12] 12.Rao A, Barkley D, Francca GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596:211–220. doi: 10.1038/s41586-021-03634-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Armingol E, Officer A, Harismendy O, Lewis NE. Deciphering cell–cell interactions and communication from gene expression. Nat. Rev. Genet. 2021;22:71–88. doi: 10.1038/s41576-020-00292-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Nassiri I, McCall MN. Systematic exploration of cell morphological phenotypes associated with a transcriptomic query. Nucleic Acids Res. 2018;46:e116–e116. doi: 10.1093/nar/gky626. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019;37:547–554. doi: 10.1038/s41587-019-0071-9. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Qiu X, et al. Mapping transcriptomic vector fields of single cells. Cell. 2022;185:690–711. doi: 10.1016/j.cell.2021.12.045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Palla, G., Fischer, D. S., Regev, A. & Theis, F. J. Spatial components of molecular tissue biology. Nat. Biotechnol.40, 308–318 (2022). [DOI] [PubMed]

[CR18] 18.Zhao, E. et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol.39, 1375–1384 (2021). [DOI] [PMC free article] [PubMed]

[CR19] 19.Hu J, et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods. 2021;18:1342–1351. doi: 10.1038/s41592-021-01255-8. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Yang Y, et al. SC-MEB: spatial clustering with hidden Markov random field using empirical Bayes. Brief. Bioinform. 2022;23:bbab466. doi: 10.1093/bib/bbab466. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Maynard KR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 2021;24:425–436. doi: 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Wang, I.-H. et al. Spatial transcriptomic reconstruction of the mouse olfactory glomerular map suggests principles of odor processing. Nat. Neurosci.25, 484–492 (2022). [DOI] [PMC free article] [PubMed]

[CR23] 23.Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods. 2019;16:1289–1296. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Haghverdi L, Lun AT, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat. Biotechnol. 2019;37:685–691. doi: 10.1038/s41587-019-0113-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Stuart T, et al. Comprehensive integration of single-cell data. Cell. 2019;177:1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat. Methods. 2018;15:1053–1058. doi: 10.1038/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Lotfollahi M, Wolf FA, Theis FJ. scGen predicts single-cell perturbation responses. Nat. Methods. 2019;16:715–721. doi: 10.1038/s41592-019-0494-8. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Velten, B. et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat. Methods19, 179–186 (2022). [DOI] [PMC free article] [PubMed]

[CR30] 30.Zeira, R., Land, M., Strzalkowski, A. & Raphael, B. J. Alignment and integration of spatial transcriptomics data. Nat. Methods19, 567–575 (2022). [DOI] [PMC free article] [PubMed]

[CR31] 31.Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008).

[CR32] 32.McInnes, L. et al. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861 (2018), 10.21105/joss.00861.

[CR33] 33.Yeh FL, Wang Y, Tom I, Gonzalez LC, Sheng M. TREM2 binds to apolipoproteins, including apoe and CLU/APOJ, and thereby facilitates uptake of amyloid-beta by microglia. Neuron. 2016;91:328–340. doi: 10.1016/j.neuron.2016.06.015. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Liu, W. et al. Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Res.10.1093/nar/gkac219 (2022). [DOI] [PMC free article] [PubMed]

[CR35] 35.Mamber C, et al. GFAPδ expression in glia of the developmental and adolescent mouse brain. PLoS ONE. 2012;7:e52659. doi: 10.1371/journal.pone.0052659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Zhang X, et al. Human intracellular ISG15 prevents interferon-α/β over-amplification and auto-inflammation. Nature. 2015;517:89–93. doi: 10.1038/nature13801. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Hermann M, Bogunovic D. ISG15: in sickness and in health. Trends Immunol. 2017;38:79–93. doi: 10.1016/j.it.2016.11.001. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Dantzer R, O’connor JC, Freund GG, Johnson RW, Kelley KW. From inflammation to sickness and depression: when the immune system subjugates the brain. Nat. Rev. Neurosci. 2008;9:46–56. doi: 10.1038/nrn2297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Khandaker GM, et al. Inflammation and immunity in schizophrenia: implications for pathophysiology and treatment. Lancet Psychiatry. 2015;2:258–270. doi: 10.1016/S2215-0366(14)00122-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Sun S, Zhu J, Zhou X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat. Methods. 2020;17:193–200. doi: 10.1038/s41592-019-0701-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Zhu J, Sun S, Zhou X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 2021;22:1–25. doi: 10.1186/s13059-021-02404-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Svensson V, Teichmann SA, Stegle O. Spatialde: identification of spatially variable genes. Nat. Methods. 2018;15:343–346. doi: 10.1038/nmeth.4636. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Weber, L. M., Saha, A., Datta, A., Hansen, K. D. & Hicks, S. C. nnSVG: scalable identification of spatially variable genes using nearest-neighbor gaussian processes. Preprint at bioRxiv10.1101/2022.05.16.492124 (2022). [DOI] [PMC free article] [PubMed]

[CR44] 44.Hildebrandt F, et al. Spatial transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver. Nat. Commun. 2021;12:1–14. doi: 10.1038/s41467-021-27354-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Mak KM, Png CM. The hepatic central vein: structure, fibrosis, and role in liver biology. Anatomical Record. 2020;303:1747–1767. doi: 10.1002/ar.24273. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Han X, et al. Mapping the mouse cell atlas by microwell-seq. Cell. 2018;172:1091–1107. doi: 10.1016/j.cell.2018.02.001. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44:e117–e117. doi: 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Lein ES, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445:168–176. doi: 10.1038/nature05453. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Marx V. Method of the year: spatially resolved transcriptomics. Nat. Methods. 2021;18:9–14. doi: 10.1038/s41592-020-01033-y. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Amaral PP, et al. Complex architecture and regulated expression of the Sox2ot locus during vertebrate development. RNA. 2009;15:2013–2027. doi: 10.1261/rna.1705309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Haslinger A, Schwarz TJ, Covic M, Chichung Lie D. Expression of Sox11 in adult neurogenic niches suggests a stage-specific role in adult neurogenesis. Eur. J. Neurosci. 2009;29:2103–2114. doi: 10.1111/j.1460-9568.2009.06768.x. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Tepe B, et al. Single-cell RNA-seq of mouse olfactory bulb reveals cellular heterogeneity and activity-dependent molecular census of adult-born neurons. Cell Rep. 2018;25:2689–2703. doi: 10.1016/j.celrep.2018.11.034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Sanai N, et al. Corridors of migrating neurons in the human brain and their decline during infancy. Nature. 2011;478:382–386. doi: 10.1038/nature10487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Nagayama S, Homma R, Imamura F. Neuronal organization of olfactory bulb circuits. Front. Neural Circ. 2014;8:98. doi: 10.3389/fncir.2014.00098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol.40, 517–526 (2022). [DOI] [PMC free article] [PubMed]

[CR56] 56.Tufo C, et al. Development of the mammalian main olfactory bulb. Development. 2022;149:dev200210. doi: 10.1242/dev.200210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR57] 57.Hu J, et al. Gene expression signature for angiogenic and nonangiogenic non-small-cell lung cancer. Oncogene. 2005;24:1212–1219. doi: 10.1038/sj.onc.1208242. [DOI] [PubMed] [Google Scholar]

[CR58] 58.Bentink S, et al. Angiogenic mRNA and microRNA gene expression signature predicts a novel subtype of serous ovarian cancer. PLoS ONE. 2012;7:e30269. doi: 10.1371/journal.pone.0030269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.Masiero M, et al. A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor ELTD1 as a key regulator of angiogenesis. Cancer Cell. 2013;24:229–241. doi: 10.1016/j.ccr.2013.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.Langlois B, et al. Angiomatrix, a signature of the tumor angiogenic switch-specific matrisome, correlates with poor prognosis for glioma and colorectal cancer patients. Oncotarget. 2014;5:10529. doi: 10.18632/oncotarget.2470. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR61] 61.Ma L, et al. Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer. Cancer Cell. 2019;36:418–430. doi: 10.1016/j.ccell.2019.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Capece, D. et al. The inflammatory microenvironment in hepatocellular carcinoma: a pivotal role for tumor-associated macrophages. BioMed Res. Int.2013, 1–15 (2013). [DOI] [PMC free article] [PubMed]

[CR63] 63.Sawa-Wejksza K, Kandefer-Szerszeń M. Tumor-associated macrophages as target for antitumor therapy. Arch. Immunol. Ther. Exp. 2018;66:97–111. doi: 10.1007/s00005-017-0480-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Huang K, et al. High SPINK1 expression predicts poor prognosis and promotes cell proliferation and metastasis of hepatocellular carcinoma. J. Invest. Surg. 2021;34:1011–1020. doi: 10.1080/08941939.2020.1728443. [DOI] [PubMed] [Google Scholar]

[CR65] 65.Birgani MT, et al. Long non-coding RNA SNHG6 as a potential biomarker for hepatocellular carcinoma. Pathol. Oncol. Res. 2018;24:329–337. doi: 10.1007/s12253-017-0241-3. [DOI] [PubMed] [Google Scholar]

[CR66] 66.Kang YH, et al. Dysregulation of overexpressed IL-32α in hepatocellular carcinoma suppresses cell growth and induces apoptosis through inactivation of NF-κB and Bcl-2. Cancer Lett. 2012;318:226–233. doi: 10.1016/j.canlet.2011.12.023. [DOI] [PubMed] [Google Scholar]

[CR67] 67.Tsai, Y.-H. H. et al. Self-supervised representation learning with relative predictive coding. In International Conference on Learning Representations (2021). https://openreview.net/forum?id=068E_JSq9O

[CR68] 68.Lin, Y. et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat. Biotechnol.40, 703–710 (2022). [DOI] [PMC free article] [PubMed]

[CR69] 69.Allen, C. et al. A Bayesian multivariate mixture model for high throughput spatial transcriptomics. Biometrics. online, (2022). 10.1111/biom.13727 [DOI] [PMC free article] [PubMed]

[CR70] 70.Tipping ME, Bishop CM. Probabilistic principal component analysis. J. R. Stat. Soc. Series B Stat. Methodol. 1999;61:611–622. [Google Scholar]

[CR71] 71.Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

[CR72] 72.Graner F, Glazier JA. Simulation of biological cell sorting using a two-dimensional extended potts model. Phys. Rev. Lett. 1992;69:2013. doi: 10.1103/PhysRevLett.69.2013. [DOI] [PubMed] [Google Scholar]

[CR73] 73.Besag J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Series B Methodol. 1974;36:192–225. [Google Scholar]

[CR74] 74.Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 2014;32:896–902. doi: 10.1038/nbt.2931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR75] 75.Hounkpe BW, Chenou F, de Lima F, De Paula EV. HRT Atlas v1. 0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res. 2021;49:D947–D955. doi: 10.1093/nar/gkaa609. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR76] 76.Raudvere U, et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Res. 2019;47:W191–W198. doi: 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR77] 77.Street K, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19:1–16. doi: 10.1186/s12864-018-4772-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR78] 78.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008;2008:P10008. [Google Scholar]

[CR79] 79.Zhao, E., et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nature Biotechnology39, 1375–1384 (2021). [DOI] [PMC free article] [PubMed]

[CR80] 80.Li Z, Zhou X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol. 2022;23:1–35. doi: 10.1186/s13059-022-02734-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR81] 81.Lin Y, et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA. 2019;116:9775–9784. doi: 10.1073/pnas.1820006116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR82] 82.Hubert L, Arabie P. Comparing partitions. J. Classif. 1985;2:193–218. [Google Scholar]

[CR83] 83.Cover, T. M. & Thomas, J. A. Elements of Information Theory 2nd Edition (wiley series in telecommunications and signal processing) (Wiley-Interscience, 2006).

[CR84] 84.Rand WM. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971;66:846–850. [Google Scholar]

[CR85] 85.Liu, W. et al. Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with precast. feiyoung/PRECAST: v1.3.0. 10.5281/zenodo.7417715 (2022).

PERMALINK

Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST

Wei Liu

Xu Liao

Ziye Luo

Yi Yang

Mai Chan Lau

Yuling Jiao

Xingjie Shi

Weiwei Zhai

Hongkai Ji

Joe Yeong

Jin Liu

Abstract

Introduction

Results

Spatial transcriptomics data integration using PRECAST

Fig. 1. Schematic overview of PRECAST and simulation results.

Validation using simulated data

Application to human dorsolateral prefrontal cortex Visium data

Fig. 2. Analysis of human DLPFC data (n = 47,680 locations over 12 tissue sections).

Application to mouse liver ST data

Fig. 3. Analysis of mouse liver ST data (n = 4865 locations over 8 tissue sections).

Application to mouse olfactory bulb Slide-seqV2 data

Fig. 4. Analysis of mouse olfactory bulb data (n = 594,890 locations over 16 tissue sections).

Application to hepatocellular carcinoma Visium data

Fig. 5. Analysis of data for four human HCC sections.

Discussion

Methods

PRECAST model

Projections of non-cellular biological effects

Recovery of comparable gene expression matrices

Differential expression analysis and enrichment analysis

Conditional SVG analysis

Trajectory inference/RNA velocity analysis

Cell-type deconvolution analysis

Comparisons of methods

Evaluation metrics

Local inverse Simpson’s index

Silhouette coefficient

Canonical correlation coefficients and/or conditional correlation

Adjusted Rand index and/or normalized mutual information

Simulations

Gene selection for integrative analysis

Data resources

Human dorsolateral prefrontal cortex Visium data

Mouse liver ST data

Mouse olfactory bulb Slide-seqV2 data

Human hepatocellular carcinoma Visium data

Reporting summary

Supplementary information

Source data

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases