Abstract
Analyzing multi-sample spatial transcriptomics data requires accounting for biological variation. We present multi-sample non-negative spatial factorization (mNSF), an alignment-free framework extending single-sample spatial factorization to multi-sample datasets. mNSF incorporates sample-specific spatial correlation modeling and extracts low-dimensional data representations. Through simulations and real data analysis, we demonstrate mNSF’s efficacy in identifying true factors, shared anatomical regions, and region-specific biological functions. mNSF’s performance is comparable to alignment-based methods when alignment is feasible, while enabling analysis in scenarios where spatial alignment is unfeasible. mNSF shows promise as a robust method for analyzing spatially resolved transcriptomics data across multiple samples.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03601-x.
Keywords: Spatial transcriptomics, Matrix factorization, Multi-sample analysis, Dimensionality reduction, Spatial gene expression
Background
Spatially resolved transcriptomics (SRT) measures gene expression levels in the context of spatial positions [1–11], either at the single cell level or as a local aggregate of multiple cells across a spatial location, also termed a spot. The last 30 years of genomics have established that it is essential to consider biological replicates when trying to understand a biological system [12, 13]. Indeed, technology does not remove biological variation [14].
Multisample (population-level) analysis of spatial data is common in functional magnetic resonance imaging (fMRI) brain data, and it is instructive to briefly review the approach in this field [15]. In fMRI analysis, the first step is to spatially align the samples to a common coordinate system (known as template-based alignment). The unit of measurements are 3D cubes known as “voxels.” Following alignment, analysis then proceeds separately for each voxel (or sometimes region), typically by using a general linear model across samples. For fMRI data, spatial alignment makes it possible to deploy standard statistical models for each voxel separately, substantially simplifying downstream analysis.
In spatially resolved transcriptomics, a number of methods for spatial alignment has been proposed, including PASTE [16], PASTE2 [17], STalign [18], and GPSA [19]. Some of these methods align to a common coordinate system, others align the samples to each other. However, we posit that there are natural limitations to the potential success of this approach to multi-sample analysis. In fMRI imaging, alignment is helped by the fact that the whole brain is imaged in 3D in each sample. In contrast to fMRI data, the alignment of spatially resolved transcriptomics is complicated by the possibility that different samples may be collected from different anatomical areas and have differences in the shape, size, and rotation of the sections. Indeed, SRT samples can represent completely disjoint areas; in this case, spatial alignment is impossible except to a common coordinate system. But even then, it is unclear how downstream analysis should proceed, when the samples are non-overlapping.
Factor analysis has been a successful approach to unsupervised discovery of patterns in genomics. There are a few existing methods for the factor analysis of SRT data that model the spatial dependency of gene expression data [20–22]. NSF [20] and MEFISTO [21] are focused on the factorization of data from a single biospecimen, as each factor is modeled using a single Gaussian Process. MEFISTO allows multi-sample analysis, but it requires the samples to be spatially aligned as a preprocessing step. Similarly, NSF can be applied to multiple samples, provided they have been spatially aligned. The spatial alignment process, preceding MEFISTO and NSF analyses, transforms samples to a common coordinate system. This creates an expanded covariance matrix encompassing all samples, with inter-spot correlations based on aligned distances. Both MEFISTO and NSF then utilize this aligned, multi-sample data structure for their analyses. Neither methods are evaluated for this purpose in their respective publications. In contrast, SpatialPCA [22] can be applied to unaligned samples and the publication contains a light evaluation of this task. The authors compares clusters obtained from a joint respectively single-sample analysis to manually annotated cortical layers and conclude that the multi-sample analysis does not outperform single-sample analysis. The SpatialPCA model has a few limitations. Primarily is lacks the ability to have a sample-specific parametrization of spatial dependencies for individual factors, such as the bandwidth in its Gaussian Process components. This restriction potentially hampers its effectiveness when examining samples of diverse sizes or those originating from different platforms. Together, this suggests that across-sample factorization of SRT data still has substantial challenges.
Results
Bypassing spatial alignment by parameter modeling
It is important to account for sample-to-sample variation in the analysis of genomics data. We are considering this question in the context of applying matrix factorization methods, such as non-negative matrix factorization (NMF), to spatially resolved transcriptomics data. Gene expression data exhibits a spatial dependence whereby genes measured at two locations which are spatially close show a different dependence from genes measured at two distant spatial locations. Such dependence can be driven by a variety of sources including spatial patterns in the distribution of cell types as well as correlated measurement error. The goal of any analysis of spatially resolved transcriptomics data is to identify systematic changes in gene expression which are associated with spatial location. Broadly, we refer to such dependence as “spatial dependence.”
One approach to analysis of multi-sample spatially resolved transcriptomics data is to start the analysis with spatial alignment of the samples into a common coordinate system. This process essentially defines spatial neighbourhoods and maps these neighbourhoods between samples. But spatial alignment is well-recognized to be a challenging problem, due to the need to account for differences in shape, rotation, and placement of anatomical regions or other features between samples.
Here, we provide a general recipe for extending a one-sample spatial factorization framework to allow multi-sample analysis. Our approach bypasses the need for spatial alignment.
In a spatial factorization framework, we represent spatial expression data on a single sample as a sum of products between gene loadings and spatial factors,
Here, is the spatial data matrix, is the number of spatial factors, are gene loadings and are the spatial factors. The gene loadings and spatial factors represent systematic changes in gene expression. Accounting for spatial dependence in such a model is done by additional modeling of the spatial factors; an example is the proposed non-negative spatial factorization (NSF) of Townes and Engelhardt [20] where the spatial factors are modelled using Gaussian processes.
In a multi-sample dataset, we have an additional m index for the different samples. Our recipe prescribes letting the spatial factors be sample-specific while the gene loadings are shared across samples (“Methods” section), giving rise to the following factorization
Note the absence of the m index on the gene loadings . Sharing the gene loadings across samples is exactly what happens when NMF is applied to data without spatial information such as bulk RNA-seq data (“Methods” section).
With such a parameterization, our recipe enforces each factor to have the same association with genes across samples, while allowing spatial dependence to be modeled separately in each sample. Fitting such a model will usually require the development of new software (“Methods” section), often by extending existing software.
As a proof of concept, here we have applied this recipe to non-negative spatial factorization (NSF). We allow each sample to have its own spatial dependence structure (or more specifically a sample-specific covariance term in the Gaussian process). We call this extended model mNSF (multi-sample NSF). We provide a python package implementing our model.
The performance of mNSF in simulations
We examined the performance of our mNSF model using a simulation study which is a simple extension of the study conducted by Townes and Engelhardt [20], but adapted to examine issues that are particularly relevant for multi-sample analysis of SRT data. Briefly, we specify true latent spatial factors and generate gene expression data with noise. We only depict the true and estimated spatial features, and not the gene loadings. We use T1–4 to denote the 4 true factors in the simulation, and use M1–4 to denote the 4 mNSF spatial features. There are no particular order to the mNSF output so we manually identify the best true factor which matches a given estimated factor.
First, we examined how mNSF handles the important case where the spatial factors are rotated between samples (T1–T4 in Fig. 1A). For each factor, its spatial distributions are the same in the two samples, but rotated 90 degrees. We find the mNSF factors among M1–4 each corresponding to one simulated factor among T1–4, according to their spatial pattern (M1, M2, M3, and M4, corresponding to T2, T1 T4, and T3). We use Moran’s I to measure the spatial dependency for each of the mNSF factors (i.e., M1–4) (Fig. 1C), and we find that all four mNSF factors show high spatial dependency. Those validations suggest that mNSF successfully identifies these simulated spatial features.
Fig. 1.
Multi-sample NSF on simulated data. A We generate 4 true factors for 2 samples, labeled T1–T4. Using these 4 spatial features, we generate noisy expression data for 500 genes following the approach in [20]. Simulated true factors are shown in the top two rows. Factors in sample 2 are the same as sample 1 except they are each rotated 90 degrees anticlockwise. mNSF factors are shown in the bottom two rows. The order of the factors estimated by mNSF is arbitrary and we manually pair each true factor with the estimated factor which best represent it. B Moran’s I for each of the 4 estimated factors across the 2 samples in A. C As A but factor T1 for sample 2 is set to 0. D As B but for the samples in C
Second, we examined mNSF in the case where one of the factors has a spatial pattern only in one sample, and has constant low value in the other sample (i.e., T1 among the factors T1–T4 in Fig. 1B). We find mNSF factors each corresponding to one simulated factor (M1, M2, M3, and M4, corresponds to T2, T4, T3, and T1). For each set of simulations, we use Moran’s I to measure the spatial dependency of each of the mNSF factors among M1–4 (Fig. 1D). We find that factor M4, which corresponds to factor T1 (i.e., the factor which is only operational in one of the samples by design), show high spatial dependency in sample 1 and almost zero spatial dependency in sample 2. Those results suggest that mNSF is capable of identifying factors that represent patterns that are operational only in some of the samples.
Third, we examined the performance of mNSF under additional complex conditions, including size differences between samples (Additional file 1: Fig. S1a), pattern distortions between samples (Additional file 1: Fig. S1b), and sample-specific noise levels (Additional file 1: Fig. S1c). These results demonstrate mNSF’s robustness in extracting meaningful patterns despite varying noise, size, and shape of the signal across samples.
Analysis of a mouse sagittal brain dataset
To assess the ability of mNSF to identify spatially resolved features in an actual, multi-sample, SRT dataset, we next analyzed the adult mouse sagittal brain dataset generated by 10X genomics, generated using the Visium technology [2]. This dataset consists of two sagittal sections from a single mouse brain. Each section was cut in two halves, one anterior and one posterior, for a total of 4 samples. Each sample was assayed using a separate Visium slide. We know that certain anatomical regions are present only in the anterior (e.g. olfactory bulb) or the posterior (e.g., cerebellum), while other anatomical regions will be split across the two halves (e.g., the hippocampus). This provides an opportunity to assess the ability of mNSF to identify sample-specific factors as well as both common factors and factors that vary across the anterior and posterior sections.
Townes and Engelhardt [20] applies NSF to one of the two anterior samples. They use 20 factors with a split of 10 spatial factors and 10 spatially-unrestricted features. Despite this demarcation, they find that most of the 20 learned patterns have strong spatial components. They establish that most of these factors correspond to known anatomical regions in the anterior mouse brain. Consistent with their parameter choices, we apply mNSF to all 4 samples using 20 spatial features.
To interpret each mNSF factor, we find a list of genes that are highly associated with, and most specific for each factor by analyzing the gene loading matrix using the patternMarkers approach identified in [23]. We then use the set of genes associated with each factor to interpret the cell types, biological functions, or anatomical regions represented by each factor.
Some mNSF factors reflect specific anatomical regions, only present in some of the samples. For example, factor M9 can be identified as highlighting the gyri of the cerebellum (Fig. 2). This hindbrain-specific factor is close to zero for the two anterior samples and visually similar in intensity and distribution across the two replicate sections for the posterior brain. The top genes identified by patternMarkers include Pcp2, Calb1, Car8, and Itpr1 which are specific markers for Purkinje cells within the cerebellum [24], as well as Cbln1 and Cbln3 which are known to exhibit high expression in the cerebellum, and Zic1 which is a specific marker for cerebellar granular cells [25] (Additional file 1: Table S1). This result highlights the ability of mNSF to identify factors associated with a signal that is present only in some, but not all, of the samples.
Fig. 2.
mNSF factors of mouse sagittal data reflect anatomical structures. The dataset is composed of four samples from the same mouse. The sagittal brain was divided into an anterior and a posterior half and two parallel sections were obtained for each half, for a total of 4 samples. We depict the results of applying mNSF using 20 factors to these 4 samples. Factor M9 (left) exhibits high use in the cerebellum and is an example of a factor specific to the posterior brain. Factor M17 (right) exhibits high use in both the hippocampus and hypothalamus regions, where the hippocampus region spans the posterior and anterior areas. This is an example of a factor which varies continuously across disjoint samples. All 20 factors are depicted in Additional file 1: Fig. S2
Some mNSF factors represent anatomical regions that span the posterior and anterior brain (Fig. 2). For example, factor M17 predominately marks both the hippocampus and hypothalamus, with moderate signal in select layers of the cortex. Note how the estimated factor varies smoothly across the posterior and anterior brain, specifically across the CA1–3 layers(which appear as a rotated U in these samples). The regions labeled by this factor are considered regions of increased synaptic plasticity. For example, the hippocampus, which functions primarily in learning and memory [26], requires this plasticity for the formation and consolidation of short-term memories. The hypothalamus plays a crucial role in maintaining homeostasis in the body [27], regulating a variety of essential functions such as hunger, thirst, sleep, circadian rhythms, stress responses, and reproductive behaviors. Synaptic plasticity in the hypothalamus is important for adaptation to changes in physiological states and the environment [28–31]. Consistent with this categorization, patternMarker genes for M17 are associated with synaptic plasticity, including AMPA receptor regulation (Cnih2, Herring et al. [32]), dendritic spine development (Ddn, Ncdn, Yang et al. [33, 34]), and synaptogenesis (Nptxr, Lee et al. [35]) (Additional file 1: Table S1).
Broadly, across the 20 spatial features, our observations about factors M9 and M17 hold for other spatial features. Specifically, we observe (a) consistency between each factor across the two replicate sections (b) there are multiple factors which continuously vary across the anterior and posterior brain (Additional file 1: Fig. S2) (c) a few factors (M10, M19) are specific to either the anterior or posterior brain.
Sensitivity analysis on the dispersion parameter demonstrates that varying the dispersion parameter (ranging from 0.001 to 100) does not significantly affect the factorization results (Additional file 1: Figs. S3, S4). The goodness-of-fit (measured by Poisson deviance between observed counts and predicted mean values) decreases as the number of factors increases, with the decrease rate noticeably slowing after 16 factors (Additional file 1: Fig. S5).
Analysis of human DLPFC data
Next, to evaluate mNSF across a dataset with replicate samples from different donors, we apply mNSF to a widely used spatially resolved transcriptomics dataset from the Visium platform on human dorsolateral prefrontal cortex (DLPFC) Maynard et al. [36]. This data consists of 4 samples from each of 3 donors. The 4 samples from each donor consist of parallel sections (Fig. 3). Each section has a width of 10 and we label the sections as A–D, representing the physical ordering of the sections. The physical separation is as follows: the AB pair is separated by , the BC pair is separated by and the CD pair is separated by . Furthermore, the data are supplied with manual annotations of cortical layers based on expression and H&E staining, with labels of white matter (WM), cortical layers 1 to 6 and NA (which are excluded from our analysis). Not all layers are present in all samples; specifically, the 4 samples from individual 2 do not have layers 1 and 2 present.
Fig. 3.
Layer annotation of DLPFC data. Manually annotated layer for each spatial location in each sample from each donor in DLPFC dataset. Each donor contributes 4 different sections, arbitrarily labelled A–D. For samples from donor one and three, each sample has seven layers: layer 1–6 and white matter. The samples from donor 2 do not cover layers 1 and 2. While the manual annotation includes the NA (not available classification), these spatial locations are excluded from our analyses
We apply mNSF to all 12 samples using 10 spatial features. The model does not encode the design of the experiment with 4 sections from 3 donors. We expect that different samples from the same donor are more similar than different samples from different donors; this is true for the manual annotations of the samples (Fig. 3). We do not necessarily expect that different cortical layers form distinct “clusters” in expression space. For example, it is understood that some genes are expressed in a gradient across the cortical layers [37–39]. Nevertheless, we expect some relationship between cortical layers and mNSF spatial features. To compare the mNSF factors with the discrete manual annotation, we use the following approach: we group each spatial location in each sample according to its manual annotation and display each factor value across the layers (Fig. 4). The M6 factor displayed in Fig. 4 is particularly high in cortical layer 2, followed by blending into cortical layer 3. The lowest layer is cortical layer 1 and the factor is almost absent in white matter.
Fig. 4.
Cross-layer trend of the M6 factor in sample 1. The left plot depict the spatial distribution of factor M6 on the spatial locations from sample 1 (donor 1, section A). The middle plot depicts the distribution of this factor, stratified by which layer each spatial location belongs to (according to the manual layer annotation). The right plot depicts the manual annotation for this sample. The color scale in the left plot is the factor values whereas the color in the middle and left plot is used to depict the manual annotation
Figure 5 depicts 3 mNSF spatial features. Due to size restrictions, we display 4 samples, 1 sample from each donor as well as an additional parallel section from each donor. This display depicts both between-donor variability and between-sections-within-a-donor variability. For completeness, we depict all 12 samples and 10 factors in Additional file 1: Figs. S6–S15.
Fig. 5.
mNSF factors represent layers of the prefrontal cortex. mNSF A factor M5, B factor M2 and C factor M6 in sample 1 (donor 1, section A), sample 5 (donor 2, section A), sample 7 (donor 2, section B), and sample 12 (donor 3, section D)
Factor M6 has high values in the spatial locations manually annotated as layer 2, intermediate values in spatial locations manually annotated as layer 3 and close to zero values for spatial locations manually annotated as white matter (Fig. 5A). PatternMarker genes for this factor include HPCAL1 (Additional file 1: Table S2), which is a marker for layer 2 excitatory neurons [40]. This factor is consistent with the manual annotation across the 3 donors, and across parallel sections within each donor.
Factor M2 is consistently high for spatial locations annotated as white matter (Fig. 5B) and low otherwise. PatternMarker genes for this factor include known oligodendrocyte and myelin-associated genes, such as MOG, MOBP, MBP, and BCAS1 [41, 42] (Additional file 1: Table S2), which are expected to be specific to white matter. This factor is consistent with the manual annotation across the 3 donors, and across parallel sections within each donor.
Factor M5 has high values in the spatial locations manually annotated as white matter in the samples from donor 1 and 3, but consistently low values across all the spatial locations in the samples from donor 2 (Fig. 5C). It therefore suggests an inconsistency, potentially in tissue processing or orientation, across the samples from different donors. PatternMarker genes for this factor include genes that mark oligodendrocytes and myelination (PLP1, TF, CNP, ENPP2) [41, 42], and genes potentially associated with neurovasculature, blood, and vascular endothelial cells (HBA2, HBB, CLDND1) [43] (Additional file 1: Table S2). We believe differences in this factor across donors may reflect variation in how tissue blocks were dissected where sections from donor 2 are cut at a more horizontal plane that does not contain layers 1 and 2. This was confirmed to be a possible interpretation by the original manual annotator (K. Maynard, personal communication). This factor represents a pattern only present in some, but not all samples, once again showcasing the ability of mNSF to identify such patterns.
Sensitivity analysis of the dispersion parameter demonstrated that factorization results remained stable across a range of values from 0.001 to 10. However, significant changes were observed for dispersion values between 10 and 100 (Additional file 1: Fig. S16). Additionally, the association between mNSF factors and manually annotated layers exhibited strong correlation for dispersion parameters up to 10 but weakened considerably for values larger than 100 (Additional file 1: Fig. S17). This analysis suggests that the factors identified by the model are robust across a wide range of dispersion parameter values.
To address the choice of the number of factors, we evaluated mNSF performance on DLPFC data using 4–20 factors, guided by three metrics: (a) factor-layer association (b) adjusted Rand index (ARI) for domain identification (c) goodness-of-fit (Poisson deviance). We observed increasing cluster stability as more factors are included.(Additional file 1: Fig. S18). We chose to use 10 factors for our analysis of the DLPFC dataset, as this provides a good balance between model complexity and performance.
We conclude that mNSF shows encouraging performance on this dataset. It produces factors which make biological sense and are consistent across parallel sections within the same tissue block. Many of the factors are also consistent across the donors. However, the scale of the factors sometime vary (see white matter for factor M2, Fig. 5). This might reflect variability in the manual annotation, biological variability between samples or unwanted (technical) variation which might be possible to remove with additional normalization.
Comparison to spatial alignment
The DLPFC dataset is an excellent candidate for spatial alignment. However, considering the manual annotations (Fig. 3) it is clear that aligning different sections from the same donor is much easier than aligning different sections from different donors. For example, layers 1 and 2 are absent in donor 2 and this complicates spatial alignment. Zeira et al. [16] describes PASTE, a method for spatial alignment, and apply PASTE to the DLPFC dataset to perform direct alignment of these samples. However, for exactly the challenges described above, Zeira et al. [16] only attempt to align samples from the same donor to each other, doing both pairwise and 4-sample alignment. Using these data allows us to compare mNSF to spatial alignment on a dataset which is particularly well suited for spatial alignment.
Specifically, we compare the performance of mNSF and PASTE-NSF on identical sample pairs. This approach, which we term pasteNSF, involves using PASTE to align samples followed by NSF on the aligned samples. Following Zeira et al. [16], we focus on pairwise alignment, considering only adjacent section pairs in our evaluation. Specifically, both methods are applied to the same set of two adjacent section pairs (either AB, BC, or CD) for each donor. For instance, when analyzing the BC pair from Donor 1, mNSF utilizes factors specific to the BC-Donor1 case, while PASTE-NSF employs its alignment followed by NSF, also using factors specific to the BC-Donor1 case. By conducting the comparison in this manner, we ensure a direct and fair assessment of both methods’ effectiveness on the same data.
We expect, and this is confirmed by the authors, that PASTE performs best with adjacent tissue sections separated by (ie. comparisons AB or CD). In this analysis we use 10 spatial spatial features. Following factorization, we use a multinomial model to predict the 5 or 7 manually annotated layers (depending on the donor) as a function of the 10 estimated spatial features, and we use the model fit to assess performance. The model fit measures the association between the 10 inferred factors and the manual annotation.
This approach shows that mNSF has comparable performance to pasteNSF (Fig. 6A–B). PASTE supplies the user with a mapping score which represents how well the spatial alignment is performed (higher is better). The pair with the lowest mapping score (donor 1, BC) is the pair where mNSF outperforms pasteNSF the most (Fig. 6C–D).
Fig. 6.
A comparison of mNSF to spatial alignment using PASTE. The performance of mNSF compared with applying NSF on the aligned coordinates by PASTE. Each consecutive sections (within donor) (i.e., sections AB, sections BC or section CD) are used to fit either mNSF, or a combination of PASTE alignment and NSF, using three spatial features. For the factor sets inferred from each pair of samples in each method, we run a multinomial regression of the layer as a categorical variable, against the three spatial features, and calculated the A the accuracy of layer prediction using a multinomial model (“Methods” section, higher is better). B The multinomial model deviance (lower is better, “Methods” section). C Difference in the layer prediction accuracy between mNSF and the combination of PASTE and NSF, compared to the mapping score of PASTE. D Difference in the deviance of the multinomial regression between mNSF and the combination of PASTE and NSF, compared to the mapping score of PASTE. The BC sections, which are separated by , are highlighted in red
In summary, mNSF has at least comparable performance to PASTE followed by NSF when spatial alignment is easy, but extends factorization to data where spatial alignment is hard (between donors in the DLPFC dataset) or impossible (between anterior/posterior sections in the mouse sagittal dataset).
Comparison with SpatialPCA
We performed a comparative evaluation of domain identification capabilities between mNSF and SpatialPCA using the DLPFC dataset, with the manually annotated layers providing ground truth. Both methods used 10 factors for consistency. Following the SpatialPCA tutorial, we used walktrap clustering to cluster the continuous factors into discrete sets [44, 45]. Using these discrete sets, we measured the performance using the Adjusted Rand Index (ARI) (Additional file 1: Fig. S19), with ARI values predominantly within the 0.30–0.45 range. Our results show comparable proficiency in identifying spatial domains within this dataset.
Next, we compared spatialPCA and mNSF at domain identification using data across two different technologies, Visium and Slide-seq. We used mouse brain samples containing the cerebellum, from two different datasets. The datasets included a Visium-based mouse sagittal brain dataset focusing on the posterior half (posterior S1, previously described) and Slide-seq data of mouse cerebellum obtained from the Broad Single Cell Portal (ID SCP354) [3], with processed data as reported in the SpatialPCA manuscript Shang and Zhou [22]. This brain region has a very distinctive shape and location which allow us to roughly assess whether the domain is consistent across the samples, even in the absence of a known ground truth. Both methods used 10 factors. We then employed k-NN clustering for domain identification, which requires pre-specification of the number of domains. Using 5 domains, mNSF roughly identifies the cerebellum in the two samples, whereas spatialPCA fails at this task (Fig. 7). However, the performance of the mNSF is sensitive to the pre-specified number of domains. Using more domains than 5, the visual agreement between Visium and SlideSeq is not as striking as depicted in Fig. 7. In contrast, we were unable to obtain decent results using SpatialPCA despite varying the number of domains. This comparison suggests that mNSF’s have better performance in identifying biologically relevant spatial domains across different technological platforms.
Fig. 7.
Comparison of domain identification performance between mNSF and SpatialPCA in cross-platform data analysis of mouse cerebellum. Predicted layers by A mNSF and B SpatialPCA for the joint analysis of Slide-seq and Visium data. C Expression level of Pcp2, marker gene of cerebellum in each dataset
Memory and time consumption
To investigate memory and time consumption, we ran mNSF multiple times on datasets of various sizes (Fig. 8) using a GPU. The memory usage has a linear relationship with the number of factors, genes, samples and a quadratic relationship with the number of spots within each sample. Run time has a linear relationship with the number of factors, genes and samples and a quadratic relationship with the number of spots within each sample. The number of genes and the number of factors only has a small impact on run time.
Fig. 8.
Memory and time profile of mNSF. A, B Memory usage and runtime vs. number of factors. C, D Memory usage and runtime vs. number of genes. E, F Memory usage and runtime vs. number of samples. G, H Memory usage and runtime vs. number of spots within each sample. Each experiment is repeated three times. Points represent the average memory usage (or runtime) across three runs. The blue shaded area indicates the 95% confidence interval based on 10 training sets
Resource consumption will grow linearly with larger sample size and quadratically with higher resolution. In addition, mNSF is built on tensorflow and is therefore designed to be used on GPUs which have lower memory limits than CPUs. For these reasons, we implemented two strategies to reduce (primarily) memory consumption. The first is the use of inducing points, which was already part of the NSF method. Inducing points is a standard approach used with Gaussian processes: instead of using all data points, they use a smaller set of points for prediction. The second we call chunking and it consists of making multiple passes over the data, each time only using a random set of points. In each pass over random sets of points, we keep the gene weights unchanged (“Methods” section). Chunking and inducing points can be combined. We examined how computational resources scale with the percentage of induced points and number of chunks used in the analysis (Fig. 8). Memory usage increases linearly with the percentage of induced points, while runtime shows a moderate increase. When increasing the number of chunks, memory usage shows a clear decrease, demonstrating that chunking can be an effective strategy for reducing memory requirements. However, this comes with a trade-off of increased runtime, which rises with the number of chunks.
To understand how model performance scales with these computational parameters, we evaluated mNSF using different percentages of induced points and numbers of data chunks. Using the DLPFC dataset as a benchmark, we assessed both layer prediction accuracy and deviance (Additional file 1: Fig. S20). The results show that accuracy remains relatively stable down to 25% induced points before declining, while chunking the data into more pieces gradually reduces accuracy. This suggests a trade-off between computational efficiency and model performance. We caution that this trade-off is likely to be dataset dependent.
To examine factor stability under different computational settings, we calculated correlations between factors obtained using different numbers of data chunks (Additional file 1: Fig. S21). While factors remain generally consistent with 2–4 chunks, using 10 chunks leads to decreased correlations, particularly for some factors. This indicates that excessive chunking may compromise the model’s ability to capture certain spatial patterns.
We performed a similar analysis varying the percentage of induced points used (Additional file 1: Fig. S22). The correlation matrices demonstrate that most factors remain highly stable even when reducing induced points to 25%, with correlation values typically above 0.8. However, further reduction to 5–10% induced points results in decreased factor stability for some patterns. This analysis helps guide parameter selection by showing where computational savings can be achieved while maintaining factor stability.
The memory and time consumption for the analysis used in this paper is in Additional file 1: Table S3.
Discussion
In this study, we describe a general approach to extending a matrix factorization method to multi-sample datasets. Using this approach, we extended non-negative spatial matrix factorization (NSF) by Townes and Engelhardt [20] to spatial transcriptomics datasets with multiple samples. Our model allows for a sample-specific spatial dependence structure. Our method bypasses the need to align factors between samples into a consensus coordinate system, which is a challenging problem. Both real and simulated data analysis support that the method yields usable results when applied to data from multiple sources, even if it is impossible to perform spatial alignment. Classic matrix factorization methods are widely used in expression analysis and it is well recognized that it is hard to identify the biological or technical process(es) associated with each factor or pattern. Our method retains this limitation.
There are multiple possible downstream applications of our method, including spatial domain detection. These applications are left for future work. Our evaluation is focused on comparing factors to known anatomical regions, and we have not considered the impact on downstream analyses. Nevertheless, we believe that our method can serve as a foundation or input to downstream analysis of multi-sample data.
Batch effects could cause differences in spatial patterns between samples. Such differences would appear as factors which are variable across samples. Our current method cannot distinguish batch effects from biological variation. It will be an important question for future research to appropriately model and correct batch effects in spatially resolved transcriptomics data.
In applications, researchers are sometimes aware of existing patterns or factors across samples, either at sample level or at the level of spatial features. Accounting for such known biology will require the application of a semi-supervised matrix factorization method. Such methods have been suggested for other analysis domains [46]. We believe it will be important to develop such models for spatially resolved transcriptomics data and it is likely that our framework will allow for the extension of such models to multiple samples.
Conclusions
Here, we provide an alignment-free framework for generalizing a one-sample spatial factorization model to multi-sample data. In simulations, our method is capable of identifying spatial structures which are rotated between samples, as well as structures which only appear in some, but not all, samples. Using real data, we show our method is capable of identifying matched functional regions in multi-sample spatial transcriptomics data.
Methods
A general approach to multi-sample spatial factorization
Spatially resolved transcriptomic (SRT) data for a single sample can be represented as a matrix of expression measures indexed by genes g with associated spatial (physical) location . We use the index i to index the spatial locations.
Consider a standard non-negative matrix factorization model applied to SRT data on a single sample:
In this model, we decompose the expression values into a term representing genes and a term representing the spatial locations. The model does not impose any kind of spatial structure on . This model is widely used for non-spatially resolved bulk and single-cell transcriptomic data. Adapting this model to spatial data is usually done by additional requirements on the terms to account for expected spatial dependence. Such extensions are considered below, but for the sake of clarity, we first consider a matrix factorization model without spatial dependency.
Our suggested approach to extend this model across M samples is to use the following (simplified) model
where the gene loadings are shared across samples, but the (spatial) factors are sample-specific. The model is easy to fit using standard software for non-negative matrix factorization models, by concatenating the involved matrices:
where is concatenation. This is possible because of the simple model formulation where we do not impose spatial structure on the spatial features.
As argued by Townes and Engelhardt [20], this model could be improved by (a) incorporating the digital (discrete) nature of the expression data and (b) modelling the spatial dependence between spots.
As a first step towards a better model for SRT data, Townes and Engelhardt [20] describes probabilistic NMF (PNMF) which models the discrete nature of digital expression data using a Negative Binomial distribution but does not address the spatial dependence. We propose a multi-sample version of PNMF (mPNMF) specified as
Here, is a vector of known sample-specific size factors (one for each spot), and are factor-parameters which are not sample-specific. Any software that fits single-sample PNMF can fit multi-sample PNMF by concatenating the data matrices.
To model the spatial dependence of SRT data, Townes and Engelhardt [20] develops non-negative spatial factorization (NSF) by using a Gaussian process to model the spatial factors . We propose a multi-sample version of this model (mNSF), which is stated as follows (Fig. 9):
where is a sample- and factor-specific mean function and is a sample- and factor-specific covariance kernel, both depending on the sample-specific vector of spatial locations . Unlike the multi-sample version of PNMF, fitting mNSF requires sample-specific parameterization to handle the sample-specific spatial features.
Fig. 9.
Multi-sample NSF model for a dataset with two samples and three spatial features. As input, we use expression data from the two samples where rows are genes and columns are spatial spots. Each factor is modeled by a Gaussian process which represents the spatial dependency between spots; this process has sample-specific parameters. The gene loadings are shared between the two samples
We have implemented mNSF by extending the code provided by Townes and Engelhardt [20]. Our extension includes the interpolated version of NSF, where the model is fit using a subset of spatial locations which are then interpolated to encompass the entire data matrix.
Data processing
Mouse sagittal section data
The spot-level gene expression counts data, as well as the 2-dimensional coordinates denoting the position of each spot, are downloaded from 10X website: https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Anterior_Section_1, https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Anterior_Section_2, https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Posterior_Section_1, and https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Posterior_Section_2.
Top 500 genes are selected based on the maximal Poisson deviance of each gene across all four samples, calculated by a built-in function in NSF package (see code on GitHub for details).
DLPFC data
DLPFC dataset are downloaded from the SpatialExperiment [47] package. Top 500 genes are selected based on the maximal Poisson deviance of each gene across all twelve samples (see code on GitHub for details).
PASTE alignment
For each sample pair used in this study, spatial locations on the pairwise aligned coordinate system are downloaded from PASTE GitHub website: https://github.com/raphael-group/paste_reproducibility
The 500 genes selected in the 12-sample mNSF analysis for DLPFC data are used for this analysis.
For each sample pair used in this study, one-sample NSF is applied on the aligned data, i.e., the concatenated gene expression matrix of the two samples as well as the coordinate of each spatial location in each sample on the aligned coordinate system. mNSF is then used on the unaligned data, i.e., the gene expression matrix of each sample as well as the coordinate of the spatial locations in each sample in the original coordinate system.
Models without induced points
One-sample NSF
As reference, we describe the one-sample NSF model proposed in [20]. Briefly, the model assumes that the log value of each non-negative factor follows a Gaussian process across the spatial locations in the sample, with the intercept equal to a linear combination of the coordinates. The gene expression level in each spatial location follows a Poisson distribution with the mean equal to the product of a loading matrix and the factor matrix, multiplied by the size factor (i.e., library size) of the spot.
Assume there are n spatial locations in total at locations measuring G genes. We will use L to denote the number of non-negative spatial spatial features; this is a user-supplied parameter.
Let denote the observed count value for gene and spatial location i. It is assumed to follow a Negative Binomial distribution
Here, is a known size factor for spatial location i, is the dispersion parameter and
Here, denotes the loading of gene g for the factor, and denotes the value of the factor at spatial location i.
The value of factor on the observed spatial locations follows a Gaussian process distribution with a linear mean and a Matern kernel for covariance. In particular, this means that – the specification of the factor on the grid follows a normal distribution
with
and
where the kernel function for the factor is of Matern class,
Here
denotes the distance between two spatial locations with coordinates and , both of which are vectors of length 2.
In summary, is an matrix denoting the correlation of , is the length scale parameter and is the amplitude parameter for the kernel of Gaussian process.
Multi-sample NSF
If no interpolation is used, multi-sample NSF assumes that the log value of each non-negative factor follows a Gaussian Process across the spatial locations in each sample, with the intercept equals a linear combination of the coordinates. And the gene expression level in each spatial location follows a Poisson distribution with the mean equals a weighted sum of the factors multiplied by the size factor (i.e., library size) of the spot, where the weights are shared across different samples and the other parameters are all sample-specific.
Assume there are spatial locations in total at locations , G genes used, and L non-negative spatial spatial features.
The observed count value for gene at spatial location in the sample, denotes as , follows a Negative Binomial distribution
where is the scale factor of spatial location i, is the dispersion parameter, and
Here, denotes the loading of gene j for the factor, and denotes the value of the factor at spatial location i in the sample.
The value of factor on the observed spatial locations X conditional on follows a GP distribution
Here, is an matrix denoting the correlation of , with
where the kernel function for the factor in sample m is
where is the length scale parameter and is the amplitude parameter for the kernel of Gaussian Process for the lth factor in the sample.
Models with inducing points
One-sample NSF
If a set of interpolated points is used, one-sample NSF assumes that the log value of each non-negative factor follows a Gaussian Process across both the observed and interpolated spots, with the mean equals a linear combination of the coordinates. A set of parameters are created for the interpolated points, and the posterior distribution of the observed point conditional on the interpolated points is derived. The overall likelihood of both the observed and interpolated points is calculated through the likelihood of the interpolated points and the posterior likelihood of the observed points.
Assume there are N spatial locations in total at locations , J genes used, n spatial locations interpolated at locations , and L non-negative spatial spatial features.
The observed count value for gene at spot, denotes as , follows a Negative Binomial distribution
where is the scale factor of spatial location i, is the dispersion parameter, and
Here, denotes the loading of gene j for the factor, and denotes the value of the factor at spatial location i.
The distribution of (i.e. the value of the lth factor on the induced points) and follows a Gaussian Process distribution,
where
Here, is an matrix denoting the correlation of , is an matrix denoting the correlation of , and is an matrix denoting the correlation between and .
where the kernel function for the factor is
where is the length scale parameter and is the amplitude parameter for the kernel of Gaussian Process for the lth factor.
Decomposing the joint distribution of and into and , we have
where could be derived by
and could be derived by
Multi-sample NSF
In multi-sample NSF, for each factor, the loading of the same gene is shared across the samples, while all the other parameters are sample-specific. The observed data from different samples are assumed to be independent.
Assume there are K samples, with the sample containing spatial locations at , interpolated points at . The same set of G genes are used in all the samples. Assume there are L non-negative spatial factors for each sample, with the loadings of those G genes for each factor shared by samples.
For sample m, the observed count value for gene at spot, denotes as , follows a Negative Binomial distribution
where is the scale factor of spatial location i in sample m, is the dispersion parameter of sample m, and
Here, denotes the loading of gene j for the factor for sample m, and denotes the value of the factor at spatial location i in sample m.
The value of the factor on the interpolated locations of sample k are assumed to follow a GP distribution
.
The distribution of and follows a Gaussian Process distribution,
where
Here, is an matrix denoting the correlation of , is an matrix denoting the correlation of , and is an matrix denoting the correlation between and .
where the kernel function for the factor in the sample is
where is the length scale parameter for sample m and is the amplitude parameter for the kernel of Gaussian Process for sample m for the lth factor.
Decomposing the joint distribution of and into and , we have
where could be derived by
and could be derived by
Model fitting
Firstly, let’s assume a one-sample data with the distribution in the same form of one-sample NSF, as described in the first subsection under the “Methods” section, and discuss its model fitting approach.
For one-sample spatial data, in NSF paper, it has been shown that by maximizing the following function (called ELBO function), we will get the MLE estimates of all the parameters involved in the model [20]:
1 |
where denotes the parameter space, is defined by letting , and is the product of the posterior likelihood of conditional on , denoted as , and the approximated likelihood of , denoted as .
Next, we will discuss the model fitting approach for multi-sample data, where the distribution of the data is in the same form of the mNSF model.
The statement that “maximizing the ELBO function will give us the MLE estimates of all parameters involved in the model” is hold in general regardless of the form of distribution settings, such statement also holds for a data that is concatenated by data from multiple samples, where each data has the same form of distribution but with different values of parameters, i.e.,
2 |
with
and where is the observed data at all spatial locations in sample m, denotes the latent factors at induced points in sample m, is the factor at all spatial locations in sample m, is the spatial locations in sample m, and is the induced points in sample m.
As discussed in the last paragraph, the statement “maximizing the ELBO function will give us the MLE estimates of all parameters involved in the model” holds true for function (2), so in the next step, we will discuss the approach to maximize the function (2) above.
One way to maximize function (2) is using “Adam algorithm (Kingma and Ba, 2014) with gradients computed by automatic differentiation in Tensorflow” [20], which calculate the gradient of a target function with respect to a set of parameters and update the parameters by adding to each of the parameter where s denote the ’step size’ (a constant scalar that has the same value for fitting different parameters) in the gradient approach and g denotes the gradient of a parameter. To satisfy the non-negativity constraint of W parameter, we can set any negative values in W to zero after the parameters’ update in each iteration.
In the setting that the distributions of data from different samples are independent, we can re-write function (2) as
3 |
The Eq. (3) above suggests that, in terms of the gradient calculation and the parameters update within each iteration of applying Adam gradient approach in the multi-sample model fitting, it equals to
-
Step 1:
calculate the gradient of the parameters involved in each sample, only using the data of the corresponding sample;
-
Step 2:
for the parameters that are sample-specific, update those parameters in the same way of fitting one-sample NSF model; for the parameters that are shared by samples (here for mNSF model, it is the loadings parameter ), the gradient of this parameter for function (3) equals the sum of the gradients of the parameter across m samples.
Note that as long as the ’step size’ parameter are the same for the individual sample’s model fitting, the sample-specific parameter fitting in “Step 2” equals:
Step 2*: for the sample-specific parameters, update those parameters separately in the same way of fitting one-sample NSF model, (here in mNSF model, we will get M sets of updated Ws, written as ), then average those updated parameters to get the updated parameter with respect to the full model (here for mNSF model, the updated W parameter can be calculated by )
Based on all the discussions above in this subsection, we can draw the conclusion that the following two model fitting process will give us the same parameter estimates:
Process 1: maximize the ELBO function of the full mNSF model, using Adam gradient approach with step size of s and updating the parameters with 100 iterations, where at the end of each iteration, set the non-negative values in the averaged to zeros.
Process 2: repeat the following parameter updating step for up to 1000 iterations, until converge: for each sample: firstly do the parameter updates in the same way as one iteration in “Process 1” excluding the step of setting the negative values in to zero; then for parameter , get the average of its updated value across the M samples, set the non-negative values in the averaged to zeros, and use this non-negative as the updated
In mNSF, we use “Process 2” to fit the model, which will, assuming the approximated likelihoods used in NSF model fitting are close enough to the non-approximated likelihoods, give us a estimate of parameters that is close to the MLE estimates of the model.
Data chunking implementation
Mini-batch processing, where data is divided into smaller subsets for parallel processing, has become ubiquitous in deep learning due to its efficiency in handling large datasets while maintaining model convergence. Drawing inspiration from this approach, mNSF implements data chunking for analyzing large spatial transcriptomics datasets. This approach divides the spatial locations into multiple chunks and processes them in parallel, providing another layer of computational efficiency on top of the inducing points method.
In our implementation, the spatial locations in each sample are first randomly partitioned into several chunks. Each chunk contains a subset of the original spatial locations. The number of chunks can be specified by the user based on their computational resources and dataset characteristics. This chunking is performed independently for each sample.
The model is then fit to each chunk separately while maintaining the shared gene loadings across all chunks and samples. Each chunk only needs to store and process the corresponding subset of the data. The results from different chunks are then combined during the parameter update step of the model fitting process. Importantly, the gene loadings remain shared across all chunks and samples, ensuring that the biological interpretation of the factors remains consistent across the entire dataset.
The distributed points approach can be used in combination with inducing points, providing two complementary strategies for handling large datasets.
Parameter selection for induced points and data chunks
The inducing points approach, implemented in the NSF by Townes and Engelhardt [20], provides an efficient approximation method for handling spatial correlations. Our implementation of mNSF extends this approach while adding data chunking as a complementary strategy for computational efficiency.
While our implementation of mNSF provides both induced points and data chunking for computational efficiency, selecting appropriate parameters for these approximations requires careful consideration. The optimal parameters depend on both computational constraints and data characteristics. The key trade-off is between the number of induced points (U) and number of chunks (K) per sample, as both parameters affect memory usage and computational efficiency, with induced points primarily impacting model accuracy and chunks affecting parallel processing capability.
In our cross-platform analysis of mouse brain data (Visium sagittal brain and Slide-seq cerebellum), we used different parameter settings for each technology due to their distinct characteristics. For the Visium data (N = 3355 spots), we selected 35% of spatial locations as induced points and processed the data as a single chunk. For the larger Slide-seq dataset (N = 25,415 spots), we maintained the same proportion of induced points but divided the data into 10 chunks. This parameter selection was guided by multiple considerations: the cerebellum’s distinctive layered structure requires sufficient induced points to capture its complexity, while the large number of Slide-seq spots benefits from parallel processing through chunking.
For the analyses of both the DLPFC dataset (12 samples, 3500 spots per sample) and the mouse sagittal brain dataset (4 samples, 2500–4000 spots per sample), we selected 35% of spatial locations as induced points and processed each dataset as a single chunk. This parameter choice balances computational efficiency with model accuracy. The relatively moderate size of these datasets (less than 5000 spots per sample) and their well-defined anatomical structures (cortical layers in DLPFC, distinct brain regions in sagittal sections) made this configuration suitable. The 35% induced points provided sufficient coverage to capture the spatial patterns while maintaining reasonable computational demands. Single-chunk processing was appropriate given the manageable sample sizes and the importance of preserving spatial relationships within each sample.
These parameters should be adjusted based on specific dataset characteristics. Datasets with fine spatial structures (e.g., cortical layers) may require more induced points while potentially using more chunks. In contrast, datasets with large homogeneous regions may achieve good results with fewer induced points. When samples vary greatly in size or complexity, consider sample-specific parameters. Users should validate parameter choices by comparing results with different settings on a subset of data, monitoring convergence behavior, and assessing whether anatomically meaningful patterns are preserved.
Supplementary Information
Additional file 1: Contains Figs. S1–S22 and Tables S1–S3
Acknowledgements
Yuanhua Huang and Andrew Cosgrove were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Authors’ contributions
YW developed the method and performed analyses, supervised by KDH. KW, CS, GSO, and LAG provided feedback on the method, software, evaluation and interpretation of the method. YW and KDH wrote the manuscript, with feedback and help from all other authors.
Funding
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM149323 (YW, KDH), the National Institute on Aging of the National Institutes of Health under award numbers R01 AG066768 (KW, CS, LAG), R01 AG072305 (LAG), the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award number R00 NS122085 (CS, GSO), the National Cancer Institute of the National Institutes of Health under award number U01 CA284090 (GSO), the Raynor Foundation (GSO), and the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation under award CZF2019-002443 (YW, KDH). The content is solely the authors’ responsibility and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.
Data availability
The Visium data for DLPFC is available for download through spatialLIBD package [48]. The Visium data for mouse sagittal section is available through 10X portal [49]. Slide-seq V2 data are available at Broad Institutes single-cell repository with ID SCP815 [50]. We used code for generating the aligned spatial coordinates using PASTE available through GitHub (https://github.com/raphael-group/paste_reproducibility).
All code to analyze the data and generate figures is available at GitHub [51] and Zenodo [52] (including an archive of the software package).
The mNSF package described in this manuscript is available at GitHub [53]. The software is licensed under GNU Lesser General Public License v3.0.
Declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348(6233):aaa6090. 10.1126/science.aaa6090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82. 10.1126/science.aaf2403. [DOI] [PubMed] [Google Scholar]
- 3.Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363(6434):1463–7. 10.1126/science.aaw1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat Biotechnol. 2021;39(3):313–9. 10.1038/s41587-020-0739-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lee Y, Bogdanoff D, Wang Y, Hartoularos GC, Woo JM, Mowery CT, et al. XYZeq: Spatially resolved single-cell RNA sequencing reveals expression heterogeneity in the tumor microenvironment. Sci Adv. 2021;7(17):eabg4755. 10.1126/sciadv.abg4755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhao T, Chiang ZD, Morriss JW, LaFave LM, Murray EM, Del Priore I, et al. Spatial genomics enables multi-modal study of clonal heterogeneity in tissues. Nature. 2022;601(7891):85–91. 10.1038/s41586-021-04217-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lubeck E, Cai L. Single-cell systems biology by super-resolution imaging and combinatorial labeling. Nat Methods. 2012;9(7):743–8. 10.1038/nmeth.2069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Eng CHL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH. Nature. 2019;568(7751):235–9. 10.1038/s41586-019-1049-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Goltsev Y, Samusik N, Kennedy-Darling J, Bhate S, Hale M, Vazquez G, et al. Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging. Cell. 2018;174(4):968-981.e15. 10.1016/j.cell.2018.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Keren L, Bosse M, Thompson S, Risom T, Vijayaragavan K, McCaffrey E, et al. MIBI-TOF: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Sci Adv. 2019;5(10):eaax5851. 10.1126/sciadv.aax5851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Thornton CA, Mulqueen RM, Torkenczy KA, Nishida A, Lowenstein EG, Fields AJ, et al. Spatially mapped single-cell chromatin accessibility. Nat Commun. 2021;12(1):1274. 10.1038/s41467-021-21515-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016;22(6):839–951. 10.1261/rna.053959.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mendelevich A, Vinogradova S, Gupta S, Mironov AA, Sunyaev SR, Gimelbrant AA. Replicate sequencing libraries are important for quantification of allelic imbalance. Nat Commun. 2021;12(1):3370. 10.1038/s41467-021-23544-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not eliminate biological variability. Nat Biotechnol. 2011;29(7):572–3. 10.1038/nbt.1910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ombao H, Lindquist M, Thompson W, Aston J. A Tutorial for Multisequence Clinical Structural Brain MRI. In: Ombao H, Lindquist M, Thompson W, Aston J, editors. Handbook of Neuroimaging Data Analysis. Chapman & Hall/CRC; Boca Raton, FL, USA. 2019. p. 2.
- 16.Zeira R, Land M, Strzalkowski A, Raphael BJ. Alignment and integration of spatial transcriptomics data. Nat Methods. 2022;19(5):567–75. 10.1038/s41592-022-01459-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu X, Zeira R, Raphael BJ. Partial alignment of multislice spatially resolved transcriptomics data. Genome Res. 2023;33(7):1124–32. 10.1101/gr.277670.123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Clifton K, Anant M, Aihara G, Atta L, Aimiuwu OK, Kebschull JM, et al. STalign: Alignment of spatial transcriptomics data using diffeomorphic metric mapping. Nat Commun. 2023;14(1):8123. 10.1038/s41467-023-43915-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jones A, Townes FW, Li D, Engelhardt BE. Alignment of spatial genomics data using deep Gaussian processes. Nat Methods. 2023;20(9):1379–87. 10.1038/s41592-023-01972-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Townes FW, Engelhardt BE. Nonnegative spatial factorization applied to spatial genomics. Nat Methods. 2023;20(2):229–38. 10.1038/s41592-022-01687-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Velten B, Braunger JM, Argelaguet R, Arnol D, Wirbel J, Bredikhin D, et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat Methods. 2022;19(2):179–86. 10.1038/s41592-021-01343-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Shang L, Zhou X. Spatially aware dimension reduction for spatial transcriptomics. Nat Commun. 2022;13(1):7203. 10.1038/s41467-022-34879-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Fertig EJ, Ding J, Favorov AV, Parmigiani G, Ochs MF. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics. 2010;26(21):2792–3. 10.1093/bioinformatics/btq503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen X, Du Y, Broussard GJ, Kislin M, Yuede CM, Zhang S, et al. Transcriptomic mapping uncovers Purkinje neuron plasticity driving learning. Nature. 2022;605(7911):722–7. 10.1038/s41586-022-04711-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Aruga J, Minowa O, Yaginuma H, Kuno J, Nagai T, Noda T, et al. Mouse Zic1 is involved in cerebellar development. J Neurosci. 1998;18(1):284–93. 10.1523/JNEUROSCI.18-01-00284.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Bliss TV, Collingridge GL. A synaptic model of memory: long-term potentiation in the hippocampus. Nature. 1993;361(6407):31–9. 10.1038/361031a0. [DOI] [PubMed] [Google Scholar]
- 27.Saper CB, Lowell BB. The hypothalamus. Curr Biol. 2014;24(23):R1111-6. 10.1016/j.cub.2014.10.023. [DOI] [PubMed] [Google Scholar]
- 28.Dietrich MO, Horvath TL. Hypothalamic control of energy balance: insights into the role of synaptic plasticity. Trends Neurosci. 2013;36(2):65–73. 10.1016/j.tins.2012.12.005. [DOI] [PubMed] [Google Scholar]
- 29.Serrenho D, Santos SD, Carvalho AL. The Role of Ghrelin in Regulating Synaptic Function and Plasticity of Feeding-Associated Circuits. Front Cell Neurosci. 2019;13:205. 10.3389/fncel.2019.00205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bains JS, Wamsteeker Cusulin JI, Inoue W. Stress-related synaptic plasticity in the hypothalamus. Nat Rev Neurosci. 2015;16(7):377–88. 10.1038/nrn3881. [DOI] [PubMed] [Google Scholar]
- 31.Horvath TL. Synaptic plasticity in energy balance regulation. Obesity. 2006;14(Suppl 5):228S-233S. 10.1038/oby.2006.314. [DOI] [PubMed] [Google Scholar]
- 32.Herring BE, Shi Y, Suh YH, Zheng CY, Blankenship SM, Roche KW, et al. Cornichon proteins determine the subunit composition of synaptic AMPA receptors. Neuron. 2013;77(6):1083–96. 10.1016/j.neuron.2013.01.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yang L, Liu Q, Zhao Y, Lin N, Huang Y, Wang Q, et al. DExH-box helicase 9 modulates hippocampal synapses and regulates neuropathic pain. iScience. 2024;27(2):109016. 10.1016/j.isci.2024.109016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Nicolas G, Sévigny M, Lecoquierre F, Marguet F, Deschênes A, Del Pelaez MC, et al. A postzygotic de novo NCDN mutation identified in a sporadic FTLD patient results in neurochondrin haploinsufficiency and altered FUS granule dynamics. Acta Neuropathol Commun. 2022;10(1):20. 10.1186/s40478-022-01314-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lee SJ, Wei M, Zhang C, Maxeiner S, Pak C, Calado Botelho S, et al. Presynaptic Neuronal Pentraxin Receptor Organizes Excitatory and Inhibitory Synapses. J Neurosci. 2017;37(5):1062–80. 10.1523/JNEUROSCI.2768-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK, Williams SR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci. 2021;24(3):425–36. 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.O’Leary DDM, Chou SJ, Sahara S. Area patterning of the mammalian cortex. Neuron. 2007;56(2):252–69. 10.1016/j.neuron.2007.10.010. [DOI] [PubMed] [Google Scholar]
- 38.Lau HYG, Fornito A, Fulcher BD. Scaling of gene transcriptional gradients with brain size across mouse development. NeuroImage. 2021;224:117395. 10.1016/j.neuroimage.2020.117395. [DOI] [PubMed] [Google Scholar]
- 39.Lodato S, Arlotta P. Generating neuronal diversity in the mammalian cerebral cortex. Ann Rev Cell Dev Biol. 2015;31:699–720. 10.1146/annurev-cellbio-100814-125353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wei JR, Hao ZZ, Xu C, Huang M, Tang L, Xu N, et al. Identification of visual cortex cell types and species differences using single-cell RNA sequencing. Nat Commun. 2022;13(1):6902. 10.1038/s41467-022-34590-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Cahoy JD, Emery B, Kaushal A, Foo LC, Zamanian JL, Christopherson KS, et al. A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. J Neurosci. 2008;28(1):264–78. 10.1523/JNEUROSCI.4178-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
42.Plant LD, Xiong D, Dai H, Goldstein SAN. Individual I
channels at the surface of mammalian cells contain two KCNE1 accessory subunits. Proc Natl Acad Sci USA. 2014;111(14):E1438-46. 10.1073/pnas.1323548111. [DOI] [PMC free article] [PubMed]
- 43.Günzel D, Yu ASL. Claudins and the modulation of tight junction permeability. Physiol Rev. 2013;93(2):525–69. 10.1152/physrev.00019.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Pons P, Latapy M. Computing communities in large networks using random walks. In: Computer and Information Sciences - ISCIS 2005. Lecture notes in computer science. Berlin, Heidelberg: Springer Berlin Heidelberg; 2005. pp. 284–93.
- 45.Shang L, Zhou X. DLPFC data example. https://lulushang.org/SpatialPCA_Tutorial/DLPFC.html. Accessed 19 Sep 2024.
- 46.Haddock J, Kassab L, Li S, Kryshchenko A, Grotheer R, Sizikova E, et al. Semi-supervised Nonnegative Matrix Factorization for Document Classification. arXiv. 2022. p. 2203.03551. 10.48550/arXiv.2203.03551.
- 47.Righelli D, Weber LM, Crowell HL, Pardo B, Collado-Torres L, Ghazanfar S, et al. SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor. Bioinformatics. 2022;38(11):3128–31. 10.1093/bioinformatics/btac299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Collado-Torres L, Maynard K, Jaffe A, Pardo B, Spangler A, Santiago J, et al. spatialLIBD. 2022. http://bioconductor.org/packages/spatialLIBD/. Accessed 07 Sep 2023.
- 49.10x Genomics. Sagittal Slice of the Anterior from C57BL/6 mice (v1). https://www.10xgenomics.com/datasets. Accessed 07 Sep 2023.
- 50.Robert S, Evan M, Pawan K, Jilong L, Jamie M, Daniela B, et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. 2020. ID: SCP815. https://singlecell.broadinstitute.org/single_cell/study/SCP815/sensitive-spatial-genome-wide-expression-profiling-at-cellular-resolution#study-summary. Accessed 2 Nov 2023.
- 51.Wang Y, Hansen KD. mNSF paper. GitHub. 2024. https://github.com/hansenlab/mNSF_paper. Accessed 24 Sep 2024.
- 52.Wang Y, Hansen KD. mNSF paper. Zenodo. 2024. 10.5281/zenodo.15108678.
- 53.Wang Y, Hansen KD. mNSF. GitHub. 2024. https://github.com/hansenlab/mNSF. Accessed 1 April 2025.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Contains Figs. S1–S22 and Tables S1–S3
Data Availability Statement
The Visium data for DLPFC is available for download through spatialLIBD package [48]. The Visium data for mouse sagittal section is available through 10X portal [49]. Slide-seq V2 data are available at Broad Institutes single-cell repository with ID SCP815 [50]. We used code for generating the aligned spatial coordinates using PASTE available through GitHub (https://github.com/raphael-group/paste_reproducibility).
All code to analyze the data and generate figures is available at GitHub [51] and Zenodo [52] (including an archive of the software package).
The mNSF package described in this manuscript is available at GitHub [53]. The software is licensed under GNU Lesser General Public License v3.0.