Abstract
It is important to model biological variation when analyzing spatial transcriptomics data from multiple samples. One approach to multi-sample analysis is to spatially align samples, but this is a challenging problem. Here, we provide an alignment-free framework for generalizing a one-sample spatial factorization model to multi-sample data. Using this framework, we develop a method, called multi-sample non-negative spatial factorization (mNSF) that extends the one-sample non-negative spatial factorization (NSF) framework to a multi-sample dataset. Our model allows for a sample-specific model for the spatial correlation structure and extracts a low-dimensional representation of the data. We illustrate the performance of mNSF by simulation studies and real data. mNSF identifies true factors in simulated data, identifies shared anatomical regions across samples in real data and reveals region-specific biological functions. mNSFs performance is similar to alignment based methods when alignment is possible, but extends analysis to situations where spatial alignment is impossible. We expect multi-sample factorization methods to be a powerful class of methods for analyzing spatially resolved transcriptomics data.
Background
Spatially resolved transcriptomics (SRT) measures gene expression levels in the context of spatial positions (KH Chen et al., 2015; Ståhl et al., 2016; Rodriques et al., 2019; Stickels et al., 2021; Y Lee et al., 2021; Zhao et al., 2022; Lubeck, Cai, 2012; Eng et al., 2019; Goltsev et al., 2018; Keren et al., 2019; Thornton et al., 2021), either at the single cell level or as a local aggregate of multiple cells across a spatial location, also termed a spot. The last 30 years of genomics have established that it is essential to consider biological replicates when trying to understand a biological system (Schurch et al., 2016; Mendelevich et al., 2021). Indeed, technology does not remove biological variation (Hansen et al., 2011).
Multisample (population-level) analysis of spatial data is common in functional magnetic resonance imaging (fMRI) brain data, and it is instructive to briefly review the approach in this field (FIXME, 2019). In fMRI analysis, the first step is to spatially align the samples to a common coordinate system (known as template-based alignment). The unit of measurements are 3D cubes known as “voxels”. Following alignment, analysis then proceeds separately for each voxel (or sometimes region), typically by using a general linear model across samples. For fMRI data, spatial alignment makes it possible to deploy standard statistical models for each voxel separately, substantially simplifying downstream analysis.
In spatially resolved transcriptomics, a number of methods for spatial alignment has been proposed, including PASTE (Zeira et al., 2022), PASTE2 (Liu et al., 2023), STalign (Clifton et al., 2023) and GPSA (Jones et al., 2023). Some of these methods align to a common coordinate system, others align the samples to each other. However, we posit that there are natural limitations to the potential success of this approach to multi-sample analysis. In fMRI imaging, alignment is helped by the fact that the whole brain is imaged in 3D in each sample. In contrast to fMRI data, the alignment of spatially resolved transcriptomics is complicated by the possibility that different samples may be collected from different anatomical areas and have differences in the shape, size, and rotation of the sections. Indeed, SRT samples can represent completely disjoint areas; in this case, spatial alignment is impossible except to a common coordinate system. But even then, it is unclear how downstream analysis should proceed, when the samples are non-overlapping.
Factor analysis has been a successful approach to unsupervised discovery of patterns in genomics. There are a few existing methods for the factor analysis of SRT data that model the spatial dependency of gene expression data (Townes, Engelhardt, 2023; Velten et al., 2022; Shang, Zhou, 2022). NSF (Townes, Engelhardt, 2023) and MEFISTO (Velten et al., 2022) are focused on the factorization of data from a single biospecimen, as each factor is modeled using a single Gaussian Process. Nevertheless, the models are straightforward to apply to data that has been spatially aligned by treating the aligned samples as one larger sample; the performance of this approach is not evaluated in the associated publications. In contrast, spatialPCA can be applied to multiple unaligned samples. Shang, Zhou (2022) compares clusters obtained from the jointly analyzed data to manually annotated cortical layers. However, they conclude that the clusters obtained by using a joint analysis across multiple samples do not outperform the clusters obtained from a single sample. This suggests that across-sample factorization of SRT data still has substantial challenges.
Results
Bypassing spatial alignment by parameter modeling
It is important to account for sample-to-sample variation in the analysis of genomics data. We are considering this question in the context of applying matrix factorization methods, such as non-negative matrix factorization (NMF), to spatially resolved transcriptomics data. Gene expression data exhibits a spatial dependence whereby genes measured at two locations which are spatially close show a different dependence from genes measured at two distant spatial locations. Such dependence can be driven by a variety of sources including spatial patterns in the distribution of cell types as well as correlated measurement error. The goal of any analysis of spatially resolved transcriptomics data is to identify systematic changes in gene expression which are associated with spatial location. Broadly, we refer to such dependence as “spatial dependence”.
One approach to analysis of multi-sample spatially resolved transcriptomics data is to start the analysis with spatial alignment of the samples into a common coordinate system. This process essentially defines spatial neighbourhoods and maps these neighbourhoods between samples. But spatial alignment is well-recognized to be a challenging problem, due to the need to account for differences in shape, rotation, and placement of anatomical regions or other features between samples.
Here, we provide a general recipe for extending a one-sample spatial factorization framework to allow multi-sample analysis. Our approach bypasses the need for spatial alignment.
In a spatial factorization framework, we represent spatial expression data on a single sample as a sum of products between gene loadings and spatial factors,
Here is the spatial data matrix, is the number of spatial factors, are gene loadings and are the spatial factors. The gene loadings and spatial factors represent systematic changes in gene expression. Accounting for spatial dependence in such a model is done by additional modeling of the spatial factors; an example is the proposed non-negative spatial factorization (NSF) of Townes, Engelhardt (2023) where the spatial factors are modelled using Gaussian processes.
In a multi-sample dataset we have an additional index for the different samples. Our recipe prescribes letting the spatial factors be sample-specific while the gene loadings are shared across samples (Methods), giving rise to the following factorization
Note the absence of the index on the gene loadings . Sharing the gene loadings across samples is exactly what happens when NMF is applied to data without spatial information such as bulk RNA-seq data (Methods).
With such a parameterization, our recipe enforces each factor to have the same association with genes across samples, while allowing spatial dependence to be modeled separately in each sample.
Fitting such a model will usually require the development of new software (Methods), often by extending existing software.
As a proof of concept, here we have applied this recipe to non-negative spatial factorization (NSF). We allow each sample to have its own spatial dependence structure (or more specifically a sample-specific covariance term in the Gaussian process). We call this extended model mNSF (multisample NSF). We provide a python package implementing our model.
The performance of mNSF in simulations
We examined the performance of our mNSF model using a simulation study which is a simple extension of the study conducted by Townes, Engelhardt (2023), but adapted to examine issues that are particularly relevant for multi-sample analysis of SRT data. Briefly, we specify true latent spatial factors and generate gene expression data with noise. We only depict the true and estimated spatial features, and not the gene loadings. We use T1–4 to denote the 4 true factors in the simulation, and use M1–4 to denote the 4 mNSF spatial features. There are no particular order to the mNSF output so we manually identify the best true factor which matches a given estimated factor.
First, we examined how mNSF handles the important case where the spatial factors are rotated between samples (T1-T4 in Figure 1a). For each factor, its spatial distributions are the same in the two samples, but rotated 90 degrees. We find the mNSF factors among M1–4 each corresponding to one simulated factor among T1–4, according to their spatial pattern (M1, M2, M3 and M4, corresponding to T2, T1 T4 and T3). We use Moran’s I to measure the spatial dependency for each of the mNSF factors (i.e. M1–4) (Figure 1c), and we find that all four mNSF factors show high spatial dependency. Those validations suggest that mNSF successfully identifies these simulated spatial features.
Second, we examined mNSF in the case where one of the factors has a spatial pattern only in one sample, and has constant low value in the other sample (i.e. T1 among the factors T1-T4 in Figure 1b). We find mNSF factors each corresponding to one simulated factor (M1, M2, M3 and M4, corresponds to T2, T4, T3 and T1). For each set of simulations, we use Moran’s I to measure the spatial dependency of each of the mNSF factors among M1–4 (Figure 1d). We find that factor M4, which corresponds to factor T1 (i.e. the factor which is only operational in one of the samples by design), show high spatial dependency in sample 1 and almost zero spatial dependency in sample 2. Those results suggest that mNSF is capable of identifying factors that represent patterns that are operational only in some of the samples.
Analysis of a mouse sagittal brain dataset
To assess the ability of mNSF to identify spatially-resolved features in an actual, multi-sample, SRT dataset, we next analyzed the adult mouse sagittal brain dataset generated by 10X genomics, generated using the Visium technology (Ståhl et al., 2016). This dataset consists of two sagittal sections from a single mouse brain. Each section was cut in two halves, one anterior and one posterior, for a total of 4 samples. Each sample was assayed using a separate Visium slide. We know that certain anatomical regions are present only in the anterior (e.g. olfactory bulb) or the posterior (e.g. cerebellum), while other anatomical regions will be split across the two halves (e.g. the hippocampus). This provides an opportunity to assess the ability of mNSF to identify sample-specific factors as well as both common factors and factors that vary across the anterior and posterior sections.
Townes, Engelhardt (2023) applies NSF to one of the two anterior samples. They use 20 factors with a split of 10 spatial factors and 10 spatially-unrestricted features. Despite this demarcation, they find that most of the 20 learned patterns have strong spatial components. They establish that most of these factors correspond to known anatomical regions in the anterior mouse brain. Consistent with their parameter choices, we apply mNSF to all 4 samples using 20 spatial features.
To interpret each mNSF factor, we find a list of genes that are highly associated with, and most specific for each factor by analyzing the gene loading matrix using the patternMarkers approach identified in (Fertig et al., 2010). We then use the set of genes associated with each factor to interpret the cell types, biological functions, or anatomical regions represented by each factor.
Some mNSF factors reflect specific anatomical regions, only present in some of the samples. For example, factor M9 can be identified as highlighting the gyri of the cerebellum (Figure 2). This hindbrain-specific factor is close to zero for the two anterior samples and visually similar in intensity and distribution across the two replicate sections for the posterior brain. The top genes identified by patternMarkers include Pcp2, Calb1, Car8, and Itpr1 which are specific markers for Purkinje cells within the cerebellum (X Chen et al., 2022), as well as Cbln1 and Cbln3 which are known to exhibit high expression in the cerebellum, and Zic1 which is a specific marker for cerebellar granular cells (Aruga et al., 1998) (Table S1). This result highlights the ability of mNSF to identify factors associated with a signal that is present only in some, but not all, of the samples.
Some mNSF factors represent anatomical regions that span the posterior and anterior brain (Figure 2). For example, factor M17 predominately marks both the hippocampus and hypothalamus, with moderate signal in select layers of the cortex. Note how the estimated factor varies smoothly across the posterior and anterior brain, specifically across the CA1–3 layers(which appear as a rotated U in these samples). The regions labeled by this factor are considered regions of increased synaptic plasticity. For example, the hippocampus, which functions primarily in learning and memory (Bliss, Collingridge, 1993), requires this plasticity for the formation and consolidation of short-term memories. The hypothalamus plays a crucial role in maintaining homeostasis in the body (Saper, Lowell, 2014), regulating a variety of essential functions such as hunger, thirst, sleep, circadian rhythms, stress responses, and reproductive behaviors. Synaptic plasticity in the hypothalamus is important for adaptation to changes in physiological states and the environment (Dietrich, Horvath, 2013; Serrenho et al., 2019; Bains et al., 2015; Horvath, 2006). Consistent with this categorization, patternMarker genes for M17 are associated with synaptic plasticity, including AMPA receptor regulation (Cnih2, Herring et al. (2013)), dendritic spine development (Ddn, Ncdn, Yang et al. (2024) and Nicolas et al. (2022)), and synaptogenesis (Nptxr, SJ Lee et al. (2017)) (Table S1).
Broadly, across the 20 spatial features, our observations about factors M9 and M17 hold for other spatial features. Specifically, we observe (a) consistency between each factor across the two replicate sections (b) there are multiple factors which continuously vary across the anterior and posterior brain (Supplementary Figure S1) (c) a few factors (M10, M19) are specific to either the anterior or posterior brain.
Analysis of human DLPFC data
Next, to evaluate mNSF across a dataset with replicate samples from different donors, we apply mNSF to a widely used spatially resolved transcriptomics dataset from the Visium platform on human dorsolateral prefrontal cortex (DLPFC) Maynard et al. (2021). This data consists of 4 samples from each of 3 donors. The 4 samples from each donor consist of parallel sections (Figure 3). Each section has a width of 10μm and we label the sections as A-D, representing the physical ordering of the sections. The physical separation is as follows: the AB pair is separated by 0μm, the BC pair is separated by 300μm and the CD pair is separated by 0μm. Furthermore, the data are supplied with manual annotations of cortical layers based on expression and H&E staining, with labels of white matter (WM), cortical layers 1 to 6 and NA (which are excluded from our analysis). Not all layers are present in all samples; specifically, the 4 samples from individual 2 do not have layers 1 and 2 present.
We apply mNSF to all 12 samples using 10 spatial features. The model does not encode the design of the experiment with 4 sections from 3 donors. We expect that different samples from the same donor are more similar than different samples from different donors; this is true for the manual annotations of the samples (Figure 3). We do not necessarily expect that different cortical layers form distinct “clusters” in expression space. For example, it is understood that some genes are expressed in a gradient across the cortical layers (O’Leary et al., 2007; Lau et al., 2021; Lodato, Arlotta, 2015). Nevertheless, we expect some relationship between cortical layers and mNSF spatial features. To compare the mNSF factors with the discrete manual annotation, we use the following approach: we group each spatial location in each sample according to its manual annotation and display each factor value across the layers (Figure 4). The M6 factor displayed in Figure 4 is particularly high in cortical layer 2, followed by blending into cortical layer 3. The lowest layer is cortical layer 1 and the factor is almost absent in white matter.
Figure 5 depicts 3 mNSF spatial features. Due to size restrictions, we display 4 samples, 1 sample from each donor as well as an additional parallel section from each donor. This display depicts both between-donor variability and between-sections-within-a-donor variability. For completeness, we depict all 12 samples and 10 factors in Supplementary Figures S2–S11.
Factor M6 has high values in the spatial locations manually annotated as layer 2, intermediate values in spatial locations manually annotated as layer 3 and close to zero values for spatial locations manually annotated as white matter (Figure 5a). PatternMarker genes for this factor include HPCAL1 (Table S2), which is a marker for layer 2 excitatory neurons (Wei et al., 2022). This factor is consistent with the manual annotation across the 3 donors, and across parallel sections within each donor.
Factor M2 is consistently high for spatial locations annotated as white matter (Figure 5b) and low otherwise. PatternMarker genes for this factor include known oligodendrocyte and myelin-associated genes, such as MOG, MOBP, MBP, and BCAS1 (Cahoy et al., 2008; Plant et al., 2014) (Table S2), which are expected to be specific to white matter. This factor is consistent with the manual annotation across the 3 donors, and across parallel sections within each donor.
Factor M5 has high values in the spatial locations manually annotated as white matter in the samples from donor 1 and 3, but consistently low values across all the spatial locations in the samples from donor 2 (Figure 5c). It therefore suggests an inconsistency, potentially in tissue processing or orientation, across the samples from different donors. PatternMarker genes for this factor include genes that mark oligodendrocytes and myelination (PLP1, TF, CNP, ENPP2) (Cahoy et al., 2008; Plant et al., 2014), and genes potentially associated with neurovasculature, blood, and vascular endothelial cells (HBA2, HBB, CLDND1) (Günzel, Yu, 2013) (Table S2). We believe differences in this factor across donors may reflect variation in how tissue blocks were dissected where sections from donor 2 are cut at a more horizontal plane that does not contain layers 1 and 2. This was confirmed to be a possible interpretation by the original manual annotator (K. Maynard, personal communication). This factor represents a pattern only present in some, but not all samples, once again showcasing the ability of mNSF to identify such patterns.
We conclude that mNSF shows encouraging performance on this dataset. It produces factors which make biological sense and are consistent across parallel sections within the same tissue block. Many of the factors are also consistent across the donors. However, the scale of the factors sometime vary (see white matter for factor M2, Figure 5). This might reflect variability in the manual annotation, biological variability between samples or unwanted (technical) variation which might be possible to remove with additional normalization.
Comparison with spatial alignment
The DLPFC dataset is an excellent candidate for spatial alignment. However, considering the manual annotations (Figure 3) it is clear that aligning different sections from the same donor is much easier than aligning different sections from different donors. For example, layers 1 and 2 are absent in donor 2 and this complicates spatial alignment. Zeira et al. (2022) describes PASTE, a method for spatial alignment, and apply PASTE to the DLPFC dataset to perform direct alignment of these samples. However, for exactly the challenges described above, Zeira et al. (2022) only attempt to align samples from the same donor to each other, doing both pairwise and 4 -sample alignment. Using these data allows us to compare mNSF to spatial alignment on a dataset which is particularly well suited for spatial alignment.
Specifically, we compare the result of using mNSF on multiple samples to the result of using PASTE to align samples followed by NSF on the aligned samples, an approach which we term pasteNSF. Following Zeira et al. (2022) we evaluate pairwise alignment, where we only consider pairs of adjacent sections (either AB, BC or CD). We expect, and this is confirmed by the authors, that PASTE performs best with adjacent tissue sections separated by 0μm (ie. comparisons AB or CD). In this analysis we use 10 spatial spatial features. Following factorization, we use a multinomial model to predict the 5 or 7 manually annotated layers (depending on the donor) as a function of the 10 estimated spatial features, and we use the model fit to assess performance. The model fit measures the association between the 10 inferred factors and the manual annotation.
This approach shows that mNSF has comparable performance to pasteNSF (Figure 6a–b). PASTE supplies the user with a mapping score which represents how well the spatial alignment is performed (higher is better). The pair with the lowest mapping score (donor 1, BC) is the pair where mNSF outperforms pasteNSF the most (Figure 6c–d).
In summary, mNSF has at least comparable performance to PASTE followed by NSF when spatial alignment is easy, but extends factorization to data where spatial alignment is hard (between donors in the DLPFC dataset) or impossible (between anterior / posterior sections in the mouse sagittal dataset).
Discussion
In this study, we describe a general approach to extending a matrix factorization method to multisample datasets. Using this approach, we extended non-negative spatial matrix factorization (NSF) by Townes, Engelhardt (2023) to spatial transcriptomics datasets with multiple samples. Our model allows for a sample-specific spatial dependence structure. Our method bypasses the need to align factors between samples into a consensus coordinate system, which is a challenging problem. Both real and simulated data analysis support that the method yields usable results when applied to data from multiple sources, even if it is impossible to perform spatial alignment. Classic matrix factorization methods are widely used in expression analysis and it is well recognized that it is hard to identify the biological or technical process(es) associated with each factor or pattern. Our method retains this limitation.
There are multiple possible downstream applications of our method, including spatial domain detection. These applications are left for future work. Our evaluation is focused on comparing factors to known anatomical regions, and we have not considered the impact on downstream analyses. Nevertheless, we believe that our method can serve as a foundation or input to downstream analysis of multi-sample data.
Batch effects could cause differences in spatial patterns between samples. Such differences would appear as factors which are variable across samples. Our current method cannot distinguish batch effects from biological variation. It will be an important question for future research to appropriately model and correct batch effects in spatially resolved transcriptomics data.
In applications, researchers are sometimes aware of existing patterns or factors across samples, either at sample level or at the level of spatial features. Accounting for such known biology will require the application of a semi-supervised matrix factorization method. Such methods have been suggested for other analysis domains (Haddock et al., 2022). We believe it will be important to develop such models for spatially resolved transcriptomics data and it is likely that our framework will allow for the extension of such models to multiple samples.
Conclusions
Here, we provide an alignment-free framework for generalizing a one-sample spatial factorization model to multi-sample data. In simulations, our method is capable of identifying spatial structures which are rotated between samples, as well as structures which only appear in some, but not all, samples. Using real data, we show our method is capable of identifying matched functional regions in multi-sample spatial transcriptomics data.
Methods
A general approach to multi-sample spatial factorization
Spatially resolved transcriptomic (SRT) data for a single sample can be represented as a matrix of expression measures indexed by genes with associated spatial (physical) location . We use the index to index the spatial locations.
Consider a standard non-negative matrix factorization model applied to SRT data on a single sample:
In this model, we decompose the expression values into a term representing genes and a term representing the spatial locations. The model does not impose any kind of spatial structure on . This model is widely used for non-spatially resolved bulk and single-cell transcriptomic data. Adapting this model to spatial data is usually done by additional requirements on the terms to account for expected spatial dependence. Such extensions are considered below, but for the sake of clarity, we first consider a matrix factorization model without spatial dependency.
Our suggested approach to extend this model across samples is to use the following (simplified) model
where the gene loadings are shared across samples, but the (spatial) factors are sample-specific. The model is easy to fit using standard software for non-negative matrix factorization models, by concatenating the involved matrices:
where [·] is concatenation. This is possible because of the simple model formulation where we do not impose spatial structure on the spatial features.
As argued by Townes, Engelhardt (2023), this model could be improved by (a) incorporating the digital (discrete) nature of the expression data and (b) modelling the spatial dependence between spots.
As a first step towards a better model for SRT data, Townes, Engelhardt (2023) describes probabilistic NMF (PNMF) which models the discrete nature of digital expression data using a Negative Binomial distribution but does not address the spatial dependence. We propose a multi-sample version of PNMF (mPNMF) specified as
Here, is a vector of known sample-specific size factors (one for each spot), and are factorparameters which are not sample-specific. Any software that fits single-sample PNMF can fit multi-sample PNMF by concatenating the data matrices.
To model the spatial dependence of SRT data, Townes, Engelhardt (2023) develops non-negative spatial factorization (NSF) by using a Gaussian process to model the spatial factors . We propose a multi-sample version of this model (mNSF), which is stated as follows (Figure 7):
where is a sample- and factor-specific mean function and is a sample- and factor-specific covariance kernel, both depending on the sample-specific vector of spatial locations . Unlike the multi-sample version of PNMF, fitting mNSF requires sample-specific parameterization to handle the sample-specific spatial features.
We have implemented mNSF by extending the code provided by Townes, Engelhardt (2023). Our extension includes the interpolated version of NSF, where the model is fit using a subset of spatial locations which are then interpolated to encompass the entire data matrix.
Data processing
Mouse sagittal section data
The spot-level gene expression counts data, as well as the 2-dimensional coordinates denoting the position of each spot, are downloaded from 10X website: https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Anterior_Section_1,https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Anterior_Section_2,https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Posterior_Section_1, and https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Mouse_Brain_Sagittal_Posterior_Section_2.
Top 500 genes are selected based on the maximal Poisson deviance of each gene across all four samples, calculated by a built-in function in NSF package (see code on GitHub for details).
DLPFC data
DLPFC dataset are downloaded from the SpatialExperiment (Righelli et al., 2022) package. Top 500 genes are selected based on the maximal Poisson deviance of each gene across all twelve samples (see code on GitHub for details).
PASTE alignment
For each sample pair used in this study, spatial locations on the pairwise aligned coordinate system are downloaded from PASTE GitHub website: https://github.com/raphael-group/paste_reproducibility
The 500 genes selected in the 12-sample mNSF analysis for DLPFC data are used for this analysis.
For each sample pair used in this study, one-sample NSF is applied on the aligned data, i.e. the concatenated gene expression matrix of the two samples as well as the coordinate of each spatial location in each sample on the aligned coordinate system. mNSF is then used on the unaligned data, i.e. the gene expression matrix of each sample as well as the coordinate of the spatial locations in each sample in the original coordinate system.
Models without induced points
One-sample NSF
As reference, we describe the one-sample NSF model proposed in (Townes, Engelhardt, 2023). Briefly, the model assumes that the log value of each non-negative factor follows a Gaussian process across the spatial locations in the sample, with the intercept equal to a linear combination of the coordinates. The gene expression level in each spatial location follows a Poisson distribution with the mean equal to the product of a loading matrix and the factor matrix, multiplied by the size factor (i.e. library size) of the spot.
Assume there are spatial locations in total at locations measuring genes. We will use to denote the number of non-negative spatial spatial features; this is a user-supplied parameter.
Let denote the observed count value for gene and spatial location . It is assumed to follow a Negative Binomial distribution
Here is a known size factor for spatial location is the dispersion parameter and
Here denotes the loading of gene for the factor, and denotes the value of the factor at spatial location .
The value of factor on the observed spatial locations follows a Gaussian process distribution with a linear mean and a Matern kernel for covariance. In particular, this means that – the specification of the factor on the grid follows a normal distribution
with
and
where the kernel function for the factor is of Matern class,
Here
denotes the distance between two spatial locations with coordinates and , both of which are vectors of length 2.
In summary, is an matrix denoting the correlation of , is the length scale parameter and is the amplitude parameter for the kernel of Gaussian process.
Multi-sample NSF
If no interpolation is used, multi-sample NSF assumes that the log value of each non-negative factor follows a Gaussian Process across the spatial locations in each sample, with the intercept equals a linear combination of the coordinates. And the gene expression level in each spatial location follows a Poisson distribution with the mean equals a weighted sum of the factors multiplied by the size factor (i.e. library size) of the spot, where the weights are shared across different samples and the other parameters are all sample-specific.
Assume there are spatial locations in total at locations , genes used, and non-negative spatial spatial features.
The observed count value for gene at spatial location in the sample, denotes as , follows a Negative Binomial distribution
where is the scale factor of spatial location , is the dispersion parameter, and
Here denotes the loading of gene for the factor, and denotes the value of the factor at spatial location in the sample.
The value of factor on the observed spatial locations conditional on follows a GP distribution
Here is an matrix denoting the correlation of , with
where the kernel function for the factor in sample is
where is the length scale parameter and is the amplitude parameter for the kernel of Gaussian Process for the th factor in the sample.
Models with inducing points
One-sample NSF
If a set of interpolated points is used, one-sample NSF assumes that the log value of each non-negative factor follows a Gaussian Process across both the observed and interpolated spots, with the mean equals a linear combination of the coordinates. A set of parameters are created for the interpolated points, and the posterior distribution of the observed point conditional on the interpolated points is derived. The overall likelihood of both the observed and interpolated points is calculated through the likelihood of the interpolated points and the posterior likelihood of the observed points.
Assume there are spatial locations in total at locations genes used, spatial locations interpolated at locations , and non-negative spatial spatial features.
The observed count value for gene at spot, denotes as , follows a Negative Binomial distribution
where is the scale factor of spatial location , is the dispersion parameter, and
Here denotes the loading of gene for the factor, and denotes the value of the factor at spatial location .
The distribution of (i.e. the value of the th factor on the induced points) and follows a Gaussian Process distribution,
where
Here is an matrix denoting the correlation of , is an matrix denoting the correlation of , and is an matrix denoting the correlation between and .
where the kernel function for the factor is
where is the length scale parameter and is the amplitude parameter for the kernel of Gaussian Process for the th factor.
Decomposing the joint distribution of and into and , we have
where could be derived by
and could be derived by
Multi-sample NSF
In multi-sample NSF, for each factor, the loading of the same gene is shared across the samples, while all the other parameters are sample-specific. The observed data from different samples are assumed to be independent.
Assume there are samples, with the sample containing spatial locations at , interpolated points at . The same set of genes are used in all the samples. Assume there are L non-negative spatial factors for each sample, with the loadings of those genes for each factor shared by samples.
For sample , the observed count value for gene at spot, denotes as , follows a Negative Binomial distribution
where is the scale factor of spatial location in sample , is the dispersion parameter of sample , and
Here denotes the loading of gene for the factor for sample , and denotes the value of the factor at spatial location in sample .
The value of the factor on the interpolated locations of sample are assumed to follow a GP distribution
The distribution of and follows a Gaussian Process distribution,
where
Here is an matrix denoting the correlation of , is an matrix denoting the correlation of , and is an matrix denoting the correlation between and .
where the kernel function for the factor in the sample is
where is the length scale parameter for sample and is the amplitude parameter for the kernel of Gaussian Process for sample for the th factor.
Decomposing the joint distribution of and into and , we have
where could be derived by
and could be derived by
Model fitting
Firstly, let’s assume a one-sample data with the distribution in the same form of one-sample NSF, as described in the first subsection under Method section, and discuss it’s model fitting approach.
For one-sample spatial data, in NSF paper, it has been shown that by maximizing the following function (called ELBO function), we will get the MLE estimates of all the parameters involved in the model (Townes, Engelhardt, 2023):
(1) |
where denotes the parameter space, is defined by letting , and is the product of the posterior likelihood of conditional on , denoted as , and the approximated likelihood of , denoted as .
Next, we will discuss the model fitting approach for multi-sample data, where the distribution of the data is in the same form of the mNSF model.
The statement that “maximizing the ELBO function will give us the MLE estimates of all parameters involved in the model” is hold in general regardless of the form of distribution settings, such statement also holds for a data that is concatenated by data from multiple samples, where each data has the same form of distribution but with different values of parameters, i.e.
(2) |
where
where is the observed data at all spatial locations in sample , denotes the latent factors at induced points in sample , is the factor at all spatial locations in sample , is the spatial locations in sample , and is the induced points in sample .
As discussed in the last paragraph, the statement “maximizing the ELBO function will give us the MLE estimates of all parameters involved in the model” holds true for function (2), so in the next step, we will discuss the approach to maximize the function (2) above.
One way to maximize function (2) is using “Adam algorithm (Kingma and Ba, 2014) with gradients computed by automatic differentiation in Tensorflow” (Townes, Engelhardt, 2023), which calculate the gradient of a target function with respect to a set of parameters and update the parameters by adding to each of the parameter where denote the ‘step size’ (a constant scalar that has the same value for fitting different parameters) in the gradient approach and denotes the gradient of a parameter. To satisfy the non-negativity constraint of parameter, we can set any negative values in to zero after the parameters’ update in each iteration.
In the setting that the distributions of data from different samples are independent, we can re-write function (2) as
(3) |
The equation (3) above suggests that, in terms of the gradient calculation and the parameters update within each iteration of applying Adam gradient approach in the multi-sample model fitting, it equals to
Step 1: calculate the gradient of the parameters involved in each sample, only using the data of the corresponding sample;
Step 2: for the parameters that are sample-specific, update those parameters in the same way of fitting one-sample NSF model; for the parameters that are shared by samples (here for mNSF model, it is the loadings parameter ), the gradient of this parameter for function (3) equals the sum of the gradients of the parameter across samples.
Note that as long as the ‘step size’ parameter are the same for the individual sample’s model fitting, the sample-specific parameter fitting in “Step 2” equals:
Step 2*: for the sample-specific parameters, update those parameters separately in the same way of fitting one-sample NSF model, (here in mNSF model, we will get sets of updated Ws, written as , then average those updated parameters to get the updated parameter with respect to the full model (here for mNSF model, the updated parameter can be calculated by )
Based on all the discussions above in this subsection, we can draw the conclusion that the following two model fitting process will give us the same parameter estimates:
Process 1: maximize the ELBO function of the full mNSF model, using Adam gradient approach with step size of and updating the parameters with 100 iterations, where at the end of each iteration, set the non-negative values in the averaged to zeros.
Process 2: repeat the following parameter updating step for up to 1000 iterations, until converge: for each sample: firstly do the parameter updates in the same way as one iteration in ‘Process 1’ excluding the step of setting the negative values in to zero; then for parameter , get the average of its updated value across the samples, set the non-negative values in the averaged to zeros, and use this non-negative as the updated
In mNSF, we use “Process 2” to fit the model, which will, assuming the approximated likelihoods used in NSF model fitting are close enough to the non-approximated likelihoods, give us a estimate of parameters that is close to the MLE estimates of the model.
Supplementary Material
Acknowledgements
Funding
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM149323 (YW, KDH), the National Institute on Aging of the National Institutes of Health under award numbers R01AG066768 (KW, CS, LAG), R01AG072305 (LAG), the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award number R00NS122085 (CS, GSO), the National Cancer Institute of the National Institutes of Health under award number U01CA284090 (GSO), the Raynor Foundation (GSO) and the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation under award CZF2019-002443 (YW, KDH).
Footnotes
Conflicts of Interest None.
Availability of data and materials
The Visium data for mouse sagittal section is available through 10X portal (https://www.10xgenomics.com). The Visium data for DLPFC is available for download through SpatialExperiment package. Code for generating the aligned spatial coordinates using PASTE is available through GitHub (https://github.com/raphael-group/paste_reproducibility). All code to analyze the data and generate figures is available at https://github.com/hansenlab/mNSF_paper. Our mNSF package is available at https://github.com/hansenlab/mNSF.
Bibliography
- Aruga J, Minowa O, Yaginuma H, Kuno J, Nagai T, Noda T, Mikoshiba K (1998). Mouse Zic1 is involved in cerebellar development. The Journal of Neuroscience 18.1: 284–293. DOI: 10.1523/JNEUROSCI.18-01-00284.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bains JS, Wamsteeker Cusulin JI, Inoue W (2015). Stress-related synaptic plasticity in the hypothalamus. Nature Rev Neuroscience 16.7: 377–388. DOI: 10.1038/nrn3881. [DOI] [PubMed] [Google Scholar]
- Bliss TV, Collingridge GL (1993). A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361.6407: 31–39. DOI: 10.1038/361031a0. [DOI] [PubMed] [Google Scholar]
- Cahoy JD, Emery B, Kaushal A, Foo LC, Zamanian JL, Christopherson KS, Xing Y, Lubischer JL, Krieg PA, Krupenko SA, et al. (2008). A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. The Journal of Neuroscience: the official journal of the Society for Neuroscience 28.1: 264–278. DOI: 10.1523/JNEUROSCI.4178-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X (2015). RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348.6233: aaa6090. DOI: 10.1126/science.aaa6090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen X, Du Y, Broussard GJ, Kislin M, Yuede CM, Zhang S, Dietmann S, Gabel H, Zhao G, Wang SSH, et al. (2022). Transcriptomic mapping uncovers Purkinje neuron plasticity driving learning. Nature 605.7911: 722–727. DOI: 10.1038/s41586-022-04711-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clifton K, Anant M, Aihara G, Atta L, Aimiuwu OK, Kebschull JM, Miller MI, Tward D, Fan J (2023). STalign: Alignment of spatial transcriptomics data using diffeomorphic metric mapping. Nature Communications 14.1: 8123. DOI: 10.1038/s41467-023-43915-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dietrich MO, Horvath TL (2013). Hypothalamic control of energy balance: insights into the role of synaptic plasticity. Trends in Neurosciences 36.2: 65–73. DOI: 10.1016/j.tins.2012.12.005. [DOI] [PubMed] [Google Scholar]
- Eng CHL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, Yun J, Cronin C, Karp C, Yuan GC, et al. (2019). Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH. Nature 568.7751: 235–239. DOI: 10.1038/s41586-019-1049-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fertig EJ, Ding J, Favorov AV, Parmigiani G, Ochs MF (2010). CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 26.21: 2792–2793. DOI: 10.1093/bioinformatics/btq503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- FIXME (2019). “A Tutorial for Multisequence Clinical Structural Brain MRI”. Handbook of Neuroimaging Data Analysis. Ed. by Ombao H, Lindquist M, Thompson W, Aston J. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. Chapman & Hall/CRC. Chap. 5: 2. DOI: 10.1201/9781315373652. [DOI] [Google Scholar]
- Goltsev Y, Samusik N, Kennedy-Darling J, Bhate S, Hale M, Vazquez G, Black S, Nolan GP (2018). Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging. Cell 174.4: 968–981.e15. DOI: 10.1016/j.cell.2018.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Günzel D, Yu ASL (2013). Claudins and the modulation of tight junction permeability. Physiological Reviews 93.2: 525–569. DOI: 10.1152/physrev.00019.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haddock J, Kassab L, Li S, Kryshchenko A, Grotheer R, Sizikova E, Wang C, Merkh T, Madushani R, Ahn M, et al. (2022). Semi-supervised Nonnegative Matrix Factorization for Document Classification. arXiv: 2203.03551 [cs.IR]. [Google Scholar]
- Hansen KD, Wu Z, Irizarry RA, Leek JT (2011). Sequencing technology does not eliminate biological variability. Nature Biotechnology 29.7: 572–573. DOI: 10.1038/nbt.1910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Herring BE, Shi Y, Suh YH, Zheng CY, Blankenship SM, Roche KW, Nicoll RA (2013). Cornichon proteins determine the subunit composition of synaptic AMPA receptors. Neuron 77.6: 1083–1096. DOI: 10.1016/j.neuron.2013.01.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horvath TL (2006). Synaptic plasticity in energy balance regulation. Obesity 14 Supp1 5: 228S–233S. DOI: 10.1038/oby.2006.314. [DOI] [PubMed] [Google Scholar]
- Jones A, Townes FW, Li D, Engelhardt BE (2023). Alignment of spatial genomics data using deep Gaussian processes. Nature Methods 20.9: 1379–1387. DOI: 10.1038/s41592-023-3;01972-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keren L, Bosse M, Thompson S, Risom T, Vijayaragavan K, McCaffrey E, Marquez D, Angoshtari R, Greenwald NF, Fienberg H, et al. (2019). MIBI-TOF: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Science Advances 5.10: eaax5851. DOI: 10.1126/sciadv.aax5851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lau HYG, Fornito A, Fulcher BD (2021). Scaling of gene transcriptional gradients with brain size across mouse development. NeuroImage 224: 117395. DOI: 10.1016/j.neuroimage.2020.117395. [DOI] [PubMed] [Google Scholar]
- Lee SJ, Wei M, Zhang C, Maxeiner S, Pak C, Calado Botelho S, Trotter J, Sterky FH, Südhof TC (2017). Presynaptic Neuronal Pentraxin Receptor Organizes Excitatory and Inhibitory Synapses. The Journal of Neuroscience: the official journal of the Society for Neuroscience 37.5: 1062–1080. DOI: 10.1523/JNEUROSCI.2768-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee Y, Bogdanoff D, Wang Y, Hartoularos GC, Woo JM, Mowery CT, Nisonoff HM, Lee DS, Sun Y, Lee J, et al. (2021). XYZeq: Spatially resolved single-cell RNA sequencing reveals expression heterogeneity in the tumor microenvironment. Science Advances 7.17. DOI: 10.1126/sciadv.abg4755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X, Zeira R, Raphael BJ (2023). PASTE2: Partial Alignment of Multi-slice Spatially Resolved Transcriptomics Data. bioRxiv. DOI: 10.1101/2023.01.08.523162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lodato S, Arlotta P (2015). Generating neuronal diversity in the mammalian cerebral cortex. Annual Review of Cell and Developmental Biology 31: 699–720. DOI: 10.1146/annurev-cellbio-100814-125353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lubeck E, Cai L (2012). Single-cell systems biology by super-resolution imaging and combinatorial labeling. Nature Methods 9.7: 743–748. DOI: 10.1038/nmeth.2069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK, Williams SR, Catallini JL 2nd, Tran MN, Besich Z, Tippani M, et al. (2021). Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature Neuroscience 24.3: 425–436. DOI: 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mendelevich A, Vinogradova S, Gupta S, Mironov AA, Sunyaev SR, Gimelbrant AA (2021). Replicate sequencing libraries are important for quantification of allelic imbalance. Nature Communications 12.1: 3370. DOI: 10.1038/s41467-021-23544-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nicolas G, Sévigny M, Lecoquierre F, Marguet F, Deschênes A, Del Pelaez MC, Feuillette S, Audebrand A, Lecourtois M, Rousseau S, et al. (2022). A postzygotic de novo NCDN mutation identified in a sporadic FTLD patient results in neurochondrin haploinsufficiency and altered FUS granule dynamics. Acta Neuropathologica Communications 10.1: 20. DOI: 10.1186/s40478-022-01314-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Leary DDM, Chou SJ, Sahara S (2007). Area patterning of the mammalian cortex. Neuron 56.2: 252–269. DOI: 10.1016/j.neuron.2007.10.010. [DOI] [PubMed] [Google Scholar]
- Plant LD, Xiong D, Dai H, Goldstein SAN (2014). Individual channels at the surface of mammalian cells contain two KCNE1 accessory subunits. Proceedings of the National Academy of Sciences of the United States of America 111.14: E1438–46. DOI: 10.1073/pnas.1323548111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Righelli D, Weber LM, Crowell HL, Pardo B, Collado-Torres L, Ghazanfar S, Lun ATL, Hicks SC, Risso D (2022). SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor. Bioinformatics 38.11: 3128–3131. DOI: 10.1093/bioinformatics/btac299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, Welch J, Chen LM, Chen F, Macosko EZ (2019). Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 363.6434: 1463–1467. DOI: 10.1126/science.aaw1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saper CB, Lowell BB (2014). The hypothalamus. Current Biology 24.23: R1111–6. DOI: 10.1016/j.cu.b.2014.10.023$. [DOI] [PubMed] [Google Scholar]
- Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, Wrobel N, Gharbi K, Simpson GG, Owen-Hughes T, et al. (2016). Erratum: How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22.10: 1641. DOI: 10.1261/rna.058339.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Serrenho D, Santos SD, Carvalho AL (2019). The Role of Ghrelin in Regulating Synaptic Function and Plasticity of Feeding-Associated Circuits. Frontiers in Cellular Neuroscience 13: 205. DOI: 10.3389/fncel.2019.00205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shang L, Zhou X (2022). Spatially aware dimension reduction for spatial transcriptomics. Nature Communications 13.1: 7203. DOI: 10.1038/s41467-022-34879-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, Giacomello S, Asp M, West-holm JO, Huss M, et al. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353.6294: 78–82. DOI: 10.1126/science.aaf2403. [DOI] [PubMed] [Google Scholar]
- Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, Arlotta P, Macosko EZ, Chen F (2021). Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nature Biotechnology 39.3: 313–319. DOI: 10.1038/s41587-020-0739-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornton CA, Mulqueen RM, Torkenczy KA, Nishida A, Lowenstein EG, Fields AJ, Steemers FJ, Zhang W, McConnell HL, Woltjer RL, et al. (2021). Spatially mapped single-cell chromatin accessibility. Nature Aommunications 12.1: 1274. DOI: 10.1038/s41467-021-21515-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Townes FW, Engelhardt BE (2023). Nonnegative spatial factorization applied to spatial genomics. Nature Methods 20.2: 229–238. DOI: 10.1038/s41592-022-01687-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Velten B, Braunger JM, Argelaguet R, Arnol D, Wirbel J, Bredikhin D, Zeller G, Stegle O (2022). Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nature Methods 19.2: 179–186. DOI: 10.1038/s41592-021-01343-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei JR, Hao ZZ, Xu C, Huang M, Tang L, Xu N, Liu R, Shen Y, Teichmann SA, Miao Z, et al. (2022). Identification of visual cortex cell types and species differences using single-cell RNA sequencing. Nature Communications 13.1: 6902. DOI: 10.1038/s41467-022-34590-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang L, Liu Q, Zhao Y, Lin N, Huang Y, Wang Q, Yang K, Wei R, Li X, Zhang M, et al. (2024). DExH-box helicase 9 modulates hippocampal synapses and regulates neuropathic pain. iScience 27.2: 109016. DOI: 10.1016/j.isci.2024.109016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeira R, Land M, Strzalkowski A, Raphael BJ (2022). Alignment and integration of spatial transcriptomics data. Nature Methods 19.5: 567–575. DOI: 10.1038/s41592-022-01459-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao T, Chiang ZD, Morriss JW, LaFave LM, Murray EM, Del Priore I, Meli K, Lareau CA, Nadaf NM, Li J, et al. (2022). Spatial genomics enables multi-modal study of clonal heterogeneity in tissues. Nature 601.7891: 85–91. DOI: 10.1038/s41586-021-04217-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Visium data for mouse sagittal section is available through 10X portal (https://www.10xgenomics.com). The Visium data for DLPFC is available for download through SpatialExperiment package. Code for generating the aligned spatial coordinates using PASTE is available through GitHub (https://github.com/raphael-group/paste_reproducibility). All code to analyze the data and generate figures is available at https://github.com/hansenlab/mNSF_paper. Our mNSF package is available at https://github.com/hansenlab/mNSF.