Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 9.
Published in final edited form as: Nat Genet. 2023 Jan 9;55(1):78–88. doi: 10.1038/s41588-022-01256-z

SPICEMIX enables integrative single-cell spatial modeling of cell identity

Benjamin Chidester 1,#, Tianming Zhou 1,#, Shahul Alam 1, Jian Ma 1,*
PMCID: PMC9840703  NIHMSID: NIHMS1854437  PMID: 36624346

Abstract

Spatial transcriptomics can reveal spatially-resolved gene expression of diverse cells in complex tissues. However, the development of computational methods that can utilize the unique properties of spatial transcriptome data to unveil cell identities remains a challenge. Here, we introduce SpiceMix, an interpretable method based on probabilistic, latent variable modeling for joint analysis of spatial information and gene expression from spatial transcriptome data. Both simulation and real data evaluations demonstrate that SpiceMix markedly improves upon the inference of cell types and their spatial patterns compared with existing approaches. By applying to spatial transcriptome data of brain regions in human and mouse acquired by seqFISH+, STARmap, and Visium, we show that SpiceMix can enhance the inference of complex cell identities, reveal interpretable spatial metagenes, and uncover differentiation trajectories. SpiceMix is a generalizable analysis framework for spatial transcriptome data to investigate cell type composition and spatial organization of cells in complex tissues.

Introduction

The compositions of different cell types in mammalian tissues, such as brain, remain poorly understood, due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [13]. The emerging spatial transcriptomics technologies based on multiplexed imaging and sequencing [414] are able to reveal spatial information of gene expression of dozens to tens of thousands of genes in individual cells in situ within the tissue context. However, the development of computational methods that can incorporate the unique properties of spatially-resolved transcriptome data to unveil cell identities and spatially-variable features remains a challenge [15, 16].

Computational methods have been developed to use spatial transcriptome data to identify spatial domains and cell types in tissues [1721], to explore the spatial variance of genes [2225], and to align scRNA-seq with spatial transcriptome data [2629]. To model spatial dependencies, methods using hidden Markov random fields (HMRFs) have been proposed [18, 21]. However, the conventional HMRF has two major limiting assumptions for modeling cell identity: that cell types or spatial domains are discrete, thereby ignoring the interplay of intrinsic and spatial factors, and that they exhibit smooth spatial patterns, which is not true of many cell types, such as inhibitory neurons with sparse spatial patterns. More recently, graph convolution neural networks, such as SpaGCN [19], have been used for identifying spatial domains, but such methods are more susceptible to overfitting and their learned latent representations are not easily interpreted, in comparison to effective linear latent variable models for scRNA-seq data, such as non-negative matrix factorization (NMF) [30]. In addition, the existing methods typically do not integrate the modeling of the spatial variability of genes with their contribution to cell identity. Therefore, there is a need for robust, interpretable methods that can jointly model both the spatial and intrinsic factors of cell identity, which is of vital importance to fully utilize the novel properties of spatial transcriptome data.

Here, we introduce SpiceMix (Spatial Identification of Cells using Matrix Factorization), an interpretable and integrative framework to model cellular diversity based on spatial transcriptome data. SpiceMix uses latent variable modeling to elucidate the interplay of spatial and intrinsic factors of cell identity. Crucially, SpiceMix enhances the NMF [30] model of gene expression by integrating with a graphical model of the spatial organization of cells, leading to more meaningful latent representations. Applications to the spatial transcriptome datasets of brain regions in human and mouse acquired by seqFISH+ [9], STARmap [10], and Visium [31] demonstrate, on both imaging-based and spatial-barcoding-based sequencing technologies, that the enhanced SpiceMix model of cell identity can uncover complex spatially-variable metagenes and unveil important biological processes.

Results

Overview of SpiceMix

SpiceMix models spatial transcriptome data by a probabilistic graphical model, which we call NMF-HMRF (Figure 1 and Methods). Our model has a natural interpretation for single-cell spatial transcriptome data, where each node in the graph represents a cell and edges capture nearby cell-to-cell relationships, but it can also be applied to in situ sequencing-based methods (e.g., Visium [7]), where each node represents a spatially-barcoded spot that consists of potentially multiple cells.

Figure 1: Overview of SpiceMix.

Figure 1:

Gene expression measurements and a neighbor graph are extracted from spatial trancriptome data and fed into the SpiceMix framework. SpiceMix decomposes the expression yi in cell (or spot) i into a mixture of metagenes weighted by the hidden state xi. Spatial interaction between neighboring cells (or spots) i and j is modeled by an inner product of their hidden states, weighted by Σx1, the inferred spatial affinities between metagenes. The hidden mixture weights X, the metagene spatial affinity Σx1, and K metagenes M, all inferred by SpiceMix, provide unique insight into the spatially variable features that collectively constitute the identity of each cell.

For each node i in the graphical model, a latent state vector xi represents the mixture of weights for K different intrinsic or extrinsic factors of cell identity (Figure 1). To capture the continuous nature of cell state, our model extends the standard HMRF by allowing these latent states to be continuous. Importantly, different types of correlations of latent states in nearby cells are captured by the matrix Σx1, which, unlike a conventional HMRF and many other spatial models, does not exclusively assume smooth spatial patterns, but instead has the flexibility to represent both the smooth and sparse spatial patterns that compose real tissue. Each element of the K × K matrix Σx1 represents the pairwise affinity between two factors, providing an intuitive interpretation of the spatial patterns of cells in tissue. For each factor, a “metagene” in the G × K matrix M captures the expression of its associated genes, where G denotes the number of genes. The observed expression from spatial transcriptome data, yi = Mxi for node i, follows a robust linear mixing model, which gives an intuitive interpretation of the relationship of gene expression to the different latent factors representing cell identities and critical genes. Thus, the NMF-HMRF model in SpiceMix is able to uniquely integrate the spatial modeling of the HMRF with the NMF formulation for gene expression into a single model for spatial transcriptome data.

Given an input spatial transcriptome dataset, SpiceMix simultaneously learns the intuitive metagenes M of latent factors, the latent states X for all nodes, and their spatial affinity Σx1. This is achieved by our alternating maximum a posteriori (MAP) optimization algorithm. Importantly, in SpiceMix, metagenes are an integral part of the model outcome, which presents a methodological advance in comparison to the calculation of spatially-variable genes as a post-processing step in other recent methods (such as SpaGCN [19]). A regularizing parameter allows users to control the weight given to the spatial information during optimization to suit the input data. The detailed description of the NMF-HMRF model is provided in the Methods section with additional details of optimization in Supplementary Note.

Evaluation using simulated spatial transcriptome data

We first evaluated SpiceMix using simulations that model the mouse cortex, a featured region for many spatial transcriptomic studies (Figure 2ab; see Methods for the simulation method details). We devised two methods of generating expression based on the position and type of each cell: Approach I follows a metagene-based simulation; Approach II uses scDesign2 [32] trained on real scRNA-seq data [33]. For Approach II, we introduced two forms of spatial noise: leakage, which randomly swaps some reads of neighboring cells, to mimic challenges of processing real spatial transcriptomics data; and additive noise that follows random, spatially-smooth patterns. We compared the results from SpiceMix to that of NMF, HMRF, Seurat [34], and the recent SpaGCN [19]. We evaluated different methods by comparing the inferred cell types with the true cell types using the adjusted Rand index (ARI) metric. For SpiceMix and NMF, we subsequently applied Louvain clustering to the learned latent representations. The approaches for preprocessing the data and for choosing other hyperparameters for each method are provided in Supplementary Note.

Figure 2: Performance evaluation based on simulated spatial transcriptome data.

Figure 2:

a. Illustration of the simulated spatial transcriptome data of the mouse cortex, including 3 major cell types distributed in 4 layers. Excitatory (blue, cyan, green, and brown) and inhibitory (red and yellow) neurons are star-shaped and glial cells (purple and magenta) are ovals. Subtypes are distinguished by their colors. b. Dendrogram showing the similarity of the expression profiles of the 8 cell types (top), their metagene profiles (middle), and their colors and shapes (bottom) used in panel (a). The top 4 rows correspond to metagenes that determine major type, the next 6 rows correspond to metagenes that determine subtypes or are layer-specific, and the bottom 3 rows correspond to noise metagenes. c. Simulated expression of metagenes 6 and 7, from a single sample generated with σy = 0.2 and σx = 0.15, in their spatial context (top) and the inferred expression of those metagenes by SpiceMix and NMF. Expression levels of metagenes are linearly scaled to [0, 1] for visualization. Visualizations in panel (e) are of the same simulated sample. d. Performance comparison of SpiceMix, NMF, HMRF, Seurat, and SpaGCN. Bar plots of the average adjusted Rand index (ARI) score, that measures the matching between the identified cell types and the true cell types, are shown. The score is averaged across n=20 replicates per scenario. Results are reported across four simulation scenarios, with varying degrees of randomness. Error bars show +/− one standard deviation. e. Imputed cell-type labels of each method for the excitatory neurons, shown in their spatial context. Neurons that were correctly identified are colored faintly. Neurons that were incorrectly identified are colored dark gray. The upper left panel is the ground truth cell type of all cells in the simulated sample. The colors match those of panels (a) and (b).

For both simulation approaches, we found that SpiceMix consistently outperformed other methods (Figure 2ce). For Approach I, SpiceMix achieved the highest average ARI scores (0.65–0.82) across scenarios. For lower noise settings (σy = 0.2), the ARI of SpiceMix was 9–18% higher than that of SpaGCN or NMF (Figure 2d). SpiceMix, SpaGCN, and NMF all outperformed Seurat and HMRF. For the higher noise setting (σy = 0.3), SpiceMix clearly outperformed all methods (Figure 2d). We found that SpiceMix was able to recover both the layer-specific and sparse metagenes that underlie the identity of cells. For example, SpiceMix successfully recovered metagene 7, which is specific to layer L1 (Figure 2c) and is enriched in eL1 excitatory neurons (blue in Figure 2a). Notably, SpiceMix was able to reveal nearly all excitatory neurons (Figure 2e). SpiceMix also recovered metagene 6 (Figure 2c), which captures intrinsic factors of the sparse inhibitory neuron subtype i1 (red in Figure 2a). In contrast, the equivalent of metagene 7 for NMF is strongly expressed across layers L1-L3 (Figure 2c), and NMF confused some eL3 excitatory neurons (light green) with eL1 excitatory neurons (Figure 2e). The equivalent of metagene 6 for NMF shows a more diffuse pattern (Figure 2c). Additional evaluation by varying the parameter λx or zero-thresholding to reflect different sparsity of the latent variables of NMF further demonstrated the robust advantage of SpiceMix (Figure S1). In addition, SpaGCN, Seurat, and HMRF all incorrectly assigned the spatial patterns for many more excitatory neurons (Figure 2e).

For simulation Approach II, SpiceMix performed the best for all but one scenario, for which it tied with NMF, and the advantage of SpiceMix became more significant as the influence of noise and leakage on spatial expression patterns became more prevalent (see Supplementary Note and Figure S2a). We found that the spatial metagenes from SpiceMix reliably reflect both cell type composition and spatial noise (Figure S2b). Overall, SpiceMix achieved much more accurate spatial assignments of cells than all other methods (Figure S2c).

Taken together, we showed that the integration of matrix factorization and spatial modeling in SpiceMix yields better and robust inference of spatially variable features (both sparse and layer-specific) that underlie cell identities as compared to existing methods.

Improving cell identity modeling of seqFISH+ data

We applied SpiceMix to a recent single-cell spatial transcriptomic dataset of the primary visual cortex of a mouse (five samples of nearby regions), acquired by seqFISH+ [9], with single-cell expression of 2,470 genes in 523 cells [9]. We compared the spatial patterns revealed by SpiceMix to those produced by NMF with various levels of sparsity via λx and zero-thresholding, as well as Louvain clustering (Supplementary Note) and the HMRF-based method of Zhu et al. [18], both reported in Eng et al. [9]. In addition, SpiceMix revealed spatially-informed metagenes capturing biological processes in the cortex (see Supplementary Table 1).

We first clustered the cells in the latent representation of SpiceMix using hierarchical clustering, which revealed five excitatory neural subtypes, two inhibitory neural subtypes, and eight glial types (Figure 3a), supported by known marker genes [33] (Figure 3b (left) and Supplementary Note). Major cell type assignments were generally consistent among SpiceMix, NMF, and Louvain clustering (Figure 3b (middle), Figure S3a, and Figure S4). However, SpiceMix uncovered more refined cell subtypes and states. Notably, SpiceMix identified three distinct clusters following known stages of oligodendrocyte maturation [35], from oligodendrocyte precursor cells (OPCs) to mature, myelin-sheath forming oligodendrocytes, throughout the five samples, as reflected by the spatially-informed metagenes. Metagene 8 is enriched among oligodendrocytes, distinguishing them from OPCs (Figure 3b (right)), while metagene 7, which is also in OPCs, separates a cluster of early-stage oligodendrocytes (Oligo-E) from later-stage oligodendrocytes (Oligo-L), suggesting that these metagenes capture their maturation trajectory. These stages are supported by the expression patterns of the OPC marker gene Cspg4, the differentiating oligodendrocyte marker gene Tcf7l2 [36], and the mature oligodendrocyte marker gene Mog [37] (Figure 3b (left)), in addition to a large set of marker genes for oligodendrocyte stages from [35] (Figure S5). Metagene 7 was distinguished from metagene 8 by its strong spatial affinity with metagenes 3 and 4 (highlighted by black arrows in Figure 3c), which are expressed primarily by the excitatory neurons of deeper tissue layers (eL5, eL6a, and eL6b) (Figure 3b (right)). No other method (NMF, Louvain clustering as reported by [9], and the HMRF-based method of Zhu et al. [18] as reported in [9]) could clearly distinguish these spatially-distinct cells (Figure S4, Figure 3b (middle), Figure S3b). Note that sparsity constraints on NMF did not yield these oligodendrocyte stages either (Supplementary Note and Figure S6). SpiceMix also discovered spatially-variable features that led to the identification of excitatory and inhibitory neuron subtypes whose layer-specificity patterns matched those of prior scRNA-seq studies. Specifically, the excitatory neurons exhibited strong layer-specificity (Figure 3d) and were supported by prior scRNA-seq studies (Figure S7), and SpiceMix revealed a separation of SST and VIP inhibitory neurons that matched known layer-specificity (Figure S8 and Supplementary Note).

Figure 3: Application of SpiceMix to the seqFISH+ data from the mouse primary visual cortex [9].

Figure 3:

Note that colors throughout the figure of cells and labels correspond to the cell-type assignments of SpiceMix. a. UMAP plot of the latent states of SpiceMix (left) and the dendrogram of the arithmetic average of the expression for each cell type of SpiceMix (right). It is highlighted in (a) (left) that SpiceMix further delineated inhibitory neurons into VIPs (yellow) and SSTs (red-brown) enclosed by the orange dashed cycle and refined oligodendrocytes and OPCs into separate subtypes: Astro/Oligo (magenta), Oligo-1 (beige), Oligo-2 (blue), and OPC (coral), enclosed within the red dashed cycle. b. (Top) The inferred pairwise spatial affinity of metagenes, or Σx1. The strong attractions between metagene 7 and metagenes 3 and 4, which helped distinguish the spatial patterns of Oligo-L cells, are highlighted by the black arrows. (Bottom) The inferred pairwise spatial affinity of SpiceMix cell types. c. (Left) Average z-score normalized expression of known marker genes within SpiceMix cell types, along with the number of cells belonging to each type (colored bar plot). The colored boxes on the top following the name of each marker gene correspond to their known associated cell type. (Middle) Agreement of SpiceMix cell-type assignments with those of the original analysis in [9]. (Right) Average expression of inferred metagenes within SpiceMix cell types. The expression is normalized by the standard deviation per metagene. Metagenes 7 and 8, which revealed the separation of oligodendrocyte subtypes, are highlighted by black arrows. d. In situ SpiceMix cell-type assignments for all cells in each of the five FOVs. Colors of cell types are the same as in above panels.

Together, our analysis of seqFISH+ data of the mouse cortex with SpiceMix revealed spatially-variable features and more refined cell states. Our results demonstrate the advantages and unique capabilities of joint modelling of spatial and transcriptomic data using SpiceMix.

Revealing spatial metagenes and cell types from STARmap data

Next, we applied SpiceMix to a single-cell spatial transcriptome dataset of the mouse V1 neocortex acquired by STARmap [10], consisting of 930 cells passing quality control, all from a single field-of-view (FOV), with expression measurements for 1,020 genes. We compared primarily the results of SpiceMix, NMF, and Wang et al. [10]. An asterisk is appended to the end of the cell labels of Wang et al. [10] when referenced. In addition, SpiceMix generated spatially variable metagenes (see Supplementary Table 1).

We found that SpiceMix identified refined, spatial subtypes (Figure 4a) and improved upon the cell labels of [10] (Figure 4b). The learned spatial affinities (Figure 4c) enabled improved cell layer-specificity, which was particularly notable among excitatory neurons (Figure 4d). The clear boundaries between excitatory layers matched layer-enrichment analysis from scRNA-seq studies (see Figure 4b in [38]), in contrast to the cell assignments reported in Figure 5d in [10], which showed significant mixing of excitatory types across boundaries. Comparison of marker genes from [33] for eL2/3 and eL4 showed that, among eL2/3 and eL4 neurons that were differently assigned between SpiceMix and [10], their expression levels in SpiceMix assignments more closely followed that of [33] (Supplementary Note and Figure S9). The NMF formulation of SpiceMix helped reassign a large set of cells from the Astro-1* type of [10] to eL5 (Figure S10), which was further refined along layer boundaries by the learned spatial affinities (Figure 4b (middle), d). This reassignment was supported by the expression of known excitatory marker genes (Supplementary Note and Figure S11). In contrast, we found that HMRF missed sparse cell types and smoothed across layers, missing even the layer-wise structure of excitatory neurons (Figure S12). Further, SpiceMix achieved a refined, spatially-informed separation of three eL6 subtypes, driven by the identification of two strongly spatially-attracted metagenes: 5 and 7 (highlighted by a black arrow in Figure 4c).

Figure 4: Metagenes and refined cell types discovered by SpiceMix from the STARmap data of the mouse primary visual cortex [10].

Figure 4:

Note that colors throughout the figure of cells and labels correspond to the cell-type assignments of SpiceMix. a. UMAP plots of the latent states of SpiceMix and the dendrogram of the arithmetic average of the expression for each cell type of SpiceMix (right). It is highlighted in a (left) that SpiceMix delineated eL6 neurons into three subtypes enclosed in the green cycle and delineated oligodendrocytes and OPCs into three separate subtypes: Oligo-1 (beige), Oligo-2 (blue), and Astro-2/OPC (magenta), enclosed within the beige dashed cycle. b. (Top) The inferred pairwise spatial affinity of metagenes, or Σx1. The strong attraction between metagene 5 and metagene 7, which helped distinguish excitatory eL6 neurons, is highlighted by the black arrow. (Bottom) The inferred pairwise spatial affinity of cell types. c. (Left) Average z-score normalized expression of known marker genes within SpiceMix cell types, along with the number of cells belonging to each type (colored bar plot). The colored boxes on the top following the name of each marker gene correspond to their known associated cell types. (Middle) Agreement of SpiceMix cell-type assignments with those of the original analysis in [10]. (Right) Average expression of inferred metagenes within SpiceMix cell types. The expression is normalized by the standard deviation per metagene. The average proportion of metagenes 12 and 13 in oligodendrocyte cell types, which helped delineate subtypes, are highlighed by black arrows. d. In situ map of SpiceMix cell-type assignments for all cells.

SpiceMix also produced a significant refinement of glial subtypes. SpiceMix identified two oligodendrocyte clusters and an OPC cluster, distinguished by their relative expression of metagenes 12, 13, and 14 (Figure 4b (right)). Metagenes 12 and 13 were highly enriched in layer L6 and strongly attracted to each other (Figure 4c, Figure 5a). Their proportional expression by oligodendrocytes within L6 captured a maturation trajectory from OPCs to Oligo-1 cells that could not be revealed by other methods (see later section). Metagene 14 also has distinct oligodendrocyte markers (Figure 4b right), but scatters from layers L2/3 to L6 (Figure 5a), leading to a spatially distinct Oligo-2 type, clearly separated in the SpiceMix latent space from neighboring excitatory neurons (Figure S13). The expression of oligodendrocyte marker genes identified by [35] supports that the Oligo-1 and Oligo-2 clusters represent mature oligodendrocytes, distinct from the OPCs (Figure S14). In addition, SpiceMix distinguished astrocytes into two types (Astro-1 and Astro-2) based on metagenes 11 and 12. Although Astro-2 cells shared metagene 12 with OPCs, both their spatial location in the superficial layer and the expression of astrocyte marker genes defined them as astrocytes (Figure S15). In contrast, Astro-1 cells expressed metagene 11 with a scattered spatial pattern throughout all layers (Figure 5a). This Astro-1/Astro-2 separation was supported by the expression of known marker genes [39], including Gfap (P=0.024), a marker for astrocytes in the glia limitans, and Mfge8 (P=0.0013), a marker for a separate, diffuse astrocyte type (Figure 5b). We found that NMF did not reveal these subtypes (Figure S16) and the NMF metagenes typically exhibited unspecific spatial patterns and pairwise affinity (Figure S17, Figure S10d).

Figure 5: Spatial glial subtypes and the process of myelination in oligodendrocytes revealed by SpiceMix metagenes in STARmap data of the mouse primary visual cortex [10].

Figure 5:

Note that colors throughout the figure of cells and labels correspond to the cell-type assignments of SpiceMix. a. (Left) In situ map of SpiceMix cell-type assignments for astrocyte and oligodendrocyte cells in the sample. (Middle and right) In situ maps of expression of both layer-specific and ubiquitous metagenes learned by SpiceMix that are relevant to astrocytes and oligodendrocytes. b. The log-normalized expression of astrocyte subtype marker genes in Astro-1 (n=78 cells) and Astro-2 (n=13 cells) types of SpiceMix (left), and a comparison of the percentage of cells expressing those marker genes (right). *: The two-sided Wilcoxon rank sum test P<0.05. c. Trajectory analysis of SpiceMix oligodendrocyte types using Monocle2, showing the unnormalized expression of metagenes 12 and 13 along the trajectory from OPC to Oligo-1. d. (Left) The expression of metagene 13 plotted against the expression of metagene 12 for oligodendrocytes of the SpiceMix Oligo-1 and OPC types. (Right) The expression of important marker genes for myelin-sheath formation in oligodendrocytes plotted against the relative expression of metagenes 12 and 13 of the same cells. The dashed lines are the fitted linear regression model. The title of each plot consists of the gene symbol and the Benjamini/Hochberg corrected two-sided Wald test with t-distribution P-value of having a nonzero slope, respectively. *: P<0.05.

These results suggest that SpiceMix is able to refine cell identity and metagene inference with distinct spatial patterns from STARmap data, further demonstrating its advantage.

Identifying continuous oligodendrocytes myelination stages

The expression of metagenes learned by SpiceMix from seqFISH+ and STARmap suggested the existence of continuous factors of oligodendrocyte identity. Applying Monocle2 [40] to the raw counts of cells in the STARmap dataset labeled by SpiceMix as oligodendrocytes showed a clear trajectory from the OPCs to the mature Oligo-1 class (Figure 5c and Supplementary Note). The Oligo-2 class is likely a distinct type of mature oligodendrocytes compared to Oligo-1. Importantly, the relative expression of metagenes 12 and 13, which were highly expressed in OPC and Oligo-1 cells, respectively, strongly correlated with the inferred trajectory (Figure 5c).

Using linear regression, we tested if the differences in the proportions of metagenes 12 and 13 along this trajectory corresponded to the expected change in expression of myelin sheath-related genes during myelination. The eleven genes that we tested were those from the STARmap panel attributed to myelin sheath formation, according to Gene Ontology (GO) (Supplementary Note) that were expressed in at least 30% of cells. We found that the correlations of seven of the eleven genes are significant (P<0.05, after a two-step FDR correction for multiple testing) (Figure 5d and Figure S15), supporting our hypothesis. One of these genes is Atp1a2, recently confirmed by scRNA-seq studies to be suppressed as myelination progresses [41, 42], further demonstrating the robustness of our analysis. Furthermore, we found that the more recent latent variable model scHPF [43], a hierarchical Poisson factorization model for scRNA-seq data, could not reveal this continuous process of oligodendrocytes, confirming the importance of spatial information (Figure S18 and Supplementary Note).

This result further demonstrates that the latent representation of SpiceMix is uniquely able to elucidate important biological processes underlying cell states.

Unveiling spatial patterns from Visium human brain data

We next sought to demonstrate the effectiveness and interpretability of SpiceMix on a dataset of the human dorsolateral prefrontal cortex (DLPFC) acquired by the 10x Genomics Visium platform [31]. We made a direct comparison of SpiceMix to two recent methods on this dataset: SpaGCN [19] and BayesSpace [21], which was designed for spatial-barcoding methods.

SpiceMix achieved consistent advantages in identifying the layer structures of DLPFC (Figure 6a), which consisted of six cortical layers (layer L1 to layer L6) and white matter. We focused on the 4 FOVs from sample Br8100 for this analysis (Supplementary Note). The clusters from SpiceMix produced an ARI score between 0.54 and 0.61 (average 0.575), with consistent advantage over SpaGCN and BayesSpace (Figure 6a). We observed that although SpaGCN and BayesSpace could produce layer-like patterns, these layers did not closely match the true boundaries (Figure S19 and S20). In contrast, SpiceMix produced contiguous layers for all FOVs and identified clearer boundaries (Figure 6b) and learned metagenes that clearly manifest the layer structure of DLPFC (Figure S21 and Supplementary Table 1). Using all four FOVs as input did not significantly affect the ARI score of SpaGCN (Figure 6a), and we were unable to run BayesSpace effectively on all four FOVs simultaneously. Although layer L4 could not be reliably identified by any method, the metagenes a3 and a6 learned by SpiceMix showed differential expression among L3, L4, and L5 (P< 10−300, highlighted in Figure 6c).

Figure 6: Application to the Visium dataset of human dorsolateral prefrontal cortex [31].

Figure 6:

a. Comparison of the performance of SpiceMix, BayesSpace, and SpaGCN on the 4 FOVs from sample Br8100. SpiceMix and SpaGCN(4) were trained on 4 FOVs simultaneously and evaluated both on single FOVs and on 4 FOVs altogether. BayesSpace and SpaGCN(1) were trained and evaluated only on single FOVs. For SpaGCN and BayesSpace, gray dots represent one of n=10 runs with different random seeds. Data are presented as mean values and 95% CIs. b. The in situ layer assignments of SpiceMix for FOV 151673. The boundaries between ground-truth layers are illustrated by dashed lines. The gyrus and sulcus subregions of L3 identified by SpiceMix are labeled L3g and L3s, respectively. c. The in situ expression of 8 metagenes from SpiceMix, normalized by the maximum value per metagene across FOVs. Metagenes a3 and a6 collectively distinguish L4 spots (n=7952) from L3 (n=28160) (two-sided t-test P smaller than the smallest representable value) and L5 (n=21400) (two-sided t-test P= 6 × 10−322; red rectangles). d. The rank distribution of known marker genes [44] (n=53, 406, 188, and 67 genes, respectively) of 4 cell types in the 8 metagenes. ‘Exc (S)’ and ‘Exc (D)’ denote markers of excitatory neurons of superficial and deep layers, respectively. For each row, metagenes with greater ranks are highlighted by red rectangles (one-sided highlighted-vs-rest Mann-Whitney U test P= 2×10−21, 10−90, 3×10−32, 10−28, respectively). e. Kernel-smoothed in situ expressions of metagenes a4 and a5, showing their differential expressions (highlighted by arrows) between the gyric side (right side) and the sulcal side (upper side). f. The distribution of the rank difference of gyro-sulcal DEGs between metagenes a4 and a5. Gyric DEGs have greater ranks in a5 than in a4 (two-sided Wilcoxon P= 3 × 10−26, n=1836 genes), and sulcal DEGs exhibit the opposite trend (two-sided Wilcoxon P= 4 × 10−25, n=1136 genes). All boxplots show the median, first, and third quartiles, and whiskers extend no further than 1.5×IQR (inter-quartile range).

The interpretability of metagenes from SpiceMix helped unveil spatially-variable expression and spatial patterns of cell types of DLPFC. We used differentially expressed genes (DEGs) identified from [44] (Supplementary Note). The high ranks of astrocyte DEGs in metagene a1 (Figure 6d) suggest that it captures astrocyte expression, along with its ubiquitous presence in all seven layers (Figure 6c), consistent with a recent work [45]. Oligodendrocyte DEGs were enriched in metagenes a6 and a7, which were primarily in deep layers and the white matter, respectively (Figure 6cd). This is consistent with the spatial distributions of oligodendrocytes [46] and suggests a spatial-subtype separation. Moreover, the DEGs of excitatory neurons in superficial layers and deep layers were enriched in metagenes a3 and a6, respectively, which were present mostly in layers L1-L3 and layer L6, accordingly, reflecting the layer-like patterns of excitatory neurons (Figure 6cd). These findings confirm the unique ability of SpiceMix to unveil spatially-variable features and cell type composition.

Delineating finer anatomic structures of the human brain

SpiceMix was able to identify finer anatomical structures and cell composition of the brain based on its learned spatially-variable metagenes from the DLPFC Visium data [31]. On the four FOVs from sample Br8100, metagenes a4 and a5 captured the gradual gyro-sulcal variability (Figure 6ef and Figure S22). We found that more than 50% of the genes used for SpiceMix were differentially expressed across the two regions (Supplementary Note), strongly supporting this separation. The relative ranking of DEGs within each metagene, according to its weight, was significantly associated with the respective region (P < 10−24) (Figure 6f). This shows the distinct ability of the metagenes from SpiceMix to represent gradual changes in spatial gene expression.

Applying SpiceMix to FOV 151507 from sample Br5292 (Figure 7a), we found that metagenes b1-b3 defined three finer anatomical structures within layer L1 annotated in [31] (Figure 7bc) (see Supplementary Table). Based on the brightness of the staining in the histology image, we classified each spot into one of four types (Supplementary Note and Figure 7b (top left)): the dark stripe (yellow), the bright gap (green), the flanking cortex (blue), and ambiguous mixtures of these three regions (grey). All 7 marker genes of mural cells, which constitute the wall of blood vessels, from [39] that passed quality control (Supplementary Note) were highly expressed in the dark stripe. The enrichment of 5 out of the 7 genes was significant (P≤ 0.002), suggesting that the dark stripe is potentially a blood vessel. Aside from the brightness, spots exhibited other varying phenotypes across the three regions, such as cell density, UMI count, and mitochondrial RNA ratio (Figure S23a), indicating that these three regions are biologically different. We found that metagenes b1, b2, and b3 were enriched in the flanking cortex, the white gap, and the blood vessel, respectively (Figure 7bc), supporting the delineation of the three anatomical structures by SpiceMix.

Figure 7: SpiceMix metagenes associated with finer anatomical structures in the human dorsolateral preforntal cortex from Visium data [31].

Figure 7:

a. The in situ layer annotations of the ground truth on FOV 151507. b. The finer structure annotations of spots (top left) and the in situ inferred unnormalized expressions of metagenes b1-b3 on FOV 151507 (the other three panels). The color legend of the top left panel is in (c). Based on the intensity on the histological image, a spot was assigned to a dark stripe (green), a bright gap (blue), a peripheral region (orange), or a mixture of the bright gap and dark stripe (grey). As highlighted by black arrows, metagenes b1-b3 are enriched in the peripheral region, the bright gap, and the dark stripe, respectively. c. The differential expressions of metagenes b1-b3 across the finer structures. One-sided one-vs-rest Mann-Whitney U test P is displayed above each column. For better visualization, the raw expression levels were divided by the maximum expression level across all spots in the 4 FOVs per metagene. d. The inferred in situ unnormalized expression of metagenes b4 and b5 on FOV 151507, implying the delineation of the superficial part (denoted by S) and the deep region (denoted by D) in white matter. e. The rank distribution of oligodendrocyte marker genes in metagenes b4 and b5. These genes have significantly higher ranks in metagene b5 than in b4 (one-sided Wilcoxon P is shown) All boxplots show the median and first and third quartiles, and whiskers extend to values no further than 1.5×IQR (inter-quartile range).

Additionally, metagenes b4 and b5 defined two finer anatomical structures in the white matter region (Figure 7d). Specifically, metagene b4 was mainly present in a 400μm-wide superficial layer (Figure 7d (S)), whereas metagene b5 was nearly restricted to the deep part (Figure 7d (D)). Spots also exhibited different phenotypes across the two structures that are supported by DEGs (Figure S23bc and Supplementary Note). Consistent with this finding, marker genes of oligodendrocytes had a higher rank in metagene b5, which was enriched in the deep part (Figure 7e).

Together, these results further demonstrated the ability of SpiceMix to capture subtle but biologically important anatomical structures from spatial transcriptome data acquired by a variety of technologies.

Discussion

We have developed SpiceMix, an unsupervised method for modeling the diverse factors of cell identity in complex tissues based on various types of spatial transcriptome data. The integrated model of SpiceMix combines the expressive power of NMF for modeling gene expression with the HMRF for modeling spatial relationships, advancing current state-of-the-art modeling for spatial transcriptomics as clearly shown in both simulation evaluation and real data applications. On single-cell spatial transcriptome data of the mouse primary visual cortex from seqFISH+ and STARmap, SpiceMix demonstrated its effectiveness in producing reliable spatially variable metagenes and biologically informative latent representations of cell identity. On the human DLPFC data acquired by Visium, SpiceMix improved the identification of annotated layers and revealed finer anatomical structures.

A significant feature of SpiceMix is the spatially variable metagene formulation, which can model the interplay of the spatial and intrinsic composition of the transcriptome and not merely the spatial patterns of individual genes [22, 25]. Crucially, as part of the model formulation, SpiceMix considers how these metagenes are integrally related to continuous cell states, which represents a major distinction compared to other approaches [18, 19]. We note that since SpiceMix is an unsupervised method, we have showcased its application to datasets with large, unbiased gene panels. Though for datasets with targeted panels guided largely by prior knowledge, a tool such as Tangram [29] could be utilized to extend the gene panel and thereby further increase the power of SpiceMix.

As the field of spatial transcriptomics continues to grow and become more widely available, new technologies and datasets will open many new directions. In particular, it will be of great interest to model the dynamics of spatial patterns across diverse samples and along normal development or disease progression. Another exciting development is the generation of spatial multiomic data, which integrates transcriptome with other data types such as protein expression [47]. Understanding the relationships between different data modalities within their spatial context could lead to a more complete understanding of the in situ molecular underpinning of diverse cell states in complex tissues. There is also continued interest in studying cell-cell interaction and communication [48], which spatial transcriptomics can uniquely elucidate.

Enhanced computational methods that can analyze, summarize, and interpret spatial omics data will be crucial to future studies. By effectively modeling the complex mixing of latent intrinsic and spatial factors of heterogeneous cell identity in complex tissues, SpiceMix offers a useful tool to facilitate discoveries for diverse types of spatial omics data. We note that SpiceMix is not limited to transcriptomic data only, and its methodology may also be well-suited for multiomic data. In future work, enhancements may be made to SpiceMix to allow for progressive changes in the learned spatial patterns. Further, the refined cell identities and learned spatial affinities of SpiceMix may be useful for studying other aspects of tissue dynamics, including cell-cell interactions. Overall, SpiceMix is a powerful framework for the analysis of diverse types of spatial transcriptiome and multiomic data, with the distinct advantage that it can unravel the complex mixing of latent intrinsic and spatial factors of heterogeneous cell identity in complex tissues.

Methods

The probabilistic graphical model NMF-HMRF in SpiceMix

Gene expression as matrix factorization

We consider the expression of individual cells Y=[y1,,yN]+G×N, where constants G and N denote the number of genes and cells, respectively, to be the product of K underlying factors (i.e., metagenes), M=[m1,,mK]G×K, mkSG1, and weights, X=[x1,,xN]+K×N, i.e.,

Y=MX+E. (1)

This follows the non-negative matrix factorization (NMF) formulation of expression of prior work [49]. The term E=[e1,,eN]G×N captures unexplained variation or noise, which we model as i.i.d. Gaussian, i.e., ei~𝓝(0,σy2I). To resolve the scaling ambiguity between M and X, we constrain the columns of M to sum to one, so as to lie in the (G − 1)-dimensional simplex, SG1. For notational consistency, we use capital letters to denote matrices and use lowercase letters denote their column vectors.

Graphical model formulation

The formulation for our probabilistic graphical model NMF-HMRF in SpiceMix enhances standard NMF by modeling the spatial correlations among samples (i.e., cells or spots in this context) via the HMRF [50]. This novel integration aids inference of the latent M and X by enforcing spatial consistency. The spatial relationship between cells in tissue is represented as a graph 𝓖=(𝓥,𝓔) of nodes 𝓥 and edges 𝓔, where each cell is a node and edges are determined from the spatial locations. Any graph construction algorithm, such as distance thresholding or Delaunay triangulation, can be used for determining edges. For each node i in the graph, the measured gene expression vector, yi, is the set of observed variables and the weights, xi, describing the mixture of metagenes are the hidden states. The observations are related to the hidden variables via the potential function ϕ, which captures the NMF formulation. The spatial affinity between the metagene proportions of neighboring cells is captured by the potential function φ. Together, these elements constitute the HMRF.

More specifically, the potential function ϕ measures the squared reconstruction error of the observed expression of cell i according to the estimated xi and M,

ϕ(yi,xi)=exp(Uy(yi,xi)),Uy(yi,xi)=(yiMxi)22σy2, (2)

where σy2 represents the variation of expression, or noise, of the NMF. The spatial potential function φ measures the inner-product between the metagene proportions of neighboring cells i and j, weighted by the learned, pairwise correlation matrix Σx1, which captures the spatial affinity of metagenes, i.e.,

φ(xi,xj)=exp(Ux(xi,xj)),Ux(xi,xj)=xixi1Σx1xjxj1. (3)

This form for φ has several motivations. The weighted inner-product allows the affinity between two cells to be decomposed simply as the weighted sum of affinities between metagenes and for the metagenes to have different and learnable affinities between each other. It also allows the model to capture both positive and negative affinities between metagenes. By normalizing the weights xi of each cell, any scaling effects, such as cell size, are removed. In this way, the similarity that is measured is purely a function of the relative proportions of metagenes. This form also affords a straightforward interpretation for the affinity matrix Σx1. Lastly, it is more convenient for optimization.

Given an observed dataset, the model can be learned by maximizing the likelihood of the data. By the Hammersley-Clifford theorem [51], the likelihood of the data for the pairwise HMRF can be formulated as the product of pairwise dependencies between nodes,

P(Y,XΘ)=1Z(Θ)(i,j)𝓔φ(xi,xj)i𝓥ϕ(yi,xi)π(xi), (4)

where Θ = {Δ, M} is the set of model parameters and metagenes and Z(Θ) is the normalizing partition function that ensures P is a proper probability distribution. The potential function π is added to capture an exponential prior on the hidden states X,

λx=1,π(xi)=exp(λxxi1), (5)

with scale parameter 1. We normalize the average of the total normalized expression levels in individual cells to K correspondingly.

Parameter priors

We introduce a regularization hyperparameter λΣ on the spatial affinities, which allows the users to control the importance of the spatial relationships during inference to suit the dataset of interest. As the parameter decreases, the influence of spatial affinities during inference diminishes and the model becomes more similar to standard NMF. If we represent λΣ in the form λΣ=1/(2σΣ2), we can treat it as a Gaussian prior, with zero mean and σΣ2 variance, on the elements of the spatial affinity matrix Σx1,

P(Σx1)=(π/λΣ)K2exp(λΣΣx1F2), (6)

where F denotes the Frobenius norm. Note that the matrix Σx1 is forced to be transpose symmetric.

Alternating estimation of hidden states and parameters

To infer the hidden states and model parameters of the NMF-HMRF model in SpiceMix, we optimize the data likelihood via coordinate ascent, alternating between optimizing hidden states and model parameters. This optimization scheme is summarized in Supplementary Note. First, to make inference tractable, we approximate the joint probability of the hidden states by the pseudo-likelihood [51], which is the product of conditional probabilities of the hidden state of individual nodes given that of their neighbors,

P(XΘ)i𝓥P(xixη(i),Θ), (7)

where η(i) is the set of neighbors to node i.

Estimation of hidden states

Given parameters Θ of the model, we estimate the factorizations X by maximizing their posterior distribution. The maximum a posteriori (MAP) estimate of X is given by:

X^=argmaxX+K×NP(XY,Θ)=argmaxX+K×NP(Y,XΘ)=argmaxX+K×N{logP(Y,XΘ)} (8)
=argmaxX+K×N{i𝓥[Uy(yi,xi)+logπ(xi)](i,j)𝓔Ux(xi,xj)}. (9)

This is a quadratic program and can be solved efficiently via the iterated conditional model (ICM) [52] using the software package Gurobi [53] (see Supplementary Note for more details of the optimization for hidden states).

Estimation of model parameters

Given an estimate of the hidden states X, we can likewise solve for the unknown model parameters Θ by maximizing their posterior distribution. The MAP estimate of the parameters Θ is given by:

Θ^=argmaxΘP(ΘY,X)=argmaxΘP(Y,XΘ)P(Θ)=argmaxΘ{logP(Y,XΘ)+logP(Θ)} (10)
=argmaxΘ{i𝓥[Uy(yi,xi)+logπ(xi)](i,j)𝓔Ux(xi,xj)logZ(Θ)+logP(Θ)} (11)
argmaxΘ{i𝓥[Uy(yi,xi)+logπ(xi)logZi(Θ)](i,j)𝓔Ux(xi,xj)+logP(Θ)}. (12)

Eqn. 12 is an approximation by the mean-field assumption [51], which is used, in addition to the pseudo-likelihood assumption, to make the inference of model parameters tractable. We note that we can estimate metagenes, spatial affinity, and the noise level independently. The MAP estimate of the metagenes M is a quadratic program, which is efficient to solve. The MAP estimate of Σx1 is convex and is solved by the optimizer Adam [54]. Due to the complexity of the partition function Zi(Θ) of the likelihood, which includes integration over X, it is approximated by Taylor’s expansion. Since it is a function of Θ, this computation must be performed at each optimization iteration. See Supplementary Note for details of the optimization method for model parameters.

Initialization

To produce the initial estimates of the model parameters and hidden states, we do the following. First, we use a common strategy for initializing NMF, which is to cluster the data using K-means clustering, with K equal to the number of metagenes, and use the means of the clusters as an estimate of the metagenes. We then alternate for T0 iterations between solving the NMF objective for X and M. This produces, in only a few quick iterations, an appropriate initial estimate for the algorithm, which will be subsequently refined. We observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered. However, this value can be easily tuned by experimentation, and in our analysis, we found that just 5 iterations were necessary.

Empirical running time

On a CentOS 7 machine with sixteen 2.30GHz Intel(R) Xeon(R) Gold 5218 CPUs and one GeForce 2080 Ti GPU, SpiceMix takes 0.5–2 hours to run on a typical spatial transcriptome dataset with 2,000 genes and 1,000 cells. The GPU is used for the first 5 iterations, or around that number, only, when the spatial affinity matrix Σx1 is changed significantly. In subsequent iterations, most time is spent solving quadratic programs. Since the algorithm uses a few iterations of NMF to provide an initial estimate, which is a reasonable starting point, it is expected to find a good initial estimate of metagenes and latent states efficiently.

Generation and analysis of simulated data

We generated simulated spatial transcriptomic data following expression and spatial patterns of cells of the mouse primary visual cortex. Cells in the mouse cortex are classified into three primary categories: inhibitory neurons, excitatory neurons, and non-neurons or glial cells [33, 55]. Excitatory neurons in the cortex exhibit dense, concentrated, layer-wise specificity, whereas inhibitory neurons are sparse and can be spread across several layers. Non-neuronal cells can be either layer-specific or scattered across layers. We simulated single-cell data from an imaging-based method applied to a slice of tissue, which consists of four distinct vertical layers and eight cell types: four excitatory, two inhibitory, and two glial (Figure 2a). Each layer was densely populated by one layer-specific excitatory neuron type. The two inhibitory neuron types were scattered sparsely throughout several layers. One non-neuronal type was restricted to the first layer and the other was scattered sparsely throughout several layers. For each simulated image, or tissue sample, 500 cells were created with locations generated randomly in such a way so as to maintain a minimum distance between any two cells, so that the density of cells across the sample was roughly constant. With this spatial layout of cells, we devised two methodologies for generating gene expression data for individual cells. The first uses a metagene-based formulation and the second uses a recent method, scDesign2 [32], which we fit to real scRNA-seq data of the mouse cortex [33] (see Supplementary Note). See Supplementary Note for details of the methodology of the analysis of the two simulation datasets.

Data processing for the used spatial transcriptome datasets

Preprocessing and analysis of seqFISH+ data

We applied SpiceMix on a seqFISH+ dataset that profiled the mouse primary visual cortex [9]. We first removed genes which had non-zero expression in less than 40% of cells, which yielded an unbiased set of 2,470 genes. We then normalized the expression of these genes by scaling the total counts to 10,000 per cell, adding one, and applying the log transform: Eig:=log(1+(104EiggEig)). To generate a graphical representation of the cells, we applied Delaunay triangulation to physical coordinates of cells, and then removed edges of length larger than 300 pixels (30.9 μm).

For the regularization parameter of the spatial pairwise dependency, λΣ, we considered possible values in the set {2|𝓔|×102,2|𝓔|×104,2|𝓔|×106}. We found 2|𝓔|×104 to yield the desired balance of spatial regularization based upon visual inspection. We experimented with the number of metagenes, K, and chose the highest value before the expression of metagenes became too sparse. We also examined the UMAP plots of latent states, without annotations from the original analysis, to guide our selection. This led us to use K = 20 metagenes for both SpiceMix and NMF. For each hyperparameter configuration, we ran several iterations of the algorithm with different initial random seeds and chose the random seed that resulted in the highest value of the objective function, Q. After learning the latent states, we z-score normalized the latent states along the cell dimension and performed hierarchical clustering on the normalized latent states to define cell type assignment using Ward’s method and the Euclidean distance [56]. We used the Calinski-Harabasz (CH) index [57] as the criterion for determining the optimal number of clusters. Before downstream analysis, we repeatedly merged the two clusters with the lowest threshold form hierarchical clustering until the last 3 splits did not create any cluster with less than five cells. We then eliminated outlier SpiceMix cell types that had less than five cells. This led to 15 cell types for SpiceMix and 13 cell types for NMF.

For further details of the methodology of our analysis, see Supplementary Note. This includes details of our selection of Louvain clusters from [9] used in our comparative analysis with SpiceMix and additional details of the method to justify the excitatory neuron clusters of SpiceMix.

Preprocessing and analysis of STARmap data

We also applied SpiceMix on a STARmap dataset that profiled the mouse primary visual cortex [10]. We normalized the data by scaling the total counts to 10,000 per cell, adding one, and applying the log transform: Eig:=log(1+(104EiggEig)). To generate a graphical representation of the cells, we applied Delaunay triangulation to physical coordinates of cells, and then removed edges of length larger than 600 pixels.

For the regularization parameter of the spatial pairwise dependency, λΣ, we considered possible values in the set {2|𝓔|×102,2|𝓔|×104,2|𝓔|×106}. We found 2|𝓔|×104 to yield the desired balance of spatial regularization based upon visual inspection. We experimented with the number of metagenes, K, and chose the highest value for each algorithm before the expression of metagenes became too sparse. We also examined the UMAP plots of latent states, without annotations from the original analysis, to guide our selection. This led us to use K = 20 metagenes for SpiceMix and K = 15 metagenes for NMF. For each hyperparameter configuration, we ran several iterations of the algorithm with different initial random seeds and chose the random seed that resulted in the highest value of the objective function, Q. After learning the latent states, we z-score normalized the latent states along the cell dimension and performed hierarchical clustering on the normalized latent states to define cell type assignment using Ward’s method and the Euclidean distance [56]. We used the CH index as the criterion for determining the optimal number of clusters. Before downstream analysis, we removed an outlier SpiceMix cell type that had only one cell. This led to 16 cell types for SpiceMix and 11 cell types for NMF.

For further details of the methodology of our analysis, see Supplementary Note. This includes details of the selection of hyperparameters for HMRF and scHPF, our trajectory analysis of oligodendrocytes using Monocle2, and GO enrichment analysis of myelin sheath formation in oligodendrocytes.

Preprocessing and analysis of Visium data

Lastly, we applied SpiceMix to a dataset acquired from the 10x Genomics Visium platform that profiled spatial transcriptome of the human DLPFC [31]. For analysis with SpiceMix, we removed genes which had non-zero expression in less than 10% of spots, which yielded an unbiased set of 3,194 genes. We did not apply this filtering when using SpaGCN or BayesSpace. We then normalized the expression of these genes by scaling the total counts to 10,000 per spot, adding one, and applying the log transform: Eig:=log(1+(104EiggEig)). To generate a graphical representation of the spots, we defined the neighborhood of a spot to be the set of directly adjacent spots in the hexagonal grid, since the spots in each FOV form a hexagonal grid. Therefore, except for spots on the edge of the grid, each spot has exactly 6 neighbors.

For further details of the methodology of our analysis, see Supplementary Note. This includes details on the selection of the four FOVs from the Br8100 sample to use for the ARI score comparison, the ARI score comparison between SpiceMix, SpaGCN, and BayesSpace on these FOVs, the subsequent analysis of SpiceMix metagenes on these FOVs, and the analysis of SpiceMix metagenes on sample Br5292.

Additional data processing

Doublet detection [58] was performed on the seqFISH+ and STARmap datasets to confirm that none of the cells in either dataset were doublets; see Supplementary Note for details. For the explanation of our method for constructing the cell-type affinity matrix for the seqFISH+ and STARmap datasets, see Supplementary Note.

Ethical approval

Our study does not require ethics approval.

Statistics and reproducibility

All code necessary to recreate the results in this work is available on our GitHub repository: https://github.com/ma-compbio/SpiceMix, downloadable from: https://doi.org/10.5281/zenodo.7256107.

Supplementary Material

1854437_RS
1854437_Sup_Table
1854437_Sup_Info

Acknowledgements

This work was supported in part by the National Institutes of Health Common Fund 4D Nucleome Program grant UM1HG011593 (J.M.), National Institutes of Health Common Fund Cellular Senescence Network Program grant UG3CA268202 (J.M.), National Institutes of Health grant R01HG007352 (J.M.), and National Science Foundation grant 1717205 (J.M.). J.M. is additionally supported by a Guggenheim Fellowship from the John Simon Guggenheim Memorial Foundation. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Footnotes

Competing Interests

The authors declare no competing interests.

Code Availability

The source code of SpiceMix can be accessed at https://github.com/ma-compbio/SpiceMix and is downloadable from https://doi.org/10.5281/zenodo.7256107 [59]. For our comparisons against other methods, the following versions were used: Seurat v4.0.5, SpaGCN v1.0.0, BayesSpace v1.2.0, HMRF v1.3.3, and scHPF v0.5.0. The tool scDesign2 v0.1.0 for single-cell simulation was used as part of the process for generating the simulated data of Approach II.

Data Availability

The simulated data generated for this work is available at: https://github.com/ma-compbio/SpiceMix. The spatial transcriptomic and single-cell datasets used this study were obtained through publicly available repositories.

References

  • [1].Arendt D et al. The origin and evolution of cell types. Nature Reviews Genetics 17, 744–757 (2016). [DOI] [PubMed] [Google Scholar]
  • [2].Chen X, Teichmann SA & Meyer KB From tissues to cell types and back: Single-cell gene expression analysis of tissue architecture. Annual Review of Biomedical Data Science 1, 29–51 (2018). [Google Scholar]
  • [3].Consortium H et al. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Lee JH et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Chen KH, Boettiger AN, Moffitt JR, Wang S & Zhuang X Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Shah S, Lubeck E, Zhou W & Cai L In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron 92, 342–357 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Ståhl PL et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016). [DOI] [PubMed] [Google Scholar]
  • [8].Moffitt JR et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Eng C-HL et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature 568, 235–239 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Wang X et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 341, eaat5691 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Rodriques SG et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Vickovic S et al. High-definition spatial transcriptomics for in situ tissue profiling. Nature Methods 16, 987–990 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Zhuang X Spatially resolved single-cell genomics and transcriptomics by imaging. Nature Methods 18, 18–22 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Larsson L, Frisén J & Lundeberg J Spatially resolved transcriptomics adds a new dimension to genomics. Nature Methods 18, 15–18 (2021). [DOI] [PubMed] [Google Scholar]
  • [15].Lein E, Borm LE & Linnarsson S The promise of spatial transcriptomics for neuroscience in the era of molecular cell typing. Science 358, 64–69 (2017). [DOI] [PubMed] [Google Scholar]
  • [16].Palla G, Fischer DS, Regev A & Theis FJ Spatial components of molecular tissue biology. Nature Biotechnology 40, 308–318 (2022). [DOI] [PubMed] [Google Scholar]
  • [17].Schapiro D et al. histoCAT: analysis of cell phenotypes and interactions in multiplex image cytometry data. Nature Methods 14, 873–876 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Zhu Q, Shah S, Dries R, Cai L & Yuan G-C Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nature Biotechnology 36, 1183–1190 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Hu J et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature Methods 18, 1342–1351 (2021). [DOI] [PubMed] [Google Scholar]
  • [20].Jerby-Arnon L & Regev A Dialogue maps multicellular programs in tissue from single-cell or spatial transcriptomics data. Nature Biotechnology 1–11 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Zhao E et al. Spatial transcriptomics at subspot resolution with bayesspace. Nature Biotechnology 39, 1375–1384 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Svensson V, Teichmann SA & Stegle O SpatialDE: identification of spatially variable genes. Nature Methods 15, 343–346 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Arnol D, Schapiro D, Bodenmiller B, Saez-Rodriguez J & Stegle O Modeling cell-cell interactions from spatial molecular data with spatial variance component analysis. Cell Reports 29, 202–211 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Nitzan M, Karaiskos N, Friedman N & Rajewsky N Gene expression cartography. Nature 576, 132–137 (2019). [DOI] [PubMed] [Google Scholar]
  • [25].Sun S, Zhu J & Zhou X Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature Methods 17, 193–200 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Stuart T et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Welch JD et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887.e17 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Elosua-Bayes M, Nieto P, Mereu E, Gut I & Heyn H SPOTlight: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Research 49, e50 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Biancalani T et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nature Methods 18, 1352–1362 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Lee DD & Seung HS Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, 556–562 (2001). [Google Scholar]
  • [31].Maynard KR et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature Neuroscience 24, 425–436 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Sun T, Song D, Li WV & Li JJ scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biology 22, 1–37 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Tasic B et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nature Neuroscience 19, 335–346 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Satija R, Farrell JA, Gennert D, Schier AF & Regev A Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33, 495–502 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Marques S et al. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system. Science 352, 1326–1329 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Zhao C et al. Dual regulatory switch through interactions of Tcf7l2/Tcf4 with stage-specific partners propels oligodendroglial maturation. Nature Communications 7, 1–15 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Linington C, Bradl M, Lassmann H, Brunner C & Vass K Augmentation of demyelination in rat acute allergic encephalomyelitis by circulating mouse monoclonal antibodies directed against a myelin/oligodendrocyte glycoprotein. The American Journal of Pathology 130, 443–454 (1988). [PMC free article] [PubMed] [Google Scholar]
  • [38].Tasic B et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Zeisel A et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015). [DOI] [PubMed] [Google Scholar]
  • [40].Qiu X et al. Reversed graph embedding resolves complex single-cell trajectories. Nature Methods 14, 979–982 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Marques S et al. Transcriptional convergence of oligodendrocyte lineage progenitors during development. Developmental Cell 46, 504–517 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Beiter RM et al. Evidence for oligodendrocyte progenitor cell heterogeneity in the adult mouse brain. Scientific Reports 12, 1–15 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Levitin HM et al. De novo gene signature identification from single-cell rna-seq with hierarchical poisson factorization. Molecular Systems Biology 15, e8557 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Dataset: Allen institute for brain science (2021). allen cell types database – human multiple cortical areas [dataset]. available from: http://celltypes.brain-map.org/rnaseq.
  • [45].Zhang M et al. Spatially resolved cell atlas of the mouse primary motor cortex by merfish. Nature 598, 137–143 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Tan S-S et al. Oligodendrocyte positioning in cerebral cortex is independent of projection neuron layering. Glia 57, 1024–1030 (2009). [DOI] [PubMed] [Google Scholar]
  • [47].Liu Y et al. High-spatial-resolution multi-omics sequencing via deterministic barcoding in tissue. Cell 183, 1665–1681 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Armingol E, Officer A, Harismendy O & Lewis NE Deciphering cell–cell interactions and communication from gene expression. Nature Reviews Genetics 22, 71–88 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Brunet J-P, Tamayo P, Golub TR & Mesirov JP Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences 101, 4164–4169 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Zhang Y, Brady M & Smith S Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Transactions on Medical Imaging 20, 45–57 (2001). [DOI] [PubMed] [Google Scholar]
  • [51].Murphy K Machine learning: a probabilistic perspective (MIT Press, 2012). [Google Scholar]
  • [52].Besag J On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological) 48, 259–279 (1986). [Google Scholar]
  • [53].Gurobi Optimization, L. Gurobi optimizer reference manual (2020). URL http://www.gurobi.com.
  • [54].Kingma DP & Ba J Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [Google Scholar]
  • [55].Lein ES et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007). [DOI] [PubMed] [Google Scholar]
  • [56].Pedregosa F et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011). [Google Scholar]
  • [57].Caliński T & Harabasz J A dendrite method for cluster analysis. Commun. Stat. Simul. Comput. 3, 1–27 (1974). [Google Scholar]
  • [58].Gayoso A, Shor J, Carr AJ, Sharma R & Pe’er D Doubletdetection (version v3.0) URL https://zenodo.org/record/6349517 (2020).
  • [59].Chidester B, Zhou T, Alam S & Ma J SpiceMix (version v1.0.0). URL https://zenodo.org/record/7256107 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1854437_RS
1854437_Sup_Table
1854437_Sup_Info

Data Availability Statement

The simulated data generated for this work is available at: https://github.com/ma-compbio/SpiceMix. The spatial transcriptomic and single-cell datasets used this study were obtained through publicly available repositories.

RESOURCES