An interpretable Bayesian clustering approach with feature selection for analyzing spatially resolved transcriptomics data

Huimin Li; Bencong Zhu; Xi Jiang; Lei Guo; Yang Xie; Lin Xu; Qiwei Li

doi:10.1093/biomtc/ujae066

. 2024 Jul 29;80(3):ujae066. doi: 10.1093/biomtc/ujae066

An interpretable Bayesian clustering approach with feature selection for analyzing spatially resolved transcriptomics data

Huimin Li ¹, Bencong Zhu ^2,³, Xi Jiang ^4,⁵, Lei Guo ⁶, Yang Xie ⁷, Lin Xu ⁸, Qiwei Li ^9,^✉

PMCID: PMC11285114 PMID: 39073775

ABSTRACT

Recent breakthroughs in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive molecular characterization at the spot or cellular level while preserving spatial information. Cells are the fundamental building blocks of tissues, organized into distinct yet connected components. Although many non-spatial and spatial clustering approaches have been used to partition the entire region into mutually exclusive spatial domains based on the SRT high-dimensional molecular profile, most require an ad hoc selection of less interpretable dimensional-reduction techniques. To overcome this challenge, we propose a zero-inflated negative binomial mixture model to cluster spots or cells based on their molecular profiles. To increase interpretability, we employ a feature selection mechanism to provide a low-dimensional summary of the SRT molecular profile in terms of discriminating genes that shed light on the clustering result. We further incorporate the SRT geospatial profile via a Markov random field prior. We demonstrate how this joint modeling strategy improves clustering accuracy, compared with alternative state-of-the-art approaches, through simulation studies and 3 real data applications.

Keywords: high-dimensional count data, Markov random field, spatial clustering, spatial transcriptomics, STARmap, zero-inflated negative binomial mixture model

1. INTRODUCTION

Recent advances in spatially resolved transcriptomics (SRT) technologies have enabled the comprehensive molecular and spatial characterization of single cells. Understanding the spatial organization of cells, together with their molecular profiles (eg, mRNA and protein abundances), provides valuable insight into their underlying biological functions, such as embryo development (Satija et al., 2015) and tumor progression (De Bruin et al., 2014). Spatial analysis holds enormous potential for deepening our understanding of biomedicine and has led to the increasingly common application of SRT technologies in various fields, such as cancer research and developmental biology (Marx, 2021), resulting in an explosive generation of SRT data. Current SRT techniques are either imaging-based by way of single-molecule fluorescence in situ hybridization (FISH), such as seqFISH (Lubeck et al., 2014; Eng et al., 2019) and STARmap (Wang et al., 2018), or next-generation sequencing-based (NGS-based) by way of spatial barcoding, such as spatial transcriptomics (ST) (Ståhl et al., 2016) and the improved Visium platform, Slide-seq (Rodriques et al., 2019), Seq-Scope (Cho et al., 2021), and Stereo-seq (Chen et al., 2022). Imaging-based methods can measure hundreds of genes across a large number of individual cells, while NGS-based methods can measure tens of thousands of genes at spatial locations, also known as spots, consisting of multiple cells (eg, Ståhl et al., 2016) or at subcellular resolution (eg, Cho et al., 2021; Chen et al., 2022). In summary, these SRT technologies integrate geospatial and molecular profiles to enable simultaneous measurement and spatial mapping of gene expression over a tissue slide, providing a comprehensive understanding of cellular organization and function.

The emergence of SRT technologies provides new opportunities to investigate research questions. Clustering is an essential step in the analysis of SRT data. It is crucial for downstream analyses, such as cell typing and differential expression analysis, providing insight into underlying biological processes. Non-spatial clustering methods such as Inline graphic -means and the Louvain method (Blondel et al., 2008) may lead to scattered clusters, as the algorithm only takes gene expression as input and ignores spatial information. Recently, several methods have been developed to incorporate available spatial information to account for spatial correlation of gene expression: stLearn (Pham et al., 2020) uses deep learning features extracted from the histology image and the expression of spatially neighboring spots to account for spatial correlation in gene expression; BayesSpace (Zhao et al., 2021) uses a Bayesian approach by employing a Markov random field (MRF) prior, encouraging neighboring spots to belong to the same cluster; SpaGCN (Hu et al., 2021) employs a graph convolutional network approach that combines gene expression, spatial location, and histology to perform clustering; SC-MEB (Yang et al., 2022) conducts spatial clustering with a hidden MRF using an empirical Bayes approach; and DR-SC (Liu et al., 2022) simultaneously performs dimension reduction and clustering via a unified statistically principled method.

Although these spatial clustering methods can perform clustering analysis on SRT data, they all have clear limitations. First, all of these methods must transform counts in the SRT molecular profile to continuous levels for the sake of convenient statistical modeling. However, this extra step does not truly reflect the underlying data generation mechanism and may cause information loss (Sun et al., 2020). Second, SRT data commonly contain a substantial number of zeros (eg, 60%-90%), an issue that may significantly reduce statistical power and is handled by none of the aforementioned methods. Table S2 shows a brief summary of the 3 real datasets analyzed in the paper. And last but not least, to overcome the curse of dimensionality, most of them require linear or non-linear dimension reduction procedures to obtain a low-dimensional representation of data before conducting clustering. Such procedures include principal component analysis (PCA), t-distributed stochastic neighbor embedding (Van der Maaten and Hinton, 2008), and uniform manifold approximation and projection (McInnes et al., 2018). This introduces 2 shortcomings: (1) They require selecting less interpretable leading components compared with the original features, and (2) their results are sub-optimal due to information loss.

Motivated by Li et al. (2017), which proposed a zero-inflated Poisson-Dirichlet process mixture model with feature selection for text analysis, we developed a Bayesian finite mixture model with an MRF prior to simultaneously cluster spots or cells and identify the associated discriminating genes (DGs) for SRT data analysis, named Bayesian clustering approach with feature selection (BayesCafe). BayesCafe directly models the molecular profile of SRT data using a zero-inflated negative binomial (ZINB) mixture model to account for the zero inflation and over-dispersion observed in SRT data and avoids choosing an ad hoc data normalization method. To increase interpretability, BayesCafe utilizes a feature selection approach that offers low-dimensional representations of the SRT data in terms of a list of DGs. This enables any gene-wise downstream analyses. Furthermore, BayesCafe employs an MRF prior to integrate the geospatial profile of SRT data to improve clustering accuracy. We demonstrated the advantages of BayesCafe through a comprehensive simulation study that considers various spatial patterns and zero-inflation settings. In addition, we applied BayesCafe to 2 NGS-based and 1 imaging-based SRT datasets, showing improved clustering accuracy compared to existing methods.

2. MODEL

In this section, we introduce BayesCafe for spatial clustering and feature selection. The overall workflow is depicted in Figure 1. Figure S1 and Table S1 summarize the graphical and hierarchical formulation of the proposed model in the supplementary materials.

The schematic diagram of the proposed BayesCafe model.

We begin by summarizing the observed data as follows: Let an Inline graphic -by- count matrix denote the molecular profile generated by an SRT technique. Each entry , , , is the read count observed in sample point (ie, spot or cell) for gene . Let an -by-2 matrix represent the geospatial profile, where each row gives the coordinates of sample point in a compact subset of the 2-dimensional Cartesian plane (see examples in Figure 1). Besides, we employ the notations Inline graphic and to denote the distributions of data (ie, likelihood) and parameters (ie, priors and posteriors), respectively, throughout the paper.

2.1. Modeling the SRT molecular profile via a ZINB model

Previous studies on scRNA-seq data analysis have suggested that accounting for the large proportion of zeros in the model can lead to a substantial improvement in model fitting and accuracy of identifying differentially expressed genes (Finak et al., 2015; Lun et al., 2016); therefore, we start by considering a ZINB model to model the read counts:

(1)

where we use I Inline graphic to denote the indicator function and NB, , to denote a negative binomial (NB) distribution with expectation and dispersion . With this NB parameterization, the probability mass function is written as , with variance , thus allowing for over-dispersion. A small value of indicates a large variance-to-mean ratio, while a large value approaching infinity reduces the NB model to a Poisson model with the same mean and variance. In this model, we constrain 1 component of the mixture model to degenerate at zero, thereby allowing for zero inflation. The sample-specific parameter Inline graphic can be treated as the proportion of extra zero (ie, false zero or structural zero) counts at sample point .

The mean in the NB distribution is further decomposed into 2 multiplicative effects: the size factor Inline graphic and the normalized expression level . The collection reflects many nuisance effects across sample points, including but not limited to (1) reverse transcription efficiency; (2) amplification and dilution efficiency; and (3) sequencing depth. After adjusting for the global sample-specific effect, Inline graphic can be regarded as the normalized expression level of gene observed at sample point . We follow Sun et al. (2020) to set proportional to the summation of the total number of read counts across all genes observed at sample point , combined with a constraint of . This results in . After accounting for zero inflation (via Inline graphic ), over-dispersion (via ), and sample point heterogeneity (via ), our modeling approach produces a denoised version of gene expression levels, represented by .

To create an environment conducive to model fitting, we rewrite (1) by introducing a latent indicator variable Inline graphic :

(2)

We impose an independent Bernoulli prior for Inline graphic , that is, , which can be further relaxed by formulating a Be hyperprior on , leading to a beta-Bernoulli prior of with expectation . Setting results in a uniform prior on . We assume a gamma prior for all dispersion parameters , that is, and suggest small values such as for a weakly informative setting.

2.2. Clustering spots/cells while identifying DGs via a ZINB mixture model

When dealing with high-dimensional datasets, many features can provide very little information, and the inclusion of unnecessary features can complicate or even mask cluster recovery; therefore, we envision that only some features are relevant to discriminate Inline graphic spots or cells into distinct clusters. To identify these DGs, we first introduce a latent binary vector for gene , with if gene is differentially expressed among clusters (ie, DGs), and otherwise. Conditioning on , we can write our model as follows:

(3)

If Inline graphic , we assume irrespective of with and . We utilize the common independent Bernoulli prior for with a hyperparameter , that is, , which is equivalent to a binomial prior on the number of DGs, that is, . The hyperparameter can be elicited as the proportion of genes expected a priori to be in the DG set. This prior assumption can be further relaxed by formulating a beta hyperprior Inline graphic on , which produces a beta-binomial prior on the number of DGs with expectation . We set a vague prior on with a constraint of (Tadesse et al., 2005).

For the task of cluster assignment, an auxiliary set of cluster allocation variables Inline graphic is introduced, where if the th sample point belongs to cluster for , where is the number of clusters. is the normalized expression level of gene with observed in cluster , while for a non-discriminating gene (non-DG) (ie, ), we assume a uniform normalized expression level on all clusters for gene Inline graphic . We assume a gamma prior for and , that is, and , and recommend small values such as for a weakly informative setting.

2.3. Integrating the SRT geospatial profile via an MRF prior model

To efficiently incorporate available spatial information, we impose an MRF prior on Inline graphic to encourage neighboring samples to be clustered into the same group:

(4)

where Inline graphic and are hyperparameters to be chosen and denotes the vector of , excluding the th element. Here, controls the abundance of each cluster, and controls the strength of spatial dependence. We can also write the joint MRF prior on by

(5)

As a result, neighboring samples are more likely to be assigned to the same cluster. The adjacency matrix Inline graphic , an symmetric matrix, is constructed based on the geospatial profile to define the neighborhood structure, with if samples and are neighbors, and otherwise. For ST and 10x Visium platforms, is created based on their square and triangular lattices, respectively. For other SRT technologies, Inline graphic is constructed using a Voronoi diagram (Okabe et al., 2009). Note that if a sample point does not have any neighbors, its prior distribution reduces to a multinomial prior with parameter , where is a multinomial logistic transformation of . Although the parameterization is somewhat arbitrary, a careful selection of Inline graphic is crucial. In particular, a large value of may lead to a phase transition problem (ie, all sample points are assigned to the same cluster). This problem arises because Equation 4 can only increase as a function of the number of I equal to 1. In this paper, we set (Jiang et al., 2023) and Inline graphic since it produced the best results (Figure S8) in the sensitivity analysis (see Web Appendix D) as described in the supplementary materials. This setting performed very well in the simulation study and generated good results in the real data analysis.

3. MODEL FITTING

3.1. MCMC algorithm

We use a Markov chain Monte Carlo (MCMC) algorithm to update all parameters in our model. Our model allows simultaneously partitioning sample points into distinct clusters via Inline graphic and identifying DGs via . We update the cluster allocation parameters and false zero indicators using a Gibbs sampler. We jointly update the DGs indicators , normalized expression levels and via an add-delete algorithm, and update the remaining parameters using the random walk Metropolis-Hastings algorithm. We note that this algorithm ensures ergodicity for our model. For more details, please refer to the supplementary materials.

3.2. Posterior inference

Our primary goal is to identify DGs via the vector Inline graphic and cluster samples via the vector . We obtain posterior inference on these parameters by postprocessing the MCMC samples after burn-in. A common way to summarize the posterior distribution of is by utilizing the maximum a posteriori (MAP) estimates, which correspond to the configuration with the highest conditional posterior probability among those drawn by the MCMC sampler. Specifically, we define Inline graphic . Besides, we can obtain a summary of based on the pairwise probability matrix (PPM). The PPM is an -by- symmetric matrix whose elements are the posterior pairwise probabilities of co-clustering, that is, the probability that sample and sample are assigned to the same cluster: (Dahl, 2006), where Inline graphic indicates the MCMC iterations after burn-in. Then, the point estimate of the clustering, , can be obtained by minimizing the sum of squared deviations of its association matrix from the PPM:

The PPM estimate has the advantage of utilizing information from all clusterings through the PPM. It is also intuitively appealing because it selects the “average” clustering rather than forming a clustering via an external, ad hoc clustering algorithm.

For feature selection, although we can compute the MAP estimate of Inline graphic as , a more comprehensive way to summarize is based on their marginal posterior probabilities of inclusion (PPI), where PPI = . Then, the discriminating features are identified if their PPI values exceed a given threshold :

We can use the threshold Inline graphic , which is commonly referred to as the median model. Alternatively, we can determine the threshold that controls for multiplicity (Newton et al., 2004), which ensures that the expected Bayesian false discovery rate (BFDR) is less than a specified value. The BFDR is computed as follows:

where BFDR( Inline graphic ) is the desired significance level. Without otherwise noted, we use the PPM estimate for and the PPI estimate for by default in our study.

3.3. Model selection

The number of clusters Inline graphic can be determined by prior biological knowledge when available or otherwise by the elbow plot of the modified Bayesian Information Criterion (mBIC) (Wang et al., 2007). Since the number of parameters is associated with feature selection results, we give the mBIC as follows:

(6)

where Inline graphic is the full data likelihood (see Web Appendix A in the supplementary materials), are the estimators of the corresponding parameters, and is the estimated number of DGs. In the equation above, the term reflects a number of estimated parameters in the model.

4. SIMULATION STUDY

In this section, we briefly summarize the simulation study. A detailed description is available in Web Appendix C in the supplementary materials. We followed the data generative schemes described in Li et al. (2021) and Jiang et al. (2022) based on 2 real spatial patterns, respectively, constructed from a mouse olfactory bulb (MOB) study and a human breast cancer (BC) study, which are depicted in Figure 2A. The MOB pattern and BC pattern, respectively, contain Inline graphic and 250 spots. We simulated genes, with DGs. We generated count data from a ZINB mixture model and incorporated spatial variation through a gene-specific zero-mean stationary Gaussian process (GP). The data generative model is detailed in Web Appendix C.1 in the supplementary materials, while the prior choice and MCMC algorithm settings are presented in Web Appendix C.2. We reported the scalability of BayesCafe in Web Appendix E and Figure S9 in the supplementary materials.

The simulation study. (A) The 2 spatial patterns used to generate the simulated data, which were constructed from the mouse olfactory bulb (MOB) and human breast cancer (BC) study, respectively. (B) The boxplots of adjusted Rand indices (ARIs) achieved by BayesCafe, BayesCafe (NB), BayesSpace, Louvain, SpaGCN, and stLearn under different scenarios in terms of spatial pattern and sparsity setting. (C) The boxplots of area under curves (AUCs) achieved by BayesCafe, BayesCafe (NB), ZINB-WaVE DESeq2, ZINB-WaVE edgeR, DESeq2, edgeR, SPARK, and SpatialDE under different scenarios in terms of spatial pattern and sparsity setting.

To compare clustering performance, we used the adjusted Rand index (ARI) (Hubert and Arabie, 1985), a corrected-for-chance version of the Rand index (Rand, 1971). The Rand index is used to measure the similarity between 2 different partitions and ranges between 0 and 1. Larger ARI values indicate better clustering results, and a value of 1 indicates a perfect match between 2 partitions. To assess the performance of identifying the DGs via the binary vector Inline graphic , we utilized the area under the curve (AUC) of the receiver operating characteristic, a widely used metric in the evaluation of binary classifiers. AUC considers both the true positive rate and false positive rate at various threshold settings and ranges from 0 to 1; the higher the value, the more accurate the results.

Figure 2B displays the boxplots of ARIs achieved by different methods over 30 replicates under the 6 scenarios. The competing methods included SpaGCN, BayesSpace, stLearn, and Louvain. To illustrate the necessity of employing a zero-inflated model, we compared the performance of our model by substituting the ZINB model with an NB model. This alternative configuration of our model has been named BayesCafe (NB). As shown in Figure 2B, under the low zero-inflation setting, BayesCafe and BayesCafe (NB) had better performances, followed by BayesSpace and Louvain. When the proportion of false zeros increased, BayesCafe clearly outperformed other methods and maintained nearly unaffected performance, implying that realistic modeling (ie, accounting for zero inflation) delivered an advantage over other methods. In contrast, all other methods suffered from decreased power, suggesting that the variance caused by an excess of zero counts was not properly addressed.

To evaluate the performance of identifying DGs, we compared BayesCafe with alternative methods; we used the clustering results from BayesSpace as input for clusters that produced better results than competing clustering methods. The competitor pool included DESeq2 (Love et al., 2014), edgeR (Robinson et al., 2010), ZINB-WaVE DESeq2 (Risso et al., 2018), ZINB-WaVE edgeR (Risso et al., 2018), SPARK (Sun et al., 2020), and SpatialDE (Svensson et al., 2018). The former 4 methods are widely used to identify differentially expressed genes in single-cell RNA-sequencing (scRNA-seq) data. edgeR uses an exact binomial test generalized for over-dispersed counts, while DESeq2 employs a Wald test by adopting a generalized linear model based on an NB kernel. ZINB-WaVE edgeR and ZINB-WaVE DESeq2 are the modified versions of edgeR and DESeq2, respectively, using a ZINB-based Wanted Variation Extraction (ZINB-WaVE) strategy to downweight the inflated number of zeros in scRNA-seq data. SPARK and SpatialDE based on GP models are used for the detection of spatially variable genes (SVGs) in SRT data.

Figure 2C displays the boxplots of AUCs achieved by different methods under different scenarios in terms of spatial pattern and sparsity settings. We observed that all methods, with the exception of SPARK and SpatialDE, performed well under low and medium zero-inflation settings. Under the high zero-inflation setting, BayesCafe consistently maintained good performance, while BayesCafe (NB) and other methods experienced decreased power. It worth noting that SPARK models count data directly using a Poisson model, which outperformed SpatialDE but obtained much lower AUC values compared to BayesCafe. All findings suggest BayesCafe’s advantages of modeling count data directly and properly handling the zero-inflation problem through the ZINB model. As shown in Figure S2C, the DGs identified by BayesCafe exhibit the same ability as true DGs from the PCA to explain the variance, indicating high identification accuracy.

In all, the simulation study clearly demonstrated that the joint modeling of spatial cluster structure and the associated DGs via BayesCafe can boost the performance of both tasks.

5. REAL DATA ANALYSIS

BayesCafe was applied to 3 real datasets, and the outcomes are documented below. For an overview, Table S4 in the supplementary materials outlines the clustering accuracy and computational efficiency. Additionally, we have complementarily employed the normalized mutual information (NMI) (Strehl and Ghosh, 2002), a well-established metric in machine learning and data mining (Knops et al., 2006; Do et al., 2021; Molaei et al., 2021). This approach addresses the potential sensitivities of ARI to factors such as cluster size and shape.

5.1. Application to the mouse olfactory bulb ST data

To further evaluate BayesCafe’s performance, we first examined a publicly available ST dataset from an MOB study (Ståhl et al., 2016). We used replicate 12, which contains 16 034 genes measured on 282 spots. The MOB data includes 4 main anatomic layers (ie, clusters) organized in an inside-out fashion, annotated by CARD (Ma and Zhou, 2022) based on histology (see Figure 3A): the granule cell layer (GCL), the mitral cell layer (MCL), the glomerular layer (GL), and the nerve layer (ONL). We filtered out spots with fewer than 100 total counts across all genes (Ma and Zhou, 2022) and genes with more than Inline graphic zero read counts on all spots (Li et al., 2021). This quality control procedure led to a final set of spots and 9904 genes. We then found the top highly variable genes (HVGs) for our model. Identifying HVGs is a common preprocessing step in clustering analysis because it can help prioritize biologically relevant genes, reduce noise, and increase computational efficiency (Risso et al., 2018; Zhao et al., 2021; Zhang et al., 2023).

The mouse olfactory bulb ST data analysis. (A) The hematoxylin and eosin (H&E)-stained image of the tissue section with manual annotation, and clusters detected by BayesCafe, BayesCafe (NB), SpaGCN, BayesSpace, stLearn, and Louvain. (B) The heatmap of DGs across different clusters detected by BayesCafe.

We evaluated the convergence of the MCMC algorithm based on the PPI vector of Inline graphic to calculate the PPIs for all 4 chains and found that their pairwise Pearson correlation coefficients ranged from 0.95 to 0.97, along with the trace plot of the number of DGs in Figure S3A, indicating good MCMC convergence. We then aggregated the outputs of all 4 chains. We compared the clustering result of BayesCafe with BayesCafe (NB), SpaGCN, BayesSpace, stLearn, and Louvain, using default settings, and set the number of clusters as 4 ( Inline graphic ) for all methods, even though the mBIC plot depicted in Figure S7A indicates a close alternative of . As Figure 3A shows that BayesCafe (ARI = 0.582) and BayesCafe (NB) (ARI = 0.572) achieved better performance, while SpaGCN (ARI = 0.578), BayesSpace (ARI = 0.572), stLearn (ARI = 0.530), and Louvain (ARI = 0.535) generated inferior results. All methods were able to distinguish GCL and ONL layers well, but compared to BayesCafe and BayesCafe (NB), the other methods created a blurrier boundary between MCL and GL, resulting in comprised performance.

We also performed model validation between BayesCafe and BayesCafe (NB) on all 3 real datasets, as reported in Web Appendix B of the supplementary materials, the results (Table S3) supported that the ZINB model is more appropriate to model these SRT data.

Next, we examined the DGs detected by BayesCafe. Figure S3B shows the estimated marginal PPIs, Inline graphic , of each single gene after burn-in. A threshold of 0.5 on the marginal probabilities results in a median model that includes genes. We examined the differential expression of DGs among clusters using the Heatmap() function from the ComplexHeatmap R package. The heatmap in Figure 3B shows 3 distinct gene groups, each representing major expression patterns observed in the data. DGs in group 1 showed enriched expression patterns in cluster 1, while those in group 2 expressed highly in clusters 2 and 3. DGs in group 3 had a high expression level in cluster 4. The similar expression patterns between the MCL and GL may explain why BayesCafe was unable to distinguish these 2 layers very well, and the number of clusters Inline graphic was selected as 3 (Figure S7A). This analysis validates the hypotheses that DGs identified by BayesCafe express differentially among clusters, and that BayesCafe could lead to biologically meaningful clusters.

To further demonstrate BayesCafe-defined DGs align well with known biological knowledge, we compared DGs detected by BayesCafe and BayesCafe (NB), and SVGs detected by SPARK and SpatialDE, with known olfactory bulb-specific gene set defined in the Harmonizome database (PMID: 27374120). We found that the DGs detected by BayesCafe showed higher overlap with known olfactory bulb-specific gene sets than the ones defined by SPARK or SpatialDE (see Figure S3D), indicating that BayesCafe is able to identify biologically meaningful gene sets when analyzing ST datasets. We also conducted additional analyses to demonstrate that these identified DGs retain more significant data features and could uncover the underlying biological processes or functions. For more details, please see Web Appendix F.1 in the supplementary materials.

5.2. Application to the human breast cancer 10x Visium data

The second analyzed NGS-based SRT dataset was collected from a study on human breast cancer, consisting of 2518 spots and 17 943 genes. The gene expression was measured on a section of human breast tissue with invasive ductal carcinoma using the 10x Visium platform, along with manual annotation that can be used to evaluate clustering performance. This dataset contains 5 annotated tissue regions: tumor (invasive carcinoma), fibrous tissue, immune cells, necrosis, and fat as shown in Figure 4A. The mBIC plot in Figure S7B also suggests Inline graphic . We used the same analysis procedures as described in Section 5.1, with spots and HVGs among the 10 910 genes remaining after the quality control procedure. The pairwise Pearson correlation coefficients of PPIs ranged from 0.86 to 0.88 for 4 MCMC chains, along with the trace plot of the number of DGs in Figure S4A, suggesting good convergence.

The human breast cancer 10x Visium data analysis. (A) The hematoxylin and eosin (H&E)-stained image of the tissue section with manual annotation, and clusters detected by BayesCafe, BayesCafe (NB), SpaGCN, BayesSpace, stLearn, and Louvain. (B) The heatmap of DGs across different clusters detected by BayesCafe.

First, we compared the clusters detected by BayesCafe, BayesCafe (NB), and competing methods. We observed that BayesCafe achieved the highest consistency with manual annotation, with an ARI of 0.558, followed by SpaGCN (ARI = 0.528), stLearn (ARI = 0.526), and BayesCafe (NB) (ARI = 0.493), while BayesSpace surprisingly clustered the tumor region into 2 groups (Figure 4A). The integration of image profile might help SpaGCN and stLearn to achieve the satisfactory clustering performance. It is worth noting that none of the methods can separate well the fat region from the fibrous tissue. This limitation could stem from the minimal cellular presence within fat tissue, resulting in a constrained pool of gene expression data for analysis. In conclusion, the superior performance of BayesCafe underscores the advantages of integrating both molecular and spatial profiles in the clustering analysis of SRT data, along with the inclusion of the feature selection procedure and consideration of zero inflation.

Next, we examined the biological significance of the Inline graphic DGs identified by BayesCafe. As shown in Figure 4B, the identified DGs showed distinct differential expression patterns among the clusters detected by BayesCafe. Moreover, 2 ERBB family genes, ERBB2 and ERBB3, were detected as DGs and displayed high expression levels in cluster 1, which are well-known oncogenes in human breast cancer and play important roles in cell signaling and cancer development (Revillion et al., 1998; Baselga and Swain, 2009). To enhance the validation of the biological significance associated with clusters identified by BayesCafe, we conducted an analysis to compare known breast cancer genes from the Catalogue of Somatic Mutations in Cancer database (PMID: 15188009) with DGs defined by BayesCafe and BayesCafe (NB) and SVGs defined by SPARK and SpatialDE. Our findings revealed that the DGs detected by BayesCafe were more consistent with biological knowledge than SGVs found by SPARK or SpatialDE, as evidenced by a notably higher enrichment of known breast cancer genes, as illustrated in Figure S4D. This outcome serves to underscore the robustness of BayesCafe in identifying DGs that align closely with established biological knowledge pertaining to breast cancer. Gene ontology (GO) enrichment analysis, as described in Web Appendix F.2, reveals the biological functions associated with the identified DGs (see Figure S6 and Table S6). In summary, these findings confirm that BayesCafe-identified DGs are consistent with established biological knowledge and display a distinct expression pattern within the clusters.

5.3. Application to the mouse visual cortex STARmap data

To illustrate that BayesCafe is also able to analyze data from imaging-based SRT technologies, we applied BayesCafe to a mouse visual cortex STARmap dataset at single-cell resolution (Wang et al., 2018). The STARmap dataset measured 1020 genes among 1207 cells, corresponding to 7 layers (Figure 5A). The mBIC plot in Figure S7C also recommends Inline graphic . We used the same analysis procedures as described in Section 5.1, with cells, and genes remaining after our quality control procedure. The pairwise Pearson correlation coefficients of PPIs ranged from 0.67 to 0.73 for 4 MCMC chains, along with the trace plot of the number of DGs in Figure S5A, suggesting reasonable convergence. The neighboring structure was defined with a Voronoi diagram, where 2 samples sharing the same edge were defined as neighbors. Figure 5A shows that BayesCafe outperformed all other methods by achieving the highest ARI of 0.523 and NMI of 0.523 (Table S4), while BayesCafe (NB) had comprised performance with ARI and NMI of 0.438 and 0.459, respectively. Additionally, BayesCafe displayed clearer boundaries between layers in comparison to the other methods. However, BayesCafe faced a challenge in distinguishing between L5 and L6. By contrast, the ARIs of other methods were much lower (0.438 for SpaGCN, 0.190 for stLearn, and 0.295 for Louvain). This dataset demonstrates that BayesCafe is able to incorporate spatial information more efficiently than SpaGCN and stLearn, and taking account of zero inflation in SRT data could improve clustering accuracy.

The mouse visual cortex STARmap data analysis. (A) Layer structure of the tissue section from the original study, and clusters detected by BayesCafe, BayesCafe (NB), SpaGCN, stLearn, and Louvain. (B) The heatmap of DGs across different clusters detected by BayesCafe.

BayesCafe detected Inline graphic DGs out of 886 genes with a threshold of 0.5 (Figure S5B). As Figure 5B shows that DGs were divided into 3 groups with a number of genes of 26, 28, and 26, respectively. DGs in group 1 showed high expression levels in clusters 2 and 3, while those in groups 2 and 3 expressed highly in clusters 5 and 6, respectively. We also pointed out that gene expression patterns are similar between L5 and L6. This could be the reason that cells in layers L5 and L6 were clustered together in BayesCafe. Besides, PCA shows that the identified DGs preserve more significant data features, while GO enrichment analysis supports their ability to uncover underlying biological processes (see Web Appendix F.3).

To further validate biological significance of clusters defined by BayesCafe, we extracted known visual cortex gene sets from the Allen Mouse Brain Atlas (PMID: 23193282) and compared with genes defined by BayesCafe, BayesCafe (NB), SPARK, and SpatialDE in the visual cortex samples based on STARMap technology. We found that the set of DGs detected by BayesCafe was more biologically meaningful, as evidenced by the higher enrichment of known visual cortex genes (Figure S5D), confirming that BayesCafe-identified DGs are more consistent with known biological knowledge than the SVGs defined by SPARK and SpatialDE. Overall, these discoveries highlight the meaningful biological interpretations that can be inferred from the DGs identified by BayesCafe.

6. DISCUSSION

In this paper, we developed BayesCafe, a novel Bayesian ZINB mixture model that can account for the spatial correlation of SRT data by employing an MRF prior for clustering analysis and using a feature selection approach to detect DGs. Compared to existing methods, BayesCafe offers several advantages. First, it directly models count data with an NB distribution, which can better account for over-dispersion compared to a Poisson distribution, providing a more accurate representation of the data. Second, it properly addresses the issue of excess zero counts commonly observed in SRT data by incorporating a ZINB model, resulting in more robust performance. While discussions continue about the need for zero-inflation components in modeling scRNA-seq or SRT data (Silverman et al., 2020; Svensson, 2020; Zhao et al., 2022), the posterior predictive model validation detailed in Web Appendix B and summarized in Table S3, along with the real data analysis results showcased in Table S5, highlight that the majority of zeros in the count matrix are excessive, underscoring the critical necessity of adopting a ZINB model to accurately analyze and interpret SRT data. Third, it utilizes a feature selection mechanism that not only generates low-dimensional summaries of the SRT data, it also identifies the most DGs, thus improving model performance and interpretability. Fourth, it efficiently incorporates spatial information and obtains a more robust and accurate clustering result via the MRF prior model. Finally, it improves parameter estimations and uncertainties quantification using a Bayesian approach, leading to more reliable and quantifiable inference. In our simulation study, BayesCafe outperformed all other clustering methods, especially when a high number of zeros were present in the data. Furthermore, BayesCafe was able to detect the DGs with higher and more stable accuracy. In our real data analysis, BayesCafe demonstrated higher accuracy in clustering analysis by incorporating spatial information and employing the feature selection procedure. In addition, DGs identified by BayesCafe were differentially expressed and associated with biological functions.

BayesCafe primarily relies on the molecular profile from NGS-based SRT experiments, and it potentially faces a limitation in differentiating regions that exhibit similar gene expressions but diverse morphological features visible in paired histology or pathology images. Hu et al. (2021) and Jiang et al. (2023) have shown that the integration of imaging information significantly boosts the accuracy of spatial domain identification, particularly when manual annotations by pathologists serve as the benchmark for true segmentation. A further challenge with BayesCafe is its sensitivity to the feature selection process; varying sets of DGs can yield different clustering results, with an increased number of DGs typically resulting in a higher cluster number. Another constraint is the fixed cluster number Inline graphic in BayesSpace, even though is typically determined by experienced pathologists or through the proposed mBIC criterion. To address the aforementioned limitations, several extensions are worthy of exploration. BayesCafe could be developed to estimate using a Dirichlet process mixture model (Müller et al., 2015; Li et al., 2017), which would not only determine Inline graphic but also quantify its uncertainty. Regarding the robustness enhancement of the feature selection process, BayesCafe could integrate pathway information as prior knowledge, allowing for joint estimation of DGs and considering regulatory relationships among genes (Li et al., 2019). Furthermore, as the field evolves toward multi-sample analyses, extending the model to handle multiple samples could substantially improve the robustness and reliability of the clustering results. Implementing any one of these extensions, or a combination thereof, could optimize the selection process for DGs, thereby yielding more precise and reliable spatial clustering results.

Supplementary Material

ujae066_Supplemental_Files

Web Appendices, Tables, and Figures referenced in Sections 1, 2, 3, 4, and 5 are available with this paper at the Biometrics website on Oxford Academic. All simulated and real datasets utilized in our analysis, along with the associated source code in R/C++ languages, are available with this paper at the Biometrics website on Oxford Academic, and are also openly accessible at https://github.com/huimin230/BayesCafe.

ujae066_supplemental_files.zip^{(5.9MB, zip)}

Acknowledgement

The authors would like to thank the co-editor, associate editor, and reviewers for their suggestions and feedback, which improved the paper significantly, and Kevin W. Jin for helping us proofread the manuscript.

Contributor Information

Huimin Li, Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States.

Bencong Zhu, Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States; Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China.

Xi Jiang, Department of Statistics and Data Science, Southern Methodist University, Dallas, TX 75205, United States; Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States.

Lei Guo, Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States.

Yang Xie, Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States.

Lin Xu, Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States.

Qiwei Li, Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States.

FUNDING

This work was supported by the following funding: the National Science Foundation [2210912, 2113674] and the National Institutes of Health [1R01GM141519] (to Qiwei Li); the Rally Foundation, Children’s Cancer Fund (Dallas), the Cancer Prevention and Research Institute of Texas (RP180319, RP200103, RP220032, RP170152 and RP180805), and the National Institutes of Health (R01DK127037, R01CA263079, R21CA259771, UM1HG011996, and R01HL144969) (to Lin Xu); The funding bodies had no role in the design, collection, analysis, or interpretation of data in this study.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The authors analyzed 3 publicly available spatially resolved transcriptomics (SRT) datasets. Mouse olfactory bulb Spatial Transcriptomics (ST) data are accessible on the website of the Spatial Research Lab at the KTH Royal Institute of Technology https://www.spatialresearch.org/. Human breast cancer 10x Visium data are accessible on the 10x Genomics website at https://www.10xgenomics.com/resources/datasets. Mouse visual cortex STARmap data are accessible on https://www.starmapresources.com/data.

References

Baselga J., Swain S. M. (2009). Novel anticancer targets: revisiting ERBB2 and discovering ERBB3. Nature Reviews Cancer, 9, 463–475. [DOI] [PubMed] [Google Scholar]
Blondel V. D., Guillaume J.-L., Lambiotte R., Lefebvre E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008, P10008. [Google Scholar]
Chen A., Liao S., Cheng M., Ma K., Wu L., Lai Y. et al. (2022). Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 185, 1777–1792. [DOI] [PubMed] [Google Scholar]
Cho C.-S., Xi J., Si Y., Park S.-R., Hsu J.-E., Kim M. et al. (2021). Microscopic examination of spatial transcriptome using Seq-Scope. Cell, 184, 3559–3572. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dahl D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian Inference for Gene Expression and Proteomics, 4, 201–218. [Google Scholar]
De Bruin E. C., McGranahan N., Mitter R., Salm M., Wedge D. C., Yates L. et al. (2014). Spatial and temporal diversity in genomic instability processes defines lung cancer evolution. Science, 346, 251–256. [DOI] [PMC free article] [PubMed] [Google Scholar]
Do K., Tran T., Venkatesh S. (2021). Clustering by maximizing mutual information across views. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9928–9938. [Google Scholar]
Eng C.-H. L., Lawson M., Zhu Q., Dries R., Koulena N., Takei Y. et al. (2019). Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature, 568, 235–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finak G., McDavid A., Yajima M., Deng J., Gersuk V., Shalek A. K. et al. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 16, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu J., Li X., Coleman K., Schroeder A., Ma N., Irwin D. J. et al. (2021). SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature Methods, 18, 1342–1351. [DOI] [PubMed] [Google Scholar]
Hubert L., Arabie P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. [Google Scholar]
Jiang X., Wang S., Guo L., Zhu B., Wen Z., Jia L. et al. (2024). iIMPACT: integrating image and molecular profiles for spatial transcriptomics analysis. Genome Biology, 25, 147. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang X., Xiao G., Li Q. (2022). A Bayesian modified Ising model for identifying spatially variable genes from spatial transcriptomics data. Statistics in Medicine, 41, 4647–4665. [DOI] [PubMed] [Google Scholar]
Knops Z. F., Maintz J. A., Viergever M. A., Pluim J. P. (2006). Normalized mutual information based registration using k-means clustering and shading correction. Medical Image Analysis, 10, 432–439. [DOI] [PubMed] [Google Scholar]
Li Q., Cassese A., Guindani M., Vannucci M. (2019). Bayesian negative binomial mixture regression models for the analysis of sequence count and methylation data. Biometrics, 75, 183–192. [DOI] [PubMed] [Google Scholar]
Li Q., Guindani M., Reich B. J., Bondell H. D., Vannucci M. (2017). A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10, 393–409. [Google Scholar]
Li Q., Zhang M., Xie Y., Xiao G. (2021). Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics, 37, 4129–4136. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu W., Liao X., Yang Y., Lin H., Yeong J., Zhou X. et al. (2022). Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Research, 50, e72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Love M. I., Huber W., Anders S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lubeck E., Coskun A. F., Zhiyentayev T., Ahmad M., Cai L. (2014). Single-cell in situ RNA profiling by sequential hybridization. Nature Methods, 11, 360–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lun A. T., Bach K., Marioni J. C. (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 17, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
McInnes L., Healy J., Melville J. (2018). UMAP: uniform manifold approximation and projection for dimension reduction. arXiv, arXiv:1802.03426, preprint: not peer reviewed.
Ma Y., Zhou X. (2022). Spatially informed cell-type deconvolution for spatial transcriptomics. Nature Biotechnology, 40, 1349–1359. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marx V. (2021). Method of the year: spatially resolved transcriptomics. Nature Methods, 18, 9–14. [DOI] [PubMed] [Google Scholar]
Molaei S., Bousejin N. G., Zare H., Jalili M. (2021). Deep node clustering based on mutual information maximization. Neurocomputing, 455, 274–282. [Google Scholar]
Müller P., Quintana F. A., Jara A., Hanson T. (2015). Bayesian Nonparametric Data Analysis. New York: Springer. [Google Scholar]
Newton M. A., Noueiry A., Sarkar D., Ahlquist P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155–176. [DOI] [PubMed] [Google Scholar]
Okabe A., Boots B., Sugihara K., Chiu S. N. (2009). Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. New York: John Wiley & Sons. [Google Scholar]
Pham D., Tan X., Xu J., Grice L. F., Lam P. Y., Raghubar A. et al. (2020). stLearn: integrating spatial location, tissue morphology and gene expression to find cell types, cell-cell interactions and spatial trajectories within undissociated tissues. BioRxiv. 2020-05.
Rand W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of American Statistical Association, 66, 846–850. [Google Scholar]
Revillion F., Bonneterre J., Peyrat J. (1998). ERBB2 oncogene in human breast cancer and its clinical significance. European Journal of Cancer, 34, 791–808. [DOI] [PubMed] [Google Scholar]
Risso D., Perraudeau F., Gribkova S., Dudoit S., Vert J.-P. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications, 9, 284. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson M. D., McCarthy D. J., Smyth G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rodriques S. G., Stickels R. R., Goeva A., Martin C. A., Murray E., Vanderburg C. R. et al. (2019). Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science, 363, 1463–1467. [DOI] [PMC free article] [PubMed] [Google Scholar]
Satija R., Farrell J. A., Gennert D., Schier A. F., Regev A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33, 495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
Silverman J. D., Roche K., Mukherjee S., David L. A. (2020). Naught all zeros in sequence count data are the same. Computational and Structural Biotechnology Journal, 18, 2789–2798. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ståhl P. L., Salmén F., Vickovic S., Lundmark A., Navarro J. F., Magnusson J. et al. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353, 78–82. [DOI] [PubMed] [Google Scholar]
Strehl A., Ghosh J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. [Google Scholar]
Sun S., Zhu J., Zhou X. (2020). Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature Methods, 17, 193–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Svensson V. (2020). Droplet scRNA-seq is not zero-inflated. Nature Biotechnology, 38, 147–150. [DOI] [PubMed] [Google Scholar]
Svensson V., Teichmann S. A., Stegle O. (2018). SpatialDE: identification of spatially variable genes. Nature Methods, 15, 343–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tadesse M. G., Sha N., Vannucci M. (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association, 100, 602–617. [Google Scholar]
Van der Maaten L., Hinton G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. [Google Scholar]
Wang H., Li R., Tsai C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang X., Allen W. E., Wright M. A., Sylwestrak E. L., Samusik N., Vesuna S. et al. (2018). Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science, 361, eaat5691. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y., Shi X., Liu W., Zhou Q., Chan Lau M., Chun Tatt Lim J. et al. (2022). SC-MEB: spatial clustering with hidden Markov random field using empirical Bayes. Briefings in Bioinformatics, 23, bbab466. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C., Dong K., Aihara K., Chen L., Zhang S. (2023). STAMarker: determining spatial domain-specific variable genes with saliency maps in deep learning. Nucleic Acids Research, 51, e103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao E., Stone M. R., Ren X., Guenthoer J., Smythe K. S., Pulliam T. et al. (2021). Spatial transcriptomics at subspot resolution with BayesSpace. Nature Biotechnology, 39, 1375–1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao P., Zhu J., Ma Y., Zhou X. (2022). Modeling zero inflation is not necessary for spatial transcriptomics. Genome Biology, 23, 118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae066_Supplemental_Files

ujae066_supplemental_files.zip^{(5.9MB, zip)}

Data Availability Statement

[bib1] Baselga J., Swain S. M. (2009). Novel anticancer targets: revisiting ERBB2 and discovering ERBB3. Nature Reviews Cancer, 9, 463–475. [DOI] [PubMed] [Google Scholar]

[bib2] Blondel V. D., Guillaume J.-L., Lambiotte R., Lefebvre E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008, P10008. [Google Scholar]

[bib3] Chen A., Liao S., Cheng M., Ma K., Wu L., Lai Y. et al. (2022). Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 185, 1777–1792. [DOI] [PubMed] [Google Scholar]

[bib4] Cho C.-S., Xi J., Si Y., Park S.-R., Hsu J.-E., Kim M. et al. (2021). Microscopic examination of spatial transcriptome using Seq-Scope. Cell, 184, 3559–3572. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Dahl D. B. (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian Inference for Gene Expression and Proteomics, 4, 201–218. [Google Scholar]

[bib6] De Bruin E. C., McGranahan N., Mitter R., Salm M., Wedge D. C., Yates L. et al. (2014). Spatial and temporal diversity in genomic instability processes defines lung cancer evolution. Science, 346, 251–256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Do K., Tran T., Venkatesh S. (2021). Clustering by maximizing mutual information across views. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9928–9938. [Google Scholar]

[bib8] Eng C.-H. L., Lawson M., Zhu Q., Dries R., Koulena N., Takei Y. et al. (2019). Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature, 568, 235–239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Finak G., McDavid A., Yajima M., Deng J., Gersuk V., Shalek A. K. et al. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 16, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Hu J., Li X., Coleman K., Schroeder A., Ma N., Irwin D. J. et al. (2021). SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature Methods, 18, 1342–1351. [DOI] [PubMed] [Google Scholar]

[bib11] Hubert L., Arabie P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. [Google Scholar]

[bib12] Jiang X., Wang S., Guo L., Zhu B., Wen Z., Jia L. et al. (2024). iIMPACT: integrating image and molecular profiles for spatial transcriptomics analysis. Genome Biology, 25, 147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Jiang X., Xiao G., Li Q. (2022). A Bayesian modified Ising model for identifying spatially variable genes from spatial transcriptomics data. Statistics in Medicine, 41, 4647–4665. [DOI] [PubMed] [Google Scholar]

[bib14] Knops Z. F., Maintz J. A., Viergever M. A., Pluim J. P. (2006). Normalized mutual information based registration using k-means clustering and shading correction. Medical Image Analysis, 10, 432–439. [DOI] [PubMed] [Google Scholar]

[bib15] Li Q., Cassese A., Guindani M., Vannucci M. (2019). Bayesian negative binomial mixture regression models for the analysis of sequence count and methylation data. Biometrics, 75, 183–192. [DOI] [PubMed] [Google Scholar]

[bib16] Li Q., Guindani M., Reich B. J., Bondell H. D., Vannucci M. (2017). A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10, 393–409. [Google Scholar]

[bib17] Li Q., Zhang M., Xie Y., Xiao G. (2021). Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics, 37, 4129–4136. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Liu W., Liao X., Yang Y., Lin H., Yeong J., Zhou X. et al. (2022). Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Research, 50, e72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Love M. I., Huber W., Anders S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Lubeck E., Coskun A. F., Zhiyentayev T., Ahmad M., Cai L. (2014). Single-cell in situ RNA profiling by sequential hybridization. Nature Methods, 11, 360–361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Lun A. T., Bach K., Marioni J. C. (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 17, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] McInnes L., Healy J., Melville J. (2018). UMAP: uniform manifold approximation and projection for dimension reduction. arXiv, arXiv:1802.03426, preprint: not peer reviewed.

[bib22] Ma Y., Zhou X. (2022). Spatially informed cell-type deconvolution for spatial transcriptomics. Nature Biotechnology, 40, 1349–1359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Marx V. (2021). Method of the year: spatially resolved transcriptomics. Nature Methods, 18, 9–14. [DOI] [PubMed] [Google Scholar]

[bib25] Molaei S., Bousejin N. G., Zare H., Jalili M. (2021). Deep node clustering based on mutual information maximization. Neurocomputing, 455, 274–282. [Google Scholar]

[bib26] Müller P., Quintana F. A., Jara A., Hanson T. (2015). Bayesian Nonparametric Data Analysis. New York: Springer. [Google Scholar]

[bib27] Newton M. A., Noueiry A., Sarkar D., Ahlquist P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155–176. [DOI] [PubMed] [Google Scholar]

[bib28] Okabe A., Boots B., Sugihara K., Chiu S. N. (2009). Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. New York: John Wiley & Sons. [Google Scholar]

[bib29] Pham D., Tan X., Xu J., Grice L. F., Lam P. Y., Raghubar A. et al. (2020). stLearn: integrating spatial location, tissue morphology and gene expression to find cell types, cell-cell interactions and spatial trajectories within undissociated tissues. BioRxiv. 2020-05.

[bib30] Rand W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of American Statistical Association, 66, 846–850. [Google Scholar]

[bib31] Revillion F., Bonneterre J., Peyrat J. (1998). ERBB2 oncogene in human breast cancer and its clinical significance. European Journal of Cancer, 34, 791–808. [DOI] [PubMed] [Google Scholar]

[bib32] Risso D., Perraudeau F., Gribkova S., Dudoit S., Vert J.-P. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications, 9, 284. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Robinson M. D., McCarthy D. J., Smyth G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Rodriques S. G., Stickels R. R., Goeva A., Martin C. A., Murray E., Vanderburg C. R. et al. (2019). Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science, 363, 1463–1467. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Satija R., Farrell J. A., Gennert D., Schier A. F., Regev A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33, 495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Silverman J. D., Roche K., Mukherjee S., David L. A. (2020). Naught all zeros in sequence count data are the same. Computational and Structural Biotechnology Journal, 18, 2789–2798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Ståhl P. L., Salmén F., Vickovic S., Lundmark A., Navarro J. F., Magnusson J. et al. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353, 78–82. [DOI] [PubMed] [Google Scholar]

[bib38] Strehl A., Ghosh J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. [Google Scholar]

[bib39] Sun S., Zhu J., Zhou X. (2020). Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature Methods, 17, 193–200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] Svensson V. (2020). Droplet scRNA-seq is not zero-inflated. Nature Biotechnology, 38, 147–150. [DOI] [PubMed] [Google Scholar]

[bib41] Svensson V., Teichmann S. A., Stegle O. (2018). SpatialDE: identification of spatially variable genes. Nature Methods, 15, 343–346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Tadesse M. G., Sha N., Vannucci M. (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association, 100, 602–617. [Google Scholar]

[bib43] Van der Maaten L., Hinton G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. [Google Scholar]

[bib44] Wang H., Li R., Tsai C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Wang X., Allen W. E., Wright M. A., Sylwestrak E. L., Samusik N., Vesuna S. et al. (2018). Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science, 361, eaat5691. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] Yang Y., Shi X., Liu W., Zhou Q., Chan Lau M., Chun Tatt Lim J. et al. (2022). SC-MEB: spatial clustering with hidden Markov random field using empirical Bayes. Briefings in Bioinformatics, 23, bbab466. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Zhang C., Dong K., Aihara K., Chen L., Zhang S. (2023). STAMarker: determining spatial domain-specific variable genes with saliency maps in deep learning. Nucleic Acids Research, 51, e103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] Zhao E., Stone M. R., Ren X., Guenthoer J., Smythe K. S., Pulliam T. et al. (2021). Spatial transcriptomics at subspot resolution with BayesSpace. Nature Biotechnology, 39, 1375–1384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] Zhao P., Zhu J., Ma Y., Zhou X. (2022). Modeling zero inflation is not necessary for spatial transcriptomics. Genome Biology, 23, 118. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

An interpretable Bayesian clustering approach with feature selection for analyzing spatially resolved transcriptomics data

Huimin Li

Bencong Zhu

Xi Jiang

Lei Guo

Yang Xie

Lin Xu

Qiwei Li

ABSTRACT

1. INTRODUCTION

2. MODEL

FIGURE 1.

2.1. Modeling the SRT molecular profile via a ZINB model

2.2. Clustering spots/cells while identifying DGs via a ZINB mixture model

2.3. Integrating the SRT geospatial profile via an MRF prior model

3. MODEL FITTING

3.1. MCMC algorithm

3.2. Posterior inference

3.3. Model selection

4. SIMULATION STUDY

FIGURE 2.

5. REAL DATA ANALYSIS

5.1. Application to the mouse olfactory bulb ST data

FIGURE 3.

5.2. Application to the human breast cancer 10x Visium data

FIGURE 4.

5.3. Application to the mouse visual cortex STARmap data

FIGURE 5.

6. DISCUSSION

Supplementary Material

Acknowledgement

Contributor Information

FUNDING

CONFLICT OF INTEREST

DATA AVAILABILITY

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases