Abstract
Spatial transcriptomics has been emerging as a powerful technique for resolving gene expression profiles while retaining tissue spatial information. These spatially resolved transcriptomics make it feasible to examine the complex multicellular systems of different microenvironments. To answer scientific questions with spatial transcriptomics and expand our understanding of how cell types and states are regulated by microenvironment, the first step is to identify cell clusters by integrating the available spatial information. Here, we introduce SC-MEB, an empirical Bayes approach for spatial clustering analysis using a hidden Markov random field. We have also derived an efficient expectation-maximization algorithm based on an iterative conditional mode for SC-MEB. In contrast to BayesSpace, a recently developed method, SC-MEB is not only computationally efficient and scalable to large sample sizes but is also capable of choosing the smoothness parameter and the number of clusters. We performed comprehensive simulation studies to demonstrate the superiority of SC-MEB over some existing methods. We applied SC-MEB to analyze the spatial transcriptome of human dorsolateral prefrontal cortex tissues and mouse hypothalamic preoptic region. Our analysis results showed that SC-MEB can achieve a similar or better clustering performance to BayesSpace, which uses the true number of clusters and a fixed smoothness parameter. Moreover, SC-MEB is scalable to large ‘sample sizes’. We then employed SC-MEB to analyze a colon dataset from a patient with colorectal cancer (CRC) and COVID-19, and further performed differential expression analysis to identify signature genes related to the clustering results. The heatmap of identified signature genes showed that the clusters identified using SC-MEB were more separable than those obtained with BayesSpace. Using pathway analysis, we identified three immune-related clusters, and in a further comparison, found the mean expression of COVID-19 signature genes was greater in immune than non-immune regions of colon tissue. SC-MEB provides a valuable computational tool for investigating the structural organizations of tissues from spatial transcriptomic data.
Keywords: spatial transcriptomics, cell phenotype, empirical Bayes, hidden Markov random field, expectation-maximization algorithm
1 Introduction
Recent advances in spatial transcriptomics (ST) have allowed researchers to simultaneously measure transcriptome-wide gene expression at near single-cell resolution, whereas the spatial information for each measurement is retained [5]. These spatially resolved transcriptomics have deepened our understanding of how cell types and states are regulated by tissues microenvironment, e.g. of the human brain [20], mouse brain [1, 30] and mouse embryo [17], among others. The technologies used for resolving spatial gene expression can largely be classified as either imaging-based or next-generation-sequencing-based methods [37]. Imaging-based methods, which were developed to study spatial complexity, are based on fluorescent in situ hybridization (FISH) and include smFISH [19], seqFISH [11] and MERFISH [38]. Although FISH-based methods are capable of capturing both RNA quantity and position, they are limited by their throughput scalability and accuracy in measuring gene expression levels. However, multiple next-generation-sequencing-based methods have been developed to facilitate high-throughput analysis, including Geo-seq [6], Slide-seq [28] and, more recently, the commercial 10x Genomics Visium system [31]. Emerging ST technologies offer new opportunities to investigate the spatial patterns of gene expression for many applications, such as cell type identification, tissue exploration and differential expression (DE) analysis. Among these applications, cell-type clustering is the first problem that needs to be addressed.
Similar to single-cell RNA-seq data, ST data contains excessive amounts of zeros or ‘drop-outs’ [26]. Recently, many academics have argued that drop-outs are mostly due to biological variation, such as cell-type heterogeneity, rather than technical shortcomings [32]. Kim et al. [14] suggested that clustering analysis should be performed before imputing or normalizing the data. To overcome the curse of dimensionality due to high-throughput spatial gene expression, clustering is often preceded by standard dimension reduction procedures, e.g. principal component analysis (PCA), -distributed stochastic neighbor embedding [34] and uniform manifold approximation and projection [22].
In ST datasets, the majority of existing clustering methods, e.g. k-means [15] and Gaussian mixture models (GMM) [4], do not consider the available spatial information. To allow additional spatial information to be incorporated into ST datasets, several methods have been recently developed, including the hidden Markov random field (HMRF) model implemented in the Giotto package [10, 40] and a fully Bayesian model with a MRF, BayesSpace [39]. Given the spatial coordinates for each transcriptome-profiled spot, spatial clustering methods achieve better classification accuracy. For example, [39] showed that BayesSpace improved the resolution and achieved better classification accuracy for manually annotated human brain samples. However, these methods have certain limitations. First, BayesSpace is a fully Bayesian method based on Markov chain Monte Carlo; therefore, it is not computationally scalable for ST data with high resolution. Second, smoothness is an essential parameter of MRF-based methods and largely determines the proximity of the neighboring spots [33]. BayesSpace takes this smoothness parameter as fixed and, thus, cannot choose the optimal value that best fits a given dataset. Third, without optimizing the smoothness parameter, one cannot apply any model selection methods to obtain the optimal number of clusters. In practice, the number of clusters in ST datasets is usually unknown before the follow-up analysis, and the preferred method would be to automatically choose the number of clusters.
To address these limitations, we propose a method of Spatial Clustering using the hidden Markov random field based on Empirical Bayes (SC-MEB) to model a low-dimensional representation of a gene expression matrix that incorporates the spatial coordinates for each measurement. In contrast to existing methods [10, 39], SC-MEB is not only computationally efficient and scalable to larger sample sizes but also accommodates adjustments to the smoothness parameter and the number of clusters. We derived an efficient expectation-maximization (EM) algorithm based on an iterative conditional mode (ICM) and further selected the number of clusters for SC-MEB based on the modified Bayesian information criterion (MBIC) [36]. We demonstrated the effectiveness of SC-MEB over existing methods through comprehensive simulation studies. We then applied SC-MEB to the clustering analysis of three ST datasets. Using a 10x Genomics Visium dataset from human dorsolateral prefrontal cortex (DLPFC) tissues that were manually annotated, we showed that the performance of SC-MEB was comparable or better than that of BayesSpace, even though the latter uses the ‘true’ number of clusters and a prespecified, fine-tuned smoothness parameter. Using a large MERFISH dataset from mouse hypothalamic preoptic region (MHPR), we demonstrated the better clustering performance as well as the scalability of SC-MEB. We further applied SC-MEB and alternative methods to analyze ST data of a colon tissue from a patient with colorectal cancer (CRC) and COVID-19. We performed follow-up DEs analysis using the clustering results from SC-MEB and BayesSpace, and the heatmap of identified signature genes showed that SC-MEB clustering results were more reasonable and interpretable. Using pathway analysis, we identified three immune-related clusters, and the mean expressions of COVID-19 signature genes were further compared between immune and non-immune regions of a colon sample.
2 Materials and Methods
2.1 Problem formulation
The SC-MEB consists of three major stages (Figure 1A). First, PCA is conducted on the log-transformed expression of the highly variable genes to obtain the top principal components (PCs) (Figure 1B). Next, spatial clustering is performed using PCs for each spot. Finally, downstream analyses, such as DE analysis, can be performed to obtain signature genes for each cluster (Figure 1D).
Our spatial clustering method builds on a two-level hierarchical probabilistic model (Figure 1C). Briefly, for spot , the first level specifies the conditional probability of the low-dimensional representation (e.g. top PCs) of its gene expression given an unknown label , where is the number of clusters. In SC-MEB, we assume that given the labels for each spot, a -dimensional representation is mutually independent among all spots, and its distribution within a given cluster can be written as
(1) |
where , and and denote the mean and covariance matrix for cluster , respectively.
The second level of SC-MEB depicts the prior probability of the hidden labels, and an MRF prior is implemented to encourage smoothness among spots. In other words, spots of the same cluster can be in close proximity. As spots in Visium are primarily arranged on hexagonal lattices, the neighborhood of each spot is defined by applying a proximity threshold. To promote smoothness within spot neighborhoods, we use the Potts model [25] for the hidden labels. The Potts model is well known as a statistical model for use with complex systems with nearest neighbor interactions. Essentially, it views the total energies as a summation of pairwise interaction energies with neighbors, where a positive parameter represents the strength of interactions. Specifically, the Potts model promotes spatial smoothness by penalizing cases in which neighboring spots are assigned to different cluster labels. The hidden random field is assumed to be
(2) |
where , is the delta function, and is a normalization constant that does not have a closed form. When all labels on a neighborhood take the same value, meaning that the hidden is locally smooth, it incurs no neighborhood cost; otherwise, if they are not all the same, a positive cost is incurred, and the amount of cost is controlled by parameter . Thus, parameter controls the smoothness in latent labels; the larger the , the spatially smoother the latent labels. When is zero, SC-MEB reverts to the method that does not consider spatial information, i.e. GMM. Combining two levels of SC-MEB, (1) and (2), we denote the parameter space.
As the smoothing parameter does not have an explicit updated form, SC-MEB adaptively selects via a grid search strategy. That is, the SC-MEB model is trained with a prefixed using an efficient iterative-conditional-mode-based expectation-maximization (ICM-EM) scheme [9] that incorporates a pseudo-likelihood maximization step, as in the ICM method of [2]. The optimal is the value that maximizes the marginal log-likelihood. In a similar way, the marginal log-likelihood can be evaluated for a sequence of . Then, MBIC [36] is applied to choose the optimal number of clusters in a data-driven manner (Figure 1D). We also tried BIC and found that in general MBIC has similar or better numerical performance. Please refer to Supplementary for more details about the MBIC used in this case.
2.2 ICM-EM algorithm
The parameter is estimated through an ICM-EM algorithm [9]. Here, we assume is known.
In the ICM step, the estimate of is obtained by maximizing its posterior with respect to coordinately:
where , until converge [3]. Given initial values of , and observed , we have the updated equation:
(3) |
, where
In the expectation (E) step, instead of using the original complete likelihood, which is difficult to evaluate, the following pseudo-likelihood is used:
With the optimal conditional distribution of (details in Supplementary), we have
where is the responsibility that component has for explaining the observation , which is defined as follows:
(4) |
By taking partial derivatives of with respect to the parameter and setting them to zero, we obtain the updated equations in the maximization (M) step:
(5) |
(6) |
where . Since there is no closed-form solution for , we optimize the smoothness parameter via a grid search strategy:
(7) |
where the sequence is a vector of 20 evenly spaced points in the interval .
The ICM-EM algorithm iterates the ICM step and M step until convergence. Further details on the ICM-EM algorithm are provided in the Supplementary Material.
2.3 Methods for comparison
We conducted comprehensive simulations and real data analysis to gauge the performance of different methods for clustering a low-dimensional representation of a gene expression matrix, including both non-spatial and spatial clustering methods.
In detail, we considered the following non-spatial clustering methods: (i) k-means implemented in the R package stats, available at CRAN; (ii) GMM implemented in the R package mclust, available at CRAN; (iii) Louvain implemented in the R package igraph, available at https://igraph.org/r/. In addition, we compared the clustering performance of spatial methods: (i) SC-MEB implemented in the R package SC.MEB, available at https://github.com/Shufeyangyi2015310117/SC.MEB; (ii) BayesSpace implemented in the R package BayesSpace, available at Bioconductor; (iii) HMRF implemented in the Giotto package, available at http://spatialgiotto.rc.fas.harvard.edu/.
2.4 Preprocessing of ST datasets
The Visium ST [39] data were aligned and quantified using Space Ranger downloaded from 10x Genomics official website against the GRCh38 human reference genome also from 10x Genomics official website. For all datasets, we applied log-transformation of the raw count matrix using library size [18, 21]. Then, we performed PCA on the 2000 most highly variable genes. In the clustering analysis, we chose the top 15 PCs from the study datasets as the input for SC-MEB as well as for the alternative methods.
2.5 ST datasets
2.5.1 Human dorsolateral prefrontal cortex
Maynard et al [20] used recently released ST technology, the 10x Genomics Visium platform, to generate spatial maps of gene expression matrices for the six-layered DLPFC of the adult human brain that are provided in the spatialLIBD package. They also provided manual annotations of the layers based on the cytoarchitecture. In their study, they profiled the ST of human postmortem DLPFC tissue sections from 12 samples, with a median depth of 291 M reads for each sample, corresponding to a mean 3462 unique molecular indices and a mean 1734 genes per spot.
2.5.2 Mouse hypothalamic preoptic region
Moffitt et al. [24] used the combination of MERFISH with scRNA-seq to profile the gene expression of 1 million cells in situ that reveals neuronal populations in the preoptic region of 36 mouses, each with distinct molecular signatures and spatial organizations. Specifically, the MHPR dataset contains expression values of 161 genes in 1 027 848 cells. To demonstrate the scalability, we perform joint clustering for all cells of the 36 samples. The spatial locations for each sample are offset so that cells of different samples are not neighbors. Here we add 10 000 to row and column coordinates to achieve this. Sample 1 further contains six slices, we refer them as Sample 1-1 to Sample 1-6. We further analyze all cells in Sample 1 as well as each of these six slices. The number of cells for each dataset are summarized in Table 4. On average, there are six neighbors for every cells determined by their Euclidean distance orders.
Table 4.
ID | SC-MEB | BayesSpace | GMM | Giotto | Louvain | |
---|---|---|---|---|---|---|
All | 1 027 848 | 16 647 | - | 107 914 | 11 940 | 45 884 |
1 | 73 665 | 1390 | - | 14 288 | 744 | 287 |
1-1 | 1 1738 | 318 | 16 152 | 2192 | 153 | 10 |
1-2 | 12 672 | 349 | 17 238 | 2267 | 172 | 13 |
1-3 | 12 556 | 357 | 19 071 | 2332 | 170 | 17 |
1-4 | 11 763 | 326 | 15 459 | 2293 | 150 | 10 |
1-5 | 12 306 | 383 | 16 818 | 2130 | 160 | 10 |
1-6 | 12 620 | 332 | 18 366 | 2234 | 164 | 14 |
2.5.3 Human colon tissue adjacent to CRC
The colon tissue was from a 45-year-old South Asian male who was diagnosed with COVID-19 on 16 April 2020. As previously described [7], the patient had experienced mild upper respiratory tract symptoms throughout the course of the disease. He was confirmed COVID-19-negative after two consecutive nasopharyngeal swabs on and 9 and 10 May 2020 and was discharged from the isolation facility on 10 May 2020. During hospital admission, further investigation involving computed tomography scanning and colonoscopy revealed the presence of a large circumferential malignant mass in the cecum. Histology of the biopsies confirmed that the patient had invasive CRC stage II T3N0. He underwent laparoscopic right hemicolectomy on 18 May 2020, 9 days after testing negative for COVID-19. He recovered uneventfully and was discharged on 21 May 2020. Using this sample, we profiled the ST using the 10x Genomics Visium platform. In summary, it has a depth of 143 million reads for a total of 2988 spots within the tissue and a median 492 genes per spot.
2.6 Evaluation metrics
We evaluated the clustering performance by adjusted Rand index (ARI) [29]. The general formula for ARI is as follows
(8) |
where RI is the Rand index [27], and max and expected are the maximum value and the expected value of RI, respectively. Assuming that is the number of spots in an ST dataset. and represent two clustering labels for spots, where and are the corresponding numbers of clusters in and , respectively. Denoting as the number of spots belonging to both classes and , and and as the number of spots in classes and , respectively; then ARI (8) is defined as
(9) |
As the expected value of RI for two random partitions does not take a constant value and is concentrated within a small interval, ARI is a corrected version of RI to avoid these drawbacks [13]. Note that ARI lies between and 1 and takes a value of 1 when the two partitions are equal up to a permutation. Obviously, a larger ARI value indicates a higher similarity between two partitions. In the simulation, ARI was used to measure the similarity between the estimated partition and the true one. In the analysis of the DLPFC and MHPR dataset [20], manual annotations based on additional experiments and computational results were available. ARI was used to measure the similarity between labels from the estimated partition and the manually annotated clusters.
3 Results
3.1 Simulation settings
Using simulations, we compared the clustering performance of SC-MEB and with five other methods, including k-means, GMM, Louvain, BayesSpace and Giotto. For -means, BayesSpace and Giotto, we considered the true number of clusters , and its two nearest numbers, and , as the number of clusters had to be manually specified for these two methods. For all other methods, the number of clusters was selected automatically. The smoothness parameter of BayesSpace was fixed at 3 for Visium dataset and 2 for ST dataset, respectively, whereas of SC-MEB was optimized with a grid search. We compared the clustering performances using ARI for all methods, in which we ran 50 replicates in each setting.
In Example I, the labels for spots were randomly generated. In detail, for a 70 70 squared lattice with 4900 spatial spots, we generated cluster labels for each spot from the -states Potts model [as shown in Equation. (2)] with using the R package GiRaF. The number of neighbors was set to be 4, and the number of true clusters was set to 3, 5 or 7. We then considered two distributions for low-dimensional PCs: a mixture of Gaussian and a mixture of Student’s distributions. The number of PCs was set to either 10 or 15. The mean and the covariance matrix for each component are listed in Supplementary Tables S1–S4.
In Example II, labels for spots were obtained from real data analysis. In detail, we used the inferred cluster labels from SC-MEB () of colon data as the true labels for all 2988 spots. PCs were randomly generated in the same way as in Example I. The mean and the covariance matrix for each component are provided in Supplementary Tables S5 and S6.
In the above examples, all components had different covariance matrices. Because BayesSpace adopts a strategy in which all components have a shared covariance, we further conducted additional simulations with equal covariance matrices.
3.2 Performance of SC-MEB in comparison with other methods in simulation studies
In Example I, when PCs were from a mixture of Gaussian distributions, SC-MEB was more powerful than all other methods (Figure 2A). BayesSpace had a smaller ARI, i.e. poorer concordance between predicted and true clustering assignment, than SC-MEB, even when the true number of clusters was used as input. The inferior performance of BayesSpace was due to its lack of adaptation to the smoothness parameter . The other methods, Giotto, GMM, -means and Louvain, achieved lower ARIs. When PCs were from a mixture of Student’s -distributions, assumptions of BayesSpace were satisfied. As shown in Figure 2B, using BayesSpace with the correct number of clusters showed the best performance. Even though SC-MEB was miss-specified in this setting, it still achieved a high ARI that was larger than that of BayesSpace with miss-specificed and other methods. Note that most components in the mixture of Student’s -distributions simulated here are (5) and (6), which are reasonably close to Gaussian. This demonstrates the robust performance of SC-MEB when there is moderate miss-specification of distributions. We note that if the data are far from the Gaussian component, the performance of SC-MEB will degenerate. The results from other settings (Supplementary Figure S1 and S2) prompted similar conclusions.
In Example II, the comparative results (Figure 2C and D) were largely consistent with the results obtained in Example I. Specifically, SC-MEB was more powerful than all the other methods. The performance of BayesSpace was the next most powerful, and -means had the worst performance. The results obtained from other settings (Supplementary Figures S5A and B, and S6A and B) led to similar conclusions.
Finally, we considered the above two examples under BayesSpace’s assumption that all components share a common covariance. The results are shown in (Supplementary Figures S3, S4, S5C and D, and S6C and D). As is shown, the ARI of SC-MEB was comparable with that of BayesSpace, and both demonstrated better performance than the other methods.
All simulations were conducted on a computer with a 2.1 GHz Intel Xeon Gold 6230 CPU and 16 GB memory. SC-MEB was computationally more efficient than BayesSpace. In all simulations, SC-MEB toke approximately 8 min to complete the analysis for 10 combinatorial values in , whereas BayesSpace required about 25 min for prefixed and and up to 600 times more computation time than SC-MEB for fixed combinatorial values of and . To better demonstrate the computational efficiency and scalability of SC-MEB, we conducted additional simulations with an increasing sample size . In Figure 3, we can see that the computation time of SC-MEB for a fixed number of iterations increased almost linearly with increasing sample size, taking about 0.5 h to run 50 iterations for a dataset with 200K spots. Thus, SC-MEB can be used to perform clustering analysis for ST datasets with a higher resolution than other methods.
3.3 Benchmark clustering performance with real datasets
To evaluate the clustering performance of SC-MEB with real datasets, it was applied to the DLPFC and MHPR datasets and its clustering performance was compared with that of alternative methods. Specifically, we first obtained the top 15 PCs from the 2000 most highly variable genes in DLPFC dataset and all 161 genes in MHPR dataset, respectively. Then we performed clustering analysis with all methods, except k-means. As BayesSpace and Giotto cannot choose the number of clusters , the K was set to the number of clusters in the manual annotations. All other methods selected the number of clusters automatically.
Table 1 shows the ARI values for 12 DLPFC samples, where the manually annotated layers were taken as the ‘ground truth’. SC-MEB clearly outperformed BayesSpace in the analysis of six samples, and vice versa for the other five samples. In the analysis, BayesSpace took both the ‘true’ number of clusters (from the manual annotations) and the prefixed fine-tuned as input. In this case, the proposed SC-MEB achieved the similar performance without the prior information. Additionally, SC-MEB achieved the best clustering performance among the methods that can select () automatically. Table 2 compares the computational times required for all methods. The speed of SC-MEB was almost 200 times faster than that of BayesSpace, and comparable to GMM and Giotto. Louvain was the fastest method, but its clustering accuracy was inferior.
Table 1.
ID | SC-MEB | BayesSpace | GMM | Giotto | Louvain |
---|---|---|---|---|---|
151507 | 0.42 | 0.33 | 0.33 | 0.33 | 0.32 |
151508 | 0.44 | 0.36 | 0.35 | 0.34 | 0.25 |
151509 | 0.52 | 0.44 | 0.40 | 0.35 | 0.30 |
151510 | 0.39 | 0.43 | 0.44 | 0.33 | 0.28 |
151669 | 0.32 | 0.41 | 0.29 | 0.25 | 0.20 |
151670 | 0.43 | 0.43 | 0.35 | 0.21 | 0.26 |
151571 | 0.42 | 0.38 | 0.27 | 0.40 | 0.36 |
151672 | 0.44 | 0.77 | 0.14 | 0.38 | 0.27 |
151673 | 0.49 | 0.55 | 0.40 | 0.37 | 0.29 |
151674 | 0.43 | 0.33 | 0.36 | 0.29 | 0.33 |
151675 | 0.31 | 0.41 | 0.28 | 0.32 | 0.24 |
151676 | 0.39 | 0.32 | 0.21 | 0.26 | 0.25 |
The values of the best performance are in bold type.
Table 2.
ID | SC-MEB | BayesSpace | GMM | Giotto | Louvain |
---|---|---|---|---|---|
151507 | 23.00 | 7015.95 | 46.16 | 33.10 | 2.86 |
151508 | 23.55 | 5471.57 | 36.26 | 20.48 | 6.97 |
151509 | 23.56 | 4660.43 | 53.22 | 27.05 | 6.00 |
151510 | 26.62 | 4288.56 | 39.50 | 24.27 | 7.57 |
151669 | 23.79 | 5840.10 | 20.96 | 20.81 | 3.49 |
151670 | 18.96 | 4917.43 | 37.94 | 16.35 | 5.26 |
151571 | 24.54 | 3889.07 | 31.01 | 20.28 | 8.82 |
151672 | 27.91 | 3793.58 | 22.44 | 21.92 | 5.49 |
151673 | 18.92 | 5984.89 | 40.20 | 35.30 | 1.49 |
151674 | 19.80 | 5344.68 | 36.82 | 23.57 | 3.34 |
151675 | 18.93 | 3863.19 | 27.90 | 26.01 | 4.22 |
151676 | 18.93 | 3609.71 | 37.44 | 28.07 | 3.24 |
The ARI and computation times of the five methods for MHPR dataset are provided in Tables 3 and 4. Obviously, SC-MEB has the best clustering performance. The ARI of BayesSpace for each dataset is lower than SC-MEB. And for large datasets such as all cells and cells in Sample 1, it cannot work. The Giotto and Louvain can work well for each dataset, but their ARI is smaller than SC-MEB.
Table 3.
ID | SC-MEB | BayesSpace | GMM | Giotto | Louvain |
---|---|---|---|---|---|
All | 0.47 | - | 0.25 | 0.19 | 0.36 |
1 | 0.51 | - | 0.20 | 0.21 | 0.37 |
1-1 | 0.44 | 0.24 | 0.23 | 0.22 | 0.46 |
1-2 | 0.46 | 0.41 | 0.20 | 0.25 | 0.41 |
1-3 | 0.49 | 0.31 | 0.20 | 0.26 | 0.41 |
1-4 | 0.52 | 0.35 | 0.22 | 0.23 | 0.42 |
1-5 | 0.39 | 0.31 | 0.22 | 0.24 | 0.40 |
1-6 | 0.49 | 0.31 | 0.21 | 0.25 | 0.41 |
The values of the best performance are in bold type.
3.4 Spatial clustering of in-house CRC sample
To apply SC-MEB in the analysis of an in-house colon sample from a patient suffered from CRC and COVID-19, we first obtained the top 15 PCs, as described for the DLPFC dataset. The spatial clustering performed by SC-MEB was compared with that of other methods. Because SC-MEB and Louvain selected eight clusters as the optimal number (), we also ran BayesSpace and Giotto with eight clusters. The computational times for SC-MEB, BayesSpace, GMM, Giotto and Louvain were 24.66, 5324.48, 46.26, 49.88 and 0.69 seconds, respectively.
The clustering results obtained using the different methods are shown in Figure 4. In general, the pattern of clustering assigned by SC-MEB was similar to that of GMM, but the latter retained more noisy spots. In addition, the results from SC-MEB and BayesSpace had stronger spatial patterns than those of the other methods.
By checking PanglaoDB [12] for signature genes identified via DE analysis and with the help of the H&E staining shown in Figure 4A, we were able to identify regions of muscle, stroma, epithelial and immune cells. As shown in Figure 4B–F, all methods except BayesSpace returned good partitions for the muscle region, which were visually verified with the H&E staining (Figure 4A). The epithelial regions identified by BayesSpace were much smaller than those identified by SC-MEB, in which a large proportion of the epithelial regions in Figure 4B and C were classified as stromal regions (stroma 2) by BayesSpace. The immune regions identified by BayesSpace were larger at the 9 and 12 o’clock positions but smaller at the 6 o’clock position in Figure 4C than those identified by SC-MEB (Figure 4B) and GMM (Figure 4F). Strikingly, a large proportion of the regions identified as stroma 1 by both SC-MEB and GMM were classified as stroma 2 by BayesSpace. Even though stroma 1 and stroma 2 are both stromal regions, one can observe clear differences in the normalized expression of signature genes for these two clusters (Figure 5). These observations illustrate the possible over-smoothing behavior of BayesSpace, whereas SC-MEB was able to recover the fine structure of tissues.
3.4.1 DE analysis of the identified clusters
As true labels for all spots were not available for the colon dataset, we could not quantitatively evaluate the clustering performance. For the clustering results of SC-MEB and BayesSpace, we further performed DE analysis comparing an identified cluster with all others using the BPSC package [35] for log-normalized expression. Using the partition results from SC-MEB, we identified 180, 158, 128, 94, 145, 95, 137 and 84 genes that were differentially expressed for stroma 1 and 2; muscle; epithelial 1 and 2; and immune 1, 2 and 3, respectively, with a false discovery rate of . The details of all differentially expressed genes identified by SC-MEB and BayesSpace are provided in Supplementary Tables S7 and S8. We further restricted the number of signature genes by choosing those with log-fold changes larger than 0.5. Finally, we obtained a total of 62 and 57 signature genes for SC-MEB and BayesSpace, respectively.
Figure 5 shows the heatmap of normalized expression for the signature genes identified in the DE analysis by SC-MEB (Figure 5A) and BayesSpace (Figure 5B), respectively. Clearly, with BayesSpace, the normalized expression of signature genes in stroma 2 could be further divided into two sub-clusters, and the expression pattern in the second sub-cluster was very similar to that of stroma 1. This misclassification is also apparent when comparing Figure 4B and C, as a large proportion of the regions identified as stroma 1 by SC-MEB were identified as stroma 2 by BayesSpace. The findings obtained using SC-MEB demonstrated that stroma 1 and stroma 2 clusters, epithelial clusters and immune clusters were arranged in layers that are morphologically supported by the anatomical architecture of colonic tissue [23]: (from lumen to serosa) mucosal epithelium; lamina propria (in which immune cells are abundant, and the isolated lymphoid nodules present in this tissue extend into the submucosal layer); submucosal layer, the stromal layer with abundant connective tissue; and lastly muscularis externa, which is represented by the muscle layer.
3.4.2 Pathway analysis of the signature genes identified in the DE analysis
We further conducted pathway analysis using gene ontology [8] for the signature genes from each cluster identified by SC-MEB. Supplementary Table S9 shows the top four pathways in each cluster. For the regions identified as muscle, muscle contraction was among the top four significant pathways. For the three identified immune clusters, the most significant pathways included humoral immune response and antimicrobial humoral response. For stroma 2 clusters, extracellular structure organization and external encapsulating structure organization were among the most significant pathways. We also found similar patterns in the heatmap for the normalized expression of signature genes (Figure 5) between stroma 1 and 2 clusters, among the three immune clusters, and between epithelial clusters 1 and 2. There was high cosine similarity between the two stromal clusters (0.97), as well as among the immune clusters (see Supplementary Table S10). We ultimately compared the mean expression of COVID-19 signature genes [16] in the immune and non-immune regions identified by SC-MEB (Figure 6), and it was clear that COVID-19 signature genes were more highly expressed in the immune regions than the non-immune regions of the colorectal tumor sample.
4 Discussion
We propose a new computational tool, SC-MEB, for identifying cell types in ST data analysis. Our method builds on a two-level hierarchical probabilistic model that is computationally efficient and can be easily used to analyze the high-resolution data generated by ST technology. Compared with existing approaches, SC-MEB is both more computationally efficient, and is more powerful when no prior knowledge regarding the number of clusters is available. Specifically, the performance of SC-MEB is determined by its ability to select optimal K and , which is not possessed by BayesSpace and Giotto. Both SC-MEB and Giotto use the EM algorithm, but they are quite different in using MRF. SC-MEB takes the step of ICM to get the maximum a posteriori probability estimates for the hidden labels. However, there is no such step in Giotto. It directly assigns a label to the one with the largest posterior probability. We have illustrated the benefits of SC-MEB through extensive simulations, as well as in-depth analysis of four real data sets.
We benchmarked the clustering performance of SC-MEB as well as its computational efficiency using two ST datasets, DLPFC from 10x Genomics Visium and MHPR from MERFISH, respectively. In the DLPFC dataset, the ARI values for SC-MEB and BayesSpace were comparable, but SC-MEB was 200 times faster at running the analysis than BayesSpace. More importantly, SC-MEB optimized the smoothness parameter and selected the number of clusters in a data-driven manner. SC-MEB could also be applied to perform spatial clustering in other types of ST datasets. Our analysis of the MHPR dataset from MERFISH showed that SC-MEB not only outperformed other methods but was also scalable to larger sample sizes. It took less than 5 h to complete the analysis for all cells (> 1 million). By applying SC-MEB and other methods, we performed spatial clustering for a colon dataset from a patient with CRC and COVID-19 and further performed DE analysis to identify signature genes related to the clustering results. We compared the heatmaps of signature genes identified using SC-MEB and BayesSpace and observed that the clusters identified using SC-MEB were more separable. Using pathway analysis, we identified three immune-related clusters and in a further comparison, we found the mean expression of COVID-19 signature genes was greater in immune than non-immune regions.
There are some caveats associated with SC-MEB that may require further explorations. First, although clustering using low-dimensional features ensures computational efficiency, it is not certain that the features obtained are relevant to class labels that could improve the spatial clustering performance. Thus, an optimal strategy might be to perform joint dimension reduction and spatial clustering for high-dimensional ST datasets. Second, problems with bulk and single-cell RNA-seq remain in the analysis of ST datasets. For example, without removing batch effects from different experiments, findings from DE analysis under different conditions could be confounded.
Key Points
We propose an empirical Bayes approach for spatial clustering analysis using a hidden MRF.
We extensively benchmark SC-MEB against existing methods using both simulated and real ST datasets.
We further employed SC-MEB to analyze a colon dataset and the follow-up DE analysis showed that clusters obtained from SC-MEB are more separable.
The implemented R package for SC-MEB is available at https://github.com/Shufeyangyi2015310117/SC.MEB including all codes for both experiments and real data analysis to promote reproducibility.
6 Author contributions statement
J.L. and J.Y. initiated and designed the study, X.S. and Y.Y. implemented the model and performed simulation studies and benchmarking evaluation, X.S. and J.L. wrote the manuscript, and all authors edited and revised the manuscript.
Supplementary Material
Acknowledgments
The authors thank the editor and three anonymous reviewers for their valuable suggestions. This work was supported by grant R-913-200-098-263 from the Duke-NUS Medical School, AcRF Tier 2 (MOE2018-T2-1-046 and MOE2018-T2-2-006) from the Ministry of Education, Singapore, grants (NSF: # 12171229, # 71931004 and # 92046005) from the National Natural Science Foundation of China.
Yi Yang is a Research Fellow in Health Services & Systems Research program at Duke-NUS Medical School.
Xingjie Shi is an Associate Professor of the Academy of Statistics and Interdisciplinary Sciences at East China Normal University, Shanghai, China. His research interests include statistical genetics/genomics, bioinformatics, and statistical computing.
Wei Liu is a Research Fellow in Health Services & Systems Research program at Duke-NUS Medical School.
Qiuzhong Zhou is a Research Fellow in Cardiovascular and Metabolic Disorders program at Duke-NUS Medical School.
Mai Chan Lau is a bioinformaticiam with wetlab skillset who is now a postdoctorate fellow in IMCB, A*STAR Singapore.
Jeffrey Chun Tatt Lim is a medical technologist who have almost 10 years expertise in multiplex IHC/IF and pathology, currently working in IMCB, A*STAR Singapore.
Lei Sun is an Associate Professor in Cardiovascular and Metabolic Disorders program at Duke-NUS Medical School, Singapore. His research interest is to understand the role of RNA-regulatory network in metabolic diseases.
Cedric Chuan Young Ng is the CTO of Cancer Discovery Hub, National Cancer Centre Singapore. He has more than 10 years experience in genetic technologies.
Joe Yeong is a immunopathologist who is a group leader in IMCB, A*STAR Singapore. His research interest is to use spatial techologies to study and overcome resistance of cancer immunotherapy.
Jin Liu is an Assistant Professor in Health Services & Systems Research program at Duke-NUS Medical School, Singapore. His research interest includes statistical genetics/genomics, bioinformatics and machine learning.
Contributor Information
Yi Yang, Program in Health Services & Systems Research, Duke-NUS Medical School, 8 College Road, 169857, Singapore.
Xingjie Shi, Academy of Statistics and Interdisciplinary Sciences, East China Normal University, 3663 Zhongshan North Road, 200062, Shanghai, China.
Wei Liu, Program in Health Services & Systems Research, Duke-NUS Medical School, 8 College Road, 169857, Singapore.
Qiuzhong Zhou, Program in Cardiovascular and Metabolic Disorders, Duke-NUS Medical School, 8 College Road, 169857, Singapore.
Mai Chan Lau, Institute of Molecular and Cell Biology (IMCB), Agency of Science, Technology and Research (A*STAR), Street, 138673, Singapore.
Jeffrey Chun Tatt Lim, Institute of Molecular and Cell Biology (IMCB), Agency of Science, Technology and Research (A*STAR), Street, 138673, Singapore.
Lei Sun, Program in Cardiovascular and Metabolic Disorders, Duke-NUS Medical School, 8 College Road, 169857, Singapore.
Joe Yeong, Institute of Molecular and Cell Biology (IMCB), Agency of Science, Technology and Research (A*STAR), Street, 138673, Singapore; Department of Anatomical Pathology, Singapore General Hospital, 20 College Road, 169856, Singapore.
Jin Liu, Program in Health Services & Systems Research, Duke-NUS Medical School, 8 College Road, 169857, Singapore.
References
- 1. Alon S, Goodwin DR, Sinha A, et al. Expansion sequencing: Spatially precise in situ transcriptomics in intact biological systems. Science 2021; 371(6528). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Besag J. Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc B Methodol 1974; 36(2): 192–225. [Google Scholar]
- 3. Besag J. On the statistical analysis of dirty pictures. J R Stat Soc B Methodol 1986; 48(3): 259–79. [Google Scholar]
- 4. Bishop CM. Pattern recognition and machine learning. springer, 2006. [Google Scholar]
- 5. Burgess DJ. Spatial transcriptomics coming of age. Nat Rev Genet 2019; 20(6): 317–7. [DOI] [PubMed] [Google Scholar]
- 6. Chen J, Suo S, Tam PP, et al. Spatial transcriptomic analysis of cryosectioned tissue samples with geo-seq. Nat Protoc 2017; 12(3): 566–80. [DOI] [PubMed] [Google Scholar]
- 7. Cheung CCL, Goh D, Lim X, et al. Residual SARS-CoV-2 viral antigens detected in GI and hepatic tissues from five recovered patients with COVID-19. Gut 2021. [DOI] [PubMed] [Google Scholar]
- 8. Consortium TGO. The gene ontology resource: enriching a gold mine. Nucleic Acids Res 2021; 49(D1): D325–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Cuadra MB, Cammoun L, Butz T, et al. Comparison and validation of tissue modelization and statistical classification methods in T1-weighted MR brain images. IEEE Trans Med Imaging 2005; 24(12): 1548–65. [DOI] [PubMed] [Google Scholar]
- 10. Dries R, Zhu Q, Eng C-HL, et al. Giotto, a pipeline for integrative analysis and visualization of single-cell spatial transcriptomic data BioRxiv. 2019;701680.
- 11. Eng C-HL, Shah S, Thomassie J, et al. Profiling the transcriptome with RNA SPOTs. Nat Methods 2017; 14(12): 1153–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Franzén O, Gan L-M, Björkegren JL. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hubert L, Arabie P. Comparing partitions. Journal of classification 1985; 2(1): 193–218. [Google Scholar]
- 14. Kim TH, Zhou X, Chen M. Demystifying “drop-outs” in single-cell UMI data. Genome Biol 2020; 21(1): 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Kriegel H-P, Schubert E, Zimek A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowledge and Information Systems 2017; 52(2): 341–78. [Google Scholar]
- 16. Lee JS, Park S, Jeong HW, et al. Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19. Science immunology 2020; 5(49). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Lohoff T, Ghazanfar S, Missarova A, et al. Highly multiplexed spatially resolved gene expression profiling of mouse organogenesis bioRxiv. 2020.
- 18. Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol 2016; 17(1): 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Lyubimova A, Itzkovitz S, Junker JP, et al. Single-molecule mRNA detection and counting in mammalian tissue. Nat Protoc 2013; 8(9): 1743. [DOI] [PubMed] [Google Scholar]
- 20. Maynard KR, Collado-Torres L, Weber LM, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature Neuroscience, pages 2021; 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. McCarthy DJ, Campbell KR, Lun AT, et al. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 2017; 33(8): 1179–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction arXiv preprint arXiv:1802.03426. 2018.
- 23. Mills S. Histology for pathologists. Lippincott Williams & Wilkins, 2019. [Google Scholar]
- 24. Moffitt JR, Bambah-Mukku D, Eichhorn SW, et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 2018; 362(6416). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Potts RB. Some generalized order-disorder transformations. In: Mathematical proceedings of the cambridge philosophical society, Vol. 48. Cambridge University Press, 1952, 106–9. [Google Scholar]
- 26. Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun 2020; 11(1): 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 1971; 66(336): 846–50. [Google Scholar]
- 28. Rodriques SG, Stickels RR, Goeva A, et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 2019; 363(6434): 1463–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Schütze H, Manning CD, Raghavan P. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge, 2008. [Google Scholar]
- 30. Shah S, Lubeck E, Zhou W, et al. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron 2016; 92(2): 342–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Ståhl PL, Salmén F, Vickovic S, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 2016; 353(6294): 78–82. [DOI] [PubMed] [Google Scholar]
- 32. Svensson V. Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 2020; 38(2): 147–50. [DOI] [PubMed] [Google Scholar]
- 33. Tolpekin VA, Stein A. Quantification of the effects of land-cover-class spectral separability on the accuracy of markov-random-field-based superresolution mapping. IEEE transactions on geoscience and remote sensing 2009; 47(9): 3283–97. [Google Scholar]
- 34. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research 2008; 9(11). [Google Scholar]
- 35. Vu TN, Wills QF, Kalari KR, et al. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics 2016; 32(14): 2128–35. [DOI] [PubMed] [Google Scholar]
- 36. Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc Series B Stat Methodology 2009; 71(3): 671–83. [Google Scholar]
- 37. Waylen LN, Nim HT, Martelotto LG, et al. From whole-mount to single-cell spatial assessment of gene expression in 3D. Communications biology 2020; 3(1): 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Xia C, Babcock HP, Moffitt JR, et al. Multiplexed detection of RNA using MERFISH and branched DNA amplification. Sci Rep 2019; 9(1): 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Zhao E, Stone MR, Ren X, et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nature Biotechnology, pages 2021; 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Zhu Q, Shah S, Dries R, et al. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nat Biotechnol 2018; 36(12): 1183–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.