Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2021 Jun 19;37(22):4129–4136. doi: 10.1093/bioinformatics/btab455

Bayesian modeling of spatial molecular profiling data via Gaussian process

Qiwei Li 1,✉,a, Minzhe Zhang 2,a, Yang Xie 3, Guanghua Xiao 4,
Editor: Inanc Birol
PMCID: PMC9502169  PMID: 34146105

Abstract

Motivation

The location, timing and abundance of gene expression (both mRNA and proteins) within a tissue define the molecular mechanisms of cell functions. Recent technology breakthroughs in spatial molecular profiling, including imaging-based technologies and sequencing-based technologies, have enabled the comprehensive molecular characterization of single cells while preserving their spatial and morphological contexts. This new bioinformatics scenario calls for effective and robust computational methods to identify genes with spatial patterns.

Results

We represent a novel Bayesian hierarchical model to analyze spatial transcriptomics data, with several unique characteristics. It models the zero-inflated and over-dispersed counts by deploying a zero-inflated negative binomial model that greatly increases model stability and robustness. Besides, the Bayesian inference framework allows us to borrow strength in parameter estimation in a de novo fashion. As a result, the proposed model shows competitive performances in accuracy and robustness over existing methods in both simulation studies and two real data applications.

Availability and implementation

The related R/C++ source code is available at https://github.com/Minzhe/BOOST-GP.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Investigating the spatial organization of cells, together with their mRNA and protein abundances, is essential to delineating how cells from different origins form tissues with distinctive structures and functions. Traditional molecular profiling technologies require tissue-dissociation, which leads to the loss of the spatial context of gene expression, while traditional imaging technologies can only measure the levels of several markers. Thus, the spatial information of tissues and the high-throughput molecular profile, for a long time, were analyzed individually with little crosstalk.

Recent technology breakthroughs in spatial molecular profiling (SMP) empower the comprehensive molecular characterization at a high spatial resolution. There are two major types of SMP techniques: imaging-based methods based on the single-molecule fluorescence in situ hybridization (FISH), such as sequential FISH (seqFISH) (Eng et al., 2019; Lubeck et al., 2014) and multiplexed error-robust FISH (MERFISH) (Chen et al., 2015), and sequencing-based methods, such as spatial transcriptomics sequencing (STS) (Ståhl et al., 2016) and slide-sequencing (Rodriques et al., 2019). The overview of their workflows is shown in Figure 1a. We could view the imaging-orientated approaches as regular FISH imaging integrated with multi-round re-hybridization, which allows multiplexing. The sequencing-orientated approaches can be regarded as traditional sequencing technology with additional spatial barcoding to restore spatial information. We recently published a review paper to discuss technical details on SMP platforms (Zhang et al., 2021). In all, these SMP technologies combine the spatial information of tissues and the high-throughput molecular profile, enabling mapping and measuring the gene expression of thousands of cells over a tissue slide simultaneously.

Fig. 1.

Fig. 1.

Flowchart of an SMP study. (a) Hybridization and quantification phases in imaging-based SMP technologies (left) and barcoding and sequencing phases in sequencing-based SMP technologies (right). (b) An example of molecular profile Y (left) and an example of geospatial profile T (right). (c) Examples of molecular spatial distributions to an SV gene and a non-SV gene

Many new questions can be studied with this powerful new technology available. One of the most immediate ones is to identify genes whose expressions display spatially correlated patterns, which we refer to as spatially variable (SV) genes (see an example in Fig. 1c). Such genes may reflect the tissue heterogeneity and underlying tissue structure that drive the differentiated expression across different spatial locations. Thus, they are potentially significant and may lead to new biological insights. Statistically, identifying SV genes is a new challenge to test the association between gene-expression levels and their spatial coordinates with a null hypothesis of spatial invariance. Several methodologies have been recently proposed for this task. SpatialDE (Svensson et al., 2018) and SPARK (Sun et al., 2020) are two that adopt a Gaussian process (GP), which serves as a natural fit for this problem because of its ability to model temporal (Roberts et al., 2013) or spatial (Diggle et al., 1998) dependence via a pre-specified kernel. They share the same geostatistical modeling framework, with SPARK being more explicit in modeling count data, sample normalization and P-value calibration. Trendsceek (Edsgärd et al., 2018) is another method that models the spatial gene expression of cells as a realization marked point process. Its idea is to calculate several summary statistics for all pairwise points with respect to their distances and evaluate the significance under null distribution generated with random permutations.

All of these methods are for identifying SV genes in SMP data, while still leaving several issues unsolved. First, SpatialDE and Trendsceek transformed count data, which do not truly reflect the underlying data generative mechanism. Although SPARK used a Poisson distribution for modeling counts, the simple mean-variance relationship may not be sufficient in accounting for the over-dispersion observed in real sequencing data. Second, typical SMP data contain a large proportion of zero counts (e.g. 60–99%), and none of the mentioned methods properly addressed the zero count, which may largely reduce the statistical power. Table 1 shows a brief summary of four widely analyzed SMP cohorts. Last but not least, SpatialDE and SPARK both employ GP to estimate spatial covariance but only at certain predefined length-scales of the spatial kernels, which means the estimation is only an approximated solution.

Table 1.

A summary of four widely analyzed SMP cohort data

Name Technique No. of datasets No. of dimensions K No. of locations n No. of genes p zero prop. (%) Reference
Mouse olfactory bulb (MOB) STS 12 replicates 2 231–282 spots 15 284–16 675 89–90 Ståhl et al. (2016)
Human breast cancer (BC) STS 4 layers 2 251–264 spots 14 789–14 929 60–79 Ståhl et al. (2016)
Mouse hypothalamus MERFISH 31 individuals 2 4877–6000 cells 160–161 59–68 Moffitt et al. (2018)
Mouse hippocampus seqFISH 21 fields 3 97–362 cells 249 1–18 Shah et al. (2016)

Note: STS, spatial transcriptomics sequencing; MERFISH, multiplexed error-robust fluorescence in situ hybridization; seqFISH, sequential fluorescence in situ hybridization.

To address the issues mentioned above, we developed Bayesian mOdeling Of Spatial Transcriptomics data via Gaussian Process (BOOST-GP), which integrates the GP to capture the spatial correlation. Note that although the name only includes spatial transcriptomics data, the proposed method can be applied to analyze other SMP data without modification. BOOST-GP directly models count data using a negative binomial (NB) distribution, which was also adopted by widely used RNA-seq analysis tools, such as DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2010) to account for the over-dispersion observed in real sequencing data. Previous studies in single-cell RNA-seq (scRNA-seq) data analysis have shown that accounting for the large proportion of zero counts in the model significantly improves the model fitting and accuracy in identifying differentially expressed genes (Finak et al., 2015; Kharchenko et al., 2014; Lun et al., 2016). One novelty of this study is to explicitly accounts for the excess zeros through a zero-inflated negative binomial (ZINB) model. Furthermore, it uses a Bayesian inference framework to improve parameter estimations and quantify uncertainties. We demonstrated the advantages of BOOST-GP in robust inference across various spatial patterns and minimally affected performance when data contains zero-inflation in the simulation study. We also applied BOOST-GP to two real spatial transcriptomics datasets, and BOOST-GP shows an outstanding sensitivity–specificity balance compared to existing methods.

The rest of this article is arranged as follows. In Section 2, we formulate the probabilistic generative model and explore the parameters structure. In Section 3, we present the schematic procedure and demonstrate the Markov chain Monte Carlo (MCMC) algorithms. In Sections 4 and 5, we evaluate BOOST-GP via both simulated and real data. The last section concludes the article and proposes some potential future work.

2 Model

In this section, we present BOOST-GP for identifying SV genes in SMP data. Supplementary Figure S1 and Table S1 summarize the graphical and hierarchical formulation of the proposed model, respectively.

Before introducing the main components, we summarize the observed data as follows. Let a n-by-p count matrix Y denote the molecular profile generated by a spatially resolved transcriptomics technology based on either a sequencing or imaging-based method. The former can measure tens of thousands of genes on spatial locations consisting of a couple of hundred single cells (i.e. pn), while the latter directly identifies a large number of individual cells but is only able to measure hundreds of genes simultaneously (i.e. pn). Each entry yijN,i=1,,n,j=1,,p is the read count observed in sample i (with known location) for gene j. We use the notation yj to denote the j-th column of Y, which represents all read counts belonging to the same gene j. Let a n-by-2 matrix T denote the related geospatial profile, where each row ti=(ti1,ti2)R2 gives the coordinates in a compact subset of the 2D Cartesian plane for sample i. Illustration examples of Y and T are provided in Figure 1b.

2.1 Modeling sequence count data via a ZINB model

In addition to over-dispersion, sequence count data suffer from zero-inflation, especially when the sequence depth is not enough. For instance, the real data analyzed in Marioni et al. (2008) and Witten et al. (2010) contain 40–50% zeros of all numerical values. scRNA-seq data usually have a higher proportion of zero read counts, compared with bulk RNA-seq data (e.g. Finak et al., 2015; Kharchenko et al., 2014; Lun et al., 2016). In general, zero-inflation arises for both biological reasons (e.g. subpopulations of cells or transient states where a gene is not expressed) and technique reasons (e.g. dropouts, where a gene is expressed but not detected through sequencing due to limitation of the sampling effort).

Accommodating toward these characteristics, we start by considering a ZINB model to model the read counts,

yij|πi,λij,ϕjπiI(yij=0)+(1πi)NB(siλij,ϕj), (1)

where we use I(·) to denote the indicator function and NB(μ,ϕ),μ,ϕ>0 to denote a NB distribution with expectation μ and dispersion 1/ϕ. Here, we constrain one of the two mixture kernels to be degenerate at zero, thereby allowing for zero-inflation. The sample-specific parameter πi(0,1) can be viewed as the proportion of extra zero (i.e. false zero or structural zero) counts in sample i. With this NB parameterization, the p.m.f. is written as Γ(y+ϕ)y!Γ(ϕ)(ϕμ+ϕ)ϕ(μμ+ϕ)y, with the variance Var(Y)=μ+μ2/ϕ, thus allowing for over-dispersion. A small value of ϕ indicates a large variance to mean ratio, while a large value approaching infinity reduces the NB model to a Poisson model with the same mean and variance.

The NB mean is decomposed of two multiplicative effects, the size factor si and the normalized expression level λij. The collection s=(s1,,sn) reflects many nuisance effects across samples, including but not limited to (i) reverse transcription efficiency; (ii) amplification and dilution efficiency; and (iii) sequencing depth. Once the global sample-specific effect is accounted for, λij can be interpreted as the normalized expression level of gene j observed at sample i. Note that such a multiplicative characterization of the NB of Poisson mean is typical in both the frequentist (e.g. Cameron and Trivedi, 2013; Li et al., 2012; Witten, 2011) and the Bayesian literature (e.g. Airoldi and Bischof, 2016; Banerjee et al., 2014) to justify latent heterogeneity and extra over-dispersion in multivariate count data. To ensure identifiability between these two classes of parameters, we follow Sun et al. (2020) to set si proportional to the summation of the total number of read counts across all genes for sample i, combined with a constraint of i=1nsi=1. It results in si=j=1yij/i=1nj=1yij. If the main interest is in the absolute gene-expression level (Li et al., 2019), si’s can be set to 1. Our modeling approach yields a de-noised version of gene-expression data, i.e. λij’s, after characterizing zero-inflation (via πi’s), over-dispersion (via ϕj’s) and sample heterogeneity (via si’s). Note that although we fix the values of si’s, it is possible to use a regularizing prior specification (Li et al., 2017) on si’s that allows flexible modeling of count data.

We rewrite model (1) by introducing a latent indicator variable ηij,

yij|ηij,λij,ϕj{0 if ηij=1NB(yij;siλij,ϕj) if ηij=0ηijBern(πi). (2)

The independent Bernoulli prior assumption can be further relaxed by formulating a Be(aπ,bπ) hyperprior on πi, leading to a beta-Bernoulli prior of ηij with expectation aπ/(aπ+bπ). Setting aπ=bπ=1 results in a weakly informative prior on πi. For all dispersion parameters ϕj’s, we assume a gamma distribution, i.e. ϕjGa(aϕ,bϕ). We recommend small values, such as aϕ=bϕ=0.001, for a weakly informative setting.

2.2 Identifying SV genes via a geostatistical mixture model

We incorporate the covariates and spatial random process into the model construction by specifying a geostatistical model (Gelfand and Schliep, 2016) for the normalized expression level of gene j for sample i,

logλij=xiβj+wj(ti), (3)

where xi is a R-dimensional column vector of covariates that includes a scalar of one for the intercept and R—1 measurable explanatory variables for sample i at location ti. These explanatory variables could be cell types, tissue microenvironment, cell-cycle information and other information that might be important to adjust for during the analysis. βj is a R-dimensional column vector of coefficients that includes an intercept representing the mean log-expression of gene j across spatial locations. We denote the collection (wj(t1),,wj(tn)) by wj(T) and assume wj(T) is a gene-specific zero-mean stationary GP, modeling the spatial correlation pattern among spatial locations through the covariance σj2K(T) in a multivariate normal distribution, where σj2 is a scaling factor and the kernel K(T) is a positive definite matrix with each diagonal entry being one and each off-diagonal entry being a function of the relative position (e.g. Euclidean distance) between each pair of locations, kij(||titj||)[0,1).

The choice of an appropriate kernel function is of critical importance in spatial modeling. A comprehensive overview of many covariance functions can be found in Chapter 4 of Williams and Rasmussen (2006). The use of white noise kernel K=I indicates that i.i.d. noise, multiplied by σj2, is added to the mean log-expression of gene j determined by the covariates. Consequently, no spatial correlation w.r.t. underlying gene-expression levels should be observed across the space T. In contrast, a SV gene is the one that displays significant spatial expression pattern, which is usually defined by a squared exponential (SE) kernel (e.g. Sun et al., 2020; Svensson et al., 2018), where kij=exp(||titj||2/2lj2) with the gene-specific characteristic length-scale denoted by lj. A large value of lj encourages the underlying expression levels of distant spots or cells become more correlated.

To identify a subset of gene that is SV across locations, we postulate the existence of a latent binary vector γ=(γ1,,γp), with γj=1 if gene j is an SV gene, and γj=0 otherwise. Formulating this assumption, we can rewrite model (3),

logλj|γj,lj{MVN(Xβj,σj2K) if γj=1MVN(Xβj,σj2I) if γj=0. (4)

A common choice for the prior of the binary latent vector γ is a product of Bernoulli distributions on each individual component with a common hyperparameter ω, i.e. γjBern(ω). It is equivalent to a binomial prior on the number of SV genes, i.e. pγ=j=1pγjBin(p,ω). The hyperparameter ω can be elicited as the proportion of genes expected a priori to be SV. This prior assumption can be further relaxed by formulating a Be(aω,bω) hyperprior on ω, which leads to a beta-binomial prior on pγ with expectation paω/(aω+bω). We impose a constraint of aω+bω=2 for a vague prior of ω (Tadesse et al., 2005).

Taking a conjugate Bayesian approach, we impose a normal prior on each coefficient in βj and an inverse-gamma (IG) prior on σj2; i.e. βj|σj2MVN(0,hσj2I) and σj2IG(aσ,bσ). This parameterization setting is standard in most Bayesian normal models. It allows for creating a computationally efficient feature selection algorithm by integrating out the GP mean and covariance scaling factor. The integration leads to marginal non-standardized multivariate student’s t-distributions (MVT) on logλj. Consequently, we can write the collapsed version of model (4)

logλj|γj,lj{MVT2aσ(0,bσaσ(K1K1XG1XK1)1) if γj=1MVT2aσ(0,bσaσ(IX(XX+I/h)1X)1) if γj=0, (5)

where G=XK1X+I/h. To complete the model specification, we choose a uniform prior ljU(al,bl). Following Sun et al. (2020), we suggest the value of al and bl set to be tmin/2 and 2tmax, where tmin and tmax are the minimum and maximum value of the non-zero Euclidean distances across all pairs of spatial locations, respectively.

3 Model fitting

3.1 MCMC algorithm

We use the MCMC algorithm to sample from the posterior distribution. We update the false zero indicator ηij’s using Gibbs sampler and remaining parameters using a Metropolis–Hastings algorithm. We note that this algorithm is sufficient to guarantee ergodicity for our model. See the details in the Supplementary Material.

3.2 Posterior inference

Our primary interest lies in the identification of SV genes via the selection vector γ. For each gene, the null and alternative hypotheses are M0:γj=0 and M1:γj=1. We could select the model via calculating the Bayes factor (BF) in favor of M1 over M0, which is defined as the ratio of posterior odds to prior odds. The latter equals aω/bω, while the former can be approximated using the MCMC samples {γj(1),,γj(U)}, where U denotes the total number of iterations after burn-in. Thus,

BFj=p(M1|·)p(M0|·)/p(M1)p(M0)u=1Uγj(u)/Uu=1U(1γj(u))/U/aωbω.

The larger the BFj, the more likely gene j is an SV gene, integrating over the uncertainty in all model parameters. We suggest choosing the BF threshold based on the scale for interpretation (Kass and Raftery, 1995).

In addition to BF, we could summarize the posterior distribution of γ via maximum-a-posteriori estimates,

γMAP=argmaxγi=1nj=1pf(yij|ηij,λij,ϕj)p(logλj|γj(u),lj)p(γj(u)).

A more comprehensive summarization of γ is based on their marginal posterior probabilities of inclusion (PPI), where PPIj=u=1Uγj(u)/U. Then, SV genes are selected if their PPIs are greater than a threshold c,

γPPI=(I(PPI1c),,I(PPIpc)).

We suggest choosing the threshold that controls for multiplicity (Newton et al., 2004), which guarantees the expected Bayesian false discovery rate (BFDR) to be smaller than a value. The BFDR is calculated as follows,

BFDR(c)=j=1p(1PPIj)I(1PPIj<c)j=1pI(1PPIj<c),

where BFDR(c) is the desired significance level.

4 Simulation study

We used simulated data to assess the performance of BOOST-GP and compare it with that of three alternatives, i.e. Trendsceek (Svensson et al., 2018), SpatialDE (Edsgärd et al., 2018) and SPARK (Sun et al., 2020). We also carried out a sensitivity analysis reported in the Supplementary Material.

We followed the data generative schemes originated by Edsgärd et al. (2018) and Sun et al. (2020) based on two artificially generated spatial patterns and two real spatial patterns in mouse olfactory bulb (MOB) and human breast cancer (BC) data. The first two namely spot and linear patterns, which are shown in Figure 2a and b, were on a 16-by16 square lattice (n = 256 spots). The MOB and BC patterns, which are shown in Figure 2c and d, were on n = 260 and 250 spots, respectively. We simulated p =100 genes, among which 15 were SV genes while the rest were non-SV genes. For gene j at spot i, its latent normalized expression level on a logarithmic scale was the sum of three components, log(λij)=βj+ei+ϵij, where βj is the gene-specific baseline, ei is the spot-specific fold-change between SV and non-SV genes and ϵij is the non-spatial random error. In our simulation study, we set the baseline βj=2 and assumed ϵijN(0,0.32). The spatial patterns were embedded in the construction of ei’s. For a non-SV gene, we set ei=0,i=1,,n so that the latent normalized expression levels were i.i.d. from a log-normal distribution with mean and variance being 2 and 0.32, respectively. Consequently, no spatial correlation should be observed. For an SV gene with the spot pattern, the values of ei’s of the four center spots with coordinates (8, 8), (8, 9), (9, 8) and (9, 9), were set to log6, while all others were linearly decreased to zero within the radius of five spots. For an SV gene with the linear pattern, the value of ei of the most bottom-left spot with coordinate (1, 1), was set to log6, while all others were linearly decreased to zero along the diagonal line. For an SV gene with the real patterns, each spot was categorized into two groups. We set ei = 0 for those spots from the low expression group, while altering its value to log3 for those spots from the high expression group. The difference in ei’s between the two groups of spots thus introduced spatial differential expression patterns. Then, we simulated each gene-expression count data yij from an NB distribution where its mean was a product of the latent normalized expression level λij and the size factor si. We sampled si’s from a log-normal distribution with mean zero and variance 0.22. The gene-specific NB dispersion parameters were sampled from an exponential distribution with mean 10. Furthermore, to mimic the excess zeros observed in the real data, we randomly chose 30% spots and forced their counts to be zero. Combined with the four patterns (i.e. spot, linear, MOB and BC) and two count generating models (i.e. ZINB with 0% and 30% false zeros), there were 4 × 2 = 8 scenarios in total. For each of the scenarios, we repeated the above steps to generate ten replicates.

Fig. 2.

Fig. 2.

The four spatial patterns used in the simulation study: (a) and (b) Two artificially generated patterns; (c) and (d) two binary patterns that were summarized on the basis of SV genes identified by SPARK in MOB and human BC data, respectively

For prior specification of BOOST-GP, we recommended and used the following default settings. The hyperparameters that controlled the percentage of false zeros a priori were set to πBe(aπ=1,bπ=1). As for the gamma priors on the NB dispersion and GP characteristic length-scale parameters, i.e. ϕjGa(aϕ,bϕ) and ljGa(al,bl), we set aϕ,bϕ, al and bl to 0.001, which led to a vague distribution with mean and variance equal to 1 and 1000. We set the hyperparameters that control the selection of SV genes, ωBe(aω=0.1,bω=1.9), resulting in the proportion of SV genes expected a priori to be aω/(aω+bω)=5%. As for the IG priors on the variance components σj2, we set aσ=3 to achieve a distribution with an infinite variance. We further set bσ=1 and h =10 suggested by the sensitivity analysis results, which is shown in Supplementary Table S2. As for the algorithm setting, we ran four independent MCMC chain with 2000 iterations, discarding the first 50% sweeps as burn-in. We started each chain from a model by setting all genes to be non-SV and randomly drawing all other model parameters from their prior distribution. Results we report below were obtained by pooling together the MCMC outputs from the four chains. All experiments were implemented in R with Rcpp package to accelerate computations.

To quantify the accuracy of identifying SV genes via the binary vector γ, we consider two widely used measures of the quality of binary classifiers: (i) area under the curve (AUC) of the receiver operating characteristic; and (ii) Matthews correlation coefficient (MCC) (Matthews, 1975). The former considers both true positive (TP) and false positive (FP) rates across various threshold settings, while the latter balances TP, FP, true negative (TN) and false negative (FN) counts even if the true zeros and ones in γ are of very different sizes.

MCC=(TP×TNFP×FN)(TP+FP)(TP+FN)(TN+FP)(TN+FN).

The number of SV genes is usually assumed to be a small fraction of the total. Thus, MCC is more appropriate to handle such an imbalanced scenario. Note that the AUC yields a value between 0 to 1 that is averaged by all possible thresholds used to select discriminatory features based on PPIs or P-values, and the MCC value ranges from –1 to 1 to pinpoint a specified threshold. The larger the index, the more accurate the inference.

Figure 3 displays the boxplot of AUCs by different methods over 10 replicated datasets under the 15 scenarios. We can see that BOOST-GP, SPARK and SpatialDE had similar power when there were no false zeros in the simulated data. The Markcorr and Markvario of Trendsceek also had close performance in the spot and linear patterns, but performed worse in the other two patterns. The Emark and Vmark of Trendsceek did not generated satisfactory results among any of the four patterns. Because BOOST-GP, SPARK and SpatialDE are all GP-based methods, it was not surprising that they had similar performance in the non-zero-inflation scenarios. While under the setting with a high proportion of false zeros, BOOST-GP clearly stood out and still maintained an almost unaffected performance, suggesting that realistic modeling (i.e. accounting for zero-inflation) contributes to the advantage over SPARK and SpatialDE. Meanwhile, all other methods suffered from reduced power, indicating the variance brought by an excess of zero count not being properly resolved. We further evaluated all methods in terms of MCC. To control for the type-I error rate, we adjusted P-values from SPARK, SpatialDE and Trendsceek using the Benjamini–Hochberg method (Benjamini and Hochberg, 1995) and chose a significance level of 0.05 to select SV genes. For BOOST-GP, we chose the PPI threshold corresponding to a BFDR of 5% to select SV genes. Supplementary Table S2 shows the averaged MCCs of all methods in all scenarios. BOOST-GP had very stable performance among all the four patterns. The performance of SpatialDE and SPARK were only good in some patterns. Trendsceek was not able to report any SV genes. This was probably due to the P-value calculation procedure of Trendsceek, which highly depends on the permutation times. Under the zero-inflation settings, SpatialDE and SPARK failed the task in most of the patterns. BOOST-GP still had reasonable performance though the accuracy dropped.

Fig. 3.

Fig. 3.

Simulation study: The boxplots of AUCs achieved by BOOST-GP, SPARK, SpatialDE and Trendsceek under different scenarios in terms of spatial pattern and count generating process

Regarding the efficiency, the average execution time per gene was 0.171, 0.003 and 4.413 s for SPARK, SpatialDE and Trendsceek. Note that Trendsceek conducted four tests (i.e. Emark, Markcorr, Markvario and Vmark) together, so it took longer. BOOST-GP spent 10.320 s per gene on average due to the computationally intensive MCMC algorithms, even though it was written by Rcpp to accelerate computation. All experiments were implemented on a high performance computing server with two Intel Xeon CPUs (45 MB cache and 2.10 GHz) and 250 GB memory.

5 Real data analysis

We applied BOOST-GP to analyze two published SMP datasets generated by STS, the MOB and human BC, accessible on Spatial Research Lab (https://www.spatialresearch.org/). These data consist of gene-expression measurements in the form of read counts collected on a number of spatial locations known as spots. There are 12 replicates and 4 layers available for MOB and BC cohorts, respectively. Following the previous studies (Sun et al., 2020; Svensson et al., 2018), we used the MOB replicate 11, which contains 16 218 genes measured on 262 spots, and the BC layer 2, which contains 14 789 genes measured on 251 spots. We applied the same quality control steps suggested by Sun et al. (2020) to remove outlier genes and spots. After filtering out those spots with fewer than 10 total counts across all genes and those genes with more than 90% zero read counts across all spots, the MOB data had p =11 274 genes and n = 260 spots, while the BC data had p = 5262 genes and n = 250 spots. We applied the same prior specification, algorithm setting and significance criteria that we used in the simulation. Because of the poor performance of Trendsceek in simulation studies, we did not include it in real data analysis.

We checked the MCMC algorithm convergence based on the PPI vector of γ. We calculated the PPIs for all the four chains and found their pairwise Pearson correlation coefficients ranged from 0.934 to 0.988, suggesting good MCMC convergence. We then aggregated the outputs of all chains. Figure 4 shows the overlap of identified SV genes by SpatialDE, SPARK and BOOST-GP methods in MOB and BC datasets. SPARK was shown to be the most aggressive method that reported the most SV genes, while SpatialDE was the most conservative one and reported the fewest SV genes. BOOST-GP lay in between SpatialDE and SPARK, with a lot in common with them, but also uniquely identified genes. Because SpatialDE was almost always a subset of SPARK, we focused on the comparison between BOOST-GP and SPARK in the following analysis. We then did hierarchical clustering on the SV genes identified by BOOST-GP and SPARK, and plotted the averaged normalized expression levels within each cluster to represent each pattern.

Fig. 4.

Fig. 4.

Real data analysis: the Venn diagram of SV genes identified by BOOST-GP, SPARK, and SpatialDE in (a) MOB and (b) human BC data

In order to have a more detailed comparison and not overshadow any subtle patterns, we clustered those SV genes of MOB, as shown in Figure 5a and b, into six and four groups. We could see that in MOB data, the patterns detected by BOOST-GP and SPARK resemble each other well. More specifically, the first to sixth patterns can be further merged, leading to three major patterns, which was consistent with what was reported by Sun et al. (2020). This suggested that although BOOST-GP detected fewer SV genes than SPARK, it did not miss any major patterns. In BC data, the patterns summarized by the two methods showed more distinctiveness, especially the first pattern. The remaining three patterns could still largely overlap each other, though not perfectly. This agreed with the Venn diagram that more discrepancy was found between BOOST-GP and SPARK in BC data compared to the MOB data.

Fig. 5.

Fig. 5.

Real data analysis: (a) and (b) the averaged normalized gene-expression levels of SV genes identified by BOOST-GP and SPARK in (a) MOB and (b) human BC data, respectively. (c) and (d) The associated hematoxylin and eosin-stained tissue slides of MOB and BC data, respectively

To have an additional diagnosis of the spatial signals of the detected SV genes, we employed Moran’s I spatial autocorrelation test (Li et al., 2007; Moran, 1950) (detailed in the Supplementary Material). Its values range from –1 to 1. The larger the Moran’s I, the stronger the spatial correlation, which means clustering spots with similar normalized expression levels. Supplementary Figure S2 displays three binary spatial pattern examples with different values of Moran’s I. Figure 6a summarized the Moran’s I’s of the common SV genes found by both methods and SV genes found by individual method alone. Spatial autocorrelation was shown to be the strongest in the common SV gene set, followed by the BOOST-GP only set and then the SPARK only set. The result implies: (i) SV genes identified by both methods are more likely to be true SV genes and (ii) SPARK were in general more aggressive to report more candidates but also with lower average signals. This might be because SPARK considered two types of kernels (SE and periodic) and used five characteristic length-scales for each kernel (i.e. 10 kernels in total), which caused more versatility. Moreover, SPARK did not take zero-inflation into account, which tended to report more FP discoveries.

Fig. 6.

Fig. 6.

Real data analysis: (a) the boxplots of Moran’s I’s of SV genes identified by both BOOST-GP and SPARK, BOOST-GP only and SPARK only. (b) and (c) The averaged normalized gene-expression levels of SV genes identified by BOOST-GP only and SPARK only in MOB and human BC data, respectively

In parallel, we also performed hierarchical clustering on these three sets of genes. In MOB data, 680 genes identified by SPARK only could be clustered into four groups, as shown in Figure 6b, same as the ones previously described in Figure 5a, while the 45 genes identified by BOOST-GP only mainly belonged to two patterns, as shown in Figure 6b. This result suggested SPARK might be more sensitive in MOB data, as it is able to find some weak signal genes, but in the meantime, according to Figure 6a, some of those could potentially be FPs that lead to the dilution of Moran’s I’s. In BC data, 372 genes identified by SPARK only showed both much weaker spatial signals and less spatial patterns compared to 108 genes identified by BOOST-GP only, as shown in Figure 6c, suggesting the SPARK method potentially scarified specificity while it did not capture TPs as well.

Last, we conducted gene ontology (GO) enrichment analysis using a Python wrapper GSEAPY (Kuleshov et al., 2016; Subramanian et al., 2007) to see whether those detected SV genes by BOOST-GP were related to certain biological functions. GSEAPY can be accessible on https://gseapy.readthedocs.io/en/latest/. We performed this on 2941 mouse and 2338 human GO terms of biological processes, which had at least one gene overlap with detected SV gene pools. Using the adjusted P-value smaller than 0.05 as the threshold, we found 148 and 197 enriched biological process GO terms in MOB and BC datasets. We took a closer look at those terms that were associated with the synaptic and nervous system for MOB data, as they play important roles in synaptic organization and nerve development, which is expected to be involved in olfactory functioning. As showed in Figure 7, many of those synaptic- and nervous-related GO terms were significantly enriched, suggesting the functions of those detected SV genes were aligned with the underlying mechanism that drove the spatially differentiated expression pattern. This was consistent with what was reported in Sun et al. (2020). For BC data, the terms related to extracellular matrix and immune response, as expected, were found to be significantly enriched in the identified SV genes, as they are over-activated in BC. Furthermore, five ERBB-related pathways were found to be significant in our analysis. ERBB family genes, especially ERBB2, are well-known oncogenes in BC that have a close relationship with altered signaling pathways and tumorigenesis and development (Stern, 2000). Three of them were displayed in Figure 7. However, none of them were found significant in the SPARK analysis result [e.g. the adjusted P-value for ERBB2 signaling pathway (GO: 0038128) was 0.239]. We went back and had a deeper check. We found the reason was some genes involved in ERBB pathway were significant in BOOST-GP result but only marginally significant in SPARK (e.g. the adjusted P-value and PPI of gene UBB were 9.22×1016 and 0.650 reported by BOOST-GP, the corresponding adjusted P-value for SPARK and SpatialDE were 0.053 and 0.857). It echoed with the previous clustering and autocorrelation analysis, where BOOST-GP had more reliable results in BC data than SPARK.

Fig. 7.

Fig. 7.

Real data analysis: GO enrichment analysis of SV genes identified by BOOST-GP in (a) MOB and (b) human BC data. The red dashed line indicates a significant level of 0.05

6 Conclusion

Recently, the emerging SMP technologies demonstrated great potential in revealing tissue spatial organization and heterogeneity. In this study, we developed a novel Bayesian hierarchical model that incorporates a GP model to identify SV genes. It can also characterize the embedded temporal or spatial patterns in high-resolution time-series RNA-Seq data (e.g. Owens et al., 2016) or three-dimensional SMP data (e.g. Shah et al., 2016). Compared to other existing methods, it has several advantages. First, it directly models count data with an NB distribution that could accommodate the over-dispersion compared to the Poisson distribution. Second and most importantly, it properly addresses the excess of zero count problem, which is typically observed in the SMP data, using a zero-inflation model. Third, it improves the estimation of the kernel length-scale parameter using a Bayesian approach, leading to more stable and accurate results. In simulation studies, BOOST-GP had similar power in identifying SV genes compared with SpatialDE and SPARK, and much better than Trendsceek in general spatial settings. It greatly outperformed all other methods when false zeros were present in the data. In real data, BOOST-GP and SPARK both had reasonable performance and had a lot in common, while SpatialDE was a bit conservative. The performance of different methods may vary in different datasets with different inherent spatial patterns and data structures. SPARK seemed to be more sensitive in MOB data with some sacrifice in precision. BOOST-GP performed better in human BC data, identifying undiscovered spatial patterns and meaningful enriched pathways. In general, SPARK was a more aggressive method under the selected significance cutoff, but at the same time bearing a higher risk of discovering more FPs. Because BOOST-GP reports both marginal PPI and BFs (which could be converted to P-values under some assumption) as measures of significance, users have more freedom to balance the sensitivity and specificity on their own considerations depending on the study needs.

Several extensions of our model are worth investigating. Firstly, with a Bayesian framework, BOOST-GP can be further extended to incorporate pathway information as prior knowledge to take into account the regulatory relationships between genes to perform a joint estimation. Secondly, the choice of an appropriate kernel function is of critical importance in spatial modeling (Rasmussen, 2003). We used only the SE kernel in BOOST-GP. However, the related covariance function is infinitely differentiable, making a spatial pattern too smooth to be realistic. In the future, we would consider the Matérn and rational quadratic kernels suggested by Stein (2012), which are suitable for both weak and strong smoothness assumptions.

Supplementary Material

btab455_Supplementary_Data

Acknowledgements

The authors would like to thank Jessie Norris for helping us in proofreading the manuscript.

Funding

This work was supported by the National Institutes of Health (NIH) [R35GM136375, P30CA142543, 1R01GM140012, P50CA70907 and 1R01GM115473]; and the Cancer Prevention and Research Institute of Texas (CPRIT) [RP190107, RP180805].

Conflict of Interest: none declared.

Contributor Information

Qiwei Li, Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, USA.

Minzhe Zhang, Quantitative Biology Research Center, Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.

Yang Xie, Quantitative Biology Research Center, Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.

Guanghua Xiao, Quantitative Biology Research Center, Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.

References

  1. Airoldi E.M., Bischof J.M. (2016) Improving and evaluating topic models and other models of text. J. Am. Stat. Assoc., 111, 1381–1403. [Google Scholar]
  2. Banerjee S.  et al. (2014) Hierarchical Modeling and Analysis for Spatial Data. CRC Press, Boca Raton, USA. [Google Scholar]
  3. Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol., 57, 289–300. [Google Scholar]
  4. Cameron A.C., Trivedi P.K. (2013) Regression Analysis of Count Data. Vol. 53. Cambridge University Press, Cambridge, UK. [Google Scholar]
  5. Chen K.H.  et al. (2015) Spatially resolved, highly multiplexed RNA profiling in single cells. Science, 348, aaa6090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Diggle P.J.  et al. (1998) Model-based geostatistics. J. R. Stat. Soc. Series C Appl. Stat., 47, 299–350. [Google Scholar]
  7. Edsgärd D.  et al. (2018) Identification of spatial expression trends in single-cell gene expression data. Nat. Methods, 15, 339–342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Eng C.-H.L.  et al. (2019) Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature, 568, 235–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Finak G.  et al. (2015) MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol., 16, 278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gelfand A.E., Schliep E.M. (2016) Spatial statistics and Gaussian processes: a beautiful marriage. Spat. Stat., 18, 86–104. [Google Scholar]
  11. Kass R.E., Raftery A.E. (1995) Bayes factors. J. Am. Stat. Assoc., 90, 773–795. [Google Scholar]
  12. Kharchenko P.V.  et al. (2014) Bayesian approach to single-cell differential expression analysis. Nat. Methods, 11, 740–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kuleshov M.V.  et al. (2016) Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res., 44, W90–W97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li H.  et al. (2007) Beyond Moran’s I: testing for spatial dependence based on the spatial autoregressive model. Geogr. Anal., 39, 357–375. [Google Scholar]
  15. Li J.  et al. (2012) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13, 523–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Li Q.  et al. (2017) A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. Stat. Anal. Data Min., 10, 393–409. [Google Scholar]
  17. Li Q.  et al. (2019) Bayesian negative binomial mixture regression models for the analysis of sequence count and methylation data. Biometrics, 75, 183–192. [DOI] [PubMed] [Google Scholar]
  18. Love M.I.  et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol., 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lubeck E.  et al. (2014) Single-cell in situ RNA profiling by sequential hybridization. Nat. Methods, 11, 360–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lun A.T.  et al. (2016) Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol., 17, 75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Marioni J.C.  et al. (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res., 18, 1509–1517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Matthews B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442–451. [DOI] [PubMed] [Google Scholar]
  23. Moffitt J.R.  et al. (2018) Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science, 362, eaau5324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Moran P.A. (1950) Notes on continuous stochastic phenomena. Biometrika, 37, 17–23. [PubMed] [Google Scholar]
  25. Newton M.A.  et al. (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155–176. [DOI] [PubMed] [Google Scholar]
  26. Owens N.D.  et al. (2016) Measuring absolute RNA copy numbers at high temporal resolution reveals transcriptome kinetics in development. Cell Rep., 14, 632–647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Rasmussen C.E. (2003) Gaussian processes in machine learning. In: Summer School on Machine Learning. Springer, Berlin, Germany, pp. 63–71. [Google Scholar]
  28. Roberts S.  et al. (2013) Gaussian processes for time-series modelling. Philos. Trans. R. Soc. A, 371, 20110550. [DOI] [PubMed] [Google Scholar]
  29. Robinson M.D.  et al. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Rodriques S.G.  et al. (2019) Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science, 363, 1463–1467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Shah S.  et al. (2016) In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron, 92, 342–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ståhl P.L.  et al. (2016) Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353, 78–82. [DOI] [PubMed] [Google Scholar]
  33. Stein M.L. (2012) Interpolation of Spatial Data: Some Theory for Kriging. Springer Science & Business Media, Berlin, Germany. [Google Scholar]
  34. Stern D.F. (2000) Tyrosine kinase signalling in breast cancer: ErbB family receptor tyrosine kinases. Breast Cancer Res., 2, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Subramanian A.  et al. (2007) GSEA-P: a desktop application for gene set enrichment analysis. Bioinformatics, 23, 3251–3253. [DOI] [PubMed] [Google Scholar]
  36. Sun S.  et al. (2020) Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat. Methods, 17, 193–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Svensson V.  et al. (2018) SpatialDE: identification of spatially variable genes. Nat. Methods, 15, 343–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Tadesse M.G.  et al. (2005) Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc., 100, 602–617. [Google Scholar]
  39. Williams C.K., Rasmussen C.E. (2006) Gaussian Processes for Machine Learning. Vol. 2. MIT Press, Cambridge, MA. [Google Scholar]
  40. Witten D.  et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol., 8, 58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Witten D.M. (2011) Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat., 5, 2493–2518. [Google Scholar]
  42. Zhang M.  et al. (2021) Spatial molecular profiling: platforms, applications and analysis tools. Brief. Bioinform., 22, bbaa145. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btab455_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES